TY - GEN
T1 - Synthetic Audio Helps for Cognitive State Tasks
AU - Soubki, Adil
AU - Murzaku, John
AU - Zeng, Peter
AU - Rambow, Owen
N1 - Publisher Copyright:
©2025 Association for Computational Linguistics.
PY - 2025
Y1 - 2025
N2 - Automatically recognizing a human’s complete cognitive state from text is a difficult task; from text, a model has to recognize a combination of concepts including belief, emotion, common ground, sentiment, and intention. Humans do not only track and update cognitive state from the meaning of words and sentences, but also from paralinguistic cues such as prosody. The NLP community has broadly focused on text-only approaches to cognitive state tasks, but audio can provide vital missing information. We posit that text-to-speech (TTS) models learn to track aspects of cognitive state in order to produce naturalistic audio, and that the signal audio models implicitly identify is orthogonal to the information that language models exploit. We present Synthetic Audio Data fine-tuning (SAD), a framework where we show that seven tasks related to cognitive state modeling benefit from multimodal training on both text and zero-shot synthetic audio data from an off-the-shelf TTS system. We show an improvement over the text-only modality when adding synthetic audio data to text-only corpora. Furthermore, on tasks and corpora that do contain gold audio, we show our SAD framework achieves competitive performance using text and synthetic audio compared to text and gold audio.
AB - Automatically recognizing a human’s complete cognitive state from text is a difficult task; from text, a model has to recognize a combination of concepts including belief, emotion, common ground, sentiment, and intention. Humans do not only track and update cognitive state from the meaning of words and sentences, but also from paralinguistic cues such as prosody. The NLP community has broadly focused on text-only approaches to cognitive state tasks, but audio can provide vital missing information. We posit that text-to-speech (TTS) models learn to track aspects of cognitive state in order to produce naturalistic audio, and that the signal audio models implicitly identify is orthogonal to the information that language models exploit. We present Synthetic Audio Data fine-tuning (SAD), a framework where we show that seven tasks related to cognitive state modeling benefit from multimodal training on both text and zero-shot synthetic audio data from an off-the-shelf TTS system. We show an improvement over the text-only modality when adding synthetic audio data to text-only corpora. Furthermore, on tasks and corpora that do contain gold audio, we show our SAD framework achieves competitive performance using text and synthetic audio compared to text and gold audio.
UR - https://www.scopus.com/pages/publications/105028794552
U2 - 10.18653/v1/2025.findings-naacl.92
DO - 10.18653/v1/2025.findings-naacl.92
M3 - Conference contribution
AN - SCOPUS:105028794552
T3 - 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Proceedings of the Conference Findings, NAACL 2025
SP - 1701
EP - 1708
BT - 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics
A2 - Chiruzzo, Luis
A2 - Ritter, Alan
A2 - Wang, Lu
PB - Association for Computational Linguistics (ACL)
T2 - 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics, NAACL 2025
Y2 - 29 April 2025 through 4 May 2025
ER -