TY - GEN
T1 - Prosody Analysis of Audiobooks
AU - Pethe, Charuta
AU - Pham, Bach
AU - Childress, Felix D.
AU - Yin, Yunting
AU - Skiena, Steven
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Recent advances in text-to-speech have made it possible to generate natural-sounding audio from text. However, audiobook narrations involve dramatic vocalizations and intonations by the reader, with greater reliance on emotions, dialogues, and descriptions in the narrative. Using our dataset of 93 aligned book-audiobook pairs, we present improved models to predict prosody (pitch, volume, and rate of speech) from narrative text using language modeling. Our predicted prosody attributes correlate much better with human audiobook readings than results from a state-of-the-art commercial TTS system: our predicted pitch shows a higher correlation with human reading for 22 out of 24 books in the test set, while our predicted volume attribute proves more similar to human reading for 23 out of the 24 books. Finally, we present a human evaluation study to quantify the extent that people prefer prosody-enhanced audiobook readings over default commercial text-to-speech systems.
AB - Recent advances in text-to-speech have made it possible to generate natural-sounding audio from text. However, audiobook narrations involve dramatic vocalizations and intonations by the reader, with greater reliance on emotions, dialogues, and descriptions in the narrative. Using our dataset of 93 aligned book-audiobook pairs, we present improved models to predict prosody (pitch, volume, and rate of speech) from narrative text using language modeling. Our predicted prosody attributes correlate much better with human audiobook readings than results from a state-of-the-art commercial TTS system: our predicted pitch shows a higher correlation with human reading for 22 out of 24 books in the test set, while our predicted volume attribute proves more similar to human reading for 23 out of the 24 books. Finally, we present a human evaluation study to quantify the extent that people prefer prosody-enhanced audiobook readings over default commercial text-to-speech systems.
KW - character embedding
KW - prosody attribute prediction
KW - text to speech
UR - https://www.scopus.com/pages/publications/105009500395
U2 - 10.1109/ICSC64641.2025.00036
DO - 10.1109/ICSC64641.2025.00036
M3 - Conference contribution
AN - SCOPUS:105009500395
T3 - Proceedings - IEEE International Conference on Semantic Computing, ICSC
SP - 217
EP - 221
BT - Proceedings - 2025 19th International Conference on Semantic Computing, ICSC 2025
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 19th International Conference on Semantic Computing, ICSC 2025
Y2 - 3 February 2025 through 5 February 2025
ER -