TY - GEN
T1 - On the distribution of lexical features at multiple levels of analysis
AU - Almodaresi, Fatemeh
AU - Ungar, Lyle
AU - Kulkarni, Vivek
AU - Zakeri, Mohsen
AU - Giorgi, Salvatore
AU - Andrew Schwartz, H.
N1 - Publisher Copyright:
© 2017 Association for Computational Linguistics.
PY - 2017
Y1 - 2017
N2 - Natural language processing has increasingly moved from modeling documents and words toward studying the people behind the language. This move to working with data at the user or community level has presented the field with different characteristics of linguistic data. In this paper, we empirically characterize various lexical distributions at different levels of analysis, showing that, while most features are decidedly sparse and non-normal at the message-level (as with traditional NLP), they follow the central limit theorem to become much more Log-normal or even Normal at the user- and county-levels. Finally, we demonstrate that modeling lexical features for the correct level of analysis leads to marked improvements in common social scientific prediction tasks.
AB - Natural language processing has increasingly moved from modeling documents and words toward studying the people behind the language. This move to working with data at the user or community level has presented the field with different characteristics of linguistic data. In this paper, we empirically characterize various lexical distributions at different levels of analysis, showing that, while most features are decidedly sparse and non-normal at the message-level (as with traditional NLP), they follow the central limit theorem to become much more Log-normal or even Normal at the user- and county-levels. Finally, we demonstrate that modeling lexical features for the correct level of analysis leads to marked improvements in common social scientific prediction tasks.
UR - https://www.scopus.com/pages/publications/85040538193
U2 - 10.18653/v1/P17-2013
DO - 10.18653/v1/P17-2013
M3 - Conference contribution
AN - SCOPUS:85040538193
T3 - ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)
SP - 79
EP - 84
BT - ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Short Papers)
PB - Association for Computational Linguistics (ACL)
T2 - 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017
Y2 - 30 July 2017 through 4 August 2017
ER -