Skip to main navigation Skip to search Skip to main content

On the distribution of lexical features at multiple levels of analysis

  • Fatemeh Almodaresi
  • , Lyle Ungar
  • , Vivek Kulkarni
  • , Mohsen Zakeri
  • , Salvatore Giorgi
  • , H. Andrew Schwartz
  • Stony Brook University
  • University of Pennsylvania

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

6 Scopus citations

Abstract

Natural language processing has increasingly moved from modeling documents and words toward studying the people behind the language. This move to working with data at the user or community level has presented the field with different characteristics of linguistic data. In this paper, we empirically characterize various lexical distributions at different levels of analysis, showing that, while most features are decidedly sparse and non-normal at the message-level (as with traditional NLP), they follow the central limit theorem to become much more Log-normal or even Normal at the user- and county-levels. Finally, we demonstrate that modeling lexical features for the correct level of analysis leads to marked improvements in common social scientific prediction tasks.

Original languageEnglish
Title of host publicationACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Short Papers)
PublisherAssociation for Computational Linguistics (ACL)
Pages79-84
Number of pages6
ISBN (Electronic)9781945626760
DOIs
StatePublished - 2017
Event55th Annual Meeting of the Association for Computational Linguistics, ACL 2017 - Vancouver, Canada
Duration: Jul 30 2017Aug 4 2017

Publication series

NameACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)
Volume2

Conference

Conference55th Annual Meeting of the Association for Computational Linguistics, ACL 2017
Country/TerritoryCanada
CityVancouver
Period07/30/1708/4/17

Fingerprint

Dive into the research topics of 'On the distribution of lexical features at multiple levels of analysis'. Together they form a unique fingerprint.

Cite this