TY - GEN
T1 - The remarkable benefit of user-level aggregation for lexical-based population-level predictions
AU - Giorgi, Salvatore
AU - Preoţiuc-Pietro, Daniel
AU - Buffone, Anneke
AU - Rieman, Daniel
AU - Ungar, Lyle H.
AU - Andrew Schwartz, H.
N1 - Publisher Copyright:
© 2018 Association for Computational Linguistics
PY - 2018
Y1 - 2018
N2 - Nowcasting based on social media text promises to provide unobtrusive and near real-time predictions of community-level outcomes. These outcomes are typically regarding people, but the data is often aggregated without regard to users in the Twitter populations of each community. This paper describes a simple yet effective method for building community-level models using Twitter language aggregated by user. Results on four different U.S. county-level tasks, spanning demographic, health, and psychological outcomes show large and consistent improvements in prediction accuracies (e.g. from Pearson r = .73 to .82 for median income prediction or r = .37 to .47 for life satisfaction prediction) over the standard approach of aggregating all tweets. We make our aggregated and anonymized community-level data, derived from 37 billion tweets - over 1 billion of which were mapped to counties, available for research.
AB - Nowcasting based on social media text promises to provide unobtrusive and near real-time predictions of community-level outcomes. These outcomes are typically regarding people, but the data is often aggregated without regard to users in the Twitter populations of each community. This paper describes a simple yet effective method for building community-level models using Twitter language aggregated by user. Results on four different U.S. county-level tasks, spanning demographic, health, and psychological outcomes show large and consistent improvements in prediction accuracies (e.g. from Pearson r = .73 to .82 for median income prediction or r = .37 to .47 for life satisfaction prediction) over the standard approach of aggregating all tweets. We make our aggregated and anonymized community-level data, derived from 37 billion tweets - over 1 billion of which were mapped to counties, available for research.
UR - https://www.scopus.com/pages/publications/85068310163
M3 - Conference contribution
AN - SCOPUS:85068310163
T3 - Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018
SP - 1167
EP - 1172
BT - Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018
A2 - Riloff, Ellen
A2 - Chiang, David
A2 - Hockenmaier, Julia
A2 - Tsujii, Jun'ichi
PB - Association for Computational Linguistics
T2 - 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018
Y2 - 31 October 2018 through 4 November 2018
ER -