TY - GEN
T1 - Freshman or fresher? Quantifying the geographic variation of language in online social media
AU - Kulkarni, Vivek
AU - Perozzi, Bryan
AU - Skiena, Steven
N1 - Publisher Copyright:
© Copyright 2016, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
PY - 2016
Y1 - 2016
N2 - In this paper we present a new computational technique to detect and analyze statistically significant geographic variation in language. While previous approaches have primarily focused on lexical variation between regions, our method identifies words that demonstrate semantic and syntactic variation as well. We extend recently developed techniques for neural language models to learn word representations which capture differing semantics across geographical regions. In order to quantify this variation and ensure robust detection of true regional differences, we formulate a null model to determine whether observed changes are statistically significant. Our method is the first such approach to explicitly account for random variation due to chance while detecting regional variation in word meaning. To validate our model, we study and analyze two different massive online data sets: millions of tweets from Twitter as well as millions of phrases contained in the Google Book Ngrams. Our analysis reveals interesting facets of language change across countries.
AB - In this paper we present a new computational technique to detect and analyze statistically significant geographic variation in language. While previous approaches have primarily focused on lexical variation between regions, our method identifies words that demonstrate semantic and syntactic variation as well. We extend recently developed techniques for neural language models to learn word representations which capture differing semantics across geographical regions. In order to quantify this variation and ensure robust detection of true regional differences, we formulate a null model to determine whether observed changes are statistically significant. Our method is the first such approach to explicitly account for random variation due to chance while detecting regional variation in word meaning. To validate our model, we study and analyze two different massive online data sets: millions of tweets from Twitter as well as millions of phrases contained in the Google Book Ngrams. Our analysis reveals interesting facets of language change across countries.
UR - https://www.scopus.com/pages/publications/84979641083
M3 - Conference contribution
AN - SCOPUS:84979641083
T3 - Proceedings of the 10th International Conference on Web and Social Media, ICWSM 2016
SP - 615
EP - 618
BT - Proceedings of the International AAAI Conference on Web and Social Media, ICWSM 2016
PB - AAAI Press
T2 - 10th International AAAI Conference on Web and Social Media, ICWSM 2016
Y2 - 17 May 2016 through 20 May 2016
ER -