TY - GEN
T1 - Name-ethnicity classification from open sources
AU - Ambekar, Anurag
AU - Ward, Charles
AU - Mohammed, Jahangir
AU - Male, Swapna
AU - Skiena, Steven
PY - 2009
Y1 - 2009
N2 - The problem of ethnicity identification from names has a variety of important applications, including biomedical research, demographic studies, and marketing. Here we report on the development of an ethnicity classifier where all training data is extracted from public, non-confidential (and hence somewhat unreliable) sources. Our classifier uses hidden Markov models (HMMs) and decision trees to classify names into 13 cultural/ethnic groups with individual group accuracy comparable accuracy to earlier binary (e.g., Spanish/non-Spanish) classifiers. We have applied this classifier to over 20 million names from a large-scale news corpus, identifying interesting temporal and spatial trends on the representation of particular cultural/ethnic groups.
AB - The problem of ethnicity identification from names has a variety of important applications, including biomedical research, demographic studies, and marketing. Here we report on the development of an ethnicity classifier where all training data is extracted from public, non-confidential (and hence somewhat unreliable) sources. Our classifier uses hidden Markov models (HMMs) and decision trees to classify names into 13 cultural/ethnic groups with individual group accuracy comparable accuracy to earlier binary (e.g., Spanish/non-Spanish) classifiers. We have applied this classifier to over 20 million names from a large-scale news corpus, identifying interesting temporal and spatial trends on the representation of particular cultural/ethnic groups.
KW - Ethnicity detection
KW - Name classification
KW - News analysis
KW - Social science research
UR - https://www.scopus.com/pages/publications/70350694226
U2 - 10.1145/1557019.1557032
DO - 10.1145/1557019.1557032
M3 - Conference contribution
AN - SCOPUS:70350694226
SN - 9781605584959
T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
SP - 49
EP - 57
BT - KDD '09
T2 - 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '09
Y2 - 28 June 2009 through 1 July 2009
ER -