Skip to main navigation Skip to search Skip to main content

Name-ethnicity classification from open sources

  • Anurag Ambekar
  • , Charles Ward
  • , Jahangir Mohammed
  • , Swapna Male
  • , Steven Skiena
  • Stony Brook University

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

166 Scopus citations

Abstract

The problem of ethnicity identification from names has a variety of important applications, including biomedical research, demographic studies, and marketing. Here we report on the development of an ethnicity classifier where all training data is extracted from public, non-confidential (and hence somewhat unreliable) sources. Our classifier uses hidden Markov models (HMMs) and decision trees to classify names into 13 cultural/ethnic groups with individual group accuracy comparable accuracy to earlier binary (e.g., Spanish/non-Spanish) classifiers. We have applied this classifier to over 20 million names from a large-scale news corpus, identifying interesting temporal and spatial trends on the representation of particular cultural/ethnic groups.

Original languageEnglish
Title of host publicationKDD '09
Subtitle of host publicationProceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Pages49-57
Number of pages9
DOIs
StatePublished - 2009
Event15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '09 - Paris, France
Duration: Jun 28 2009Jul 1 2009

Publication series

NameProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Conference

Conference15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '09
Country/TerritoryFrance
CityParis
Period06/28/0907/1/09

Keywords

  • Ethnicity detection
  • Name classification
  • News analysis
  • Social science research

Fingerprint

Dive into the research topics of 'Name-ethnicity classification from open sources'. Together they form a unique fingerprint.

Cite this