Skip to main navigation Skip to search Skip to main content

POLYGLOT-NER: Massive Multilingual named entity recognition

  • Stony Brook University

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

135 Scopus citations

Abstract

The increasing diversity of languages used on the web introduces a new level of complexity to Information Retrieval (IR) systems. We can no longer assume that textual content is written in one language or even the same language family. In this paper, we demonstrate how to build massive multilingual annotators with minimal human expertise and intervention. We describe a system that builds Named Entity Recognition (NER) annotators for 40 major languages using Wikipedia and Freebase. Our approach does not require NER human annotated datasets or language specific resources like treebanks, parallel corpora, and orthographic rules. The novelty of approach lies therein - using only language agnostic techniques, while achieving competitive performance. Our method learns distributed word representations (word embeddings) which encode semantic and syntactic features of words in each language. Then, we automatically generate datasets from Wikipedia link structure and Freebase attributes. Finally, we apply two preprocessing stages (oversampling and exact surface form matching) which do not require any linguistic expertise. Our evaluation is two fold: First, we demonstrate the system performance on human annotated datasets. Second, for languages where no gold-standard benchmarks are available, we propose a new method, distant evaluation, based on statistical machine translation.

Original languageEnglish
Title of host publicationSIAM International Conference on Data Mining 2015, SDM 2015
EditorsSuresh Venkatasubramanian, Jieping Ye
PublisherSociety for Industrial and Applied Mathematics Publications
Pages586-594
Number of pages9
ISBN (Electronic)9781510811522
DOIs
StatePublished - 2015
EventSIAM International Conference on Data Mining 2015, SDM 2015 - Vancouver, Canada
Duration: Apr 30 2015May 2 2015

Publication series

NameSIAM International Conference on Data Mining 2015, SDM 2015

Conference

ConferenceSIAM International Conference on Data Mining 2015, SDM 2015
Country/TerritoryCanada
CityVancouver
Period04/30/1505/2/15

Fingerprint

Dive into the research topics of 'POLYGLOT-NER: Massive Multilingual named entity recognition'. Together they form a unique fingerprint.

Cite this