Skip to main navigation Skip to search Skip to main content

Bootstrapping semantic annotation for content-rich HTML documents

  • Stony Brook University

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

20 Scopus citations

Abstract

Enormous amount of semantic data is still being encoded in HTML documents. Identifying and annotating the semantic concepts implicit in such documents makes them directly amenable for Semantic Web processing. In this paper we describe a highly automated technique for annotating HTML documents, especially template-based content-rich documents, containing many different semantic concepts per document. Starting with a (small) seed of hand-labeled instances of semantic concepts in a set of HTML documents we bootstrap an annotation process that automatically identifies unlabeled concept instances present in other documents. The bootstrapping technique exploits the observation that semantically related items in content-rich documents exhibit consistency in presentation style and spatial locality to learn a statistical model for accurately identifying different semantic concepts in HTML documents drawn from a variety of Web sources. We also present experimental results on the effectiveness of the technique.

Original languageEnglish
Title of host publicationProceedings - 21st International Conference on Data Engineering, ICDE 2005
Pages583-593
Number of pages11
DOIs
StatePublished - 2005
Event21st International Conference on Data Engineering, ICDE 2005 - Tokyo, Japan
Duration: Apr 5 2005Apr 8 2005

Publication series

NameProceedings - International Conference on Data Engineering
ISSN (Print)1084-4627

Conference

Conference21st International Conference on Data Engineering, ICDE 2005
Country/TerritoryJapan
CityTokyo
Period04/5/0504/8/05

Fingerprint

Dive into the research topics of 'Bootstrapping semantic annotation for content-rich HTML documents'. Together they form a unique fingerprint.

Cite this