TY - GEN
T1 - Bootstrapping semantic annotation for content-rich HTML documents
AU - Mukherjee, Saikat
AU - Ramakrishnan, I. V.
AU - Singh, Amarjeet
PY - 2005
Y1 - 2005
N2 - Enormous amount of semantic data is still being encoded in HTML documents. Identifying and annotating the semantic concepts implicit in such documents makes them directly amenable for Semantic Web processing. In this paper we describe a highly automated technique for annotating HTML documents, especially template-based content-rich documents, containing many different semantic concepts per document. Starting with a (small) seed of hand-labeled instances of semantic concepts in a set of HTML documents we bootstrap an annotation process that automatically identifies unlabeled concept instances present in other documents. The bootstrapping technique exploits the observation that semantically related items in content-rich documents exhibit consistency in presentation style and spatial locality to learn a statistical model for accurately identifying different semantic concepts in HTML documents drawn from a variety of Web sources. We also present experimental results on the effectiveness of the technique.
AB - Enormous amount of semantic data is still being encoded in HTML documents. Identifying and annotating the semantic concepts implicit in such documents makes them directly amenable for Semantic Web processing. In this paper we describe a highly automated technique for annotating HTML documents, especially template-based content-rich documents, containing many different semantic concepts per document. Starting with a (small) seed of hand-labeled instances of semantic concepts in a set of HTML documents we bootstrap an annotation process that automatically identifies unlabeled concept instances present in other documents. The bootstrapping technique exploits the observation that semantically related items in content-rich documents exhibit consistency in presentation style and spatial locality to learn a statistical model for accurately identifying different semantic concepts in HTML documents drawn from a variety of Web sources. We also present experimental results on the effectiveness of the technique.
UR - https://www.scopus.com/pages/publications/28444470451
U2 - 10.1109/ICDE.2005.28
DO - 10.1109/ICDE.2005.28
M3 - Conference contribution
AN - SCOPUS:28444470451
SN - 0769522858
T3 - Proceedings - International Conference on Data Engineering
SP - 583
EP - 593
BT - Proceedings - 21st International Conference on Data Engineering, ICDE 2005
T2 - 21st International Conference on Data Engineering, ICDE 2005
Y2 - 5 April 2005 through 8 April 2005
ER -