TY - GEN
T1 - Perception-oriented online news extraction
AU - Chen, Jinlin
AU - Xiao, Keli
PY - 2008
Y1 - 2008
N2 - A novel online news extraction approach based on human perception is presented in this paper. The approach simulates how a human perceives and identifies online news content. It first detects news areas based on content function, space continuity, and formatting continuity of news information. It further identifies detailed news content based on the position, format, and semantic of detected news areas. Experiment results show that our approach, achieves much better performance (in average more than 99% in terms of F1 Value) compared to previous approaches such as Tree Edit Distance and Visual Wrapper based approaches. Furthermore, our approach does not assume the existence of Web templates in the tested Web pages as required by Tree Edit Distance based approach, nor does it need training sets as required in Visual Wrapper based approach. The success of our approach demonstrates the strength of the perception-oriented Web information extraction methodology and represents a promising approach for automatic information extraction from sources with presentation design for humans.
AB - A novel online news extraction approach based on human perception is presented in this paper. The approach simulates how a human perceives and identifies online news content. It first detects news areas based on content function, space continuity, and formatting continuity of news information. It further identifies detailed news content based on the position, format, and semantic of detected news areas. Experiment results show that our approach, achieves much better performance (in average more than 99% in terms of F1 Value) compared to previous approaches such as Tree Edit Distance and Visual Wrapper based approaches. Furthermore, our approach does not assume the existence of Web templates in the tested Web pages as required by Tree Edit Distance based approach, nor does it need training sets as required in Visual Wrapper based approach. The success of our approach demonstrates the strength of the perception-oriented Web information extraction methodology and represents a promising approach for automatic information extraction from sources with presentation design for humans.
KW - Information extraction
KW - Online news
KW - Web
UR - https://www.scopus.com/pages/publications/57649171100
U2 - 10.1145/1378889.1378952
DO - 10.1145/1378889.1378952
M3 - Conference contribution
AN - SCOPUS:57649171100
SN - 9781595939982
T3 - Proceedings of the ACM International Conference on Digital Libraries
SP - 363
EP - 366
BT - JCDL'08
T2 - 8th ACM/IEEE-CS Joint Conference on Digital Libraries 2008, JCDL'08
Y2 - 16 June 2008 through 20 June 2008
ER -