Web news extraction based on path pattern mining

FSKD'09: Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 7(2009)

引用 2|浏览0
暂无评分
摘要
Many Web news sites have similar structures and layout styles. Our extensive case studies have indicated that there exists potential relevance between Web content layouts and path patterns. Compared with the delimiting features of Web content, path patterns have many advantages, such as a high positioning accuracy, ease of use and a strong pervasive performance. Consequently, a Web information extraction model with path patterns constructed from a path pattern mining algorithm is proposed in this paper. Our experimental data set is obtained by randomly selecting news Web pages from the CNN website. With a reasonable tolerance threshold, the experimental results show that the average precision is above 99% and the average recall is 100% when we integrate Web information extraction with our path pattern mining algorithm. The performance of path patterns from the pattern mining algorithm is much better than that of priori extraction rules configured by domain knowledge.
更多
查看译文
关键词
path pattern,path pattern mining algorithm,Web content,Web content layout,Web information extraction,Web information extraction model,Web news site,news Web page,extraction rule,pattern mining algorithm,Path Pattern Mining,Web News Extraction
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要