A Hybrid Approach Towards Wrapper Induction
msra
摘要
The approaches to learn wrappers for extraction from semi-structured documents (like HTML documents) are divided into string based ones, and tree based ones. In previous papers we have shown that tree based approaches perform much better and need less examples than string based approaches, but have the disadvantage that they can only extract complete text nodes, whereas string based approaches can extract within text nodes. In this paper we propose a hybrid ap- proach that combines the advantages of both systems. We compare this approach experimentally with a string based approach on some sub node extraction tasks.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络