Method of Webpage Entity Extraction Based on Mixed Attribute Measurement and DOM Tree.

ICNCC(2022)

引用 0|浏览0
暂无评分
摘要
Most mainstream web information extraction models adopt multidimensional analysis and then fusion evaluation, which has issues such as the high cost of vectorization and unreasonable fusion. For this reason, a text mining method of webpages based on the mixed attribute measure and DOM tree segmentation is proposed, which uses the K-Nearest Neighbor algorithm to measure the similarity of the mixed attributes of nodes in the DOM tree to alleviate the shortcomings of previous algorithms that focus solely on categorical or numeric attributes, and improves extraction quality, in addition, the web page segmentation algorithm is used to find the set of target DOM tree nodes, and ultimately information extraction is completed and effectively reduces extraction costs. Experimental results show that, compared to multiple baseline models, the method has over 6.5% and 11.6% improvement in the accuracy and recall rate evaluation indicators across numerous scenarios and clear speed benefits.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要