Classification of Layout vs. Relational Tables on theWeb: Machine Learning with Rendered Pages

ACM Transactions on the Web(2023)

引用 0|浏览6
暂无评分
摘要
Table mining on the web is an open problem, and none of the previously proposed techniques provides a complete solution. Most research focuses on the structure of the HTML document, but because of the nature and structure of the web, it is still a challenging problem to detect relational tables. Web Content Accessibility Guidelines (WCAG) also cover a wide range of recommendations for making tables accessible, but our previous work shows that these recommendations are also not followed; therefore, tables are still inaccessible to disabled people and automated processing. We propose a new approach to table mining by not looking at the HTML structure, but rather, the rendered pages by the browser. The first task in table mining on theweb is to classify relational vs. layout tables, and here, we propose two alternative approaches for that task. We first introduce our dataset, which includes 725 web pages with 9,957 extracted tables. Our first approach extracts features from a page after being rendered by the browser, then applies several machine learning algorithms in classifying the layout vs. relational tables. The best result is with Random Forest with the accuracy of 97.2% (F1-score: 0.955) with 10-fold cross-validation. Our second approach classifies tables using images taken from the same sources using Convolutional Neural Network (CNN), which gives an accuracy of 95% (F1-score: 0.95). Our work here shows that the web's true essence comes after it goes through a browser and using the rendered pages and tables, the classification is more accurate compared to literature and paves the way in making the tables more accessible.
更多
查看译文
关键词
Table mining,information extraction,table classification,table accessibility
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要