Content-based and link-based methods for categorical webpage classification

Shushman Choudhury,Tanmay Batra, Christian Hughes, LOWERCASE LEMMATIZER

semanticscholar(2016)

引用 1|浏览0
暂无评分
摘要
Through this project evaluate numerous methods for categorizing webpages. We utilize both the textual content of the webpage, and data about the hyperlinks on the the webpage. We first investigate the performance of various classification methods that are only content-based Multinomial Naive Bayes, Support Vector Machine, Decision Trees and Word2Vec embedding. We then augment some of these classifiers with information from the hyperlinks of a webpage, using a single-hop neighbour lookup. We also make a comment on using graphical models. Our results show that all of the content-only classifiers perform quite well without a great deal of parameter tuning on our test set. Furthermore, we observe a general decrease in accuracy due to using the information from hyperlinks, but there are some cases where the content-based methods are incorrect on their own, but are correct when augmented with link-based information.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要