Using markup language to differentiate between reliable and unreliable news

Caireann Kennedy,Josephine Griffith

2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA)(2020)

引用 2|浏览1
暂无评分
摘要
The aim of this research is to develop a more accurate method to detect unreliable news articles without considering the article content. Fake news articles are defined here as those articles which consist entirely of intentionally fabricated unreliable news. The approach taken in this work to detect fake news articles was to consider the type and frequency of HTML tags used in articles. By comparing the counts for HTML tags used in reliable and unreliable online articles, it was found that there are distinct differences between the HTML tags used in the two types of article sources (unreliable or reliable). Two datasets were used with different labelling of ground truth. The first dataset used, NELA 2017 (News Landscape 2017), comprises 136,000 news articles, obtained from 92 different news sources, dated between April 2017 and October 2017. The sources of the articles in NELA 2017 were categorized as either reliable or unreliable, using a media bias fact-checker resource and this was used to label the articles as either reliable or unreliable. The FakeNewsNet dataset is comprised of over 15000 news articles and tweets obtained from a fact-checking website, Gossip Cop, and has preassigned ground truth labels (fake or real). After analysis of NELA 2017, it was found that unreliable articles have 166 tags that were never used by the reliable articles and that there are 8 HTML tags that are used only in the reliable articles. Based on these findings, classification algorithms were employed on the extracted HTML tags. Experimental results show that the KNN classifier (k-nearest neighbors) and the CART classifier (classification and regression tree) give the best performance, having accuracies of around 97% when 10-fold cross-validation was implemented on the NELA 2017 dataset. Accuracies of around 72% were found when the same techniques were applied to the FakeNewsNet dataset. Using the NELA dataset, a comparison was also carried out between this new approach and two other approaches to detect fake news articles - one that uses content analysis and a second that combines content analysis and HTML tags. It was found that the new approach has similar, or often better, accuracy than other methods. This research offers a promising approach to detect unreliable news articles without having to consider the content words of the articles.
更多
查看译文
关键词
Disinformation,fake news,HTML tags,machine learning classification
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要