6 Million Spam Tweets: A Large Ground Truth For Timely Twitter Spam Detection

2015 IEEE International Conference on Communications (ICC)(2015)

引用 163|浏览174
暂无评分
摘要
Twitter has changed the way of communication and getting news for people's daily life in recent years. Meanwhile, due to the popularity of Twitter, it also becomes a main target for spamming activities. In order to stop spammers, Twitter is using Google SafeBrowsing to detect and block spam links. Despite that blacklists can block malicious URLs embedded in tweets, their lagging time hinders the ability to protect users in real-time. Thus, researchers begin to apply different machine learning algorithms to detect Twitter spam. However, there is no comprehensive evaluation on each algorithms' performance for real-time Twitter spam detection due to the lack of large ground truth. To carry out a thorough evaluation, we collected a large dataset of over 600 million public tweets. We further labelled around 6.5 million spam tweets and extracted 12 lightweight features, which can be used for online detection. In addition, we have conducted a number of experiments on six machine learning algorithms under various conditions to better understand their effectiveness and weakness for timely Twitter spam detection. We will make our labelled dataset for researchers who are interested in validating or extending our work.
更多
查看译文
关键词
Twitter spam detection,spam Tweets,Google SafeBrowsing,online social networks,machine learning algorithm
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要