A Quality Type-aware Annotated Corpus and Lexicon for Harassment Research.

WebSci '18: 10th ACM Conference on Web Science Amsterdam Netherlands May, 2018(2018)

引用 55|浏览34
暂无评分
摘要
A quality annotated corpus is essential to research. Despite the re- cent focus of the Web science community on cyberbullying research, the community lacks standard benchmarks. This paper provides both a quality annotated corpus and an o ensive words lexicon capturing di erent types of harassment content: (i) sexual, (ii) racial, (iii) appearance-related, (iv) intellectual, and (v) political1. We rst crawled data from Twitter using this content-tailored o ensive lexicon. As mere presence of an o ensive word is not a reliable indicator of harassment, human judges annotated tweets for the presence of harassment. Our corpus consists of 25,000 annotated tweets for the ve types of harassment content and is available on the Git repository2.
更多
查看译文
关键词
Annotated corpus, context, sexual, racial, political, appearance-related, intellectual, cyberbullying, harassment, offensive Lexicon, profane word
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要