Building Tamil Text Dataset on LGBTQIA and Offensive Language Detection using Multilingual BERT

2022 International Conference on Inventive Computation Technologies (ICICT)(2022)

引用 0|浏览0
暂无评分
摘要
Lesbian, Gay, Bisexual, Transgender and Queer or Questioning (LGBTQ) community worldwide undergo depression, mental health problems and develop suicidal thoughts, due to non-acceptance by the society. This primarily stems from the lack of awareness and connection with the LGBTQ community. This paper aims to collect tweets and classifies them as offensive or non-offensive using manual and automated methods. The tweets are in Tamil (a classical Indian language), which is limited in terms of data and research. The presented dataset is annotated by two annotators and validated by a Cohen’s kappa metric score of 0.97. A finetuned Multilingual BERT (mBERT) Transformer model along with baseline models of Naive Bayes and Long Short-Term Memory detect offensive and non-offensive texts using the benchmark dataset, with the mBERT model performing the best, scoring a F-score of 93% for Non-Offensive class and 70% for Offensive class We also discuss preliminary findings from the dataset, which can be investigated further for specific research directions in the future.
更多
查看译文
关键词
offensive language detection,tamil text dataset,lgbtqia
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要