Demographical Based Sentiment Analysis for Detection of Hate Speech Tweets for Low Resource LanguageJust Accepted

Kamal Safdar,Shibli Nisar,Waseem Iqbal, Awais Ahmad,Yawar Abbas Bangash

ACM Transactions on Asian and Low-Resource Language Information Processing(2023)

引用 0|浏览1
暂无评分
摘要
Advancement in IT and communication technology provides the opportunity for social media users to communicate their ideas and thoughts across the globe within no time as well big data promulgated in a result of the communication process itself has immense challenges. Recently, the provision of freedom of speech has witnessed immense promulgation of offensive and hate speech content on the internet aimed the basic human rights violation. The detection of abusive content on social media for rich resource language has become a hot area for researchers in the recent past. However, low-resource languages are underprivileged due to the non-availability of large corpus and its complexity to understand. The proposed methodology mainly has two parts. One is to detect abusive content and the other is to have a demographical analysis of the Indigenously developed dataset. The process starts with the development of a unique unlabeled Urdu dataset of 0.2 M from Twitter through a web scrapper tool named snscraper. The dataset is collected against the 36 districts of Punjab from Pakistan and from the duration 2018- Apr 2022. The dataset is labeled into three target classes Neutral, Offensive, and Hate Speech. After data cleaning, the feature extraction process is achieved with the help of traditional techniques such as Bow and tf-idf with the combination of word and char n-gram and word embedding word2Vec. The dataset is trained on both machine learning algorithms SVM and Logistic regression and deep learning techniques Long Short Term Memory (LSTM) and Convolutional Neural Networks (CNN). The best F score achieved through LSTM on this dataset is 64 and accuracy is 93 through CNN algorithms. A Choropleth map is used for visualization of the dataset distributed among 36 districts of Punjab and a time series plot for time analysis covers five years duration from 2018-Apr to 22.
更多
查看译文
关键词
snscrapper,CNN,Word Embedding,LSTM,Choropleth map
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要