Balancing Techniques for Improving Automated Detection of Hate Speech and Offensive Language on Social Media

B. Ajay Chandrasekhar Reddy, Girish Kumar Chandra,Dilip Singh Sisodia,Arti Anuragi

2023 2nd International Conference for Innovation in Technology (INOCON)（2023）

引用 1|浏览0

暂无评分

摘要

On social media networks like Twitter, Facebook, and Tumblr, people frequently share information. However, these platforms are also notorious for the spread of hate speech and insults, often posted anonymously. Hate speech involves using violent, abusive, or aggressive language towards a particular group based on factors such as gender, race, religion, or region. The prevalence of hate speech on these websites is a major concern, and manually detecting it can be time-consuming. To address this issue, this study presents an automated hate speech detection model that is evaluated on a publicly available Twitter dataset. The proposed method emphasizes data pre-processing, including stemming, term frequency-inverse document frequency (TF-IDF) for feature extraction, and various sampling techniques (random sampler, synthetic minority over-sampling technique (SMOTE), and ALL-KNN) to balance an imbalanced dataset. The logistic regression, support vector machine (SVM), and k-nearest neighbor (k-NN) machine learning classifiers were trained and tested using hold-out cross-validation to reduce overfitting and evaluate performance. The performance was evaluated using metrics such as accuracy, precision, and confusion matrix. The results showed that the logistic regression classifier using the SMOTE approach had the best performance, with an accuracy of 82%, a macro average of precision, recall, and an F1-score of 80%, 82%, and 79%, respectively.

查看译文

关键词

Sentimental analysis,Hate speech,offensive language,TF-ID features,logistic regression,Sampling techniques

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要