Bangla Language Dialect Classification using Machine Learning

Md Raihanul Islam Tomal, Tanveer Kader,Abdul Kadar Muhammad Masum,Md. Kalim Amzad Chy

2022 4th International Conference on Electrical, Computer & Telecommunication Engineering (ICECTE)(2022)

引用 0|浏览2
暂无评分
摘要
Dialect classification of a language is a complex work as it is the variation of the same language. This paper classifies dialect based on local Bengali text. The classification becomes harder when it is about a language that is not very much available in written format or stored in any other way except spoken among local people. For natural language processing (NLP) a good amount of data is essential to get the job done. It focuses on generating an enriched dataset of the local Bangla language. The dataset introduces two popular dialects which are Chatgaiya and Pabna which are spoken by a large number of people. It comprises about 5000 data regarding these local languages which are annotated with their respective dialects. A five-step Exploratory Data Analysis (EDA) is carried out. Feature extraction is conducted using three different techniques like CountVectorizer, Term Frequency-Inverse Document Frequency (TF-IDF) and Word2vec. With this huge amount of data, it worked on classifying Bangla language dialect using machine learning algorithms such as Support Vector Machine (SVM), Multinomial Naïve Bayes (MNB), Logistic Regression (LR), Random Forest (RF), Decision Tree (DT), K Nearest Neighbor (KNN). This study obtained the highest 96% accuracy.
更多
查看译文
关键词
bangla dialect classificaiton,machine learning,countvectorizer,tf-idf,word2vec
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要