Deep learning-based Turkish spelling error detection with a multi-class false positive reduction model.

Turkish J. Electr. Eng. Comput. Sci.(2023)

引用 0|浏览4
暂无评分
摘要
Spell checking and correction is an important step in the text normalization process. These tasks are more challenging in agglutinative languages such as Turkish since many words can be derived from the root word by combining many suffixes. In this study, we propose a two-step deep learning-based model for misspelled word detection in the Turkish language. A false positive reduction model is integrated into the system to reduce the false positive predictions originating from the use of foreign words and abbreviations that are commonly used in Internet sharing platforms. For this purpose, we create a multi-class dataset by developing a mobile application for labeling. We compare the effect of using different types of tokenizers including character-based, syllable-based, and byte-pair encoding (BPE) approaches together with Long Short-Term Memory (LSTM) and Bi-directional LSTM (Bi-LSTM) networks. The findings show that the proposed Bi-LSTM-based model with the BPE tokenizer is superior to the benchmarking methods. The results also indicate that the false positive reduction step significantly increased the precision of the base detection model in exchange for a comparably less drop in its recall.
更多
查看译文
关键词
Text normalization, spell checker, tokenizers, long short-term memory, agglutinative languages
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要