A Preliminary Study on Taiwanese OCR for Assisting Textual Database Construction from Historical Documents

2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP)(2022)

引用 0|浏览4
暂无评分
摘要
Currently, there is not enough Taiwanese text available to build a proper language model (LM) to support the construction of emerging Taiwanese automatic speech recognition (ASR) and text-to-speech (TTS) systems. Therefore, this paper reports the first Taiwanese optical character recognition (OCR) [1, 2, 3] system to assist human annotators in converting a vast collection of scanned images of Taiwanese historical documents preserved in the “Memory of the Written Taiwanese” (MoWT) website [4] into a usable textual database for building state-of-the-art Taiwanese ASR and TTS systems in the future. Supplementary information and replication materials are available on GitHub [5].
更多
查看译文
关键词
Written Taiwanese,Optical Character Recognition,Taiwanese Text Corpus
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要