A Preliminary Study on Taiwanese OCR for Assisting Textual Database Construction from Historical Documents
2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP)(2022)
摘要
Currently, there is not enough Taiwanese text available to build a proper language model (LM) to support the construction of emerging Taiwanese automatic speech recognition (ASR) and text-to-speech (TTS) systems. Therefore, this paper reports the first Taiwanese optical character recognition (OCR) [1, 2, 3] system to assist human annotators in converting a vast collection of scanned images of Taiwanese historical documents preserved in the “Memory of the Written Taiwanese” (MoWT) website [4] into a usable textual database for building state-of-the-art Taiwanese ASR and TTS systems in the future. Supplementary information and replication materials are available on GitHub [5].
更多查看译文
关键词
Written Taiwanese,Optical Character Recognition,Taiwanese Text Corpus
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要