Sequence Labeling Algorithms for Punctuation Restoration in Brazilian Portuguese Texts.

Tiago B. de Lima,Péricles B. C. de Miranda,Rafael Ferreira Mello,Moesio Wenceslau,Ig Ibert Bittencourt,Thiago Damasceno Cordeiro, Jário José

BRACIS (2)（2022）

引用 1|浏览2

暂无评分

摘要

Punctuation Restoration is an essential post-processing task of text generation methods, such as Speech-to-Text (STT) and Machine Translation (MT). Usually, the generation models employed in those tasks produce unpunctuated text, which is difficult for human readers and might degrade the performance of many downstream text processing tasks. Thus, many techniques exist to restore the text's punctuation. For instance, approaches based on Conditional Random Fields (CRF) and pre-trained models, such as the Bidirectional Encoder Representations from Transformers (BERT), have been widely applied. In the last few years, however, one approach has gained significant attention: casting the Punctuation Restoration problem into a sequence labeling task. In Sequence Labeling, each punctuation symbol becomes a label (e.g., COMMA, QUESTION, and PERIOD) that sequence tagging models can predict. This approach has achieved competitive results against stateof-the-art punctuation restoration algorithms. However, most research focuses on English, lacking discussion in other languages, such as Brazilian Portuguese. Therefore, this paper conducts an experimental analysis comparing the Bi-Long Short-Term Memory (BI-LSTM) + CRF model and BERT to predict punctuation in Brazilian Portuguese. We evaluate those approaches in the IWSLT2 2012-03 and OBRAS dataset in terms of precision, recall, and F1-score. The results showed that BERT achieved competitive results in terms of punctuation prediction, but it requires much more GPU resources for training than the BI-LSTM + CRF algorithm.

查看译文

关键词

punctuation restoration,sequence

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要