Extraction Of Arabic Words From Multilingual Documents

PROCEEDINGS OF THE EIGHTH IASTED INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING(2004)

引用 23|浏览16
暂无评分
摘要
Latin script words are now commonly used in Arabic script documents. An OCR developed for the Arabic script will wrongly recognize the words in Latin script. So it is necessary to filter out these Latin script words before feeding the Arabic script words to the Arabic OCR. Which gives rise to the need to develop an automatic script recognition system for words in Arabic and Latin scripts.In this paper we present a method which can filter out Latin words from heterogeneous blocks.The method is based on a rapid filtering process that uses morphological and statistical features of Arabic script such as: overlapping and inclusion of bounding boxes, horizontal bar, low diacritics, Height and width variation of connected components, etc. Out of tests, our method has shown its efficiency in the discrimination between Arabic and Latin script at word level. Data set of words is extracted from the "Directory of North of Affica" and the results of the word identification reaches 98% on 1435 words.
更多
查看译文
关键词
heterogeneous blocks,document analysis,scripts discrimination,word level
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要