Syntax annotation for the GENIA corpus

IJCNLP (companion)(2005)

引用 117|浏览49
暂无评分
摘要
Linguistically annotated corpus based on texts in biomedical domain has been constructed to tune natural language processing (NLP) tools for bio- textmining. As the focus of information extraction is shifting from "nominal" information such as named entity to "verbal" information such as function and interaction of substances, applica- tion of parsers has become one of the key technologies and thus the corpus annotated for syntactic structure of sen- tences is in demand. A subset of the GENIA corpus consisting of 500 MEDLINE abstracts has been anno- tated for syntactic structure in an XML- based format based on Penn Treebank II (PTB) scheme. Inter-annotator agreement test indicated that the writ- ing style rather than the contents of the research abstracts is the source of the difficulty in tree annotation, and that annotation can be stably done by lin- guists without much knowledge of bi- ology with appropriate guidelines regarding to linguistic phenomena par- ticular to scientific texts.
更多
查看译文
关键词
information extraction,natural language processing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要