New Tools for Web-Scale N-grams

LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION(2010)

引用 89|浏览125
暂无评分
摘要
While the web provides a fantastic linguistic resource, col lecting and processing data at web-scale is beyond the reach of most academic laboratories. Previous research has relied on search engin es to collect online information, but this is hopelessly ine fficient for building large-scale linguistic resources, such as lists of named-e ntity types or clusters of distributionally-similar words . An alternative to pro- cessing web-scale text directly is to use the information pr ovided in an N-gram corpus. An N-gram corpus is an efficient co mpression of large amounts of text. An N-gram corpus states how often each sequence of words (up to length N) occurs. We propose tools for working with enhanced web-scale N-gram corpora that include richer levels of source annotation, such as part-of-speech tags. We describe a new set of search tools that make use of these tags, and collectively lower the barrier for lexical learning and ambiguity resolution at web-scale. The tools will allow novel sources of information to be applied to long-standing natural language challenges.
更多
查看译文
关键词
natural language,search engine,col
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要