The Advantages And Challenges Of "Big Data": Insights From The 14 Billion Word Iweb Corpus

LINGUISTIC RESEARCH(2019)

引用 7|浏览9
暂无评分
摘要
The iWeb corpus contains nearly 14 billion words from 22 million web pages, and it has been designed in a way that allows users to quickly and easily create "Virtual Corpora", in order to focus on websites that are related to their areas of interest. The data from this very large corpus provides very detailed information on syntactic, morphological, lexical, and semantic phenomena, in ways that would never be possible with a small 100 million or 500 million word corpus. In addition, the corpus provides a number of features that are not available with other large corpora, such as the ability to perform advanced searches of the top 60,000 words in the corpus, and to see a wealth of information on each of these words - definitions, links to images and audio, translations, detailed frequency information, related topics, collocates, word clusters, re-sortable concordance lines, and much more. Finally, we discuss the challenges of large corpora, and how the corpus architecture that is used for iWeb has uniquely been designed to address these challenges.
更多
查看译文
关键词
iWeb, virtual corpora, big data, BNC (British National Corpus), COCA (Corpus of Contemporary American English)
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要