The best of both worlds: Multi-billion word “dynamic” corpora

semanticscholar（2019）

引用 0|浏览1

暂无评分

摘要

Nearly all of the very large corpora of English are “static”, which allows a wide range of one-time, pre-processed data, such as collocates. The challenge comes with large “dynamic” corpora, which are updated regularly, and where preprocessing is much more difficult. This paper provides an overview of the NOW corpus (News on the Web), which is currently 8.2 billion words in size, and which grows by about 170 million words each month. We discuss the architecture of NOW, and provide many examples that show how data from NOW can (uniquely) be extracted to look at a wide range of ongoing changes in English. 1 Corpus architecture Multi-billion word corpora have become commonplace in the last 5-10 years. For example, there are several different 10-20 billion word corpora from Sketch Engine (Kilgarrif et al 2014; www.sketchengine.eu), Corpora from the Web (Schäfer 2015; corporafromtheweb.org), and English-Corpora.org (formerly the BYU Corpora). Most of these corpora, however, are “static” corpora. The corpus texts are collected and annotated, and they are then indexed and preprocessed in other ways, which makes text retrieval very fast even on very large corpora. For example, the 14 billion word iWeb corpus (https://www.english-corpora.org/iweb), users can search by word form, lemma, part of speech, synonyms, user-defined wordlists, and more. A search for a complex string like VERB _a =EXPENSIVE @CLOTHES (verb + article + any form of any synonym of expensive + any form of any word in the user-defined clothes wordlist) will take just 2-3 seconds. iWeb and all of the corpora from EnglishCorpora.org are based on highly-optimized relational databases, which yields corpora that are typically 5-10 times as fast as other large corpora (see www.english-corpora.org/speed.asp). The underlying architecture is similar to “columnstore” databases. In a 14 billion word corpus, for example, there would be 14 billion rows, each with a structure like the following: Figure 1: Corpus architecture Each word / lemma / PoS combination is represented as an integer value, which is tied to an entry in the lexicon (and which is in a separate database). In Figure 1, for example, the integer value [1983] represents [ best / best / jjt ]. There is a clustered index on this “middle” column ([word11] in Figure 1), which means that all of the tokens of any word (best in this case) are stored physically adjacent to each other on the SSD, which increases access speed a great deal. As it carries out the search, iWeb (or any of the corpora from English-Corpora.org) parses the search string to find the lowest-frequency, “weakest” part of the string. For example, in the search string the best NOUN, the word best occurs less than either the or all NOUNs. The search focuses first on the lemma best, and only when it finds those rows (all of the rows containing the value 1983 in column [word11]) does it narrow this to rows where the preceding column ( [word10] in Figure 1) is the value for the and the following column ([word12] in Figure 1) is an integer value tied to a noun in the lexicon. (Note that in Figure 1 (for reasons of space), only the two columns to the left and to the right of the “node” column are shown, but – depending on the corpus – there are 5-10 columns each to the left and to the right). Davies (2019) explains the underlying architecture in more detail, and provides a number of examples that show that the corpora with this architecture are typically 5-10 times as fast as the architecture of other very large corpora. Crucially, this is because these other corpora typically parse the search string left to right (e.g. with the word the first in the string the best NOUN), whereas we focus first on the “weakest link” in the search string. Our approach also takes full advantage of relational database architecture, such as JOINs across any number of highly-optimized tables. For example, in the example of VERB _a =EXPENSIVE @CLOTHES shown above (verb + article + any form of any synonym of expensive + any form of any word in the user-defined clothes wordlist), the search will use lemma and part of speech information from the main [lexicon] table, as well as a separate [synonyms] table containing entries for more than 65,000 words, and another table containing user-defined lists such as clothing, emotions, or a particular class of verbs. Additional tables could contain pronunciation information or additional semantic information, and the search speed will not decrease much (if at all) no matter how many tables are involved. Finally, there is a [sources] table that can contain any number of columns related to each of the texts in the corpus, and these are JOINed to the main corpus table (e.g. Figure 1) via the [textID] value. This allows users to quickly and easily create “virtual corpora” using any of the metadata from the [sources] table, such as author, date, website, or genre. When the corpus sees that all of the “slots” in a search are very frequent, it defaults to using preprocessed n-grams, which are even faster than the previous approach. For example, a very high frequency search like “NOUN NOUN” takes less than two seconds, because it is only searching 10 or 100 million rows of data in the n-grams databases. (The downside of the n-gram tables is that they refer to the entire corpus, and not just particular sections, just as certain genres or texts.) Figure 2: iWeb high frequency: NOUN + NOUN Finally, as with the Sketch Engine corpora, other data such as collocates are pre-processed in iWeb, which means they can be retrieved in just a second or two. Figure 3: iWeb collocates for bread Pre-processing also allows for very fast retrieval (1-2 seconds for results from the 14 billion word corpus) for word clusters, related topics (words that frequently co-occur anywhere on the 22 million web pages), websites that use the word the most (which can be used to quickly and easily create “Virtual Corpora” on almost any topic), and sample concordance lines (see Davies 2019). 2 Creating the dynamic NOW corpus As we will discuss in Section 4. the challenge comes, however, when we create a corpus that is “dynamic. (We define “dynamic” as corpora in which texts are continually added, rather than corpora in which texts are both added and deleted – although our architecture would have the same advantages in this case as well.) An example of a dynamic corpus is the NOW Corpus (“News on the Web”; www.englishcorpora.org/now), which is – as far as we are aware – the only corpus larger than a billion words, and which is growing on a regular basis (at least every month). The NOW corpus debuted at 3.6 billion words in May 2016 (with texts going back to 2010) and is now (early July 2019) about 8.2 billion words in size. Every month 150-170 million words are added to the corpus, or about 1.5 billion words each year. Note that similar corpora for Spanish and Portuguese are also available (corpusdelespanol.org/now: 6.0 billion words in 21 Spanish-speaking countries since 2012, and corpusdoportugues.org/now: 1.3 billion words in 4 Portuguese-speaking countries since 2012), but the English NOW corpus will be the focus of this paper. To create the NOW corpus, every hour five different machines search Google News to retrieve newly-listed newspaper and magazine articles, for 20 different English-speaking countries (the same 20 countries as GloWbE; see Davies 2013). For example, Figure 4 shows just two sample entries from Google News from 3 July 2019, and on average we gather the URLs for about 20,000 such articles each day. Figure 4: Sample Google News entries The metadata for each of the 20,000 articles (URL, title, source, Google snippet) that appear each day are stored in a relational database. For example, the following is a small selection of the links from Google News from the US and Canada for the last hour on April 24, 2019, as the initial version of this paper was being written: Figure 5: NOW sample list of articles At the end of the month, we download the 250,000-300,000 articles using a custom program written in the Go language, which downloads all of the 250,000_ texts in about 30-40 minutes. We then use JusText (Pomikálak 2011; corpus.tools/wiki/Justext) to remove boilerplate material, and we tag the text with CLAWS 7 (for English; see Garside and Smith 1997), and a customized tagger based on Eckhard Bick’s Palavras tagger for the Portuguese and Spanish corpora (Bick 1999). We then remove duplicate articles (always a problem in newspaper-based corpora) by looking for duplicate 11-grams across texts. For example, if a text has 68 11-grams starting with the word the, and 39 of these 11grams are also found in any of the other 250,000+ texts from that month, then the text is tagged as a probable duplicate and it is removed from the corpus. (This process takes only 2-3 minutes for the 150-170 million words, because of the relational database architecture underlying the corpus). Once we have done all of these steps, the new texts are then added to the existing corpus. As the Figure 6 shows (for Nov 2018 – June 2019), this results in about 150-175 million additional words of data each month: Figure 6: NOW size by month (last 8 months) Note that NOW contains just those articles that Google News links to, which are primarily newspaper and magazine sites. But there is an incredible variety in these sites – they are not just “staid” broadsheet newspapers. They include magazine and newspaper articles dealing not only with current events, but also technology, entertainment, and a wide variety of topics (as is evidenced by the 7,000+ “news” sites in a given month, as shown in Figure 6). Evidence for the often informal nature of the texts comes from an investigation of the lexical creativity in the corpus. For example, there are more than 540 different –alypse words that are formed by analogy to the word apocalypse, such as snarkpocalypse, snowpocalypse, chocopalypse, crapocaly

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要