Data-Driven Neologism Mining in a TV Corpus

semanticscholar(2022)

引用 0|浏览1
暂无评分
摘要
New words emerge in language all the time, and sometimes they become a part of the language for a long time, while sometimes they disappear from use as soon as they appeared. Following our previous methods in historical data (Säily et al., 2021), we focus on neologisms in a more contemporary setting. Our aim is to study the emergence and use of neologisms in the TV Corpus, which contains 325 million words of subs (Davies, 2021). Due to the massive size of the corpus, studying neologism candidates by hand would be a time-consuming task. Therefore, we apply a filtering approach to create an initial list of neologism candidates. First, we extract the publication year of the TV show episode or movie where each lemma in the corpus first appears. This gives us a list of the earliest attestation of every lemma in the corpus. We found that comparing this list to the earliest attestations in the Oxford English Dictionary (OED Online, n.d.), and considering the words that appear in our corpus the same time or before their recorded earliest attestation in the OED potential neologism candidates does not yield enough results, unlike in our previous studies with historical data (Säily et al., 2021). For this reason, we use a large corpus called Corpus of Historical American English (COHA) (Davies, 2012) to do this filtering. We thus compare the earliest occurrences of words in the TV Corpus to the earliest occurrences in COHA producing a list of words that appeared earlier in the TV Corpus than in COHA. This list of candidates will then be gone through manually by carefully studying each occurrence of a potential neologism. The use of novel vocabulary in television series has been studied by e.g. Bednarek (2018: Chapter 9). We aim to scale up this research by using a significantly larger corpus and automating comparisons with dictionary and corpus data. By comparing the TV Corpus with COHA and the OED and by utilizing the metadata associated with them, we are able to analyse the diachronic development of neologism use in English-language television series as well as register variation in their frequency, types, functions and semantics. As an example, science fiction series often seem to use words related to technological innovations (e.g. biodome in our data), and in some series neology may act as a characterization device (Reichelt, forthcoming).
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要