Self-Consistency as an Inductive Bias in Early Language Acquisition.

CogSci(2014)

引用 27|浏览11
暂无评分
摘要
Self-Consistency as an Inductive Bias in Early Language Acquisition Abdellah Fourtassi (abdellah.fourtassi@gmail.com) Ewan Dunbar (emd@umd.edu) Emmanuel Dupoux (emmanuel.dupoux@gmail.com) Laboratoire de Sciences Cognitives et Psycholinguistique, ENS/EHESS/CNRS Paris 75005, France other (neighboring) words. This can be seen as guiding the learner towards a more “semantically coherent” lexicon. We show, using English and Japanese corpora, that the SC-score picks out the correct (ideal) inventory and word boundaries. We also show that, although the SC-score has some free pa- rameters, it is largely independent of the way these parameters are set. The paper is organized as follows. We begin by setting the framework of our experiment (modeling of phonetic varia- tion, word segmentation, and semantics). Then, we introduce our learning bias, the SC-score, and explain how it links these different levels of representation in a coherent and intuitive fashion. Next, we present the results of our simulations on two different speech corpora in English and Japanese. Abstract In this paper we introduce an inductive bias for language acqui- sition under a view where learning of the various levels of lin- guistic structure takes place interactively. The bias encourages the learner to choose sound systems that lead to more “seman- tically coherent” lexicons. We quantify this coherence using an intrinsic and unsupervised measure of predictiveness called “self-consistency.” We found self-consistency to be optimal un- der the true phonemic inventory and the correct word segmen- tation in English and Japanese. Keywords: Language acquisition, inductive bias, phonemes, word segmentation, semantics. Introduction In learning their native language, infants need to make sense of the sounds they are hearing. For the segmental inventory, they need to decide how much of the detail present in the signal matters, and how much of the detail they should ig- nore. The inventories that human lexicons make use of are somewhere in between maximally coarse and maximally fine- grained. For word segmentation, learners need to decide what to take as a lexical unit of speech: this could in principle be anywhere from a single segment up to an entire utterance, but, in reality, the result is somewhere in between. Whether learning is seen from a nativist or empiricist per- spective, it cannot happen without some kind of learning bias (whether domain specific or domain general) which delimits the hypothesis space, however broadly, and favors one repre- sentation over another, however weakly (see Pearl and Gold- water (in press) for a review). In this paper we propose a novel learning bias and show that it aids in picking out the right level of granularity for both the segmental inventory and lexical segmentation. It makes use of the synergy between different levels of representation (in- ventory, lexicon, semantics). It takes a systemic approach to language acquisition, whereby infants are understood as try- ing to build and optimize a coherent system with compatible levels of representation. Recent developmental studies have indeed begun to sug- gest that infants start learning both the sound system and the lexicon of their native language at the same time, around 6 months (see Gervain and Mehler (2010) for a review). This paper proposes that these two levels crucially interact in learn- ing. The bias towards global coherence is coded by a measure we call the self-consistency score (SC-score). It is used to evaluate a phonetic inventory and a word segmentation, as a function of the predictiveness of the lexicon they induce. The lexicon should be one in which words are highly predictive of The framework In order to acquire language, infants must undo various kinds of sub-phonemic variation present in the phonetics, segment words from continuous utterances, and assign meaning to these words. In this section, we explain how phonetic inven- tories, word segmentation, and semantics are operationalized in this study. Corpora We use two speech corpora: the Buckeye Speech corpus (Pitt, Johnson, Hume, Kiesling, & Raymond, 2005), which consists of 40 hours of spontaneous conversations with 40 speakers of American English, and the core of the Corpus of Spon- taneous Japanese (Maekawa, Koiso, Furui, & Isahara, 2000) which consists of 45 hours of recorded spontaneous conversa- tions and public speeches in different fields, ranging from en- gineering to humanities. Following Boruta (2012), we use an inventory of 25 phonemes for transcribing Japanese. For En- glish, we use the phonemic transcription of Pitt et al. (2005), which consists of a set of 45 phonemes. We take these phone- mic transcriptions to give the ideal lexical inventories for the two languages. Phonetic variation We generate alternate inventories for English and Japanese by modifying the phonetic transcription of each corpus, starting from the ideal (i.e., phonemic) transcription. To generate inventories smaller than the true inventory, we collapse the segments into 9 natural classes: stops, fricatives, affricates, nasals, liquids, glides, high vowels, mid vowels and low vowels; then, into 4 coarser-grained classes: obstruents,
更多
查看译文
关键词
early language acquisition,inductive bias,self-consistency
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要