Cross-linguistically Consistent Semantic and Syntactic Annotation of Child-directed Speech

ArXiv(2021)

引用 2|浏览12
暂无评分
摘要
While corpora of child speech and child-directed speech (CDS) have enabled major contributions to the study of child language acquisition, semantic annotation for such corpora is still scarce and lacks a uniform standard. We compile two CDS corpora with sentential logical forms, one in English and the other in Hebrew. In compiling the corpora we employ a methodology that enforces a cross-linguistically consistent representation, building on recent advances in dependency representation and semantic parsing. The corpora are based on a sizable portion of Brown’s Adam corpus from CHILDES (≈80% of its child-directed utterances), and to all child-directed utterances from Berman’s Hebrew CHILDES corpus Hagar. We begin by annotating the corpora with the Universal Dependencies (UD) scheme for syntactic annotation, motivated by its applicability to a wide variety of domains and languages. We then proceed by applying an automatic method for transducing sentential logical forms (LFs) from UD structures. The two representations have complementary strengths: UD structures are language-neutral and support direct annotation, whereas LFs are neutral as to the interface between syntax and semantics, and transparently encode semantic distinctions. We verify the quality of the annotated UD annotation using an inter-annotator agreement study. We then demonstrate the utility of the compiled corpora through a longitudinal corpus study of the prevalence of different syntactic and semantic phenomena. Szubert, Goldwater and Steedman are from the School of Informatics at the University of Edinburgh, UK; Gibbon is from the Centre for Clinical Brain Sciences at the University of Edinburgh, UK; Abend is from the School of Computer Science and Engineering and the Department of Cognitive Science of the Hebrew University of Jerusalem, Israel; Schneider is from the Departments of Linguistics and Computer Science of Georgetown University, D.C., USA. ORCID identifiers are Abend (0000-0003-4311-3876), Schneider (0000-0002-5994-671X), Gibbon (0000-0002-5485-7523), Goldwater (0000-0002-7298-0947), Steedman (0000-00032509-0797). Email for Correspondence: omri.abend@mail.huji.ac.il (Abend) ar X iv :2 10 9. 10 95 2v 1 [ cs .C L ] 2 2 Se p 20 21
更多
查看译文
关键词
syntactic annotation,consistent semantic,speech,cross-linguistically,child-directed
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要