Putting a Value on Comparable Data.
Meeting of the Association for Computational Linguistics(2011)
摘要
Machine translation began in 1947 with an influential memo by Warren Weaver. In that memo, Weaver noted that human code-breakers could transform ciphers into natural language (e.g., into Turkish)
• without access to parallel ciphertext/plaintext data, and
• without knowing the plaintext language's syntax and semantics.
Simple word- and letter-statistics seemed to be enough for the task. Weaver then predicted that such statistical methods could also solve a tougher problem, namely language translation.
This raises the question: can sufficient translation knowledge be derived from comparable (non-parallel) data?
In this talk, I will discuss initial work in treating foreign language as a code for English, where we assume the code to involve both word substitutions and word transpositions. In doing so, I will quantitatively estimate the value of non-parallel data, versus parallel data, in terms of end-to-end accuracy of trained translation systems. Because we still know very little about solving word-based codes, I will also describe successful techniques and lessons from the realm of letter-based ciphers, where the non-parallel resources are (1) enciphered text, and (2) unrelated plaintext. As an example, I will describe how we decoded the Copiale cipher with limited computer-like knowledge of the plaintext language.
The talk will wrap up with challenges in exploiting comparable data at all levels: letters, words, phrases, syntax, and semantics.
更多查看译文
关键词
value,data
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要