COMBINING STATISTICAL SIMILARITY MEASURES FOR AUTOMATIC INDUCTION OF SEMANTIC CLASSES

San Juan(2005)

引用 12|浏览19
暂无评分
摘要
In this paper, an unsupervised semantic class induction algorithm is proposed that is based on the principle that similarity of context implies similarity of meaning. Two semantic similarity metrics that are variations of the Vector Product distance are used in order to measure the semantic distance between words and to automati- cally generate semantic classes. The first metric computes "wide- context" similarity between words using a "bag-of-words" model, while the second metric computes "narrow-context" similarity us- ing a bigram language model. A hybrid metric that is defined as the linear combination of the wide and narrow-context metrics is also proposed and evaluated. To cluster words into semantic classes an iterative clustering algorithm is used. The semantic metrics are evaluated on two corpora: a semantically heterogeneous web news domain (HR-Net) and an application-specific travel reserva- tion corpus (ATIS). For the hybrid metric, semantic class member precision of 85% is achieved at 17% recall for the HR-Net task and precision of 85% is achieved at 55% recall for the ATIS task.
更多
查看译文
关键词
iterative methods,natural languages,pattern clustering,unsupervised learning,application-specific travel reservation corpus,bigram language model,heterogeneous Web news domain,iterative clustering algorithm,statistical similarity measures,unsupervised semantic class induction algorithm,vector product distance
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要