IDF for Word N-grams

ACM Trans. Inf. Syst.(2017)

引用 19|浏览58
暂无评分
摘要
Inverse Document Frequency (IDF) is widely accepted term weighting scheme whose robustness is supported by many theoretical justifications. However, applying IDF to word N-grams (or simply N-grams) of any length without relying on heuristics has remained a challenging issue. This article describes a theoretical extension of IDF to handle N-grams. First, we elucidate the theoretical relationship between IDF and information distance, a universal metric defined by the Kolmogorov complexity. Based on our understanding of this relationship, we propose N-gram IDF, a new IDF family that gives fair weights to words and phrases of any length. Based only on the magnitude relation of N-gram IDF weights, dominant N-grams among overlapping N-grams can be determined. We also propose an efficient method to compute the N-gram IDF weights of all N-grams by leveraging the enhanced suffix array and wavelet tree. Because the exact computation of N-gram IDF provably requires significant computational cost, we modify it to a fast approximation method that can estimate weight errors analytically and maintain application-level performance. Empirical evaluations with unsupervised/supervised key term extraction and web search query segmentation with various experimental settings demonstrate the robustness and language-independent nature of the proposed N-gram IDF.
更多
查看译文
关键词
Term weighting,multiword expression,Kolmogorov complexity,information distance,wavelet tree,Poisson distribution
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要