Class-dependent Canonical Correlation Analysis for scalable cross-lingual document categorization.

CIDM(2013)

引用 0|浏览18
暂无评分
摘要
Canonical Correlation Analysis (CCA) is used to infer a semantic space into which text documents, written in different languages, can be mapped to a language-independent representation, called latent topics. This highly reduces the complexity of dealing with different languages since we can train a document classifier using the labeled documents in one language, and then apply it to classify documents in another language. This topic modeling task is usually performed in a class-independent manner. The performance of CCA depends on the amount of documents used for inferring the semantic space. However, CCA has a high computational complexity with respect to the number of training documents. In this paper, we proposed a scalable variant of CCA, CD-CCA, to improve its scalability and complexity where the projection is performed in a class-dependent manner. It generates a semantic space for each category separately. Then a binary document classifier is trained for each category on its own semantic space. CD-CCA was applied on English-Chinese document classification. The experimental results showed that CD-CCA can deal with large training sets without hurting the performance of the underlying classifiers compared to traditional CCA. CD-CCA opens the door for distributed training of the semantic spaces of the different categories.
更多
查看译文
关键词
canonical correlation analysis,computational complexity,topic modeling,semantics,correlation,feature extraction,text analysis,web pages,natural language processing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要