Efficient And Distributed Generalized Canonical Correlation Analysis For Big Multiview Data

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING(2019)

引用 26|浏览110
暂无评分
摘要
Generalized canonical correlation analysis (GCCA) integrates information from data samples that are acquired at multiple feature spaces (or 'views') to produce low-dimensional representations-which is an extension of classical two-view CCA. Since the 1960s, (G)CCA has attracted much attention in statistics, machine learning, and data mining because of its importance in data analytics. Despite these efforts, the existing GCCA algorithms have serious complexity issues. The memory and computational complexities of the existing algorithms usually grow as a quadratic and cubic function of the problem dimension (the number of samples / features), respectively-e.g., handling views with approximate to 1,000 features using such algorithms already occupies approximate to 10(6) memory and the per-iteration complexity is approximate to 10(9) flops-which makes it hard to push these methods much further. To circumvent such difficulties, we first propose a GCCA algorithm whose memory and computational costs scale linearly in the problem dimension and the number of nonzero data elements, respectively. Consequently, the proposed algorithm can easily handle very large sparse views whose sample and feature dimensions both exceed approximate to 100,000. Our second contribution lies in proposing two distributed algorithms for GCCA, which compute the canonical components of different views in parallel and thus can further reduce the runtime significantly if multiple computing agents are available. We provide detailed convergence analyses of the proposed algorithms and show that all the large-scale GCCA algorithms converge to a Karush-Kuhn-Tucker (KKT) point at least sublinearly. Judiciously designed synthetic and real-data experiments are employed to showcase the effectiveness of the proposed algorithms.
更多
查看译文
关键词
Distributed algorithms, Sparse matrices, Correlation, Machine learning algorithms, Electronic mail, Data mining, Machine learning, Generalized canonical correlation analysis, multiview learning, multilingual word embedding, distributed GCCA
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要