Optimizing Sparse Tensor Times Matrix On Multi-Core And Many-Core Architectures

Jiajia Li,Yuchen Ma,Chenggang Yan,Richard Vuduc

SC（2016）

引用 43|浏览14

暂无评分

摘要

This paper presents the optimized design and implementation of sparse tensor-times-dense matrix multiply (SpTTM) for CPU and GPU platforms. This primitive is a critical bottleneck in data analysis and mining applications based on tensor methods, such as the Tucker decomposition. We first design and implement sequential SpTTM to avoid explicit data transformations between a tensor and a matrix, which is the conventional approach. We further optimize SpTTM on multicore CPU and GPU systems by parallelizing, avoiding locks, and exploiting data locality. Our sequential SpTTM is up to 3.5x faster than the SpTTM from Tensor Toolbox and 1.5x over that from Cyclops Tensor Framework. Our parallel algorithms show 4.1x speedup on multicore Intel Core i7 and 18.8x speedup on NVIDIA K40c GPU over our sequential SpTTM respectively.

查看译文

关键词

many-core architectures,multicore architectures,sparse tensor-times-dense matrix multiply,CPU platforms,data analysis,data mining applications,Tucker decomposition,sequential SpTTM,data locality,tensor toolbox,cyclops tensor framework,multicore Intel Core i7,NVIDIA K40c GPU

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要