Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Parallel Computing, no. 3 (2009): 1-12E

Cited by: 862|Views123
EI WOS

Abstract

We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific optimization method- ologies for im...More

Code:

Data:

0
Introduction
  • Industry has moved to chip multiprocessor (CMP) system design in order to better manage trade-offs among performance, energy efficiency, and reliability [10,5].
  • SpMV is a frequent bottleneck in scientific computing applications, and is notorious for sustaining low fractions of peak processor performance.
  • The authors implement SpMV for one of the most diverse sets of CMP platforms studied in the existing HPC literature, including the homogeneous multicore designs of the dual-socket  quad-core AMD Opteron 2356 (Barcelona), the dual-socket  dual-core AMD Opteron 2214 (Santa Rosa) and the dual-socket  quad-core Intel Xeon E5345 (Clovertown), the heterogeneous local-store based architecture of the dual-socket  eight-SPE IBM QS20 Cell Blade, as well as one of the first scientific studies of the hardware-multithreaded dual-socket  eight-core Sun UltraSparc T2+ T5140 (Victoria Falls) – essentially a dual-socket Niagara.
Highlights
  • Industry has moved to chip multiprocessor (CMP) system design in order to better manage trade-offs among performance, energy efficiency, and reliability [10,5]
  • sparse matrix–vector multiply (SpMV) is a frequent bottleneck in scientific computing applications, and is notorious for sustaining low fractions of peak processor performance
  • We implement SpMV for one of the most diverse sets of CMP platforms studied in the existing HPC literature, including the homogeneous multicore designs of the dual-socket  quad-core AMD Opteron 2356 (Barcelona), the dual-socket  dual-core AMD Opteron 2214 (Santa Rosa) and the dual-socket  quad-core Intel Xeon E5345 (Clovertown), the heterogeneous local-store based architecture of the dual-socket  eight-SPE IBM QS20 Cell Blade, as well as one of the first scientific studies of the hardware-multithreaded dual-socket  eight-core Sun UltraSparc T2+ T5140 (Victoria Falls) – essentially a dual-socket Niagara2
  • Our study examines the Sun UltraSparc T2+ T5140 with two T2+ processors operating at 1.16 GHz, with a per-core and per-socket peak performance of 1.16 GFlop/s and 9.33 GFlop/s, respectively (no fused-multiply add (FMA) functionality)
  • Our findings illuminate the architectural differences among one of the most diverse sets of multicore configurations considered in the existing literature, and speak strongly to the necessity of multicore-specific optimization over the use of existing off-the-shelf approaches to parallelism for multicore machines
  • As tuning often increased performance by more than 20%, we feel peak power efficiency is typical at peak performance
  • Significant additional performance was seen on the dual socket configurations when the aggregate bandwidth is doubled
Results
  • SpMV dominates the performance of diverse applications in scientific and engineering computing, economic modeling and information retrieval; yet, conventional implementations have historically been relatively poor, running at 10% or less of machine peak on single-core cache-based microprocessor systems [24].
  • Doing so provides a 20% reduction in memory traffic for some matrices, which could translate up to a 20% increase in performance.
  • The data in Table 3 shows that the Victoria Falls system sustains only 1% of its memory bandwidth when using a single thread on a single core.
  • As tuning often increased performance by more than 20%, the authors feel peak power efficiency is typical at peak performance
Conclusion
  • Summary and conclusions

    The authors' findings illuminate the architectural differences among one of the most diverse sets of multicore configurations considered in the existing literature, and speak strongly to the necessity of multicore-specific optimization over the use of existing off-the-shelf approaches to parallelism for multicore machines.
  • The ‘‘heavy-weight” out-of-order cores of the Santa Rosa, Barcelona, and Clovertown systems showed sub-linear improvement from one to two cores
  • These powerful cores are severely bandwidth starved.
  • Significant additional performance was seen on the dual socket configurations when the aggregate bandwidth is doubled
  • This indicates that sustainable memory bandwidth may become a significant bottleneck as core count increases, and software designers should consider bandwidth reduction as a key algorithmic optimization
Tables
  • Table1: Architectural summary of AMD Opteron (Santa Rosa), AMD Opteron (Barcelona), Intel Xeon (Clovertown), Sun Victoria Falls, and STI Cell multicore chips. Sustained power measured via digital power meter
  • Table2: Overview of SpMV optimizations attempted in our study for the Â86 (Santa Rosa, Barcelona and Clovertown), Victoria Falls, and Cell architectures
  • Table3: Sustained bandwidth and computational rate for a dense matrix stored in sparse format, in GB/s (and percentage of configuration’s peak bandwidth) and GFlop/s (and percentage of configuration’s peak performance)
Download tables as Excel
Funding
  • All authors from Lawrence Berkeley National Laboratory were supported by the Office of Advanced Scientific Computing Research in the Department of Energy Office of Science under contract number DE-AC02-05CH11231
Study subjects and analysis
cases: 3
For each threading model, we implement an auto-tuning framework to produce architecture-optimized kernels. We attempt three cases: no cache and no TLB blocking, cache blocking with no TLB blocking, as well as cache and TLB blocking. For each of these, a heuristic based on minimization of memory traffic selects the appropriate block size, register blocking, format, and index size. i.e., the compression strategy that results in the substantially smallest matrix is selected

Reference
  • K. Asanovic, R. Bodik, B. Catanzaro, et al., The landscape of parallel computing research: a view from Berkeley, Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, December 2006.
    Google ScholarFindings
  • D. Bailey, Little’s law and high performance computing, RNR Technical Report, 1997.
    Google ScholarFindings
  • S. Balay, W.D. Gropp, L.C. McInnes, B.F. Smith, Efficient management of parallelism in object oriented numerical software libraries, in: E. Arge, A.M. Bruaset, H.P. Langtangen (Eds.), Modern Software Tools in Scientific Computing, 1997, pp. 163–202.
    Google ScholarLocate open access versionFindings
  • G.E. Blelloch, M.A. Heroux, M. Zagha, Segmented operations for sparse matrix computations on vector multiprocessors, Technical Report CMU-CS-93- 173, Department of Computer Science, CMU, 1993.
    Google ScholarFindings
  • S. Borkar, Design challenges of technology scaling, IEEE Micro 19 (4) (1999) 23–29.
    Google ScholarLocate open access versionFindings
  • R. Geus, S. Röllin, Towards a fast parallel sparse matrix–vector multiplication, in: E.H. D’Hollander, J.R. Joubert, F.J. Peters, H. Sips (Eds.), Proceedings of the International Conference on Parallel Computing (ParCo), Imperial College Press, 1999, pp. 308–315.
    Google ScholarLocate open access versionFindings
  • Xiaogang Gou, Michael Liao, Paul Peng, Gansha Wu, Anwar Ghuloum, Doug Carmean, Report on sparsematrix performance analysis, Intel Report, Intel, United States, 2008.
    Google ScholarFindings
  • M. Gschwind, Chip multiprocessing and the cell broadband engine, in: CF’06: Proceedings of the third Conference on Computing Frontiers, New York, NY, USA, 2006, pp. 1–8.
    Google ScholarLocate open access versionFindings
  • M. Gschwind, H.P. Hofstee, B.K. Flachs, M. Hopkins, Y. Watanabe, T. Yamazaki, Synergistic processing in Cell’s multicore architecture, IEEE Micro 26 (2)
    Google ScholarLocate open access versionFindings
  • J.L. Hennessy, D.A. Patterson, Computer Architecture: a Quantitative Approach, fourth ed., Morgan Kaufmann, San Francisco, 2006.
    Google ScholarFindings
  • E.J. Im, K. Yelick, R. Vuduc, Sparsity: optimization framework for sparse matrix kernels, International Journal of High Performance Computing Applications 18 (1) (2004) 135–158.
    Google ScholarLocate open access versionFindings
  • Ankit Jain. pOSKI: an extensible autotuning framework to perform optimized SpMVs on multicore architectures, Technical Report (pending), MS Report, EECS Department, University of California, Berkeley, 2008.
    Google ScholarFindings
  • Kornilios Kourtis, Georgios I. Goumas, Nectarios Koziris, Optimizing sparse matrix–vector multiplication using index and value compression, In Conf. Computing Frontiers (2008) 87–96.
    Google ScholarLocate open access versionFindings
  • B.C. Lee, R. Vuduc, J. Demmel, K. Yelick, Performance models for evaluation and automatic tuning of symmetric sparse matrix–vector multiply, in: Proceedings of the International Conference on Parallel Processing, Montreal, Canada, August 2004.
    Google ScholarLocate open access versionFindings
  • J. Mellor-Crummey, J. Garvin, Optimizing sparse matrix vector multiply using unroll-and-jam, in: Proceedings of the LACSI Symposium, Santa Fe, NM, USA, October 2002.
    Google ScholarLocate open access versionFindings
  • R. Nishtala, R. Vuduc, J.W. Demmel, K.A. Yelick, When cache blocking sparse matrix vector multiply works and why, Applicable Algebra in Engineering, Communication, and Computing 1 (2007).
    Google ScholarFindings
  • A. Pinar, M. Heath, Improving performance of sparse matrix–vector multiplication, in: Proceedings of the Supercomputing, 1999.
    Google ScholarLocate open access versionFindings
  • D.J. Rose, A graph-theoretic study of the numerical solution of sparse positive definite systems of linear equations, Graph Theory and Computing
    Google ScholarFindings
  • Michelle Mills Strout, Larry Carter, Jeanne Ferrante, Barbara Kreaseck, Sparse tiling for stationary iterative methods, International Journal of High Performance Computing Applications 18 (1) (2004) 95–114.
    Google ScholarLocate open access versionFindings
  • O. Temam, W. Jalby, Characterizing the behavior of sparse algorithms on caches, in: Proceedings of the Supercomputing, 1992.
    Google ScholarLocate open access versionFindings
  • S. Toledo, Improving memory-system performance of sparse matrix–vector multiplication, in: Eighth SIAM Conference on Parallel Processing for Scientific Computing, March 1997.
    Google ScholarLocate open access versionFindings
  • B. Vastenhouw, R.H. Bisseling, A two-dimensional data distribution method for parallel sparse matrix–vector multiplication, SIAM Review 47 (1)
    Google ScholarLocate open access versionFindings
  • R. Vuduc, Automatic performance tuning of sparse matrix kernels, PhD Thesis, University of California, Berkeley, Berkeley, CA, USA, December 2003.
    Google ScholarFindings
  • R. Vuduc, J.W. Demmel, K.A. Yelick. OSKI: a library of automatically tuned sparse matrix kernels, in: Proceedings of the SciDAC 2005, Journal of Physics: Conference Series, San Francisco, CA, June 2005.
    Google ScholarLocate open access versionFindings
  • R. Vuduc, A. Gyulassy, J.W. Demmel, K.A. Yelick, Memory hierarchy optimizations and bounds for sparse AT Ax, in: Proceedings of the ICCS Workshop on Parallel Linear Algebra, volume LNCS, Melbourne, Australia, June 2003, Springer.
    Google ScholarLocate open access versionFindings
  • R. Vuduc, S. Kamil, J. Hsu, R. Nishtala, J.W. Demmel, K.A. Yelick, Automatic performance tuning and analysis of sparse triangular solve, in: ICS 2002: Workshop on Performance Optimization via High-Level Languages and Libraries, New York, USA, June 2002.
    Google ScholarFindings
  • J.B. White, P. Sadayappan, On improving the performance of sparse matrix–vector multiplication, in: Proceedings of the International Conference on High-Performance Computing, 1997.
    Google ScholarLocate open access versionFindings
  • J. Willcock, A. Lumsdaine, Accelerating sparse matrix computations via data compression, in: Proceedings International Conference on Supercomputing (ICS), Cairns, Australia, June 2006.
    Google ScholarLocate open access versionFindings
  • J.W. Willenbring, A.A. Anda, M. Heroux, Improving sparse matrix–vector product kernel performance and availabillity, in: Proceedings of the Midwest Instruction and Computing Symposium, Mt. Pleasant, IA, 2006.
    Google ScholarLocate open access versionFindings
  • S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, J. Demmel, Optimization of sparse matrix–vector multiplication on emerging multicore platforms, in: Proceedings of the Supercomputing, 2007.
    Google ScholarLocate open access versionFindings
  • S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, K. Yelick, Scientific computing kernels on the cell processor, International Journal of Parallel Programming 35 (3) (2007) 263–298.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn
小科