# Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Parallel Computing, no. 3 (2009): 1-12E

EI WOS

Full Text

Weibo

Keywords

Abstract

We are witnessing a dramatic change in computer architecture due to the multicore paradigm shift, as every electronic device from cell phones to supercomputers confronts parallelism of unprecedented scale. To fully unleash the potential of these systems, the HPC community must develop multicore specific optimization method- ologies for im...More

Code:

Data:

Introduction

- Industry has moved to chip multiprocessor (CMP) system design in order to better manage trade-offs among performance, energy efficiency, and reliability [10,5].
- SpMV is a frequent bottleneck in scientific computing applications, and is notorious for sustaining low fractions of peak processor performance.
- The authors implement SpMV for one of the most diverse sets of CMP platforms studied in the existing HPC literature, including the homogeneous multicore designs of the dual-socket Â quad-core AMD Opteron 2356 (Barcelona), the dual-socket Â dual-core AMD Opteron 2214 (Santa Rosa) and the dual-socket Â quad-core Intel Xeon E5345 (Clovertown), the heterogeneous local-store based architecture of the dual-socket Â eight-SPE IBM QS20 Cell Blade, as well as one of the first scientific studies of the hardware-multithreaded dual-socket Â eight-core Sun UltraSparc T2+ T5140 (Victoria Falls) – essentially a dual-socket Niagara.

Highlights

- Industry has moved to chip multiprocessor (CMP) system design in order to better manage trade-offs among performance, energy efficiency, and reliability [10,5]
- sparse matrix–vector multiply (SpMV) is a frequent bottleneck in scientific computing applications, and is notorious for sustaining low fractions of peak processor performance
- We implement SpMV for one of the most diverse sets of CMP platforms studied in the existing HPC literature, including the homogeneous multicore designs of the dual-socket Â quad-core AMD Opteron 2356 (Barcelona), the dual-socket Â dual-core AMD Opteron 2214 (Santa Rosa) and the dual-socket Â quad-core Intel Xeon E5345 (Clovertown), the heterogeneous local-store based architecture of the dual-socket Â eight-SPE IBM QS20 Cell Blade, as well as one of the first scientific studies of the hardware-multithreaded dual-socket Â eight-core Sun UltraSparc T2+ T5140 (Victoria Falls) – essentially a dual-socket Niagara2
- Our study examines the Sun UltraSparc T2+ T5140 with two T2+ processors operating at 1.16 GHz, with a per-core and per-socket peak performance of 1.16 GFlop/s and 9.33 GFlop/s, respectively (no fused-multiply add (FMA) functionality)
- Our findings illuminate the architectural differences among one of the most diverse sets of multicore configurations considered in the existing literature, and speak strongly to the necessity of multicore-specific optimization over the use of existing off-the-shelf approaches to parallelism for multicore machines
- As tuning often increased performance by more than 20%, we feel peak power efficiency is typical at peak performance
- Significant additional performance was seen on the dual socket configurations when the aggregate bandwidth is doubled

Results

- SpMV dominates the performance of diverse applications in scientific and engineering computing, economic modeling and information retrieval; yet, conventional implementations have historically been relatively poor, running at 10% or less of machine peak on single-core cache-based microprocessor systems [24].
- Doing so provides a 20% reduction in memory traffic for some matrices, which could translate up to a 20% increase in performance.
- The data in Table 3 shows that the Victoria Falls system sustains only 1% of its memory bandwidth when using a single thread on a single core.
- As tuning often increased performance by more than 20%, the authors feel peak power efficiency is typical at peak performance

Conclusion

**Summary and conclusions**

The authors' findings illuminate the architectural differences among one of the most diverse sets of multicore configurations considered in the existing literature, and speak strongly to the necessity of multicore-specific optimization over the use of existing off-the-shelf approaches to parallelism for multicore machines.- The ‘‘heavy-weight” out-of-order cores of the Santa Rosa, Barcelona, and Clovertown systems showed sub-linear improvement from one to two cores
- These powerful cores are severely bandwidth starved.
- Significant additional performance was seen on the dual socket configurations when the aggregate bandwidth is doubled
- This indicates that sustainable memory bandwidth may become a significant bottleneck as core count increases, and software designers should consider bandwidth reduction as a key algorithmic optimization

- Table1: Architectural summary of AMD Opteron (Santa Rosa), AMD Opteron (Barcelona), Intel Xeon (Clovertown), Sun Victoria Falls, and STI Cell multicore chips. Sustained power measured via digital power meter
- Table2: Overview of SpMV optimizations attempted in our study for the Â86 (Santa Rosa, Barcelona and Clovertown), Victoria Falls, and Cell architectures
- Table3: Sustained bandwidth and computational rate for a dense matrix stored in sparse format, in GB/s (and percentage of configuration’s peak bandwidth) and GFlop/s (and percentage of configuration’s peak performance)

Funding

- All authors from Lawrence Berkeley National Laboratory were supported by the Office of Advanced Scientific Computing Research in the Department of Energy Office of Science under contract number DE-AC02-05CH11231

Study subjects and analysis

cases: 3

For each threading model, we implement an auto-tuning framework to produce architecture-optimized kernels. We attempt three cases: no cache and no TLB blocking, cache blocking with no TLB blocking, as well as cache and TLB blocking. For each of these, a heuristic based on minimization of memory traffic selects the appropriate block size, register blocking, format, and index size. i.e., the compression strategy that results in the substantially smallest matrix is selected

Reference

- K. Asanovic, R. Bodik, B. Catanzaro, et al., The landscape of parallel computing research: a view from Berkeley, Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, December 2006.
- D. Bailey, Little’s law and high performance computing, RNR Technical Report, 1997.
- S. Balay, W.D. Gropp, L.C. McInnes, B.F. Smith, Efficient management of parallelism in object oriented numerical software libraries, in: E. Arge, A.M. Bruaset, H.P. Langtangen (Eds.), Modern Software Tools in Scientific Computing, 1997, pp. 163–202.
- G.E. Blelloch, M.A. Heroux, M. Zagha, Segmented operations for sparse matrix computations on vector multiprocessors, Technical Report CMU-CS-93- 173, Department of Computer Science, CMU, 1993.
- S. Borkar, Design challenges of technology scaling, IEEE Micro 19 (4) (1999) 23–29.
- R. Geus, S. Röllin, Towards a fast parallel sparse matrix–vector multiplication, in: E.H. D’Hollander, J.R. Joubert, F.J. Peters, H. Sips (Eds.), Proceedings of the International Conference on Parallel Computing (ParCo), Imperial College Press, 1999, pp. 308–315.
- Xiaogang Gou, Michael Liao, Paul Peng, Gansha Wu, Anwar Ghuloum, Doug Carmean, Report on sparsematrix performance analysis, Intel Report, Intel, United States, 2008.
- M. Gschwind, Chip multiprocessing and the cell broadband engine, in: CF’06: Proceedings of the third Conference on Computing Frontiers, New York, NY, USA, 2006, pp. 1–8.
- M. Gschwind, H.P. Hofstee, B.K. Flachs, M. Hopkins, Y. Watanabe, T. Yamazaki, Synergistic processing in Cell’s multicore architecture, IEEE Micro 26 (2)
- J.L. Hennessy, D.A. Patterson, Computer Architecture: a Quantitative Approach, fourth ed., Morgan Kaufmann, San Francisco, 2006.
- E.J. Im, K. Yelick, R. Vuduc, Sparsity: optimization framework for sparse matrix kernels, International Journal of High Performance Computing Applications 18 (1) (2004) 135–158.
- Ankit Jain. pOSKI: an extensible autotuning framework to perform optimized SpMVs on multicore architectures, Technical Report (pending), MS Report, EECS Department, University of California, Berkeley, 2008.
- Kornilios Kourtis, Georgios I. Goumas, Nectarios Koziris, Optimizing sparse matrix–vector multiplication using index and value compression, In Conf. Computing Frontiers (2008) 87–96.
- B.C. Lee, R. Vuduc, J. Demmel, K. Yelick, Performance models for evaluation and automatic tuning of symmetric sparse matrix–vector multiply, in: Proceedings of the International Conference on Parallel Processing, Montreal, Canada, August 2004.
- J. Mellor-Crummey, J. Garvin, Optimizing sparse matrix vector multiply using unroll-and-jam, in: Proceedings of the LACSI Symposium, Santa Fe, NM, USA, October 2002.
- R. Nishtala, R. Vuduc, J.W. Demmel, K.A. Yelick, When cache blocking sparse matrix vector multiply works and why, Applicable Algebra in Engineering, Communication, and Computing 1 (2007).
- A. Pinar, M. Heath, Improving performance of sparse matrix–vector multiplication, in: Proceedings of the Supercomputing, 1999.
- D.J. Rose, A graph-theoretic study of the numerical solution of sparse positive definite systems of linear equations, Graph Theory and Computing
- Michelle Mills Strout, Larry Carter, Jeanne Ferrante, Barbara Kreaseck, Sparse tiling for stationary iterative methods, International Journal of High Performance Computing Applications 18 (1) (2004) 95–114.
- O. Temam, W. Jalby, Characterizing the behavior of sparse algorithms on caches, in: Proceedings of the Supercomputing, 1992.
- S. Toledo, Improving memory-system performance of sparse matrix–vector multiplication, in: Eighth SIAM Conference on Parallel Processing for Scientific Computing, March 1997.
- B. Vastenhouw, R.H. Bisseling, A two-dimensional data distribution method for parallel sparse matrix–vector multiplication, SIAM Review 47 (1)
- R. Vuduc, Automatic performance tuning of sparse matrix kernels, PhD Thesis, University of California, Berkeley, Berkeley, CA, USA, December 2003.
- R. Vuduc, J.W. Demmel, K.A. Yelick. OSKI: a library of automatically tuned sparse matrix kernels, in: Proceedings of the SciDAC 2005, Journal of Physics: Conference Series, San Francisco, CA, June 2005.
- R. Vuduc, A. Gyulassy, J.W. Demmel, K.A. Yelick, Memory hierarchy optimizations and bounds for sparse AT Ax, in: Proceedings of the ICCS Workshop on Parallel Linear Algebra, volume LNCS, Melbourne, Australia, June 2003, Springer.
- R. Vuduc, S. Kamil, J. Hsu, R. Nishtala, J.W. Demmel, K.A. Yelick, Automatic performance tuning and analysis of sparse triangular solve, in: ICS 2002: Workshop on Performance Optimization via High-Level Languages and Libraries, New York, USA, June 2002.
- J.B. White, P. Sadayappan, On improving the performance of sparse matrix–vector multiplication, in: Proceedings of the International Conference on High-Performance Computing, 1997.
- J. Willcock, A. Lumsdaine, Accelerating sparse matrix computations via data compression, in: Proceedings International Conference on Supercomputing (ICS), Cairns, Australia, June 2006.
- J.W. Willenbring, A.A. Anda, M. Heroux, Improving sparse matrix–vector product kernel performance and availabillity, in: Proceedings of the Midwest Instruction and Computing Symposium, Mt. Pleasant, IA, 2006.
- S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, J. Demmel, Optimization of sparse matrix–vector multiplication on emerging multicore platforms, in: Proceedings of the Supercomputing, 2007.
- S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, K. Yelick, Scientific computing kernels on the cell processor, International Journal of Parallel Programming 35 (3) (2007) 263–298.

Tags

Comments

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn