## AI helps you reading Science

## AI Insight

AI extracts a summary of this paper

Weibo:

# Cache-oblivious algorithms

CIAC, pp.5-5, (2003)

EI

Abstract

Computers with multiple levels of caching have traditionally required techniques such as data blocking in order for algorithms to exploit the cache hierarchy effectively. These "cache-aware" algorithms must be properly tuned to achieve good performance using so-called "voodoo" parameters which depend on hardware properties, such as cache ...More

Code:

Data:

Introduction

- The cache oblivious model is a simple and elegant model to design algorithms that perform well in hierarchical memory models ubiquitous on current systems.
- In section 7, a theoretically optimal, randomized cache oblivious sorting algorithm along with the running times of an implementation is presented.
- Strassen’s matrix multiplication, quicksort, mergesort, closest pair [16], convex hulls [7], median selection [16] are all algorithms that are cache oblivious, though not all of them are optimal in this model.

Highlights

- The cache oblivious model is a simple and elegant model to design algorithms that perform well in hierarchical memory models ubiquitous on current systems
- Chapter Outline: We introduce the cache oblivious model in section 2
- We study the cache oblivious analysis of Strassen’s algorithm in section 5
- The code written for this experimentation is below 300 lines The experiment reported in Figure 5 were done on a Itanium dual processor system with 2Gb Random Access Model (RAM). (Only one processor was being used)
- We present here problems, related bounds and references for more interested readers
- Note that in the table, sort() and scan() denote the number of cache misses of scan and sorting functions done by an optimal cache oblivious implementation

Results

- The sorting lower bound in the cache oblivious model is the same as the external memory model (See Chapter ??).
- Before the authors go into the divide and conquer based algorithm for matrix transposition that is cache oblivious, lets see an experimental results.
- Remark: Figure 2 shows the effect of using blocked cache oblivious algorithm for matrix transposition.
- It is easy to code, uses the fact that the memory consists of a cache hierarchy, and could be exploited to speed up tree based search structures on most current machines.
- Before one makes his hand “dirty” with implementing an algorithm in the cache oblivious or the external memory model, one should be aware of practical things that might
- The authors list a few practical glitches that are shared by both the cache oblivious and the external memory model.
- Code written and algorithms designed keeping the following things in mind, could be a lot faster than just directly coding an algorithm that is optimal in either the cache oblivious or the external memory model.
- One can overcome this problem by writing one’s own paging system over the OS to do experimentation of cache oblivious algorithms on huge data sizes.
- The authors' major conclusion are as follows: Limited associativity in the mapping from main memory addresses to cache sets can significantly degrade running time; the limited number of TLB entries can lead to thrashing; the fanciest optimal algorithms are not competitive on real machines even at fairly large problem sizes unless cache miss penalties are quite high; low level performance tuning “hacks”, such as register tiling and array alignment, can significantly distort the effect of improved algorithms, ...

Conclusion

- (Toy experiments comparing quicksort with a modified funnelsort or distribution sort don’t count!) Currently the only impressive code that might back up ”practicality” claims of cache oblivious algorithms is FFTW [18].
- Matrix multiplication and transposition using blocked cache oblivious algorithms do fairly well in comparison with cache aware/external memory algorithms.
- For matrix transposition, there are at least two cache oblivious algorithms coded in [13].

Funding

- The author is partially supported by NSF (CCR-9732220, CCR-0098172) and by a grant from Sandia National Labs

Reference

- A. Aggarwal, B. Alpern, A. K. Chandra, and M. Snir. A model for hierarchical memory. In Proc. 19th Annu. ACM Sympos. Theory Comput., pages 305–313, 1987.
- A. Aggarwal and A. K. Chandra. Virtual memory algorithms. In Proc. 20th Annu. ACM Sympos. Theory Comput., pages 173–185, 1988.
- A. Aggarwal, A. K. Chandra, and M. Snir. Hierarchical memory with block transfer. In Proc. 28rd Annu. IEEE Sympos. Found. Comput. Sci., pages 204–216, 1987.
- A. Aggarwal and J. S. Vitter. The input/output complexity of sorting and related problems. Commun. ACM, 31:1116–1127, 1988.
- B. Alpern, L. Carter, and E. Feig. Uniform memory hierarchies. In focs, pages 600–608, 1990.
- B. Alpern, L. Carter, E. Feig, and T. Selker. The uniform memory hierarchy model of computation. Algorithmica, 12(2-3), 1994.
- N. M. Amato and Edgar A. Ramos. On computing Voronoi diagrams by divideprune-and-conquer. In Proc. 12th Annu. ACM Sympos. Comput. Geom., pages 166–175, 1996.
- L. Arge, M. A. Bender, E. D. Demaine, B. Holland-Minkley, and J. I. Munro. Cache-oblivious priority queue and graph algorithm applications. In Proceedings of the 34th Annual ACM Symposium on Theory of Computing (STOC), pages 268–276, 2002.
- M. A. Bender, Z. Duan, J. Iacono, and J. Wu. A locality-preserving cache-oblivious dynamic dictionary. In Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 29–38, 2002.
- G. E. Blelloch, C. E. Leiserson, B. M. Maggs, C. Greg Plaxton, S. J. Smith, and M. Zagha. A comparison of sorting algorithms for the connection machine CM-2. In ACM Symposium on Parallel Algorithms and Architectures, pages 3–16, 1991.
- G. S. Brodal and R. Fagerberg. Funnel heap - a cache oblivious priority queue. In Proc. 13th Annual International Symposium on Algorithms and Computation, Lecture Notes in Computer Science. 2002.
- G. S. Brodal, R. Fagerberg, and R. Jacob. Cache oblivious search trees via binary trees of small height. Technical Report BRICS-RS-01-36, BRICS, Department of Computer Science, University of Aarhus, October 2001.
- S. Chatterjee and S. Sen. Cache-efficient matrix transposition. In HPCA, pages 195–205, 2000.
- Y.-J. Chiang, M. T. Goodrich, E. F. Grove, R. Tamassia, D. E. Vengroff, and J. S. Vitter. External-memory graph algorithms. In Proc. 6th ACM-SIAM Sympos. Discrete Algorithms, pages 139–149, 1995.
- D. Coppersmith and S. Winograd. Matrix multiplication via arithmetic progression. Journal of Symbolic Computation, 9:251–280, 1990.
- T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT Press, Cambridge, MA, 1990.
- N. Eiron, M. Rodeh, and I. Steinwarts. Matrix multiplication: A case study of algorithm engineering. In 2nd Workshop on Algorithm Engineering, volume 16, pages 98–109, 1998.
- M. Frigo. A fast fourier transform compiler. In PLDI’99 — Conference on Programming Language Design and Implementation, Atlanta, GA, 1999.
- M. Frigo. Portable high-performance programs. Technical Report MIT/LCS/TR785, 1999.
- M. Frigo, Charles E. Leiserson, H. Prokop, and S. Ramachandran. Cache oblivious algorithms. In Proc. 40th Annual Symposium on Foundations of Computer Science, October 1999.
- R. L. Graham, D. E. Knuth, and O. Patashnik. Concrete Mathematics. AddisonWesley, Reading, MA, 1989.
- J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, Inc., 1990.
- J. W. Hong and H. T. Kung. I/o complexity: The red-blue pebble game. In stoc, pages 326–333, 1981.
- R. E. Ladner, R. Fortna, and B. H. Nguyen. A comparison of cache aware and cache oblivious static search trees using program instrumentation. In To appear in LNCS volume devoted to Experimental Algorithmics, April 2002.
- A. LaMarca and R.E. Ladner. The influence of caches on the performance of sorting. In Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 370–379, 5–7 January 1997.
- C. Nyberg, T. Barclay, Z. Cvetanovic, J. Gray, and D. B. Lomet. Alphasort: A risc machine sort. In R. T. Snodgrass and M. Winslett, editors, Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, Minneapolis, Minnesota, May 24-27, 1994, pages 233–242. ACM Press, 1994.
- N. Rahman, R. Cole, and R. Raman. Optimized predecessor data structures for internal memory. In 5th Workshop on Algorithms Engineering (WAE), 2001.
- J. E. Savage. Extending the Hong-Kung model to memory hierachies. In Proceedings of the 1st Annual International Conference on Computing and Combinatorics, volume 959 of LNCS, pages 270–281, August 1995.
- S. Sen and S. Chatterjee. Towards a theory of cache-efficient algorithms. In Proceedings of the 11th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 829–838, January 2000.
- D. D. Sleator and R. E. Tarjan. Amortized efficiency of list update and paging rules. Commun. ACM, 28:202–208, 1985.
- V. Strassen. Gaussian elimination is not optimal. Numer Math, 13:354–356, 1969.
- O. Temam, C. Fricker, and William Jalby. Cache interference phenomena. In Measurement and Modeling of Computer Systems, pages 261–271, 1994. Journal on Matrix Analysis and Applications, 18(4):1065–1081, October 1997.
- 34. D. S. Wise. Ahnentafel indexing into morton-ordered arrays, or matrix locality for free. In Euro-Par 2000 – Parallel Processing, volume 1900 of LNCS, pages 774–784, August 2000.
- 35. Q. Yi, V. Advi, and K. Kennedy. Transforming loops to recursion for multilevel memory hierarchies. In Proceedings of the 2000 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 169–181, Vancouver, Canada, June 2000. ACM.

Tags

Comments

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn