Directory Distributed Private Memory Cache FPU Network Router Alewife node CMMU Sparcle Shared Distributed Memory HOST VME Host Interface Intelligent

Anant Agarwal,Ricardo Bianchini,David Chaiken,Frederic T. Chong,Kirk L. Johnson, David Kranz,John Kubiatowicz,Beng-Hong Lim,Kenneth Mackenzie,Donald Yeung

semanticscholar（2007）

引用 0|浏览0

暂无评分

摘要

A variety of models for parallel architectures such as share d m mory, message passing, and dataflow, have converged in the rec ent past to a hybrid architecture form called distributed share d memory (DSM). By using a combination of hardware and software mecha nisms, DSM combines the nice features of all the above models and is able to achieve both the scalability of message passing ma chines and the programmability of shared memory systems. Alewife, an early prototype of such DSM architectures, uses a hybrid of s oftware and hardware mechanisms to support coherent shared memory, efficient user-level messaging, fine-grain synchronization, and latency tolerance. Alewife supports up to 512 processing nodes connected over a scalable and cost-effective mesh network at a constant cost per node. Four mechanisms combine to achieve Alewife’s goals of scala bility and programmability: software-extended coherent shared m mory provides a global, linear address space; integrated messag passing allows compiler and operating system designers to provide e fficient communication and synchronization; support for fine-grain computation allows many processors to cooperate on small problem sizes; and latency tolerance mechanisms – including block multith reading and prefetching – mask unavoidable delays due to communicat ion. Extensive results from microbenchmarks, together with ove r a dozen complete applications running on a 32-node prototyp e, demonstrate that integrating message passing with shared m emory enables a cost-efficient solution to the cache coherence pro blem and provides a rich set of programming primitives. Our results f urther show that messaging and shared memory operations are both im portant because each helps the programmer to achieve the best pe rformance for various machine configurations. Block multithrea ding and prefetching improve performance significantly, and langua ge constructs that allow programmers to express fine-grain synchr onization An earlier version of this paper appeared in ISCA ’95. yAffiliation: Federal University of Rio de Janeiro, Brazil zAffiliation: Digital Equipment Corporation Systems Resear ch Center, Palo Alto, CA 94301 xAffiliation: University of California at Davis, Davis, CA 95 616 {Affiliation: Xilinx Inc., Boulder, CO kAffiliation: University of California at Berkeley, Berkele y, CA 94720 Affiliation: IBM T.J. Watson Research Center, Yorktown Heig hts, NY 10598 yyAffiliation: Georgia Institute of Technology, Atlanta, GA 3 0 32 zzAffiliation: University of Maryland at College Park, CollegPark, MD 20742 can improve performance by over a factor of two.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要