On Network Locality in MPI-Based HPC Applications.

ICPP '20: Proceedings of the 49th International Conference on Parallel Processing（2020）

引用 4|浏览0

暂无评分

摘要

Data movements through interconnection networks exceed local memory accesses in terms of latency as well as energy by multiple orders of magnitude. While many optimizations make great effort to improve memory accesses, large distances in the network can easily dash these improvements, resulting in increased overall costs. Therefore, a deep understanding of network locality is key for further optimizations, such as improved mapping of ranks to physical entities. In this work, we are looking at locality in the hardware-independent application level and at locality aspects of common network structures. In order to quantize the former, two new metrics are introduced, namely rank locality and selectivity. Our studies are performed on a selection of 16 exascale proxy mini apps, with a scale ranging from eight to 1152 ranks. These traces are statically analyzed regarding their spatial communication pattern at the MPI level. The resulting practice in actual hardware is evaluated with a network model, which implements topologies such as tori, fat tree, and dragonfly, and an according minimal routing. As a result, this work is founded on a large set of experimental configurations, based on different applications, scales, and topologies. While in most traces single ranks have a wide range of commutation partners, 90% of the communication is exchanged only with a small set of ten or fewer other ranks. Results suggest the 3D torus as the most favorable topology for a small number of ranks, while for larger configurations the fat tree is preferable. Furthermore, we show that in general, the network is highly underutilized and that in 93% of all configurations less than 1% of network resources are actually used. Overall, this indicates that static analyses could assist to select an advanced mapping, which assigns groups of heavily communicating ranks to nearby physical entities. This could help to minimize the total number of packet hops and, thereby, improve latency and reduce the probability of congestions.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要