Resilient X10 over MPI user level failure mitigation

X10@PLDI(2016)

引用 16|浏览77
暂无评分
摘要
Many PGAS languages and libraries rely on high performance transport layers such as GASNet and MPI to achieve low communication latency, portability and scalability. As systems increase in scale, failures are expected to become normal events rather than exceptions. Unfortunately, GASNet and standard MPI do not pro- vide fault tolerance capabilities. This limitation hinders PGAS languages and other high-level programming models from supporting resilience at scale. For this reason, Resilient X10 has previously been supported over sockets only, not over MPI. This paper describes the use of a fault tolerant MPI implementation, called ULFM (User Level Failure Mitigation), as a transport layer for Resilient X10. By providing fault tolerant collective and agreement algorithms, on demand failure propagation, and support for InfiniBand, ULFM provides the required infrastructure to create a high performance transport layer for Resilient X10. We show that replacing X10’s emulated collectives with ULFM’s blocking collectives results in significant performance improvements. For three iterative SPMD-style applications running on 1000 X10 places, the improvement ranged between 30% and 51%. The per-step overhead for resilience was less than 9%. A proposal for adding ULFM to the coming MPI-4 standard is currently under assessment by the MPI Forum. Our results show that adding user-level fault tolerance support in MPI makes it a suitable base for resilience in high-level programming models.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要