Parallel Version of the Framework for Clustering Error Messages

LOBACHEVSKII JOURNAL OF MATHEMATICS(2021)

引用 1|浏览1
暂无评分
摘要
Distributed computing environments execute great amount of various computing jobs that can fail or break for some reason. The analysis of the error messages describing the reasons of failures has become one of the most crucial parts of the existing monitoring systems. This analysis is complicated by the presence of a large number of messages, especially in the case of the retrospective analysis. ClusterLogs framework was developed as a modular and flexible tool for the clustering of error messages of computing jobs in distributed computing infrastructures. The general purpose of this tool is to simplify the error messages analysis by grouping together messages that share similar failure reasons and textual patterns. Proposed clustering method includes a set of sequential data processing stages and provides various clustering options: deterministic similarity-based clustering and unsupervised multiple machine learning methods with preliminary vectorization of error messages using the word embedding technique. The performance tests had revealed the most time consuming stages. In this paper we describe the parallelilzing method for these stages and demonstrate how it has allowed the increased performance of the whole clustering pipeline. The performance tests were executed on the HPC system Polus.
更多
查看译文
关键词
hpc, mpi, parallel, clustering, error messages
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要