GUIDE: a scalable information directory service to collect, federate, and analyze logs for operational insights into a leadership HPC facility

SC(2017)

引用 20|浏览131
暂无评分
摘要
In this paper, we describe the GUIDE framework used to collect, federate, and analyze log data from the Oak Ridge Leadership Computing Facility (OLCF), and how we use that data to derive insights into facility operations. We collect system logs and extract monitoring data at every level of the various OLCF subsystems, and have developed a suite of pre-processing tools to make the raw data consumable. The cleansed logs are then ingested and federated into a central, scalable data warehouse, Splunk, that offers storage, indexing, querying, and visualization capabilities. We have further developed and deployed a set of tools to analyze these multiple disparate log streams in concert and derive operational insights. We describe our experience from developing and deploying the GUIDE infrastructure, and deriving valuable insights on the various subsystems, based on two years of operations in the production OLCF environment.
更多
查看译文
关键词
scalable information directory service,federate,analyze logs,Leadership HPC Facility,GUIDE framework,Oak Ridge Leadership Computing Facility,facility operations,collect system logs,OLCF subsystems,pre-processing tools,raw data consumable,cleansed logs,central data warehouse,scalable data warehouse,indexing,visualization capabilities,multiple disparate log streams,derive operational insights,GUIDE infrastructure,production OLCF
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要