Simon: A Simple And Scalable Method For Sensing, Inference And Measurement In Data Center Networks

PROCEEDINGS OF THE 16TH USENIX SYMPOSIUM ON NETWORKED SYSTEMS DESIGN AND IMPLEMENTATION(2019)

引用 65|浏览201
暂无评分
摘要
It is important to perform measurement and monitoring in order to understand network performance and debug problems encountered by distributed applications. Despite many products and much research on these topics, in the context of data centers, performing accurate measurement at scale in near real-time has remained elusive. There are two main approaches to network telemetry-switch-based and end-host-based-each with its own advantages and drawbacks.In this paper, we attempt to push the boundary of edge-based measurement by scalably and accurately reconstructing the full queueing dynamics in the network with data gathered entirely at the transmit and receive network interface cards (NICs). We begin with a Signal Processing framework for quantifying a key trade-off: reconstruction accuracy versus the amount of data gathered. Based on this, we propose SIMON, an accurate and scalable measurement system for data centers that reconstructs key network state variables like packet queuing times at switches, link utilizations, and queue and link compositions at the flow-level. We use two ideas to speed up SIMON: (i) the hierarchical nature of data center topologies, and (ii) the function approximation capability of multi-layered neural networks. The former gives a speedup of 1,000x while the latter implemented on GPUs gives a speedup of 5,000x to 10,000x, enabling SIMON to run in real-time. We deployed SIMON in three testbeds with different link speeds, layers of switching and number of servers. Evaluations with NetFPGAs and a cross-validation technique show that SIMON reconstructs queue-lengths to within 3-5 KBs and link utilizations to less than 1% of actual. The accuracy and speed of SIMON enables sensitive A/B tests, which greatly aids the real-time development of algorithms, protocols, network software and applications.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要