Microscope: Queue-based Performance Diagnosis for Network Functions

Labs Research)
Labs Research)
Labs - Research)
Labs - Research)

ACM SIGCOMM 2020, 2020.

Cited by: 0|Views95
Weibo:
Our evaluation shows that Microscope can correctly capture 89.7% of all performance problems of various types, up to 2.5 times more than the state-ofthe-art tools

Abstract:

By moving monolithic network appliances to software running on commodity hardware, network function virtualization allows flexible resource sharing among network functions and achieves scalability with low cost. However, due to resource contention, network functions can suffer from performance problems that are hard to diagnose. In partic...More

Code:

Data:

0
Full Text
Bibtex
Weibo
Introduction
  • Network function virtualization (NFV) transforms hardware middleboxes to software running on commodity hardware – called Virtual Network Functions (VNFs), thereby bringing flexibility and

    Minlan Yu agility to network operations.
  • Since VNFs process packets in software, there are inevitably more performance variations than hardware platforms
  • These performance problems have a significant impact on service-level agreements and user experiences [35].
  • Figure 1 shows that a bursty flow of 300 μs can impact flows that arrive in the three milliseconds because of the long time the queues take to drain
  • While this happens, the impact of queuing may propagate to other NFs and flows
Highlights
  • Network function virtualization (NFV) transforms hardware middleboxes to software running on commodity hardware – called Virtual Network Functions (VNFs), thereby bringing flexibility and

    Minlan Yu agility to network operations
  • Our evaluation demonstrates that Microscope can correctly capture 89.7% of all performance problems emanating from a variety of reasons such as traffic bursts, interrupts, NF bugs, etc., which is up to 2.5 times higher than the state-of-the-art tools
  • Our evaluation shows that Microscope can correctly capture 89.7% of all performance problems of various types, up to 2.5 times more than the state-ofthe-art tools
  • In this paper we presented Microscope to diagnose network performance issues
  • We showed how stringent performance requirements of VNFs can create a lasting impact across time and network functions causing latency and throughput issues for downstream VNFs
  • Our evaluation shows that this decoupling significantly reduces the aggregation time without losing any significant patterns
  • We demonstrated that Microscope can diagnose performance problems caused by interrupts, software bugs, traffic bursts, resource exhaustion etc. accurately and correctly across a chain of various network functions
Methods
  • The authors compare the accuracy of Microscope and NetMedic. Ideally, the authors would like to run evaluation on real problems, but ground truth is often hard to come by in such scenarios.
  • When the authors inject traffic bursts, sometimes interrupts occur at the same time, and these two culprits both contribute to the performance problem.
  • For such scenarios, Microscope identifies other problems as the top reason rather than the injected ones.
  • For 39.9% of victim packets NetMedic ranks the traffic burst the second-most likely culprit
Results
  • The authors evaluate the accuracy and performance of Microscope.
  • The authors' evaluation shows that Microscope can correctly capture 89.7% of all performance problems of various types, up to 2.5 times more than the state-ofthe-art tools.
  • The authors demonstrate that this can be achieved with a very small overhead during runtime information collection.
  • If a flow matches a rule at the Firewall, it is forwarded to the Monitor, otherwise it directly traverses to the VPN
Conclusion
  • The authors can use a non-zero queue length threshold to define the start of a queuing period, to handle the case when the NF queues may be non-zero for a long time.In this paper the authors presented Microscope to diagnose network performance issues.
  • The authors showed how stringent performance requirements of VNFs can create a lasting impact across time and network functions causing latency and throughput issues for downstream VNFs. The authors presented the design and the implementation of Microscope that can be used to diagnose such problems, based on the key insight of analyzing the queuing periods.
  • While Microscope is not a panacea, the authors believe it can help operators in reaping benefits provided by virtualization while maximizing performance
Summary
  • Introduction:

    Network function virtualization (NFV) transforms hardware middleboxes to software running on commodity hardware – called Virtual Network Functions (VNFs), thereby bringing flexibility and

    Minlan Yu agility to network operations.
  • Since VNFs process packets in software, there are inevitably more performance variations than hardware platforms
  • These performance problems have a significant impact on service-level agreements and user experiences [35].
  • Figure 1 shows that a bursty flow of 300 μs can impact flows that arrive in the three milliseconds because of the long time the queues take to drain
  • While this happens, the impact of queuing may propagate to other NFs and flows
  • Objectives:

    The authors' goal is to find the causal relations between the NFs/flows with intermittent2 performance problems.
  • The authors' goal is to identify all the abnormal behaviors which impact the packet p.
  • The authors' goal is to understand the history of PreSet(p) and why these packets take T time at NF f.
  • This is because, the goal is to diagnose victim packet p at f , no matter how the packets in PreSet(p) distribute within the timespan at each upstream NF, they cause the same effect at f.
  • The authors' goal is to identify the culprit flows, culprit NF, or culprit NF-flow pairs
  • Methods:

    The authors compare the accuracy of Microscope and NetMedic. Ideally, the authors would like to run evaluation on real problems, but ground truth is often hard to come by in such scenarios.
  • When the authors inject traffic bursts, sometimes interrupts occur at the same time, and these two culprits both contribute to the performance problem.
  • For such scenarios, Microscope identifies other problems as the top reason rather than the injected ones.
  • For 39.9% of victim packets NetMedic ranks the traffic burst the second-most likely culprit
  • Results:

    The authors evaluate the accuracy and performance of Microscope.
  • The authors' evaluation shows that Microscope can correctly capture 89.7% of all performance problems of various types, up to 2.5 times more than the state-ofthe-art tools.
  • The authors demonstrate that this can be achieved with a very small overhead during runtime information collection.
  • If a flow matches a rule at the Firewall, it is forwarded to the Monitor, otherwise it directly traverses to the VPN
  • Conclusion:

    The authors can use a non-zero queue length threshold to define the start of a queuing period, to handle the case when the NF queues may be non-zero for a long time.In this paper the authors presented Microscope to diagnose network performance issues.
  • The authors showed how stringent performance requirements of VNFs can create a lasting impact across time and network functions causing latency and throughput issues for downstream VNFs. The authors presented the design and the implementation of Microscope that can be used to diagnose such problems, based on the key insight of analyzing the queuing periods.
  • While Microscope is not a panacea, the authors believe it can help operators in reaping benefits provided by virtualization while maximizing performance
Tables
  • Table1: Information collected by Microscope during runtime
  • Table2: Breakdown of problem frequencies based on culprits and victims. Rows represent culprit NFs and columns represent victim NFs. Numbers show the percentage of problems for each
  • Table3: Frequency differences for problems caused by different
Download tables as Excel
Related work
  • In this section, we discuss works related to Microscope. Given the nature of network function virtualization, we discuss existing solutions for performance diagnostics in the domain of networks and distributed systems5. But before going into direct comparisons we’ll first discuss a few of the recent works on NFV performance optimization and then discuss how Microscope is different from existing works based on its ability to diagnose problems at finegrained timescales and across functions in service function chains. Performance optimization: There has been a great effort on performance optimization for NFV systems and distributed systems in multi-tenant environments.

    Performance optimization for distributed systems: Retro [44], Ernest [52], and HUG [20] are some of the first efforts in this regard. These systems are mainly focused on resource allocation optimization in a distributed system.
Funding
  • Junzhi Gong, Yuliang Li, and Minlan Yu are supported in part by the NSF grant CNS-1618138
Reference
  • Brocade vyatta 5400 vrouter. http://www.brocade.com/products/all/networkfunctions-virtualization/product-details/5400-vrouter/index.page.
    Findings
  • The cooperative association for internet data analysis (caida). http://www.caida.
    Findings
  • Data plane development kit. https://www.dpdk.org/.
    Findings
  • Evolution of the broadband network gateway. https://www.tmcnet.com/tmc/
    Findings
  • Ieee standard 1588-2008. http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?
    Findings
  • Jaeger: open source, end-to-end distributed tracing. https://www.jaegertracing.
    Findings
  • Microscope survey form and results. https://www.dropbox.com/s/
    Findings
  • Migration to ethernet-based broadband aggregation. https://www.broadbandforum.org/download/TR-101_Issue-2.pdf.
    Findings
  • Nfv proofs of concept. http://www.etsi.org/technologies-clusters/technologies/
    Findings
  • Open vswitch. https://www.openvswitch.org/.
    Findings
  • Vpp. https://fd.io/.
    Findings
  • Zipkin: A distributed tracing system. https://zipkin.io/.
    Findings
  • Omid Alipourfard and Minlan Yu. Decoupling algorithms and optimizations in network functions. In Proceedings of the 17th ACM Workshop on Hot Topics in Networks, pages 71–77, 2018.
    Google ScholarLocate open access versionFindings
  • Bilal Anwer, Theophilus Benson, Nick Feamster, and Dave Levin. Programming slick network functions. In Proceedings of the 1st acm sigcomm symposium on software defined networking research, pages 1–13, 2015.
    Google ScholarLocate open access versionFindings
  • Muhammad Bilal Anwer, Murtaza Motiwala, Mukarram bin Tariq, and Nick Feamster. Switchblade: A platform for rapid deployment of network protocols on programmable hardware. In Proceedings of the ACM SIGCOMM 2010 conference, pages 183–194, 2010.
    Google ScholarLocate open access versionFindings
  • Paramvir Bahl, Ranveer Chandra, Albert Greenberg, Srikanth Kandula, David A Communication Review, 37(4):13–24, 2007.
    Google ScholarLocate open access versionFindings
  • Paul Barham, Austin Donnelly, Rebecca Isaacs, and Richard Mortier. Using magpie for request extraction and workload modelling. In OSDI, volume 4, pages
    Google ScholarLocate open access versionFindings
  • Anat Bremler-Barr, Yotam Harchol, and David Hay. Openbox: a software-defined framework for developing, deploying, and managing network functions. In Proceedings of the 2016 ACM SIGCOMM Conference, pages 511–524. ACM, 2016.
    Google ScholarLocate open access versionFindings
  • Mike Y Chen, Emre Kiciman, Eugene Fratkin, Armando Fox, and Eric Brewer. Pinpoint: Problem determination in large, dynamic internet services. In Proceedings
    Google ScholarLocate open access versionFindings
  • Mosharaf Chowdhury, Zhenhua Liu, Ali Ghodsi, and Ion Stoica. {HUG}: Multi-resource fairness for correlated and elastic demands. In 13th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 16), pages 407–424, 2016.
    Google ScholarLocate open access versionFindings
  • Mihai Dobrescu, Norbert Egi, Katerina Argyraki, Byung-Gon Chun, Kevin Fall, Gianluca Iannaccone, Allan Knies, Maziar Manesh, and Sylvia Ratnasamy. Routebricks: exploiting parallelism to scale software routers. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages 15–28, 2009.
    Google ScholarLocate open access versionFindings
  • Nick G Duffield and Matthias Grossglauser. Trajectory sampling for direct traffic observation. IEEE/ACM transactions on networking, 9(3):280–292, 2001.
    Google ScholarLocate open access versionFindings
  • Daniel E Eisenbud, Cheng Yi, Carlo Contavalli, Cody Smith, Roman Kononov, Eric Mann-Hielscher, Ardas Cilingiroglu, Bin Cheyney, Wentao Shang, and Jinnah Dylan Hosein. Maglev: A fast and reliable software network load balancer. In ({NSDI} 16), pages 523–535, 2016.
    Google ScholarLocate open access versionFindings
  • Paul Emmerich, Sebastian Gallenmüller, Daniel Raumer, Florian Wohlfart, and Georg Carle. Moongen: A scriptable high-speed packet generator. In Proceedings of the 2015 Internet Measurement Conference, pages 275–287, 2015.
    Google ScholarLocate open access versionFindings
  • Cristian Estan, Stefan Savage, and George Varghese. Automatically inferring patterns of resource consumption in network traffic. In Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications, pages 137–148, 2003.
    Google ScholarLocate open access versionFindings
  • Rodrigo Fonseca, George Porter, Randy H Katz, and Scott Shenker. X-trace: A pervasive network tracing framework. In 4th {USENIX} Symposium on Networked Systems Design & Implementation ({NSDI} 07), 2007.
    Google ScholarLocate open access versionFindings
  • Rohan Gandhi, Hongqiang Harry Liu, Y Charlie Hu, Guohan Lu, Jitendra Padhye, Lihua Yuan, and Ming Zhang. Duet: Cloud scale load balancing with hardware and software. ACM SIGCOMM Computer Communication Review, 44(4):27–38, 2014.
    Google ScholarLocate open access versionFindings
  • Yilong Geng, Shiyu Liu, Zi Yin, Ashish Naik, Balaji Prabhakar, Mendel Rosenblum, and Amin Vahdat. Exploiting a natural network effect for scalable, fine-grained clock synchronization. In 15th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 18), pages 81–94, 2018.
    Google ScholarLocate open access versionFindings
  • Younghwan Go, Muhammad Asim Jamshed, YoungGyoun Moon, Changho Hwang, and KyoungSoo Park. Apunet: Revitalizing {GPU} as packet processing accelerator. In 14th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 17), pages 83–96, 2017.
    Google ScholarLocate open access versionFindings
  • Sangjin Han, Keon Jang, Aurojit Panda, Shoumik Palkar, Dongsu Han, and Sylvia Ratnasamy. Softnic: A software nic to augment hardware. EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2015-155, 2015.
    Google ScholarFindings
  • Sangjin Han, Keon Jang, KyoungSoo Park, and Sue Moon. Packetshader: a gpu-accelerated software router. ACM SIGCOMM Computer Communication Review, 40(4):195–206, 2010.
    Google ScholarLocate open access versionFindings
  • Muhammad Asim Jamshed, Jihyung Lee, Sangwoo Moon, Insu Yun, Deokjin Kim, Sungryoul Lee, Yung Yi, and KyoungSoo Park. Kargus: a highly-scalable software-based intrusion detection system. In Proceedings of the 2012 ACM conference on Computer and communications security, pages 317–328. ACM, 2012.
    Google ScholarLocate open access versionFindings
  • Keon Jang, Sangjin Han, Seungyeop Han, Sue B Moon, and KyoungSoo Park. Sslshader: Cheap ssl acceleration with commodity processors. In NSDI, pages 1–14, 2011.
    Google ScholarLocate open access versionFindings
  • Murad Kablan, Azzam Alsudais, Eric Keller, and Franck Le. Stateless network functions: Breaking the tight coupling of state and processing. In 14th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 17), pages 97–112, 2017.
    Google ScholarLocate open access versionFindings
  • Kostis Kaffes, Timothy Chong, Jack Tigar Humphries, Adam Belay, David Mazières, and Christos Kozyrakis. Shinjuku: Preemptive scheduling for μsecond-scale tail latency. In 16th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 19), pages 345–360, 2019.
    Google ScholarLocate open access versionFindings
  • Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal, Jitendra Padhye, and Paramvir Bahl. Detailed diagnosis in enterprise networks. ACM SIGCOMM Computer Communication Review, 39(4):243–254, 2009.
    Google ScholarLocate open access versionFindings
  • Rishi Kapoor, Alex C Snoeren, Geoffrey M Voelker, and George Porter. Bullet trains: a study of nic burst behavior at microsecond timescales. In Proceedings of the ninth ACM conference on Emerging networking experiments and technologies, pages 133–138, 2013.
    Google ScholarLocate open access versionFindings
  • Georgios P Katsikas, Tom Barbette, Dejan Kostic, Rebecca Steinert, and Gerald Q Maguire Jr. Metron:{NFV} service chains at the true speed of the underlying hardware. In 15th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 18), pages 171–186, 2018.
    Google ScholarLocate open access versionFindings
  • Joongi Kim, Keon Jang, Keunhong Lee, Sangwook Ma, Junhyun Shim, and Sue Moon. Nba (network balancing act): A high-performance packet processing framework for heterogeneous processors. In Proceedings of the Tenth European Conference on Computer Systems, page 22. ACM, 2015.
    Google ScholarLocate open access versionFindings
  • Eddie Kohler, Robert Morris, Benjie Chen, John Jannotti, and M Frans Kaashoek. The click modular router. ACM Transactions on Computer Systems (TOCS), 18(3):263–297, 2000.
    Google ScholarLocate open access versionFindings
  • Ramana Rao Kompella, Jennifer Yates, Albert Greenberg, and Alex C. Snoeren. Ip fault localization via risk modeling. In Proceedings of the 2Nd Conference on Symposium on Networked Systems Design & Implementation - Volume 2, NSDI’05, pages 57–70, Berkeley, CA, USA, 2005. USENIX Association.
    Google ScholarLocate open access versionFindings
  • Sameer G Kulkarni, Wei Zhang, Jinho Hwang, Shriram Rajagopalan, KK Ramakrishnan, Timothy Wood, Mayutan Arumaithurai, and Xiaoming Fu. Nfvnice: Dynamic backpressure and scheduling for nfv service chains. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, pages 71–84. ACM, 2017.
    Google ScholarLocate open access versionFindings
  • Bojie Li, Kun Tan, Layong Larry Luo, Yanqing Peng, Renqian Luo, Ningyi Xu, Yongqiang Xiong, Peng Cheng, and Enhong Chen. Clicknp: Highly flexible and high performance network processing with reconfigurable hardware. In Proceedings of the 2016 ACM SIGCOMM Conference, pages 1–14. ACM, 2016.
    Google ScholarLocate open access versionFindings
  • Jonathan Mace, Peter Bodik, Rodrigo Fonseca, and Madanlal Musuvathi. Retro: Targeted resource management in multi-tenant distributed systems. In 12th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 15), pages 589–603, 2015.
    Google ScholarLocate open access versionFindings
  • Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. Pivot tracing: Dynamic causal monitoring for distributed systems. ACM Transactions on Computer Systems (TOCS), 35(4):1–28, 2018.
    Google ScholarLocate open access versionFindings
  • Karthik Nagaraj, Charles Killian, and Jennifer Neville. Structured comparative analysis of systems logs to diagnose performance problems. In Presented as part of the 9th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 12), pages 353–366, 2012.
    Google ScholarLocate open access versionFindings
  • Jaehyun Nam, Junsik Seo, and Seungwon Shin. Probius: Automated approach for vnf and service chain analysis in software-defined nfv. In Proceedings of the Symposium on SDN Research, pages 1–13, 2018.
    Google ScholarLocate open access versionFindings
  • Shoumik Palkar, Chang Lan, Sangjin Han, Keon Jang, Aurojit Panda, Sylvia Ratnasamy, Luigi Rizzo, and Scott Shenker. E2: a framework for nfv applications. In Proceedings of the 25th Symposium on Operating Systems Principles, pages 121–136, 2015.
    Google ScholarLocate open access versionFindings
  • Aurojit Panda, Sangjin Han, Keon Jang, Melvin Walls, Sylvia Ratnasamy, and Scott Shenker. Netbricks: Taking the v out of {NFV}. In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), pages 203–216, 2016.
    Google ScholarLocate open access versionFindings
  • Luigi Rizzo. Netmap: a novel framework for fast packet i/o. In 21st USENIX Security Symposium (USENIX Security 12), pages 101–112, 2012.
    Google ScholarFindings
  • Chen Sun, Jun Bi, Zhilong Zheng, Heng Yu, and Hongxin Hu. Nfp: Enabling network function parallelism in nfv. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, pages 43–56. ACM, 2017.
    Google ScholarLocate open access versionFindings
  • Shivaram Venkataraman, Zongheng Yang, Michael Franklin, Benjamin Recht, and Ion Stoica. Ernest: efficient performance prediction for large-scale advanced analytics. In 13th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 16), pages 363–378, 2016.
    Google ScholarLocate open access versionFindings
  • Wenfei Wu, Keqiang He, and Aditya Akella. Perfsight: Performance diagnosis for software dataplanes. In Proceedings of the 2015 Internet Measurement Conference, pages 409–421, 2015.
    Google ScholarLocate open access versionFindings
  • Shaula Alexander Yemini, Shmuel Kliger, Eyal Mozes, Yechiam Yemini, and David Ohsie. High speed and robust event correlation. IEEE communications Magazine, 34(5):82–90, 1996.
    Google ScholarLocate open access versionFindings
  • Kai Zhang, Bingsheng He, Jiayu Hu, Zeke Wang, Bei Hua, Jiayi Meng, and Lishan Yang. G-net: Effective {GPU} sharing in {NFV} systems. In 15th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 18), pages 187–200, 2018.
    Google ScholarLocate open access versionFindings
  • Yang Zhang, Bilal Anwer, Vijay Gopalakrishnan, Bo Han, Joshua Reich, Aman Shaikh, and Zhi-Li Zhang. Parabox: Exploiting parallelism for virtual network functions in service chaining. In Proceedings of the Symposium on SDN Research, pages 143–149, 2017.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments