Understanding Exception-Related Bugs in Large-Scale Cloud Systems

ASE, pp. 339-351, 2019.

Cited by: 1|Bibtex|Views12|Links
EI
Keywords:
exception mechanismlarge scaleerror conditionexception handlingmain business logicMore(3+)
Weibo:
We present a comprehensive analysis of 210 eBugs in six popular cloud systems, from the perspective of triggering conditions

Abstract:

Exception mechanism is widely used in cloud systems. This is mainly because it separates the error handling code from main business logic. However, the huge space of potential error conditions and the sophisticated logic of cloud systems present a big hurdle to the correct use of exception mechanism. As a result, mistakes in the exception...More

Code:

Data:

0
Introduction
  • At the time of this writing, about 7% of the source code in twelve popular open source distributed systems [1] involves exception mechanism (Figure 1), i.e., throwing exceptions, or being enclosed in try, catch, or code blocks
  • Such a widespread use of exception mechanism is mainly due to its advantages over the traditional checkingreturn-value mechanism [2].
  • It allows developers to combine multiple exceptions by using their common superclass exception, providing greater flexibility to write the error handling code
Highlights
  • Exception mechanism is widely used to handle errors in cloud systems
  • We find that eBugs are severe in cloud systems: 74% of our studied eBugs affect system availability or integrity
  • We provide a large benchmark of eBugs in cloud systems, which can be used to evaluate the effectiveness of the tools that expose and detect eBugs in cloud systems
  • To understand the characteristics of eBugs in real-world cloud systems, we select the target systems based on three criteria: (i) The systems must be diverse for an unbiased dataset. (ii) The systems should be mature and popular, so that we can understand the real problems faced by developers. (iii) The systems should be open source and have public issue tracking systems
  • We present a comprehensive analysis of 210 eBugs in six popular cloud systems, from the perspective of triggering conditions
  • We develop DIET to detect inaccurate exceptions in cloud systems
Methods
  • (iii) The systems should be open source and have public issue tracking systems
  • With these requirements in mind, the authors identify the following six cloud systems: (i) Cassandra [11], a highly available peer-to-peer NoSQL database; (ii) HBase [12], a masterslave NoSQL database; (iii) HDFS [13], a distributed file system; (iv) Hadoop MapReduce [14], a distributed data processing framework; (v) YARN [15], a distributed resource management system; and (vi) ZooKeeper [16], a distributed coordination service.
  • The authors apply a few filtering rules to identify the relevant issues
Results
  • 3 and 4 in §V-A show that inaccurate exceptions cannot describe the triggering conditions precisely, and may mislead developers to handle them incorrectly.
  • If an exception’s class and its error message imply different types of triggering conditions, the exception is likely to be inaccurate.
  • Inspired by these observations, the authors build a static analysis tool, DIET, to automatically detect inaccurate exceptions, by inspecting the inconsistency between an exception’s class and its error message.
  • For a given root exception, DIET employs the above probabilities to examine whether its exception class and error message imply different triggering condition types.
  • DIET identifies root exceptions by finding the ones that have no cause exceptions
Conclusion
  • The authors present a comprehensive analysis of 210 eBugs in six popular cloud systems, from the perspective of triggering conditions.
  • Most of these eBugs affect the availability or integrity of the cloud systems.
  • The authors develop DIET to detect inaccurate exceptions in cloud systems.
  • DIET has detected 31 eBugs and bad practices, and developers have confirmed 23 of them
Summary
  • Introduction:

    At the time of this writing, about 7% of the source code in twelve popular open source distributed systems [1] involves exception mechanism (Figure 1), i.e., throwing exceptions, or being enclosed in try, catch, or code blocks
  • Such a widespread use of exception mechanism is mainly due to its advantages over the traditional checkingreturn-value mechanism [2].
  • It allows developers to combine multiple exceptions by using their common superclass exception, providing greater flexibility to write the error handling code
  • Methods:

    (iii) The systems should be open source and have public issue tracking systems
  • With these requirements in mind, the authors identify the following six cloud systems: (i) Cassandra [11], a highly available peer-to-peer NoSQL database; (ii) HBase [12], a masterslave NoSQL database; (iii) HDFS [13], a distributed file system; (iv) Hadoop MapReduce [14], a distributed data processing framework; (v) YARN [15], a distributed resource management system; and (vi) ZooKeeper [16], a distributed coordination service.
  • The authors apply a few filtering rules to identify the relevant issues
  • Results:

    3 and 4 in §V-A show that inaccurate exceptions cannot describe the triggering conditions precisely, and may mislead developers to handle them incorrectly.
  • If an exception’s class and its error message imply different types of triggering conditions, the exception is likely to be inaccurate.
  • Inspired by these observations, the authors build a static analysis tool, DIET, to automatically detect inaccurate exceptions, by inspecting the inconsistency between an exception’s class and its error message.
  • For a given root exception, DIET employs the above probabilities to examine whether its exception class and error message imply different triggering condition types.
  • DIET identifies root exceptions by finding the ones that have no cause exceptions
  • Conclusion:

    The authors present a comprehensive analysis of 210 eBugs in six popular cloud systems, from the perspective of triggering conditions.
  • Most of these eBugs affect the availability or integrity of the cloud systems.
  • The authors develop DIET to detect inaccurate exceptions in cloud systems.
  • DIET has detected 31 eBugs and bad practices, and developers have confirmed 23 of them
Tables
  • Table1: INVESTIGATED BUG REPORTS IN THE STUDIED SYSTEMS
  • Table2: THE TYPE DISTRIBUTION OF INACCURATE EXCEPTION EBUGS
  • Table3: JIRA ISSUE PRIORITY OF EBUGS
  • Table4: TRIGGERING CONDITIONS OF THE STUDIED EBUGS AND THEIR TYPICAL SCENARIOS IN EACH SYSTEM
  • Table5: STATISTICS OF EBUG ROOT CAUSES
  • Table6: EBUG FAILURE SYMPTOMS
  • Table7: THE TRIGGERING CONDITIONS AND EXCEPTION CLASSES OF FOUR
  • Table8: APPLYING DIET ON REAL-WORLD CLOUD SYSTEMS
  • Table9: TIMING REQUIREMENTS ON EBUG TRIGGERING CONDITIONS
  • Table10: EXCEPTION CLASSES THAT ARE TRIGGERED MORE THAN ONCE BY
  • Table11: BUGS AND BAD PRACTICES DETECTED BY DIET
Download tables as Excel
Funding
  • In this work, Haicheng Chen and Feng Qin were partially supported by National Science Foundation grants #CNS-1513120 and #CCF-0953759 (CAREER Award)
  • Wensheng Dou and Yanyan Jiang were partially supported by National Key R&D Program of China (2017YFB1001800), National Natural Science Foundation of China (#61732019, #61932021, #61802165), Youth Innovation Promotion Association at Chinese Academy of Sciences, Alibaba Innovative Research Program, and the Collaborative Innovation Center of Novel Software Technology and Industrialization, Jiangsu, China
Reference
  • Apache Hadoop. [Online]. Available: https://hadoop.apache.org
    Findings
  • Advantages of exceptions. [Online]. Available: https://docs.oracle.com/
    Findings
  • D. Yuan, Y. Luo, X. Zhuang, G. R. Rodrigues, X. Zhao, Y. Zhang, P. Jain, and M. Stumm, “Simple testing can prevent most critical failures: An analysis of production failures in distributed data-intensive systems,” in Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, 2014, pp. 249–265.
    Google ScholarLocate open access versionFindings
  • H. S. Gunawi, M. Hao, T. Leesatapornwongsa, T. Patana-anake, T. Do, J. Adityatama, K. J. Eliazar, A. Laksono, J. F. Lukman, V. Martin et al., “What bugs live in the cloud? A study of 3000+ issues in cloud systems,” in Proceedings of the ACM Symposium on Cloud Computing, 2014, pp. 1–14.
    Google ScholarLocate open access versionFindings
  • F. Ebert, F. Castor, and A. Serebrenik, “An exploratory study on exception handling bugs in Java programs,” Journal of Systems and Software, vol. 106, pp. 82–101, 2015.
    Google ScholarLocate open access versionFindings
  • J. Oliveira, D. Borges, T. Silva, N. Cacho, and F. Castor, “Do Android developers neglect error handling? A maintenance-centric study on the relationship between Android abstractions and uncaught exceptions,” Journal of Systems and Software, vol. 136, pp. 1–18, 2018.
    Google ScholarLocate open access versionFindings
  • R. Coelho, L. Almeida, G. Gousios, and A. van Deursen, “Unveiling exception handling bug hazards in Android based on GitHub and Google code issues,” in Proceedings of the 12th Working Conference on Mining Software Repositories, 2015, pp. 134–145.
    Google ScholarLocate open access versionFindings
  • L. Fan, T. Su, S. Chen, G. Meng, Y. Liu, L. Xu, G. Pu, and Z. Su, “Largescale analysis of framework-specific exceptions in Android Apps,” in Proceedings of the 40th International Conference on Software Engineering, 2018, pp. 408–419.
    Google ScholarLocate open access versionFindings
  • R. Coelho, A. Rashid, A. von Staa, J. Noble, U. Kulesza, and C. Lucena, “A catalogue of bug patterns for exception handling in aspect-oriented programs,” in Proceedings of the 15th Conference on Pattern Languages of Programs, 2008, p. 23.
    Google ScholarLocate open access versionFindings
  • Jira Software. [Online]. Available: https://www.atlassian.com/software/jira
    Findings
  • Apache Cassandra. [Online]. Available: http://cassandra.apache.org
    Findings
  • Apache HBase. [Online]. Available: http://hbase.apache.org
    Findings
  • HDFS architecture. [Online]. Available: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
    Findings
  • MapReduce tutorial. [Online]. Available: http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/ MapReduceTutorial.html
    Findings
  • Apache Hadoop YARN. [Online]. Available: http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
    Findings
  • Apache ZooKeeper. [Online]. Available: http://zookeeper.apache.org
    Findings
  • EBugs in cloud systems. [Online]. Available: https://hanseychen.github.io/eBugs/
    Findings
  • O. R. Gatla, M. Hameed, M. Zheng, V. Dubeyko, A. Manzanares, F. Blagojevic, C. Guyot, and R. Mateescu, “Towards robust file system checkers,” in Proceedings of the 16th USENIX Conference on File and Storage Technologies, 2018, pp. 105–122.
    Google ScholarLocate open access versionFindings
  • Y. Gao, W. Dou, F. Qin, C. Gao, D. Wang, J. Wei, R. Huang, L. Zhou, and Y. Wu, “An empirical study on crash recovery bugs in large-scale distributed systems,” in Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2018, pp. 539–550.
    Google ScholarLocate open access versionFindings
  • C. Cadar, D. Dunbar, D. R. Engler et al., “KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs,” in Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, 2008, pp. 209–224.
    Google ScholarLocate open access versionFindings
  • P. Godefroid, M. Y. Levin, D. A. Molnar et al., “Automated whitebox fuzz testing,” in Proceedings of the 16th Network and Distributed System Security Symposium, 2008, pp. 151–166.
    Google ScholarLocate open access versionFindings
  • M. Zheng, J. Tucek, D. Huang, F. Qin, M. Lillibridge, E. S. Yang, B. W. Zhao, and S. Singh, “Torturing databases for fun and profit,” in Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, 2014, pp. 449–464.
    Google ScholarLocate open access versionFindings
  • H. S. Gunawi, T. Do, P. Joshi, P. Alvaro, J. M. Hellerstein, A. C. ArpaciDusseau, R. H. Arpaci-Dusseau, K. Sen, and D. Borthakur, “FATE and DESTINI: A framework for cloud recovery testing,” in Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, 2011, pp. 1–18.
    Google ScholarLocate open access versionFindings
  • R. Alagappan, A. Ganesan, Y. Patel, T. S. Pillai, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau, “Correlated crash vulnerabilities,” in Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, 2016, pp. 151–167.
    Google ScholarLocate open access versionFindings
  • H. Liu, X. Wang, G. Li, S. Lu, F. Ye, and C. Tian, “FCatch: Automatically detecting time-of-fault bugs in cloud systems,” in Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems, 2018, pp. 419–431.
    Google ScholarLocate open access versionFindings
  • Jepsen. [Online]. Available: https://jepsen.io/
    Findings
  • A. Alquraan, H. Takruri, M. Alfatafta, and S. Al-Kiswany, “An analysis of network-partitioning failures in cloud systems,” in Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation, 2018, pp. 51–68.
    Google ScholarLocate open access versionFindings
  • T. Leesatapornwongsa, M. Hao, P. Joshi, J. F. Lukman, and H. S. Gunawi, “SAMC: Semantic-aware model checking for fast discovery of deep bugs in cloud systems,” in Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, 2014, pp. 399–414.
    Google ScholarLocate open access versionFindings
  • A. Ganesan, R. Alagappan, A. C. Arpaci-Dusseau, and R. H. ArpaciDusseau, “Redundancy does not imply fault tolerance: Analysis of distributed storage reactions to single errors and corruptions,” in Proceedings of the 15th USENIX Conference on File and Storage Technologies, 2017, pp. 149–166.
    Google ScholarLocate open access versionFindings
  • G. B. de Padua and W. Shang, “Revisiting exception handling practices with exception flow analysis,” in Proceedings of 17th International Working Conference on Source Code Analysis and Manipulation, 2017, pp. 11–20.
    Google ScholarLocate open access versionFindings
  • D. Sena, R. Coelho, U. Kulesza, and R. Bonifacio, “Understanding the exception handling strategies of Java libraries: An empirical study,” in Proceedings of the 13th International Conference on Mining Software Repositories, 2016, pp. 212–222.
    Google ScholarLocate open access versionFindings
  • S. Liang, W. Sun, M. Might, A. Keep, and D. Van Horn, “Pruning, pushdown exception-flow analysis,” in Proceedings of 14th International Working Conference on Source Code Analysis and Manipulation, 2014, pp. 265–274.
    Google ScholarLocate open access versionFindings
  • H. Melo, R. Coelho, U. Kulesza, and D. Sena, “In-depth characterization of exception flows in software product lines: An empirical study,” Journal of Software Engineering Research and Development, vol. 1, no. 1, p. 3, 2013.
    Google ScholarLocate open access versionFindings
  • P. Prabhu, N. Maeda, G. Balakrishnan, F. Ivancic, and A. Gupta, “Interprocedural exception analysis for C++,” in Proceedings of the 25th European Conference on Object-Oriented Programming, 2011, pp. 583– 608.
    Google ScholarLocate open access versionFindings
  • M. Bravenboer and Y. Smaragdakis, “Exception analysis and pointsto analysis: Better together,” in Proceedings of the 18th International Symposium on Software Testing and Analysis, 2009, pp. 1–12.
    Google ScholarLocate open access versionFindings
  • S. Thummalapenta and T. Xie, “Mining exception-handling rules as sequence association rules,” in Proceedings of the 31st International Conference on Software Engineering, 2009, pp. 496–506.
    Google ScholarLocate open access versionFindings
  • T. Montenegro, H. Melo, R. Coelho, and E. Barbosa, “Improving developers awareness of the exception handling policy,” in Proceedings of the 25th International Conference on Software Analysis, Evolution and Reengineering, 2018, pp. 413–422.
    Google ScholarLocate open access versionFindings
  • HDFS-14486. [Online]. Available: https://issues.apache.org/jira/browse/ HDFS-14486
    Findings
  • CASSANDRA-15111. [Online]. Available: https://issues.apache.org/jira/browse/CASSANDRA-15111
    Findings
  • CASSANDRA-15112. [Online]. Available: https://issues.apache.org/jira/browse/CASSANDRA-15112
    Findings
  • CASSANDRA-15114. [Online]. Available: https://issues.apache.org/jira/browse/CASSANDRA-15114
    Findings
  • CASSANDRA-15116. [Online]. Available: https://issues.apache.org/jira/browse/CASSANDRA-15116
    Findings
  • CASSANDRA-15117. [Online]. Available: https://issues.apache.org/jira/browse/CASSANDRA-15117
    Findings
  • HBASE-22369. [Online]. Available: https://issues.apache.org/jira/browse/HBASE-22369
    Findings
  • S. Nakshatri, M. Hegde, and S. Thandra, “Analysis of exception handling patterns in Java projects: An empirical study,” in Proceedings of the 13th International Conference on Mining Software Repositories, 2016, pp. 500–503.
    Google ScholarLocate open access versionFindings
  • M. Monperrus, M. G. de Montauzan, B. Cornu, R. Marvie, and R. Rouvoy, “Challenging analytical knowledge on exception-handling: An empirical study of 32 Java software packages,” Tech. Rep. hal01093908, 2014.
    Google ScholarFindings
  • M. B. Kery, C. Le Goues, and B. A. Myers, “Examining programmer practices for locally handling exceptions,” in Proceedings of the 13th International Conference on Mining Software Repositories, 2016, pp. 484–487.
    Google ScholarLocate open access versionFindings
  • T. Leesatapornwongsa, J. F. Lukman, S. Lu, and H. S. Gunawi, “TaxDC: A taxonomy of non-deterministic concurrency bugs in datacenter distributed systems,” in Proceedings of the 21st International Conference on Architectural Support for Programming Languages and Operating Systems, 2016, pp. 517–530.
    Google ScholarLocate open access versionFindings
  • T. Dai, J. He, X. Gu, and S. Lu, “Understanding real-world timeout problems in cloud server systems,” in Proceeding of the IEEE International Conference on Cloud Engineering, 2018, pp. 1–11.
    Google ScholarLocate open access versionFindings
  • S. Lu, S. Park, E. Seo, and Y. Zhou, “Learning from mistakes: A comprehensive study on real world concurrency bug characteristics,” in Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems, 2008, pp. 329–339.
    Google ScholarLocate open access versionFindings
  • A. Chou, J. Yang, B. Chelf, S. Hallem, and D. Engler, “An empirical study of operating systems errors,” in Proceedings of the 18th Symposium on Operating Systems Principles, 2001, pp. 73–88.
    Google ScholarLocate open access versionFindings
  • Z. Yin, D. Yuan, Y. Zhou, S. Pasupathy, and L. Bairavasundaram, “How do fixes become bugs?” in Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, 2011, pp. 26–36.
    Google ScholarLocate open access versionFindings
  • S. Park, S. Lu, and Y. Zhou, “CTrigger: Exposing atomicity violation bugs from their hiding places,” in Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, 2009, pp. 25–36.
    Google ScholarLocate open access versionFindings
  • W. Zhang, C. Sun, and S. Lu, “ConMem: Detecting severe concurrency bugs through an effect-oriented approach,” in Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, 2010, pp. 179–192.
    Google ScholarLocate open access versionFindings
  • B. Kasikci, B. Schubert, C. Pereira, G. Pokam, and G. Candea, “Failure sketching: A technique for automated root cause diagnosis of inproduction failures,” in Proceedings of the 25th Symposium on Operating Systems Principles, 2015, pp. 344–360.
    Google ScholarLocate open access versionFindings
Your rating :
0

 

Tags
Comments