NaturalFuzz: Natural Input Generation for Big Data Analytics

Ahmad Humayun, Yaoxuan Wu,Miryung Kim,Muhammad Ali Gulzar

2023 38TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, ASE（2023）

引用 1|浏览10

暂无评分

摘要

Fuzzing applies input mutations iteratively with the only goal of finding more bugs, resulting in synthetic tests that tend to lack realism. Big data analytics are expected to ingest real-world data as input. Therefore, when synthetic test data are not easily comprehensible, they are less likely to facilitate the downstream task of fixing errors. Our position is that fuzzing in this domain must achieve both high naturalness and high code coverage. We propose a new natural synthetic test generation tool for big data analytics, called NATURALFUZZ. It generates both unstructured, semi-structured, and structured data with corresponding semantics such as 'zipcode' and 'age.' The key insights behind NATURALFUZZ are two-fold. First, though existing test data may be small and lack coverage, we can grow this data to increase code coverage. Second, we can strategically mix constituent parts across different rows and columns to construct new realistic synthetic data by leveraging fine-grained data provenance. On commercial big data application benchmarks, NATURALFUZZ achieves an additional 19.9% coverage and detects 1.9x more faults than a machine learning-based synthetic data generator (SDV) when generating comparably sized inputs. This is because an ML-based synthetic data generator does not consider which code branches are exercised by which input rows from which tables, while NATURALFUZZ is able to select input rows that have a high potential to increase code coverage and mutate the selected data towards unseen, new program behavior. NATURALFUZZ's test data is more realistic than the test data generated by two baseline fuzzers (BigFuzz and Jazzer), while increasing code coverage and fault detection potential. NATURALFUZZ is the first fuzzing methodology with three benefits: (1) exclusively generate natural inputs, (2) fuzz multiple input sources simultaneously, and (3) find deeper semantics faults.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要