AI helps you reading Science
AI generates interpretation videos
AI extracts and analyses the key points of the paper to generate videos automatically
AI parses the academic lineage of this thesis
AI extracts a summary of this paper
We demonstrate the usefulness of the explanations generated by ExplainED on real-life, undocumented Exploratory Data Analysis notebooks
ExplainED: explanations for EDA notebooks
Hosted Content, no. 12 (2020): 2917-2920
AbstractExploratory Data Analysis (EDA) is an essential yet highly demanding task. To get a head start before exploring a new dataset, data scientists often prefer to view existing EDA notebooks - illustrative exploratory sessions that were created by fellow data scientists who examined the same dataset and shared their notebooks via onli...More
PPT (Upload PPT)
- Exploratory Data Analysis (EDA) is an important step in any data scientific (DS) pipeline.
- The authors demonstrate the usefulness of the explanations generated by ExplainED on real-life, undocumented EDA notebooks.
- PVLDB Reference Format: Daniel Deutch, Amir Gilad, Tova Milo, and Amit Somech.
- Exploratory Data Analysis (EDA) is an important step in any data scientific (DS) pipeline
- We demonstrate the usefulness of the explanations generated by ExplainED on real-life, undocumented EDA notebooks
- An EDA notebook contain a curated summary of an EDA process, presented through a notebook interface – a literate programming environment that allows users to document a sequence of programmatic operations, their results, as well as to add free-text explanations
- We will first present the audience with an undocumented EDA notebook, reveal the explanations generated by ExplainED for each exploratory step
- As explained in the sequel, ExplainED analyzes the interestingness of each EDA operation qi before producing an explanation that describes what exactly is interesting in the resulting view Vi
- Given a view Vi in an EDA notebook, we first assess its interstingness w.r.t. the measures defined above, derive which specific elements in the view have the highest impact on the interestingness score of the view, and present them in an illustrative, Natural Language (NL) template
- ExplainED uses Shapely to measure the contribution of each tuple to the interestingness score of the view.
- ExplainED takes as input a view from a given EDA notebook, and generates a textual explanation as follows: First, the interestingness of the view is evaluated using several measures, each corresponding to a different interestingness facet.
- Focusing on the measure that yielded the highest score, ExplainED computes the Shapley values of the top-k elements in the view w.r.t. the interestingness measure.
- The authors will first present the audience with an undocumented EDA notebook, reveal the explanations generated by ExplainED for each exploratory step.
- The authors define the data model for EDA notebooks and the considered interestingness measures.
- Given a view Vi, ExplainED generates an explanatory text Ei, which highlights the elements that are interesting in Vi. For example, see the generated explanations in the red frames in Figure 1.
- As explained in the sequel, ExplainED analyzes the interestingness of each EDA operation qi before producing an explanation that describes what exactly is interesting in the resulting view Vi. An interestingness measure the author is a function mapping each view to a real number.
- Given a view Vi in an EDA notebook, the authors first assess its interstingness w.r.t. the measures defined above, derive which specific elements in the view have the highest impact on the interestingness score of the view, and present them in an illustrative, NL template.
- The authors formalize the definition of a Shapley value of an element in a view w.r.t. an interestingness measure as follows.
- Given an EDA view Vi generated from Vi−1 and an interestingness measure I, the Shapley value of the element e ∈ Vi is defined as: Shap(Vi, I, e) S
- The authors will demonstrate the explanations that ExplainED generates for EDA notebooks and their usefulness over the Kaggle Flights dataset.
- The authors will employ ExplainED to dynamically generate an explanation for each view in the notebook, demonstrating the value of the explanations to the data analysis process.
- Technical Details: The authors will let participants look under the hood of ExplainED by showing the manner in which it selects the most relevant interestingness measures and finds the interesting tuples or groups in views based on their Shapley values.
- Various methods of explaining query results have been proposed in the literature. Prominently, explanations using provenance [7, 2], interventions , influence , Shapley values , or using NL  among others. The main difference between these works and ours is that these works explain which input tuples affected the output of a query, while we try to find the input tuples that make the view interesting (i.e., that most affect the view’s interestingness score). There are also other tools for assisting users in composing EDA steps. For example, recommendations of EDA next-steps (e.g., ), and highlighting promising features to explore (e.g., ). However, such tools do not explain why are the generated views considered interesting.
- This research has been funded by the Israeli Science Foundation (ISF), the Binational US-Israel Science Foundation, the Tel Aviv University Data Science center, the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (Grant agreement No 804302), and the Google Ph.D
-  P. Buneman, S. Khanna, and W. Tan. Why and where: A characterization of data provenance. In ICDT, pages 316–330, 2001.
-  V. Chandola and V. Kumar. Summarization - compressing data into an informative representation. KAIS, 12(3), 2007.
-  D. Deutch, N. Frost, and A. Gilad. Provenance for natural language queries. PVLDB, 10(5):577–588, 2017.
-  L. Geng and H. J. Hamilton. Interestingness measures for data mining: A survey. CSUR, 2006.
-  A. Giuzio, G. Mecca, E. Quintarelli, M. Roveri, D. Santoro, and L. Tanca. Indiana: An interactive system for assisting database exploration. Information Systems, 83:40–56, 2019.
-  T. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In PODS, pages 31–40, 2007.
-  M. B. Kery, M. Radensky, M. Arya, B. E. John, and B. A. Myers. The story in the notebook: Exploratory data science using a literate programming tool. In CHI, 2018.
-  E. Livshits, L. E. Bertossi, B. Kimelfeld, and M. Sebag. The shapley value of tuples in query answering. In ICDT, pages 20:1–20:19, 2020.
-  S. M. Lundberg and S.-I. Lee. A unified approach to interpreting model predictions. In NIPS. 2017.
-  T. Milo, C. Ozeri, and A. Somech. Predicting ”what is interesting” by mining interactive-data-analysis session logs. In EDBT, 2019.
-  T. Milo and A. Somech. Next-step suggestions for modern interactive data analysis platforms. In KDD, 2018.
-  S. Roy and D. Suciu. A formal approach to finding explanations for database queries. In SIGMOD, pages 1579–1590, 2014.
-  E. Strumbelj and I. Kononenko. Explaining prediction models and individual predictions with feature contributions. Knowl. Inf. Syst., 41(3):647–665, 2014.
-  E. Wu and S. Madden. Scorpion: Explaining away outliers in aggregate queries. PVLDB, 6(8):553–564, 2013.