DataDoc Analyzer: A Tool for Analyzing the Documentation of Scientific Datasets

PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023(2023)

引用 0|浏览4
暂无评分
摘要
Recent public regulatory initiatives and relevant voices in the ML community have identified the need to document datasets according to several dimensions to ensure the fairness and trustworthiness of machine learning systems. In this sense, the data-sharing practices in the scientific field have been quickly evolving in the last years, with more and more research works publishing technical documentation together with the data for replicability purposes. However, this documentation is written in natural language, and its structure, content focus, and composition vary, making them challenging to analyze. We present DataDoc Analyzer, a tool for analyzing the documentation of scientific datasets by extracting the details of the main dimensions required to analyze the fairness and potential biases. We believe that our tool could help improve the quality of scientific datasets, aid dataset curators during its documentation process, and be a helpful tool for empirical studies on the overall quality of the datasets used in the ML field. The tool implements an ML pipeline that uses Large Language Models at its core for information retrieval. DataDoc is open-source, and a public demo is published online.
更多
查看译文
关键词
Datasets,Machine learning,Fairness,Reverse Engineering,Large Language Models,Explainability
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要