TextMachina: Seamless Generation of Machine-Generated Text Datasets
CoRR(2024)
摘要
Recent advancements in Large Language Models (LLMs) have led to high-quality
Machine-Generated Text (MGT), giving rise to countless new use cases and
applications. However, easy access to LLMs is posing new challenges due to
misuse. To address malicious usage, researchers have released datasets to
effectively train models on MGT-related tasks. Similar strategies are used to
compile these datasets, but no tool currently unifies them. In this scenario,
we introduce TextMachina, a modular and extensible Python framework, designed
to aid in the creation of high-quality, unbiased datasets to build robust
models for MGT-related tasks such as detection, attribution, or boundary
detection. It provides a user-friendly pipeline that abstracts away the
inherent intricacies of building MGT datasets, such as LLM integrations, prompt
templating, and bias mitigation. The quality of the datasets generated by
TextMachina has been assessed in previous works, including shared tasks where
more than one hundred teams trained robust MGT detectors.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要