Estimating Class Separability of Datasets Using Persistent Homology with Application to LLM Fine-Tuning

Najah F. Ghalyan,Kostis Gourgoulias,Yash Satsangi,Seán Moran,Maxime Labonne, Joseph Sabelja

arXiv (Cornell University)（2023）

引用 0|浏览0

暂无评分

摘要

This paper proposes a method to estimate the class separability of an unlabeled text dataset by inspecting the topological characteristics of sentence-transformer embeddings of the text. Experiments conducted involve both binary and multi-class cases, with balanced and imbalanced scenarios. The results demonstrate a clear correlation and a better consistency between the proposed method and other separability and classification metrics, such as Thornton's method and the AUC score of a logistic regression classifier, as well as unsupervised methods. Finally, we empirically show that the proposed method can be part of a stopping criterion for fine-tuning language-model classifiers. By monitoring the class separability of the embedding space after each training iteration, we can detect when the training process stops improving the separability of the embeddings without using additional labels.

查看译文

关键词

estimating class separability,datasets,unsupervised method,fine-tuning

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要