SpanSeq: Similarity-based sequence data splitting method for improved development and assessment of deep learning projects
CoRR(2024)
摘要
The use of deep learning models in computational biology has increased
massively in recent years, and is expected to do so further with the current
advances in fields like Natural Language Processing. These models, although
able to draw complex relations between input and target, are also largely
inclined to learn noisy deviations from the pool of data used during their
development. In order to assess their performance on unseen data (their
capacity to generalize), it is common to randomly split the available data in
development (train/validation) and test sets. This procedure, although
standard, has lately been shown to produce dubious assessments of
generalization due to the existing similarity between samples in the databases
used. In this work, we present SpanSeq, a database partition method for machine
learning that can scale to most biological sequences (genes, proteins and
genomes) in order to avoid data leakage between sets. We also explore the
effect of not restraining similarity between sets by reproducing the
development of the state-of-the-art model DeepLoc, not only confirming the
consequences of randomly splitting databases on the model assessment, but
expanding those repercussions to the model development. SpanSeq is available
for downloading and installing at
https://github.com/genomicepidemiology/SpanSeq.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要