What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases
arxiv(2024)
摘要
Vision-language (VL) models, pretrained on colossal image-text datasets, have
attained broad VL competence that is difficult to evaluate. A common belief is
that a small number of VL skills underlie the variety of VL tests. In this
paper, we perform a large-scale transfer learning experiment aimed at
discovering latent VL skills from data. We reveal interesting characteristics
that have important implications for test suite design. First, generation tasks
suffer from a length bias, suggesting benchmarks should balance tasks with
varying output lengths. Second, we demonstrate that factor analysis
successfully identifies reasonable yet surprising VL skill factors, suggesting
benchmarks could leverage similar analyses for task selection. Finally, we
present a new dataset, OLIVE (https://github.com/jq-zh/olive-dataset), which
simulates user instructions in the wild and presents challenges dissimilar to
all datasets we tested. Our findings contribute to the design of balanced and
broad-coverage vision-language evaluation methods.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要