Measuring the Robustness of NLP Models to Domain Shifts
CoRR(2023)
Abstract
Existing research on Domain Robustness (DR) suffers from disparate setups,
lack of task variety, and scarce research on recent models and capabilities
such as few-shot learning. Furthermore, we claim that the common practice of
measuring DR might further obscure the picture. Current research focuses on
challenge sets and relies solely on the Source Drop (SD): Using the source
in-domain performance as a reference point for degradation. However, the Target
Drop (TD) should be used as a complementary point of view. To understand the DR
challenge in modern NLP models, we developed a benchmark comprised of seven NLP
tasks, including classification, QA, and generation. Our benchmark focuses on
natural topical domain shifts and enables measuring both the SD and the TD. Our
comprehensive study, involving over 14,000 domain shifts across 18 fine-tuned
and few-shot models, shows that both models suffer from drops upon domain
shifts. While fine-tuned models excel in-domain, few-shot LLMs often surpass
them cross-domain, showing better robustness. In addition, we found that a
large SD can be explained by shifting to a harder domain rather than a genuine
DR challenge. Thus, the TD is a more reliable metric.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
![](https://originalfileserver.aminer.cn/sys/aminer/pubs/mrt_preview.jpeg)
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined