Generalizable Whole Slide Image Classification with Fine-Grained Visual-Semantic Interaction
CVPR 2024(2024)
摘要
Whole Slide Image (WSI) classification is often formulated as a Multiple
Instance Learning (MIL) problem. Recently, Vision-Language Models (VLMs) have
demonstrated remarkable performance in WSI classification. However, existing
methods leverage coarse-grained pathogenetic descriptions for visual
representation supervision, which are insufficient to capture the complex
visual appearance of pathogenetic images, hindering the generalizability of
models on diverse downstream tasks. Additionally, processing high-resolution
WSIs can be computationally expensive. In this paper, we propose a novel
"Fine-grained Visual-Semantic Interaction" (FiVE) framework for WSI
classification. It is designed to enhance the model's generalizability by
leveraging the interplay between localized visual patterns and fine-grained
pathological semantics. Specifically, with meticulously designed queries, we
start by utilizing a large language model to extract fine-grained pathological
descriptions from various non-standardized raw reports. The output descriptions
are then reconstructed into fine-grained labels used for training. By
introducing a Task-specific Fine-grained Semantics (TFS) module, we enable
prompts to capture crucial visual information in WSIs, which enhances
representation learning and augments generalization capabilities significantly.
Furthermore, given that pathological visual patterns are redundantly
distributed across tissue slices, we sample a subset of visual instances during
training. Our method demonstrates robust generalizability and strong
transferability, dominantly outperforming the counterparts on the TCGA Lung
Cancer dataset with at least 9.19
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要