Text-Guided Neural Network Training for Image Recognition in Natural Scenes and Medicine

IEEE Transactions on Pattern Analysis and Machine Intelligence(2021)

引用 25|浏览149
暂无评分
摘要
Convolutional neural networks (CNNs) are widely recognized as the foundation for machine vision systems. The conventional rule of teaching CNNs to understand images requires training images with human annotated labels, without any additional instructions. In this article, we look into a new scope and explore the guidance from text for neural network training. We present two versions of attention mechanisms to facilitate interactions between visual and semantic information and encourage CNNs to effectively distill visual features by leveraging semantic features. In contrast to dedicated text-image joint embedding methods, our method realizes asynchronous training and inference behavior: a trained model can classify images, irrespective of the text availability. This characteristic substantially improves the model scalability to multiple (multimodal) vision tasks. We also apply the proposed method onto medical imaging, which learns from richer clinical knowledge and achieves attention-based interpretable decision-making. With comprehensive validation on two natural and two medical datasets, we demonstrate that our method can effectively make use of semantic knowledge to improve CNN performance. Our method performs substantial improvement on medical image datasets. Meanwhile, it achieves promising performance for multi-label image classification and caption-image retrieval as well as excellent performance for phrase-based and multi-object localization on public benchmarks.
更多
查看译文
关键词
Algorithms,Diagnostic Imaging,Humans,Neural Networks, Computer
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要