Learning Using Privileged Information for Food Recognition

Proceedings of the 27th ACM International Conference on Multimedia(2019)

引用 33|浏览215
暂无评分
摘要
Food recognition for user-uploaded images is crucial in visual diet tracking, an emerging application linking multimedia and healthcare domains. However, it is challenging due to the various visual appearances of food images. This is caused by different conditions when taking the photos, such as angles, distances, light conditions, food containers, and background scenes. To alleviate such a semantic gap, this paper presents a cross-modal alignment and transfer network (ATNet), which is motivated by the paradigm of learning using privileged information (LUPI). It additionally utilizes the ingredients in food images as an "intelligent teacher" in the training stage to facilitate cross-modal information passing. Specifically, ATNet first uses a pair of synchronized autoencoders to build the base image and ingredient channels for information flow. Subsequently, the information passing is enabled through a two-stage cross-modal interaction. The first stage of interaction adopts a two-step method, called partial heterogeneous transfer, to 1) alleviate the intrinsic heterogeneity between images and ingredients and 2) align them in a shared space to make their carried information about food classes interact. In the second stage, ATNet learns to map the visual embeddings of images to the ingredient channel for food recognition from the view of "teacher''. This leads a refined recognition by a multi-view fusion. Experiments on two real-world datasets show that ATNet can be incorporated with any state-of-the-art CNN models to consistently improve their performance.
更多
查看译文
关键词
cross-modal fusion, food recognition, heterogeneous feature alignment, learning using privileged information
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要