What Makes Training Multi-Modal Classification Networks Hard?

Weiyao Wang
Weiyao Wang

CVPR, pp. 12692-12702, 2020.

Cited by: 1|Bibtex|Views43|Links
EI
Keywords:
deep networkuni modalvideo action recognitionLow-Rank Multi-Modal Fusionmodal modelMore(8+)
Weibo:
G-Blend provides an 1.3% improvement over RGB model with the same backbone architecture ip-CSN-152 when both models are trained from scratch

Abstract:

Consider end-to-end training of a multi-modal vs. a uni-modal network on a task with multiple input modalities: the multi-modal network receives more information, so it should match or outperform its uni-modal counterpart. In our experiments, however, we observe the opposite: the best uni-modal network can outperform the multi-modal netwo...More

Code:

Data:

0
Introduction
  • Consider a late-fusion multi-modal network, trained endto-end to solve a task.
  • The performance drop with multiple input streams appears to be common and was noted in [24, 3, 38, 44].
  • This phenomenon warrants investigation and solution.
Highlights
  • Consider a late-fusion multi-modal network, trained endto-end to solve a task
  • Uni-modal solutions are a strict subset of the solutions available to the multi-modal network; a well-optimized multi-modal model should, in theory, always outperform the best uni-modal model
  • Our contributions in this paper include: We empirically demonstrate the significance of overfitting in joint training of multi-modal networks, and we identify two causes for the problem
  • We show the problem is architecture agnostic: different fusion techniques can suffer the same overfitting problem. We propose a metric to understand the problem quantitatively: the overfitting-to-generalization ratio (OGR), with both theoretical and empirical justification. We propose a new training scheme which minimizes overfitting-to-generalization ratio via an optimal blend of multiple supervision signals
  • Each modality is processed by a different deep network φmi with parameter Θmi, and their features are fused and passed to a classifier C
  • G-Blend provides an 1.3% improvement over RGB model with the same backbone architecture ip-CSN-152 [46] when both models are trained from scratch
Methods
  • On Kinetics with RGB-audio setting, online Gradient-Blending surpasses both uni-modal and naive multi-modal baselines, by 3.2% and 4.1% respectively.
  • On a different fusion architecture with Low-Rank Multi-Modal Fusion (LMF) [35], GradientBlending gives 4.2% improvement.
  • This suggests Gradiend-Blending can be adopted to other fusion strategies besides late-fusion and other fusion architectures besides concatenation
Results
  • G-Blend provides an 1.3% improvement over RGB model with the same backbone architecture ip-CSN-152 [46] when both models are trained from scratch.
  • G-Blend outperforms state-of-the-arts multi-modal baseline Shift-Attention Network [10] by 1.4% while using less modalities and no pre-training.
  • It is on-par with SlowFast [17] while being 2x faster.
  • Using weakly-supervised pre-training by IG-65M [23] on visual, G-Blend gives unparalleled 83.3% top-1 accuracy and 96.0% top-5 accuracy
Conclusion
  • In uni-modal networks, diagnosing and correcting overfitting typically involves manual inspection of learning curves.
  • The authors have shown that for multi-modal networks it is essential to measure and correct overfitting in a principled way, and the authors put forth a useful and practical measure of overfitting.
  • Gradient-Blending, uses this measure to obtain significant improvements over baselines, and either outperforms or is comparable with stateof-the-art methods on multiple tasks and benchmarks.
  • The method potentially applies broadly to end-to-end training of ensemble models, and the authors look forward to extending GBlend to other fields where calibrating multiple losses is needed, such as multi-task
Summary
  • Introduction:

    Consider a late-fusion multi-modal network, trained endto-end to solve a task.
  • The performance drop with multiple input streams appears to be common and was noted in [24, 3, 38, 44].
  • This phenomenon warrants investigation and solution.
  • Methods:

    On Kinetics with RGB-audio setting, online Gradient-Blending surpasses both uni-modal and naive multi-modal baselines, by 3.2% and 4.1% respectively.
  • On a different fusion architecture with Low-Rank Multi-Modal Fusion (LMF) [35], GradientBlending gives 4.2% improvement.
  • This suggests Gradiend-Blending can be adopted to other fusion strategies besides late-fusion and other fusion architectures besides concatenation
  • Results:

    G-Blend provides an 1.3% improvement over RGB model with the same backbone architecture ip-CSN-152 [46] when both models are trained from scratch.
  • G-Blend outperforms state-of-the-arts multi-modal baseline Shift-Attention Network [10] by 1.4% while using less modalities and no pre-training.
  • It is on-par with SlowFast [17] while being 2x faster.
  • Using weakly-supervised pre-training by IG-65M [23] on visual, G-Blend gives unparalleled 83.3% top-1 accuracy and 96.0% top-5 accuracy
  • Conclusion:

    In uni-modal networks, diagnosing and correcting overfitting typically involves manual inspection of learning curves.
  • The authors have shown that for multi-modal networks it is essential to measure and correct overfitting in a principled way, and the authors put forth a useful and practical measure of overfitting.
  • Gradient-Blending, uses this measure to obtain significant improvements over baselines, and either outperforms or is comparable with stateof-the-art methods on multiple tasks and benchmarks.
  • The method potentially applies broadly to end-to-end training of ensemble models, and the authors look forward to extending GBlend to other fields where calibrating multiple losses is needed, such as multi-task
Tables
  • Table1: Uni-modal networks consistently outperform multimodal networks. Best uni-modal networks vs late fusion multimodal networks on Kinetics using video top-1 validation accuracy. Single stream modalities include video clips (RGB), Optical Flow (OF), and Audio (A). Multi-modal networks use the same architectures as uni-modal, with late fusion by concatenation at the last layer before prediction
  • Table2: Both offline and online Gradient-Blending outperform Naive late fusion and RGB only. Offline G-Blend is lightly less accurate compared with the online version, but much simpler to implement
  • Table3: G-Blend on different optimizers. We compare G-Blend with Visual only and Naive AV on two additional optimizers: AdaGrad, and Adam. G-Blend consistently outperforms Visual-Only and Naive AV baselines on all three optimizers
  • Table4: Gradient-Blending (G-Blend) works on different multi-modal problems. Comparison between G-Blend with naive late fusion and single best modality on Kinetics. On all 4 combinations of different modalities, G-Blend outperforms both naive late fusion network and best uni-modal network by large margins, and it also works for cases with more than two modalities. G-Blend results are averaged over three runs with different initialization. Variances are small and are provided in supplementary pared to auxiliary loss baseline. The reason is that the weights learned by Gradient-Blending are very similar to equal weights. The failures of auxiliary loss on Kinetics and mini-Sports demonstrates that the weights used in G-Blend are indeed important. We note that for mini-AudioSet, even though the naively trained multi-modal baseline is better than uni-modal baseline, Gradient-Blending still improves by finding more generalized information. We also experiment with other less obvious multi-task techniques such as treating the weights as learnable parameters [<a class="ref-link" id="c30" href="#r30">30</a>]. However, this approach converges to a similar result as naive joint training. This happens because it lacks of overfitting prior, and thus the learnable weights were biased towards the head that has the lowest training loss which is audio-RGB
  • Table5: G-Blend outperforms all baseline methods on different benchmarks and tasks. Comparison of G-blend with different regularization baselines as well as uni-modal networks on Kinetics, mini-Sports, and mini-AudioSet. G-Blend consistently outperforms other methods, except for being comparable with using auxiliary loss on mini-AudioSet due to the similarity of learned weights of G-Blend and equal weights
  • Table6: Comparison with state-of-the-art methods on Kinetics. G-
  • Table7: Comparison with state-of-the-art methods on AudioSet. GBlend outperforms the state-of-the-art methods by a large margin
  • Table8: Comparison with state-of-the-art methods on EPIC-Kitchen. G-Blend achieves 2nd place on seen kitchen challenge and 4th place on unseen, despite using fewer modalities, fewer backbones, and single model in contrast to model ensembles compared to published results on leaderboard
  • Table9: Last row of Table 3 in main papers with variance. Results are averaged over three runs with random initialization, and ± indicates variances
  • Table10: Multi-modal networks have lower validation accuracy but higher train accuracy
Download tables as Excel
Related work
  • Video classification. Video understanding has been one of the most active research areas in computer vision recently. There are two unique features with respect to videos: temporal information and multi-modality. Previous works have made significant progress in understanding temporal information [27, 45, 50, 40, 47, 55, 17]. However, videos are also rich in multiple modalities: RGB frames, motion vectors (optical flow), and audio. Previous works that exploit the multi-modal natures primarily focus on
Funding
  • Identifies two main causes for this performance drop: first, multi-modal networks are often prone to overfitting due to their increased capacity
  • Addresses these two problems with a technique calls Gradient-Blending, which computes an optimal blending of modalities based on their overfitting behaviors
  • Demonstrates that Gradient Blending outperforms widely-used baselines for avoiding overfitting and achieves state-of-the-art accuracy on various tasks including human action recognition, ego-centric action recognition, and acoustic event detection
  • Demonstrates the significance of overfitting in joint training of multi-modal networks, and identifies two causes for the problem
  • Shows the problem is architecture agnostic: different fusion techniques can suffer the same overfitting problem. proposes a metric to understand the problem quantitatively: the overfitting-to-generalization ratio , with both theoretical and empirical justification. proposes a new training scheme which minimizes OGR via an optimal blend of multiple supervision signals
Reference
  • Combining correlated unbiased estimators of the mean of a normal distribution. https://projecteuclid.org/download/pdf_1/euclid.lnms/1196285392.4
    Findings
  • Epic-kitchens action recognition. //competitions.codalab.org/ competitions/20115. Accessed: 13. 8 https:2019-11-
    Google ScholarFindings
  • H. Alamri, V. Cartillier, A. Das, J. Wang, A. Cherian, I. Essa, D. B. amd Tim K. Marks, C. Hori, P. Anderson, S. Lee, and D. Parikh. Audio-visual scene-aware dialog. In CVPR, 2019. 1
    Google ScholarLocate open access versionFindings
  • S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. VQA: Visual Question Answering. In ICCV, 2015. 2
    Google ScholarLocate open access versionFindings
  • R. Arandjelovi and A. Zisserman. Look, listen and learn. In ICCV, 2017. 2
    Google ScholarLocate open access versionFindings
  • J. Arevalo, T. Solorio, M. M. y Gmez, and F. A. Gonzlez. Gated multimodal units for information fusion. In ICLR Workshop, 2017. 2
    Google ScholarLocate open access versionFindings
  • T. Baltruvsaitis, C. Ahuja, and L.-P. Morency. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41:423–443, 2018. 2
    Google ScholarLocate open access versionFindings
  • F. Baradel, N. Neverova, C. Wolf, J. Mille, and G. Mori. Object level visual reasoning in videos. In ECCV, 2017
    Google ScholarLocate open access versionFindings
  • R. Bernardi, R. Cakici, D. Elliott, A. Erdem, E. Erdem, N. Ikizler-Cinbis, F. Keller, A. Muscat, and B. Plank. Automatic description generation from images: A survey of models, datasets, and evaluation measures. J. Artif. Int. Res., 55(1):409–442, Jan. 2016. 2
    Google ScholarLocate open access versionFindings
  • Y. Bian, C. Gan, X. Liu, F. Li, X. Long, Y. Li, H. Qi, J. Zhou, S. Wen, and Y. Lin. Revisiting the effectiveness of off-the-shelf temporal modeling approaches for large-scale video classification. CoRR, abs/1708.03805, 2017. 2, 7, 8
    Findings
  • Caffe2-Team. Caffe2: A new lightweight, modular, and scalable deep learning framework. https://caffe2.ai/.5
    Findings
  • J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017. 2
    Google ScholarLocate open access versionFindings
  • Z. Chen, V. Badrinarayanan, C.-Y. Lee, and A. Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In ICML, 2018. 2
    Google ScholarLocate open access versionFindings
  • D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray. Scaling egocentric vision: The epic-kitchens dataset. In ECCV, 2018. 7
    Google ScholarLocate open access versionFindings
  • J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121–2159, July 2011. 6
    Google ScholarLocate open access versionFindings
  • D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. ICCV, 2015. 2
    Google ScholarLocate open access versionFindings
  • C. Feichtenhofer, H. Fan, J. Malik, and K. He. Slowfast networks for video recognition. In ICCV, 2019. 2, 7, 8
    Google ScholarLocate open access versionFindings
  • C. Feichtenhofer, A. Pinz, and R. P. Wildes. Spatiotemporal residual networks for video action recognition. In NIPS, 2016. 2
    Google ScholarLocate open access versionFindings
  • C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In CVPR, 2016. 2
    Google ScholarLocate open access versionFindings
  • A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. A. Ranzato, and T. Mikolov. Devise: A deep visual-semantic embedding model. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, NIPS 26, pages 2121–2129. Curran Associates, Inc., 2013. 2
    Google ScholarLocate open access versionFindings
  • A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. In EMNLP, 2016. 2
    Google ScholarLocate open access versionFindings
  • J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter. Audio set: An ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017, New Orleans, LA, 2017. 5
    Google ScholarLocate open access versionFindings
  • D. Ghadiyaram, M. Feiszli, D. Tran, X. Yan, H. Wang, and D. K. Mahajan. Large-scale weakly-supervised pre-training for video action recognition. In CVPR, 2019. 7
    Google ScholarLocate open access versionFindings
  • Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In CVPR, 2017. 1, 2
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. 5
    Google ScholarLocate open access versionFindings
  • J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In CVPR, 2018. 1, 14
    Google ScholarLocate open access versionFindings
  • A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014. 2, 5
    Google ScholarLocate open access versionFindings
  • W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman. The kinetics human action video dataset. CoRR, abs/1705.06950, 2017. 5
    Findings
  • E. Kazakos, A. Nagrani, A. Zisserman, and D. Damen. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In ICCV, 2019. 7, 8
    Google ScholarLocate open access versionFindings
  • A. Kendall, Y. Gal, and R. Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In CVPR, 2018. 2, 7
    Google ScholarLocate open access versionFindings
  • D. Kiela, E. Grave, A. Joulin, and T. Mikolov. Efficient large-scale multi-modal classification. In AAAI, 2018. 1, 2
    Google ScholarLocate open access versionFindings
  • D. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 12 2014. 6
    Google ScholarLocate open access versionFindings
  • I. Kokkinos. Ubernet: Training a ‘universal’ convolutional neural network for low-, mid-, and highlevel vision using diverse datasets and limited memory. CVPR, 2017. 2
    Google ScholarLocate open access versionFindings
  • B. Korbar, D. Tran, and L. Torresani. Cooperative learning of audio and video models from selfsupervised synchronization. In NeurIPS, 2018. 2
    Google ScholarLocate open access versionFindings
  • Z. Liu, Y. Shen, V. Lakshminarasimhan, P. Liang, A. Zadeh, and L.-P. Morency. Efficient low-rank multimodal fusion with modality-specific factors. pages 2247–2256, 01 2018. 6
    Google ScholarFindings
  • P. Nakkiran, G. Kaplun, D. Kalimeris, T. Yang, B. L. Edelman, F. Zhang, and B. Barak. Sgd on neural networks learns functions of increasing complexity. In NeurIPS, 2019. 5
    Google ScholarLocate open access versionFindings
  • A. Owens and A. A. Efros. Audio-visual scene analysis with self-supervised multisensory features. In The European Conference on Computer Vision (ECCV), September 2018. 1, 2, 6, 14
    Google ScholarLocate open access versionFindings
  • A. Poliak, J. Naradowsky, A. Haldar, R. Rudinger, and B. Durme. Hypothesis only baselines in natural language inference. pages 180–191, 01 2018. 1
    Google ScholarFindings
  • C. R. Qi, X. Chen, O. Litany, and L. J. Guibas. Imvotenet: Boosting 3d object detection in point clouds with image votes. In CVPR, 2020. 2
    Google ScholarLocate open access versionFindings
  • Z. Qiu, T. Yao,, and T. Mei. Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV, 2017. 2
    Google ScholarLocate open access versionFindings
  • K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014. 2
    Google ScholarLocate open access versionFindings
  • R. Socher, M. Ganjoo, C. D. Manning, and A. Y. Ng. Zero-shot learning through cross-modal transfer. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 1, NIPS’13, pages 935–943, USA, 2013. Curran Associates Inc. 2
    Google ScholarLocate open access versionFindings
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958, Jan. 2014. 1, 6
    Google ScholarLocate open access versionFindings
  • J. Thomason, D. Gordan, and Y. Bisk. Shifting the baseline: Single modality performance on visual navigation & qa. In NAACL, 11 2018. 1
    Google ScholarLocate open access versionFindings
  • D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015. 2
    Google ScholarLocate open access versionFindings
  • D. Tran, H. Wang, L. Torresani, and M. Feiszli. Video classification with channel-separated convolutional networks. In ICCV, 2019. 5, 7, 8
    Google ScholarLocate open access versionFindings
  • D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri. A closer look at spatiotemporal convolutions for action recognition. In CVPR, 2018. 2, 5, 7
    Google ScholarLocate open access versionFindings
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. 2017. 15
    Google ScholarLocate open access versionFindings
  • L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016. 2
    Google ScholarLocate open access versionFindings
  • X. Wang, A. Farhadi, and A. Gupta. Actionstransformations. In CVPR, 2016. 2
    Google ScholarLocate open access versionFindings
  • X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In CVPR, 2018. 1, 7, 15
    Google ScholarLocate open access versionFindings
  • X. Wang, Y. Wu, L. Zhu, and Y. Yang. Baidu-uts submission to the epic-kitchens action recognition challenge 2019. arXiv preprint arXiv:1906.09383, 2019. 8
    Findings
  • Y. Wang, J. Li, and F. Metze. A comparison of five multiple instance learning pooling functions for sound event detection with weak labeling. arXiv preprint arXiv:1810.09050, 2018. 8
    Findings
  • J. Weston, S. Bengio, and N. Usunier. Wsabie: Scaling up to large vocabulary image annotation. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence - Volume Volume Three, IJCAI’11, pages 2764–2770. AAAI Press, 2011. 2
    Google ScholarLocate open access versionFindings
  • S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy. Rethinking spatiotemporal feature learning for video understanding. In ECCV, 2018. 2
    Google ScholarFindings
  • C. Yu, K. S. Barsim, Q. Kong, and B. Yang. Multi-level attention model for weakly supervised audio classification. arXiv preprint arXiv:1803.02353, 2018. 8
    Findings
  • J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4694–4702, 2015. 2
    Google ScholarLocate open access versionFindings
  • P. Zhang, Y. Goyal, D. Summers-Stay, D. Batra, and D. Parikh. Yin and Yang: Balancing and answering binary visual questions. In CVPR, 2016. 2
    Google ScholarLocate open access versionFindings
  • H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba. The sound of pixels. In ECCV, 2018. 2
    Google ScholarLocate open access versionFindings
  • 8. Similar to SE-gate, NL-Gate can be added with multiple directions and at multiple positions. We found that it works the best when added after block 4, with a 2-D concat of audio and RGB features as Key-Value and visual features as Query to gate the visual stream.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments