What Makes Training Multi-Modal Networks Hard?
arXiv: Computer Vision and Pattern Recognition, 2019.
Consider end-to-end training of a multi-modal vs. a single-modal network on a task with multiple input modalities: the multi-modal network receives more information, so it should match or outperform its single-modal counterpart. In our experiments, however, we observe the opposite: the best single-modal network always outperforms the mu...More
PPT (Upload PPT)