Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input

David Harwath
David Harwath
Dídac Surís
Dídac Surís
Galen Chuang
Galen Chuang

International Journal of Computer Vision, pp. 620-641, 2020.

Cited by: 93|Views55

Abstract:

In this paper, we explore neural network models that learn to associate segments of spoken audio captions with the semantically relevant portions of natural images that they refer to. We demonstrate that these audio-visual associative localizations emerge from network-internal representations learned as a by-product of training to perform...More

Code:

Data:

Get fulltext within 24h
Bibtex
Upload PDF

1.Your uploaded documents will be check within 24h, and coins will be credited to your account.

2.As the current system does not support cash withdrawal, you can add staff WeChat (AMxiaomai) to receive it as a red packet.

3.10 coins will be exchanged for 1 yuan.

?

Upload a single paper

for 5 coins

Wechat's Red Packet
?

Upload 50 articles

for 280 coins

Wechat's Red Packet
?

Upload 200 articles

for 1200 coins

Wechat's Red Packet
?

Upload 500 articles

for 3000 coins

Wechat's Red Packet
?

Upload 1000 articles

for 7000 coins

Wechat's Red Packet
Your rating :
0

 

Tags
Comments