Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech
We demonstrated that these units are far more robust to noise and domain shift than units derived from previously proposed models. These results supported the notion that semantic supervision via a discriminative, multimodal grounding objective has the potential to be more powerf...
In this paper, we present a method for learning discrete linguistic units by incorporating vector quantization layers into neural models of visually grounded speech. We show that our method is capable of capturing both word-level and sub-word units, depending on how it is configured. What differentiates this paper from prior work on speec...More
PPT (Upload PPT)