Generating Speech with Prosodic Prominence based on SSL-Visually Grounded Models.

Bella Septina Ika Hartanti, Dipta Tanaya,Kurniawati Azizah,Dessi Puji Lestari,Ayu Purwarianti,Sakriani Sakti

Oriental COCOSDA International Conference on Speech Database and Assessments（2023）

引用 0|浏览0

暂无评分

摘要

Despite many existing works that address expressive speech synthesis with a desired prosody, few have focused on generating speech with prosody prominence. Most previous studies addressing this issue generate speech from given text labels with a contrastive focus emphasizing a specific word. In contrast, this paper investigates whether we can control prosody based on the contrastive focus that appears in images. Given an image and its caption, our system first discovers spoken terms associated with objects or situations in natural images based on a self-supervised visually grounded model. Then it generates speech with prosody prominence based on the contrastive focus of these spoken terms in a way that best describes the images. The framework can perform the task with/without text annotation, making it applicable for untranscribed, unsegmented speech utterances in unknown languages.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要