Fusing Multi-Level Features from Audio and Contextual Sentence Embedding from Text for Interview-Based Depression Detection

Junqi Xue, Ruihan Qin, Xinxu Zhou, Honghai Liu,Min Zhang,Zhiguo Zhang

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2024)

引用 0|浏览4
暂无评分
摘要
Automatic depression detection based on audio and text representations from participants’ interviews has attracted widespread attention. However, most of previous researches only used one type of feature of one single modality for depression detection, so that the rich information of audio and text from interviews has not been fully utilized. Moreover, an effective multi-modal fusion approach to leverage the independence among audio and text representations is still lacking. To address these problems, we propose a multi-modal fusion depression detection model based on the interaction of multilevel audio features and text sentence embedding. Specifically, we first extract Low-Level Descriptors (LLDs), mel-spectrogram features, and wav2vec features from the audio. Then we design a Multi-level Audio Features Interaction Module (MAFIM) to fuse these three levels of features for a comprehensive audio representation. For interview text, we use pre-trained BERT to extract sentence-level embedding. Further, to effectively fuse audio and text representations, we design a Channel Attention-based Multi-modal Fusion Module (CAMFM) by taking into account the independence and correlation between two different modalities. Our proposed model shows better performance on two datasets, DAIC-WOZ and EATD-Corpus, than existing methods, so it has a high potential to be applied for interview-based depression detection in practice.
更多
查看译文
关键词
Depression detection,Multi-modal fusion,Multi-level audio features,Text sentence embedding,Channel attention
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要