4D Multimodal Speaker Model for Remote Speech Diagnosis.

IEEE Access(2022)

引用 0|浏览4
暂无评分
摘要
This paper presents a concept of a 4D multimodal speaker model (4D-MSM) for asynchronous remote speech diagnosis. Recording and archiving diagnostically significant articulation material remain an issue in computer-aided speech diagnosis. Therefore, we propose a workflow for preparing and storing reliable and easily interpretable multimodal data regarding pronunciation. According to our assumptions, data acquisition should be non-invasive, comfortable for both the patient and therapist, not interfere with the articulation process, and provide essential data of high quality. We developed and employed a dedicated device, obtaining a 15-channel spatially distributed audio signal and stable stereovision stream from two cameras focused on the lower part of the face. Our framework for data preprocessing covers digital beamforming of the multichannel audio signal, audio-video synchronization, and segmentation of words in the audio signal. Then, we use stereo data to calculate and adjust the depth map and prepare point clouds. Simultaneously, we delineate the mouth in video frames using a dedicated semi-automated segmentation algorithm. The point clouds are then textured with the camera images with superimposed mouth regions. Finally, we add the audio track to constitute the 4D-MSM. In the paper, we show the concept and detailed specification of the model and present experiments to justify the methodology. Proposed 4D-MSMs may be employed in remote speech diagnosis for objectifying and archiving diagnoses, conducting asynchronous consultations, and documenting the progress in therapy.
更多
查看译文
关键词
Medical treatment,Point cloud compression,Speech processing,Mouth,Data acquisition,Cameras,Pediatrics,Articulation data acqusition,audio-video processing,computer-aided speech diagnosis,remote speech diagnosis and therapy,stereovision
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要