Multichannel Long-Term Streaming Neural Speech Enhancement for Static and Moving Speakers
arxiv(2024)
摘要
In this work, we extend our previously proposed offline SpatialNet for
long-term streaming multichannel speech enhancement in both static and moving
speaker scenarios. SpatialNet exploits spatial information, such as the
spatial/steering direction of speech, for discriminating between target speech
and interferences, and achieved outstanding performance. The core of SpatialNet
is a narrow-band self-attention module used for learning the temporal dynamic
of spatial vectors. Towards long-term streaming speech enhancement, we propose
to replace the offline self-attention network with online networks that have
linear inference complexity w.r.t signal length and meanwhile maintain the
capability of learning long-term information. Three variants are developed
based on (i) masked self-attention, (ii) Retention, a self-attention variant
with linear inference complexity, and (iii) Mamba, a
structured-state-space-based RNN-like network. Moreover, we investigate the
length extrapolation ability of different networks, namely test on signals that
are much longer than training signals, and propose a short-signal training plus
long-signal fine-tuning strategy, which largely improves the length
extrapolation ability of the networks within limited training time. Overall,
the proposed online SpatialNet achieves outstanding speech enhancement
performance for long audio streams, and for both static and moving speakers.
The proposed method will be open-sourced in
https://github.com/Audio-WestlakeU/NBSS.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要