Event Specific Attention for Polyphonic Sound Event Detection.

Interspeech(2021)

引用 5|浏览3
暂无评分
摘要
The concept of multi-headed self attention (MHSA) introduced as a critical building block of a Transformer Encoder/Decoder Module has made a significant impact in the areas of natural language processing (NLP), automatic speech recognition (ASR) and recently in the area of sound event detection (SED). The current state-of-the-art approaches to SED employ a shared attention mechanism achieved through a stack of MHSA blocks to detect multiple sound events. Consequently, in a multi-label SED task, a common attention mechanism would be responsible for generating relevant feature representations for each of the events to be detected. In this paper, we show through empirical evaluation that having more MHSA blocks dedicated specifically for individual events, rather than having a stack of shared MHSA blocks, improves the overall detection performance. Interestingly, this improvement in performance comes about because the event-specific attention blocks help in resolving confusions in the case of co-occurring events. The proposed "Event-specific Attention Network" (ESA-Net) can be trained in an end-to-end manner. On the DCASE 2020 Task 4 data set, we show that with ESA-Net, the best single model achieves an event-based F1 score of 52.1 % on the public validation data set improving over the existing state of the art result.
更多
查看译文
关键词
Multi-Head Self Attention,Transformer,Relative Positional Encoding,Sound Event Detection,DCASE
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要