Neural Architecture Search on Acoustic Scene Classification

Li Jixiang
Li Jixiang
Liang Chuming
Liang Chuming
Wang Zhao
Wang Zhao
Xiang Fei
Xiang Fei
Cited by: 0|Bibtex|Views136|Links
Keywords:
NSGA-IIgated recurrent unitefficient networkasc taskneural networkMore(11+)
Weibo:
We present a novel and efficient network for Acoustic Scene Classification tasks where its feature extractor is inspired by MobileNetV2

Abstract:

Convolutional neural networks are widely adopted in Acoustic Scene Classification (ASC) tasks, but they generally carry a heavy computational burden. In this work, we propose a lightweight yet high-performing baseline network inspired by MobileNetV2, which replaces square convolutional kernels with unidirectional ones to extract feature...More

Code:

Data:

0
Introduction
  • Acoustic Scene Classification (ASC) is an important task in the field of audio understanding and analysis, which classifies an audio stream into one of the predefined acoustic scenes.
  • With the great success of deep learning in computer vision and the availability of larger audio datasets, methods based on DNN [6], CNN [7] and RNN [8] have gradually been dominant in ASC.
  • Hershey [12] compared different CNN architectures (VGG [13], Xception [14] and ResNet [15] etc.) in an audio classification task and all of them showed promising results.
  • In vision tasks, most architectures above have been outperformed by advanced networks such as MobileNetV2 [16] in terms of the number of parameters and computational cost
Highlights
  • Acoustic Scene Classification (ASC) is an important task in the field of audio understanding and analysis, which classifies an audio stream into one of the predefined acoustic scenes
  • As an outperforming feature extractor, VGG [13] architecture is widely used by previous works like [30] [31] [32] that achieve good performance, but its size and depth bring the problem of high computational cost
  • We can see that the smaller single NASC-net outperforms the baseline model in terms of macro F1 score, suggesting our Neural Architecture Search (NAS) method can search more light-weight and more accurate models than expert-designing
  • We present a novel and efficient network for ASC tasks where its feature extractor is inspired by MobileNetV2
  • Our searched network obtains a new state of the art on the DCASE2018 Task5 dataset with much lower computation
  • We can conclude that NAS is applicable in the field of ASC and potentially in other acoustics domains
Methods
  • The authors can see from Table 1 that the baseline model is 0.4% F1 higher than the comparison model which just replaces the FE with original VGG16, while reducing 4G FLOPs and 14M parameters
  • It suggests that capacity and depth of model have no absolute relationship with the ability to extract features and demonstrates the proposed architecture with unidirectional kernels alternately in temporal dimension and frequency dimensions is efficient for the ASC task.
Results
  • In section 3.4 the authors illustrate the details of the searched network that achieve state-of-the-art results on DCASE 2018 task 5.
  • The authors can see that the smaller single NASC-net outperforms the baseline model in terms of macro F1 score, suggesting the NAS method can search more light-weight and more accurate models than expert-designing
Conclusion
  • The authors present a novel and efficient network for ASC tasks where its feature extractor is inspired by MobileNetV2.
  • The authors show that the proposed network can achieve both high performance and low computation.
  • On the basis of the proposed network, the authors apply neural architecture search to achieve a more sophisticated architecture using the fairness supernet training strategy and NSGA-II algorithm.
  • The authors' searched network obtains a new state of the art on the DCASE2018 Task5 dataset with much lower computation.
  • The authors can conclude that NAS is applicable in the field of ASC and potentially in other acoustics domains
Summary
  • Introduction:

    Acoustic Scene Classification (ASC) is an important task in the field of audio understanding and analysis, which classifies an audio stream into one of the predefined acoustic scenes.
  • With the great success of deep learning in computer vision and the availability of larger audio datasets, methods based on DNN [6], CNN [7] and RNN [8] have gradually been dominant in ASC.
  • Hershey [12] compared different CNN architectures (VGG [13], Xception [14] and ResNet [15] etc.) in an audio classification task and all of them showed promising results.
  • In vision tasks, most architectures above have been outperformed by advanced networks such as MobileNetV2 [16] in terms of the number of parameters and computational cost
  • Objectives:

    Since the authors aim to search for architectures with higher accuracy and lower computation, the authors set the accuracy metric and computational cost as two objectives.
  • Methods:

    The authors can see from Table 1 that the baseline model is 0.4% F1 higher than the comparison model which just replaces the FE with original VGG16, while reducing 4G FLOPs and 14M parameters
  • It suggests that capacity and depth of model have no absolute relationship with the ability to extract features and demonstrates the proposed architecture with unidirectional kernels alternately in temporal dimension and frequency dimensions is efficient for the ASC task.
  • Results:

    In section 3.4 the authors illustrate the details of the searched network that achieve state-of-the-art results on DCASE 2018 task 5.
  • The authors can see that the smaller single NASC-net outperforms the baseline model in terms of macro F1 score, suggesting the NAS method can search more light-weight and more accurate models than expert-designing
  • Conclusion:

    The authors present a novel and efficient network for ASC tasks where its feature extractor is inspired by MobileNetV2.
  • The authors show that the proposed network can achieve both high performance and low computation.
  • On the basis of the proposed network, the authors apply neural architecture search to achieve a more sophisticated architecture using the fairness supernet training strategy and NSGA-II algorithm.
  • The authors' searched network obtains a new state of the art on the DCASE2018 Task5 dataset with much lower computation.
  • The authors can conclude that NAS is applicable in the field of ASC and potentially in other acoustics domains
Tables
  • Table1: Performance comparisons with different feature extractor and input size
  • Table2: Class-wise performance comparison. Note B-s denotes the single baseline model, N-s denotes the single NASCnet model, N-e denotes the ensembled NASC-net model
Download tables as Excel
Funding
  • In section 3.4 we illustrate the details of the searched network that achieve state-of-the-art results on DCASE 2018 task 5
  • Based on the ratio of the two sets, we can calculate the F1-score for entire Evaluation Set based on the official published leaderboard, e.g. the F1-score of category ”Absence” in [11] is 88.7%=87.7%×0.4286+89.4%×0.5714
  • We can see that the smaller single NASC-net outperforms the baseline model in terms of macro F1 score, suggesting our NAS method can search more light-weight and more accurate models than expert-designing
Reference
  • A. J. Eronen, V. T. Peltonen, J. T. Tuomi, A. P. Klapuri, S. Fagerlund, T. Sorsa, G. Lorho, and J. Huopaniemi, “Audio-based context recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 1, pp. 321–329, 2005.
    Google ScholarLocate open access versionFindings
  • S. Ntalampiras, I. Potamitis, and N. Fakotakis, “On acoustic surveillance of hazardous situations,” in 2009 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2009, pp. 165–168.
    Google ScholarLocate open access versionFindings
  • M. Bugalho, J. Portelo, I. Trancoso, T. Pellegrini, and A. Abad, “Detecting audio events for semantic video search,” in Tenth Annual Conference of the International Speech Communication Association, 2009.
    Google ScholarLocate open access versionFindings
  • A. Mesaros, T. Heittola, and T. Virtanen, “Tut database for acoustic scene classification and sound event detection,” in 2016 24th European Signal Processing Conference (EUSIPCO). IEEE, 2016, pp. 1128–1132.
    Google ScholarLocate open access versionFindings
  • A. Mesaros, T. Heittola, E. Benetos, P. Foster, M. Lagrange, T. Virtanen, and M. D. Plumbley, “Detection and classification of acoustic scenes and events: Outcome of the dcase 2016 challenge,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 26, no. 2, pp. 379–393, 2018.
    Google ScholarLocate open access versionFindings
  • S. Mun, S. Shon, W. Kim, D. K. Han, and H. Ko, “Deep neural network based learning and transferring mid-level audio features for acoustic scene classification,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 796–800.
    Google ScholarLocate open access versionFindings
  • Y. Han and K. Lee, “Acoustic scene classification using convolutional neural network and multiple-width frequency-delta data augmentation,” arXiv preprint arXiv:1607.02383, 2016.
    Findings
  • T. H. Vu and J.-C. Wang, “Acoustic scene and event recognition using recurrent neural networks,” Detection and Classification of Acoustic Scenes and Events, vol. 2016, 2016.
    Google ScholarLocate open access versionFindings
  • A. Mesaros, T. Heittola, and T. Virtanen, “Acoustic scene classification: An overview of dcase 2017 challenge entries,” in 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC). IEEE, 2018, pp. 411–415.
    Google ScholarLocate open access versionFindings
  • T. Inoue, P. Vinayavekhin, S. Wang, D. Wood, N. Greco, and R. Tachibana, “Domestic activities classification based on cnn using shuffling and mixing data augmentation,” DCASE 2018 Challenge, 2018.
    Google ScholarLocate open access versionFindings
  • G. Dekkers, L. Vuegen, T. van Waterschoot, B. Vanrumste, and P. Karsmakers, “DCASE 2018 Challenge - Task 5: Monitoring of domestic activities based on multi-channel acoustics,” KU Leuven, Tech. Rep., 2018. [Online]. Available: https://arxiv.org/abs/1807.11246
    Findings
  • S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold et al., “Cnn architectures for large-scale audio classification,” in 2017 ieee international conference on acoustics, speech and signal processing (icassp). IEEE, 2017, pp. 131–135.
    Google ScholarLocate open access versionFindings
  • K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
    Findings
  • F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1251–1258.
    Google ScholarLocate open access versionFindings
  • K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
    Google ScholarLocate open access versionFindings
  • M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4510–4520.
    Google ScholarLocate open access versionFindings
  • T. Vniat, O. Schwander, and L. Denoyer, “Stochastic Adaptive Neural Architecture Search for Keyword Spotting,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 2842–2846.
    Google ScholarLocate open access versionFindings
  • H. Mazzawi, X. Gonzalvo, A. Kracun, P. Sridhar, N. Subrahmanya, I. L. Moreno, H. J. Park, and P. Violette, “Improving Keyword Spotting and Language Identification via Neural Architecture Search at Scale,” in Interspeech, 2019, pp. 1278–1282.
    Google ScholarFindings
  • B. Zoph, V. Vasudevan, J. Shlens, and Q. Le, “Learning transferable architectures for scalable image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8697–8710.
    Google ScholarLocate open access versionFindings
  • E. Real, A. Aggarwal, Y. Huang, and Q. Le, “Regularized evolution for image classifier architecture search,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2019, pp. 4780–4789.
    Google ScholarLocate open access versionFindings
  • H. Liu, K. Simonyan, and Y. Yang, “Darts: Differentiable architecture search,” in International Conference on Learning Representations, 2019.
    Google ScholarLocate open access versionFindings
  • G. Bender, P.-J. Kindermans, B. Zoph, V. Vasudevan, and Q. Le, “Understanding and Simplifying One-Shot Architecture Search,” in International Conference on Machine Learning, 2018, pp. 549– 558.
    Google ScholarLocate open access versionFindings
  • Z. Guo, X. Zhang, H. Mu, W. Heng, Z. Liu, Y. Wei, and J. Sun, “Single Path One-Shot Neural Architecture Search with Uniform Sampling,” arXiv preprint. arXiv:1904.00420, 2019.
    Findings
  • X. Chu, B. Zhang, R. Xu, and J. Li, “FairNAS: Rethinking Evaluation Fairness of Weight Sharing Neural Architecture Search,” arXiv preprint. arXiv:1907.01845, 2019.
    Findings
  • K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A Fast and Elitist Multiobjective Genetic Algorithm: NSGA-II,” IEEE Transactions on Evolutionary Computation, vol. 6, no. 2, pp. 182–197, 2002.
    Google ScholarLocate open access versionFindings
  • Y. Luo and N. Mesgarani, “Conv-tasnet: Surpassing ideal time– frequency magnitude masking for speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256–1266, 2019.
    Google ScholarLocate open access versionFindings
  • S. Kriman, S. Beliaev, B. Ginsburg, J. Huang, O. Kuchaiev, V. Lavrukhin, R. Leary, J. Li, and Y. Zhang, “Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions,” arXiv preprint arXiv:1910.10261, 2019.
    Findings
  • T. DeVries and G. W. Taylor, “Improved regularization of convolutional neural networks with cutout,” arXiv preprint arXiv:1708.04552, 2017.
    Findings
  • D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations, 2014.
    Google ScholarLocate open access versionFindings
  • Y. Han, J. Park, and K. Lee, “Convolutional neural networks with binaural representations and background subtraction for acoustic scene classification,” DCASE 2017 Challenge, 2017.
    Google ScholarLocate open access versionFindings
  • R. Tanabe, T. Endo, Y. Nikaido, T. Ichige, P. Nguyen, Y. Kawaguchi, and K. Hamada, “Multichannel acoustic scene classification by blind dereverberation, blind source separation, data augmentation, and model ensembling,” in DCASE 2018 Challenge, 2018.
    Google ScholarFindings
  • W. Wang, W. Wang, M. Sun, and C. Wang, “Acoustic scene analysis with multi-head attention networks,” arXiv preprint arXiv:1909.08961, 2019.
    Findings
Your rating :
0

 

Tags
Comments