Optics and Precision Engineering, Volume. 32, Issue 2, 237(2024)

Audio object detection network with multimodal cross level feature knowledge transfer

Shibei LIU and Ying CHEN*
Author Affiliations
  • Key Laboratory of Advanced Process Control for Light Industry (Ministry of Education), Jiangnan University, Wuxi214122, China
  • show less
    References(31)

    [1] S ADAVANNE, A POLITIS, J NIKUNEN et al. Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE Journal of Selected Topics in Signal Processing, 13, 34-48(2019).

    [3] Y AYTAR, C VONDRICK, A TORRALBA. SoundNet: learning sound representations from unlabeled video, 892-900(2016).

    [4] T AFOURAS, Y M ASANO, F FAGAN et al. Self-supervised object detection from audio-visual correspondence, 10565-10576(2022).

    [5] A OWENS, J J WU, J H MCDERMOTT et al. Ambient Sound Provides Supervision for Visual Learning. Computer Vision-ECCV 2016, 801-816(2016).

    [6] C GAN, H ZHAO, P H CHEN et al. Self-supervised moving vehicle tracking with stereo sound, 7052-7061(2019).

    [7] F R VALVERDE, J VALERIA HURTADO, A VALADA. There is more than meets the eye: self-supervised multi-object detection and tracking with sound by distilling multimodal knowledge, 11607-11616(2021).

    [8] T AFOURAS, J S CHUNG, A SENIOR et al. Deep audio-visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44, 8717-8727(2022).

    [9] Y D LI, H LIU, H TANG. Multi-modal perception attention network with self-supervised learning for audio-visual speaker tracking. Proceedings of the AAAI Conference on Artificial Intelligence, 36, 1456-1463(2022).

    [10] J ZÜRN, W BURGARD. Self-supervised moving vehicle detection from audio-visual cues. IEEE Robotics and Automation Letters, 7, 7415-7422(2022).

    [11] J ZÜRN, W BURGARD, A VALADA. Self-supervised visual terrain classification from unsupervised acoustic feature learning. IEEE Transactions on Robotics, 37, 466-481(2021).

    [12] Y P TIAN, J SHI, B C LI et al. Audio-Visual Event Localization in Unconstrained Videos. Computer Vision-ECCV 2018, 252-268(2018).

    [13] A YOUNES, D HONERKAMP, T WELSCHEHOLD et al. Catch me if you hear me: audio-visual navigation in complex unmapped environments with moving sounds. IEEE Robotics and Automation Letters, 8, 928-935(2023).

    [14] H L CHEN, W D XIE, T AFOURAS et al. Localizing visual sounds the hard way, 16862-16871(2021).

    [15] R ARANDJELOVIC, A ZISSERMAN. Objects that sound, 435-451(2018).

    [19] F TUNG, G MORI. Similarity-preserving knowledge distillation, 1365-1374(2019).

    [21] S X HU, A DAMIANOU et al. Variational information distillation for knowledge transfer, 9155-9163(2019).

    [22] S ZAGORUYKO, N KOMODAKIS. Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer(2017).

    [23] G AGUILAR, Y LING, Y ZHANG et al. Knowledge distillation from internal representations. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 7350-7357(2020).

    [24] P CHEN, D YU. Improved Faster RCNN approach for vehicles and pedestrian detection. International Core Journal of Engineering, 6, 119-124(2020).

    [26] M X TAN, R M PANG, Q V LE. EfficientDet: scalable and efficient object detection, 10778-10787(2020).

    [27] M TAN, Q LE. Efficientnet: rethinking model scaling for convolutional neural networks, 6105-6114(2019).

    [28] T Y LIN, M MAIRE, S BELONGIE et al. Microsoft COCO Common Objects in Context. Computer Vision-ECCV 2014, 740-755(2014).

    [29] M EVERINGHAM, L VAN GOOL, C K I WILLIAMS et al. The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88, 303-338(2010).

    [30] J DENG, W DONG, R SOCHER et al. ImageNet: a large-scale hierarchical image database, 248-255(2009).

    [31] M F CHANG, J LAMBERT, P SANGKLOY et al. Argoverse: 3D tracking and forecasting with rich maps, 8740-8749(2019).

    Tools

    Get Citation

    Copy Citation Text

    Shibei LIU, Ying CHEN. Audio object detection network with multimodal cross level feature knowledge transfer[J]. Optics and Precision Engineering, 2024, 32(2): 237

    Download Citation

    EndNote(RIS)BibTexPlain Text
    Save article for my favorites
    Paper Information

    Category:

    Received: Jun. 8, 2023

    Accepted: --

    Published Online: Apr. 2, 2024

    The Author Email: CHEN Ying (chenying@jiangnan.edu.cn)

    DOI:10.37188/OPE.20243202.0237

    Topics