Optics and Precision Engineering, Volume. 32, Issue 2, 237(2024)

Audio object detection network with multimodal cross level feature knowledge transfer

Shibei LIU and Ying CHEN*
Author Affiliations
  • Key Laboratory of Advanced Process Control for Light Industry (Ministry of Education), Jiangnan University, Wuxi214122, China
  • show less
    References(31)

    [1] ADAVANNE S, POLITIS A, NIKUNEN J et al. Sound event localization and detection of overlapping sources using convolutional recurrent neural networks[J]. IEEE Journal of Selected Topics in Signal Processing, 13, 34-48(2019).

    [3] AYTAR Y, VONDRICK C, TORRALBA A. SoundNet: learning sound representations from unlabeled video[C], 892-900(2016).

    [4] AFOURAS T, ASANO Y M, FAGAN F et al. Self-supervised object detection from audio-visual correspondence[C], 10565-10576(2022).

    [5] OWENS A, WU J J, MCDERMOTT J H et al. Ambient Sound Provides Supervision for Visual Learning[M]. Computer Vision-ECCV 2016, 801-816(2016).

    [6] GAN C, ZHAO H, CHEN P H et al. Self-supervised moving vehicle tracking with stereo sound[C], 7052-7061(2019).

    [7] VALVERDE F R, VALERIA HURTADO J, VALADA A. There is more than meets the eye: self-supervised multi-object detection and tracking with sound by distilling multimodal knowledge[C], 11607-11616(2021).

    [8] AFOURAS T, CHUNG J S, SENIOR A et al. Deep audio-visual speech recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44, 8717-8727(2022).

    [9] LI Y D, LIU H, TANG H. Multi-modal perception attention network with self-supervised learning for audio-visual speaker tracking[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 36, 1456-1463(2022).

    [10] ZÜRN J, BURGARD W. Self-supervised moving vehicle detection from audio-visual cues[J]. IEEE Robotics and Automation Letters, 7, 7415-7422(2022).

    [11] ZÜRN J, BURGARD W, VALADA A. Self-supervised visual terrain classification from unsupervised acoustic feature learning[J]. IEEE Transactions on Robotics, 37, 466-481(2021).

    [12] TIAN Y P, SHI J, LI B C et al. Audio-Visual Event Localization in Unconstrained Videos[M]. Computer Vision-ECCV 2018, 252-268(2018).

    [13] YOUNES A, HONERKAMP D, WELSCHEHOLD T et al. Catch me if you hear me: audio-visual navigation in complex unmapped environments with moving sounds[J]. IEEE Robotics and Automation Letters, 8, 928-935(2023).

    [14] CHEN H L, XIE W D, AFOURAS T et al. Localizing visual sounds the hard way[C], 16862-16871(2021).

    [15] ARANDJELOVIC R, ZISSERMAN A. Objects that sound[C], 435-451(2018).

    [19] TUNG F, MORI G. Similarity-preserving knowledge distillation[C], 1365-1374(2019).

    [21] HU S X, DAMIANOU A et al. Variational information distillation for knowledge transfer[C], 9155-9163(2019).

    [22] ZAGORUYKO S, KOMODAKIS N. Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer[C](2017).

    [23] AGUILAR G, LING Y, ZHANG Y et al. Knowledge distillation from internal representations[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 7350-7357(2020).

    [24] CHEN P, YU D. Improved Faster RCNN approach for vehicles and pedestrian detection[J]. International Core Journal of Engineering, 6, 119-124(2020).

    [26] TAN M X, PANG R M, LE Q V. EfficientDet: scalable and efficient object detection[C], 10778-10787(2020).

    [27] TAN M, LE Q. Efficientnet: rethinking model scaling for convolutional neural networks[C], 6105-6114(2019).

    [28] LIN T Y, MAIRE M, BELONGIE S et al. Microsoft COCO Common Objects in Context[M]. Computer Vision-ECCV 2014, 740-755(2014).

    [29] EVERINGHAM M, VAN GOOL L, WILLIAMS C K I et al. The pascal visual object classes (VOC) challenge[J]. International Journal of Computer Vision, 88, 303-338(2010).

    [30] DENG J, DONG W, SOCHER R et al. ImageNet: a large-scale hierarchical image database[C], 248-255(2009).

    [31] CHANG M F, LAMBERT J, SANGKLOY P et al. Argoverse: 3D tracking and forecasting with rich maps[C], 8740-8749(2019).

    Tools

    Get Citation

    Copy Citation Text

    Shibei LIU, Ying CHEN. Audio object detection network with multimodal cross level feature knowledge transfer[J]. Optics and Precision Engineering, 2024, 32(2): 237

    Download Citation

    EndNote(RIS)BibTexPlain Text
    Save article for my favorites
    Paper Information

    Category:

    Received: Jun. 8, 2023

    Accepted: --

    Published Online: Apr. 2, 2024

    The Author Email: Ying CHEN (chenying@jiangnan.edu.cn)

    DOI:10.37188/OPE.20243202.0237

    Topics