Optics and Precision Engineering, Volume. 32, Issue 2, 237(2024)
Audio object detection network with multimodal cross level feature knowledge transfer
[1] ADAVANNE S, POLITIS A, NIKUNEN J et al. Sound event localization and detection of overlapping sources using convolutional recurrent neural networks[J]. IEEE Journal of Selected Topics in Signal Processing, 13, 34-48(2019).
[3] AYTAR Y, VONDRICK C, TORRALBA A. SoundNet: learning sound representations from unlabeled video[C], 892-900(2016).
[4] AFOURAS T, ASANO Y M, FAGAN F et al. Self-supervised object detection from audio-visual correspondence[C], 10565-10576(2022).
[5] OWENS A, WU J J, MCDERMOTT J H et al. Ambient Sound Provides Supervision for Visual Learning[M]. Computer Vision-ECCV 2016, 801-816(2016).
[6] GAN C, ZHAO H, CHEN P H et al. Self-supervised moving vehicle tracking with stereo sound[C], 7052-7061(2019).
[7] VALVERDE F R, VALERIA HURTADO J, VALADA A. There is more than meets the eye: self-supervised multi-object detection and tracking with sound by distilling multimodal knowledge[C], 11607-11616(2021).
[8] AFOURAS T, CHUNG J S, SENIOR A et al. Deep audio-visual speech recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44, 8717-8727(2022).
[9] LI Y D, LIU H, TANG H. Multi-modal perception attention network with self-supervised learning for audio-visual speaker tracking[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 36, 1456-1463(2022).
[10] ZÜRN J, BURGARD W. Self-supervised moving vehicle detection from audio-visual cues[J]. IEEE Robotics and Automation Letters, 7, 7415-7422(2022).
[11] ZÜRN J, BURGARD W, VALADA A. Self-supervised visual terrain classification from unsupervised acoustic feature learning[J]. IEEE Transactions on Robotics, 37, 466-481(2021).
[12] TIAN Y P, SHI J, LI B C et al.
[13] YOUNES A, HONERKAMP D, WELSCHEHOLD T et al. Catch me if you hear me: audio-visual navigation in complex unmapped environments with moving sounds[J]. IEEE Robotics and Automation Letters, 8, 928-935(2023).
[14] CHEN H L, XIE W D, AFOURAS T et al. Localizing visual sounds the hard way[C], 16862-16871(2021).
[15] ARANDJELOVIC R, ZISSERMAN A. Objects that sound[C], 435-451(2018).
[19] TUNG F, MORI G. Similarity-preserving knowledge distillation[C], 1365-1374(2019).
[21] HU S X, DAMIANOU A et al. Variational information distillation for knowledge transfer[C], 9155-9163(2019).
[22] ZAGORUYKO S, KOMODAKIS N. Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer[C](2017).
[23] AGUILAR G, LING Y, ZHANG Y et al. Knowledge distillation from internal representations[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 7350-7357(2020).
[24] CHEN P, YU D. Improved Faster RCNN approach for vehicles and pedestrian detection[J]. International Core Journal of Engineering, 6, 119-124(2020).
[26] TAN M X, PANG R M, LE Q V. EfficientDet: scalable and efficient object detection[C], 10778-10787(2020).
[27] TAN M, LE Q. Efficientnet: rethinking model scaling for convolutional neural networks[C], 6105-6114(2019).
[28] LIN T Y, MAIRE M, BELONGIE S et al.
[29] EVERINGHAM M, VAN GOOL L, WILLIAMS C K I et al. The pascal visual object classes (VOC) challenge[J]. International Journal of Computer Vision, 88, 303-338(2010).
[30] DENG J, DONG W, SOCHER R et al. ImageNet: a large-scale hierarchical image database[C], 248-255(2009).
[31] CHANG M F, LAMBERT J, SANGKLOY P et al. Argoverse: 3D tracking and forecasting with rich maps[C], 8740-8749(2019).
Get Citation
Copy Citation Text
Shibei LIU, Ying CHEN. Audio object detection network with multimodal cross level feature knowledge transfer[J]. Optics and Precision Engineering, 2024, 32(2): 237
Category:
Received: Jun. 8, 2023
Accepted: --
Published Online: Apr. 2, 2024
The Author Email: Ying CHEN (chenying@jiangnan.edu.cn)