Optics and Precision Engineering, Volume. 32, Issue 2, 237(2024)
Audio object detection network with multimodal cross level feature knowledge transfer
[1] S ADAVANNE, A POLITIS, J NIKUNEN et al. Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE Journal of Selected Topics in Signal Processing, 13, 34-48(2019).
[3] Y AYTAR, C VONDRICK, A TORRALBA. SoundNet: learning sound representations from unlabeled video, 892-900(2016).
[4] T AFOURAS, Y M ASANO, F FAGAN et al. Self-supervised object detection from audio-visual correspondence, 10565-10576(2022).
[5] A OWENS, J J WU, J H MCDERMOTT et al. Ambient Sound Provides Supervision for Visual Learning. Computer Vision-ECCV 2016, 801-816(2016).
[6] C GAN, H ZHAO, P H CHEN et al. Self-supervised moving vehicle tracking with stereo sound, 7052-7061(2019).
[7] F R VALVERDE, J VALERIA HURTADO, A VALADA. There is more than meets the eye: self-supervised multi-object detection and tracking with sound by distilling multimodal knowledge, 11607-11616(2021).
[8] T AFOURAS, J S CHUNG, A SENIOR et al. Deep audio-visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44, 8717-8727(2022).
[9] Y D LI, H LIU, H TANG. Multi-modal perception attention network with self-supervised learning for audio-visual speaker tracking. Proceedings of the AAAI Conference on Artificial Intelligence, 36, 1456-1463(2022).
[10] J ZÜRN, W BURGARD. Self-supervised moving vehicle detection from audio-visual cues. IEEE Robotics and Automation Letters, 7, 7415-7422(2022).
[11] J ZÜRN, W BURGARD, A VALADA. Self-supervised visual terrain classification from unsupervised acoustic feature learning. IEEE Transactions on Robotics, 37, 466-481(2021).
[12] Y P TIAN, J SHI, B C LI et al.
[13] A YOUNES, D HONERKAMP, T WELSCHEHOLD et al. Catch me if you hear me: audio-visual navigation in complex unmapped environments with moving sounds. IEEE Robotics and Automation Letters, 8, 928-935(2023).
[14] H L CHEN, W D XIE, T AFOURAS et al. Localizing visual sounds the hard way, 16862-16871(2021).
[15] R ARANDJELOVIC, A ZISSERMAN. Objects that sound, 435-451(2018).
[19] F TUNG, G MORI. Similarity-preserving knowledge distillation, 1365-1374(2019).
[21] S X HU, A DAMIANOU et al. Variational information distillation for knowledge transfer, 9155-9163(2019).
[22] S ZAGORUYKO, N KOMODAKIS. Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer(2017).
[23] G AGUILAR, Y LING, Y ZHANG et al. Knowledge distillation from internal representations. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 7350-7357(2020).
[24] P CHEN, D YU. Improved Faster RCNN approach for vehicles and pedestrian detection. International Core Journal of Engineering, 6, 119-124(2020).
[26] M X TAN, R M PANG, Q V LE. EfficientDet: scalable and efficient object detection, 10778-10787(2020).
[27] M TAN, Q LE. Efficientnet: rethinking model scaling for convolutional neural networks, 6105-6114(2019).
[28] T Y LIN, M MAIRE, S BELONGIE et al.
[29] M EVERINGHAM, L VAN GOOL, C K I WILLIAMS et al. The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88, 303-338(2010).
[30] J DENG, W DONG, R SOCHER et al. ImageNet: a large-scale hierarchical image database, 248-255(2009).
[31] M F CHANG, J LAMBERT, P SANGKLOY et al. Argoverse: 3D tracking and forecasting with rich maps, 8740-8749(2019).
Get Citation
Copy Citation Text
Shibei LIU, Ying CHEN. Audio object detection network with multimodal cross level feature knowledge transfer[J]. Optics and Precision Engineering, 2024, 32(2): 237
Category:
Received: Jun. 8, 2023
Accepted: --
Published Online: Apr. 2, 2024
The Author Email: CHEN Ying (chenying@jiangnan.edu.cn)