Audio object detection network with multimodal cross level feature knowledge transfer

Shibei LIU; Ying CHEN

doi:10.37188/OPE.20243202.0237

Optics and Precision Engineering, Volume. 32, Issue 2, 237(2024)

Audio object detection network with multimodal cross level feature knowledge transfer

Shibei LIU and Ying CHEN^*

Key Laboratory of Advanced Process Control for Light Industry （Ministry of Education）， Jiangnan University， Wuxi214122， China

show less

Abstract Get PDF(in Chinese)

References(31)

[1] ADAVANNE S, POLITIS A, NIKUNEN J et al. Sound event localization and detection of overlapping sources using convolutional recurrent neural networks[J]. IEEE Journal of Selected Topics in Signal Processing, 13, 34-48(2019).

[2] WEI Y, HU D, TIAN Y et al. Learning in Audio-Visual Context： a Review， Analysis， and New Perspective[webpage]. https：//arxiv.org/abs/2208.09579

[3] AYTAR Y, VONDRICK C, TORRALBA A. SoundNet： learning sound representations from unlabeled video[C], 892-900(2016).

[4] AFOURAS T, ASANO Y M, FAGAN F et al. Self-supervised object detection from audio-visual correspondence[C], 10565-10576(2022).

[5] OWENS A, WU J J, MCDERMOTT J H et al. Ambient Sound Provides Supervision for Visual Learning[M]. Computer Vision-ECCV 2016, 801-816(2016).

[6] GAN C, ZHAO H, CHEN P H et al. Self-supervised moving vehicle tracking with stereo sound[C], 7052-7061(2019).

[7] VALVERDE F R, VALERIA HURTADO J, VALADA A. There is more than meets the eye： self-supervised multi-object detection and tracking with sound by distilling multimodal knowledge[C], 11607-11616(2021).

[8] AFOURAS T, CHUNG J S, SENIOR A et al. Deep audio-visual speech recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44, 8717-8727(2022).

[9] LI Y D, LIU H, TANG H. Multi-modal perception attention network with self-supervised learning for audio-visual speaker tracking[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 36, 1456-1463(2022).

[10] ZÜRN J, BURGARD W. Self-supervised moving vehicle detection from audio-visual cues[J]. IEEE Robotics and Automation Letters, 7, 7415-7422(2022).

[11] ZÜRN J, BURGARD W, VALADA A. Self-supervised visual terrain classification from unsupervised acoustic feature learning[J]. IEEE Transactions on Robotics, 37, 466-481(2021).

[12] TIAN Y P, SHI J, LI B C et al. Audio-Visual Event Localization in Unconstrained Videos[M]. Computer Vision-ECCV 2018, 252-268(2018).

[13] YOUNES A, HONERKAMP D, WELSCHEHOLD T et al. Catch me if you hear me： audio-visual navigation in complex unmapped environments with moving sounds[J]. IEEE Robotics and Automation Letters, 8, 928-935(2023).

[14] CHEN H L, XIE W D, AFOURAS T et al. Localizing visual sounds the hard way[C], 16862-16871(2021).

[15] ARANDJELOVIC R, ZISSERMAN A. Objects that sound[C], 435-451(2018).

[16] HU D, QIAN R, JIANG M Y et al. Discriminative Sounding Objects Localization via Self-Supervised Audiovisual Matching[webpage]. arXiv, 2010-05466(2020). http：//arxiv.org/abs/2010.05466.pdf

[17] HINTON G, VINYALS O, DEAN J. Distilling the Knowledge in a Neural Network[webpage]. arXiv, 1503-02531(2015). http：//arxiv.org/abs/1503.02531.pdf

[18] ROMERO A, BALLAS N, KAHOU S E et al. FitNets： Hints for Thin Deep Nets[webpage]. arXiv, 1412-6550(2014). http：//arxiv.org/abs/1412.6550.pdf

[19] TUNG F, MORI G. Similarity-preserving knowledge distillation[C], 1365-1374(2019).

[20] TIAN Y L, KRISHNAN D, ISOLA P. Contrastive Representation Distillation[webpage]. arXiv, 1910-10699(2019). http：//arxiv.org/abs/1910.10699.pdf

[21] HU S X, DAMIANOU A et al. Variational information distillation for knowledge transfer[C], 9155-9163(2019).

[22] ZAGORUYKO S, KOMODAKIS N. Paying more attention to attention： improving the performance of convolutional neural networks via attention transfer[C](2017).

[23] AGUILAR G, LING Y, ZHANG Y et al. Knowledge distillation from internal representations[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 7350-7357(2020).

[24] CHEN P, YU D. Improved Faster RCNN approach for vehicles and pedestrian detection[J]. International Core Journal of Engineering, 6, 119-124(2020).

[25] FURLANELLO T, LIPTON Z C, TSCHANNEN M et al. Born Again Neural Networks[webpage]. arXiv, 1805-04770(2018). http：//arxiv.org/abs/1805.04770.pdf

[26] TAN M X, PANG R M, LE Q V. EfficientDet： scalable and efficient object detection[C], 10778-10787(2020).

[27] TAN M, LE Q. Efficientnet： rethinking model scaling for convolutional neural networks[C], 6105-6114(2019).

[28] LIN T Y, MAIRE M, BELONGIE S et al. Microsoft COCO： Common Objects in Context[M]. Computer Vision-ECCV 2014, 740-755(2014).

[29] EVERINGHAM M, VAN GOOL L, WILLIAMS C K I et al. The pascal visual object classes （VOC） challenge[J]. International Journal of Computer Vision, 88, 303-338(2010).

[30] DENG J, DONG W, SOCHER R et al. ImageNet： a large-scale hierarchical image database[C], 248-255(2009).

[31] CHANG M F, LAMBERT J, SANGKLOY P et al. Argoverse： 3D tracking and forecasting with rich maps[C], 8740-8749(2019).

Tools

Get Citation

Copy Citation Text

Shibei LIU, Ying CHEN. Audio object detection network with multimodal cross level feature knowledge transfer[J]. Optics and Precision Engineering, 2024, 32(2): 237

Download Citation

EndNote(RIS)BibTex Plain Text

Set citation alerts for article

Save article for my favorites