Research Progress in Fundamental Architecture of Deep Learning-Based Single Object Tracking Method

Tingfa Xu; Ying Wang; Guokai Shi; Tianhao Li; Jianan Li

doi:10.3788/AOS230746

Acta Optica Sinica, Volume. 43, Issue 15, 1510003(2023)

Research Progress in Fundamental Architecture of Deep Learning-Based Single Object Tracking Method

Tingfa Xu^1,2、*, Ying Wang¹, Guokai Shi³, Tianhao Li¹, and Jianan Li^1、**

¹Key Laboratory of Photoelectronic Imaging Technology and System, Ministry of Education, School of Optics and Photonics, Beijing Institute of Technology, Beijing 100081, China

²Chongqing Innovation Center, Beijing Institute of Technology, Chongqing 401120, China

³North Automatic Control Technology Institute, Taiyuan 030006, Shanxi, China

show less

Abstract Get PDF(in Chinese)

References(63)

[1] Henriques J F, Caseiro R, Martins P et al. High-speed tracking with kernelized correlation filters[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37, 583-596(2015).

[2] Bertinetto L, Valmadre J, Henriques J F et al. Fully-convolutional Siamese networks for object tracking[M]. Hua G, Jégou H. Computer vision–ECCV 2016 workshops. Lecture notes in computer science, 9914, 850-865(2016).

[3] Cui Z J, An J S, Zhang Y F et al. Light-weight Siamese attention network object tracking for unmanned aerial vehicle[J]. Acta Optica Sinica, 40, 1915001(2020).

[4] Xie Z, Geng Z, Hu J et al. Revealing the dark secrets of masked image modeling[J]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14475-14485(2023).

[5] Held D, Thrun S, Savarese S. Learning to track at 100 fps with deep regression networks[C], 749-765(2016).

[6] Guo D, Shao Y, Cui Y et al. Graph attention tracking[J]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9543-9552(2021).

[7] Xu Y D, Wang Z Y, Li Z X et al. SiamFC++: towards robust and accurate visual tracking with target estimation guidelines[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 12549-12556(2020).

[8] Vaswani A, Shazeer N, Parmar N et al. Attention is all you need[C], 6000-6010(2017).

[9] Rao Y, Zhao W, Liu B et al. Dynamicvit: Efficient vision transformers with dynamic token sparsification[J]. Advances in Neural Information Processing Systems, 34, 13937-13949(2021).

[10] Carion N, Massa F, Synnaeve G et al. End-to-end object detection with transformers[M]. Vedaldi A, Bischof H, Brox T, et al. Computer vision–ECCV 2020. Lecture notes in computer science, 12346, 213-229(2020).

[11] Wang H Y, Zhu Y K, Adam H et al. MaX-DeepLab: end-to-end panoptic segmentation with mask transformers[C], 5459-5470(2021).

[12] Chen X, Yan B, Zhu J W et al. Transformer tracking[C], 8122-8131(2021).

[13] Xie F, Wang C Y, Wang G T et al. Correlation-aware deep tracking[C], 8741-8750(2022).

[14] Ye B T, Chang H, Ma B P et al. Joint feature learning and relation modeling for tracking: a one-stream framework[M]. Avidan S, Brostow G, Cissé M, et al. Computer vision–ECCV 2022. Lecture notes in computer science, 13682, 341-357(2022).

[15] Kristan M, Leonardis A, Matas J et al. The sixth visual object tracking vot2018 challenge results[C], 3-53(2018).

[16] Huang L H, Zhao X, Huang K Q. GOT-10k: a large high-diversity benchmark for generic object tracking in the wild[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43, 1562-1577(2021).

[17] Fan H, Lin L T, Yang F et al. LaSOT: a high-quality benchmark for large-scale single object tracking[C], 5369-5378(2020).

[18] Müller M, Bibi A, Giancola S et al. TrackingNet: a large-scale dataset and benchmark for object tracking in the wild[M]. Ferrari V, Hebert M, Sminchisescu C, et al. Computer vision–ECCV 2018. Lecture notes in computer science, 11205, 310-327(2018).

[19] Li P X, Wang D, Wang L J et al. Deep visual tracking: review and experimental comparison[J]. Pattern Recognition, 76, 323-338(2018).

[20] Marvasti-Zadeh S M, Cheng L, Ghanei-Yakhdan H et al. Deep learning for visual tracking: a comprehensive survey[J]. IEEE Transactions on Intelligent Transportation Systems, 23, 3943-3968(2022).

[21] Javed S, Danelljan M, Khan F S et al. Visual object tracking with discriminative filters and Siamese networks: a survey and outlook[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45, 6552-6574(2022).

[22] Yang T Y, Chan A B. Recurrent filter learning for visual tracking[C], 2010-2019(2018).

[23] Song Y B, Ma C, Wu X H et al. VITAL: visual tracking via adversarial learning[C], 8990-8999(2018).

[24] Park E, Berg A C. Meta-tracker: fast and robust online adaptation for visual object trackers[M]. Ferrari V, Hebert M, Sminchisescu C, et al. Computer vision–ECCV 2018. Lecture notes in computer science, 11207, 587-604(2018).

[25] Zheng J L, Ma C, Peng H W et al. Learning to track objects from unlabeled videos[C], 13526-13535(2022).

[26] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 60, 84-90(2017).

[27] Liu Z, Lin Y T, Cao Y et al. Swin transformer: hierarchical vision transformer using shifted windows[C], 9992-10002(2022).

[28] Lin L T, Fan H, Zhang Z P et al. SwinTrack: a simple and strong baseline for transformer tracking[EB/OL]. https://arxiv.org/abs/2112.00995

[29] Valmadre J, Bertinetto L, Henriques J et al. End-to-end representation learning for correlation filter based tracking[C], 2805-2813(2017).

[30] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[EB/OL]. https://arxiv.org/abs/1409.1556

[31] Szegedy C, Vanhoucke V, Ioffe S et al. Rethinking the inception architecture for computer vision[C], 2818-2826(2016).

[32] He K M, Zhang X Y, Ren S Q et al. Deep residual learning for image recognition[C], 770-778(2016).

[33] Zhang Z P, Peng H W. Deeper and wider Siamese networks for real-time visual tracking[C], 4586-4595(2020).

[34] Li B, Wu W, Wang Q et al. SiamRPN: evolution of Siamese visual tracking with very deep networks[C], 4277-4286(2020).

[35] Dosovitskiy A, Beyer L, Kolesnikov A et al. An image is worth 16x16 words: transformers for image recognition at scale[EB/OL]. https://arxiv.org/abs/2010.11929

[36] Touvron H, Cord M, Douze M et al. Training data-efficient image transformers & distillation through attention[EB/OL]. https://arxiv.org/abs/2012.12877

[37] Fu Z H, Fu Z H, Liu Q J et al. SparseTT: visual tracking with sparse transformers[EB/OL]. https://arxiv.org/abs/2205.03776

[38] Guo M Z, Zhang Z P, Fan H et al. Learning target-aware representation for visual tracking via informative interactions[C], 927-934(2022).

[39] Li B, Yan J J, Wu W et al. High performance visual tracking with Siamese region proposal network[C], 8971-8980(2018).

[40] Yan B, Zhang X Y, Wang D et al. Alpha-refine: boosting tracking performance by precise bounding box estimation[C], 5285-5294(2021).

[41] Wang Y, Xu T F, Li J N et al. Pyramid correlation based deep Hough voting for visual object tracking[C], 610-625(2021).

[42] Fu Z H, Liu Q J, Fu Z H et al. STMTrack: template-free visual tracking with space-time memory networks[C], 13769-13778(2021).

[43] Wang X L, Girshick R, Gupta A et al. Non-local neural networks[C], 7794-7803(2018).

[44] Yan B, Peng H W, Fu J L et al. Learning spatio-temporal transformer for visual tracking[C], 10428-10437(2022).

[45] Song Z K, Yu J Q, Chen Y P P et al. Transformer tracking with cyclic shifting window attention[C], 8781-8790(2022).

[46] Zhang Z P, Liu Y H, Wang X et al. Learn to match: automatic matching network design for visual tracking[C], 13319-13328(2022).

[47] Ren S Q, He K M, Girshick R et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 1137-1149(2017).

[48] Yu Y C, Xiong Y L, Huang W L et al. Deformable Siamese attention networks for visual object tracking[C], 6727-6736(2020).

[49] Chen Z D, Zhong B N, Li G R et al. Siamese box adaptive network for visual tracking[C], 6667-6676(2020).

[50] Jiang B R, Luo R X, Mao J Y et al. Acquisition of localization confidence for accurate object detection[M]. Ferrari V, Hebert M, Sminchisescu C, et al. Computer vision–ECCV 2018. Lecture notes in computer science, 11218, 816-832(2018).

[51] Guo D Y, Wang J, Cui Y et al. SiamCAR: Siamese fully convolutional classification and regression for visual tracking[C], 6268-6276(2020).

[52] Du F, Liu P, Zhao W et al. Correlation-guided attention for corner detection based visual tracking[C], 6835-6844(2020).

[53] Deng J, Dong W, Socher R et al. ImageNet: a large-scale hierarchical image database[C], 248-255(2009).

[54] Cui Y T, Jiang C, Wang L M et al. MixFormer: end-to-end tracking with iterative mixed attention[C], 13598-13608(2022).

[55] Chen B Y, Li P X, Bai L et al. Backbone is all your need: a simplified architecture for visual object tracking[M]. Avidan S, Brostow G, Cissé M, et al. Computer vision–ECCV 2022. Lecture notes in computer science, 13682, 375-392(2022).

[56] Wu H P, Xiao B, Codella N et al. CvT: introducing convolutions to vision transformers[C], 22-31(2022).

[57] Song Z K, Luo R, Yu J Q et al. Compact transformer tracker with correlative masked modeling[EB/OL]. https://arxiv.org/abs/2301.10938

[58] Chen X L, Xie S N, He K M. An empirical study of training self-supervised vision transformers[C], 9620-9629(2022).

[59] He K M, Chen X L, Xie S N et al. Masked autoencoders are scalable vision learners[C], 15979-15988(2022).

[60] Radford A, Kim J W, Hallacy C et al. Learning transferable visual models from natural language supervision[EB/OL]. https://arxiv.org/abs/2103.00020

[61] Mueller M, Smith N, Ghanem B. A benchmark and simulator for UAV tracking[M]. Leibe B, Matas J, Sebe N, et al. Computer vision–ECCV 2016. Lecture notes in computer science, 9905, 445-461(2016).

[62] Wang X, Shu X J, Zhang Z P et al. Towards more flexible and accurate object tracking with natural language: algorithms and benchmark[C], 13758-13768(2021).

[63] Zhang Z P, Peng H W, Fu J L et al. Ocean: object-aware anchor-free tracking[M]. Vedaldi A, Bischof H, Brox T, et al. Computer vision-ECCV 2020. Lecture notes in computer science, 12366, 771-787(2020).

Tools

Get Citation

Copy Citation Text

Tingfa Xu, Ying Wang, Guokai Shi, Tianhao Li, Jianan Li. Research Progress in Fundamental Architecture of Deep Learning-Based Single Object Tracking Method[J]. Acta Optica Sinica, 2023, 43(15): 1510003

Download Citation

EndNote(RIS)BibTex Plain Text

Set citation alerts for article

Save article for my favorites

Paper Information

Category: Image Processing

Received: Mar. 29, 2023

Accepted: Jun. 15, 2023

Published Online: Aug. 15, 2023

The Author Email: Xu Tingfa (ciom_xtf1@bit.edu.cn), Li Jianan (lijianan@bit.edu.cn)

DOI:10.3788/AOS230746

Topics

laser devices and laser physics

Lasers and Laser Optics

Laser physics

laser manufacturing

Instrumentation, Measurement and Metrology