Object detection and tracking have been crucial concerns in the 2D[
Chinese Optics Letters, Volume. 20, Issue 8, 081101(2022)
FAANet: feature-aligned attention network for real-time multiple object tracking in UAV videos
Multiple object tracking (MOT) in unmanned aerial vehicle (UAV) videos has attracted attention. Because of the observation perspectives of UAV, the object scale changes dramatically and is relatively small. Besides, most MOT algorithms in UAV videos cannot achieve real-time due to the tracking-by-detection paradigm. We propose a feature-aligned attention network (FAANet). It mainly consists of a channel and spatial attention module and a feature-aligned aggregation module. We also improve the real-time performance using the joint-detection-embedding paradigm and structural re-parameterization technique. We validate the effectiveness with extensive experiments on UAV detection and tracking benchmark, achieving new state-of-the-art 44.0 MOTA, 64.6 IDF1 with 38.24 frames per second running speed on a single 1080Ti graphics processing unit.
1. Introduction
Object detection and tracking have been crucial concerns in the 2D[
Online MOT algorithms can be reduced to the two-step and one-shot approaches. The two-step approach[
Previous efforts have been made to solve these problems. IPGAT[
Sign up for Chinese Optics Letters TOC Get the latest issue of Advanced Photonics delivered right to you!Sign up now
In this Letter, we are committed to coping with both aforementioned problems simultaneously. To address the problem of scale changes, we design a feature-aligned attention network (FAANet), which is mainly composed of two modules: the channel and spatial attention (CSA) module and feature-aligned aggregation (FAA) module. The CSA module adaptively enhances multi-scale features, and the FAA module successively generates alignment bias of two different resolution features. FAANet integrates multi-scale features to improve the robustness of object scale changes. To address the problem of real-time, we adopt the joint-detection-embedding (JDE) paradigm[
The major contributions of this paper are summarized as follows. (1) We propose an FAANet to enhance and aggregate multi-scale features so as to cope with drastic scale changes in UAV videos. (2) We introduce the JDE paradigm to MOT in UAV videos and use the structural re-parameterization technique to increase the network inference speed. (3) Extensive experiments are conducted on UAV detection and tracking (UAVDT)[
2. Methods
In this section, we first present the architecture of the proposed FAANet and then explain the details of the CSA module, FAA module, and online inference.
2.1. Overview
As shown in Fig. 1, the framework of proposed FAANet contains four components: feature extractor backbone, feature fusion neck, detection and Re-ID prediction heads, and online tracking association. We adopt RepVGG[
Figure 1.Architecture of our tracker FAANet tracking framework. This framework contains four components: backbone (RepVGG), neck (CSA + FAA), head (Re-ID + detection), and association.
2.2. Channel and spatial attention module
In general, the input and output dimensions of the channel attention (CA) or spatial attention (SA) are the same[
Figure 2.Architecture of CSA module.
Specifically, let the one output of the backbone be , where , , and are channel dimension, height, and width. Accordingly, the weights of channels can be computed as
In contrast to channel attention, which focuses on channel dimension, spatial attention mainly focuses on the height and width dimension. Inspired by polarized self-attention[
Because the channel attention adopts one-dimensional convolution instead of the full connection as usual, the number of parameters is greatly reduced, and the inference speed is improved. The number of channels is reduced to 64 dimensions through the proposed CSA module, which also reduces the number of parameters for real-time performance and provides a channel consistent input to the FAA module.
2.3. Feature-aligned aggregation module
Traditional multi-scale feature aggregation usually adopts the method of bi-linear interpolation up-sampling and element-size summation. However, because of the feature misalignment, feature degradation will occur, which leads to the degradation of the ability to locate objects at different scales. After our meticulous research, we refer to the feature-aligned mechanism in AlignSeg[
The FAA module is shown in Fig. 3. Specifically, let the corresponding two outputs of the CSA module be and , where and , respectively, denote the feature map with high and low resolution. The network will learn to generate the offsets of two feature maps for alignment. Let the corresponding two offsets be and ; then they can be computed by the following formulas:
Figure 3.Architecture of FAA module.
2.4. Online inference
Herein, the two important components of online inference are data association and structural re-parameterization, which we will illustrate further.
We follow the standard online tracking algorithm to associate boxes. As shown in Fig. 4, we first initialize a few tracklets based on the estimated boxes in the first frame and use a Kalman filter to predict the locations of the tracklets in the next frame. We perform two Hungarian matchings between detections and tracklets sequentially. The first matching considers the appearance information (Re-ID embedding) measured by cosine distance and the motion information measured by Mahalanobis distance. The second matching only considers intersection over union (IOU) distance, which is simple but useful. Finally, we initialize new tracklets that meet confidence thresholds and mark the lost tracklets.
Figure 4.Procedure of association between detections and tracklets.
Structural re-parameterization is used in RepVGG[
Figure 5.Structural re-parameterization of a RepVGG block.
3. Experiment
3.1. Datasets and metrics
We evaluate our method on the dataset UAVDT[
3.2. Implementation details
We choose RepVGG[
3.3. Experiment analysis
We evaluate our FAANet together with various classic and recent algorithms including CEM[
Since most of the current UAV-based MOT algorithms follow the TBD paradigm, many algorithms do not publish their own speed or only publish the speed of the association phase. That leads to difficulty and ambiguity in speed comparison of algorithms. As much as we can, we collect the currently publicly available algorithm speed and performance, which are shown in Fig. 6. Note that the three algorithms in comparison only calculate the time consumption in the association phase. It can be observed that the speed of our FAANet is 60 times higher than that of the current state-of-the-art M-CMSN-M[
|
Figure 6.MOTA-IDF1-FPS comparison with other UAV-based MOT trackers on the UAVDT test dataset. The horizontal axis is FPS, the vertical axis is MOTA, and the radius of the circle is IDF1.
The comparison based on scene attributes is shown in Fig. 7. Our algorithm performs better than other algorithms in the high-alt, bird-view, and fog scenes. It demonstrates the effectiveness of the proposed module.
Figure 7.IDF1 comparison with other UAV-based MOT trackers on the UAVDT test dataset based on scene attributes. The IDF1 of FAANet is marked outside the circle.
3.4. Ablation experiments
To validate the effectiveness of CA, SA, and FAA modules, we introduce a baseline RepVGG-B0 with a re-parameterization technique. The baseline reduces the feature dimension by convolution and performs multi-scale fusion by bi-linear interpolation up-sampling and element-size summation. Except for the above, all the other settings are consistent, such as training hyperparameters and association details in tracking. As shown in Table 2, the contributions of CA and SA are similar, but their combination can lead to better performance. Compared with CSA, FAA improves more IDF1, which may be due to the feature-aligned multi-scale aggregation boosting robustness to small objects and scale changes.
|
As shown in Table 3, we illustrate the speed improvement of the re-parameterization technique. It decreases the number of model parameters from 15.9 × 106 to 14.4 × 106 and the amount of floating-point operations (FLOPs) from 62.3 × 109 to 58.3 × 109. Generally, it increases frames per second (FPS) from 30.32 to 38.24. This is a 26% speed improvement without degeneration of any accuracy performance.
|
3.5. Visualization results
Figure 8 visualizes several typical scenes tracking comparison results between DeepSORT[
Figure 8.Examples and comparison of tracking results between DeepSORT and FAANet on the UAVDT test dataset.
The vertical numbers denote the number of objects tracked in the three frames. On average, FAANet can track 29% more objects than the classical DeepSORT[
4. Conclusions
In this Letter, we propose an FAANet for MOT in UAV videos. Experimental results demonstrate that our methods can better cope with the problem of scale changes and small object with real-time speed. We hope that our method is attractive for application to industry due to its high accuracy and fast speed.
[1] A. Bewley, Z. Ge, L. Ott, F. Ramos, B. Upcroft. Simple online and realtime tracking. IEEE International Conference on Image Processing, 3464(2016).
[2] N. Wojke, A. Bewley, D. Paulus. Simple online and realtime tracking with a deep association metric. IEEE International Conference on Image Processing, 3645(2017).
[3] Q. Qian, Y. Hu, N. Zhao, M. Li, F. Shao, X. Zhang. Object tracking method based on joint global and local feature descriptor of 3D LIDAR point cloud. Chin. Opt. Lett., 18, 061001(2020).
[4] J. Dai, L. Huang, K. Guo, L. Ling, H. Huang. Reflectance transformation imaging of 3D detection for subtle traces. Chin. Opt. Lett., 19, 031101(2021).
[5] D. Ramanan, D. A. Forsyth. Finding and tracking people from the bottom up. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1(2003).
[6] Z. Wang, L. Zheng, Y. Liu, Y. Li, S. Wang. Towards real-time multi-object tracking. European Conference on Computer Vision, 107(2020).
[7] Y. Zhang, C. Wang, X. Wang, W. Zeng, W. Liu. FairMOT: on the fairness of detection and re-identification in multiple object tracking. Int. J. Comput. Vis., 129, 3069(2021).
[8] H. Yu, G. Li, L. Su, B. Zhong, Q. Huang. Conditional GAN based individual and global motion fusion for multiple object tracking in UAV videos. Pattern Recognit. Lett., 131, 219(2020).
[9] H. Yu, G. Li, W. Zhang, Q. Huang, D. Du, Q. Tian, N. Sebe. The unmanned aerial vehicle benchmark: object detection, tracking and baseline. Int. J. Comput. Vis., 128, 1141(2020).
[10] X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, J. Sun. RepVGG: making VGG-style ConvNets great again. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 13728(2021).
[11] Q. Wang, B. Wu, P. Zhu, P. Li, Q. Hu. ECA-Net: efficient channel attention for deep convolutional neural networks. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 11531(2020).
[12] H. Liu, F. Liu, X. Fan, D. Huang. Polarized self-attention: towards high-quality pixel-wise regression(2021).
[13] Z. Huang, Y. Wei, X. Wang, H. Shi, T. S. Huang. AlignSeg: feature-aligned segmentation networks. IEEE Trans. Pattern Anal. Mach. Intell., 44, 550(2021).
[14] A. Milan, S. Roth, K. Schindler. Continuous energy minimization for multitarget tracking. IEEE Trans. Pattern Anal. Mach. Intell., 36, 58(2013).
[15] S. H. Bae, K. J. Yoon. Robust online multi-object tracking based on tracklet confidence and online discriminative appearance learning. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1218(2014).
[16] H. Pirsiavash, D. Ramanan, C. C. Fowlkes. Globally-optimal greedy algorithms for tracking a variable number of objects. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1201(2011).
[17] E. Bochinski, V. Eiselein, T. Sikora. High-speed tracking-by-detection without using image information. IEEE International Conference on Advanced Video and Signal Based Surveillance, 1(2017).
[18] Y. Xiang, A. Alahi, S. Savarese. Learning to track: online multi-object tracking by decision making. IEEE International Conference on Computer Vision, 4705(2015).
[19] C. Dicle, O. I. Camps, M. Sznaier. The way they move: tracking multiple targets with similar appearance. IEEE International Conference on Computer Vision, 2304(2013).
[20] Q. Zhou, B. Zhong, Y. Zhang, J. Li, Y. Fu. Deep alignment network based multi-person tracking with occlusion and motion reasoning. IEEE Trans. Multimed., 21, 1183(2018).
[21] H. Yu, G. Li, W. Zhang, H. Yao, Q. Huang. Self-balance motion and appearance model for multi-object tracking in UAV. Proceedings of the ACM Multimedia Asia, 1(2019).
[22] H. U. Dike, Y. Zhou. A robust quadruplet and faster region-based CNN for UAV video-based multiple object tracking in crowded environment. Electronics, 10, 795(2021).
Get Citation
Copy Citation Text
Zhenqi Liang, Jingshi Wang, Gang Xiao, Liu Zeng, "FAANet: feature-aligned attention network for real-time multiple object tracking in UAV videos," Chin. Opt. Lett. 20, 081101 (2022)
Category: Imaging Systems and Image Processing
Received: Feb. 5, 2022
Accepted: Apr. 28, 2022
Posted: May. 6, 2022
Published Online: May. 27, 2022
The Author Email: Gang Xiao (xiaogang@sjtu.edu.cn)