Chinese Optics Letters, Volume. 20, Issue 8, 081101(2022)

FAANet: feature-aligned attention network for real-time multiple object tracking in UAV videos

Zhenqi Liang1, Jingshi Wang1,2, Gang Xiao1、*, and Liu Zeng1
Author Affiliations
  • 1School of Aeronautics and Astronautics, Shanghai Jiao Tong University, Shanghai 200240, China
  • 2Jiangsu Automation Research Institute, Lianyungang 222061, China
  • show less

    Multiple object tracking (MOT) in unmanned aerial vehicle (UAV) videos has attracted attention. Because of the observation perspectives of UAV, the object scale changes dramatically and is relatively small. Besides, most MOT algorithms in UAV videos cannot achieve real-time due to the tracking-by-detection paradigm. We propose a feature-aligned attention network (FAANet). It mainly consists of a channel and spatial attention module and a feature-aligned aggregation module. We also improve the real-time performance using the joint-detection-embedding paradigm and structural re-parameterization technique. We validate the effectiveness with extensive experiments on UAV detection and tracking benchmark, achieving new state-of-the-art 44.0 MOTA, 64.6 IDF1 with 38.24 frames per second running speed on a single 1080Ti graphics processing unit.

    Keywords

    1. Introduction

    Object detection and tracking have been crucial concerns in the 2D[1,2] and 3D[3,4] vision community. Multiple object tracking (MOT) aims to analyze videos to identify and track objects. Because of the flexibility of unmanned aerial vehicles (UAVs) equipped with cameras, MOT in UAV videos has become a new research hotspot and trend in recent years.

    Online MOT algorithms can be reduced to the two-step and one-shot approaches. The two-step approach[1,2] mainly follows the tracking-by-detection (TBD) paradigm[5]. It formulates the MOT task as two steps of object detection and data association. The one-shot approach[6,7] joints detection and embedding learning, and does not separate the re-identification (Re-ID) model as usual in TBD. As of now, almost all UAV-based MOT algorithms adopt the TBD paradigm. Different from the common observation perspectives in MOT, such as fixed security cameras and moving cars, MOT in UAV videos must pay more attention to the following challenges. (1) Scale changes and small objects: the flight altitude of the UAV changes with time, and the object scale varies greatly at different altitudes. Because of the wide field of view and high altitude, the objects are usually small. (2) Real-time performance: MOT in UAV videos needs to locate fast moving ground objects, so algorithms should be fast enough to be used in industrial applications.

    Previous efforts have been made to solve these problems. IPGAT[8] predicts complex motions using a conditional generative adversarial networks model. However, it neglects to utilize multi-scale appearance information of an object. M-CMSN-M[9] unifies single object tracking and MOT for multi-task learning using a Siamese network. Nevertheless, the speed of M-CMSN-M[9] is extremely slow due to the numerous objects in UAV videos.

    In this Letter, we are committed to coping with both aforementioned problems simultaneously. To address the problem of scale changes, we design a feature-aligned attention network (FAANet), which is mainly composed of two modules: the channel and spatial attention (CSA) module and feature-aligned aggregation (FAA) module. The CSA module adaptively enhances multi-scale features, and the FAA module successively generates alignment bias of two different resolution features. FAANet integrates multi-scale features to improve the robustness of object scale changes. To address the problem of real-time, we adopt the joint-detection-embedding (JDE) paradigm[6] and adopt the structural re-parameterization technique[10] to increase the network inference speed.

    The major contributions of this paper are summarized as follows. (1) We propose an FAANet to enhance and aggregate multi-scale features so as to cope with drastic scale changes in UAV videos. (2) We introduce the JDE paradigm to MOT in UAV videos and use the structural re-parameterization technique to increase the network inference speed. (3) Extensive experiments are conducted on UAV detection and tracking (UAVDT)[9] benchmarks to verify effectiveness of the proposed method.

    2. Methods

    In this section, we first present the architecture of the proposed FAANet and then explain the details of the CSA module, FAA module, and online inference.

    2.1. Overview

    As shown in Fig. 1, the framework of proposed FAANet contains four components: feature extractor backbone, feature fusion neck, detection and Re-ID prediction heads, and online tracking association. We adopt RepVGG[10] as our backbone to extract multi-scale features with minor modifications to their channels and layers. Let the shape of the input image be 3×Hinput×Hinput; then the shape of multi-scale output features is, respectively, Ci×Hi×Wi, where Ci=64×2i1, Hi=Hinput/2i+1, Wi=Winput/2i+1, and i=1,2,3,4. We adopt the same detection and Re-ID heads as FairMOT[7], including heat map, box size, center offset, and Re-ID embeddings.

    Architecture of our tracker FAANet tracking framework. This framework contains four components: backbone (RepVGG), neck (CSA + FAA), head (Re-ID + detection), and association.

    Figure 1.Architecture of our tracker FAANet tracking framework. This framework contains four components: backbone (RepVGG), neck (CSA + FAA), head (Re-ID + detection), and association.

    2.2. Channel and spatial attention module

    In general, the input and output dimensions of the channel attention (CA) or spatial attention (SA) are the same[11,12]. In order to improve real-time performance and reduce the risk of overfitting, we propose a method combining feature dimension reduction, channel attention, and spatial attention, as shown in Fig. 2. We perform channel attention first, and then spatial attention. We refer to the effective channel attention mechanism[11] and modify it by 2D 1×1 convolution and one-dimensional convolution with variant kernel and stride.

    Architecture of CSA module.

    Figure 2.Architecture of CSA module.

    Specifically, let the one output of the backbone be FRC×H×W, where C, H, and W are channel dimension, height, and width. Accordingly, the weights of channels can be computed as wchannel=σ(Conv1d(g(F))),where g(F)=1WHi,j=1W,HFi,j is the channel-wise global average pooling (GAP), and σ is the sigmoid nonlinear activation function. We set the Conv1d with a kernel as 3, 7, 11, 15 and stride as 1, 2, 4, 8, respectively, in four multi-scale features. In this way, all channel dimensions of multi-scale features can be reduced to 64. Subsequently, the output of channel attention can be computed as Fchannel=wchannelConv2d(F)+Conv2d(F),where denotes element-wise product. The output dimension of Conv2d(1×1) is 64. Specially, for i=1,C=64, we replace Conv2d(1×1) with an identity function.

    In contrast to channel attention, which focuses on channel dimension, spatial attention mainly focuses on the height and width dimension. Inspired by polarized self-attention[12], we also adopt the polarized method to perform spatial attention. Specifically, let the output of channel attention be FchannelR64×H×W, so the spatial weights can be computed by the following formulas: F1=softmax(f1(g(Conv2d(Fchannel)))),F2=f2(Conv2d(Fchannel)),wspatial=σ(f3(F1F2)),where f1, f2, and f3 denote different reshape operation, and softmax(·) denotes the softmax function. After calculating the spatial weights, the final output of the CSA module can be computed as Fspatial=wspatialFchannel,Foutput=Fchannel+Fspatial.

    Because the channel attention adopts one-dimensional convolution instead of the full connection as usual, the number of parameters is greatly reduced, and the inference speed is improved. The number of channels is reduced to 64 dimensions through the proposed CSA module, which also reduces the number of parameters for real-time performance and provides a channel consistent input to the FAA module.

    2.3. Feature-aligned aggregation module

    Traditional multi-scale feature aggregation usually adopts the method of bi-linear interpolation up-sampling and element-size summation. However, because of the feature misalignment, feature degradation will occur, which leads to the degradation of the ability to locate objects at different scales. After our meticulous research, we refer to the feature-aligned mechanism in AlignSeg[13] and apply it to the UAV-based MOT domain.

    The FAA module is shown in Fig. 3. Specifically, let the corresponding two outputs of the CSA module be FhighR64×H×W and FlowR64×(H/2)×(W/2), where Fhigh and Flow, respectively, denote the feature map with high and low resolution. The network will learn to generate the offsets of two feature maps for alignment. Let the corresponding two offsets be ΔhighR64×H×W and ΔlowR64×(H/2)×(W/2); then they can be computed by the following formulas: Fconcat=Concat(upsample(Flow),Fhigh),Δlow=Conv2d3×3(ReLU(BN(Conv2d1×1(Fconcat)))),Δhigh=Conv2d3×3(ReLU(BN(Conv2d1×1(Fconcat)))),where Concat(F,F) denotes the concatenation operation of two feature maps along the channel dimension, upsample(F) denotes the bilateral interpolation function, Conv2d3×3 and Conv2d1×1 denote the convolution layer with 3×3 and 1×1 kernels, and ReLU and BN, respectively, denote the rectified linear unit activation function and batch normalization layer. Let the output of the FAA module be Foutput; then Foutput can be computed as Foutput=f(upsample(Flow),Δlow)+f(Fhigh,Δhigh),where f(F,Δ) is defined in AlignSeg[13]. It can be described like this: suppose Ah,w is the output of the alignment function f(F,Δ) in the spatial coordinates for position (h,w); then Ah,w can be computed as Ah,w=h=1Hw=1WFh,w·max(0,1|h+Δ1hwh|)·max(0,1|w+Δ2hww|),where Δ1hw and Δ2hw indicate the learned 2D transformation offsets for position (h,w).

    Architecture of FAA module.

    Figure 3.Architecture of FAA module.

    2.4. Online inference

    Herein, the two important components of online inference are data association and structural re-parameterization, which we will illustrate further.

    We follow the standard online tracking algorithm to associate boxes. As shown in Fig. 4, we first initialize a few tracklets based on the estimated boxes in the first frame and use a Kalman filter to predict the locations of the tracklets in the next frame. We perform two Hungarian matchings between detections and tracklets sequentially. The first matching considers the appearance information (Re-ID embedding) measured by cosine distance and the motion information measured by Mahalanobis distance. The second matching only considers intersection over union (IOU) distance, which is simple but useful. Finally, we initialize new tracklets that meet confidence thresholds and mark the lost tracklets.

    Procedure of association between detections and tracklets.

    Figure 4.Procedure of association between detections and tracklets.

    Structural re-parameterization is used in RepVGG[10] to decouple a multi-branch topology with a plain architecture. Herein, we also utilize structural re-parameterization to increase the backbone network inference speed. As shown in Fig. 5, during the training phase, multi-branch topology is adopted to avoid the problem like gradient vanishing. After training, we perform the transformation with simple algebra in RepVGG[10] and save the plain backbone architecture for online inference.

    Structural re-parameterization of a RepVGG block.

    Figure 5.Structural re-parameterization of a RepVGG block.

    3. Experiment

    3.1. Datasets and metrics

    We evaluate our method on the dataset UAVDT[9]. The dataset offers 50 sequences recorded from a UAV (60% for training and 40% for testing). We compare our method with various classic and recent algorithms on the testing sequences. We use multiple metrics, including MOT accuracy (MOTA), identification F1 score (IDF1), MOT precision (MOTP), mostly tracked targets (MT), mostly lost targets (ML), false positive (FP), false negative (FN), identification switches (IDS), and fragmented (FM) to evaluate different aspects of the tracking performance. MOTA is computed based on FP, FN, and IDS. Considering the amount of FP and FN is larger than that of IDS, MOTA focuses more on the detection performance. IDF1 focuses more on the association performance.

    3.2. Implementation details

    We choose RepVGG[10] as our backbone. Its weights are initialized by the ImageNet-pretrained model. Specifically, we choose a slightly modified RepVGG-B0 as our default backbone, which has [2,4,6,16] blocks and [64,128,256,512] output channels. Our model is trained and tested on a NVIDIA GEFORCE RTX 1080Ti graphics processing unit. We adopt standard data augmentation techniques including scaling, rotation and color jittering. The input image is resized to 1024×544, and the feature map resolution is 256×136. We choose Adam as our optimizer. We set the starting learning rate as 8 × 10−5 and batch size as 10. The learning rate will decrease by a factor 10 per 15 epochs. The training step takes about 18 h with 35 epochs in total.

    3.3. Experiment analysis

    We evaluate our FAANet together with various classic and recent algorithms including CEM[14], CMOT[15], SORT[1], DeepSORT[2], GOG[16], IOUT[17], MDP[18], SMOT[19], DeepAlign[20], SBMA[21], IPGAT[8], M-CMSN-M[9], and Quadruplet[22]. All the results of the comparison algorithms are obtained from recent published papers and the UAVDT benchmark[9].

    Since most of the current UAV-based MOT algorithms follow the TBD paradigm, many algorithms do not publish their own speed or only publish the speed of the association phase. That leads to difficulty and ambiguity in speed comparison of algorithms. As much as we can, we collect the currently publicly available algorithm speed and performance, which are shown in Fig. 6. Note that the three algorithms in comparison only calculate the time consumption in the association phase. It can be observed that the speed of our FAANet is 60 times higher than that of the current state-of-the-art M-CMSN-M[9], while MOTA and IDF1 are slightly ahead. Detailed comparison between algorithms is shown in Table 1. We list the seven kinds of the most excellent trackers in recent years. We have the best performance on the 7 out of 10 metrics.

    • Table 1. Results of a Quantitative Comparison among Classic MOT Methods and Recent UAV-Based Methods on the UAVDT Test Dataseta

      Table 1. Results of a Quantitative Comparison among Classic MOT Methods and Recent UAV-Based Methods on the UAVDT Test Dataseta

      MOT MethodsYearFrameworkMOTA IDF1 MOTP MT ML FP FN IDS FM FPS
      SORT[1]2016Faster RCNN39.043.774.333.928.033,037172,62823505787Nan
      DeepSORT[2]2017Faster RCNN40.758.273.241.723.744,868155,2902061643215.01
      DeepAlign[20]2018Faster RCNN41.649.073.343.724.345,420152,224154637330.23
      SBMA[21]2019LSTM38.648.572.138.924.444,724160,950348911,796Nan
      IPGAT[8]2020LSTM + CGAN39.049.472.237.425.242,135163,837209110,057Nan
      M-CMSN-M[9]2020Faster RCNN43.162.673.545.322.745,900147,63839042590.64
      Quadruplet[22]2021Faster RCNN40.355.074.0NanNan30,065150,83710913057Nan
      FAANetNanRepVGG + JDE44.064.677.947.922.657,146133,496403720238.24

    MOTA-IDF1-FPS comparison with other UAV-based MOT trackers on the UAVDT test dataset. The horizontal axis is FPS, the vertical axis is MOTA, and the radius of the circle is IDF1.

    Figure 6.MOTA-IDF1-FPS comparison with other UAV-based MOT trackers on the UAVDT test dataset. The horizontal axis is FPS, the vertical axis is MOTA, and the radius of the circle is IDF1.

    The comparison based on scene attributes is shown in Fig. 7. Our algorithm performs better than other algorithms in the high-alt, bird-view, and fog scenes. It demonstrates the effectiveness of the proposed module.

    IDF1 comparison with other UAV-based MOT trackers on the UAVDT test dataset based on scene attributes. The IDF1 of FAANet is marked outside the circle.

    Figure 7.IDF1 comparison with other UAV-based MOT trackers on the UAVDT test dataset based on scene attributes. The IDF1 of FAANet is marked outside the circle.

    3.4. Ablation experiments

    To validate the effectiveness of CA, SA, and FAA modules, we introduce a baseline RepVGG-B0 with a re-parameterization technique. The baseline reduces the feature dimension by 1×1 convolution and performs multi-scale fusion by bi-linear interpolation up-sampling and element-size summation. Except for the above, all the other settings are consistent, such as training hyperparameters and association details in tracking. As shown in Table 2, the contributions of CA and SA are similar, but their combination can lead to better performance. Compared with CSA, FAA improves more IDF1, which may be due to the feature-aligned multi-scale aggregation boosting robustness to small objects and scale changes.

    • Table 2. Evaluation of the Critical Factors in FAANeta

      Table 2. Evaluation of the Critical Factors in FAANeta

      RepVGG-B0CASAFAAMOTA IDF1 FPS
      38.256.845.70
      39.759.243.52
      39.359.443.41
      40.460.241.35
      42.163.740.54
      44.064.638.24

    As shown in Table 3, we illustrate the speed improvement of the re-parameterization technique. It decreases the number of model parameters from 15.9 × 106 to 14.4 × 106 and the amount of floating-point operations (FLOPs) from 62.3 × 109 to 58.3 × 109. Generally, it increases frames per second (FPS) from 30.32 to 38.24. This is a 26% speed improvement without degeneration of any accuracy performance.

    • Table 3. The Improvement of Re-parameterization Technique

      Table 3. The Improvement of Re-parameterization Technique

      RepParams (106)FLOPs (109)MOTA IDF1 FPS
      15.962.344.064.630.32
      14.458.344.064.638.24

    3.5. Visualization results

    Figure 8 visualizes several typical scenes tracking comparison results between DeepSORT[2] and FAANet on the UAVDT[9] test dataset. From the results of M0403, M1301, and M1004, we can see that FAANet can better track small objects at the end of the road, which is caused by the wide field of vision. From the results of M0701, we can see that FAANet performs better in scenes of scale changes, which is caused by flight altitude.

    Examples and comparison of tracking results between DeepSORT and FAANet on the UAVDT test dataset.

    Figure 8.Examples and comparison of tracking results between DeepSORT and FAANet on the UAVDT test dataset.

    The vertical numbers denote the number of objects tracked in the three frames. On average, FAANet can track 29% more objects than the classical DeepSORT[2] method in the four typical scene examples. This is mainly attributed to the multi-scale feature enhancement and fusion of the CSA and FAA modules.

    4. Conclusions

    In this Letter, we propose an FAANet for MOT in UAV videos. Experimental results demonstrate that our methods can better cope with the problem of scale changes and small object with real-time speed. We hope that our method is attractive for application to industry due to its high accuracy and fast speed.

    [1] A. Bewley, Z. Ge, L. Ott, F. Ramos, B. Upcroft. Simple online and realtime tracking. IEEE International Conference on Image Processing, 3464(2016).

    [2] N. Wojke, A. Bewley, D. Paulus. Simple online and realtime tracking with a deep association metric. IEEE International Conference on Image Processing, 3645(2017).

    [3] Q. Qian, Y. Hu, N. Zhao, M. Li, F. Shao, X. Zhang. Object tracking method based on joint global and local feature descriptor of 3D LIDAR point cloud. Chin. Opt. Lett., 18, 061001(2020).

    [4] J. Dai, L. Huang, K. Guo, L. Ling, H. Huang. Reflectance transformation imaging of 3D detection for subtle traces. Chin. Opt. Lett., 19, 031101(2021).

    [5] D. Ramanan, D. A. Forsyth. Finding and tracking people from the bottom up. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1(2003).

    [6] Z. Wang, L. Zheng, Y. Liu, Y. Li, S. Wang. Towards real-time multi-object tracking. European Conference on Computer Vision, 107(2020).

    [7] Y. Zhang, C. Wang, X. Wang, W. Zeng, W. Liu. FairMOT: on the fairness of detection and re-identification in multiple object tracking. Int. J. Comput. Vis., 129, 3069(2021).

    [8] H. Yu, G. Li, L. Su, B. Zhong, Q. Huang. Conditional GAN based individual and global motion fusion for multiple object tracking in UAV videos. Pattern Recognit. Lett., 131, 219(2020).

    [9] H. Yu, G. Li, W. Zhang, Q. Huang, D. Du, Q. Tian, N. Sebe. The unmanned aerial vehicle benchmark: object detection, tracking and baseline. Int. J. Comput. Vis., 128, 1141(2020).

    [10] X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, J. Sun. RepVGG: making VGG-style ConvNets great again. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 13728(2021).

    [11] Q. Wang, B. Wu, P. Zhu, P. Li, Q. Hu. ECA-Net: efficient channel attention for deep convolutional neural networks. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 11531(2020).

    [12] H. Liu, F. Liu, X. Fan, D. Huang. Polarized self-attention: towards high-quality pixel-wise regression(2021).

    [13] Z. Huang, Y. Wei, X. Wang, H. Shi, T. S. Huang. AlignSeg: feature-aligned segmentation networks. IEEE Trans. Pattern Anal. Mach. Intell., 44, 550(2021).

    [14] A. Milan, S. Roth, K. Schindler. Continuous energy minimization for multitarget tracking. IEEE Trans. Pattern Anal. Mach. Intell., 36, 58(2013).

    [15] S. H. Bae, K. J. Yoon. Robust online multi-object tracking based on tracklet confidence and online discriminative appearance learning. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1218(2014).

    [16] H. Pirsiavash, D. Ramanan, C. C. Fowlkes. Globally-optimal greedy algorithms for tracking a variable number of objects. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1201(2011).

    [17] E. Bochinski, V. Eiselein, T. Sikora. High-speed tracking-by-detection without using image information. IEEE International Conference on Advanced Video and Signal Based Surveillance, 1(2017).

    [18] Y. Xiang, A. Alahi, S. Savarese. Learning to track: online multi-object tracking by decision making. IEEE International Conference on Computer Vision, 4705(2015).

    [19] C. Dicle, O. I. Camps, M. Sznaier. The way they move: tracking multiple targets with similar appearance. IEEE International Conference on Computer Vision, 2304(2013).

    [20] Q. Zhou, B. Zhong, Y. Zhang, J. Li, Y. Fu. Deep alignment network based multi-person tracking with occlusion and motion reasoning. IEEE Trans. Multimed., 21, 1183(2018).

    [21] H. Yu, G. Li, W. Zhang, H. Yao, Q. Huang. Self-balance motion and appearance model for multi-object tracking in UAV. Proceedings of the ACM Multimedia Asia, 1(2019).

    [22] H. U. Dike, Y. Zhou. A robust quadruplet and faster region-based CNN for UAV video-based multiple object tracking in crowded environment. Electronics, 10, 795(2021).

    Tools

    Get Citation

    Copy Citation Text

    Zhenqi Liang, Jingshi Wang, Gang Xiao, Liu Zeng. FAANet: feature-aligned attention network for real-time multiple object tracking in UAV videos[J]. Chinese Optics Letters, 2022, 20(8): 081101

    Download Citation

    EndNote(RIS)BibTexPlain Text
    Save article for my favorites
    Paper Information

    Category: Imaging Systems and Image Processing

    Received: Feb. 5, 2022

    Accepted: Apr. 28, 2022

    Posted: May. 6, 2022

    Published Online: May. 27, 2022

    The Author Email: Gang Xiao (xiaogang@sjtu.edu.cn)

    DOI:10.3788/COL202220.081101

    Topics