Acta Optica Sinica, Volume. 44, Issue 6, 0628006(2024)

Multi-Scale Optical Remote Sensing Image Target Detection Based On Enhanced Small Target Features

Huilin Shan1,2, Shuoyang Wang1, Junyi Tong1, Yuxiang Hu2, Yanhao Zhang2, and Yinsheng Zhang1,2、*
Author Affiliations
  • 1School of Electronics & Information Engineering, Nanjing University of Information Science & Technology, Nanjing 210044, Jiangsu, China
  • 2School of Electronic & Information Engineering, Wuxi University, Wuxi 214105, Jiangsu, China
  • show less

    Objective

    Remote sensing technology is a method to observe and obtain information about objects and phenomena on the Earth's surface by satellites and aircraft. It allows us to obtain large-scale, multi-spectral, and high-resolution data from remote locations on Earth. The global and real-time technology features multi-spectral observation, high resolution, and multi-source data fusion without contact. Remote sensing target detection is a process of target recognition and extraction using remote sensing data. It aims to automatically detect, locate, and identify specific target types from remote sensing images, which is of significance for disaster warning and response, environmental monitoring, and ecological protection.

    Methods

    The traditional remote sensing image target detection algorithms include valley threshold and Sobel operator and convolutional neural network (CNN) algorithm, of which the most widely employed is the CNN. The algorithm has sound feature extraction and pattern recognition capabilities, but it is sensitive to locations and scale and may still perform poorly when small targets or large-scale changes are involved. Therefore, for the detection of remote sensing targets, it is necessary to consider many factors such as complex background, unbalanced target distribution, dense target, false detection, and missed detection. Therefore, we propose a multi-scale neural network for enhancing small target features (ESF-MNet) to deal with the low detection accuracy and poor generalization of current remote sensing targets. The core idea is to combine multiple CBH modules and CA attention mechanism to form a multi-residual cascade layer and perform efficient aggregation to enhance target feature expression. The RFE module is introduced to help the network better respond to remote sensing targets of different scales. GSConv and CARAFE modules are utilized to form the main structure of the Neck end. While reducing the amount of parameters and maintaining accuracy, the CARAFE module is adopted to improve the semantic extraction ability of the network. Meanwhile, a detection head that is more suitable for small targets is constructed to reduce the lost small target information as the network depth increases.

    Results and Discussions

    Qualitative and quantitative experiments are carried out on mainstream remote sensing detection models such as ESF-MNet, with ablation experiments analyzed. To verify the effectiveness of each improvement point, we conduct seven experiments on DOTA and NWPU NHR-10 datasets under the same environment and parameters based on the YOLOv7 network model. The detected image targets have complex backgrounds, as shown in Table 1. If the attention effect is not employed alone, the mentioned EACM module can significantly improve the effect. The proposed receptive field enhancement module effectively captures context information at different scales. The constructed Neck layer simplifies the network structure and improves the semantic extraction ability, and the proposed detection layer is suitable for small targets and enhances the fusion of shallow features. The mAP0.5 is improved by 3.7% and 4.5% on the two datasets respectively, which proves the effectiveness of each module. The proposed algorithm is compared with other algorithms to further compare the model performance. The experimental environment is the same, with the same training set and test set adopted. Faster R-CNN, FMSSD, YOLOv5s, YOLOv7, YOLOv8s, algorithms in Refs. [21-23], and the proposed algorithm are shown in Tables 2 and 3. In terms of average accuracy value, the ESF-MNet model performs best. Especially in the aspect of custom small targets, the performance is more prominent. The mAP reaches 83.6% and 97.6% respectively. However, the algorithm accuracy does not reach the best level when detecting some large target objects (such as track and field, basketball court). The main reason is that the network depth after model lightweight is shallow and the downsampling multiple is small. If the network depth and the downsampling multiple increase, although the detection effect of large targets can be improved, poor detection of small targets will be caused. Therefore, our research focus is to improve the detection accuracy of small and medium-sized targets on the premise of ensuring higher detection accuracy for large targets. Generally, compared with other algorithms, the proposed algorithm still has obvious advantages in mAP, greatly reduces the false detection rate, and also meets the basic needs of real-time detection.

    Conclusions

    The detection and recognition of targets in optical remote sensing images is of significance for civilian applications. However, in the case of complex background, dense small targets, and lack of feature information, the identification of small targets is very difficult. Meanwhile, we construct an efficient layer attention aggregation module in the backbone network to extract the target features of various categories and employ the receptive field enhancement module to fuse the feature maps of different depths and thus improve the information expression ability of the network. Additionally, by utilizing GSConv and CARAFE modules to form the Neck layer, and adopting the compression method of halving the number of channels, the neck is finely processed, and the cross-stage partial network (GSCSP) module VoV-GSCSP module is designed by one-time aggregation method, which can reduce the network computation and improve the detection speed. With the addition of the CARAFE module, the detection accuracy is improved. In addition, a multi-scale network is constructed by leveraging a feature output layer with a lower sampling rate of 4, 8, and 16 times in the detection head structure, which effectively improves the detection of small targets. Experimental results show that the model has sound real-time performance and strong robustness for small target detection in complex background. Although the model has been improved, it may still has missed detection and error detection. Although the remote sensing image target detection method is mature, it is still difficult to calculate the large and complex, accurate, and efficient method. However, we will continue to study and solve these problems in the future.

    Tools

    Get Citation

    Copy Citation Text

    Huilin Shan, Shuoyang Wang, Junyi Tong, Yuxiang Hu, Yanhao Zhang, Yinsheng Zhang. Multi-Scale Optical Remote Sensing Image Target Detection Based On Enhanced Small Target Features[J]. Acta Optica Sinica, 2024, 44(6): 0628006

    Download Citation

    EndNote(RIS)BibTexPlain Text
    Save article for my favorites
    Paper Information

    Category: Remote Sensing and Sensors

    Received: Oct. 20, 2023

    Accepted: Dec. 15, 2023

    Published Online: Mar. 15, 2024

    The Author Email: Zhang Yinsheng (yorkzhang@nuist.edu.cn)

    DOI:10.3788/AOS231676

    Topics