Acta Photonica Sinica, Volume. 53, Issue 3, 0310001(2024)

Object Detection Algorithm Based on CNN-Transformer Dual Modal Feature Fusion

Chen YANG1...2, Zhiqiang HOU1,2,*, Xinyue LI1,2, Sugang MA1,2, and Xiaobao YANG12 |Show fewer author(s)
Author Affiliations
  • 1School of Computer Science and Technology, Xi'an University of Posts & Telecommunications, Xi'an 710121, China
  • 2Shaanxi Provincial Key Laboratory of Network Data Analysis and Intelligent Processing,Xi'an 710121, China
  • show less

    To overcome the limitations of single modal object detection, this study proposes a dual modal feature fusion object detection algorithm based on CNN-Transformer. By fully leveraging the clear contour information from infrared images and the rich detail information from visible light images, integrating the complementary information of both infrared and visible light modalities significantly enhances object detection performance and extends its applicability to more complex real-world scenarios. The innovation of the proposed algorithm lies in the construction of a dual-stream feature extraction network, which can simultaneously process information from infrared and visible light images. Since infrared images have clear contour information that can be used to guide object localization, we adopt a CNN-based Feature Extraction (CFE) module to process infrared images to better capture the location information of the object in the infrared images and improve the expressive power of the feature information. On the other hand, visible light images usually contain rich detail information such as color and texture, which are distributed in the whole image, so we adopt Transformer-based Feature Extraction (TFE) to process visible light images to better capture the global context information and detail information of visible light images. This differentiated feature extraction strategy helps to fully utilize the advantages of both infrared and visible light modal images, enabling the algorithm to better adapt to object detection tasks in different scenes and conditions. In addition, we introduce a dual-modal feature fusion module, which successfully fuses feature information from both modalities through effective inter-modal information interaction. The design of this module not only preserves the features of the original two modalities, but also realizes the inter-modal feature complementarity, which enhances the expression ability of the object features and further improves the performance of multimodal object detection. To validate the effectiveness of the algorithm, we conducted extensive experimental evaluations on three different datasets, including the publicly available datasets KAIST, FLIR, and the GIR dataset that we created in-house. These datasets contain multimodal image pairs, i.e., infrared images, visible light images, and infrared-visible light image pairs. We trained and tested these multimodal images to evaluate the applicability and performance of the algorithm in various situations. The experimental results indicate that the detection accuracy of the proposed algorithm for dual modal images on the KAIST dataset is 5.7% and 17.4% higher than that of the baseline algorithm for infrared and visible light images respectively. Similarly, on the FLIR dataset, the detection accuracy for infrared and visible light images increased by 11.6% and 17.1%, respectively, when compared to the baseline algorithm. On our self-constructed GIR dataset, the proposed algorithm also demonstrates a notable enhancement in detection accuracy. Additionally, the algorithm has the capability to independently process either infrared or visible light images, with a significant improvement in detection accuracy compared to the baseline algorithm for both modalities. This further validates the applicability and robustness of the proposed algorithm. Moreover, the proposed algorithm also supports the flexibility of displaying the detection results of visible or infrared images during the visualization process to meet different needs.

    Keywords
    Tools

    Get Citation

    Copy Citation Text

    Chen YANG, Zhiqiang HOU, Xinyue LI, Sugang MA, Xiaobao YANG. Object Detection Algorithm Based on CNN-Transformer Dual Modal Feature Fusion[J]. Acta Photonica Sinica, 2024, 53(3): 0310001

    Download Citation

    EndNote(RIS)BibTexPlain Text
    Save article for my favorites
    Paper Information

    Category:

    Received: Aug. 21, 2023

    Accepted: Sep. 26, 2023

    Published Online: May. 16, 2024

    The Author Email: HOU Zhiqiang (hou-zhq@sohu.com)

    DOI:10.3788/gzxb20245303.0310001

    Topics