Optics and Precision Engineering, Volume. 33, Issue 1, 135(2025)

Super-resolution reconstruction of text image with multimodal semantic interaction

Yulan HAN*, Yihong LUO, Yujie CUI, and Chaofeng LAN
Author Affiliations
  • College of Measurement and Control Technology and Communication Engineering, Harbin University of Science and Technology, Harbin150080, China
  • show less

    The accurate extraction of text content from images is hindered by the absence of scale transformation in feature representation and insufficient resolution, which misguides the reconstruction network. To address this challenge, this paper proposes a novel multi-modal semantic interactive text image super-resolution reconstruction method. By incorporating an attention mask within the semantic inference module, the method corrects text content information and employs semantic prior knowledge to constrain and guide the reconstruction of semantically accurate super-resolution text images. To enhance the network's representational capacity and accommodate text images of varying shapes and lengths, a multimodal semantic interaction block is introduced. This block consists of three key components: a visual dual-flow integration module, a cross-modal adaptive fusion module, and an orthogonal bidirectional gated recurrent unit. First, the visual dual-flow integration module captures multi-granularity visual information, including contextual understanding, by leveraging the complementary strengths of global statistical features and robust local approximations. Next, the cross-modal adaptive fusion module dynamically facilitates interaction and alignment between semantic information and multi-granularity visual features, effectively reducing cross-modal feature discrepancies. Finally, the orthogonal bidirectional gated recurrent unit establishes multimodal feature dependencies in both vertical and horizontal orientations. Experimental results on the TextZoom test set demonstrate that the proposed method outperforms state-of-the-art approaches in terms of quantitative metrics, achieving significant improvements in PSNR and SSIM. Compared to the TPGSR model, the proposed method increases the average recognition accuracy of ASTER, MORAN, and CRNN by 2.9%, 3.6%, and 3.7%, respectively. These findings highlight the effectiveness of multimodal semantic interaction in enhancing text image super-resolution and improving text recognition accuracy.

    Keywords
    Tools

    Get Citation

    Copy Citation Text

    Yulan HAN, Yihong LUO, Yujie CUI, Chaofeng LAN. Super-resolution reconstruction of text image with multimodal semantic interaction[J]. Optics and Precision Engineering, 2025, 33(1): 135

    Download Citation

    EndNote(RIS)BibTexPlain Text
    Save article for my favorites
    Paper Information

    Category:

    Received: Jul. 31, 2024

    Accepted: --

    Published Online: Apr. 1, 2025

    The Author Email: Yulan HAN (hanyulan@hrbust.edu.cn)

    DOI:10.37188/OPE.20253301.0135

    Topics