Super-resolution reconstruction of text image with multimodal semantic interaction

Yulan HAN; Yihong LUO; Yujie CUI; Chaofeng LAN

doi:10.37188/OPE.20253301.0135

Optics and Precision Engineering, Volume. 33, Issue 1, 135(2025)

Super-resolution reconstruction of text image with multimodal semantic interaction

Yulan HAN^*, Yihong LUO, Yujie CUI, and Chaofeng LAN

College of Measurement and Control Technology and Communication Engineering， Harbin University of Science and Technology， Harbin150080， China

show less

Abstract Get PDF(in Chinese)

The accurate extraction of text content from images is hindered by the absence of scale transformation in feature representation and insufficient resolution， which misguides the reconstruction network. To address this challenge， this paper proposes a novel multi-modal semantic interactive text image super-resolution reconstruction method. By incorporating an attention mask within the semantic inference module， the method corrects text content information and employs semantic prior knowledge to constrain and guide the reconstruction of semantically accurate super-resolution text images. To enhance the network's representational capacity and accommodate text images of varying shapes and lengths， a multimodal semantic interaction block is introduced. This block consists of three key components： a visual dual-flow integration module， a cross-modal adaptive fusion module， and an orthogonal bidirectional gated recurrent unit. First， the visual dual-flow integration module captures multi-granularity visual information， including contextual understanding， by leveraging the complementary strengths of global statistical features and robust local approximations. Next， the cross-modal adaptive fusion module dynamically facilitates interaction and alignment between semantic information and multi-granularity visual features， effectively reducing cross-modal feature discrepancies. Finally， the orthogonal bidirectional gated recurrent unit establishes multimodal feature dependencies in both vertical and horizontal orientations. Experimental results on the TextZoom test set demonstrate that the proposed method outperforms state-of-the-art approaches in terms of quantitative metrics， achieving significant improvements in PSNR and SSIM. Compared to the TPGSR model， the proposed method increases the average recognition accuracy of ASTER， MORAN， and CRNN by 2.9%， 3.6%， and 3.7%， respectively. These findings highlight the effectiveness of multimodal semantic interaction in enhancing text image super-resolution and improving text recognition accuracy.

Note: This section is automatically generated by AI . The website and platform operators shall not be liable for any commercial or legal consequences arising from your use of AI generated content on this website. Please be aware of this.

Keywords

feature semantic prior multi-granularity multimodal super-resolution reconstruction text image

Tools

Get Citation

Copy Citation Text

Yulan HAN, Yihong LUO, Yujie CUI, Chaofeng LAN. Super-resolution reconstruction of text image with multimodal semantic interaction[J]. Optics and Precision Engineering, 2025, 33(1): 135

Download Citation

EndNote(RIS)BibTex Plain Text

Set citation alerts for article

Save article for my favorites

Paper Information

Category:

Received: Jul. 31, 2024

Accepted: --

Published Online: Apr. 1, 2025

The Author Email: Yulan HAN (hanyulan@hrbust.edu.cn)

DOI:10.37188/OPE.20253301.0135

Topics

laser devices and laser physics

Lasers and Laser Optics

Laser physics

laser manufacturing

Instrumentation, Measurement and Metrology