Journal of Beijing Normal University, Volume. 61, Issue 3, 285(2025)

Uncertainty-awared generative network for Chinese scene text editing

GAO Yutong1,2, ZHANG Ying1, LIU Xianggan3,4、*, LIU Yidian5, JIANG Shan5, GUO Ziyi5, and SONG Feifan6
Author Affiliations
  • 1Key Laboraory of Ethnic Language Intelligent Analysis and Security Governance, Ministry of Education, Minzu University of China, Beijing, China
  • 2Key Laboratory of Big Data and Artificial Intelligence in Transportation, Ministry of Education, Beijing Jiaotong University, Beijing, China
  • 3Natural Language Processing and Knowledge Graph Laboratory, Huazhong University of Science and Technology, Wuhan, Hubei, China
  • 4Hainan Lingshui Li'an International Education Innovation Pilot Zone, Lingshui, Hainan, China
  • 5State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China
  • 6Research Center for Rural Financial Reform, Changchun, Jilin, China
  • show less
    References(20)

    [1] [1] CHEN J Y, HUANG Y P, LYU T C, et al. Textdiffuser: Diffusion models as text painters[EB/OL]. [2025-03-29]. https://proceedings.neurips.cc/paper_files/paper/2023/file/1df4afb0b4ebf492a41218ce16b6d8df-Paper-Conference.pdf

    [2] [2] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139

    [3] [3] DHARIWAL P, NICHOL A. Diffusion models beat GANs on image synthesis[EB/OL]. (2021-06-01) [2025-03-29]. https://arxiv.org/abs/2105.05233v4

    [6] [6] LEE J, KIM Y, KIM S, et al. RewriteNet: realistic scene text image generation via editing text in real-world image[EB/OL]. [2025-03-29]. https://www.academia.edu/download/91842572/2107.11041v1.pdf

    [7] [7] GU J X, MENG X J, LU G S, et al. Wukong: a 100 million large-scale Chinese cross-modal pre-training benchmark[EB/OL]. (2021-06-01) [2025-03-29]. https://arxiv.org/abs/2202.06767v4

    [8] [8] GAL R, ALALUF Y, ATZMON Y, et al. An image is worth one word: personalizing text-to-image generation using textual inversion[EB/OL]. (2022-08-02) [2025-03-29]. https://arxiv.org/abs/2208.01618v1

    [9] [9] JI J B, ZHANG G H, WANG Z W, et al. Improving diffusion models for scene text editing with dual encoders[EB/OL]. (2023-04-12) [2025-03-29]. https://arxiv.org/abs/2304.05568v1

    [10] [10] HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models[EB/OL]. [2025-03-29]. https://proceedings.neurips.cc/paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf

    [11] [11] ZHANG L M, RAO A Y, AGRAWALA M. Adding conditional control to text-to-image diffusion models[C]//2023 IEEE/CVF International Conference on Computer Vision (ICCV). October 1-6, 2023. Paris: IEEE, 2023: 3813

    [12] [12] ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]//2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). June 18-24, 2022. New Orleans: IEEE, 2022: 10674

    [13] [13] YANG Y K, GUI D N, YUAN Y H, et al. Glyphcontrol: glyph conditional control for visual text generation[EB/OL]. (2023-11-11) [2025-03-29]. 2023: 2305.18259. https://arxiv.org/abs/2305.18259v2

    [14] [14] BALAJI Y, NAH S J, HUANG X, et al. Ediff-I: text-to-image diffusion models with an ensemble of expert denoisers[EB/OL]. (2023-05-14) [2025-03-29]. https://arxiv.org/pdf/2211.01324

    [15] [15] CHEN H X, XU Z E, GU Z X, et al. Diffute: universal text editing diffusion model[EB/OL]. [2025-03-29]. https://proceedings.neurips.cc/paper_files/paper/2023/file/c7138635035501eb71b0adf6ddc319d6-Paper-Conference.pdf

    [16] [16] ZENG W C, SHU Y, LI Z H, et al. Textctrl: diffusion-based scene text editing with prior guidance control[EB/OL]. (2024-10-14) [2025-03-29]. https://arxiv.org/abs/2410.10133v1

    [17] [17] WANG T, QU X C, LIU T. Textmastero: mastering high-quality scene text editing in diverse languages and styles[EB/OL]. (2024-08-20) [2025-03-29]. https://arxiv.org/abs/2408.10623v1

    [18] [18] CHANG H W, ZHANG H, BARBER J, et al. Muse: text-to-image generation via masked generative transformers[EB/OL]. (2023-01-02) [2025-03-29]. https://arxiv.org/abs/2301.00704v1

    [19] [19] MA J, ZHAO M J, CHEN C, et al. Glyphdraw: seamlessly rendering text with intricate spatial structures in text-to-image generation[EB/OL]. (2023-05-23) [2025-03-29]. https://arxiv.org/abs/2303.17870v2

    [20] [20] RONNEBERGER O, FISCHER P, BROX T. U-Net: convolutional networks for biomedical image segmentation[M]//Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015. Cham: Springer International Publishing, 2015: 234

    [21] [21] TUO Y X, XIANG W M, HE J Y, et al. Anytext: multilingual visual text generation and editing[EB/OL]. (2024-02-21) [2025-03-29]. https://arxiv.org/abs/2311.03054v5

    [22] [22] SCHUHMANN C, VENCU R, BEAUMONT R, et al. LAION-400M: open dataset of CLIP-filtered 400 million image-text pairs[EB/OL]. (2021-11-03) [2025-03-29]. https://arxiv.org/abs/2111.02114v1

    Tools

    Get Citation

    Copy Citation Text

    GAO Yutong, ZHANG Ying, LIU Xianggan, LIU Yidian, JIANG Shan, GUO Ziyi, SONG Feifan. Uncertainty-awared generative network for Chinese scene text editing[J]. Journal of Beijing Normal University, 2025, 61(3): 285

    Download Citation

    EndNote(RIS)BibTexPlain Text
    Save article for my favorites
    Paper Information

    Received: Apr. 9, 2025

    Accepted: Aug. 21, 2025

    Published Online: Aug. 21, 2025

    The Author Email: LIU Xianggan (liuxg@hust.edu.cn)

    DOI:10.12202/j.0476-0301.2025056

    Topics