Computer Engineering, Volume. 51, Issue 8, 262(2025)

A Research on Training Method for Diffusion Model Based on Neighborhood Attention

JI Lixia1,2, ZHOU Hongxin1, XIAO Shijie1, CHEN Yunfeng3, and ZHANG Han1、*
Author Affiliations
  • 1School of Cyber Science and Engineering, Zhengzhou University, Zhengzhou 450000, Henan, China
  • 2College of Software Engineering, Sichuan University, Chengdu 610065, Sichuan, China
  • 3Henan Cocyber Information and Technology Co., Ltd., Zhengzhou 450000, Henan, China
  • show less
    References(26)

    [1] [1] KINGMA D P, WELLING M. Auto-encoding variational Bayes[EB/OL]. [2023-10-03]. http://export.arxiv.org/pdf/1312.6114.

    [2] [2] LI X, THICKSTUN J, GULRAJANI I, et al. Diffusion-lm improves controllable text generation[C]//Proceedings of Advances in Neural Information Processing Systems. [S. l.]: AAAI Press, 2022: 4328-4343.

    [3] [3] HO J, CHAN W, SAHARIA C, et al. Imagen video: high definition video generation with diffusion models[EB/OL]. [2023-10-03]. https://arxiv.org/abs/2210.02303.

    [5] [5] RONNEBERGER O, FISCHER P, BROX T. U-Net: convolutional networks for biomedical image segmentation[C]//Proceedings of International Conference on Medical Image Computing and Computer-Assisted Intervention. Berlin, Germany: Springer, 2015: 234-241.

    [7] [7] PEEBLES W, XIE S. Scalable diffusion models with transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Washington D.C., USA: IEEE Press, 2023: 4195-4205.

    [8] [8] MICHELUCCI U. An introduction to autoencoders[EB/OL]. [2023-10-03]. https://arxiv.org/abs/2201.03898?context=cs.AI.

    [9] [9] CAO Y, LI S, LIU Y, et al. A comprehensive survey of AI-Generated Content (AIGC): a history of generative AI from gan to ChatGPT[EB/OL]. [2023-10-03]. https://arxiv.org/abs/2303.04226.

    [10] [10] HINTON G E, SALAKHUTDINOV R R. Reducing the dimensionality of data with neural networks[J]. science, 2006, 313(5786): 504-507.

    [11] [11] GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2014: 2672-2680.

    [12] [12] HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models[C]//Proceedings of Advances in Neural Information Processing Systems. [S. l.]: AAAI Press, 2020: 6840-6851.

    [13] [13] SIDDIQUE N, PAHEDING S, ELKIN C P, et al. U-Net and its variants for medical image segmentation: a review of theory and applications[J]. IEEE Access, 2021, 9: 82031-82057.

    [14] [14] WU J, LIU W L, LI C, et al. A state-of-the-art survey of U-Net in microscopic image analysis: from simple usage to structure mortification[J]. Neural Computing and Applications, 2023, 36: 3317-3346.

    [15] [15] ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2022: 10684-10695.

    [17] [17] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2017: 6000-6010.

    [18] [18] LIU Y, ZHANG Y, WANG Y X, et al. A survey of visual transformers[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(6): 7478-7498.

    [19] [19] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. [2023-10-03]. https://arxiv.org/abs/2103.00020?file=2103.00020.

    [20] [20] RAMESH A, DHARIWAL P, NICHOL A, et al. Hierarchical text-conditional image generation with clip latents[EB/OL]. [2023-10-03]. https://arxiv.org/abs/2204.06125.

    [21] [21] DHARIWAL P, NICHOL A. Diffusion models beat GANs on image synthesis[EB/OL]. [2023-10-03]. https://arxiv.org/abs/2105.05233.

    [22] [22] ZHENG H, NIE W, VAHDAT A, et al. Fast training of diffusion models with masked Transformers[EB/OL]. [2023-10-03]. https://arxiv.org/abs/2306.09305.

    [23] [23] HASSANI A, WALTON S, LI J, et al. Neighborhood attention transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2019: 6185-6194.

    [24] [24] DENG J, DONG W, SOCHER R, et al. ImageNet: a large-scale hierarchical image database[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2009: 248-255.

    [25] [25] HEUSEL M, RAMSAUER H, UNTERTHINER T, et al. GANs trained by a two time-scale update rule converge to a local Nash equilibrium[EB/OL]. [2023-10-03]. https://arxiv.org/pdf/1706.08500.

    [26] [26] BARRATT S, SHARMA R. A note on the inception score[EB/OL]. [2023-10-03]. https://arxiv.org/pdf/1801.01973.

    [27] [27] HESSEL J, HOLTZMAN A, FORBES M, et al. CLIPscore: a reference-free evaluation metric for image captioning[EB/OL]. [2023-10-03]. https://arxiv.org/pdf/1801.01973.

    [28] [28] SZEGEDY C, LIU W, JIA Y Q, et al. Going deeper with convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2015: 1-9.

    [29] [29] GAO S, ZHOU P, CHENG M M, et al. Masked diffusion transformer is a strong image synthesizer[EB/OL]. [2023-10-03]. https://arxiv.org/abs/2303.14389.

    Tools

    Get Citation

    Copy Citation Text

    JI Lixia, ZHOU Hongxin, XIAO Shijie, CHEN Yunfeng, ZHANG Han. A Research on Training Method for Diffusion Model Based on Neighborhood Attention[J]. Computer Engineering, 2025, 51(8): 262

    Download Citation

    EndNote(RIS)BibTexPlain Text
    Save article for my favorites
    Paper Information

    Category:

    Received: Nov. 8, 2023

    Accepted: Aug. 26, 2025

    Published Online: Aug. 26, 2025

    The Author Email: ZHANG Han (zhang_han@gs.zzu.edu.cn)

    DOI:10.19678/j.issn.1000-3428.0068793

    Topics