A Research on Training Method for Diffusion Model Based on Neighborhood Attention

JI Lixia; ZHOU Hongxin; XIAO Shijie; CHEN Yunfeng; ZHANG Han

doi:10.19678/j.issn.1000-3428.0068793

Computer Engineering, Volume. 51, Issue 8, 262(2025)

A Research on Training Method for Diffusion Model Based on Neighborhood Attention

JI Lixia^1,2, ZHOU Hongxin¹, XIAO Shijie¹, CHEN Yunfeng³, and ZHANG Han^1、*

Author Affiliations

¹School of Cyber Science and Engineering, Zhengzhou University, Zhengzhou 450000, Henan, China

²College of Software Engineering, Sichuan University, Chengdu 610065, Sichuan, China

³Henan Cocyber Information and Technology Co., Ltd., Zhengzhou 450000, Henan, China

show less

Abstract Get PDF(in Chinese)

References(26)

[1] [1] KINGMA D P, WELLING M. Auto-encoding variational Bayes[EB/OL]. [2023-10-03]. http://export.arxiv.org/pdf/1312.6114.

[2] [2] LI X, THICKSTUN J, GULRAJANI I, et al. Diffusion-lm improves controllable text generation[C]//Proceedings of Advances in Neural Information Processing Systems. [S. l.]: AAAI Press, 2022: 4328-4343.

[3] [3] HO J, CHAN W, SAHARIA C, et al. Imagen video: high definition video generation with diffusion models[EB/OL]. [2023-10-03]. https://arxiv.org/abs/2210.02303.

[5] [5] RONNEBERGER O, FISCHER P, BROX T. U-Net: convolutional networks for biomedical image segmentation[C]//Proceedings of International Conference on Medical Image Computing and Computer-Assisted Intervention. Berlin, Germany: Springer, 2015: 234-241.

[7] [7] PEEBLES W, XIE S. Scalable diffusion models with transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Washington D.C., USA: IEEE Press, 2023: 4195-4205.

[8] [8] MICHELUCCI U. An introduction to autoencoders[EB/OL]. [2023-10-03]. https://arxiv.org/abs/2201.03898?context=cs.AI.

[9] [9] CAO Y, LI S, LIU Y, et al. A comprehensive survey of AI-Generated Content (AIGC): a history of generative AI from gan to ChatGPT[EB/OL]. [2023-10-03]. https://arxiv.org/abs/2303.04226.

[10] [10] HINTON G E, SALAKHUTDINOV R R. Reducing the dimensionality of data with neural networks[J]. science, 2006, 313(5786): 504-507.

[11] [11] GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2014: 2672-2680.

[12] [12] HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models[C]//Proceedings of Advances in Neural Information Processing Systems. [S. l.]: AAAI Press, 2020: 6840-6851.

[13] [13] SIDDIQUE N, PAHEDING S, ELKIN C P, et al. U-Net and its variants for medical image segmentation: a review of theory and applications[J]. IEEE Access, 2021, 9: 82031-82057.

[14] [14] WU J, LIU W L, LI C, et al. A state-of-the-art survey of U-Net in microscopic image analysis: from simple usage to structure mortification[J]. Neural Computing and Applications, 2023, 36: 3317-3346.

[15] [15] ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2022: 10684-10695.

[17] [17] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2017: 6000-6010.

[18] [18] LIU Y, ZHANG Y, WANG Y X, et al. A survey of visual transformers[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(6): 7478-7498.

[19] [19] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. [2023-10-03]. https://arxiv.org/abs/2103.00020?file=2103.00020.

[20] [20] RAMESH A, DHARIWAL P, NICHOL A, et al. Hierarchical text-conditional image generation with clip latents[EB/OL]. [2023-10-03]. https://arxiv.org/abs/2204.06125.

[21] [21] DHARIWAL P, NICHOL A. Diffusion models beat GANs on image synthesis[EB/OL]. [2023-10-03]. https://arxiv.org/abs/2105.05233.

[22] [22] ZHENG H, NIE W, VAHDAT A, et al. Fast training of diffusion models with masked Transformers[EB/OL]. [2023-10-03]. https://arxiv.org/abs/2306.09305.

[23] [23] HASSANI A, WALTON S, LI J, et al. Neighborhood attention transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2019: 6185-6194.

[24] [24] DENG J, DONG W, SOCHER R, et al. ImageNet: a large-scale hierarchical image database[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2009: 248-255.

[25] [25] HEUSEL M, RAMSAUER H, UNTERTHINER T, et al. GANs trained by a two time-scale update rule converge to a local Nash equilibrium[EB/OL]. [2023-10-03]. https://arxiv.org/pdf/1706.08500.

[26] [26] BARRATT S, SHARMA R. A note on the inception score[EB/OL]. [2023-10-03]. https://arxiv.org/pdf/1801.01973.

[27] [27] HESSEL J, HOLTZMAN A, FORBES M, et al. CLIPscore: a reference-free evaluation metric for image captioning[EB/OL]. [2023-10-03]. https://arxiv.org/pdf/1801.01973.

[28] [28] SZEGEDY C, LIU W, JIA Y Q, et al. Going deeper with convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2015: 1-9.

[29] [29] GAO S, ZHOU P, CHENG M M, et al. Masked diffusion transformer is a strong image synthesizer[EB/OL]. [2023-10-03]. https://arxiv.org/abs/2303.14389.

Tools

Get Citation

Copy Citation Text

JI Lixia, ZHOU Hongxin, XIAO Shijie, CHEN Yunfeng, ZHANG Han. A Research on Training Method for Diffusion Model Based on Neighborhood Attention[J]. Computer Engineering, 2025, 51(8): 262

Download Citation

EndNote(RIS)BibTex Plain Text

Set citation alerts for article

Save article for my favorites

Paper Information

Category:

Received: Nov. 8, 2023

Accepted: Aug. 26, 2025

Published Online: Aug. 26, 2025

The Author Email: ZHANG Han (zhang_han@gs.zzu.edu.cn)

DOI:10.19678/j.issn.1000-3428.0068793

Topics