Computer Engineering, Volume. 51, Issue 8, 262(2025)
A Research on Training Method for Diffusion Model Based on Neighborhood Attention
[1] [1] KINGMA D P, WELLING M. Auto-encoding variational Bayes[EB/OL]. [2023-10-03]. http://export.arxiv.org/pdf/1312.6114.
[2] [2] LI X, THICKSTUN J, GULRAJANI I, et al. Diffusion-lm improves controllable text generation[C]//Proceedings of Advances in Neural Information Processing Systems. [S. l.]: AAAI Press, 2022: 4328-4343.
[3] [3] HO J, CHAN W, SAHARIA C, et al. Imagen video: high definition video generation with diffusion models[EB/OL]. [2023-10-03]. https://arxiv.org/abs/2210.02303.
[5] [5] RONNEBERGER O, FISCHER P, BROX T. U-Net: convolutional networks for biomedical image segmentation[C]//Proceedings of International Conference on Medical Image Computing and Computer-Assisted Intervention. Berlin, Germany: Springer, 2015: 234-241.
[7] [7] PEEBLES W, XIE S. Scalable diffusion models with transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Washington D.C., USA: IEEE Press, 2023: 4195-4205.
[8] [8] MICHELUCCI U. An introduction to autoencoders[EB/OL]. [2023-10-03]. https://arxiv.org/abs/2201.03898?context=cs.AI.
[9] [9] CAO Y, LI S, LIU Y, et al. A comprehensive survey of AI-Generated Content (AIGC): a history of generative AI from gan to ChatGPT[EB/OL]. [2023-10-03]. https://arxiv.org/abs/2303.04226.
[10] [10] HINTON G E, SALAKHUTDINOV R R. Reducing the dimensionality of data with neural networks[J]. science, 2006, 313(5786): 504-507.
[11] [11] GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2014: 2672-2680.
[12] [12] HO J, JAIN A, ABBEEL P. Denoising diffusion probabilistic models[C]//Proceedings of Advances in Neural Information Processing Systems. [S. l.]: AAAI Press, 2020: 6840-6851.
[13] [13] SIDDIQUE N, PAHEDING S, ELKIN C P, et al. U-Net and its variants for medical image segmentation: a review of theory and applications[J]. IEEE Access, 2021, 9: 82031-82057.
[14] [14] WU J, LIU W L, LI C, et al. A state-of-the-art survey of U-Net in microscopic image analysis: from simple usage to structure mortification[J]. Neural Computing and Applications, 2023, 36: 3317-3346.
[15] [15] ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2022: 10684-10695.
[17] [17] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2017: 6000-6010.
[18] [18] LIU Y, ZHANG Y, WANG Y X, et al. A survey of visual transformers[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(6): 7478-7498.
[19] [19] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. [2023-10-03]. https://arxiv.org/abs/2103.00020?file=2103.00020.
[20] [20] RAMESH A, DHARIWAL P, NICHOL A, et al. Hierarchical text-conditional image generation with clip latents[EB/OL]. [2023-10-03]. https://arxiv.org/abs/2204.06125.
[21] [21] DHARIWAL P, NICHOL A. Diffusion models beat GANs on image synthesis[EB/OL]. [2023-10-03]. https://arxiv.org/abs/2105.05233.
[22] [22] ZHENG H, NIE W, VAHDAT A, et al. Fast training of diffusion models with masked Transformers[EB/OL]. [2023-10-03]. https://arxiv.org/abs/2306.09305.
[23] [23] HASSANI A, WALTON S, LI J, et al. Neighborhood attention transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2019: 6185-6194.
[24] [24] DENG J, DONG W, SOCHER R, et al. ImageNet: a large-scale hierarchical image database[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2009: 248-255.
[25] [25] HEUSEL M, RAMSAUER H, UNTERTHINER T, et al. GANs trained by a two time-scale update rule converge to a local Nash equilibrium[EB/OL]. [2023-10-03]. https://arxiv.org/pdf/1706.08500.
[26] [26] BARRATT S, SHARMA R. A note on the inception score[EB/OL]. [2023-10-03]. https://arxiv.org/pdf/1801.01973.
[27] [27] HESSEL J, HOLTZMAN A, FORBES M, et al. CLIPscore: a reference-free evaluation metric for image captioning[EB/OL]. [2023-10-03]. https://arxiv.org/pdf/1801.01973.
[28] [28] SZEGEDY C, LIU W, JIA Y Q, et al. Going deeper with convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2015: 1-9.
[29] [29] GAO S, ZHOU P, CHENG M M, et al. Masked diffusion transformer is a strong image synthesizer[EB/OL]. [2023-10-03]. https://arxiv.org/abs/2303.14389.
Get Citation
Copy Citation Text
JI Lixia, ZHOU Hongxin, XIAO Shijie, CHEN Yunfeng, ZHANG Han. A Research on Training Method for Diffusion Model Based on Neighborhood Attention[J]. Computer Engineering, 2025, 51(8): 262
Category:
Received: Nov. 8, 2023
Accepted: Aug. 26, 2025
Published Online: Aug. 26, 2025
The Author Email: ZHANG Han (zhang_han@gs.zzu.edu.cn)