Acta Optica Sinica, Volume. 43, Issue 15, 1510002(2023)

From Perception to Creation: Exploring Frontier of Image and Video Generation Methods

Liang Lin and Binbin Yang*
Author Affiliations
  • School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou 510006, Guangdong, China
  • show less
    Figures & Tables(24)
    Overview of generative adversarial network (GAN) principle
    Overview of variational auto-encoder (VAE)
    Overview of flow-based generative model
    Overview of diffusion model
    Comparison of different generative models
    Overview of image and video generation models
    Class-conditioned image generation[95]
    Text-conditioned image generation[96]
    Text-to-image generation results of Stable Diffusion[97]
    Weight-tuning-based image customization[102]
    Token-learning-based image customization[103]
    Image customization with multi-concept composition[104]
    Mask-region-based text-to-image editing[105]
    Prompt-editing-based text-to-image editing[106]
    Embedding-interpolation-based text-to-image editing[107]
    Generated videos of Imagen Video[108]
    VIDM framework using two diffusion model to generate video content and action information respectively[111]
    PVDM framework that represents the video as three two-dimensional hidden variables, and thus uses the two-dimensional diffusion model for training[112]
    Text-to-video generation results of Make-A-Video[113]
    VideoFusion framework that uses pre-trained text-to-image diffusion model to generate base frame and uses video data to train a residual noise generator[114]
    Video editing based on input image or text prompt[115]
    One shot text-to-video generation[116]
    • Table 1. FID comparison of different text-to-image pre-trained models on MS-COCO dataset

      View table

      Table 1. FID comparison of different text-to-image pre-trained models on MS-COCO dataset

      MethodFID↓
      LAFITE11926.94
      DALL·E12017.89
      LDM10012.63
      GLIDE9612.24
      DALL·E29710.39
      Imagen987.27
    • Table 2. Performance comparison of different class-to-video generation methods on UCF-101 dataset and Sky Time-lapse dataset

      View table

      Table 2. Performance comparison of different class-to-video generation methods on UCF-101 dataset and Sky Time-lapse dataset

      Method

      FVD↓

      (Sky Time-lapse

      256×256

      FVD↓

      (UCF-101

      256×256

      FID↓

      (UCF-101

      128×128

      IS↑

      (UCF-101

      128×128

      MoCoGAN125206.61821.412.42
      VideoGPT126222.72880.624.69
      MoCoGAN-HD127164.11729.683832.36
      DIGAN12883.1471.965529.71
      VIDM11157.4294.730653.34
      PVDM11255.4343.674.40
    Tools

    Get Citation

    Copy Citation Text

    Liang Lin, Binbin Yang. From Perception to Creation: Exploring Frontier of Image and Video Generation Methods[J]. Acta Optica Sinica, 2023, 43(15): 1510002

    Download Citation

    EndNote(RIS)BibTexPlain Text
    Save article for my favorites
    Paper Information

    Category: Image Processing

    Received: Mar. 30, 2023

    Accepted: Jul. 22, 2023

    Published Online: Aug. 15, 2023

    The Author Email: Yang Binbin (yangbb3@mail2.sysu.edu.cn)

    DOI:10.3788/AOS230758

    Topics