Acta Optica Sinica, Volume. 43, Issue 15, 1510002(2023)

From Perception to Creation: Exploring Frontier of Image and Video Generation Methods

Liang Lin and Binbin Yang*
Author Affiliations
  • School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou 510006, Guangdong, China
  • show less

    Significance

    In recent years, advancements in computing software and hardware have led to artificial intelligent (AI) models achieving performance levels approaching or surpassing human capabilities in perceptive tasks. However, in order to develop mature AI systems that can comprehensively understand the world, models must be capable of generating visual concepts, rather than simply recognizing them because creation and customization require a thorough understanding of high-level semantics and full details of each generated object.

    From an applied perspective, when AI models obtain the capability of visual understanding and generation, they will significantly promote progress and development across diverse aspects of the industry. For example, visual generative models can be applied to the following aspects: colorizing and restoring old black and white photos and films; enhancing and remastering old videos in high definition; synthesizing real-time virtual anchors, talking faces, and AI avatars; incorporating special effects into personalized video shooting on short video platforms; stylizing users' portraits and input images; compositing movie special effects and scene rendering, and so on. Therefore, research on the theories and methods of image and video generation models holds significant theoretical significance and industrial application value.

    Progress

    In this paper, we first provide a comprehensive overview of existing generative frameworks, including generative adversarial networks (GAN), variational autoencoders (VAE), flow models, and diffusion models, which can be summarized in Fig. 5. GAN is trained in an adversarial manner to obtain an ideal generator, with the mutual competition of a generator and a discriminator. VAE is composed of an encoder and a decoder, and it is trained via variational inference to make the decoded distribution approximate the real distribution. The flow model uses a family of invertible mappings and simple priors to construct an invertible transformation between real data distribution and the prior distribution. Different from GANs and VAEs, flow models are trained by the estimation of maximum likelihood. Recently, diffusion models emerge as a class of powerful visual generative models with state-of-the-art synthesis results on visual data. The diffusion model decomposes the image generation process into a sequence of denoising processes from a Gaussian prior. Its training procedure is more stable by avoiding the use of an adversarial training strategy and can be successfully deployed in a large-scale pre-trained generation system.

    We then review recent state-of-the-art advances in image and video generation and discuss their merits and limitations. Fig. 6 shows the overview of image and video generation models and their classifications. Works on pre-trained text-to-image generation models study how to pre-train a text-to-image foundation model on large-scale datasets. Among those T2I foundation models, stable diffusion becomes a widely-used backbone for the tasks of image/video customization and editing, due to its impressive performance and scalability. Prompt-based image editing methods aim to use the pre-trained text-to-image foundation model, e.g., stable diffusion, to edit a generated/natural image according to input text prompts. Due to the difficulty of collecting large-scale and high-quality video datasets and the expensive computational cost, the research on video generation still lags behind image generation. To learn from the success of text-to-image diffusion models, some works, e.g., video diffusion model, imagen video, VIDM, and PVDM, have tried to use enormous video data to train a video diffusion model from scratch and obtain a video generation foundation model similar to stable diffusion. Another line of work aims to resort to pre-trained image generators, e.g., stable diffusion, to provide content prior to video generation and only learn the temporal dynamics from video, which significantly improves the training efficiency.

    Finally, we discuss the drawbacks of existing image and video generative modeling methods, such as misalignment between input prompts and generated images/videos, further propose feasible strategies to improve those visual generative models, and outline potential and promising future research directions. These contributions are crucial for advancing the field of visual generative modeling and realizing the full potential of AI systems in generating visual concepts.

    Conclusions and Prospects

    Under the rapid evolution of diffusion models, artificial intelligence has undergone a significant transformation from perception to creation. AI can now generate perceptually realistic and harmonious data, even allowing visual customization and editing based on input conditions. In light of this progress in generative models, here we provide prospects for the potential future forms of AI: with both perception and cognitive abilities, AI models can establish their own open world, enabling people to realize the concept of "what they think is what they get" without being constrained by real-life conditions. For example, in this open environment, the training of AI models is no longer restricted by data collection, leading to a reformation of many existing paradigms in machine learning. Techniques like transfer learning (domain adaptation) and active learning may diminish in importance. AI might be able to achieve self-interaction, self-learning, and self-improvement within the open world it creates, ultimately attaining higher levels of intelligence and profoundly transforming humans' lifestyles.

    Tools

    Get Citation

    Copy Citation Text

    Liang Lin, Binbin Yang. From Perception to Creation: Exploring Frontier of Image and Video Generation Methods[J]. Acta Optica Sinica, 2023, 43(15): 1510002

    Download Citation

    EndNote(RIS)BibTexPlain Text
    Save article for my favorites
    Paper Information

    Category: Image Processing

    Received: Mar. 30, 2023

    Accepted: Jul. 22, 2023

    Published Online: Aug. 15, 2023

    The Author Email: Yang Binbin (yangbb3@mail2.sysu.edu.cn)

    DOI:10.3788/AOS230758

    Topics