3D face reconstruction based on multimodal expression encoding and temporal modeling

ObjectiveThe rapid advancement of virtual human technology and 3D face reconstruction has led to its wide application in virtual characters, film production, and human-computer interaction. However, traditional single-modal approaches, such as image- or audio-driven models, often struggle to capture rich emotional details and maintain temporal consistency in dynamic expressions. To address these challenges, this research proposes a novel multi-modal facial expression encoder that integrates image, text, and audio modalities to enhance the accuracy, naturalness, and emotional consistency of 3D facial expression generation in both static and dynamic scenarios.MethodsThe proposed method introduces a multi-modal facial expression encoding approach that leverages the complementary strengths of image, text, and audio modalities (Fig.2). The image modality captures visual facial features, while the text modality provides semantic cues via predefined descriptors, and the audio modality refines temporal dynamics by modulating the image-text similarity. Initially, an image–text feature extraction module regresses facial expression and pose parameters, followed by dynamic adjustment of text-driven features based on their relevance to the visual data. In the final stage, audio features guide the expression synthesis through cross-modal attention, aligning visual expressions with the emotional content of the audio (Fig.3). Additionally, a temporal modeling module captures sequential dependencies between frames, ensuring smooth transitions and consistent emotional expression over time (Fig.4). This integration of multi-modal fusion and temporal modeling yields natural, expressive, and temporally coherent 3D facial animations.Results and DiscussionsThe experimental results confirm the effectiveness of the proposed method. In static 3D face reconstruction, the integration of text and image modalities leads to high emotional consistency (Fig.6) and improved expression and pose regression precision (Tab.3). When audio is incorporated, the method generates dynamic expressions with enhanced emotional consistency and visual quality, outperforming existing approaches in both static and dynamic scenarios (Tab.4, Fig.7). Moreover, the temporal modeling module guarantees smooth transitions between frames, resulting in a more natural sequence of expressions.ConclusionsThis study introduces a novel multi-modal facial expression encoder that fuses image, text, and audio modalities to enhance the accuracy, naturalness, and emotional consistency of 3D face reconstruction. The model effectively captures both static and dynamic expressions, addressing limitations of single-modal methods. By integrating cross-modal attention and temporal modeling, it generates realistic and emotionally expressive 3D facial animations, demonstrating the potential of multi-modal fusion in virtual humans, film production, and human-computer interaction. Future work may incorporate additional modalities like gesture or contextual information to further improve realism and expressiveness.

AI Video Guide

AI Picture Guide

AI One Sentence

AI Short Abstract

Note: This section is automatically generated by AI . The website and platform operators shall not be liable for any commercial or legal consequences arising from your use of AI generated content on this website. Please be aware of this.

Keywords

3D point cloud computational imaging face reconstruction facial expression generation modal fusion temporal modeling

Tools

Get Citation

Copy Citation Text

Xiaolong HE, Feipeng DA, Shaoyan GAI. 3D face reconstruction based on multimodal expression encoding and temporal modeling[J]. Infrared and Laser Engineering, 2025, 54(8): 20250129

Download Citation

EndNote(RIS)BibTex Plain Text

Set citation alerts for article

Save article for my favorites

Paper Information

Category: Optical imaging, display and information processing

Received: Feb. 27, 2025

Accepted: --

Published Online: Aug. 29, 2025

The Author Email:

DOI:10.3788/IRLA20250129

Topics

laser devices and laser physics

Lasers and Laser Optics

Laser physics

laser manufacturing

Instrumentation, Measurement and Metrology