Infrared and Laser Engineering, Volume. 54, Issue 8, 20250129(2025)

3D face reconstruction based on multimodal expression encoding and temporal modeling

Xiaolong HE1,2, Feipeng DA1,2, and Shaoyan GAI1,2
Author Affiliations
  • 1School of Automation, Southeast University, Nanjing 210096, China
  • 2Key Laboratory of Measurement and Control of Complex Systems of Engineering, Ministry of Education, Nanjing 210096, China
  • show less

    ObjectiveThe rapid advancement of virtual human technology and 3D face reconstruction has led to its wide application in virtual characters, film production, and human-computer interaction. However, traditional single-modal approaches, such as image- or audio-driven models, often struggle to capture rich emotional details and maintain temporal consistency in dynamic expressions. To address these challenges, this research proposes a novel multi-modal facial expression encoder that integrates image, text, and audio modalities to enhance the accuracy, naturalness, and emotional consistency of 3D facial expression generation in both static and dynamic scenarios.MethodsThe proposed method introduces a multi-modal facial expression encoding approach that leverages the complementary strengths of image, text, and audio modalities (Fig.2). The image modality captures visual facial features, while the text modality provides semantic cues via predefined descriptors, and the audio modality refines temporal dynamics by modulating the image-text similarity. Initially, an image–text feature extraction module regresses facial expression and pose parameters, followed by dynamic adjustment of text-driven features based on their relevance to the visual data. In the final stage, audio features guide the expression synthesis through cross-modal attention, aligning visual expressions with the emotional content of the audio (Fig.3). Additionally, a temporal modeling module captures sequential dependencies between frames, ensuring smooth transitions and consistent emotional expression over time (Fig.4). This integration of multi-modal fusion and temporal modeling yields natural, expressive, and temporally coherent 3D facial animations.Results and DiscussionsThe experimental results confirm the effectiveness of the proposed method. In static 3D face reconstruction, the integration of text and image modalities leads to high emotional consistency (Fig.6) and improved expression and pose regression precision (Tab.3). When audio is incorporated, the method generates dynamic expressions with enhanced emotional consistency and visual quality, outperforming existing approaches in both static and dynamic scenarios (Tab.4, Fig.7). Moreover, the temporal modeling module guarantees smooth transitions between frames, resulting in a more natural sequence of expressions.ConclusionsThis study introduces a novel multi-modal facial expression encoder that fuses image, text, and audio modalities to enhance the accuracy, naturalness, and emotional consistency of 3D face reconstruction. The model effectively captures both static and dynamic expressions, addressing limitations of single-modal methods. By integrating cross-modal attention and temporal modeling, it generates realistic and emotionally expressive 3D facial animations, demonstrating the potential of multi-modal fusion in virtual humans, film production, and human-computer interaction. Future work may incorporate additional modalities like gesture or contextual information to further improve realism and expressiveness.

    Keywords
    Tools

    Get Citation

    Copy Citation Text

    Xiaolong HE, Feipeng DA, Shaoyan GAI. 3D face reconstruction based on multimodal expression encoding and temporal modeling[J]. Infrared and Laser Engineering, 2025, 54(8): 20250129

    Download Citation

    EndNote(RIS)BibTexPlain Text
    Save article for my favorites
    Paper Information

    Category: Optical imaging, display and information processing

    Received: Feb. 27, 2025

    Accepted: --

    Published Online: Aug. 29, 2025

    The Author Email:

    DOI:10.3788/IRLA20250129

    Topics