ObjectiveThe rapid advancement of virtual human technology and 3D face reconstruction has led to its wide application in virtual characters, film production, and human-computer interaction. However, traditional single-modal approaches, such as image- or audio-driven models, often struggle to capture rich emotional details and maintain temporal consistency in dynamic expressions. To address these challenges, this research proposes a novel multi-modal facial expression encoder that integrates image, text, and audio modalities to enhance the accuracy, naturalness, and emotional consistency of 3D facial expression generation in both static and dynamic scenarios.
MethodsThe proposed method introduces a multi-modal facial expression encoding approach that leverages the complementary strengths of image, text, and audio modalities (
Fig.2). The image modality captures visual facial features, while the text modality provides semantic cues via predefined descriptors, and the audio modality refines temporal dynamics by modulating the image-text similarity. Initially, an image–text feature extraction module regresses facial expression and pose parameters, followed by dynamic adjustment of text-driven features based on their relevance to the visual data. In the final stage, audio features guide the expression synthesis through cross-modal attention, aligning visual expressions with the emotional content of the audio (
Fig.3). Additionally, a temporal modeling module captures sequential dependencies between frames, ensuring smooth transitions and consistent emotional expression over time (
Fig.4). This integration of multi-modal fusion and temporal modeling yields natural, expressive, and temporally coherent 3D facial animations.
Results and DiscussionsThe experimental results confirm the effectiveness of the proposed method. In static 3D face reconstruction, the integration of text and image modalities leads to high emotional consistency (
Fig.6) and improved expression and pose regression precision (
Tab.3). When audio is incorporated, the method generates dynamic expressions with enhanced emotional consistency and visual quality, outperforming existing approaches in both static and dynamic scenarios (
Tab.4,
Fig.7). Moreover, the temporal modeling module guarantees smooth transitions between frames, resulting in a more natural sequence of expressions.
ConclusionsThis study introduces a novel multi-modal facial expression encoder that fuses image, text, and audio modalities to enhance the accuracy, naturalness, and emotional consistency of 3D face reconstruction. The model effectively captures both static and dynamic expressions, addressing limitations of single-modal methods. By integrating cross-modal attention and temporal modeling, it generates realistic and emotionally expressive 3D facial animations, demonstrating the potential of multi-modal fusion in virtual humans, film production, and human-computer interaction. Future work may incorporate additional modalities like gesture or contextual information to further improve realism and expressiveness.