Skip to main navigation Skip to search Skip to main content

JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation

  • Stony Brook University

Research output: Contribution to conferencePaperpeer-review

Abstract

We introduce a novel method for joint expression and audio-guided talking face generation. Recent approaches either struggle to preserve the speaker identity or fail to produce faithful facial expressions. To address these challenges, we propose a NeRF-based network. Since we train our network on monocular videos without any ground truth, it is essential to learn disentangled representations for audio and expression. We first learn audio features in a self-supervised manner, given utterances from multiple subjects. By incorporating a contrastive learning technique, we ensure that the learned audio features are aligned to the lip motion and disentangled from the muscle motion of the rest of the face. We then devise a transformer-based architecture that learns expression features, capturing long-range facial expressions and disentangling them from the speech-specific mouth movements. Through quantitative and qualitative evaluation, we demonstrate that our method can synthesize high-fidelity talking face videos, achieving state-of-the-art facial expression transfer along with lip synchronization to unseen audio. Project Page: https://starc52.github.io/publications/JEAN.

Original languageEnglish
StatePublished - 2024
Event35th British Machine Vision Conference, BMVC 2024 - Glasgow, United Kingdom
Duration: Nov 25 2024Nov 28 2024

Conference

Conference35th British Machine Vision Conference, BMVC 2024
Country/TerritoryUnited Kingdom
CityGlasgow
Period11/25/2411/28/24

Fingerprint

Dive into the research topics of 'JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation'. Together they form a unique fingerprint.

Cite this