VR Avatar Facial Expression

A Technique of Synthesizing an Avatar’s Upper Facial Expression Based on the User’s Lower Face.

Leading Researcher | VR Design&Development, User Test, Interdisciplinary Collaboration

Teammates: Xinge Liu (Tsinghua University), Ziyu Han (Carnegie Mellon University)

Supervisors: Prof. Xin Yi (Tsinghua University), Prof. Xin Tong (Duke Kunshan University)

Time: 06/2022-Present

——Research Description——

In social VR, an avatar’s upper facial expression is often missing due to the occlusion of the head-mounted display (HMD). The mainstream method is to embed an eye-tracking camera in HMD to capture the user’s upper face movements. With another facial tracking camera attached to the HMD to enable the avatar’s lower facial expression, the combination could make the avatar’s whole face move. However, the connection between the upper face and the lower face has not been studied yet. Our research proposed a technique of synthesizing an avatar’s upper facial expression based on a user’s lower face with only a facial tracking camera.

——My Role——

Conducted literature reviews on methods of synthesizing avatars’ facial expressions.
Designed and developed the technique of synthesizing an avatar’s upper facial expression based on a user's lower face with the Unity game engine.
Collaborated with another computational design teammate and designed different blendshapes (units of an avatar's facial expression).
Developed a virtual meeting platform with the Mirror framework on Unity.
Collaborated with teammates and designed a user test to evaluate the performance of the technique.
Recruited 18 participants and carried out the user test.
Wrote a manuscript to share research outcomes with the community as an author.
Identified existing technique limitations. Proposed feasible and concrete iteration plans to improve the technique.

Demo Video

[Code]

[Accepted as poster by IEEE VR 2023 Conference]

——VR Device Setup——

A facial tracking camera is attached to the VR Head-Mounted Display (HMD). The camera could track a user's lower face movement and retarget to the avatar's lower face with the help of 38 lower face blendshapes. Blendshape is avatars' facial expression unit. With combinations of different blendshapes can form different facial expressions. The avatar's upper face is still static (14 upper face blendshapes).

——Technique Implementation——

1. Avatar Facial Expression Blendshape Dataset Construction

Categorized 6 facial expressions: Happy, Sad, Angry, Disgust, Fear, and Surprise.
Generated 13,500 lines of blendshape data with 15 participants.

2. Neural Network for Upper Face Blendshapes Prediction

A five-dense layer multi-output neural network with numerical output (developed with TensorFlow)

Neural Network Structure

3. Technique Pipeline

Deployed the neural network model into Unity
Used SteamVR package to create a VR environment and used the Barracuda package to transfer data between the avatar and neural network model

Barracuda

Technique Pipeline

4. Implementation of the Technique to Various Avatars

Collaborated with another computational design field teammate and designed the blendshapes for various avatars.
Implemented the technique on these avatars.

Man Avatar

Woman Avatar

5. Online Virtual Meeting Platform Construction

Constructed a virtual meeting platform with Unity and Mirror framework to enable two users to communicate.
The platform enables avatars' facial movements, and a head rotation feature to enhance immersion.
Two users were located in different rooms and use the platform to communicate. The audio was sent through Zoom.

The Virtual Meeting Platform

——User Test——

1. Participants

18 participants (5 males and 13 females), average age of 20.44.

2. Study Design

Compare 3 different avatar upper facial expression conditions:
- Upper facial expression static, no expression (Static in short)
- Our proposed technique (Tech in short)
- VR with eye tracking, the ground truth for upper facial expression (Eye-tracking in short)
Divided the participants into 3 groups based on upper facial expression condition, further divided one group into 3 sub-groups (2 participants in a sub-group):
- Static vs Tech (3 sub-groups)
- Tech vs Eye-tracking (3-sub-groups)
- Static vs Eye-tracking (3-subgroups)

Study Design

3. Evaluation

Condition preference
Social presence (Likert scale from 1 to 7)
Interpersonal attraction (Likert scale from 1 to 7)
The uncanny valley effect (Likert scale from 1 to 7)

4. User Test Procedure

Take Static vs Tech group as an example

User Test Procedure

——Results, and Conclusion——

1. Results

Condition Preference

Social Presence (Mean)

Interpersonal Attraction (Mean)

The Uncanny Valley Effect (Mean)

2. Conclusion

Both Tech and Eye-tracking conditions performed better than Static in social presence, interpersonal attraction, and the uncanny effect.
Our proposed tech's performance was in the middle of Static and Eye-tracking (ground truth) and did not differ much from Eye-tracking (no significance in pairwise comparison), indicating the technique had a descent performance. Users would feel more natural and comfortable in communication with implementing the technique.

——Limitations, and Future Work——

1. Limitations

The technique is a simple end-to-end structured system. Avatar's upper face will always move once the lower face is moving, which is not accurate in real-time virtual communication.
If the facial expressions' lower face parts are similar, the technique will synthesize the wrong upper facial expression (e.g. disgust and angry).
Not enough participants in the user test, results may contain bias.

2. Future Work

Add multi-modality emotion recognition for more accurate output. (e.g. audio emotional recognition, and facial expression recognition based on the lower face).
Enlarge the participant population to cut down bias.

——TAKEAWAYS——

Users would feel more natural and engaged in communication using our proposed technique compared to the Static condition. However, the technique still needs improvements in the accuracy of synthesizing an avatar's upper facial expression.