System Integration of Dynamically 3D Human Reconstruction


The idea of Metaverse is getting popular in these two years. In the future, individuals or corporations will all have significant needs in building 3D objects. Our project focuses on the 3D human model, and aims at making everyone build their own human model in an easy and fast way. Given a full body image and a target video, our system merges user body appearance into target video animation.

Models We Used

VIBE [1]
Given a video, this model produces 3D body pose, position, and shape parameters. This model can convert these parameters to SMPL human model. An adversarial learning framework is applied to discriminate between real humans and the predicted ones.

DensePose & Coordinate-based texture inpainting [2][3]
These two models process body texture. DensePose classifies and regresses every pixel in user image into U, V coordinates. The resulting texture is incomplete due to limited information about the back body. Coordinate-based texture inpainting model takes UV map and source image as input. It predicts the remaining body texture and samples colors from the source image. We apply these two models to obtain SMPL texture.

[1] Muhammed Kocabas, Nikos Athanasiou, Michael J. Black. VIBE: Video Inference for Human Body Pose and Shape Estimation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Dec 2020.
[2] Riza Alp Gu ̈ler, Natalia Neverova, and Iasonas Kokkinos. DensePose: Dense human pose estimation in the wild. In The IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), June 2018.
[3] Artur Grigorev, Artem Sevastopolsky, Alexander Vakhitov, Victor Lempitsky. Coordinate-based Texture Inpainting for Pose-Guided Image Generation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, July 2019.


A. Obtain video animation & body position
Target video (input video) is fed into VIBE model. Human body pose, shape, joint parameters in an FBX file format is obtained.

B. Obtain user body shape
Since VIBE model can only process input with video format, the user image is first converted into extremely short video with FFmpeg. User shape is obtained after VIBE model.

C. Obtain user appearance texture
DensePose takes the user’s full body image and outputs a UV map which indicates body texture coordinates. Inpainting is applied to predict the texture. This stage provides complete body texture.

D. Merge animation, position, body shape and texture
Outputs of previous steps are fed into Unity. Unity merges all three to a clip of animation as shown in below “Demo Video” section. It will then be built into WebGL and put on web application.


The system is integrated into a web application with the Flask framework. The input image is user’s body image, and the input video (target video) includes the body the user expects to exchange with. Output animation demonstrates an avatar with user input image appearance. The animation of the avatar is the motion in the input video.


Uploaded image & video source:
Full body shot:
Target video:


國立清華大學 資訊工程學系 22級

國立清華大學 資訊工程學系 22級