Workflow
A. Obtain video animation & body position
Target video (input video) is fed into VIBE model. Human body pose, shape, joint parameters in an FBX file format is obtained.
B. Obtain user body shape
Since VIBE model can only process input with video format, the user image is first converted into extremely short video with FFmpeg. User shape is obtained after VIBE model.
C. Obtain user appearance texture
DensePose takes the user’s full body image and outputs a UV map which indicates body texture coordinates. Inpainting is applied to predict the texture. This stage provides complete body texture.
D. Merge animation, position, body shape and texture
Outputs of previous steps are fed into Unity. Unity merges all three to a clip of animation as shown in below “Demo Video” section. It will then be built into WebGL and put on web application.