We seperate our pipeline into the training phase and testing phase and give details in the following.
Train Instant Avatar with a monocular video
InstantAvatar, based on NeRF with InstantNGP as its foundation, quickly learns avatars from monocular video. The method involves sampling points along rays in posed space for each frame, followed by transforming these points into a normalized space to eliminate global orientation and translation. Points in empty space are filtered using a shared occupancy grid across all frames, and the remaining points are deformed to canonical space using an articulation module. Finally, they are input into the canonical neural radiance field to assess color and density.
Once we finish training Instant Avatar, we move to the testing phase of our pipeline. Since we will synthesize views of a person playing instruments, we must first capture the human poses of a random person playing instruments and extract SMPL parameters to feed into Instant Avatar. However, capturing 3D motion with only a single-view video, which is mostly filmed by, is impossible. Therefore, we simulate a virtual person playing a drum in Blender and film the player from different views.
After filming different views of a person playing drum, we used an existing motion capture system, EasyMocap, to extract SMPL parameters. The overall pipeline of motion capturing a virtual person playing a drum is shown below.
We need to analyze the music first to decide when to execute the human pose. Thus, we use the paper Intuitive Analysis, Creation and Manipulation of MIDI Data with pretty_midi. They provide a Python library that can work with MIDI data efficiently. The steps are listed below:
Playing an instrument involves associating each note with a pose, creating a continuous transformation from one pose to another during the music piece. We use a motion capture system to capture pose parameters for each note and set a keyframe on the timeline when a note is played.
To generate frames between two keyframes, we utilize the insight that transitioning from one pose to another involves rotating body joints. SMPL pose parameters describe the rotation of 23 body joints using an axis angle. We apply spherical linear interpolation (Slerp) to calculate 23 joint angles for each frame between two keyframes. Slerp represents an interpolation with constant angular velocity along the shortest path (geodesic) on the unit hypersphere between two quaternions q1 and q2.
The parameter u moves from 0 (where the expression simplifies to q1) to 1 (where the expression simplifies to q2). For each joint, we transform the axis angle to quarteronions, substitute the first keyframe’s joint angle to q1 and the second keyframe’s joint angle to q2, and then apply Slerp.
We first set the fps for our output video to make the avatar hit every note at the right time. By preprocessing the MIDI file, we get an array that records when a note is hit. Let the time a note is hit be t1, the time the following note is hit be t2, and then the number of frames between the two notes can be obtained by multiplying fps by (t1 - t2).
We use a 3D mesh to represent an instrument and film the scene with an instrument only along with the same camera path as rendering our avatar. As shown below, the circle-like line is the camera path, and we render the drum scene along the circle-like path. This eventually gets a video of viewing the drum from different angles.
Next, we include instruments in the scene using a straightforward scene composition algorithm, the Z-buffer algorithm, shown below. This algorithm efficiently combines two images into one image, ensuring proper occlusion effects.
From above, we present a solution for combining two distinct scenes. Yet, obtaining the depth of these scenes differs. Instruments, being 3D meshes, allow us to quickly determine the depth as we know the mesh's surface location. However, it is challenging to do so for the human avatar, which is an implicit neural representation. In simpler terms, we cannot explicitly obtain the (x, y, z) information of the human surface.
Therefore, we utilize an attribute of NeRF to get the depth image. To illustrate, recap what NeRF is doing:
As shown below, to predict the color Cˆ(r) of pixel p, we follow RayA and sample K points. For each sample point (kth), we input the (x, y, z) coordinate value and viewing direction (θ, ϕ) into an MLP. The MLP produces the color ck and density δk for that sample point. This process is repeated for all K sample points along RayA. We proceed with the volume rendering process once all ck and δk are obtained.
where δk = tk+1 - tk is the distance between two adjacent samples and tk is the distance between camera origin and the k th sample point.
For calculating the depth of a pixel, we can reference from Equation 1 and utilize wk. This gives the depth equation:
where tk is the distance between camera origin and the k th sample point.
Once all depth images of two scenes are known, we can apply the Z-buffer algorithm (Algo 1) to composite 2 scenes.
Loper, Matthew and Mahmood, Naureen and Romero, Javier and Pons-Moll, Gerard and Black, Michael J. , ”SMPL: A Skinned Multi-Person Linear Model”, SIGGRAPH Asia, 2015
linkThiemo Alldieck, Marcus Magnor, Weipeng Xu, Christian Theobalt and Gerard Pons-Moll, ”People Snapshot Dataset”, In CVPR, 2018
link