Pipeline

We seperate our pipeline into the training phase and testing phase and give details in the following.

Training phase

Train Instant Avatar with a monocular video

Testing phase

Motion capture: Use an existing mocap system to capture a person (different from the person in the training phase) playing the instrument and extract human pose parameters
Adding music and synchronizing with the human pose: Use a MIDI file to decide when to execute the human pose
Inference Instant Avatar: Feed SMPL parameters and camera poses (the view you want to look at) into trained Instant Avatar
Scene composition: Adding instruments to the scene using z-buffer algorithm

Instant Avatar

InstantAvatar, based on NeRF with InstantNGP as its foundation, quickly learns avatars from monocular video. The method involves sampling points along rays in posed space for each frame, followed by transforming these points into a normalized space to eliminate global orientation and translation. Points in empty space are filtered using a shared occupancy grid across all frames, and the remaining points are deformed to canonical space using an articulation module. Finally, they are input into the canonical neural radiance field to assess color and density.

The SMPL parameters of the person can be obtained using a motion capture system. To render a frame of the human under a novel pose, we change the pose parameters to animate the avatar. We train our avatar using the People Snapshot dataset, a monocular video of a person turning around.

Motion Capture

Once we finish training Instant Avatar, we move to the testing phase of our pipeline. Since we will synthesize views of a person playing instruments, we must first capture the human poses of a random person playing instruments and extract SMPL parameters to feed into Instant Avatar. However, capturing 3D motion with only a single-view video, which is mostly filmed by, is impossible. Therefore, we simulate a virtual person playing a drum in Blender and film the player from different views.

After filming different views of a person playing drum, we used an existing motion capture system, EasyMocap, to extract SMPL parameters. The overall pipeline of motion capturing a virtual person playing a drum is shown below.

Adding Music and Synchronize with Human Pose

Analyze Music

We need to analyze the music first to decide when to execute the human pose. Thus, we use the paper Intuitive Analysis, Creation and Manipulation of MIDI Data with pretty_midi. They provide a Python library that can work with MIDI data efficiently. The steps are listed below:

Input the MIDI data
Divide the instrument that the player would play and analyze its property.
We only record the time (in seconds) for instruments like drums when the player should play. However, for more intricate instruments like the double bass, we note the time and pose index they should adopt.
Specifically, when playing the double bass, the player's finger movement corresponds to the musical note being played, affecting body parameters. By pre-generating all potential human body parameters, we can precisely determine the player's actions when analyzing the music notes from the MIDI file.

Keyframe Interpolation

Playing an instrument involves associating each note with a pose, creating a continuous transformation from one pose to another during the music piece. We use a motion capture system to capture pose parameters for each note and set a keyframe on the timeline when a note is played.

To generate frames between two keyframes, we utilize the insight that transitioning from one pose to another involves rotating body joints. SMPL pose parameters describe the rotation of 23 body joints using an axis angle. We apply spherical linear interpolation (Slerp) to calculate 23 joint angles for each frame between two keyframes. Slerp represents an interpolation with constant angular velocity along the shortest path (geodesic) on the unit hypersphere between two quaternions q1 and q2.

The parameter u moves from 0 (where the expression simplifies to q1) to 1 (where the expression simplifies to q2). For each joint, we transform the axis angle to quarteronions, substitute the first keyframe’s joint angle to q1 and the second keyframe’s joint angle to q2, and then apply Slerp.

We first set the fps for our output video to make the avatar hit every note at the right time. By preprocessing the MIDI file, we get an array that records when a note is hit. Let the time a note is hit be t1, the time the following note is hit be t2, and then the number of frames between the two notes can be obtained by multiplying fps by (t1 - t2).

Adding Instruments

We use a 3D mesh to represent an instrument and film the scene with an instrument only along with the same camera path as rendering our avatar. As shown below, the circle-like line is the camera path, and we render the drum scene along the circle-like path. This eventually gets a video of viewing the drum from different angles.

Drum Scene

Virtual Human Scene

Scene Composition - Z-buffer algorithm

Z-buffer algorithm

Next, we include instruments in the scene using a straightforward scene composition algorithm, the Z-buffer algorithm, shown below. This algorithm efficiently combines two images into one image, ensuring proper occlusion effects.

Get the depth of NeRF

From above, we present a solution for combining two distinct scenes. Yet, obtaining the depth of these scenes differs. Instruments, being 3D meshes, allow us to quickly determine the depth as we know the mesh's surface location. However, it is challenging to do so for the human avatar, which is an implicit neural representation. In simpler terms, we cannot explicitly obtain the (x, y, z) information of the human surface.

Therefore, we utilize an attribute of NeRF to get the depth image. To illustrate, recap what NeRF is doing:

As shown below, to predict the color Cˆ(r) of pixel p, we follow RayA and sample K points. For each sample point (kth), we input the (x, y, z) coordinate value and viewing direction (θ, ϕ) into an MLP. The MLP produces the color ck and density δk for that sample point. This process is repeated for all K sample points along RayA. We proceed with the volume rendering process once all ck and δk are obtained.

where δk = tk+1 - tk is the distance between two adjacent samples and tk is the distance between camera origin and the k th sample point.

For calculating the depth of a pixel, we can reference from Equation 1 and utilize wk. This gives the depth equation:

where tk is the distance between camera origin and the k th sample point.

Once all depth images of two scenes are known, we can apply the Z-buffer algorithm (Algo 1) to composite 2 scenes.

All set, let's go!

see result

Our Pipeline