RandLA-Net

Why we use RandLA-Net?

SensatUrban is a large-scale point cloud, and the first thing to consider when doing 3D Semantic Segmentation is to down-sampling, due to existing equipment considerations, most down-sampling methods cut the entire point cloud set into small point cloud blocks.
However. many objects will lose their original geometric structure of the object, so in the case of a large point cloud, it will be difficult for the neural network to learn the overall ​geometry structure of the object.

So, We choose RandLA-Net.

Random Sampling

Using random sampling, we only have to use K*O (1) time to select K points from the input N points points, so the amount of calculation is only related to the number of points we need to sample, so it has a very high efficiency. 
However, the uncertainty of using random sampling will lead to the loss of local information, for this problem, the author proposes a local feature aggregation (LFA) module as a solution.

Local Feature Aggregation

Including 3 parts:
1. LocSE: 
          To encode the coordinates of the point cloud, using KNN algorithm to find the nearest K points, then concatenate the center point coordinates, K neighbor point coordinates, relative coordinates and distances.
          Then the points we concatenate above and corresponding neighbor points are concatenated and become a new feature.

2. Attentive Pooling
          Design a shared function for each point to learn their own attention score, and take attention score as a soft mask that can select important features, and the final learned features are the weighted sum of these neighborhood features.

3. Dilated Resudual Block
          ​Because of the recent downsampling, the purpose of expanding the residual block is to increase the sense of each point Receptive field.
          Multiple local space encoding, self-attention pooling layer and Skip connections are combined into a ​Dilated Residual Block.

Structure of LFA module

Modifications We've Done

In the implementation stage, we divided the RandLA-Net into 3 important parts to do modification:

Dataset Preprocessing
Ratio between batch size and data size
Sampling Method

Data Preprocessing

Random Rotation

The new coordinates are obtained by multiplying the coordinates of each point in the data with the rotation matrix Figure 6 for the three axes, It is equivalent to rotate the entire data, and rotate in a random way to increase the diversity of the data

Random Translate

Add the coordinates of all points of with a  3 X 3 matrix, let the whole data shift for a period.
The distance matrix also increases the diversity of the data in a stochastic manner.

Random Scale

Combine the coordinates of all points in a piece of data with 3 X 3 matrices are multiplied to obtain a The new data with a certain scaling.
Because it is the scaling of the overall data, so between points The distance of the point cloud and the overall density of the point cloud will also be affected, and the scaling parameter is also increased in a random manner.

Gaussian Noise

The noise value generated by Gaussian noise is random, but after the noise value is counted, it will show a ​Normal distribution. 
When we look at the noise spectrum, we find Gaussian noise and the original image is relatively integrated, and will not produce excessive noise that will destroy the integrity of the data, and this type of noise can also generate different noises by adjusting parameters such as the average value and standard deviation.

Ratio Between Batch size and Data size

Progress

Since the training on a GTX 1080 is severely limited by VRAM size, we did not modify the ratio between the batch size and the data size on this GPU.
Only when the training resources of the second GPU (GTX Titan X) which is provided to us from the laboratory, we can adjust the ratio. This part will be shown in the result.

In this modification, we can see that after increasing the size of VRAM, the performance of the overall model has been greatly improved, which can be See RandLA-Net is very sensitive to the size of the input data.

And we also find a ratio sweet spot in the RandLA-net.

Sampling Method

The default sampling method of RandLA-net is to randomly sample a point, find 30,000 points adjacent to this point as the input of the Network, and we follow his sampling idea, but change the sampling method to The next two types:

Farthest Point Sampling

In the SensatUrban dataset, since the point cloud has been cut into 33 different ply files by default, and due to insufficient hardware space, it is impossible to read all the files to the memory at one time, so that the entire SensatUrban data set can be pieced together into a complete point cloud For sampling, so for the farthest point sampling method, we down-sampled again, and made three different modifications to the data sampling of reading multiple files and sampling a layer of 3000 records:

1. The 3000 pieces of data comes from the same small file.
 
         Easier in sampling.
           Low point cloud diversity
2. The 3000 pieces of data comes from 33 file evenly. (2000/33 = 90)
           The data gets more average than the previous method.
           The data gets more average than the previous method.
3. The 3000 pieces of data comes from 33 file based on the original file size
           The data comes average.
           Needs more time to calculate the FPS sampling

Weird Result in FPS

Why FPS gets lower mIOU than random sampling, while FPS gets more average in data sampling?

Our Assumption: Although FPS has more average result, the coverage is too similar in every epoch.
Let's take a point cloud as example:

The three point cloud A, B, and C in the above figure are the same point cloud.
And the sampling order is that first randomly selected point (Red), then according to the order of the farthest point sampling blue → yellow → green → purple, point cloud A is similar to the first point sampled by point cloud B, and the first point sampled by point cloud C is the center.

As can be seen from the above figure, after the three FPS of the point cloud are completely different, the final result after finding the nearest 30,000 points to the sampling point, the obtained data (the white bottom part) are very similar.
All of which are the four corners and the center part.  This makes the coverage of different epochs too similar, and when the number of samples of the large point cloud is too small, the part with the gray background can never be obtained, which also causes dead ends. 
Compared with random sampling, it is more likely to obtain the gray part that cannot be obtained by the FPS, so random sampling can achieve better performance in the case of limited resources.

Adaptive Sampling

In the process of implementing adaptive sampling, one of the samples is also used as a benchmark, and then additional processing makes the outlier we get close to the normal point cloud outline, and the adaptive sampling has not been implemented in the model after evaluation.

Let's take the actual situation as an example. 
Suppose we have a data with 12 points. The sampling process of RandLA-net is: after finding a point, take 11 points near it as a data.  As shown on the left side of the figure above, when we get outliers (yellow dots), the outliers are usually caused by noise.  It will be close to the contour of the point cloud, so the points obtained are the 11 points closest to the outlier. Generally speaking, it will be one of the corners of the point cloud outline, that is, the 11 points in yellow and red. The situation on the right side of the figure above shows that when we use adaptive sampling, the center point we get will be It is the point cloud contour point closest to the outlier point, that is, the yellow point at the corner of the point cloud contour on the right side of the above figure, and then find the 11 red points closest to it, which is the obtained data sampling. Comparing the two pictures, although the position of the point obtained by our center point is very different, the final The whole data is the same two data. It can be seen that adaptive sampling is not suitable for the RandLA-net data collection method. Computing but getting the same data is expensive in most cases.