-A Details of the network architecture
Our network follows the widely-used encoder-decoder architecture with skip connections. The input consecutive frames are first fed into URS unit to obtain the mapped sample points and then into a fully-connected layer to extract per-point features. Five LDC modules and the corresponding decoder layers are used to learn the spatial features of each point, to which five TC modules are added to learn the inter-frame temporal features. Finally, three FC layers and a DP layer are used to predict the label of each point. The details of each part are as follows:
Network input: The input are the two adjacent frames with sizes of , respectively, where is the number of points, is the feature dimension of each input point.
URS unit: Each frame is divided into 3D tiles at the client side, where . Within each tile, sampling is conducted with points, where . Setting the batch size of the network as , . Therefore, the size of the data entered at one time for encoding is . Moreover, we select for KNN based on the K enumeration experiment, evaluating the networkβs performance for different values of , as shown in Table I. The optimal network performance is observed at K=16. Simultaneously, as the result shows, increasing the K value poses a risk of overfitting. Therefore, we introduce a dropout layer to discard some neurons with a probability of 0.5 randomly.
Encoding layer and temporal contrast layer: Five encoding layers are used in our network to progressively reduce the size of the point clouds and increase the per-point feature dimensions. Each encoding layer consists of an LDC encoding layer and a RS operation. In particular, only of the point features are retained after each encoding layer, i.e., (). Meanwhile, the per-point feature dimension is gradually increased through once encoding to preserve more information, i.e., (). As for the temporal part, the feature dimension does not change through the temporal comparison layer, but the feature value changes, either enhanced or weakened by the temporal saliency operator
Decoding layers: After the above encoding layers, five decoding layers are used. For each layer in the decoder, we first find the nearest neighbor for each query point using the KNN algorithm, and then upsample the point feature set by nearest neighbor interpolation. Next, the upsampled feature maps are concatenated with the intermediate feature maps generated by the coding layer by skip linking, after which a shared MLP is applied to the concatenated features.
Final prediction: The final predicted label of each point is obtained through three shared fully-connected layers and a dropout layer, where . The dropout ratio is 0.5.
-B Details of ground truth production
Our client side plays 4 point cloud video sequences to 40 users using HTC Vive HMDs, and our server side records all user trajectories using Unity. As shown in Fig. 1, we superimpose the viewports of 40 users to get the frequency with which each point was viewed. We then divide the point cloud frames into 8 intervals according to frequency. We randomly select 100 frames and calculate the intersection ratio of the point cloud within the frequency intervals to the point cloud visible to the actual users.
![Refer to caption](x1.png)
K value | Point_level MIoU | Tile_level MIoU |
---|---|---|
10 | 66.30 | 77.80 |
11 | 68.05 | 78.52 |
12 | 73.01 | 80.98 |
13 | 74.07 | 81.96 |
14 | 75.54 | 82.08 |
15 | 75.64 | 82.60 |
16 | 82.09 | 88.10 |
17 | 79.53 | 84.47 |
18 | 80.64 | 86.23 |
We can see that the overlap** regions with frequencies greater than 5 have gradually approached the userβs viewport. On the premise of this experimental result, we also consider that each userβs viewpoint is related to the userβs own trajectory and the salient regions in the video. and then mark the low-frequency interval [0, 5] as a non-FoV region, excluding the randomness of individual users, and take the point cloud region with a frequency greater than or equal to 5 as the FoV region, i.e., the ground truth.