-A Details of the network architecture

Our network follows the widely-used encoder-decoder architecture with skip connections. The input consecutive frames are first fed into URS unit to obtain the mapped sample points and then into a fully-connected layer to extract per-point features. Five LDC modules and the corresponding decoder layers are used to learn the spatial features of each point, to which five TC modules are added to learn the inter-frame temporal features. Finally, three FC layers and a DP layer are used to predict the label of each point. The details of each part are as follows:

Network input: The input are the two adjacent frames with sizes of NΓ—Di⁒n𝑁subscript𝐷𝑖𝑛N\times D_{in}italic_N Γ— italic_D start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT, respectively, where N𝑁Nitalic_N is the number of points, Di⁒nsubscript𝐷𝑖𝑛D_{in}italic_D start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT is the feature dimension of each input point.

URS unit: Each frame is divided into mπ‘šmitalic_m 3D tiles at the client side, where m=12π‘š12m=12italic_m = 12. Within each tile, sampling is conducted with V𝑉Vitalic_V points, where V=12288𝑉12288V=12288italic_V = 12288. Setting the batch size of the network as B𝐡Bitalic_B, B=4𝐡4B=4italic_B = 4. Therefore, the size of the data entered at one time for encoding is BΓ—VΓ—Di⁒n𝐡𝑉subscript𝐷𝑖𝑛B\times V\times D_{in}italic_B Γ— italic_V Γ— italic_D start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT. Moreover, we select K=16𝐾16K=16italic_K = 16 for KNN based on the K enumeration experiment, evaluating the network’s performance for different values of K𝐾Kitalic_K, as shown in Table I. The optimal network performance is observed at K=16. Simultaneously, as the result shows, increasing the K value poses a risk of overfitting. Therefore, we introduce a dropout layer to discard some neurons with a probability of 0.5 randomly.

Encoding layer and temporal contrast layer: Five encoding layers are used in our network to progressively reduce the size of the point clouds and increase the per-point feature dimensions. Each encoding layer consists of an LDC encoding layer and a RS operation. In particular, only 1414\frac{1}{4}divide start_ARG 1 end_ARG start_ARG 4 end_ARG of the point features are retained after each encoding layer, i.e., (Nβ†’N4β†’N16β†’N64β†’N256β†’N512→𝑁𝑁4→𝑁16→𝑁64→𝑁256→𝑁512N\rightarrow\frac{N}{4}\rightarrow\frac{N}{16}\rightarrow\frac{N}{64}% \rightarrow\frac{N}{256}\rightarrow\frac{N}{512}italic_N β†’ divide start_ARG italic_N end_ARG start_ARG 4 end_ARG β†’ divide start_ARG italic_N end_ARG start_ARG 16 end_ARG β†’ divide start_ARG italic_N end_ARG start_ARG 64 end_ARG β†’ divide start_ARG italic_N end_ARG start_ARG 256 end_ARG β†’ divide start_ARG italic_N end_ARG start_ARG 512 end_ARG). Meanwhile, the per-point feature dimension is gradually increased through once encoding to preserve more information, i.e., (8β†’32β†’128β†’256β†’512β†’1024β†’832β†’128β†’256β†’512β†’10248\rightarrow 32\rightarrow 128\rightarrow 256\rightarrow 512\rightarrow 10248 β†’ 32 β†’ 128 β†’ 256 β†’ 512 β†’ 1024). As for the temporal part, the feature dimension does not change through the temporal comparison layer, but the feature value changes, either enhanced or weakened by the temporal saliency operator G⁒(β‹…)𝐺⋅G(\cdot)italic_G ( β‹… )

Decoding layers: After the above encoding layers, five decoding layers are used. For each layer in the decoder, we first find the nearest neighbor for each query point using the KNN algorithm, and then upsample the point feature set by nearest neighbor interpolation. Next, the upsampled feature maps are concatenated with the intermediate feature maps generated by the coding layer by skip linking, after which a shared MLP is applied to the concatenated features.

Final prediction: The final predicted label of each point is obtained through three shared fully-connected layers (N,64)β†’(N,32)β†’(N,n⁒c⁒l⁒a⁒s⁒s)→𝑁64𝑁32β†’π‘π‘›π‘π‘™π‘Žπ‘ π‘ (N,64)\rightarrow(N,32)\rightarrow(N,nclass)( italic_N , 64 ) β†’ ( italic_N , 32 ) β†’ ( italic_N , italic_n italic_c italic_l italic_a italic_s italic_s ) and a dropout layer, where n⁒c⁒l⁒a⁒s⁒s=2π‘›π‘π‘™π‘Žπ‘ π‘ 2nclass=2italic_n italic_c italic_l italic_a italic_s italic_s = 2. The dropout ratio is 0.5.

-B Details of ground truth production

Our client side plays 4 point cloud video sequences to 40 users using HTC Vive HMDs, and our server side records all user trajectories using Unity. As shown in Fig. 1, we superimpose the viewports of 40 users to get the frequency with which each point was viewed. We then divide the point cloud frames into 8 intervals according to frequency. We randomly select 100 frames and calculate the intersection ratio of the point cloud within the frequency intervals to the point cloud visible to the actual users.

Refer to caption
Figure 1: The relationship between the percentage of overlap** area coverage and frequency. We know that the maximum frequency of the viewed point cloud is 40, the overlap area expands with the frequency interval, and the overlap area with a frequency greater than or equal to 5 accounts for almost 100%percent\%% of the user’s viewport.
TABLE I: The performance of STVP at different K values
K value Point_level MIoU Tile_level MIoU
10 66.30%percent\%% 77.80%percent\%%
11 68.05%percent\%% 78.52%percent\%%
12 73.01%percent\%% 80.98%percent\%%
13 74.07%percent\%% 81.96%percent\%%
14 75.54%percent\%% 82.08%percent\%%
15 75.64%percent\%% 82.60%percent\%%
16 82.09%percent\%% 88.10%percent\%%
17 79.53%percent\%% 84.47%percent\%%
18 80.64%percent\%% 86.23%percent\%%

We can see that the overlap** regions with frequencies greater than 5 have gradually approached the user’s viewport. On the premise of this experimental result, we also consider that each user’s viewpoint is related to the user’s own trajectory and the salient regions in the video. and then mark the low-frequency interval [0, 5] as a non-FoV region, excluding the randomness of individual users, and take the point cloud region with a frequency greater than or equal to 5 as the FoV region, i.e., the ground truth.