-A Details of the network architecture

Our network follows the widely-used encoder-decoder architecture with skip connections. The input consecutive frames are first fed into URS unit to obtain the mapped sample points and then into a fully-connected layer to extract per-point features. Five LDC modules and the corresponding decoder layers are used to learn the spatial features of each point, to which five TC modules are added to learn the inter-frame temporal features. Finally, three FC layers and a DP layer are used to predict the label of each point. The details of each part are as follows:

Network input: The input are the two adjacent frames with sizes of $N\times D_{in}$ , respectively, where $N$ is the number of points, $D_{in}$ is the feature dimension of each input point.

URS unit: Each frame is divided into $m$ 3D tiles at the client side, where $m=12$ . Within each tile, sampling is conducted with $V$ points, where $V=12288$ . Setting the batch size of the network as $B$ , $B=4$ . Therefore, the size of the data entered at one time for encoding is $B\times V\times D_{in}$ . Moreover, we select $K=16$ for KNN based on the K enumeration experiment, evaluating the network’s performance for different values of $K$ , as shown in Table I. The optimal network performance is observed at K=16. Simultaneously, as the result shows, increasing the K value poses a risk of overfitting. Therefore, we introduce a dropout layer to discard some neurons with a probability of 0.5 randomly.

Encoding layer and temporal contrast layer: Five encoding layers are used in our network to progressively reduce the size of the point clouds and increase the per-point feature dimensions. Each encoding layer consists of an LDC encoding layer and a RS operation. In particular, only $\frac{1}{4}$ of the point features are retained after each encoding layer, i.e., ( $N\rightarrow\frac{N}{4}\rightarrow\frac{N}{16}\rightarrow\frac{N}{64}% \rightarrow\frac{N}{256}\rightarrow\frac{N}{512}$ ). Meanwhile, the per-point feature dimension is gradually increased through once encoding to preserve more information, i.e., ( $8\rightarrow 32\rightarrow 128\rightarrow 256\rightarrow 512\rightarrow 1024$ ). As for the temporal part, the feature dimension does not change through the temporal comparison layer, but the feature value changes, either enhanced or weakened by the temporal saliency operator $G(\cdot)$

Decoding layers: After the above encoding layers, five decoding layers are used. For each layer in the decoder, we first find the nearest neighbor for each query point using the KNN algorithm, and then upsample the point feature set by nearest neighbor interpolation. Next, the upsampled feature maps are concatenated with the intermediate feature maps generated by the coding layer by skip linking, after which a shared MLP is applied to the concatenated features.

Final prediction: The final predicted label of each point is obtained through three shared fully-connected layers $(N,64)\rightarrow(N,32)\rightarrow(N,nclass)$ and a dropout layer, where $nclass=2$ . The dropout ratio is 0.5.

-B Details of ground truth production

Our client side plays 4 point cloud video sequences to 40 users using HTC Vive HMDs, and our server side records all user trajectories using Unity. As shown in Fig. 1, we superimpose the viewports of 40 users to get the frequency with which each point was viewed. We then divide the point cloud frames into 8 intervals according to frequency. We randomly select 100 frames and calculate the intersection ratio of the point cloud within the frequency intervals to the point cloud visible to the actual users.

Refer to caption — Figure 1: The relationship between the percentage of overlap** area coverage and frequency. We know that the maximum frequency of the viewed point cloud is 40, the overlap area expands with the frequency interval, and the overlap area with a frequency greater than or equal to 5 accounts for almost 100 $\%$ of the user’s viewport.

TABLE I: The performance of STVP at different K values

K value	Point_level MIoU	Tile_level MIoU
10	66.30 $\%$	77.80 $\%$
11	68.05 $\%$	78.52 $\%$
12	73.01 $\%$	80.98 $\%$
13	74.07 $\%$	81.96 $\%$
14	75.54 $\%$	82.08 $\%$
15	75.64 $\%$	82.60 $\%$
16	82.09 $\%$	88.10 $\%$
17	79.53 $\%$	84.47 $\%$
18	80.64 $\%$	86.23 $\%$

We can see that the overlap** regions with frequencies greater than 5 have gradually approached the user’s viewport. On the premise of this experimental result, we also consider that each user’s viewpoint is related to the user’s own trajectory and the salient regions in the video. and then mark the low-frequency interval [0, 5] as a non-FoV region, excluding the randomness of individual users, and take the point cloud region with a frequency greater than or equal to 5 as the FoV region, i.e., the ground truth.