HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: silence

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2402.12532v3 [cs.CV] 23 Feb 2024
\WarningFilter

*captionUnknown document class (or package)

Scalable Human-Machine Point Cloud Compression

Mateen Ulhaq and Ivan V. Bajić School of Engineering Science
Simon Fraser University
Burnaby, BC, Canada
[email protected]
School of Engineering Science
Simon Fraser University
Burnaby, BC, Canada
[email protected]
Abstract

Due to the limited computational capabilities of edge devices, deep learning inference can be quite expensive. One remedy is to compress and transmit point cloud data over the network for server-side processing. Unfortunately, this approach can be sensitive to network factors, including available bitrate. Luckily, the bitrate requirements can be reduced without sacrificing inference accuracy by using a machine task-specialized codec. In this paper, we present a scalable codec for point-cloud data that is specialized for the machine task of classification, while also providing a mechanism for human viewing. In the proposed scalable codec, the “base” bitstream supports the machine task, and an “enhancement” bitstream may be used for better input reconstruction performance for human viewing. We base our architecture on PointNet++, and test its efficacy on the ModelNet40 dataset. We show significant improvements over prior non-specialized codecs.

Index Terms:
deep learning, point cloud compression, coding for machines, scalable coding, classification

I Introduction

Point clouds representing 3D visual data are increasingly being used in many applications, including augmented reality, robotics, and autonomous driving. Advances in deep learning have led to the development of deep models for performing machine vision tasks on point cloud data, including classification, object detection, and segmentation. However, most deep learning models are computationally expensive. This poses a challenge for computationally-limited edge devices that want to perform machine vision tasks, and yet are limited in size, energy consumption, cost, and other factors.

One option for performing a machine task on the edge device is to limit the complexity of the model. Unfortunately, this usually comes at the cost of model accuracy. Another option is to transmit the input data to a server for machine analysis. In this approach, the edge device compresses the input data prior to transmission, often using a codec designed to reconstruct the input for human viewing. However, such input reconstruction codecs are not optimized for machine analysis. Thus, they often spend large amounts of bits on encoding information that is not relevant to the machine task. This results in a worse rate-accuracy trade-off than is possible with a more specialized codec. In situations where a low rate is desired — for instance, in areas of poor network connectivity, or congested networks — using a non-specialized codec may result in excessively high machine task latency [1].

In order to improve the rate-accuracy trade-off, we may instead use a codec that is specialized for the machine task. Such codecs often perform part of the machine task on the edge device itself. In this hybrid approach, the edge device simultaneously compresses the input and performs part of the machine task. This allows the model to discard unnecessary information, thus reducing the rate, resulting in a system that is more robust to changing network conditions, and may reduce system latency over a certain range of available bitrates [2].

In [3], a novel codec for point cloud classification was proposed. This learned codec, based on PointNet [4], compresses the input point cloud into a highly compressed representation that is intended solely for machine analysis, in this case classification. This codec was shown to achieve a significantly better rate-accuracy trade-off in comparison with alternative methods using standard codecs that are not specialized for machine analysis. This was achieved by removing not only statistical redundancy, but also task-irrelevant information, during the compression process.

While the codec in [3] achieves a good rate-accuracy trade-off for point cloud classification, it is not suitable for other purposes. Most applications involving automated machine-based analysis are expected to run the machine task continuously, but may occasionally require human verification or review. Hence, it is important to develop codecs that are able to support both tasks — machine vision and human viewing — efficiently. In this paper, we present such a scalable codec, the first in the point cloud literature, which supports point cloud classification, while also providing a mechanism for human viewing. Our code is available online.111 https://github.com/multimedialabsfu/learned-point-cloud-compression-for-classification

II Related work

Point cloud classification is among the most researched point cloud analysis tasks. Related classification models accept different input formats, including point lists (e.g., PointNet [4] and PointNet++ [5]), 3D voxel grids (e.g., VoxNet [6]), and Octrees (e.g., OctNet [7]). Since our work builds on PointNet and PointNet++, we assume the input format is a point list.

Conventional handcrafted point cloud codecs include Draco [8] and G-PCC [9] (implemented as TMC13 [10]). More recently, the research focus has shifted towards learned codecs, following the seminal work of Ballé et al. [11] who proposed a variational autoencoder (VAE) based architecture for image compression. In their architecture, the input 𝒙𝒙{\boldsymbol{x}}bold_italic_x is first transformed into a latent representation 𝒚𝒚{\boldsymbol{y}}bold_italic_y, which is then quantized and entropy-coded using a learned entropy model. Such an architecture can be trained end-to-end using the loss

=R+λD(𝒙,𝒙^),𝑅𝜆𝐷𝒙bold-^𝒙\mathcal{L}=R+\lambda\cdot D({\boldsymbol{x}},{\boldsymbol{\hat{x}}}),caligraphic_L = italic_R + italic_λ ⋅ italic_D ( bold_italic_x , overbold_^ start_ARG bold_italic_x end_ARG ) , (1)

where D(𝒙,𝒙^)𝐷𝒙bold-^𝒙D({\boldsymbol{x}},{\boldsymbol{\hat{x}}})italic_D ( bold_italic_x , overbold_^ start_ARG bold_italic_x end_ARG ) is the distortion measure between the input 𝒙𝒙{\boldsymbol{x}}bold_italic_x and decoded 𝒙^bold-^𝒙{\boldsymbol{\hat{x}}}overbold_^ start_ARG bold_italic_x end_ARG, and R𝑅Ritalic_R is the estimate of the entropy of 𝒚^bold-^𝒚{\boldsymbol{\hat{y}}}overbold_^ start_ARG bold_italic_y end_ARG. This mechanism has been successful in various fields of learned compression, including point cloud compression [12, 13, 14, 15, 16].

Specialized compression for machine tasks — often referred to as coding for machines — has been explored for images [17] and video [18] and, recently, for point clouds [3]. Moreover, scalable multi-task coding approaches [19, 20] have shown that one can perform a machine vision task at a fairly low bitrate, while enabling other tasks, such as input reconstruction for human viewing, with an additional enhancement bitstream.

While quality-scalable point cloud coding has been studied before [21], this paper presents the first scalable multi-task point cloud codec: the base bitstream supports point cloud classification, while the enhancement bitstream allows point cloud reconstruction. Unlike our earlier work [3], which presented a classification-optimized codec based on PointNet [4], our scalable codec is based on PointNet++ [5], which is a hierarchical extension of PointNet.

III Proposed codec

Refer to caption
(a) Input compression.
Refer to caption
(b) Machine task compression, as used in e.g. [3].
Refer to caption
(c) Scalable multi-task compression.
Figure 1: High-level comparison of codec architectures.
Refer to caption
Figure 2: Proposed codec architecture.
Refer to caption
Figure 3: Proposed codec architecture (details).

III-A Preliminaries

In fig. 0(a), we show an abstract representation of an input compression codec. The input point cloud 𝒙𝒙{\boldsymbol{x}}bold_italic_x is encoded and decoded as 𝒙^bold-^𝒙{\boldsymbol{\hat{x}}}overbold_^ start_ARG bold_italic_x end_ARG by any desired point cloud codec, including non-learned codecs such as G-PCC [9]. Then, the reconstructed point cloud 𝒙^bold-^𝒙{\boldsymbol{\hat{x}}}overbold_^ start_ARG bold_italic_x end_ARG is fed into a classification model (e.g., PointNet) in order to obtain the class prediction 𝒕^bold-^𝒕{\boldsymbol{\hat{t}}}overbold_^ start_ARG bold_italic_t end_ARG. This approach provides a baseline for comparison with our proposed codec.

In fig. 0(b), we show an abstract representation of a machine task codec, as was explored by [3] for point cloud classification. Using the same terminology as in [11], gasubscript𝑔𝑎g_{a}italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT refers to the analysis transform, and gssubscript𝑔𝑠g_{s}italic_g start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT refers to the synthesis transform. In this codec, the input point cloud 𝒙𝒙{\boldsymbol{x}}bold_italic_x is first encoded into a latent representation 𝒚=ga(𝒙)𝒚subscript𝑔𝑎𝒙{\boldsymbol{y}}=g_{a}({\boldsymbol{x}})bold_italic_y = italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( bold_italic_x ). This is then quantized as 𝒚^=Q(𝒚)bold-^𝒚𝑄𝒚{\boldsymbol{\hat{y}}}=Q({\boldsymbol{y}})overbold_^ start_ARG bold_italic_y end_ARG = italic_Q ( bold_italic_y ), and then losslessly compressed using a learned entropy model. For instance, in [3], a fully-factorized entropy model was used. The reconstructed latent representation 𝒚^bold-^𝒚{\boldsymbol{\hat{y}}}overbold_^ start_ARG bold_italic_y end_ARG may then be used to predict the classes 𝒕^=gs,t(𝒚^)bold-^𝒕subscript𝑔𝑠𝑡bold-^𝒚{\boldsymbol{\hat{t}}}=g_{s,t}({\boldsymbol{\hat{y}}})overbold_^ start_ARG bold_italic_t end_ARG = italic_g start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ( overbold_^ start_ARG bold_italic_y end_ARG ).

III-B Scalable human-machine compression codec

In fig. 0(c), we show a high-level representation of a scalable multi-task codec. Following the principle of latent space scalability [20], the scalable multi-task codec splits the latent space into two parts, [𝒚^1,𝒚^2]=split(𝒚^)subscriptbold-^𝒚1subscriptbold-^𝒚2splitbold-^𝒚[{\boldsymbol{\hat{y}}}_{1},{\boldsymbol{\hat{y}}}_{2}]=\operatorname{split}({% \boldsymbol{\hat{y}}})[ overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] = roman_split ( overbold_^ start_ARG bold_italic_y end_ARG ), along the channel dimension. The first part 𝒚^1subscriptbold-^𝒚1{\boldsymbol{\hat{y}}}_{1}overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is called the “base” layer and the second part 𝒚^2subscriptbold-^𝒚2{\boldsymbol{\hat{y}}}_{2}overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is called the “enhancement” layer. The base layer is used for the machine task (i.e., classification) to predict the class 𝒕^=gs,t(𝒚^1)bold-^𝒕subscript𝑔𝑠𝑡subscriptbold-^𝒚1{\boldsymbol{\hat{t}}}=g_{s,t}({\boldsymbol{\hat{y}}}_{1})overbold_^ start_ARG bold_italic_t end_ARG = italic_g start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT ( overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). Both base and enhancement layers are used for input reconstruction: 𝒙^=gs,x(concat[detach(𝒚^1),𝒚^2])bold-^𝒙subscript𝑔𝑠𝑥concatdetachsubscriptbold-^𝒚1subscriptbold-^𝒚2{\boldsymbol{\hat{x}}}=g_{s,x}(\operatorname{concat}[\operatorname{detach}({% \boldsymbol{\hat{y}}}_{1}),{\boldsymbol{\hat{y}}}_{2}])overbold_^ start_ARG bold_italic_x end_ARG = italic_g start_POSTSUBSCRIPT italic_s , italic_x end_POSTSUBSCRIPT ( roman_concat [ roman_detach ( overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] ). This approach allows for scalability: the enhancement bitstream only needs to be computed and transmitted when human viewing is desired. Note that the detachdetach\operatorname{detach}roman_detach operation is used to disable gradient propagation of D(𝒙,𝒙^)𝐷𝒙bold-^𝒙D({\boldsymbol{x}},{\boldsymbol{\hat{x}}})italic_D ( bold_italic_x , overbold_^ start_ARG bold_italic_x end_ARG ) backwards through 𝒚^1subscriptbold-^𝒚1{\boldsymbol{\hat{y}}}_{1}overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. This improves the specialization of 𝒚^1subscriptbold-^𝒚1{\boldsymbol{\hat{y}}}_{1}overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT towards the machine task; otherwise, 𝒚^1subscriptbold-^𝒚1{\boldsymbol{\hat{y}}}_{1}overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT may end up with some enhancement-layer information [22]. Two separate bitstreams (base and enhancement) are produced, whose total rate is R𝒚^=R𝒚^1+R𝒚^2subscript𝑅bold-^𝒚subscript𝑅subscriptbold-^𝒚1subscript𝑅subscriptbold-^𝒚2R_{{\boldsymbol{\hat{y}}}}=R_{{\boldsymbol{\hat{y}}}_{1}}+R_{{\boldsymbol{\hat% {y}}}_{2}}italic_R start_POSTSUBSCRIPT overbold_^ start_ARG bold_italic_y end_ARG end_POSTSUBSCRIPT = italic_R start_POSTSUBSCRIPT overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_R start_POSTSUBSCRIPT overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

III-C Proposed architecture

Our complete proposed codec architecture, based on PointNet++ [5], is shown in fig. 2, and the details of each block are shown in fig. 3. Our codec is provided in “full” and “lite” configurations, as detailed in table I. The input point cloud of P𝑃Pitalic_P points is represented as a matrix 𝒙3×P𝒙superscript3𝑃{\boldsymbol{x}}\in\mathbb{R}^{3\times P}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT 3 × italic_P end_POSTSUPERSCRIPT of (x,y,z)𝑥𝑦𝑧(x,y,z)( italic_x , italic_y , italic_z ) coordinates. 𝒖𝒖{\boldsymbol{u}}bold_italic_u is an additional set of attribute features (e.g., normals, color, etc.) that optionally may also be compressed. At the beginning of the encoder, 𝒙𝒙{\boldsymbol{x}}bold_italic_x and 𝒖𝒖{\boldsymbol{u}}bold_italic_u are concatenated along the channel dimension so that 𝒖(0)=concat[𝒙,𝒖]superscript𝒖0concat𝒙𝒖{\boldsymbol{u}}^{(0)}=\operatorname{concat}[{\boldsymbol{x}},{\boldsymbol{u}}]bold_italic_u start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = roman_concat [ bold_italic_x , bold_italic_u ]. This ensures that the same encoding capabilities are available for both. We feed 𝒙(0)=𝒙superscript𝒙0𝒙{\boldsymbol{x}}^{(0)}={\boldsymbol{x}}bold_italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = bold_italic_x and 𝒖(0)superscript𝒖0{\boldsymbol{u}}^{(0)}bold_italic_u start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT into a sequence of downsampling blocks. Each i𝑖iitalic_i-th downsampling block takes in 𝒙(i1)superscript𝒙𝑖1{\boldsymbol{x}}^{(i-1)}bold_italic_x start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT and 𝒖(i1)superscript𝒖𝑖1{\boldsymbol{u}}^{(i-1)}bold_italic_u start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT, and outputs and a smaller set of centroids 𝒙(i)superscript𝒙𝑖{\boldsymbol{x}}^{(i)}bold_italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT and features 𝒖(i)superscript𝒖𝑖{\boldsymbol{u}}^{(i)}bold_italic_u start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT, along with features 𝒖𝒈(i1)superscriptsubscript𝒖𝒈𝑖1{\boldsymbol{u_{g}}}^{(i-1)}bold_italic_u start_POSTSUBSCRIPT bold_italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT that are grouped by the centroids. The final 𝒖(3)superscript𝒖3{\boldsymbol{u}}^{(3)}bold_italic_u start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT is compressed in a multi-task scalable manner, from which 𝒖^(3)superscriptbold-^𝒖3{\boldsymbol{\hat{u}}}^{(3)}overbold_^ start_ARG bold_italic_u end_ARG start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT is derived. Each 𝒖𝒈(i1)superscriptsubscript𝒖𝒈𝑖1{\boldsymbol{u_{g}}}^{(i-1)}bold_italic_u start_POSTSUBSCRIPT bold_italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT is compressed using a standard transform-encoder-decoder-transform compression pipeline to obtain 𝒖^𝒈(i1)superscriptsubscriptbold-^𝒖𝒈𝑖1{\boldsymbol{\hat{u}_{g}}}^{(i-1)}overbold_^ start_ARG bold_italic_u end_ARG start_POSTSUBSCRIPT bold_italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT. A sequence of upsampling blocks is then applied, where each i𝑖iitalic_i-th upsampling block takes in 𝒖^(i)superscriptbold-^𝒖𝑖{\boldsymbol{\hat{u}}}^{(i)}overbold_^ start_ARG bold_italic_u end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT. For i>0𝑖0i>0italic_i > 0, the output of the i𝑖iitalic_i-th upsampling block is concatenated with 𝒖^𝒈(i1)superscriptsubscriptbold-^𝒖𝒈𝑖1{\boldsymbol{\hat{u}_{g}}}^{(i-1)}overbold_^ start_ARG bold_italic_u end_ARG start_POSTSUBSCRIPT bold_italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT to give 𝒖^(i1)superscriptbold-^𝒖𝑖1{\boldsymbol{\hat{u}}}^{(i-1)}overbold_^ start_ARG bold_italic_u end_ARG start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT. For i=0𝑖0i=0italic_i = 0, the output is 𝒙^bold-^𝒙{\boldsymbol{\hat{x}}}overbold_^ start_ARG bold_italic_x end_ARG.

III-C1 Downsampling

Each downsampling block is a PointNet++ “set abstraction” layer (see [5] for more details), with minor modifications. We used single-scale grou** (SSG) of points for our proposed codec, though it may likely be improved with multi-scale grou** (MSG) or multi-resolution grou** (MRG) as described in [5]. The “set abstraction” layer takes as input a (subsampled) point cloud 𝒙(i1)superscript𝒙𝑖1{\boldsymbol{x}}^{(i-1)}bold_italic_x start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT of shape 3×P(i1)3superscript𝑃𝑖13\times P^{(i-1)}3 × italic_P start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT and features 𝒖(i1)superscript𝒖𝑖1{\boldsymbol{u}}^{(i-1)}bold_italic_u start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT of shape D(i1)×P(i1)superscript𝐷𝑖1superscript𝑃𝑖1D^{(i-1)}\times P^{(i-1)}italic_D start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT × italic_P start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT containing information about each corresponding point. Using farthest point sampling (FPS), a set of P(i)superscript𝑃𝑖P^{(i)}italic_P start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT centroids 𝒙(i)superscript𝒙𝑖{\boldsymbol{x}}^{(i)}bold_italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT of shape 3×P(i)3superscript𝑃𝑖3\times P^{(i)}3 × italic_P start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT is selected from 𝒙(i1)superscript𝒙𝑖1{\boldsymbol{x}}^{(i-1)}bold_italic_x start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT. Then, the points in 𝒙(i1)superscript𝒙𝑖1{\boldsymbol{x}}^{(i-1)}bold_italic_x start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT are grouped into P(i)superscript𝑃𝑖P^{(i)}italic_P start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT groups of S(i)superscript𝑆𝑖S^{(i)}italic_S start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT points each. Then for each centroid, a ball query222Note that due to the ball query, the same point may be assigned to multiple centroids. Also, centroids are not always present within their own group of points. Nonetheless, the ball query has the benefit of scale-invariant grou**. [5] is performed to find the first S(i)superscript𝑆𝑖S^{(i)}italic_S start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT points that are within a radius of R(i)superscript𝑅𝑖R^{(i)}italic_R start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT from the given centroid. The relative positions of each group of points is then computed with respect to its associated centroid, resulting in the residuals r(i)=ball_query(𝒙(i1),𝒙(i),R(i))repeat(𝒙(i),S(i))superscript𝑟𝑖ball_querysuperscript𝒙𝑖1superscript𝒙𝑖superscript𝑅𝑖repeatsuperscript𝒙𝑖superscript𝑆𝑖r^{(i)}=\operatorname{ball\_query}({\boldsymbol{x}}^{(i-1)},{\boldsymbol{x}}^{% (i)},R^{(i)})-\operatorname{repeat}({\boldsymbol{x}}^{(i)},S^{(i)})italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = start_OPFUNCTION roman_ball _ roman_query end_OPFUNCTION ( bold_italic_x start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_R start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) - roman_repeat ( bold_italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_S start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) of shape 3×P(i)×S(i)3superscript𝑃𝑖superscript𝑆𝑖3\times P^{(i)}\times S^{(i)}3 × italic_P start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT × italic_S start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT. Each point in 𝒓(i)superscript𝒓𝑖{\boldsymbol{r}}^{(i)}bold_italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT is concatenated with its respective feature vector in 𝒖(i1)superscript𝒖𝑖1{\boldsymbol{u}}^{(i-1)}bold_italic_u start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT, resulting in a grouped feature tensor 𝒖𝒈(i1)superscriptsubscript𝒖𝒈𝑖1{\boldsymbol{u_{g}}}^{(i-1)}bold_italic_u start_POSTSUBSCRIPT bold_italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT of shape (D(i1)+3)×P(i)×S(i)superscript𝐷𝑖13superscript𝑃𝑖superscript𝑆𝑖(D^{(i-1)}+3)\times P^{(i)}\times S^{(i)}( italic_D start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT + 3 ) × italic_P start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT × italic_S start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT. Then, 𝒖𝒈(i1)superscriptsubscript𝒖𝒈𝑖1{\boldsymbol{u_{g}}}^{(i-1)}bold_italic_u start_POSTSUBSCRIPT bold_italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT is fed into a miniature group-wise “PointNet encoder” block consisting of a shared multi-layer perceptron (MLP) with a max pooling operation at the end. That is, the same PointNet encoder is applied to each group of features independently and identically. This results in 𝒖(i)superscript𝒖𝑖{\boldsymbol{u}}^{(i)}bold_italic_u start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT of shape D(i)×P(i)superscript𝐷𝑖superscript𝑃𝑖D^{(i)}\times P^{(i)}italic_D start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT × italic_P start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT. Three downsampling blocks are used in our proposed codec. The last downsampling block groups all the remaining points into a single group, i.e., P(3)=1superscript𝑃31P^{(3)}=1italic_P start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT = 1.

III-C2 Feature compression

At each level i<3𝑖3i<3italic_i < 3, a latent representation 𝒚(i)=ha(i)(𝒖𝒈(i))superscript𝒚𝑖superscriptsubscript𝑎𝑖superscriptsubscript𝒖𝒈𝑖{\boldsymbol{y}}^{(i)}={h_{a}}^{(i)}({\boldsymbol{u_{g}}}^{(i)})bold_italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( bold_italic_u start_POSTSUBSCRIPT bold_italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) of shape M(i)×P(i)superscript𝑀𝑖superscript𝑃𝑖M^{(i)}\times P^{(i)}italic_M start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT × italic_P start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT is computed, then quantized as 𝒚^(i)=Q(𝒚(i))superscriptbold-^𝒚𝑖𝑄superscript𝒚𝑖{\boldsymbol{\hat{y}}}^{(i)}=Q({\boldsymbol{y}}^{(i)})overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = italic_Q ( bold_italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ), and then entropy-coded using a learned entropy model. The resulting bitstream is transmitted and then decoded as 𝒚^(i)superscriptbold-^𝒚𝑖{\boldsymbol{\hat{y}}}^{(i)}overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT. From this, a set of grouped features 𝒖^𝒈(i)=hs(i)(𝒚^(i))superscriptsubscriptbold-^𝒖𝒈𝑖superscriptsubscript𝑠𝑖superscriptbold-^𝒚𝑖{\boldsymbol{\hat{u}_{g}}}^{(i)}=h_{s}^{\,(i)}({\boldsymbol{\hat{y}}}^{(i)})overbold_^ start_ARG bold_italic_u end_ARG start_POSTSUBSCRIPT bold_italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ( overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) is computed for usage during upsampling. (Note that 𝒖^𝒈(i)superscriptsubscriptbold-^𝒖𝒈𝑖{\boldsymbol{\hat{u}_{g}}}^{(i)}overbold_^ start_ARG bold_italic_u end_ARG start_POSTSUBSCRIPT bold_italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT and 𝒖𝒈(i)superscriptsubscript𝒖𝒈𝑖{\boldsymbol{u_{g}}}^{(i)}bold_italic_u start_POSTSUBSCRIPT bold_italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT may inhabit entirely different feature spaces, and are unrelated.)

The final feature vector 𝒖(3)superscript𝒖3{\boldsymbol{u}}^{(3)}bold_italic_u start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT is fed into the scalable multi-task compression pipeline described in section III-B. The latent representation 𝒚(3)=ha(3)(𝒖(3))superscript𝒚3superscriptsubscript𝑎3superscript𝒖3{\boldsymbol{y}}^{(3)}={h_{a}}^{(3)}({\boldsymbol{u}}^{(3)})bold_italic_y start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT = italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT ( bold_italic_u start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT ) is computed, then quantized as 𝒚^(3)=Q(𝒚(3))superscriptbold-^𝒚3𝑄superscript𝒚3{\boldsymbol{\hat{y}}}^{(3)}=Q({\boldsymbol{y}}^{(3)})overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT = italic_Q ( bold_italic_y start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT ), and then entropy-coded to generate a base bitstream for 𝒚^1(3)superscriptsubscriptbold-^𝒚13{\boldsymbol{\hat{y}}}_{1}^{(3)}overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT and an enhancement bitstream for 𝒚^2(3)superscriptsubscriptbold-^𝒚23{\boldsymbol{\hat{y}}}_{2}^{(3)}overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT. The base bitstream is decoded as 𝒚^1(3)superscriptsubscriptbold-^𝒚13{\boldsymbol{\hat{y}}}_{1}^{(3)}overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT, and then fed into a decoder-side task backend to obtain the class prediction 𝒕^=gs,t(3)(𝒚^1(3))bold-^𝒕superscriptsubscript𝑔𝑠𝑡3superscriptsubscriptbold-^𝒚13{\boldsymbol{\hat{t}}}=g_{s,t}^{(3)}({\boldsymbol{\hat{y}}}_{1}^{(3)})overbold_^ start_ARG bold_italic_t end_ARG = italic_g start_POSTSUBSCRIPT italic_s , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT ( overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT ). The enhancement bitstream is decoded as 𝒚^2(3)superscriptsubscriptbold-^𝒚23{\boldsymbol{\hat{y}}}_{2}^{(3)}overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT, and then its concatenation with 𝒚^1(3)superscriptsubscriptbold-^𝒚13{\boldsymbol{\hat{y}}}_{1}^{(3)}overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT (detached) is fed into a decoder-side transform to obtain 𝒖^𝒈(3)=hs(3)(concat[detach(𝒚^1(3)),𝒚^2(3)])superscriptsubscriptbold-^𝒖𝒈3superscriptsubscript𝑠3concatdetachsuperscriptsubscriptbold-^𝒚13superscriptsubscriptbold-^𝒚23{\boldsymbol{\hat{u}_{g}}}^{(3)}=h_{s}^{(3)}(\operatorname{concat}[% \operatorname{detach}({\boldsymbol{\hat{y}}}_{1}^{(3)}),{\boldsymbol{\hat{y}}}% _{2}^{(3)}])overbold_^ start_ARG bold_italic_u end_ARG start_POSTSUBSCRIPT bold_italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT = italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT ( roman_concat [ roman_detach ( overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT ) , overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT ] ).

Note that we split only 𝒚(3)superscript𝒚3{\boldsymbol{y}}^{(3)}bold_italic_y start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT into 𝒚1(3)subscriptsuperscript𝒚31{\boldsymbol{y}}^{(3)}_{1}bold_italic_y start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒚2(3)subscriptsuperscript𝒚32{\boldsymbol{y}}^{(3)}_{2}bold_italic_y start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. This is because 𝒚1(3)subscriptsuperscript𝒚31{\boldsymbol{y}}^{(3)}_{1}bold_italic_y start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT suffices for the classification task, while 𝒚2(3)subscriptsuperscript𝒚32{\boldsymbol{y}}^{(3)}_{2}bold_italic_y start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and 𝒚(j)superscript𝒚𝑗{\boldsymbol{y}}^{(j)}bold_italic_y start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT for j<3𝑗3j<3italic_j < 3 are only needed for input reconstruction. For brevity, in the experiments we only test input reconstruction with all enhancement layers, but the proposed architecture also allows for quality- or density-scalable input reconstruction.

III-C3 Upsampling

An upsampling block takes in E(i+1)+D(i)+3(1δi,3)superscript𝐸𝑖1superscript𝐷𝑖31subscript𝛿𝑖3E^{(i+1)}+D^{(i)}+3\cdot(1-\delta_{i,3})italic_E start_POSTSUPERSCRIPT ( italic_i + 1 ) end_POSTSUPERSCRIPT + italic_D start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT + 3 ⋅ ( 1 - italic_δ start_POSTSUBSCRIPT italic_i , 3 end_POSTSUBSCRIPT ) channels and outputs E(i)superscript𝐸𝑖E^{(i)}italic_E start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT channels, where δi,jsubscript𝛿𝑖𝑗\delta_{i,j}italic_δ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is the Kronecker delta (δi,j=1subscript𝛿𝑖𝑗1\delta_{i,j}=1italic_δ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 1 if i=j𝑖𝑗i=jitalic_i = italic_j, 00 otherwise). Each upsampling block consists of a simple two-layer point-wise MLP. For upsampling blocks at levels deeper than i>0𝑖0i>0italic_i > 0, batch normalizations are also present. Furthermore, each such block is followed by a reshape-transpose-reshape combination that converts the resulting shape (E(i)S(i))×P(i)superscript𝐸𝑖superscript𝑆𝑖superscript𝑃𝑖(E^{(i)}\cdot S^{(i)})\times P^{(i)}( italic_E start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ⋅ italic_S start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) × italic_P start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT into the output shape E(i)×(S(i)P(i))superscript𝐸𝑖superscript𝑆𝑖superscript𝑃𝑖E^{(i)}\times(S^{(i)}\cdot P^{(i)})italic_E start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT × ( italic_S start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ⋅ italic_P start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ). To ensure compatibility of shapes among various components during reconstruction, we enforce the constraint P(i1)=P(i)S(i)superscript𝑃𝑖1superscript𝑃𝑖superscript𝑆𝑖P^{(i-1)}=P^{(i)}\cdot S^{(i)}italic_P start_POSTSUPERSCRIPT ( italic_i - 1 ) end_POSTSUPERSCRIPT = italic_P start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ⋅ italic_S start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT.

TABLE I: Architecture hyperparameter configurations
Codec i𝑖iitalic_i P(i)superscript𝑃𝑖P^{(i)}italic_P start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT S(i)superscript𝑆𝑖S^{(i)}italic_S start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT R(i)superscript𝑅𝑖R^{(i)}italic_R start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT D(i)superscript𝐷𝑖D^{(i)}italic_D start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT E(i)superscript𝐸𝑖E^{(i)}italic_E start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT M(i)superscript𝑀𝑖M^{(i)}italic_M start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT
full 0 1024 3 3 0
1 256 4 0.2 128 64 0
2 64 4 0.4 192 32 64
3 1 64 256 16 (48, 16)
lite 0 1024 3 3 0
1 256 4 0.2 32 32 0
2 64 4 0.4 48 16 16
3 1 64 64 8 (48, 16)

IV Experiments

We trained our models on the ModelNet40 [23] dataset, sampling P=1024𝑃1024P=1024italic_P = 1024 points per object, and reconstructed the same number of points. Our implementation uses the PyTorch, CompressAI [24], and CompressAI Trainer [25] libraries.

Refer to caption
(a) RA for base task.
Refer to caption
(b) RA for base task (zoomed in).
Refer to caption
(c) RD for reconstruction task.
Figure 4: Rate-accuracy (RA) and rate-distortion (RD) curves on the ModelNet40 dataset, with rate units of bits per point (bpp) scaled for 1024 points.

The loss function used for training was

=i=03R𝒚^(i)+λxD(𝒙,𝒙^)+λtD(𝒕,𝒕^).superscriptsubscript𝑖03subscript𝑅superscriptbold-^𝒚𝑖subscript𝜆𝑥𝐷𝒙bold-^𝒙subscript𝜆𝑡𝐷𝒕bold-^𝒕\mathcal{L}=\sum_{i=0}^{3}R_{{\boldsymbol{\hat{y}}}^{(i)}}+\lambda_{x}D({% \boldsymbol{x}},{\boldsymbol{\hat{x}}})+\lambda_{t}D({\boldsymbol{t}},{% \boldsymbol{\hat{t}}}).caligraphic_L = ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_D ( bold_italic_x , overbold_^ start_ARG bold_italic_x end_ARG ) + italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_D ( bold_italic_t , overbold_^ start_ARG bold_italic_t end_ARG ) . (2)

The rate estimate R𝒚^(i)subscript𝑅superscriptbold-^𝒚𝑖R_{{\boldsymbol{\hat{y}}}^{(i)}}italic_R start_POSTSUBSCRIPT overbold_^ start_ARG bold_italic_y end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT for each level is the negative log likelihood outputted by the respective entropy models. The distortion D(𝒕,𝒕^)𝐷𝒕bold-^𝒕D({\boldsymbol{t}},{\boldsymbol{\hat{t}}})italic_D ( bold_italic_t , overbold_^ start_ARG bold_italic_t end_ARG ) is the cross-entropy between the one-hot encoded labels 𝒕𝒕{\boldsymbol{t}}bold_italic_t and the softmax of the model’s prediction 𝒕^bold-^𝒕{\boldsymbol{\hat{t}}}overbold_^ start_ARG bold_italic_t end_ARG. The distortion D(𝒙,𝒙^)𝐷𝒙bold-^𝒙D({\boldsymbol{x}},{\boldsymbol{\hat{x}}})italic_D ( bold_italic_x , overbold_^ start_ARG bold_italic_x end_ARG ) is the Chamfer distance between the input and reconstructed point clouds. We trained different models to operate at different base/enhancement rate points by varying the hyperparameters λt{27,26,,20}subscript𝜆𝑡superscript27superscript26superscript20\lambda_{t}\in\{2^{-7},2^{-6},\ldots,2^{0}\}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { 2 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT , 2 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT , … , 2 start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT } and λx[1,8000]subscript𝜆𝑥18000\lambda_{x}\in[1,8000]italic_λ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∈ [ 1 , 8000 ]. These values are reported for rates in units of bits per point (bpp), i.e., the rates are divided by P=1024𝑃1024P=1024italic_P = 1024. To improve stability during training, we disabled the first two levels by setting M(0)=0superscript𝑀00M^{(0)}=0italic_M start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 0 and M(1)=0superscript𝑀10M^{(1)}=0italic_M start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = 0. We also set M(2)=64superscript𝑀264M^{(2)}=64italic_M start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT = 64 and M(3)=64superscript𝑀364M^{(3)}=64italic_M start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT = 64, and used a partitioning ratio of 0.750.750.750.75 for the last layer, i.e., M1(3)=48subscriptsuperscript𝑀3148M^{(3)}_{1}=48italic_M start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 48 (base) and M2(3)=16subscriptsuperscript𝑀3216M^{(3)}_{2}=16italic_M start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 16 (enhancement).

The input compression codecs were evaluated using the same methodology for varying P𝑃Pitalic_P and input scaling as in [3]. Since codecs like TMC13 [10] and OctAttention [15] are tailored to point clouds with a large number of points, the coding overhead they produce may be significant when applied to smaller point clouds such as those used in classification. To improve the fairness when measuring the rate, we ignored such overhead within the bitstream when possible.

V Results

figs. 3(b) and 3(a) show the rate-accuracy curves for the base task of point-cloud classification. Only the rate of the base bitstream is measured for the base task. Scalable codec curves are grouped by their worst-case reconstruction ability measured in terms of Chamfer distance [26]. Our proposed codec achieves better classification performance than all existing codecs, including [3] (denoted “Ulhaq2023”), which in turn is better than input compression codecs (“ICC”) that use a point-cloud codec (TMC13 [10] and OctAttention [15]) followed by a PointNet/PointNet++ classifier. Only the best ICC curves are shown. Note that P=*P{=}{}^{*}italic_P = start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT indicates the best results attained when varying P{8,16,,1024}𝑃8161024P\in\{8,16,\ldots,1024\}italic_P ∈ { 8 , 16 , … , 1024 }.

There is a large gap in base task performance between task-specialized codecs (the proposed one and [3]) and non-specialized codecs because, in addition to statistical redundancy, the former codecs remove a significant amount of task-irrelevant information. Meanwhile, the proposed codec outperforms [3] due to its reliance on the more powerful PointNet++, rather than PointNet.

fig. 3(c) show the rate-distortion curves for the reconstruction task in terms of the rate versus the Chamfer distance. For the proposed codec, the rate includes both the base and enhancement bitstreams. The figures show that our proposed codec is also capable of achieving better reconstruction quality than conventional codecs at low rates, below 1.6 bpp. As the rate increases, the conventional codecs eventually catch up in terms of reconstruction quality.

VI Conclusion

In this paper, we proposed a scalable multi-task compression codec for point clouds. Our proposed codec produces a base bitstream that supports point cloud classification and an enhancement bitstream that, together with the base bitstream, supports reconstruction of the point cloud for human viewing. The proposed codec showed a significant improvement in rate-accuracy performance for the base task over existing codecs, while also achieving competitive performance against conventional codecs on input reconstruction at reasonably low rates. In the future, we hope that this work will inspire further research into scalable multi-task compression codecs for point clouds on more complex real-world datasets, and for other tasks such as semantic segmentation.

References

  • [1] N. Shlezinger and I. V. Bajić, “Collaborative inference for AI-empowered IoT devices,” IEEE Internet of Things Magazine, vol. 5, no. 4, pp. 92–98, Dec. 2022.
  • [2] I. V. Bajić, W. Lin, and Y. Tian, “Collaborative intelligence: Challenges and opportunities,” in Proc. IEEE ICASSP, 2021, pp. 8493–8497.
  • [3] M. Ulhaq and I. V. Bajić, “Learned point cloud compression for classification,” in Proc. IEEE MMSP, 2023.
  • [4] C. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deep learning on point sets for 3D classification and segmentation,” in Proc. IEEE/CVF CVPR, 2017, pp. 77–85.
  • [5] C. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: Deep hierarchical feature learning on point sets in a metric space,” in Proc. NIPS, 2017.
  • [6] D. Maturana and S. A. Scherer, “VoxNet: A 3D convolutional neural network for real-time object recognition,” in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS), 2015, pp. 922–928.
  • [7] G. Riegler, A. O. Ulusoy, and A. Geiger, “OctNet: Learning deep 3D representations at high resolutions,” in Proc. IEEE/CVF CVPR, 2017, pp. 6620–6629.
  • [8] Google, “Draco: 3D data compression,” 2017. [Online]. Available: https://google.github.io/draco/
  • [9] K. Mammou, P. A. Chou, D. Flynn, M. Krivokuća, O. Nakagami, and T. Sugio, “G-PCC codec description v2,” 2019, ISO/IEC JTC1/SC29/WG11 N18189.
  • [10] D. Flynn and K. Mammou, “MPEG-PCC-TMC13,” 2021. [Online]. Available: https://github.com/MPEGGroup/mpeg-pcc-tmc13
  • [11] J. Ballé, D. C. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational image compression with a scale hyperprior,” arXiv preprint arXiv:1802.01436, 2018.
  • [12] M. Quach, G. Valenzise, and F. Dufaux, “Learning convolutional transforms for lossy point cloud geometry compression,” in Proc. IEEE ICIP, 2019, pp. 4320–4324.
  • [13] W. Yan, Y. shao, S. Liu, T. H. Li, Z. Li, and G. Li, “Deep autoencoder-based lossy geometry compression for point clouds,” arXiv preprint arXiv:1905.03691, 2019.
  • [14] Y. He, X. Ren, D. Tang, Y. Zhang, X. Xue, and Y. Fu, “Density-preserving deep point cloud compression,” arXiv preprint arXiv:2204.12684v1, 2022.
  • [15] C. Fu, G. Li, R. Song, W. Gao, and S. Liu, “OctAttention: Octree-based large-scale contexts model for point cloud compression,” in Proc. AAAI, vol. 36, no. 1, Jun. 2022, pp. 625–633.
  • [16] K.-S. You, P. Gao, and Q. T. Li, “IPDAE: Improved patch-based deep autoencoder for lossy point cloud geometry compression,” in Proc. 1st Int. Workshop on Advances in Point Cloud Compression, Processing and Analysis, 2022.
  • [17] N. Le, H. Zhang, F. Cricri, R. Ghaznavi-Youvalari, and E. Rahtu, “Image coding for machines: an end-to-end learned approach,” in Proc. IEEE ICASSP, 2021, pp. 1590–1594.
  • [18] L.-Y. Duan, J. Liu, W. Yang, T. Huang, and W. Gao, “Video Coding for Machines: A paradigm of collaborative compression and intelligent analytics,” IEEE Trans. Image Process., vol. 29, pp. 8680–8695, 2020.
  • [19] Y. Hu, S. Yang, W. Yang, L.-Y. Duan, and J. Liu, “Towards coding for human and machine vision: A scalable image coding approach,” in Proc. IEEE ICME, 2020, pp. 1–6.
  • [20] H. Choi and I. V. Bajić, “Scalable image coding for humans and machines,” IEEE Trans. Image Process., vol. 31, pp. 2739–2754, 2022.
  • [21] A. F. R. Guarda, N. M. M. Rodrigues, and F. Pereira, “Point cloud geometry scalable coding with a single end-to-end deep learning model,” in Proc. IEEE ICIP, 2020, pp. 3354–3358.
  • [22] Y. Foroutan, A. Harell, A. de Andrade, and I. V. Bajić, “Base layer efficiency in scalable human-machine coding,” in Proc. IEEE ICIP, 2023, pp. 3299–3303.
  • [23] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3D ShapeNets: A deep representation for volumetric shapes,” in Proc. IEEE/CVF CVPR, 2015, pp. 1912–1920.
  • [24] J. Bégaint, F. Racapé, S. Feltman, and A. Pushparaja, “CompressAI: a PyTorch library and evaluation platform for end-to-end compression research,” arXiv preprint arXiv:2011.03029, 2020.
  • [25] M. Ulhaq and F. Racapé, “CompressAI Trainer,” 2022. [Online]. Available: https://github.com/InterDigitalInc/CompressAI-Trainer
  • [26] H. Blum, “A transformation for extracting new descriptors of shape,” in Models for the Perception of Speech and Visual Form, W. Wathen-Dunn, Ed.   Cambridge, MA: MIT Press, 1967, pp. 362–380.