License: arXiv.org perpetual non-exclusive license
arXiv:2403.03296v1 [cs.CV] 05 Mar 2024

CenterDisks: Real-time instance segmentation with disk covering
thanks: We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), the Institut for data valorisation (IVADO) and the Apogée fund.

Katia Jodogne-del Litto LITIV Lab., Polytechnique Montréal
Montréal, Canada
[email protected]
   Guillaume-Alexandre Bilodeau LITIV Lab., Polytechnique Montréal
Montréal, Canada
[email protected]
Abstract

Increasing the accuracy of instance segmentation methods is often done at the expense of speed. Using coarser representations, we can reduce the number of parameters and thus obtain real-time masks.

In this paper, we take inspiration from the set cover problem to predict mask approximations. Given ground-truth binary masks of objects of interest as training input, our method learns to predict the approximate coverage of these objects by disks without supervision on their location or radius. Each object is represented by a fixed number of disks with different radii. In the learning phase, we consider the radius as proportional to a standard deviation in order to compute the error to propagate on a set of two-dimensional Gaussian functions rather than disks.

We trained and tested our instance segmentation method on challenging datasets showing dense urban settings with various road users.

Our method achieve state-of-the art results on the IDD and KITTI dataset with an inference time of 0.040 s on a single RTX 3090 GPU.

The code is available at:

https://github.com/KatiaJDL/CenterDisks

I Introduction

The development of intelligent vehicles is a major technological advance, which relies heavily on computer vision to observe and interpret the environment. The use cases are multiple in the context of light individual vehicles or public transport, and range from driving assistance mechanisms to autonomous driving. The deployment in urban areas requires the ability to detect road users in real-time and accurately in dense scenes. However, the deep learning models used for image processing are generally quite resource intensive. One of the factors at play is their large number of parameters. Reducing their sizes allows both to speed them up and to require less computing resources. Smaller models can be used with less powerful and lighter hardware, for example in embedded systems. The climate crisis and its challenges also make it necessary to be aware of the resources consumed for training and production.

To localize precisely road users, detections with bounding boxes are not enough. We need to segment the image to get the pixels belonging to each object of interest. But predicting binary masks is quite slow, and most precise instance segmentation do not reach real-time performances [1, 2]. One way to achieve a trade-off between speed and accuracy is to use mask approximations, with a lighter representation and therefore less parameters.

One way to simplify a binary mask is to use only its contours [3, 4, 5, 6], but we can also be interested in the possibility of approximating a binary mask with few parameters by representing the interior of the mask directly. This can be viewed as a mask covering problem. We no longer characterize each pixel of the image, but we want to project shapes on it. The simplest and least expensive way in terms of number of parameters to represent a surface is a disk. Only three parameters are needed, the two coordinates of the center and the radius. Thus, we propose to approximate a complex binary mask with a set of disks (see Figure 1).

Refer to caption
(a) Original image.
Refer to caption
(b) Binary mask.
Refer to caption
(c) Set of covering disks.
Refer to caption
(d) Mask produced by the disks.
Figure 1: Disk covering on a Cityscapes image crop. The set of covering disks is composed of 24 disks with identical radius.

One of the main advantages of this representation is that it does not require direct supervision or a custom ground-truth. There is no need for a ground-truth representing the centers and radii of the sets of disks. The optimization is performed directly on the binary masks.

Our contributions are the following:

  • We propose a new approach for mask approximation with disk covering;

  • We show that a deep neural network can learn to cover a surface by positioning a set of disks;

  • Our proposed method is real-time and achieves state-of-the art results on the IDD and KITTI test set.

II Related works

Real-time instance segmentation

Most traditional instance segmentation methods, like Mask R-CNN [1], PANET [2], or more recently Sam [7], cannot reach real-time performances. But techniques were developed to speed them up. First, we can mention sparse representation of object localization. In SOLOv2 [8], the images are divided into cells, and each cell corresponds to a mask, representing the potential object which center lies in this cell. SparseInst [9] also uses sparse feature maps to produce mask kernels but does not need to localize the objects by their center. Spatial Sampling Net [10] produces a non-uniform density map. It follows the object distribution, and the masks are obtained with a diffusion process through a spatial sampling operator.

Furthermore, combining bounding boxes and segmentation masks can be an acceleration factor. ESE SEG [11] performs at the same time the bounding box prediction and the object segmentation, and Box2Pix [12] combines semantic segmentation and bounding boxes. Finally, real-time can be achieve with a simplified representation of masks, with a combination of mask prototypes [13], or with polygonal mask approximations [5, 4].

Refer to caption
Figure 2: Architecture for CenterDisks. The backbone is represented here as an Houglass backbone [14]. The five heads predict the heatmap for object centers, the offsets from this center, the relative depths, and the parameters of the disk sets. The number of parameters displayed on this figure is only as an illustration. For implementation details, please refer to section IV-A and to the code provided.

Instance segmentation with mask approximations

Except for YOLACT [13], which uses mask prototypes and predicted coefficients to combine them, most mask approximation methods consider only the boundary of the objects. A popular representation is the polygonal mask.

Polygon-RNN [15] and Polygon-RNN++ [16] use a recurrent neural network to determine the next vertex. These methods were meant for semi-automated annotation systems, and are really slow. PolarMask [3] fixes the angles of the vertices in a polar representation, which produce a star structure. Poly-YOLO [5] also uses a polar grid, with free angles and a dynamic number of vertices, chosen with a dedicated confidence score during inference phase. CenterPoly [4] and CenterPolyV2 [17] generate simultaneously heatmaps for object detection, and polygon vertices for each pixel. Nevertheless, most of these methods do not consider potential holes in the masks.

However, non-polygonal representations exist. ESE SEG [11] uses an approximation of object boundaries based on Chebyshev polynomials as mask approximation. FourierNet [18] and SCR [19] use instead Fourier series and add to the output a differentiable decoder that allows learning directly on the final shape of the masks. DeepSnake [6] directly generates the contours and then iteratively distorts them to get closer to the actual contours of the objects. BCondInst [20] and BshapeNet [21] use boundary predictions to improve its mask predictions.

Set cover problem

The mask approximation methods presented above rely on the detection of the object contour. One can also consider a representation based on the inside of the objects. Approximating a complex surface with a set of shapes can be seen as a sub-problem of the general set cover problem, which is NP-complete [22]. Depending on the evaluation criteria for the covering, there are several mathematical results to this computational geometry problem for simple configurations.

The problem of covering the unit disk uses an indirect approach by requiring the minimization of the common radius of a set of disks to cover the whole surface. This mathematical problem is solved exactly for different values of n𝑛nitalic_n: 1, 2, 3, 4, 5, 7 [23]. and approximate radius values have been obtained up to n=10𝑛10n=10italic_n = 10 [24]. Here the location of the disks is secondary, and the radius is the value to be minimized. For less regular surfaces, under real-life conditions, this problem can be interpreted as the placement of radio antennas (or base station placement) [25]. There are approximation algorithms for convex polygons [26]. For more complex surfaces, the optimization algorithms range from linear integer programming [27], greedy [28], to meta-heuristic like genetic algorithms [29].

Thus, there is no exact algorithm to find the position of disks covering any surface by minimizing their radii. The existing algorithms use optimization techniques. In our case, the input is an image with multiple areas that should be covered as exactly as possible. It is not a known surface. Nevertheless, we postulate and demonstrate with practical results that it is possible to optimize the overlap by deep learning with the complete mask as the only supervision.

III Method

Our method is based on the object detector CenterNet [30], which locates objects by their center and regress from it the coordinates of their bounding box. Instead of bounding boxes, we adapt the prediction heads to obtain a set of centers and radii for each object detected. The architecture of our method is shown in the figure 2. There are five prediction heads. The heatmap and offset heads come directly from CenterNet [30]. With one heatmap for each semantic category, we can extract the peaks, which correspond to the center of the objects. The relative depth head is inspired by CenterPoly [4]. We added two heads for centers and radii prediction.

III-A Gaussian projection

The covering is densely predicted: A set of N𝑁Nitalic_N disks is predicted for each pixel, and only the ones corresponding to a peak of the heatmap are kept. Therefore, a set of N𝑁Nitalic_N disks covers one object. That represents 2N2𝑁2N2 italic_N coordinates for the center of the disks x1,y1,x2,y2,,xn,ynsubscript𝑥1subscript𝑦1subscript𝑥2subscript𝑦2subscript𝑥𝑛subscript𝑦𝑛x_{1},y_{1},x_{2},y_{2},...,x_{n},y_{n}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and MN𝑀𝑁M\leq Nitalic_M ≤ italic_N size parameters σ1,σMsubscript𝜎1subscript𝜎𝑀\sigma_{1},...\sigma_{M}italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … italic_σ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT. These last values can be considered as the standard deviation of a Gaussian function in two dimensions.

Each center is matched with a standard deviation using an association function λ𝜆\lambdaitalic_λ. N,M,λ𝑁𝑀𝜆N,M,\lambdaitalic_N , italic_M , italic_λ are hyper-parameters of the model. If z𝑧zitalic_z is the index of a center, λ(z)𝜆𝑧\lambda(z)italic_λ ( italic_z ) is the index of the corresponding radius. For example, if the standard deviation is the same for all centers, we must fix M=1𝑀1M=1italic_M = 1 and λ:z1:𝜆𝑧1\lambda:z\rightarrow 1italic_λ : italic_z → 1. For each center to get its personalized standard deviation, M=N𝑀𝑁M=Nitalic_M = italic_N and λ:zz:𝜆𝑧𝑧\lambda:z\rightarrow zitalic_λ : italic_z → italic_z. We performed an ablation study to choose the proportion of size parameters compared to the number of centers N𝑁Nitalic_N.

The equation for the Gaussian function of center i𝑖iitalic_i is

fi(x,y)=e(xxi)2+(yyi)22σλ(i)2,subscript𝑓𝑖𝑥𝑦superscript𝑒superscript𝑥subscript𝑥𝑖2superscript𝑦subscript𝑦𝑖22superscriptsubscript𝜎𝜆𝑖2f_{i}(x,y)=e^{-\frac{(x-x_{i})^{2}+(y-y_{i})^{2}}{2\sigma_{\lambda(i)}^{2}}},italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_y ) = italic_e start_POSTSUPERSCRIPT - divide start_ARG ( italic_x - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_y - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_λ ( italic_i ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT , (1)

with which we can deduce the global probability map for one object:

f(x,y)=i=0Ne(xxi)2+(yyi)22σλ(i)2.𝑓𝑥𝑦superscriptsubscript𝑖0𝑁superscript𝑒superscript𝑥subscript𝑥𝑖2superscript𝑦subscript𝑦𝑖22superscriptsubscript𝜎𝜆𝑖2f(x,y)=\sum_{i=0}^{N}e^{-\frac{(x-x_{i})^{2}+(y-y_{i})^{2}}{2\sigma_{\lambda(i% )}^{2}}}.italic_f ( italic_x , italic_y ) = ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - divide start_ARG ( italic_x - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_y - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_λ ( italic_i ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT . (2)

The final result of the prediction head for the disks is a probability map.

III-B Training

Since the output is a sum of Gaussian functions, we have to apply a normalization function on these probability maps to reduce their range to [0,1]01[0,1][ 0 , 1 ]. This allows us to use classic loss functions for segmentation tasks. We chose to use the hyperbolic tangent function.

No custom ground-truth (GT) is necessary. We can simply use the binary masks provided with the datasets. The loss function used for the training of the model is defined as follow:

Loss=WhmLosshm+WoffsetLossoffset+WdepthLossdepth+WdisksLossdisks,𝐿𝑜𝑠𝑠subscript𝑊𝑚𝐿𝑜𝑠subscript𝑠𝑚subscript𝑊𝑜𝑓𝑓𝑠𝑒𝑡𝐿𝑜𝑠subscript𝑠𝑜𝑓𝑓𝑠𝑒𝑡subscript𝑊𝑑𝑒𝑝𝑡𝐿𝑜𝑠subscript𝑠𝑑𝑒𝑝𝑡subscript𝑊𝑑𝑖𝑠𝑘𝑠𝐿𝑜𝑠subscript𝑠𝑑𝑖𝑠𝑘𝑠Loss=W_{hm}Loss_{hm}+W_{offset}Loss_{offset}\\ +W_{depth}Loss_{depth}+W_{disks}Loss_{disks},start_ROW start_CELL italic_L italic_o italic_s italic_s = italic_W start_POSTSUBSCRIPT italic_h italic_m end_POSTSUBSCRIPT italic_L italic_o italic_s italic_s start_POSTSUBSCRIPT italic_h italic_m end_POSTSUBSCRIPT + italic_W start_POSTSUBSCRIPT italic_o italic_f italic_f italic_s italic_e italic_t end_POSTSUBSCRIPT italic_L italic_o italic_s italic_s start_POSTSUBSCRIPT italic_o italic_f italic_f italic_s italic_e italic_t end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL + italic_W start_POSTSUBSCRIPT italic_d italic_e italic_p italic_t italic_h end_POSTSUBSCRIPT italic_L italic_o italic_s italic_s start_POSTSUBSCRIPT italic_d italic_e italic_p italic_t italic_h end_POSTSUBSCRIPT + italic_W start_POSTSUBSCRIPT italic_d italic_i italic_s italic_k italic_s end_POSTSUBSCRIPT italic_L italic_o italic_s italic_s start_POSTSUBSCRIPT italic_d italic_i italic_s italic_k italic_s end_POSTSUBSCRIPT , end_CELL end_ROW (3)

where Whmsubscript𝑊𝑚W_{hm}italic_W start_POSTSUBSCRIPT italic_h italic_m end_POSTSUBSCRIPT, Woffsetsubscript𝑊𝑜𝑓𝑓𝑠𝑒𝑡W_{offset}italic_W start_POSTSUBSCRIPT italic_o italic_f italic_f italic_s italic_e italic_t end_POSTSUBSCRIPT, Wdepthsubscript𝑊𝑑𝑒𝑝𝑡W_{depth}italic_W start_POSTSUBSCRIPT italic_d italic_e italic_p italic_t italic_h end_POSTSUBSCRIPT are the respective weights of the loss for the prediction heads for heatmap, offset and relative depth, and Wdiskssubscript𝑊𝑑𝑖𝑠𝑘𝑠W_{disks}italic_W start_POSTSUBSCRIPT italic_d italic_i italic_s italic_k italic_s end_POSTSUBSCRIPT is the weight for the loss for the disk heads (head for disk centers and head for standard deviations). The loss for the heatmaps describing the center of objects is the focal loss [31], and Lossoffset𝐿𝑜𝑠subscript𝑠𝑜𝑓𝑓𝑠𝑒𝑡Loss_{offset}italic_L italic_o italic_s italic_s start_POSTSUBSCRIPT italic_o italic_f italic_f italic_s italic_e italic_t end_POSTSUBSCRIPT and Lossdepth𝐿𝑜𝑠subscript𝑠𝑑𝑒𝑝𝑡Loss_{depth}italic_L italic_o italic_s italic_s start_POSTSUBSCRIPT italic_d italic_e italic_p italic_t italic_h end_POSTSUBSCRIPT are L1 loss functions.

The loss function used for the training of the two disk heads should compute the difference between a GT binary mask and a probability map. We propose the use of two possible loss functions: binary cross-entropy [32] and Dice loss [33]. The first one is a local loss, based on the distribution, while the second is a regional loss, taking into account the whole image. For a pixel k𝑘kitalic_k with its coordinates (x,y)𝑥𝑦(x,y)( italic_x , italic_y ), zk{0,1}subscript𝑧𝑘01z_{k}\in\{0,1\}italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ { 0 , 1 } is the GT value and pk=f(x,y)[0,1]subscript𝑝𝑘𝑓𝑥𝑦01p_{k}=f(x,y)\in[0,1]italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_f ( italic_x , italic_y ) ∈ [ 0 , 1 ] is the predicted value, corresponding to 2. The losses are defined by

Lcross_entropy=1Nkzklog(p(pk))+(1zk)log(1pk)subscript𝐿𝑐𝑟𝑜𝑠𝑠_𝑒𝑛𝑡𝑟𝑜𝑝𝑦1𝑁subscript𝑘subscript𝑧𝑘𝑙𝑜𝑔𝑝subscript𝑝𝑘1subscript𝑧𝑘𝑙𝑜𝑔1subscript𝑝𝑘L_{cross\_entropy}=-\frac{1}{N}\sum_{k}z_{k}log(p(p_{k}))+(1-z_{k})log(1-p_{k})italic_L start_POSTSUBSCRIPT italic_c italic_r italic_o italic_s italic_s _ italic_e italic_n italic_t italic_r italic_o italic_p italic_y end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_l italic_o italic_g ( italic_p ( italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) + ( 1 - italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_l italic_o italic_g ( 1 - italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) (4)

and

LDice=12i=1Nzkpkkzk+kpk+ϵ,subscript𝐿𝐷𝑖𝑐𝑒12superscriptsubscript𝑖1𝑁subscript𝑧𝑘subscript𝑝𝑘subscript𝑘subscript𝑧𝑘subscript𝑘subscript𝑝𝑘italic-ϵL_{Dice}=1-\frac{2\sum_{i=1}^{N}z_{k}p_{k}}{\sum_{k}z_{k}+\sum_{k}p_{k}+% \epsilon},italic_L start_POSTSUBSCRIPT italic_D italic_i italic_c italic_e end_POSTSUBSCRIPT = 1 - divide start_ARG 2 ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_ϵ end_ARG , (5)

ϵitalic-ϵ\epsilonitalic_ϵ being a smoothing term. Lossdisks𝐿𝑜𝑠subscript𝑠𝑑𝑖𝑠𝑘𝑠Loss_{disks}italic_L italic_o italic_s italic_s start_POSTSUBSCRIPT italic_d italic_i italic_s italic_k italic_s end_POSTSUBSCRIPT is then one of those two.

III-C Inference

Refer to caption
(a) Mask generated with superposition of Gaussian functions.
Refer to caption
(b) Final binary mask.
Figure 3: Thresholding with α=0.5𝛼0.5\alpha=0.5italic_α = 0.5 on the predicted set of disks at inference phase.

In the inference phase, we perform no normalization on the probability maps representing the objects. After projection, with two-dimensional Gaussian functions, the predicted binary mask is selected with a threshold α𝛼\alphaitalic_α (Figure 3).

We may perform contour smoothing with the Douglas-Peucker algorithm [34]. With simple computer graphics tools, we retrieve a polygon approximation of the contour from the binary mask [35] and perform a reduction of the number of vertices. The hyper-parameter β𝛽\betaitalic_β, used as a maximum distance to the original outline, is proportional to the initial perimeter. This post-processing step can reduce the roundness of the mask, which is inherent to disks.

IV Experiments

IV-A Implementation

The focus of our study is the segmentation of road users in urban settings. We used the Cityscapes [36], IDD [37] and KITTI datasets [38]. We performed ablation studies on the Cityscapes validation set. The instance categories used correspond to road users: bicycle, bus, car, motorcycle, person, rider, train, truck. The data splits used for training, validation and testing are predefined for all datasets. The images of the Cityscapes dataset are recorded in different cities in Germany. It includes 5,000 images, with a standard resolution of 2048×1024204810242048\times 10242048 × 1024. In the Indian Driving Dataset (IDD), there are approximately 10,000 images, with sizes varying from 1920×1080192010801920\times 10801920 × 1080 to 1280×96412809641280\times 9641280 × 964. Finally, KITTI dataset includes 400 images for segmentation, with a size of 1280×38412803841280\times 3841280 × 384.

Our main evaluation metric is the Average Precision (AP) [39]. It is an aggregation of the average precision for different IoU thresholds, from 50% to 95%. The AP50%, for average precision with minimum IoU of 50%, and AP50m and AP100m, for objects within a range of 50m and 100m respectively, are also used. Moreover, for all these metrics, we can have access to the average over all categories or to the detailed results.

We implemented our method with Pytorch [40]. We use the Hourglass network [14], with one stack as a backbone for all our experiments. The backbone, heatmap head, and offset head are pre-trained on COCO [39]. We first trained on Cityscapes and then fine-tuned our model for KITTI and IDD. For training, we used classical data augmentation techniques, with a resolution of 1024×51210245121024\times 5121024 × 512: color augmentation, random crop**, and flip**. The loss weights are Whm=1subscript𝑊𝑚1W_{hm}=1italic_W start_POSTSUBSCRIPT italic_h italic_m end_POSTSUBSCRIPT = 1, Wdisks=1subscript𝑊𝑑𝑖𝑠𝑘𝑠1W_{disks}=1italic_W start_POSTSUBSCRIPT italic_d italic_i italic_s italic_k italic_s end_POSTSUBSCRIPT = 1, Wdepth=0.1subscript𝑊𝑑𝑒𝑝𝑡0.1W_{depth}=0.1italic_W start_POSTSUBSCRIPT italic_d italic_e italic_p italic_t italic_h end_POSTSUBSCRIPT = 0.1 and Woffset=0.1subscript𝑊𝑜𝑓𝑓𝑠𝑒𝑡0.1W_{offset}=0.1italic_W start_POSTSUBSCRIPT italic_o italic_f italic_f italic_s italic_e italic_t end_POSTSUBSCRIPT = 0.1. For the disk heads we used the Dice loss.

The model was trained for 240 epochs with a batch size of 4 on a single RTX 3090 GPU with the adam optimizer [41]. The starting learning rate used was 2e-4, and we divided the learning rate by ten at epochs 90 and 120. The threshold for the inference was fixed to α=0.5𝛼0.5\alpha=0.5italic_α = 0.5. We use 16 disks with 16 different radii.

Refer to caption
(a) AP and AP50% for each semantic category.
Refer to caption
(b) Frequency of each category in the Cityscapes dataset.
Figure 4: Results on the Cityscapes validation set by categories.

IV-B Results

TABLE I: Results on the Cityscapes test set. If the runtimes were not explicitly stated in the original paper, there are estimated based on our knowledge of the method. Results are taken from the original papers or public online benchmarks, unless stated otherwise. Boldface: Best results for real-time methods. Underline: Best results overall.
Method Mask type AP ↑ AP50% ↑ AP100m ↑ AP50m ↑ Runtime (s) ↓
Mask-RCNN [1] Full 26.22 49.89 37.63 40.11 similar-to-or-equals\simeq 0.2
PANET [2] Full 31.80 57.10 44.20 46.00 >>> 1
PolyTransform [42] Polygon 40.10 65.90 54.80 58.00 >>> 1
DeepSnake [6] Outline 31.70 58.40 43.20 44.70 4.6
Polygon-RNN++ [16] Polygon 25.50 45.50 39.30 43.40 >>> 1
Spatial Sampling Net [10] Full 9.20 16.80 16.4 21.4 0.009
Box2Pix [12] Full 13.10 27.20 - - 0.092
Poly-YOLO [5] Polygon 11.50 26.70 - - 0.049
Poly-YOLO lite [5] Polygon 10.10 23.90 - - 0.027
CenterPoly [4] Polygon 15.54 39.49 23.33 24.45 0.045
CenterDisks (Ours) Shape set 7.36 25.89 11.64 12.37 0.040

The results for our method on the test sets of Cityscapes, IDD and KITTI are presented in the Table I, Table II and Table III. We included methods with real-time performances as well as slower but more precise methods.

For the Cityscapes dataset (Table I), the inference time for one image is about 0.040 second. It means that our method can be used in real-time settings. Our method reaches 7.36 for Average Precision and 25.89 for AP50%. Except for CenterPoly, our results for AP50% are competitive with other real-time method. The evaluation metrics are very demanding, and a low score, especially for the AP metric, does not mean that results are not usable.

As we can see in Figure 4, the best performances are met on the categories that are most present in the dataset, and that have the biggest surface in average. The cars, representing almost half of the objects in the Cityscapes dataset, are especially well segmented. The shape of the objects has also a high importance, since disks are not very flexible. Compact blocks (cars, buses, trucks, trains) are easier to approximate here, whereas the handlebars of a bike or the slender shape of pedestrian are harder to get right.

However, on the Indian Driving Dataset (Table II) and on the KITTI dataset (Table III) CenterDisks performs better than the previous methods. We reach 20.30 in AP and 49.90 in AP50% for IDD, and for KITTI, we improve the state-of-the-art method by 3 points for the AP metric and 10.5 points for the AP50% metric. These datasets contain less pedestrians, and no motorcycle, bicycle or rider for KITTI. We have discussed previously that the most frequently represented and the largest objects were the best segmented objects, so this can explain the very good performances on these two datasets. Moreover, IDD is more diverse inside each category, and KITTI does not have a lot of training data. Our method thus seems to have a higher generalization capacity than previous methods.

Qualitative results for the KITTI test set and Cityscapes test set are shown on Figure 5 and Figure 6. The instances are well-segmented, and some details can be captured, for example the wheels or the leg separation of pedestrians when they are big enough. The predicted masks are round-shaped. It is inherent in the structure of the mask with a set of disks, which cannot create straight lines or sharp angles. Slower methods provide more accurate masks (Figure 5), especially for details, such as the tires.

TABLE II: Results on the IDD test set. * Results from the original IDD paper [37]. Boldface: Best results for real-time methods. Underline: Best results overall.
Method AP AP50% Time (s)
Mask-RCNN [1]* 26.80 49.90 similar-to-or-equals\simeq 0.2
PANET [2]* 37.60 66.10 >>> 1
Poly-YOLO [5] 11.50 26.70 0.049
Poly-YOLO lite [5] 10.10 23.90 0.027
CenterPoly [4] 14.40 36.90 0.045
CenterDisks (Ours) 20.30 49.90 0.032
TABLE III: Results on the KITTI test set. Boldface: Best results.
Method AP AP50% Time (s)
CenterPoly [4] 8.73 26.74 0.045
CenterDisks (Ours) 11.75 37.24 0.033

V Discussion

V-A Ablation studies

We performed ablation studies on the number of disks used for the covering (Table IV). At first, adding more disks allow for a better covering, but then reach a plateau. The best value among the ones we tested is 16 disks. The runtime for this configuration stays also under the limit for real-time application.

To decide between a region-based loss and a distribution-based loss, we experimented with the binary-cross entropy loss [32] and Dice loss [33]. The performance is much better using the region-based loss function Dice (Table IV). This can be partly attributed to the imbalance between objects of interest and background, which is better taken into account by this error function.

We also checked if each disk should get its own personalized radius. The more the radii are individualized, the more accurate the covering should be. But at the same time, it can be harder to optimize more different prediction parameters. As we can see on Table V, using more different radii improves the covering. The best configuration is the one where all disks have a different radius, so it is the one we used in our other experiments.

TABLE IV: Results on the Cityscapes validation set. All radius different. Boldface: Best results.
N disks N radii Loss AP AP50% Runtime (s)
2 2 Dice 5.03 19.50 0.032
4 4 Dice 9.51 33.79 0.036
8 8 Dice 10.53 33.33 0.039
16 16 Dice 12.22 35.47 0.040
24 24 Dice 10.78 32.18 0.043
32 32 Dice 10.65 33.31 0.046
16 16 BCE 8.00 27.14 0.040
16 16 Dice 12.22 35.47 0.040
TABLE V: Results on the Cityscapes validation set. The loss function used for the training is the Dice loss. Boldface: Best results.
N disks N radii AP AP50% Runtime (s)
16 1 11.80 35.04 0.039
16 2 10.24 31.40 0.040
16 4 11.59 34.11 0.040
16 16 12.22 35.47 0.040
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Refer to caption
(g)
Refer to caption
(h)
Figure 5: Qualitative results on the KITTI test set. Comparison with state-of-the-art instance segmentation methods. From top to bottom: CenterDisks (our method), Mask-RCNN [1], Segment Anything [7], SparseInst [9]. We used the pre-trained models provided by the authors, without fine-tuning them on KITTI. Best viewed on a screen.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 6: Qualitative results on the Cityscapes test set. Colors correspond to semantic categories (blue - cars, yellow - person, red - bus, light pink - bicycle, orange - rider, violet - truck, bright pink - motorcycle). Best viewed on a screen.

The initial results of CenterDisks are somewhat round-shaped, which does not represent accurately the type of objects present in the datasets. Therefore, we studied the refinement of the masks during post-processing with the Douglas-Peucker algorithm (see section III-C). With β=0.01𝛽0.01\beta=0.01italic_β = 0.01, the AP is improved by 1.1 points, and the AP50% is improved by 2.4 points (Table VI). We find that although this post-processing improves the final accuracy slightly, the inference time needed is not worth the improvement, with 1s per image.

When using oracle prediction for the detection, the evaluation metrics reach the results of 12.42 for the AP and 39.35 for AP50%. The accuracy is thus improved by 4 points for the AP50%. It means that getting a better detection head could improve the results slightly.

TABLE VI: Ablation study for the post-processing step. The reduction factor for the Douglas-Peucker algorithm is defined in section III-C, and is proportional to β𝛽\betaitalic_β. Results on the Cityscapes validation set. We use the dice loss, with 16 different radii. Boldface: Best results.
N disks β𝛽\betaitalic_β AP AP50% Runtime (s)
16 None 12.22 35.47 0.040
16 0.001 13.03 37.52 0.925
16 0.01 13.36 37.85 0.923

V-B Limitations

The overall accuracy of our method does not reach the performance of the best existing instance segmentation methods. Due to the use of overlap** disks, it is almost impossible to get straight lines and sharp angles. This difficulty could only be overcome in using other shapes. Small objects are also hard to segment accurately, as we can see on the qualitative results (objects in the background in Figure 6) and when decomposing the results according to the category (Table 4). Finally, our method struggles also with fine separation, for example the legs of pedestrians. Nonetheless, our proposed representation can theoretically achieve a better mask than the polygonal methods for objects with holes, or whose center of gravity is outside the mask.

VI Conclusion

In this paper, we propose a new paradigm for mask approximation using disk covering. A fixed number of disks with different radii represent the objects. The model is trained without the need for elaborate or custom ground-truths. During training, the disks are projected as two-dimensional Gaussian functions on the image. It allows a direct comparison to the binary masks. This method shows promising results with regards to the accuracy-speed compromise. It could be further improved by refining the representation, with ellipses or more complex shapes. Finally, this segmentation approach could be generalized to other domains and could benefit to problems concerning more specifically the coverage of complex surfaces.

References

  • [1] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2961–2969. [Online]. Available: https://openaccess.thecvf.com/content_iccv_2017/html/He_Mask_R-CNN_ICCV_2017_paper.html
  • [2] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path Aggregation Network for Instance Segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8759–8768. [Online]. Available: https://openaccess.thecvf.com/content_cvpr_2018/html/Liu_Path_Aggregation_Network_CVPR_2018_paper.html
  • [3] E. Xie, P. Sun, X. Song, W. Wang, X. Liu, D. Liang, C. Shen, and P. Luo, “PolarMask: Single Shot Instance Segmentation With Polar Representation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 12 193–12 202. [Online]. Available: https://openaccess.thecvf.com/content_CVPR_2020/html/Xie_PolarMask_Single_Shot_Instance_Segmentation_With_Polar_Representation_CVPR_2020_paper.html
  • [4] H. Perreault, G.-A. Bilodeau, N. Saunier, and M. Héritier, “CenterPoly: Real-Time Instance Segmentation Using Bounding Polygons,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2982–2991. [Online]. Available: https://openaccess.thecvf.com/content/ICCV2021W/AVVision/html/Perreault_CenterPoly_Real-Time_Instance_Segmentation_Using_Bounding_Polygons_ICCVW_2021_paper.html
  • [5] P. Hurtik, V. Molek, J. Hula, M. Vajgl, P. Vlasanek, and T. Nejezchleba, “Poly-YOLO: higher speed, more precise detection and instance segmentation for YOLOv3,” Neural Computing and Applications, vol. 34, no. 10, pp. 8275–8290, May 2022. [Online]. Available: https://doi.org/10.1007/s00521-021-05978-9
  • [6] S. Peng, W. Jiang, H. Pi, X. Li, H. Bao, and X. Zhou, “Deep Snake for Real-Time Instance Segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Apr. 2020, arXiv:2001.01629 [cs] type: article. [Online]. Available: http://arxiv.longhoe.net/abs/2001.01629
  • [7] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, and R. Girshick, “Segment Anything,” Apr. 2023, arXiv:2304.02643 [cs]. [Online]. Available: http://arxiv.longhoe.net/abs/2304.02643
  • [8] X. Wang, R. Zhang, T. Kong, L. Li, and C. Shen, “SOLOv2: Dynamic and Fast Instance Segmentation,” in Advances in Neural Information Processing Systems, vol. 33.   Curran Associates, Inc., 2020, pp. 17 721–17 732. [Online]. Available: https://proceedings.neurips.cc/paper/2020/hash/cd3afef9b8b89558cd56638c3631868a-Abstract.html
  • [9] T. Cheng, X. Wang, S. Chen, W. Zhang, Q. Zhang, C. Huang, Z. Zhang, and W. Liu, “Sparse Instance Activation for Real-Time Instance Segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Mar. 2022. [Online]. Available: https://openaccess.thecvf.com/content/CVPR2022/html/Cheng_Sparse_Instance_Activation_for_Real-Time_Instance_Segmentation_CVPR_2022_paper.html
  • [10] D. Mazzini and R. Schettini, “Spatial Sampling Network for Fast Scene Understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 0–0. [Online]. Available: https://openaccess.thecvf.com/content_CVPRW_2019/html/WAD/Mazzini_Spatial_Sampling_Network_for_Fast_Scene_Understanding_CVPRW_2019_paper.html
  • [11] W. Xu, H. Wang, F. Qi, and C. Lu, “Explicit Shape Encoding for Real-Time Instance Segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5168–5177. [Online]. Available: https://openaccess.thecvf.com/content_ICCV_2019/html/Xu_Explicit_Shape_Encoding_for_Real-Time_Instance_Segmentation_ICCV_2019_paper.html
  • [12] J. Uhrig, E. Rehder, B. Fröhlich, U. Franke, and T. Brox, “Box2Pix: Single-Shot Instance Segmentation by Assigning Pixels to Object Boxes,” in 2018 IEEE Intelligent Vehicles Symposium (IV), Jun. 2018, pp. 292–299. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/8500621
  • [13] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “YOLACT: Real-time Instance Segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2019, arXiv: 1904.02689. [Online]. Available: http://arxiv.longhoe.net/abs/1904.02689
  • [14] A. Newell, K. Yang, and J. Deng, “Stacked Hourglass Networks for Human Pose Estimation,” in Computer Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds., 2016, pp. 483–499. [Online]. Available: https://link.springer.com/chapter/10.1007/978-3-319-46484-8_29
  • [15] L. Castrejon, K. Kundu, R. Urtasun, and S. Fidler, “Annotating Object Instances With a Polygon-RNN,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5230–5238. [Online]. Available: https://openaccess.thecvf.com/content_cvpr_2017/html/Castrejon_Annotating_Object_Instances_CVPR_2017_paper.html
  • [16] D. Acuna, H. Ling, A. Kar, and S. Fidler, “Efficient Interactive Annotation of Segmentation Datasets With Polygon-RNN++,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 859–868. [Online]. Available: https://openaccess.thecvf.com/content_cvpr_2018/html/Acuna_Efficient_Interactive_Annotation_CVPR_2018_paper.html
  • [17] K. Jodogne-Del Litto and G.-A. Bilodeau, “Real-time instance segmentation with polygons using an Intersection-over-Union loss,” May 2023, arXiv:2305.05490 [cs]. [Online]. Available: http://arxiv.longhoe.net/abs/2305.05490
  • [18] H. U. M. Riaz, N. Benbarka, and A. Zell, “FourierNet: Compact Mask Representation for Instance Segmentation Using Differentiable Shape Decoders,” in 2020 25th International Conference on Pattern Recognition (ICPR), Jan. 2021, pp. 7833–7840, iSSN: 1051-4651.
  • [19] G. Bahl, L. Daniel, and F. Lafarge, “SCR: Smooth Contour Regression with Geometric Priors,” arXiv, Tech. Rep. arXiv:2202.03784, Feb. 2022, arXiv:2202.03784 [cs] type: article. [Online]. Available: http://arxiv.longhoe.net/abs/2202.03784
  • [20] Q.-L. Zhang and Y.-B. Yang, “A boundary-preserving conditional convolution network for instance segmentation,” Pattern Recognition Letters, vol. 163, pp. 1–9, Nov. 2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167865522002665
  • [21] B. R. Kang, H. Lee, K. Park, H. Ryu, and H. Y. Kim, “BshapeNet: Object detection and instance segmentation with bounding shape masks,” Pattern Recognition Letters, vol. 131, pp. 449–455, Mar. 2020. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167865520300350
  • [22] R. M. Karp, “Reducibility among Combinatorial Problems,” in Complexity of Computer Computations: Proceedings of a symposium on the Complexity of Computer Computations, held March 20–22, 1972, at the IBM Thomas J. Watson Research Center, Yorktown Heights, New York, and sponsored by the Office of Naval Research, Mathematics Program, IBM World Trade Corporation, and the IBM Research Mathematical Sciences Department, ser. The IBM Research Symposia Series, R. E. Miller, J. W. Thatcher, and J. D. Bohlinger, Eds.   Boston, MA: Springer US, 1972, pp. 85–103. [Online]. Available: https://doi.org/10.1007/978-1-4684-2001-2_9
  • [23] E. H. Neville, “On the Solution of Numerical Functional Equations: Illustrated by an Account of a Popular Puzzle and of its Solution,” Proceedings of the London Mathematical Society, vol. s2_14, no. 1, pp. 308–326, Jan. 1915. [Online]. Available: https://doi.org/10.1112/plms/s2_14.1.308
  • [24] C. T. Zahn, Jr., “Black box maximization of circular coverage,” Journal of Research of the National Bureau of Standards. Section B. Mathematics and Mathematical Physics, vol. 66B, pp. 181–216, 1962. [Online]. Available: https://mathscinet.ams.org/mathscinet-getitem?mr=164285
  • [25] A. Salhieh, J. Weinmann, M. Kochhal, and L. Schwiebert, “Power efficient topologies for wireless sensor networks,” in International Conference on Parallel Processing, 2001., Sep. 2001, pp. 156–163, iSSN: 0190-3918.
  • [26] Y. Xu, J. Peng, W. Wang, and B. Zhu, “The connected disk covering problem,” Journal of Combinatorial Optimization, vol. 35, no. 2, pp. 538–554, Feb. 2018. [Online]. Available: https://doi.org/10.1007/s10878-017-0195-0
  • [27] E. Horster and R. Lienhart, “Approximating Optimal Visual Sensor Placement,” in 2006 IEEE International Conference on Multimedia and Expo, Jul. 2006, pp. 1257–1260, iSSN: 1945-788X.
  • [28] V. P. Munishwar and N. B. Abu-Ghazaleh, “Coverage algorithms for visual sensor networks,” ACM Transactions on Sensor Networks, vol. 9, no. 4, pp. 45:1–45:36, Jul. 2013. [Online]. Available: https://dl.acm.org/doi/10.1145/2489253.2489262
  • [29] J. K. Han, B. S. Park, Y. S. Choi, and H. K. Park, “Genetic approach with a new representation for base station placement in mobile communications,” in IEEE 54th Vehicular Technology Conference. VTC Fall 2001. Proceedings (Cat. No.01CH37211), vol. 4, Oct. 2001, pp. 2703–2707 vol.4. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/957251
  • [30] X. Zhou, D. Wang, and P. Krähenbühl, “Objects as Points,” arXiv:1904.07850 [cs], Apr. 2019. [Online]. Available: http://arxiv.longhoe.net/abs/1904.07850
  • [31] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal Loss for Dense Object Detection,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2980–2988. [Online]. Available: https://openaccess.thecvf.com/content_iccv_2017/html/Lin_Focal_Loss_for_ICCV_2017_paper.html
  • [32] M. Yi-de, L. Qing, and Q. Zhi-bai, “Automated image segmentation using improved PCNN model based on cross-entropy,” in Proceedings of 2004 International Symposium on Intelligent Multimedia, Video and Speech Processing, 2004., Oct. 2004, pp. 743–746.
  • [33] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation,” in 2016 Fourth International Conference on 3D Vision (3DV), Oct. 2016, pp. 565–571.
  • [34] D. H. Douglas and T. K. Peucker, “Algorithms for the reduction of the number of points required to represent a digitized line or its caricature,” Cartographica: The International Journal for Geographic Information and Geovisualization, vol. 10, no. 2, pp. 112–122, Dec. 1973, publisher: University of Toronto Press. [Online]. Available: https://www.utpjournals.press/doi/abs/10.3138/fm57-6770-u75u-7727
  • [35] S. Suzuki and K. be, “Topological structural analysis of digitized binary images by border following,” Computer Vision, Graphics, and Image Processing, vol. 30, no. 1, pp. 32–46, Apr. 1985. [Online]. Available: https://www.sciencedirect.com/science/article/pii/0734189X85900167
  • [36] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The Cityscapes Dataset for Semantic Urban Scene Understanding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2016, pp. 3213–3223. [Online]. Available: https://openaccess.thecvf.com/content_cvpr_2016/html/Cordts_The_Cityscapes_Dataset_CVPR_2016_paper.html
  • [37] G. Varma, A. Subramanian, A. Namboodiri, M. Chandraker, and C. Jawahar, “IDD: A Dataset for Exploring Problems of Autonomous Navigation in Unconstrained Environments,” in 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Jan. 2019, pp. 1743–1751.
  • [38] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? The KITTI vision benchmark suite,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2012, pp. 3354–3361, iSSN: 1063-6919.
  • [39] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár, “Microsoft COCO: Common Objects in Context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 2014, pp. 740–755. [Online]. Available: http://arxiv.longhoe.net/abs/1405.0312
  • [40] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “PyTorch: An Imperative Style, High-Performance Deep Learning Library,” in Advances in Neural Information Processing Systems, vol. 32.   Curran Associates, Inc., 2019. [Online]. Available: https://proceedings.neurips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html
  • [41] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” in 3rd International Conference on Learning Representations, {ICLR} 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. [Online]. Available: http://arxiv.longhoe.net/abs/1412.6980
  • [42] J. Liang, N. Homayounfar, W.-C. Ma, Y. Xiong, R. Hu, and R. Urtasun, “PolyTransform: Deep Polygon Transformer for Instance Segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9131–9140. [Online]. Available: https://openaccess.thecvf.com/content_CVPR_2020/html/Liang_PolyTransform_Deep_Polygon_Transformer_for_Instance_Segmentation_CVPR_2020_paper.html