CenterDisks: Real-time instance segmentation with disk covering
^†^†thanks: We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), the Institut for data valorisation (IVADO) and the Apogée fund.

Katia Jodogne-del Litto LITIV Lab., Polytechnique Montréal
Montréal, Canada
[email protected] Guillaume-Alexandre Bilodeau LITIV Lab., Polytechnique Montréal
Montréal, Canada
[email protected]

Abstract

Increasing the accuracy of instance segmentation methods is often done at the expense of speed. Using coarser representations, we can reduce the number of parameters and thus obtain real-time masks.

In this paper, we take inspiration from the set cover problem to predict mask approximations. Given ground-truth binary masks of objects of interest as training input, our method learns to predict the approximate coverage of these objects by disks without supervision on their location or radius. Each object is represented by a fixed number of disks with different radii. In the learning phase, we consider the radius as proportional to a standard deviation in order to compute the error to propagate on a set of two-dimensional Gaussian functions rather than disks.

We trained and tested our instance segmentation method on challenging datasets showing dense urban settings with various road users.

Our method achieve state-of-the art results on the IDD and KITTI dataset with an inference time of 0.040 s on a single RTX 3090 GPU.

The code is available at:

https://github.com/KatiaJDL/CenterDisks

I Introduction

The development of intelligent vehicles is a major technological advance, which relies heavily on computer vision to observe and interpret the environment. The use cases are multiple in the context of light individual vehicles or public transport, and range from driving assistance mechanisms to autonomous driving. The deployment in urban areas requires the ability to detect road users in real-time and accurately in dense scenes. However, the deep learning models used for image processing are generally quite resource intensive. One of the factors at play is their large number of parameters. Reducing their sizes allows both to speed them up and to require less computing resources. Smaller models can be used with less powerful and lighter hardware, for example in embedded systems. The climate crisis and its challenges also make it necessary to be aware of the resources consumed for training and production.

To localize precisely road users, detections with bounding boxes are not enough. We need to segment the image to get the pixels belonging to each object of interest. But predicting binary masks is quite slow, and most precise instance segmentation do not reach real-time performances [1, 2]. One way to achieve a trade-off between speed and accuracy is to use mask approximations, with a lighter representation and therefore less parameters.

One way to simplify a binary mask is to use only its contours [3, 4, 5, 6], but we can also be interested in the possibility of approximating a binary mask with few parameters by representing the interior of the mask directly. This can be viewed as a mask covering problem. We no longer characterize each pixel of the image, but we want to project shapes on it. The simplest and least expensive way in terms of number of parameters to represent a surface is a disk. Only three parameters are needed, the two coordinates of the center and the radius. Thus, we propose to approximate a complex binary mask with a set of disks (see Figure 1).

One of the main advantages of this representation is that it does not require direct supervision or a custom ground-truth. There is no need for a ground-truth representing the centers and radii of the sets of disks. The optimization is performed directly on the binary masks.

Our contributions are the following:

•

We propose a new approach for mask approximation with disk covering;
•

We show that a deep neural network can learn to cover a surface by positioning a set of disks;
•

Our proposed method is real-time and achieves state-of-the art results on the IDD and KITTI test set.

II Related works

Real-time instance segmentation

Most traditional instance segmentation methods, like Mask R-CNN [1], PANET [2], or more recently Sam [7], cannot reach real-time performances. But techniques were developed to speed them up. First, we can mention sparse representation of object localization. In SOLOv2 [8], the images are divided into cells, and each cell corresponds to a mask, representing the potential object which center lies in this cell. SparseInst [9] also uses sparse feature maps to produce mask kernels but does not need to localize the objects by their center. Spatial Sampling Net [10] produces a non-uniform density map. It follows the object distribution, and the masks are obtained with a diffusion process through a spatial sampling operator.

Furthermore, combining bounding boxes and segmentation masks can be an acceleration factor. ESE SEG [11] performs at the same time the bounding box prediction and the object segmentation, and Box2Pix [12] combines semantic segmentation and bounding boxes. Finally, real-time can be achieve with a simplified representation of masks, with a combination of mask prototypes [13], or with polygonal mask approximations [5, 4].

Instance segmentation with mask approximations

Except for YOLACT [13], which uses mask prototypes and predicted coefficients to combine them, most mask approximation methods consider only the boundary of the objects. A popular representation is the polygonal mask.

Polygon-RNN [15] and Polygon-RNN++ [16] use a recurrent neural network to determine the next vertex. These methods were meant for semi-automated annotation systems, and are really slow. PolarMask [3] fixes the angles of the vertices in a polar representation, which produce a star structure. Poly-YOLO [5] also uses a polar grid, with free angles and a dynamic number of vertices, chosen with a dedicated confidence score during inference phase. CenterPoly [4] and CenterPolyV2 [17] generate simultaneously heatmaps for object detection, and polygon vertices for each pixel. Nevertheless, most of these methods do not consider potential holes in the masks.

However, non-polygonal representations exist. ESE SEG [11] uses an approximation of object boundaries based on Chebyshev polynomials as mask approximation. FourierNet [18] and SCR [19] use instead Fourier series and add to the output a differentiable decoder that allows learning directly on the final shape of the masks. DeepSnake [6] directly generates the contours and then iteratively distorts them to get closer to the actual contours of the objects. BCondInst [20] and BshapeNet [21] use boundary predictions to improve its mask predictions.

Set cover problem

The mask approximation methods presented above rely on the detection of the object contour. One can also consider a representation based on the inside of the objects. Approximating a complex surface with a set of shapes can be seen as a sub-problem of the general set cover problem, which is NP-complete [22]. Depending on the evaluation criteria for the covering, there are several mathematical results to this computational geometry problem for simple configurations.

The problem of covering the unit disk uses an indirect approach by requiring the minimization of the common radius of a set of disks to cover the whole surface. This mathematical problem is solved exactly for different values of $n$ : 1, 2, 3, 4, 5, 7 [23]. and approximate radius values have been obtained up to $n=10$ [24]. Here the location of the disks is secondary, and the radius is the value to be minimized. For less regular surfaces, under real-life conditions, this problem can be interpreted as the placement of radio antennas (or base station placement) [25]. There are approximation algorithms for convex polygons [26]. For more complex surfaces, the optimization algorithms range from linear integer programming [27], greedy [28], to meta-heuristic like genetic algorithms [29].

Thus, there is no exact algorithm to find the position of disks covering any surface by minimizing their radii. The existing algorithms use optimization techniques. In our case, the input is an image with multiple areas that should be covered as exactly as possible. It is not a known surface. Nevertheless, we postulate and demonstrate with practical results that it is possible to optimize the overlap by deep learning with the complete mask as the only supervision.

III Method

Our method is based on the object detector CenterNet [30], which locates objects by their center and regress from it the coordinates of their bounding box. Instead of bounding boxes, we adapt the prediction heads to obtain a set of centers and radii for each object detected. The architecture of our method is shown in the figure 2. There are five prediction heads. The heatmap and offset heads come directly from CenterNet [30]. With one heatmap for each semantic category, we can extract the peaks, which correspond to the center of the objects. The relative depth head is inspired by CenterPoly [4]. We added two heads for centers and radii prediction.

III-A Gaussian projection

The covering is densely predicted: A set of $N$ disks is predicted for each pixel, and only the ones corresponding to a peak of the heatmap are kept. Therefore, a set of $N$ disks covers one object. That represents $2N$ coordinates for the center of the disks $x_{1},y_{1},x_{2},y_{2},...,x_{n},y_{n}$ and $M\leq N$ size parameters $\sigma_{1},...\sigma_{M}$ . These last values can be considered as the standard deviation of a Gaussian function in two dimensions.

Each center is matched with a standard deviation using an association function $\lambda$ . $N,M,\lambda$ are hyper-parameters of the model. If $z$ is the index of a center, $\lambda(z)$ is the index of the corresponding radius. For example, if the standard deviation is the same for all centers, we must fix $M=1$ and $\lambda:z\rightarrow 1$ . For each center to get its personalized standard deviation, $M=N$ and $\lambda:z\rightarrow z$ . We performed an ablation study to choose the proportion of size parameters compared to the number of centers $N$ .

The equation for the Gaussian function of center $i$ is

f_{i}(x,y)=e^{-\frac{(x-x_{i})^{2}+(y-y_{i})^{2}}{2\sigma_{\lambda(i)}^{2}}},

(1)

with which we can deduce the global probability map for one object:

f(x,y)=\sum_{i=0}^{N}e^{-\frac{(x-x_{i})^{2}+(y-y_{i})^{2}}{2\sigma_{\lambda(i% )}^{2}}}.

(2)

The final result of the prediction head for the disks is a probability map.

III-B Training

Since the output is a sum of Gaussian functions, we have to apply a normalization function on these probability maps to reduce their range to $[0,1]$ . This allows us to use classic loss functions for segmentation tasks. We chose to use the hyperbolic tangent function.

No custom ground-truth (GT) is necessary. We can simply use the binary masks provided with the datasets. The loss function used for the training of the model is defined as follow:

Loss=W_{hm}Loss_{hm}+W_{offset}Loss_{offset}\\ +W_{depth}Loss_{depth}+W_{disks}Loss_{disks},

(3)

where $W_{hm}$ , $W_{offset}$ , $W_{depth}$ are the respective weights of the loss for the prediction heads for heatmap, offset and relative depth, and $W_{disks}$ is the weight for the loss for the disk heads (head for disk centers and head for standard deviations). The loss for the heatmaps describing the center of objects is the focal loss [31], and $Loss_{offset}$ and $Loss_{depth}$ are L1 loss functions.

The loss function used for the training of the two disk heads should compute the difference between a GT binary mask and a probability map. We propose the use of two possible loss functions: binary cross-entropy [32] and Dice loss [33]. The first one is a local loss, based on the distribution, while the second is a regional loss, taking into account the whole image. For a pixel $k$ with its coordinates $(x,y)$ , $z_{k}\in\{0,1\}$ is the GT value and $p_{k}=f(x,y)\in[0,1]$ is the predicted value, corresponding to 2. The losses are defined by

L_{cross\_entropy}=-\frac{1}{N}\sum_{k}z_{k}log(p(p_{k}))+(1-z_{k})log(1-p_{k})

(4)

and

L_{Dice}=1-\frac{2\sum_{i=1}^{N}z_{k}p_{k}}{\sum_{k}z_{k}+\sum_{k}p_{k}+% \epsilon},

(5)

$\epsilon$ being a smoothing term. $Loss_{disks}$ is then one of those two.

III-C Inference

In the inference phase, we perform no normalization on the probability maps representing the objects. After projection, with two-dimensional Gaussian functions, the predicted binary mask is selected with a threshold $\alpha$ (Figure 3).

We may perform contour smoothing with the Douglas-Peucker algorithm [34]. With simple computer graphics tools, we retrieve a polygon approximation of the contour from the binary mask [35] and perform a reduction of the number of vertices. The hyper-parameter $\beta$ , used as a maximum distance to the original outline, is proportional to the initial perimeter. This post-processing step can reduce the roundness of the mask, which is inherent to disks.

IV Experiments

IV-A Implementation

The focus of our study is the segmentation of road users in urban settings. We used the Cityscapes [36], IDD [37] and KITTI datasets [38]. We performed ablation studies on the Cityscapes validation set. The instance categories used correspond to road users: bicycle, bus, car, motorcycle, person, rider, train, truck. The data splits used for training, validation and testing are predefined for all datasets. The images of the Cityscapes dataset are recorded in different cities in Germany. It includes 5,000 images, with a standard resolution of $2048\times 1024$ . In the Indian Driving Dataset (IDD), there are approximately 10,000 images, with sizes varying from $1920\times 1080$ to $1280\times 964$ . Finally, KITTI dataset includes 400 images for segmentation, with a size of $1280\times 384$ .

Our main evaluation metric is the Average Precision (AP) [39]. It is an aggregation of the average precision for different IoU thresholds, from 50% to 95%. The AP50%, for average precision with minimum IoU of 50%, and AP50m and AP100m, for objects within a range of 50m and 100m respectively, are also used. Moreover, for all these metrics, we can have access to the average over all categories or to the detailed results.

We implemented our method with Pytorch [40]. We use the Hourglass network [14], with one stack as a backbone for all our experiments. The backbone, heatmap head, and offset head are pre-trained on COCO [39]. We first trained on Cityscapes and then fine-tuned our model for KITTI and IDD. For training, we used classical data augmentation techniques, with a resolution of $1024\times 512$ : color augmentation, random crop**, and flip**. The loss weights are $W_{hm}=1$ , $W_{disks}=1$ , $W_{depth}=0.1$ and $W_{offset}=0.1$ . For the disk heads we used the Dice loss.

The model was trained for 240 epochs with a batch size of 4 on a single RTX 3090 GPU with the adam optimizer [41]. The starting learning rate used was 2e-4, and we divided the learning rate by ten at epochs 90 and 120. The threshold for the inference was fixed to $\alpha=0.5$ . We use 16 disks with 16 different radii.

IV-B Results

TABLE I: Results on the Cityscapes test set. If the runtimes were not explicitly stated in the original paper, there are estimated based on our knowledge of the method. Results are taken from the original papers or public online benchmarks, unless stated otherwise. Boldface: Best results for real-time methods. Underline: Best results overall.

Method	Mask type	AP ↑	AP50% ↑	AP100m ↑	AP50m ↑	Runtime (s) ↓
Mask-RCNN [1]	Full	26.22	49.89	37.63	40.11	$\simeq$ 0.2
PANET [2]	Full	31.80	57.10	44.20	46.00	$>$ 1
PolyTransform [42]	Polygon	40.10	65.90	54.80	58.00	$>$ 1
DeepSnake [6]	Outline	31.70	58.40	43.20	44.70	4.6
Polygon-RNN++ [16]	Polygon	25.50	45.50	39.30	43.40	$>$ 1
Spatial Sampling Net [10]	Full	9.20	16.80	16.4	21.4	0.009
Box2Pix [12]	Full	13.10	27.20	-	-	0.092
Poly-YOLO [5]	Polygon	11.50	26.70	-	-	0.049
Poly-YOLO lite [5]	Polygon	10.10	23.90	-	-	0.027
CenterPoly [4]	Polygon	15.54	39.49	23.33	24.45	0.045
CenterDisks (Ours)	Shape set	7.36	25.89	11.64	12.37	0.040

The results for our method on the test sets of Cityscapes, IDD and KITTI are presented in the Table I, Table II and Table III. We included methods with real-time performances as well as slower but more precise methods.

For the Cityscapes dataset (Table I), the inference time for one image is about 0.040 second. It means that our method can be used in real-time settings. Our method reaches 7.36 for Average Precision and 25.89 for AP50%. Except for CenterPoly, our results for AP50% are competitive with other real-time method. The evaluation metrics are very demanding, and a low score, especially for the AP metric, does not mean that results are not usable.

As we can see in Figure 4, the best performances are met on the categories that are most present in the dataset, and that have the biggest surface in average. The cars, representing almost half of the objects in the Cityscapes dataset, are especially well segmented. The shape of the objects has also a high importance, since disks are not very flexible. Compact blocks (cars, buses, trucks, trains) are easier to approximate here, whereas the handlebars of a bike or the slender shape of pedestrian are harder to get right.

However, on the Indian Driving Dataset (Table II) and on the KITTI dataset (Table III) CenterDisks performs better than the previous methods. We reach 20.30 in AP and 49.90 in AP50% for IDD, and for KITTI, we improve the state-of-the-art method by 3 points for the AP metric and 10.5 points for the AP50% metric. These datasets contain less pedestrians, and no motorcycle, bicycle or rider for KITTI. We have discussed previously that the most frequently represented and the largest objects were the best segmented objects, so this can explain the very good performances on these two datasets. Moreover, IDD is more diverse inside each category, and KITTI does not have a lot of training data. Our method thus seems to have a higher generalization capacity than previous methods.

Qualitative results for the KITTI test set and Cityscapes test set are shown on Figure 5 and Figure 6. The instances are well-segmented, and some details can be captured, for example the wheels or the leg separation of pedestrians when they are big enough. The predicted masks are round-shaped. It is inherent in the structure of the mask with a set of disks, which cannot create straight lines or sharp angles. Slower methods provide more accurate masks (Figure 5), especially for details, such as the tires.

TABLE II: Results on the IDD test set. * Results from the original IDD paper [37]. Boldface: Best results for real-time methods. Underline: Best results overall.

Method	AP	AP50%	Time (s)
Mask-RCNN [1]*	26.80	49.90	$\simeq$ 0.2
PANET [2]*	37.60	66.10	$>$ 1
Poly-YOLO [5]	11.50	26.70	0.049
Poly-YOLO lite [5]	10.10	23.90	0.027
CenterPoly [4]	14.40	36.90	0.045
CenterDisks (Ours)	20.30	49.90	0.032

TABLE III: Results on the KITTI test set. Boldface: Best results.

Method	AP	AP50%	Time (s)
CenterPoly [4]	8.73	26.74	0.045
CenterDisks (Ours)	11.75	37.24	0.033

V Discussion

V-A Ablation studies

We performed ablation studies on the number of disks used for the covering (Table IV). At first, adding more disks allow for a better covering, but then reach a plateau. The best value among the ones we tested is 16 disks. The runtime for this configuration stays also under the limit for real-time application.

To decide between a region-based loss and a distribution-based loss, we experimented with the binary-cross entropy loss [32] and Dice loss [33]. The performance is much better using the region-based loss function Dice (Table IV). This can be partly attributed to the imbalance between objects of interest and background, which is better taken into account by this error function.

We also checked if each disk should get its own personalized radius. The more the radii are individualized, the more accurate the covering should be. But at the same time, it can be harder to optimize more different prediction parameters. As we can see on Table V, using more different radii improves the covering. The best configuration is the one where all disks have a different radius, so it is the one we used in our other experiments.

TABLE IV: Results on the Cityscapes validation set. All radius different. Boldface: Best results.

N disks	N radii	Loss	AP	AP50%	Runtime (s)
2	2	Dice	5.03	19.50	0.032
4	4	Dice	9.51	33.79	0.036
8	8	Dice	10.53	33.33	0.039
16	16	Dice	12.22	35.47	0.040
24	24	Dice	10.78	32.18	0.043
32	32	Dice	10.65	33.31	0.046
16	16	BCE	8.00	27.14	0.040
16	16	Dice	12.22	35.47	0.040

TABLE V: Results on the Cityscapes validation set. The loss function used for the training is the Dice loss. Boldface: Best results.

N disks	N radii	AP	AP50%	Runtime (s)
16	1	11.80	35.04	0.039
16	2	10.24	31.40	0.040
16	4	11.59	34.11	0.040
16	16	12.22	35.47	0.040

The initial results of CenterDisks are somewhat round-shaped, which does not represent accurately the type of objects present in the datasets. Therefore, we studied the refinement of the masks during post-processing with the Douglas-Peucker algorithm (see section III-C). With $\beta=0.01$ , the AP is improved by 1.1 points, and the AP50% is improved by 2.4 points (Table VI). We find that although this post-processing improves the final accuracy slightly, the inference time needed is not worth the improvement, with 1s per image.

When using oracle prediction for the detection, the evaluation metrics reach the results of 12.42 for the AP and 39.35 for AP50%. The accuracy is thus improved by 4 points for the AP50%. It means that getting a better detection head could improve the results slightly.

TABLE VI: Ablation study for the post-processing step. The reduction factor for the Douglas-Peucker algorithm is defined in section III-C, and is proportional to

\beta

. Results on the Cityscapes validation set. We use the dice loss, with 16 different radii. Boldface: Best results.

N disks	$\beta$	AP	AP50%	Runtime (s)
16	None	12.22	35.47	0.040
16	0.001	13.03	37.52	0.925
16	0.01	13.36	37.85	0.923

V-B Limitations

The overall accuracy of our method does not reach the performance of the best existing instance segmentation methods. Due to the use of overlap** disks, it is almost impossible to get straight lines and sharp angles. This difficulty could only be overcome in using other shapes. Small objects are also hard to segment accurately, as we can see on the qualitative results (objects in the background in Figure 6) and when decomposing the results according to the category (Table 4). Finally, our method struggles also with fine separation, for example the legs of pedestrians. Nonetheless, our proposed representation can theoretically achieve a better mask than the polygonal methods for objects with holes, or whose center of gravity is outside the mask.

VI Conclusion

In this paper, we propose a new paradigm for mask approximation using disk covering. A fixed number of disks with different radii represent the objects. The model is trained without the need for elaborate or custom ground-truths. During training, the disks are projected as two-dimensional Gaussian functions on the image. It allows a direct comparison to the binary masks. This method shows promising results with regards to the accuracy-speed compromise. It could be further improved by refining the representation, with ellipses or more complex shapes. Finally, this segmentation approach could be generalized to other domains and could benefit to problems concerning more specifically the coverage of complex surfaces.

References

[1] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2961–2969. [Online]. Available: https://openaccess.thecvf.com/content_iccv_2017/html/He_Mask_R-CNN_ICCV_2017_paper.html
[2] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path Aggregation Network for Instance Segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8759–8768. [Online]. Available: https://openaccess.thecvf.com/content_cvpr_2018/html/Liu_Path_Aggregation_Network_CVPR_2018_paper.html
[3] E. Xie, P. Sun, X. Song, W. Wang, X. Liu, D. Liang, C. Shen, and P. Luo, “PolarMask: Single Shot Instance Segmentation With Polar Representation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 12 193–12 202. [Online]. Available: https://openaccess.thecvf.com/content_CVPR_2020/html/Xie_PolarMask_Single_Shot_Instance_Segmentation_With_Polar_Representation_CVPR_2020_paper.html
[4] H. Perreault, G.-A. Bilodeau, N. Saunier, and M. Héritier, “CenterPoly: Real-Time Instance Segmentation Using Bounding Polygons,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2982–2991. [Online]. Available: https://openaccess.thecvf.com/content/ICCV2021W/AVVision/html/Perreault_CenterPoly_Real-Time_Instance_Segmentation_Using_Bounding_Polygons_ICCVW_2021_paper.html
[5] P. Hurtik, V. Molek, J. Hula, M. Vajgl, P. Vlasanek, and T. Nejezchleba, “Poly-YOLO: higher speed, more precise detection and instance segmentation for YOLOv3,” Neural Computing and Applications, vol. 34, no. 10, pp. 8275–8290, May 2022. [Online]. Available: https://doi.org/10.1007/s00521-021-05978-9
[6] S. Peng, W. Jiang, H. Pi, X. Li, H. Bao, and X. Zhou, “Deep Snake for Real-Time Instance Segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Apr. 2020, arXiv:2001.01629 [cs] type: article. [Online]. Available: http://arxiv.longhoe.net/abs/2001.01629
[7] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, and R. Girshick, “Segment Anything,” Apr. 2023, arXiv:2304.02643 [cs]. [Online]. Available: http://arxiv.longhoe.net/abs/2304.02643
[8] X. Wang, R. Zhang, T. Kong, L. Li, and C. Shen, “SOLOv2: Dynamic and Fast Instance Segmentation,” in Advances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc., 2020, pp. 17 721–17 732. [Online]. Available: https://proceedings.neurips.cc/paper/2020/hash/cd3afef9b8b89558cd56638c3631868a-Abstract.html
[9] T. Cheng, X. Wang, S. Chen, W. Zhang, Q. Zhang, C. Huang, Z. Zhang, and W. Liu, “Sparse Instance Activation for Real-Time Instance Segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Mar. 2022. [Online]. Available: https://openaccess.thecvf.com/content/CVPR2022/html/Cheng_Sparse_Instance_Activation_for_Real-Time_Instance_Segmentation_CVPR_2022_paper.html
[10] D. Mazzini and R. Schettini, “Spatial Sampling Network for Fast Scene Understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 0–0. [Online]. Available: https://openaccess.thecvf.com/content_CVPRW_2019/html/WAD/Mazzini_Spatial_Sampling_Network_for_Fast_Scene_Understanding_CVPRW_2019_paper.html
[11] W. Xu, H. Wang, F. Qi, and C. Lu, “Explicit Shape Encoding for Real-Time Instance Segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5168–5177. [Online]. Available: https://openaccess.thecvf.com/content_ICCV_2019/html/Xu_Explicit_Shape_Encoding_for_Real-Time_Instance_Segmentation_ICCV_2019_paper.html
[12] J. Uhrig, E. Rehder, B. Fröhlich, U. Franke, and T. Brox, “Box2Pix: Single-Shot Instance Segmentation by Assigning Pixels to Object Boxes,” in 2018 IEEE Intelligent Vehicles Symposium (IV), Jun. 2018, pp. 292–299. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/8500621
[13] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “YOLACT: Real-time Instance Segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2019, arXiv: 1904.02689. [Online]. Available: http://arxiv.longhoe.net/abs/1904.02689
[14] A. Newell, K. Yang, and J. Deng, “Stacked Hourglass Networks for Human Pose Estimation,” in Computer Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds., 2016, pp. 483–499. [Online]. Available: https://link.springer.com/chapter/10.1007/978-3-319-46484-8_29
[15] L. Castrejon, K. Kundu, R. Urtasun, and S. Fidler, “Annotating Object Instances With a Polygon-RNN,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5230–5238. [Online]. Available: https://openaccess.thecvf.com/content_cvpr_2017/html/Castrejon_Annotating_Object_Instances_CVPR_2017_paper.html
[16] D. Acuna, H. Ling, A. Kar, and S. Fidler, “Efficient Interactive Annotation of Segmentation Datasets With Polygon-RNN++,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 859–868. [Online]. Available: https://openaccess.thecvf.com/content_cvpr_2018/html/Acuna_Efficient_Interactive_Annotation_CVPR_2018_paper.html
[17] K. Jodogne-Del Litto and G.-A. Bilodeau, “Real-time instance segmentation with polygons using an Intersection-over-Union loss,” May 2023, arXiv:2305.05490 [cs]. [Online]. Available: http://arxiv.longhoe.net/abs/2305.05490
[18] H. U. M. Riaz, N. Benbarka, and A. Zell, “FourierNet: Compact Mask Representation for Instance Segmentation Using Differentiable Shape Decoders,” in 2020 25th International Conference on Pattern Recognition (ICPR), Jan. 2021, pp. 7833–7840, iSSN: 1051-4651.
[19] G. Bahl, L. Daniel, and F. Lafarge, “SCR: Smooth Contour Regression with Geometric Priors,” arXiv, Tech. Rep. arXiv:2202.03784, Feb. 2022, arXiv:2202.03784 [cs] type: article. [Online]. Available: http://arxiv.longhoe.net/abs/2202.03784
[20] Q.-L. Zhang and Y.-B. Yang, “A boundary-preserving conditional convolution network for instance segmentation,” Pattern Recognition Letters, vol. 163, pp. 1–9, Nov. 2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167865522002665
[21] B. R. Kang, H. Lee, K. Park, H. Ryu, and H. Y. Kim, “BshapeNet: Object detection and instance segmentation with bounding shape masks,” Pattern Recognition Letters, vol. 131, pp. 449–455, Mar. 2020. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167865520300350
[22] R. M. Karp, “Reducibility among Combinatorial Problems,” in Complexity of Computer Computations: Proceedings of a symposium on the Complexity of Computer Computations, held March 20–22, 1972, at the IBM Thomas J. Watson Research Center, Yorktown Heights, New York, and sponsored by the Office of Naval Research, Mathematics Program, IBM World Trade Corporation, and the IBM Research Mathematical Sciences Department, ser. The IBM Research Symposia Series, R. E. Miller, J. W. Thatcher, and J. D. Bohlinger, Eds. Boston, MA: Springer US, 1972, pp. 85–103. [Online]. Available: https://doi.org/10.1007/978-1-4684-2001-2_9
[23] E. H. Neville, “On the Solution of Numerical Functional Equations: Illustrated by an Account of a Popular Puzzle and of its Solution,” Proceedings of the London Mathematical Society, vol. s2_14, no. 1, pp. 308–326, Jan. 1915. [Online]. Available: https://doi.org/10.1112/plms/s2_14.1.308
[24] C. T. Zahn, Jr., “Black box maximization of circular coverage,” Journal of Research of the National Bureau of Standards. Section B. Mathematics and Mathematical Physics, vol. 66B, pp. 181–216, 1962. [Online]. Available: https://mathscinet.ams.org/mathscinet-getitem?mr=164285
[25] A. Salhieh, J. Weinmann, M. Kochhal, and L. Schwiebert, “Power efficient topologies for wireless sensor networks,” in International Conference on Parallel Processing, 2001., Sep. 2001, pp. 156–163, iSSN: 0190-3918.
[26] Y. Xu, J. Peng, W. Wang, and B. Zhu, “The connected disk covering problem,” Journal of Combinatorial Optimization, vol. 35, no. 2, pp. 538–554, Feb. 2018. [Online]. Available: https://doi.org/10.1007/s10878-017-0195-0
[27] E. Horster and R. Lienhart, “Approximating Optimal Visual Sensor Placement,” in 2006 IEEE International Conference on Multimedia and Expo, Jul. 2006, pp. 1257–1260, iSSN: 1945-788X.
[28] V. P. Munishwar and N. B. Abu-Ghazaleh, “Coverage algorithms for visual sensor networks,” ACM Transactions on Sensor Networks, vol. 9, no. 4, pp. 45:1–45:36, Jul. 2013. [Online]. Available: https://dl.acm.org/doi/10.1145/2489253.2489262
[29] J. K. Han, B. S. Park, Y. S. Choi, and H. K. Park, “Genetic approach with a new representation for base station placement in mobile communications,” in IEEE 54th Vehicular Technology Conference. VTC Fall 2001. Proceedings (Cat. No.01CH37211), vol. 4, Oct. 2001, pp. 2703–2707 vol.4. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/957251
[30] X. Zhou, D. Wang, and P. Krähenbühl, “Objects as Points,” arXiv:1904.07850 [cs], Apr. 2019. [Online]. Available: http://arxiv.longhoe.net/abs/1904.07850
[31] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal Loss for Dense Object Detection,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2980–2988. [Online]. Available: https://openaccess.thecvf.com/content_iccv_2017/html/Lin_Focal_Loss_for_ICCV_2017_paper.html
[32] M. Yi-de, L. Qing, and Q. Zhi-bai, “Automated image segmentation using improved PCNN model based on cross-entropy,” in Proceedings of 2004 International Symposium on Intelligent Multimedia, Video and Speech Processing, 2004., Oct. 2004, pp. 743–746.
[33] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation,” in 2016 Fourth International Conference on 3D Vision (3DV), Oct. 2016, pp. 565–571.
[34] D. H. Douglas and T. K. Peucker, “Algorithms for the reduction of the number of points required to represent a digitized line or its caricature,” Cartographica: The International Journal for Geographic Information and Geovisualization, vol. 10, no. 2, pp. 112–122, Dec. 1973, publisher: University of Toronto Press. [Online]. Available: https://www.utpjournals.press/doi/abs/10.3138/fm57-6770-u75u-7727
[35] S. Suzuki and K. be, “Topological structural analysis of digitized binary images by border following,” Computer Vision, Graphics, and Image Processing, vol. 30, no. 1, pp. 32–46, Apr. 1985. [Online]. Available: https://www.sciencedirect.com/science/article/pii/0734189X85900167
[36] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The Cityscapes Dataset for Semantic Urban Scene Understanding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2016, pp. 3213–3223. [Online]. Available: https://openaccess.thecvf.com/content_cvpr_2016/html/Cordts_The_Cityscapes_Dataset_CVPR_2016_paper.html
[37] G. Varma, A. Subramanian, A. Namboodiri, M. Chandraker, and C. Jawahar, “IDD: A Dataset for Exploring Problems of Autonomous Navigation in Unconstrained Environments,” in 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Jan. 2019, pp. 1743–1751.
[38] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? The KITTI vision benchmark suite,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2012, pp. 3354–3361, iSSN: 1063-6919.
[39] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár, “Microsoft COCO: Common Objects in Context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 2014, pp. 740–755. [Online]. Available: http://arxiv.longhoe.net/abs/1405.0312
[40] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “PyTorch: An Imperative Style, High-Performance Deep Learning Library,” in Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc., 2019. [Online]. Available: https://proceedings.neurips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html
[41] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” in 3rd International Conference on Learning Representations, {ICLR} 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. [Online]. Available: http://arxiv.longhoe.net/abs/1412.6980
[42] J. Liang, N. Homayounfar, W.-C. Ma, Y. Xiong, R. Hu, and R. Urtasun, “PolyTransform: Deep Polygon Transformer for Instance Segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9131–9140. [Online]. Available: https://openaccess.thecvf.com/content_CVPR_2020/html/Liang_PolyTransform_Deep_Polygon_Transformer_for_Instance_Segmentation_CVPR_2020_paper.html

CenterDisks: Real-time instance segmentation with disk covering ††thanks: We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), the Institut for data valorisation (IVADO) and the Apogée fund.