VIPriors 4: Visual Inductive Priors for
Data-Efficient Deep Learning Challenges

Robert-Jan Bruintjes, Attila Lengyel, Osman Semih Kayhan, Nergis Tomen, Hadi Jamali-Rad, and Jan van Gemert R. Bruintjes, A. Lengyel, N. Tomen, H. Jamali-Rad and J. van Gemert are with Delft University of Technology.
E-mail: [email protected] O. S. Kayhan is with Bosch Security Systems B.V. H. Jamali-Rad is with Shell.
Abstract

The fourth edition of the ”VIPriors: Visual Inductive Priors for Data-Efficient Deep Learning” workshop features two data-impaired challenges. These challenges address the problem of training deep learning models for computer vision tasks with limited data. Participants are limited to training models from scratch using a low number of training samples and are not allowed to use any form of transfer learning. We aim to stimulate the development of novel approaches that incorporate inductive biases to improve the data efficiency of deep learning models. Significant advancements are made compared to the provided baselines, where winning solutions surpass the baselines by a considerable margin in both tasks. As in previous editions, these achievements are primarily attributed to heavy use of data augmentation policies and large model ensembles, though novel prior-based methods seem to contribute more to successful solutions compared to last year. This report highlights the key aspects of the challenges and their outcomes.

Index Terms:
Visual inductive priors, challenge, object detection, instance segmentation.

1 Introduction

Deep learning is increasingly powered by large training datasets. However, collecting such datasets is often costly. Progress in recent years has shown the successful application of large quantities of data to train comprehensive foundation models for vision and language [4], and to combine multiple modalities for weak supervision [39].

However, training with massive datasets still requires a significant amount of energy, contributing to carbon emissions. Furthermore, access to datasets and compute at such scale is limited to a few powerful deep learning behemoths. Additionally, such amounts of data may not be available for some domains. The Visual Inductive Priors for Data-Efficient Deep Learning workshop (VIPriors) therefore encourages research in learning from small datasets, by way of combining the learning power of deep learning with hard-won prior knowledge from specific domains.

The Visual Inductive Priors for Data-Efficient Deep Learning workshop has now been organized for the fourth year in a row [5, 27, 6], with the latest edition taking place at ICCV 2023 in Paris, France. The workshop features a paper track as well as challenges, where participants train computer vision models on small subsets of (publicly available) datasets, challenging them to find competitive solutions without the large quantities of data that power state-of-the-art deep computer vision models.

In this report, we present the outcomes of the fourth edition of the VIPriors challenges. This edition features an object detection challenge as well as an instance segmentation challenge. We discuss top-ranking solutions of both challenges, as well as the submission that receives the jury prize for introducing innovative methods. We observe that, even though some novel prior-based methods seem to contribute to success in several solutions, heavy use of model ensembling and data augmentation dominates successful solutions, like in previous editions.

2 Challenges

The workshop hosts two computer vision challenges in which the number of training samples are reduced to a small fraction of the full set:

Object detection: The DelftBikes [25] dataset is used for the object detection challenge. Each image contains 22 different bike parts that are annotated with bounding box, class and object state labels.

Instance segmentation: The main objective of this challenge is to segment basketball players and the ball on images recorded of a basketball court. The dataset is provided by SynergySports111https://synergysports.com and contains a train, validation and test set of basketball games recorded at different courts with instance labels.

We provide a toolkit222https://github.com/VIPriors/vipriors-challenges-toolkit with guidelines, baseline models and datasets for each challenge. The competitions are hosted on Codalab [37]. Each participating team submits their predictions computed over a test set of samples for which labels are withheld from competitors.

The challenges include certain rules to follow:

  • Models shall be trained from scratch with only the given dataset.

  • The usage of other data rather than the provided training data, such as pretraining the models on other data and transfer learning methods, are prohibited. It is however allowed to train with synthetic data generated from the training data.

  • The participating teams are required to write a technical report about their methodology and experiments. We cite from these reports.

2.1 Object Detection

The object detection challenge uses the DelftBikes [25] dataset. The dataset includes 8,000 bike images for training and 2,000 images for testing (Fig. 1). Images contain 22 bike parts, labeled by class, bounding box and part state labels such as intact, missing, broken or occluded. The dataset contains varying object sizes, and contextual and location biases that can cause false positive detections [25, 24]. Note that some of the object boxes are noisy which introduces more challenges to detect object parts.

Refer to caption
Figure 1: Some images from the DelftBikes dataset. Each image has a single bike with 22 labeled parts.

As a baseline detector we use the same model as used in the 2021 and 2022 challenges [27, 6], which is a Faster RCNN model with a Resnet-50 FPN [40] backbone trained from scratch for 16 epochs. This baseline network is trained with the original image size without any data augmentation. It attains 25.8% AP score on the test set. Note that the evaluation is done on available parts which are intact, damaged and occluded parts.

TABLE I: Final rankings of the Object Detection challenge.
Ranking Teams AP @ 0.5:0.95
1 Jiawei Zhao, Xuede Li, Xingyue Chen, Junfeng Luo. Vision Intelligence Department (VID), Meituan. 34.5
2 & J Xiaoqiang Lu, Jiaxuan Zhao, Yuting Yang, Zhongjian Huang, Xu Liu, Fang Liu, Licheng Jiao. School of Artificial Intelligence, Xidian University. 33.3
3 Zhang **g, Qinliang Wang, Shizhan Zhao. School of Artificial Intelligence, Xidian University. 30.6
4 Zheng Wang, Dong Xie, Hanzhi Wang, Jiang Tian. AI Lab, Lenovo Research State Key Laboratory of Virtual Reality Technology and Systems, Beihang University Department of Computer Science, Yale University 30.4
5 Xinyu Sun, Xiaoyu Hao. Xidian University 29.4
6 Team fha.ddd 26.6

The detection challenge has six participating teams. The team from Vision Intelligence Department of Meituan obtains first place with a 34.5% AP score. Two teams from Xidian University follow them by 33.3% AP and 30.6% AP respectively. The team from Xidian University lead by Xiaoqiang Lu wins the jury prize. The final rankings are shown in Table I.

2.1.1 First place

Zhao et al. use a Cascade RNN [8] with a ConvNeXtV2 backbone [53]. For data augmentation they use the Albumentations library [7] to apply Albu, PhotoMetricDistortion, MixUp [62], and Auto Augment V2 [15]. They pretrain their model on a synthetic dataset created from horizontal and vertical recombinations of binary pairs of samples (see Fig. 2). They finally apply SWA [22] and retrain the model on manually identified hard classes.

Refer to caption
Figure 2: The first place solution of Zhao et al. recombines binary pairs of images horizontally and vertically to create a synthetic pre-training dataset. Figure adapted from technical report by Zhao et al. provided to competition organizers.

2.1.2 Second place and jury prize

Lu et al. use Scaled-YOLOv4 [47], YOLOv7 [49], YOLOR [50] and CBNetv2 [28] backbones. These backbones are trained using cross-validation and light data augmentation (random scaling, random flip**, color jitter) where all model instances of the same backbone are then combined using Model Soups [54]. The resulting models are used as pre-training weights in the final training scheme, where once again models are trained with cross-validation, as well as stronger data augmentation (Mosaic Augmentation [2],  [26], Mix-Up [62], Cutout [16]) and a custom method called Image Uncertainty Weighted where images with less certain box scores are weighted heavier in the loss function. After applying Model Soups to all model instances, Test-Time Augmentation [35] and Weighted Boxes Fusion [42] are applied to achieve the final model. The full pipeline is shown in Fig. 3.

The jury prize is awarded to Lu et al. for their proposed novel method Image Uncertainty Weighted which addresses the specific dataset prior knowledge of a some classes being underrepresented in the data.

Refer to caption
Figure 3: Overview of the training pipeline of the second place solution and jury prize winner of Lu et al.

2.1.3 Third place

**g et al. initially train five backbones on training size 1280px with a confidence threshold of 0.001: YOLOv7 [48], YOLOv8x-p2[1], YOLOv8x, YOLOv8x-p6 and Cascade RCNN [8], while employing Mosaic Augmentation [2], MixUp [62], Test-Time Augmentation [35] and horizontal flip testing. Finally, the models are combined using Weighted Boxes Fusion [42].

2.2 Instance Segmentation

In the task of instance segmentation, one detects and segments instances of objects in an image. Instance segmentation is a popular and widely applicable computer vision problem, with applications ranging from autonomous driving, surveillance, remote sensing to sport analysis. Similarly to the 2021 and 2022 editions of this challenge [27, 6], our challenge is based on the basketball dataset provided by SynergySports [46], consisting of images recorded during various basketball games played on different courts. The goal is to detect and predict segmentation masks of all players and ball objects in the images. With a mere 184, 62, and 64 samples for the train, validation and test splits, respectively, the dataset is considered very small. The test labels are withheld from the challenge participants and final performance on the test set is evaluated on an online server. The main metric used is the Average Precision (AP) @ 0.50:0.95. As in last year’s challenge [6], our baseline method is based on the Detectron2 [56] implementation of Mask-RCNN [19].

Forty-six teams submitted solutions to the evaluation server, of which four teams submitted a report to qualify their submission to the challenge. Two teams from Xidian University obtains first and second place with 59% AP and 58.2% AP respectively. The team from Sichuan University follows them with 55.2% AP. The team from National Cheng Kung University wins the jury prize. The final rankings are shown in Table II.

TABLE II: Final rankings of the Instance Segmentation challenge. J indicates jury prize.
Ranking Teams % AP @ 0.50:0.95
1 Junpei Zhang, Kexin Zhang, Rui Peng, Licheng Jiao, Fang Liu, Lingling Li, Yuting Yang. Xidian University, Xi’an, Shaanxi. 59.0
2 Xiaoqiang Lu, Yuting Yang, Zhongjian Huang, Jiaxuan Zhao, Xu Liu, Fang Liu, Licheng Jiao. School of Artificial Intelligence, Xidian University. 58.2
3 Huijia Liang, ** Yang. Sichuan University. 55.2
4 & J Chih-Chung Hsu, Chia-Ming Lee, Ming-Shyen Wu. Institute of Data Science, National Cheng Kung University. 50.9

2.2.1 First place

Zhang et al. achieve first place using a novel method called Orthogonal Uncertainty Representation (OUR), which ensures that the observed geometric manifold [34] of underrepresented classes is broadened during training. The model trained is a Mask RCNN [19] model with Swin [31], ResNet [20] and CBNet [30] backbones, which is trained with geometric [36], color space, sharpness, noise injection and Copy-Paste [26] data augmentations. The segmentation head is a Hybrid Task Cascade (HTC) [10] head. The loss function is a Seesaw loss [51]. Predicted boxes are fused using Weighted Boxes Fusion [42]. Models are ensembled using SWALP [59].

Incorporating the novel OUR method increased the performance of the trained model by 1.1% AP. Referencing this performance increase against the challenge standings it is apparent that OUR contributes significantly to Zhang et al. achieving first place in the challenge.

2.2.2 Second place

Lu et al. achieve second place by building off last year’s entry by Yusunov et al. [60]. This model is an HTC [11] model with BEiTv2-L [38] with ViT-Adapter [13] and Internimage-XL [52] backbones. The loss function is a GIoU loss [41] and they use Soft NMS [3]. Lu et al. use extensive data augmentation: Mosaic Augmentation [2], Copy-Paste [26], Mix-Up [62], random brightness, random contrast, random saturation, random scale, random flip, sharpen and overlay, blur, Gaussian noise and grid-mask. An example data augmentation output is shown in Fig. 4. The model is trained with AdamW [33]. Two expert networks are used to refine mask output: SegFormer [57] and SeMask [23]. Models are ensembled using Model Soups [54]. The final model is trained on a combination of the provided training and validation sets.

Interestingly, Lu et al. note that ”integrating other detectors’ prediction under different backbones” was necessary to achieve the final test set score of 58.2% AP. Other significant contributors to the final AP score are the data augmentation methods Copy-Paste and Mosaic, which together contribute 1.4% AP improvement to the final model.

Refer to caption
Figure 4: Example data augmentation output of Lu et al. Figure adapted from technical report by Lu et al. provided to competition organizers.

2.2.3 Third place

Liang et al. achieve third place using a ensemble of three models: Mask2Former [14], YOLOX [18], and DETR [9], fused using Weighted Boxes Fusion [42]. Furthermore, Seesaw loss [51], SWA [61] and Soft NMS [3] are used in the model. A large set of data augmentations is applied during training: Copy-Paste [26], rotation, mirroring, crop**, scaling, random brightness, random saturation, random contrast, random color equality, sharpness, random noise, random erasure and local erasure.

Liang et al. show that a relatively simple set of networks and methods can achieve competitive performance, when combined with extensive data augmentation. The Seesaw loss and SWA furthermore both contribute the same 1.1% AP improvement.

2.2.4 Jury prize

Hsu et al. are awarded the jury prize as well as fourth place in the competition. The model used is a HTC [11] model with CB-SwinTransformer-Base [30, 31] backbone. The CB-FPN with Group Normalization [55] is used to ”better capture from low to high-level feature representations”. Furthermore, the Mask Scoring head [21] replaces the default HTC mask head ”to improve model performance on instance’s texture and boundary details”. Model Soups [54] is used to combine models, and SWA [61] is used to average model weights over multiple training epochs.

We award Hsu et al. the jury prize for demonstrating extensive use of (visual) prior knowledge for data-efficient deep learning. First of all, they propose Basketball Court Detection, which uses Canny-Hough line detectors [17] to find the basketball court and crop the image to contain only the field, as displayed in Fig. 5. In addition to GridMask augmentation [12], different augmentations are applied to different bounding boxes: RGB curve distortion is applied to those predicted to be ”players” specifically to vary skin tone and jersey colors; salt-and-pepper noise and brightness variations are applied to all other boxes, including ”referee”, ”ball” etc. The locations of Copy-Paste augmentations are also changed based on prior information, though exactly how this is done is not clear from the report provided to the organizers. Hsu et al. also choose to implement prior knowledge to save on memory usage, by only inferring the model on the part of the court that is relevant to the task.

The proposed prior-based augmentation pipeline improves AP by 3.7%. Unfortunately, the resulting model is too resource-intensive for the authors to apply SOTA training settings, which leads to a lacking score on the competition AP metric. However, though the final model by Hsu et al. is not a top performer in this competition, they do improve [email protected] by 14.6% over the 2021 competition winner [60] and by 5.9% over the 2022 competition winner [58], while using less memory and inference time than either competitor.

Refer to caption
Refer to caption
Figure 5: Basketball Court Detection method by Hsu et al. The top-left figure is the original image. The top-right one is cropped, with red lines detected by the Canny edge detector and Hough transform. The blue line shows a boundary based on image size, while the green lines indicate dynamic boundary from the detected lines. The bottom-left figure displays a region identified based on the maximum convex hull, which is determined using the endpoints of all lines detected by the Canny-Hough operator. The subclass attributes of the object are determined by its bounding box coordinates. In the bottom-right image, the object marked by a dotted line represents the result of location-based copy-paste augmentation. Figure and caption adapted from technical report by Hsu et al. provided to competition organizers.
TABLE III: Overview of challenge submissions. J indicates jury prize. Bold-faced methods are contributions by the competitors.
Rank Team Encoder architectures Data augmentation Methods Main metric
Object detection
1 Zhao et al. Cascade RCNN [8], Swin T. [31], ConvNeXt [32], ConvNeXtV2 [53] Albumentations [7], PhotoMetricDistortion, MixUp [62], Auto Augment V2 [15] FPN [29], SWA [22], recombined synthetic dataset, retraining hard classes 34.5
2 & J Lu et al. Scaled-YOLOv4 [47], YOLOv7 [49], YOLOR [50], CBNetv2 [28] Pre-training: random scaling, random flip**, color jitter; fine-tuning: Mosaic Augmentation [2], Copy-Paste [26], Mix-Up [62], Cutout [16] Model Soups [54], Weighted Boxes Fusion [42], Test-Time Augmentation [35], Image Uncertainty Weighted 33.3
3 **g et al. YOLOv7 [48], YOLOv8x-p2 [1], YOLOv8x, YOLOv8x-p6, Cascade RCNN [8] Mosaic Augmentation [2], MixUp [62], Test-Time Augmentation [35], horizontal flip testing Weighted Boxes Fusion [42] 30.6
4 Wang et al. YOLOv8 [1] Mosaic Augmentation [2], MixUp [62], Test-Time Augmentation [35], horizontal flip testing SparK [45] pre-training, Weighted Boxes Fusion [42] 30.4
5 Sun et al. YOLOv8 [1] HSV, rotation, translation, scaling, shearing, flip**, Mosaic Augmentation [2], MixUp [62], Copy-Paste [26] GAM [63], cross-validation 29.4
6 Team fha.ddd Faster RCNN [40], EfficientNet-V2 [44] AutoAugment [15] 26.6
Instance segmentation
1 Zhang et al. Mask RCNN [19], Swin [31], ResNet [20], FPN [29], CBNet [30] Geometric [36], color space, sharpness, noise injection, Copy-Paste [26] Orthogonal Uncertainty Representation, Hybrid Task Cascade [10], Weighted Boxes Fusion [42], Seesaw loss [51], SWALP [59] 59.0
2 Lu et al. HTC [11], BEiTv2-L [38], ViT-Adapter [13], Internimage [52] Mosaic Augmentation [2], Copy-Paste [26], Mix-Up [62], random brightness, random contrast, random saturation, random scale, random flip, sharpen and overlay, blur, Gaussian noise, grid-mask GIoU loss [41], Soft NMS [3], expert network with SegFormer [57] and SeMask [23], Test-Time Augmentation [35], Model Soups [54], random scaling by Yunusov et al. [60] 58.2
3 Liang et al. Mask2Former [14], YOLOX [18], DETR [9] Copy-Paste [26], rotation, mirroring, crop**, scaling, random brightness, random saturation, random contrast, random color equality, sharpness, random noise, random erasure, local erasure Weighted Boxes Fusion [42], Seesaw loss [51], SWA [61], Soft NMS [3] 55.2
4 & J Hsu et al. HTC [11], CB-SwinTransformer-Base [30, 31] Players: RGB curve distortion; other objects: salt-and-pepper noise & brightness variations; GridMask [12] Basketball Court Detection, GroupNorm [55], Mask Scoring R-CNN [21], SWA [61], Model Soups [54] 50.9

3 Conclusion

Table III lists all qualifying entries for each challenge by architecture, data augmentation techniques and any other methods used.

As we organize the VIPriors challenges for the fourth time, we can spot patterns in what makes a successful challenge submission, both for this year and since the first edition. As in all previous editions, heavy use of model ensembling and data augmentation seems to be a recipe for success. Since last year’s edition however, Model Soups [54] and SWA [61, 59] are used rather than standard ensembling. The chosen data augmentation methods also have seen some change, with Mosaic [2] (especially in combination with YOLO7/YOLO8) and CopyPaste [26] gaining in popularity. As for the architectures used, there is still a task-dependent bias. YOLO is popular for object detection, even though the winning method uses a Cascade RCNN with ViT-based backbones. For instance segmentation, the picture is more blurred, with both RCNN-based and ViT-based architectures having success.

Encouragingly, we note that novel prior-based methods were more successful in this edition than in the previous edition. Throughout this and previous editions there is a slight signal to indicate that the engineering of prior-based methods does contribute to success in data-efficient deep learning, even though the effort required and the potential increase in inference time and/or memory usage (as in the submission by Hsu et al.) can make it challenging to deploy such methods. However, we must conclude that general deep learning approaches such as increasing model ensembles, heavy data augmentations and extensive tuning are still very effective for data-efficient deep learning. Perhaps we can add a slight metaphorical sweetener to the ”bitter lesson” [43], but one has to try hard to taste it.

References

  • [1] GitHub - ultralytics/ultralytics. https://github.com/ultralytics/ultralytics, [Accessed 13-06-2024]
  • [2] Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: Optimal speed and accuracy of object detection (2020)
  • [3] Bodla, N., Singh, B., Chellappa, R., Davis, L.S.: Soft-nms–improving object detection with one line of code. In: Proceedings of the IEEE international conference on computer vision. pp. 5561–5569 (2017)
  • [4] Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N.S., Chen, A.S., Creel, K.A., Davis, J., Demszky, D., Donahue, C., Doumbouya, M., Durmus, E., Ermon, S., Etchemendy, J., Ethayarajh, K., Fei-Fei, L., Finn, C., Gale, T., Gillespie, L.E., Goel, K., Goodman, N.D., Grossman, S., Guha, N., Hashimoto, T., Henderson, P., Hewitt, J., Ho, D.E., Hong, J., Hsu, K., Huang, J., Icard, T.F., Jain, S., Jurafsky, D., Kalluri, P., Karamcheti, S., Keeling, G., Khani, F., Khattab, O., Koh, P.W., Krass, M.S., Krishna, R., Kuditipudi, R., Kumar, A., Ladhak, F., Lee, M., Lee, T., Leskovec, J., Levent, I., Li, X.L., Li, X., Ma, T., Malik, A., Manning, C.D., Mirchandani, S., Mitchell, E., Munyikwa, Z., Nair, S., Narayan, A., Narayanan, D., Newman, B., Nie, A., Niebles, J.C., Nilforoshan, H., Nyarko, J.F., Ogut, G., Orr, L.J., Papadimitriou, I., Park, J.S., Piech, C., Portelance, E., Potts, C., Raghunathan, A., Reich, R., Ren, H., Rong, F., Roohani, Y.H., Ruiz, C., Ryan, J., R’e, C., Sadigh, D., Sagawa, S., Santhanam, K., Shih, A., Srinivasan, K.P., Tamkin, A., Taori, R., Thomas, A.W., Tramèr, F., Wang, R.E., Wang, W., Wu, B., Wu, J., Wu, Y., Xie, S.M., Yasunaga, M., You, J., Zaharia, M.A., Zhang, M., Zhang, T., Zhang, X., Zhang, Y., Zheng, L., Zhou, K., Liang, P.: On the opportunities and risks of foundation models. ArXiv abs/2108.07258 (2021)
  • [5] Bruintjes, R.J., Lengyel, A., Rios, M.B., Kayhan, O.S., van Gemert, J.: Vipriors 1: Visual inductive priors for data-efficient deep learning challenges. arXiv preprint arXiv:2103.03768 (2021)
  • [6] Bruintjes, R.J., Lengyel, A., Rios, M.B., Kayhan, O.S., Zambrano, D., Tomen, N., van Gemert, J.: Vipriors 3: Visual inductive priors for data-efficient deep learning challenges (2023)
  • [7] Buslaev, A., Iglovikov, V.I., Khvedchenya, E., Parinov, A., Druzhinin, M., Kalinin, A.A.: Albumentations: Fast and flexible image augmentations. Information 11(2) (2020). https://doi.org/10.3390/info11020125, https://www.mdpi.com/2078-2489/11/2/125
  • [8] Cai, Z., Vasconcelos, N.: Cascade r-cnn: Delving into high quality object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6154–6162 (2018)
  • [9] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European conference on computer vision. pp. 213–229. Springer (2020)
  • [10] Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Shi, J., Ouyang, W., Loy, C.C., Lin, D.: Hybrid task cascade for instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)
  • [11] Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Shi, J., Ouyang, W., et al.: Hybrid task cascade for instance segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4974–4983 (2019)
  • [12] Chen, P., Liu, S., Zhao, H., Jia, J.: Gridmask data augmentation (2020)
  • [13] Chen, Z., Duan, Y., Wang, W., He, J., Lu, T., Dai, J., Qiao, Y.: Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534 (2022)
  • [14] Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1290–1299 (2022)
  • [15] Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501 (2018)
  • [16] DeVries, T., Taylor, G.W.: Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017)
  • [17] Duda, R.O., Hart, P.E.: Use of the hough transformation to detect lines and curves in pictures. Communications of the ACM 15(1), 11–15 (1972)
  • [18] Ge, Z., Liu, S., Wang, F., Li, Z., Sun, J.: Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430 (2021)
  • [19] He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: 2017 IEEE International Conference on Computer Vision (ICCV). pp. 2980–2988 (2017). https://doi.org/10.1109/ICCV.2017.322
  • [20] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
  • [21] Huang, Z., Huang, L., Gong, Y., Huang, C., Wang, X.: Mask scoring r-cnn. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6402–6411 (2019). https://doi.org/10.1109/CVPR.2019.00657
  • [22] Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D.P., Wilson, A.G.: Averaging weights leads to wider optima and better generalization. CoRR abs/1803.05407 (2018), http://arxiv.longhoe.net/abs/1803.05407
  • [23] Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 752–761 (2023)
  • [24] Kayhan, O.S., van Gemert, J.C.: Evaluating context for deep object detectors. arXiv preprint arXiv:2205.02887 (2022)
  • [25] Kayhan, O.S., Vredebregt, B., van Gemert, J.C.: Hallucination in object detection–a study in visual part verification. arXiv preprint arXiv:2106.02523 (2021)
  • [26] Kisantal, M., Wojna, Z., Murawski, J., Naruniec, J., Cho, K.: Augmentation for small object detection. arXiv preprint arXiv:1902.07296 (2019)
  • [27] Lengyel, A., Bruintjes, R.J., Rios, M.B., Kayhan, O.S., Zambrano, D., Tomen, N., van Gemert, J.: Vipriors 2: visual inductive priors for data-efficient deep learning challenges. arXiv preprint arXiv:2201.08625 (2022)
  • [28] Liang, T., Chu, X., Liu, Y., Wang, Y., Tang, Z., Chu, W., Chen, J., Ling, H.: Cbnetv2: A composite backbone network architecture for object detection. arXiv preprint arXiv:2107.00420 (2021)
  • [29] Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection (2017)
  • [30] Liu, Y., Wang, Y., Wang, S., Liang, T., Zhao, Q., Tang, Z., Ling, H.: Cbnet: A novel composite backbone network architecture for object detection. In: Proceedings of the AAAI conference on artificial intelligence. vol. 34, pp. 11653–11660 (2020)
  • [31] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. International Conference on Computer Vision (ICCV) (2021)
  • [32] Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
  • [33] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
  • [34] Ma, Y., Jiao, L., Liu, F., Yang, S., Liu, X., Li, L.: Curvature-balanced feature manifold learning for long-tailed classification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15824–15835 (2023)
  • [35] Moshkov, N., Mathe, B., Kertesz-Farkas, A., Hollandi, R., Horvath, P.: Test-time augmentation for deep learning-based cell segmentation on microscopy images. Scientific reports 10(1),  5068 (2020)
  • [36] Paschali, M., Simson, W., Roy, A.G., Göbl, R., Wachinger, C., Navab, N.: Manifold exploring data augmentation with geometric transformations for increased performance and robustness. In: Information Processing in Medical Imaging: 26th International Conference, IPMI 2019, Hong Kong, China, June 2–7, 2019, Proceedings 26. pp. 517–529. Springer (2019)
  • [37] Pavao, A., Guyon, I., Letournel, A.C., Tran, D.T., Baro, X., Escalante, H.J., Escalera, S., Thomas, T., Xu, Z.: Codalab competitions: An open source platform to organize scientific challenges. Journal of Machine Learning Research 24(198),  1–6 (2023), http://jmlr.org/papers/v24/21-1436.html
  • [38] Peng, Z., Dong, L., Bao, H., Ye, Q., Wei, F.: A unified view of masked image modeling. arXiv preprint arXiv:2210.10615 (2022)
  • [39] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 8748–8763. PMLR (18–24 Jul 2021), https://proceedings.mlr.press/v139/radford21a.html
  • [40] Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks (2016)
  • [41] Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: A metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 658–666 (2019)
  • [42] Roman Solovyev, W.W., Gabruseva, T.: Weighted boxes fusion: Ensembling boxes from different object detection models. Image and Vision Computing 107, 104117 (2021)
  • [43] Sutton, R.: The bitter lesson. Incomplete Ideas (blog) 13(1),  38 (2019)
  • [44] Tan, M., Le, Q.: Efficientnetv2: Smaller models and faster training. In: International conference on machine learning. pp. 10096–10106. PMLR (2021)
  • [45] Tian, K., Jiang, Y., Diao, Q., Lin, C., Wang, L., Yuan, Z.: Designing bert for convolutional networks: Sparse and hierarchical masked modeling. arXiv preprint arXiv:2301.03580 (2023)
  • [46] Van Zandycke, G., Somers, V., Istasse, M., Don, C.D., Zambrano, D.: Deepsportradar-v1: Computer vision dataset for sports understanding with high quality annotations. In: Proceedings of the 5th International ACM Workshop on Multimedia Content Analysis in Sports. p. 1–8. MMSports ’22, Association for Computing Machinery, New York, NY, USA (2022). https://doi.org/10.1145/3552437.3555699, https://doi.org/10.1145/3552437.3555699
  • [47] Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: Scaled-yolov4: Scaling cross stage partial network. In: Proceedings of the IEEE/cvf conference on computer vision and pattern recognition. pp. 13029–13038 (2021)
  • [48] Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv preprint arXiv:2207.02696 (2022)
  • [49] Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7464–7475 (2023)
  • [50] Wang, C.Y., Yeh, I.H., Liao, H.Y.M.: You only learn one representation: Unified network for multiple tasks. arXiv preprint arXiv:2105.04206 (2021)
  • [51] Wang, J., Zhang, W., Zang, Y., Cao, Y., Pang, J., Gong, T., Chen, K., Liu, Z., Loy, C.C., Lin, D.: Seesaw loss for long-tailed instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 9695–9704 (June 2021)
  • [52] Wang, W., Dai, J., Chen, Z., Huang, Z., Li, Z., Zhu, X., Hu, X., Lu, T., Lu, L., Li, H., et al.: Internimage: Exploring large-scale vision foundation models with deformable convolutions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14408–14419 (2023)
  • [53] Woo, S., Debnath, S., Hu, R., Chen, X., Liu, Z., Kweon, I.S., Xie, S.: Convnext v2: Co-designing and scaling convnets with masked autoencoders. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16133–16142 (2023)
  • [54] Wortsman, M., Ilharco, G., Gadre, S.Y., Roelofs, R., Gontijo-Lopes, R., Morcos, A.S., Namkoong, H., Farhadi, A., Carmon, Y., Kornblith, S., et al.: Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In: International Conference on Machine Learning. pp. 23965–23998. PMLR (2022)
  • [55] Wu, Y., He, K.: Group normalization. In: ECCV (2018)
  • [56] Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R.: Detectron2. https://github.com/facebookresearch/detectron2 (2019)
  • [57] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in neural information processing systems 34, 12077–12090 (2021)
  • [58] Yan, B., Zhao, X., Li, Y., Wang, H.: Task-specific data augmentation and inference processing for vipriors instance segmentation challenge. arXiv preprint arXiv:2211.11282 (2022)
  • [59] Yang, G., Zhang, T., Kirichenko, P., Bai, J., Wilson, A.G., De Sa, C.: Swalp: Stochastic weight averaging in low precision training. In: International Conference on Machine Learning. pp. 7015–7024. PMLR (2019)
  • [60] Yunusov, J., Rakhmatov, S., Namozov, A., Gaybulayev, A., Kim, T.H.: Instance segmentation challenge track technical report, vipriors workshop at iccv 2021: Task-specific copy-paste data augmentation method for instance segmentation (2021)
  • [61] Zhang, H., Wang, Y., Dayoub, F., Sünderhauf, N.: Swa object detection. arXiv preprint arXiv:2012.12645 (2020)
  • [62] Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. In: International Conference on Learning Representations (2018), https://openreview.net/forum?id=r1Ddp1-Rb
  • [63] Zhou, S., Zhao, Y., Guo, D.: Yolov5-ge vehicle detection algorithm integrating global attention mechanism. 2022 3rd International Conference on Information Science, Parallel and Distributed Systems (ISPDS) pp. 439–444 (2022), https://api.semanticscholar.org/CorpusID:252112886