Comparing fine-grained and coarse-grained object detection for ecology

Jess Tam
University of New South Wales, Sydney
[email protected]
   Justin Kay
Massachusetts Institute of Technology
[email protected]
Abstract

Computer vision applications are increasingly popular for wildlife monitoring tasks. While some studies focus on the monitoring of a single species, such as a particular endangered species, others monitor larger functional groups, such as predators. In our study, we used camera trap images collected in north-western New South Wales, Australia, to investigate how model results were affected by combining multiple species in single classes, and whether the addition of negative samples can improve model performance. We found that species that benefited the most from merging into a single class were mainly species that look alike morphologically, i.e. macropods. Whereas species that looked distinctively different gave mixed results when merged, e.g. merging pigs and goats together as non-native large mammals. We also found that adding negative samples improved model performance marginally in most instances, and recommend conducting a more comprehensive study to explore whether the marginal gains were random or consistent. We suggest that practitioners could classify morphologically similar species together as a functional group or higher taxonomic group to draw ecological inferences. Nevertheless, whether to merge classes or not will depend on the ecological question to be explored.

1 Introduction

There are many avenues to monitor ecosystem health, including estimating the population density of wildlife, behavioural monitoring, and community diversity [6]. Traditionally, population density is determined by methods such as capture-mark-recapture [17]. Recently, motion-triggered camera traps have enabled mass data collection while reducing the manual labour involved in capture-mark-recapture, thus encouraging ecologists to collect data remotely [4].

Analysing camera trap data is often labour-intensive because of the vast amounts of imagery that is captured by the cameras. An appealing option is using computer vision to automate some or all of this processing. A number of recent developments have made this increasingly practical for ecologists, including highly efficient and accurate object detection networks that can be used to localize and classify wildlife in imagery [19, 15, 21]. In particular, the MegaDetector [2]—a pre-trained YOLOv5-based model for wildlife detection—has made ecological computer vision processing widely accessible. The MegaDetector classifies images into coarse categories of ’animal’, ’vehicle’, ’person’, and ’empty’ [2]. This is a useful first step in data processing, however, in order to make ecological inferences, ecologists must build additional species-level models.

Fine-grained classification is a known challenge in computer vision [22, 24, 26, 23]. Unfortunately many classification problems in ecology are fine-grained, with many species exhibiting highly similar physical characteristics. On the other hand, this level of granularity is not always necessary for ecological analysis. For instance, ecologists may be interested in the behavior of functional groups of species that share similar ecosystem functions, where the differentiation of species within groups may be of less concern.

In this paper we investigate the challenges of training computer vision models for wildlife detection and classification from an ecological perspective. In particular, we ask: (1) How does the level of granularity of species recognition models affect ecological analysis? We construct a dataset to study this question (Sec. 3) by comparing model performance on both fine-grained (i.e. species-level) and coarse-grained (i.e. functional group-level) categories. (2) How do negative samples—data that do not contain target objects, which are plentiful in camera traps due to false triggers—affect model accuracy?

2 Related works

2.1 Species recognition from camera trap images with YOLO

Fine-grained classification of wildlife can refer to the classification of individual species, whereas coarse-grained classification can refer to that of larger groups of species, such as functional groups, genus-level or family-level groups. In the past decade, there have been an increasing trend in using neural networks for image processing in ecology [3, 9]. Multiple studies have applied various iterations of YOLO to detect species of wildlife from Africa [8, 16], Europe [16], Asia [20], and Australia [10, 5], showing performance as good as those trained with two-stage models like Faster R-CNN in some instances.

However, Zurita et al. [27] noted that model performance was not ideal when trying to detect and classify two similar Amazonian pig species - white-lipped peccary (Tayassu pecari) and collared peccary (Dicotyles tajacu). Further, Schneider et al. [16] demonstrated that the accuracy of species were especially low (0% in some species) when there were insufficient images.

2.2 Modelling based on functional groups

Species-specific models are common in ecological studies because different species occupy different spatial, temporal, and ecological niches. These models are useful not only when studying the distribution of endangered species, but are also suitable where there may be bias towards patterns created by common species [18].

Functional groups are groups of species that share similar ecosystem functions, such as diet, geographic niche, and life histories. Modelling based on functional or taxonomic groups in ecosystems is advantageous as some wildlife management strategies are based on larger groups, such as kangaroo harvesting [12] and predator management [13].

Grou** species into coarser categories can also be particularly useful where data for a particular species is insufficient. Thus, incorporating information of another species with similar ecosystem functions could improve the predictability of their ecology and movement patterns.

3 Methods

3.1 Wild Deserts dataset

The Wild Deserts project is a reintroduction project based in Sturt National Park, near the north-western border of New South Wales, Australia. The project aims to reintroduce regionally extinct small mammals, such as the greater bilby (Macrotis lagotis) and the Shark Bay bandicoot (Perameles bougainville), back onto the landscape and monitor their activities where native and non-native predators are present. These predators include dingoes, cats, and foxes, all of which play important roles influencing the behaviour of other wildlife in the same ecosystem. Camera traps were placed on paths where wildlife were known to traverse in order to collect more images. The dataset consisted of 15000 images with 15 classes of wildlife collected from 30 different cameras around the study site.

3.2 Data labelling

Ecologists from the Wild Deserts project selected images where wildlife were visible and provided class labels in a spreadsheet. We used MegaDetector 5.0 [2] to filter and remove non-animal images, and to draw the bounding boxes around the individual animals with the confidence parameter set to 0.9. Afterwards, we imported the MegaDetector results into Label Studio to conduct a quality check on the bounding boxes, while matching class labels to each box.

The raw data was very imbalanced due to the natural distribution of species, with the most prominent class (Red kangaroo) having almost 700 times more instances than the least prominent class (Lizard), which is a common issue faced in ecological datasets. We, therefore, removed around 8000 Red kangaroo images, but otherwise left the original class distribution intact. We also removed images where the species label could not be ascertained. After cleaning, the final dataset had 6140 images in total. We split the images temporally with a 70:15:15 ratio to prevent the same sequences from being distributed across the training, validation, and testing sets.

The fine-grained models consisted of 14 classes of wildlife, and the coarse-grained models had 9 classes (Fig. 1). The groups in the coarse-grained models were formed by grou** certain species that had similar ecological functions. For instance, cats (Felis catus) and foxes (Vulpes vulpes) were grouped together as non-native predators since they are both predators that prey on small wildlife (e.g. small mammals, lizards, and birds), and were both introduced to Australia.

To test if empty background images could improve model performance, we added images that were absent of any wildlife by extracting around 20 images from each of the 30 cameras, which totalled to around 700 images (approximately 10% of the dataset). We purposefully chose images that had different lighting and weather conditions.

Refer to caption
Figure 1: The distribution of the Wild Deserts dataset and the number of instances of each species.

3.3 Metrics

We report Mean Average Precision (mAP) at IoU=0.5 (AP50) for all experiments. For the fine-grained model, in additional to reporting AP50 for the set of fine-grained classes that the model was trained with, we also evaluate the “coarse-grained” AP50 by grou** fine-grained predictions according to our set of coarse-grained classes, and then evaluating this compared to coarse-grained ground truth using pycocotools [25].

3.4 Training and testing

The models in this project were ran with Python 3.11.4 [14], CUDA 12.1 [11], and ultralytics 8.1.9 with YOLOv8 small[7]. All of the models were trained with YOLOv8 small with the addition of an early stop** policy where training was stopped 5 epochs after no improvements were made in training.

4 Results

We found that, on average, coarse-grained models performed slightly better than fine-grained models. Comparing the class-wise difference, coarse-grained models improved the mAP of some classes that had lower scores, while retaining most of the performance of other classes from the fine-grained models (Fig. 2).

Refer to caption
Refer to caption
Refer to caption
Figure 2: Mean average precision (mAP) of models (a) without and (b) with negative samples. Columns represent (i) mAP of the fine-grained classes from the fine-grained model calculated by YOLOv8, (ii) coarse-grained mAP extracted from the fine-grained model with pycocotools, and (iii) coarse-grained mAP from the coarse-grained model calculated by YOLOv8.

Individual classes that gained the most performance from the coarse-grained models were Red kangaroo, Western grey kangaroo, and Euro (Fig. 3). The mAP of all three macropod classes improved from 0.897, 0.555, and 0.097 respectively to 0.968 as Native large mammals without negative samples, and from 0.922, 0.540, and 0.000 to 0.969 with their addition. Euro especially benefited the most as only 17% of the test images were correctly classified prior to the merge (Fig. 4). Around one-third of Euro’s test set was falsely identified as Red kangaroo and another one-third being other species.

Refer to caption
Refer to caption
Refer to caption
Figure 3: Examples of macropod species from the Wild Deserts dataset. From the left: Red kangaroo (Osphranter rufus), Western grey kangaroo (Macropus fuliginosus), and Euro (Osphranter robustus)
Refer to caption
Figure 4: Normalised confusion matrix of the fine-grained model before adding negative samples.

The classes that had their mAP lowered when merged were Pig and Goat, from 0.897 and 0.781 to 0.699 as Non-native large mammals before the addition of negative samples. After the addition, their scores averaged out from 0.988 and 0.470 to 0.736.

On closer inspection, we observed some mixed results when comparing the mAP of the two fine-grained model, where adding negative samples to the training samples both improved and worsened the performance of models. For example, Pig detection improved by almost 0.1 mAP, while the detection of Goats dropped by 40%.

On the other hand, the improvements between the two coarse-grained models were marginal. The biggest difference we could observe was the improvement in non-native large mammals with the addition of negative samples, which increased by 0.037 mAP.

5 Discussion

Overall, we found that coarse-grained models performed better than fine-grained models especially for species with similar physical features, i.e. Red kangaroo, Western grey kangaroo, and Euro. Whereas species that were not morphologically similar benefited less from being merged. Adding negative samples also saw an improvement in model performance to a certain extent.

5.1 When to merge classes

Although fine-grained models are useful in some contexts, distinguishing between morphologically similar species is difficult. The species-level mAP of macropods were lower than their combined functional group-level mAP for the same model, highlighting that recognising these species from one another was a challenging task for the model to accomplish. Some of the macropods’ distinguishing features are fur colour, relative body size, and ear size. However, these characteristics can be difficult to observe in camera trap images especially when the images are less comprehensible due to factors such as lighting conditions and obstructed views [1]. Leaving them as separate classes could increase false positive detections, especially if one class has significantly more training data than the others. Whereas species that look distinctively different and have sufficient data could be left as single classes.

5.2 The inclusion of negative samples

We included camera trap images without wildlife in the training set, but found mixed results. While we purposefully chose images from a variety of lighting situations, it may be that certain species only emerged during a specific time of the day, especially during the daytime, and that there were insufficient background images to represent those situations. Day images show more diverse ranges of colours and colour temperatures depending on the time of day and weather. For example, images tend to be warmer when the weather is sunny, but cooler when cloudy or rainy. Whereas night images are exclusively in black and white as they are captured by the camera’s infrared technology.

5.3 Other factors affecting model performance

In some cases, class grou**s were not enough to improve accuracy of rare classes. For instance, the Lizard class, which was not part of any larger functional group, had a low score of 0 mAP in all models. This can be explained by the fact that there were very little observations to begin with (13 images in total). Following the same logic, the performance of Western grey kangaroo and Euro could potentially be improved as single classes by collecting more images, which could decrease false positives.

6 Conclusion

While most existing studies focused on classifying individual species from camera traps, we provided an investigation into the differences between fine-grained and coarse-grained recognition of wildlife. We suggest that species that have similar physical features be grouped together, and species that look more distinctive need not be combined with others, given there is sufficient images. Whether to use fine-grained or coarse-grained models depend highly on the ecological question at hand. Further, we provided modest insight that the inclusion of negative samples could influence results, although a more comprehensive study will be required to investigate the threshold of empty images that could significantly benefit detection results.

Acknowledgements

This work was made possible with the support of the Resnick Sustainability Institute and Computer Vision for Ecology summer workshop held at the California Institute of Technology. We also thank the ecologists from the UNSW Wild Deserts project for providing the labelled camera trap images.

References

  • Beery et al. [2018] Sara Beery, Grant Van Horn, and Pietro Perona. Recognition in terra incognita. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
  • Beery et al. [2019] Sara Beery, Dan Morris, and Siyu Yang. Efficient pipeline for camera trap image review. arXiv preprint arXiv:1907.06772, 2019.
  • Christin et al. [2019] Sylvain Christin, Éric Hervet, and Nicolas Lecomte. Applications for deep learning in ecology. Methods in Ecology and Evolution, 10(10):1632–1644, 2019.
  • Delisle et al. [2021] Zackary J Delisle, Elizabeth A Flaherty, Mackenzie R Nobbe, Cole M Wzientek, and Robert K Swihart. Next-generation camera trap**: systematic review of historic trends suggests keys to expanded research applications in ecology and conservation. Frontiers in Ecology and Evolution, 9:617996, 2021.
  • Falzon et al. [2019] Greg Falzon, Christopher Lawson, Ka-Wai Cheung, Karl Vernes, Guy A Ballard, Peter JS Fleming, Alistair S Glen, Heath Milne, Atalya Mather-Zardain, and Paul D Meek. Classifyme: a field-scouting software for the identification of wildlife in camera trap images. Animals, 10(1):58, 2019.
  • Glover-Kapfer et al. [2019] P Glover-Kapfer, CA Soto-Navarro, and OR Wearn. Camera-trap** version 3.0: current constraints and future priorities for development. remote sens ecol conserv 5 (3): 209–223, 2019.
  • Jocher et al. [2023] Glenn Jocher, Ayush Chaurasia, and **g Qiu. Ultralytics yolov8, 2023.
  • Leonid et al. [2023] T Thomas Leonid, Harish Kanna, Claudia Christy VJ, AS Hamritha, and Chebolu Lokesh. Human wildlife conflict mitigation using yolo algorithm. In 2023 Eighth international conference on science technology engineering and mathematics (ICONSTEM), pages 1–7. IEEE, 2023.
  • Nakagawa et al. [2023] Shinichi Nakagawa, Malgorzata Lagisz, Roxane Francis, Jessica Tam, Xun Li, Andrew Elphinstone, Neil R Jordan, Justine K O’Brien, Benjamin J Pitcher, Monique Van Sluys, et al. Rapid literature map** on the recent use of machine learning for wildlife imagery. Peer Community Journal, 3, 2023.
  • Nguyen et al. [2023] Thi Thu Thuy Nguyen, Anne C Eichholtzer, Don A Driscoll, Nathan I Semianiw, Dean M Corva, Abbas Z Kouzani, Thanh Thi Nguyen, and Duc Thanh Nguyen. Sawit: A small-sized animal wild image dataset with annotations. Multimedia Tools and Applications, pages 1–26, 2023.
  • NVIDIA et al. [2023] NVIDIA, Péter Vingelmann, and Frank H.P. Fitzek. Cuda, release: 12.1, 2023.
  • of Planning and Environment [2023] Department of Planning and Environment. Wildlife trade management plan for the commercial harvest of kangaroos in new south wales 2022–26. Publication, annual report, New South Wales Government, 2023.
  • Prugh et al. [2019] Laura R Prugh, Kelly J Sivy, Peter J Mahoney, Taylor R Ganz, Mark A Ditmer, Madelon van de Kerk, Sophie L Gilbert, and Robert A Montgomery. Designing studies of predation risk for improved inference in carnivore-ungulate systems. Biological Conservation, 232:194–207, 2019.
  • Python Core Team [2023] Python Core Team. Python: A dynamic, open source programming language. Python Software Foundation, 2023. Python version 3.11.4.
  • Redmon et al. [2016] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 779–788, 2016.
  • Schneider et al. [2018] Stefan Schneider, Graham W Taylor, and Stefan Kremer. Deep learning object detection methods for ecological camera trap data. In 2018 15th Conference on computer and robot vision (CRV), pages 321–328. IEEE, 2018.
  • Schwarz and Arnason [1996] Carl James Schwarz and A Neil Arnason. A general methodology for the analysis of capture-recapture experiments in open populations. Biometrics, pages 860–873, 1996.
  • Secco et al. [2024] Helio Secco, Luis Felipe Farina, Vitor Oliveira da Costa, Wallace Beiroz, Marcello Guerreiro, and Pablo Rodrigues Gonçalves. Identifying roadkill hotspots for mammals in the brazilian atlantic forest using a functional group approach. Environmental Management, 73(2):365–377, 2024.
  • Szegedy et al. [2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9, 2015.
  • Tan et al. [2022] Mengyu Tan, Wentao Chao, Jo-Ku Cheng, Mo Zhou, Yiwen Ma, ** Ge, Lian Yu, and Limin Feng. Animal detection and classification from camera trap images using different mainstream object detection architectures. Animals, 12(15):1976, 2022.
  • Terven and Cordova-Esparza [2023] Juan Terven and Diana Cordova-Esparza. A comprehensive review of yolo: From yolov1 to yolov8 and beyond. arXiv preprint arXiv:2304.00501, 2023.
  • Van Horn et al. [2018] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8769–8778, 2018.
  • Wei et al. [2019] Xiu-Shen Wei, Jianxin Wu, and Quan Cui. Deep learning for fine-grained image analysis: A survey. arXiv preprint arXiv:1907.03069, 2019.
  • Wei et al. [2021] Xiu-Shen Wei, Yi-Zhe Song, Oisin Mac Aodha, Jianxin Wu, Yuxin Peng, **hui Tang, Jian Yang, and Serge Belongie. Fine-grained image analysis with deep learning: A survey. IEEE transactions on pattern analysis and machine intelligence, 44(12):8927–8948, 2021.
  • Yuxin Wu [2023] Ross Girshick Yuxin Wu. pycocotools, 2023.
  • Zhao et al. [2017] Bo Zhao, Jiashi Feng, Xiao Wu, and Shuicheng Yan. A survey on deep learning-based fine-grained object classification and semantic segmentation. International Journal of Automation and Computing, 14(2):119–135, 2017.
  • Zurita et al. [2023] María-José Zurita, Daniel Riofrío, Noel Pérez-Pérez, David Romo, Diego S Benítez, Ricardo Flores Moyano, Felipe Grijalva, and Maria Baldeon-Calisto. On the use of deep learning models for automatic animal classification of native species in the amazon. In IEEE Colombian Conference on Applications of Computational Intelligence, pages 84–103. Springer, 2023.