FishNet: Deep Neural Networks for Low-Cost Fish Stock Estimation

Moseli Mots’oehli Information and Computer Sciences
University of Hawai’i at Manoa
[email protected] Anton Nikolaev Information and Computer Sciences
University of Hawai’i Manoa
[email protected] Wawan B. IGede Yayasan Konservasi Alam Nusantara
People and Nature Consulting International
[email protected] John Lynham Economics
University of Hawai’i Manoa
[email protected] Peter J. Mous Yayasan Konservasi Alam Nusantara
People and Nature Consulting International
[email protected] Peter Sadowski Information and Computer Sciences
University of Hawai’i Manoa
[email protected]

Abstract

Fish stock assessment often involves manual fish counting by taxonomy specialists, which is both time-consuming and costly. We propose FishNet, an automated computer vision system for both taxonomic classification and fish size estimation from images captured with a low-cost digital camera. The system first performs object detection and segmentation using a Mask R-CNN to identify individual fish from images containing multiple fish, possibly consisting of different species. Then each fish species is classified and the length is predicted using separate machine learning models. To develop the model, we use a dataset of 300,000 hand-labeled images containing 1.2M fish of 163 different species and ranging in length from 10 cm to 250 cm, with additional annotations and quality control methods used to curate high-quality training data. On held-out test data sets, our system achieves a 92% intersection over union on the fish segmentation task, a 89% top-1 classification accuracy on single fish species classification, and a 2.3 cm mean absolute error on the fish length estimation task.

Index Terms:

Computer Vision, Fish Stock Estimation, Image Segmentation, Image Classification, Size Estimation

I Introduction

Predictions that all of the world’s stocks of commercially important fish could collapse by the year 2048 [1] have been tempered by recent evidence that fish stocks are recovering in many high-income countries [2]. This turnaround is typically attributed to stringent catch limits backed by an accurate scientific assessment of the number of fish in a particular stock (population). It is challenging to set a limit on catch appropriately unless scientists know how many fish are actually present in a wild population. Despite progress in many high-income countries, the majority of the world’s fisheries remain “unassessed” and the prognosis for these fisheries is concerning: populations are declining and are on a trajectory towards functional extinction [3]. One of the major barriers to performing fish stock assessment in the develo** world is the cost. The federal government in the US spends approximately $\$215$ million a year on fish stock assessment [4], which excludes spending by state governments on assessment of stocks within three nautical miles of the coast (state waters). As an example, the average cost of a fish stock assessment performed by NOAA’s Pacific Islands Fisheries Science Center in Honolulu is $\$5.6$ million, which exceeds the total value of many develo** country fisheries.

Refer to caption — Figure 1: Fishers in Indonesia on their boats photographing their catch on a standard color-coded measuring board. Source: Ed Wray, for Yayasan Konservasi Alam Nusantara.

Costly stock assessment also prevents many well-managed fisheries from accessing lucrative markets for sustainable seafood. For example, the Marine Stewardship Council (MSC), which endorses seafood with its blue eco-label requires a detailed stock assessment of the fishery for certification. In [5], the authors argue that the high costs of certification remain a barrier to many fisheries in Latin America and the Caribbean accessing the benefits of seafood ecolabels. For example, there is only one fishery certified as sustainable in Indonesia by the MSC, and this is, in part, because the species being targeted (yellowfin tuna) occurs throughout the Pacific Ocean; the expensive assessment of this stock is performed by an intergovernmental body, funded by the US, Australia, France, Japan, and a number of other countries. Here we propose a methodology for drastically reducing the cost of fisheries stock assessment by combining citizen science with machine learning.

Advances in digital photography, computer vision, and artificial intelligence (AI) make automating fish-stock estimation an attractive alternative. To address this challenge, we propose FishNet, a novel computer vision system for automated fish species classification and size estimation in a tropical snapper-grouper fishery in Indonesia. This is one of the most difficult fisheries in the world in which to perform a stock assessment, with small-scale fishers catching over a hundred different species, using a variety of fishing gear, spread across Indonesia’s 17,000 different islands. As part of a participatory science program, The Nature Conservancy in Indonesia gave small-scale fishers digital cameras and asked them to photograph all their catch against a consistent background with color-coded measurement scales, as seen in Figure 1. A team of marine biologists then identified the species and length of fish from the photographs, resulting in 300,000 images containing approximately 1.2 million fish annotated with species and length. Because the system requires no special equipment beyond a digital camera and a standardized board with fiduciary markings, the approach is a feasible and financially practical solution to the problem of fish stock estimation at scale.

II Related Work

Taxonomic Classification: Modern computer vision systems such as Facebook’s Detectron2 [6] use deep neural network models to perform object detection, segmentation, and classification. These models are pre-trained on the ImageNet-1k dataset and can be fine-tuned to application-specific tasks through transfer learning. Previous works have used this approach to classify fish for fish stock estimation [7, 8, 9, 10], relying on specialized and expensive cameras. In [7, 11], the authors use a pre-trained convolutional neural network (CNN) architecture and the Fish4Knowledge Ground-Truth dataset [12] for fish species classification. However, this dataset contains a much smaller number of images (27,370) of underwater fish extracted from video recordings and only 23 species, in comparison to the 163 species in this work. Kandimalla et al. present an automated framework for detecting, classifying, and tracking fish using high-resolution imaging sonar and underwater videos [13]. While it shares similarities with our work in addressing the classification and counting of multiple fish, it focuses solely on eight fish species and does not estimate the size of each fish. A similar approach is used in [14] where underwater video is used for detection and fish count estimation. This system requires significantly more cost as underwater images require specialized cameras and hardware, and processing video requires significantly more computation. The work in [15] is similar to ours as they perform detection, multi-fish segmentation, classification, and size estimation. However, they use a smaller dataset (DeepFish) of 1,291 labeled images with 7,339 fish of 59 species, compared to our 1.2M images of more than 160 species. They also focus on the retail side of the problem where measurements and labeling are derived from fish markets and the fish are market-sized as opposed to all fish caught at sea (which is more relevant for fish stock assessment). The work of [15] focuses solely on multi-fish images where all fish are assumed to be of the same species. In contrast, our system is more versatile, and is capable of identifying and estimating the size of each fish in images containing multiple species. Furthermore, their focus is on large fish like bluefin tuna and sharks, while our system is trained to identify and measure fish ranging from 10cm to over 200cm in size. To our knowledge, our work offers the most comprehensive and practical approach to fish stock estimation, covering segmentation, classification, and size estimation using low-cost camera images on a dataset of over 160 species.

Fish Size Estimation: To help estimate fish size from images, our system uses a fiduciary marker of known-length: a 10cm rectangular colored areas on the sides of each measuring board. These fiduciary markers help estimate the size of the fish even when the photos are taken from different distances and angles. In [16], a mask R-CNN similar to the one used in this work detects and measures the length of fish heads and bodies, but only for a single species. In [17], the authors place three ArUco fiduciary markers of different sizes on polypropylene sheets, along with the fish to be measured. The authors extend their work in [18], by using the same three fiduciary markers and estimating fish length using standard smartphone cameras. They use mask R-CNN to detect and segment objects of interest. Their evaluation uses a smaller dataset of 1,000 images with fewer fish species and doesn’t include species classification as all fish are assumed to be the same species. This limits its applicability for general automated fish stock estimation. In contrast, FishNet is more general, accurately estimating sizes for a broader range of 163 fish species ranging from 10cm to over 200cm.

III Methods

In this section, we outline the implementation of data collection, image segmentation, classification, and fish length regression. Figure 2 shows the complete FishNet system: the models, outputs, and training datasets.

III-A Data Collection

CODRS (Crew Operated Data Recording Systems) is a program run by The Nature Conservancy’s Indonesia office. Data on species and size distributions of catch (as needed for length-based stock assessments) are collected via photographs taken on digital cameras by crew onboard fishing vessels. The fish are positioned over standardized measuring boards (1-by-0.8 m) before the photographs are taken so their length can be easily inferred (see Figure 4). The whiteboard, with multi-colored markings every 10 cm serves as a fiduciary marker for scale and perspective. The images often contain multiple fish and photographs are taken from a variety of distances, angles, and lighting conditions. At the end of each month, the memory card from each camera is handed over for the processing of the images by fish identification experts working for The Nature Conservancy. Processing includes identification of the species and length of the fish, double-checking by a data quality control coordinator, and storage in an online database known as i-Fish. Participation in the program is voluntary and fishers receive a stipend for participating. Over 300 vessels have cooperated and contributed to the database. The set of 300,000 photographs was then hand-annotated by taxonomic experts to label each fish with species and length. Approximately 1.2 million fish were hand-labeled in this manner, with 163 different species represented.

III-B Data Curation

Large-scale data-labeling projects nearly always contain some errors. In this work, one source of error is that photographs containing multiple fish are annotated with a list of species and fish lengths, but the order of the list does not always match the order of the fish in the image. In most images, fish are labeled from top-to-bottom, left to right, but many examples were found in which this is not the case, leading to ambiguity in the annotations. Therefore, the species annotations for images containing a single fish were assumed to be reliable, while the species annotations for images containing multiple fish are assumed to be noisy.

In experiments, we used the most reliable data for training, but there is interest in develo** methods for curating the rest of the data. We developed an algorithm that attempts to disambiguate the species annotations using the species classification prediction. For images with one or two fish, matching the labels to the fish is trivial as there are only one or two possible combinations. For images with three fish and above on a board, we use the output of a species classification to assign a likelihood for each object permutation, and selected the one with the highest likelihood.

III-C Image Segmentation

The FishNet system first applies an object detection and segmentation model to predict a bounding box and segmentation mask around each individual fish and fiduciary color box. For this, we use Metas’s Detectron2 [6] implementation of Mask R-CNN [19] with a Resnet-50 backbone [20]. This model has been pre-trained on the Microsoft Common Objects in Context (MS COCO) dataset [21] and fine-tuned on a small self-annotated dataset.

The segmentation model is fine-tuned on a set of random images that were manually annotated for object detection and segmentation. Annotators segmented each fish and a set of fiduciary markers consisting of four 3-by-10 cm colored boxes (two yellow and two blue) located on the edges of the presentation board. The Visual Geometry Group (VGG) Image Annotation (VIA) tool [22] was used to annotate 700 random images containing multiple fish of the same species with polygon outlines of three object types: (1) fish, (2) yellow boxes, and (3) blue boxes. When these objects are partially occluded (e.g., by humans, shadows, or fishing nets), the annotator infers their shape as shown in Figure 4. Of the 700 images, 420 (60%) were used for training, 140 (20%) were used for cross-validation, and 140 (20%) used for testing.

Given a new image, the fine-tuned Detectron2 model detects objects of interest with (1) a bounding box enclosing the object, (2) a pixel-wise segmentation mask, and (3) the predicted class (fish, yellow box, or blue box). During inference, FishNet uses the segmentation mask from Detectron2 to crop individual fish images for input to classification and size regression models. We evaluated the segmentation model in terms of accuracy and the Intersection over Union (IoU) score on the held-out validation set. These metrics inform us of how often we are detecting and correctly assigning objects, as well as how well the segmentation mask covers the object since these outputs directly feed into the length estimation algorithm. A high IoU score (near 100%) means the bounding boxes fully and tightly enclose the fish, and this will result in better data for the downstream classification and regression tasks.

III-D Species Classification

For species classification, a pre-trained ResNet-50 model was fine-tuned on a dataset of 50,000 images containing a single fish, cropped using the bounding box supplied by the object detection model. We use 30,000 images (60%) for training, 10,000 images (20%) for validation, and 10,000 images (20%) as a test set. For training, the learning rate was $0.005$ , which was reduced linearly by 10% when the validation set error does not decrease for 10 epochs. Various augmentation strategies are applied during training to improve generalization, as listed in Table I. During testing, the images are re-scaled to $152\times 152$ and normalized.

Transformation	Range/Value
Intensity normalization	[0,1]
Random Rotation	[0, $30 °$ ]
Width, length, channel shift, and random zoom	[0, 0.2]
Shear	[0, 0.3]
Feature-wise normalization and horizontal flip	Yes
Fill mode	nearest
Re-scale	$152\times 152$

Table I: Data augmentations applied during training of the species classification model.

III-E Fish Length Estimation

Length regression is a two-step process. First, we use Detectron2 to detect and locate fish and colored fiduciary markers, which provide scale information due to their fixed and known size. Images without a visible fiduciary marker are discarded from analysis. A random forest regression model is trained to predict fish size in centimeters from a set of features that are extracted from the image and the detected fish and fiduciary markers:

1.

Color Boxes: Color box count, median color box length, mean color box length, median color box segment length, and maximum color box segment length.
2.

Detected Fish: The confidence score, bounding box length, and segment mask length for each fish. We use both measurements since the segment mask can be slightly shorter than the actual fish due to less than $100\%$ IoU in fish segmentation.

The features include redundant information which helps improve predictions under occlusion or poor lighting conditions. We chose to use a random forest due to its robustness to outliers.

IV Results

In experiments, we evaluate each component of the FishNet model: detection, segmentation, species classification, and length regression. Due to the limitations of human annotations discussed above, we report results on held-out subsets of the full dataset that have high-quality annotations.

IV-A Detection

The performance of the object detection model was evaluated by comparing the number of fish detected in each image with the number of fish listed by human annotators. On a random subset of 10,000 images containing only a single fish, the model detects fish in 99% of the images and fiduciary markers in 97% of the images. For the more challenging case where an image can contain multiple fish, we ran the object detection model on all 300,000 images and observed that the predicted number of fish exactly agreed with the human annotator in 92% of images. Figure 5 shows the confusion matrix for detected vs. human-annotated number of fish per image. When the detection model is wrong, it is usually off by only one or two fish, overestimating the count in 3% of images and underestimates the count in 5% of images. The model is more likely to miscount fish when there are more on the board, especially smaller fish, due to overlap. Some of these errors are due to fish located in the background of the image, as shown in Figure 6, and could be eliminated by more careful image-capture procedures or by crop** the images as a pre-processing step.

IV-B Segmentation

Segmentation performance was evaluated using a held-out test set of 140 (20% of 700) images containing multiple fish and segmented by human annotators. The predicted segmentation masks have a mean IoU of 92% (median 94%) for fish and a mean IoU of 86% (median 88%) for fiduciary markers, as shown in the box and whisker plot in Figure 8. Figure 7 is an example of predictions on an image where a high IoU is achieved for all the fiduciary markers and fish on the board. The performance of the detection and segmentation model affects species classification and length regression. A high IoU score indicates bounding boxes tightly enclose the fish, leading to better training data for length estimation in single fish images. In multi-fish images, poor IoU from Detectron2 affects both classification and regression. Loose bounding boxes can overestimate lengths and cause parts of one fish to appear in the crop of another, making it harder to learn critical features that explain the differences in species.

IV-C Species Classification

Species classification using the fine-tuned ResNet-50 model was evaluated using the held out test set of 10,000 single-fish images. The model achieves 89% top-1 classification accuracy and $97\%$ top-5 accuracy on the test set.

IV-D Length Regression

Length regression was evaluated using five-fold cross-validation on the set of 50,000 single-fish images. The mean absolute error (MAE) was 2.3 cm and coefficient of determination ( $R^{2}$ ) was 79%. Figure 9 shows a scatter plot of predicted vs. true fish lengths on the validation set. The length estimates for mid-sized fish (60-80 cm), which are the majority in the validation set as shown in the distribution of size in Figure 11, are highly accurate. The model tends to underestimate larger fish sizes, likely due to a lack of large fish in the training data. This issue can be mitigated by including more large fish in the training set. Figures 10 shows that the relative errors are symmetrically distributed around zero. This is critical for accurate length-based fish stock assessment methods, which evaluate the proportion of a population above or below certain length-based thresholds (such as size-at-maturity). We verify that the empirical distribution of the predicted fish lengths is close to the ground truth observations in Figure 11.

V Discussion and Future Research

The use of computer vision models for fish stock assessment could have great economic benefits by reducing time, cost, and labor. This paper presents FishNet: a system for performing fish counting, species classification, and size estimation from photographs. The system was developed and tested using a large collection of 300,000 annotated photographs generously provided by Yayasan Konservasi Alam Nusantara. While further work is needed to curate these annotated photographs to fully utilize them for machine learning, our results show that even training on a subset of this data yields promising results. One avenue for future research involves considering machine learning in the presence of label noise, as well as deep active learning methods for image classification [23], and segmentation under a noisy labeling oracle [24, 25].

Acknowledgement

We acknowledge the David and Lucile Packard Foundation for supporting this experiment and the CODRS data collection program (run by Yayasan Konservasi Alam Nusantara). The CODRS program has also received support from the Walton Family Foundation and the USAID Indonesia SNAPPER program (Cooperative Agreement No. AID497A1600011). We wish to thank technicians from People and Nature Consulting International for their help with the collection and labeling of the images and the group of over 500 Indonesian fishers who took pictures of their catch during their work at sea. We acknowledge technical support and computing resources from University of Hawaii Information Technology Services – Cyberinfrastructure funded in part by the National Science Foundation CC* award #2201428.

References

[1] B. Worm, E. Barbier, N Beaumont, E. Duffy, C. Folke, B. Halpern, J. Jackson, H. Lotze, F. Micheli, S. Palumbi, et al. Impacts of biodiversity loss on ocean ecosystem services. Science, 314(5800):787–790, 2006.
[2] B. Worm, R. Hilborn, J. Baum, T. Branch, J. Collie, C. Costello, M. Fogarty, E. Fulton, J. Hutchings, S. Jennings, et al. Rebuilding global fisheries. Science, 325(5940):578–585, 2009.
[3] Christopher Costello, Daniel Ovando, Ray Hilborn, Steven D Gaines, Olivier Deschenes, and Sarah E Lester. Status and solutions for the world’s unassessed fisheries. Science, 338(6106):517–520, 2012.
[4] R. Merrick and R. Methot. NOAA’s cost of fish stock assessments. National Oceanic and Atmospheric Administration, 2016.
[5] Mónica Pérez-Ramírez, Mauricio Castrejón, Nicolás L Gutiérrez, and Omar Defeo. The marine stewardship council certification in latin america and the caribbean: A review of experiences, potentials and pitfalls. Fisheries Research, 182:50–58, 2016.
[6] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2. https://github.com/facebookresearch/detectron2, 2019.
[7] E. Prasetyo, N. Suciati, and C. Fatichah. Multi-level residual network vggnet for fish species classification. Journal of King Saud University - Computer and Information Sciences, 34(8, Part A):5286–5295, 2022.
[8] R. Dhruv, J. Sushant, and S. Indu. Underwater fish species classification using convolutional neural network and deep learning. In 2017 Ninth International Conference on Advances in Pattern Recognition (ICAPR), pages 1–6, 2017.
[9] C. Guang, S. Peng, and S. Yi. Automatic fish classification system using deep learning. In 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI), pages 24–29, 2017.
[10] U. Oguzhan, K. Diclehan, and T. Mehmet. A large-scale dataset for fish segmentation and classification. In 2020 Innovations in Intelligent Systems and Applications Conference (ASYU), pages 1–5, 2020.
[11] Hongwei Qin, Xiu Li, Jian Liang, Yigang Peng, and Changshui Zhang. Deepfish: Accurate underwater live fish recognition with a deep architecture. Neurocomputing, 187:49–58, 2016.
[12] Jessica Yun-Heh Chen-Burger, Robert Fisher, Daniela Giordano, Lynda Hardman, and Fang-Pang Lin. Fish4Knowledge Ground-Truth dataset. Online, 2012. Available at: https://homepages.inf.ed.ac.uk/rbf/fish4knowledge/.
[13] V. Kandimalla, M. Richard, F. Smith, J. Quirion, L. Torgo, and C. Whidden. Automated detection, classification and counting of fish in fish passages with deep learning. In Frontiers in Marine Science, 2022.
[14] Rod Connolly, David Fairclough, Eric **ks, Ellen Ditria, Gary Jackson, Sebastian Lopez-Marcano, Andrew Olds, and Kristin **ks. Improved accuracy for automated counting of a fish in baited underwater videos for stock assessment. Frontiers in Marine Science, 8, 10 2021.
[15] Garcia-D’Urso Nahuel, Alejandro Galán Cuenca, Paula Pérez-Sánchez, Pau Climent i Pérez, Andrés Guilló, Jorge Azorin-Lopez, Marcelo Saval-Calvo, Juan Guillén-Nieto, and Gabriel Soler-Capdepón. The deepfish computer vision dataset for fish instance segmentation, classification, and size estimation. Scientific Data, 9:287, 06 2022.
[16] A. Alvarez-Ellacurı´a1, M. Palmer1, I. Catala´n, and J. Lisani. Image-based, unsupervised estimation of fish size from commercial landings using deep learning. ICES Journal of Marine Science (2020), 77(4):1330–1339, 2019.
[17] G. Monkman, K. Hyder, M.J. Kaiser, and F. Vidal. Using machine vision to estimate fish length from images using regional convolutional neural networks. Methods in Ecology and Evolution), 10(12):2045–2056, 2019.
[18] G. Monkman, K. Hyder, M.J. Kaiser, and F. Vidal. Accurate estimation of fish length in single camera photogrammetry with a fiducial marker. ICES Journal of Marine Science (2020), 77(6):2245–2254, 2020.
[19] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2980–2988, 2017.
[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[21] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and L.C. Zitnick. Microsoft coco: Common objects in context. In Computer Vision – ECCV 2014, pages 740–755, 2014.
[22] Abhishek Dutta and Andrew Zisserman. The VIA annotation software for images, audio and video. In Proceedings of the 27th ACM International Conference on Multimedia, 2019.
[23] M. Mots’oehli and K. Baek. Deep active learning in the presence of label noise: A survey. arXiv preprint arXiv:2302.11075, 2023.
[24] B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In 32nd Conference on Neural Information Processing Systems, NIPS’18, pages 8536–8546. Curran Associates Inc., 2018.
[25] Z. Chicheng and C. Kamalika. Active learning from weak and strong labelers. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.