-
Enhancing Post-Hoc Explanation Benchmark Reliability for Image Classification
Authors:
Tristan Gomez,
Harold Mouchère
Abstract:
Deep neural networks, while powerful for image classification, often operate as "black boxes," complicating the understanding of their decision-making processes. Various explanation methods, particularly those generating saliency maps, aim to address this challenge. However, the inconsistency issues of faithfulness metrics hinder reliable benchmarking of explanation methods. This paper employs an…
▽ More
Deep neural networks, while powerful for image classification, often operate as "black boxes," complicating the understanding of their decision-making processes. Various explanation methods, particularly those generating saliency maps, aim to address this challenge. However, the inconsistency issues of faithfulness metrics hinder reliable benchmarking of explanation methods. This paper employs an approach inspired by psychometrics, utilizing Krippendorf's alpha to quantify the benchmark reliability of post-hoc methods in image classification. The study proposes model training modifications, including feeding perturbed samples and employing focal loss, to enhance robustness and calibration. Empirical evaluations demonstrate significant improvements in benchmark reliability across metrics, datasets, and post-hoc methods. This pioneering work establishes a foundation for more reliable evaluation practices in the realm of post-hoc explanation methods, emphasizing the importance of model robustness in the assessment process.
△ Less
Submitted 29 November, 2023;
originally announced November 2023.
-
Handwritten Text Recognition from Crowdsourced Annotations
Authors:
Solène Tarride,
Tristan Faine,
Mélodie Boillet,
Harold Mouchère,
Christopher Kermorvant
Abstract:
In this paper, we explore different ways of training a model for handwritten text recognition when multiple imperfect or noisy transcriptions are available. We consider various training configurations, such as selecting a single transcription, retaining all transcriptions, or computing an aggregated transcription from all available annotations. In addition, we evaluate the impact of quality-based…
▽ More
In this paper, we explore different ways of training a model for handwritten text recognition when multiple imperfect or noisy transcriptions are available. We consider various training configurations, such as selecting a single transcription, retaining all transcriptions, or computing an aggregated transcription from all available annotations. In addition, we evaluate the impact of quality-based data selection, where samples with low agreement are removed from the training set. Our experiments are carried out on municipal registers of the city of Belfort (France) written between 1790 and 1946. % results The results show that computing a consensus transcription or training on multiple transcriptions are good alternatives. However, selecting training samples based on the degree of agreement between annotators introduces a bias in the training data and does not improve the results. Our dataset is publicly available on Zenodo: https://zenodo.org/record/8041668.
△ Less
Submitted 19 June, 2023;
originally announced June 2023.
-
Model-based inexact graph matching on top of CNNs for semantic scene understanding
Authors:
Jérémy Chopin,
Jean-Baptiste Fasquel,
Harold Mouchère,
Rozenn Dahyot,
Isabelle Bloch
Abstract:
Deep learning based pipelines for semantic segmentation often ignore structural information available on annotated images used for training. We propose a novel post-processing module enforcing structural knowledge about the objects of interest to improve segmentation results provided by deep learning. This module corresponds to a "many-to-one-or-none" inexact graph matching approach, and is formul…
▽ More
Deep learning based pipelines for semantic segmentation often ignore structural information available on annotated images used for training. We propose a novel post-processing module enforcing structural knowledge about the objects of interest to improve segmentation results provided by deep learning. This module corresponds to a "many-to-one-or-none" inexact graph matching approach, and is formulated as a quadratic assignment problem. Our approach is compared to a CNN-based segmentation (for various CNN backbones) on two public datasets, one for face segmentation from 2D RGB images (FASSEG), and the other for brain segmentation from 3D MRIs (IBSR). Evaluations are performed using two types of structural information (distances and directional relations, , this choice being a hyper-parameter of our generic framework). On FASSEG data, results show that our module improves accuracy of the CNN by about 6.3% (the Hausdorff distance decreases from 22.11 to 20.71). On IBSR data, the improvement is of 51% (the Hausdorff distance decreases from 11.01 to 5.4). In addition, our approach is shown to be resilient to small training datasets that often limit the performance of deep learning methods: the improvement increases as the size of the training dataset decreases.
△ Less
Submitted 1 August, 2023; v1 submitted 18 January, 2023;
originally announced January 2023.
-
Comparison of attention models and post-hoc explanation methods for embryo stage identification: a case study
Authors:
Tristan Gomez,
Thomas Fréour,
Harold Mouchère
Abstract:
An important limitation to the development of AI-based solutions for In Vitro Fertilization (IVF) is the black-box nature of most state-of-the-art models, due to the complexity of deep learning architectures, which raises potential bias and fairness issues. The need for interpretable AI has risen not only in the IVF field but also in the deep learning community in general. This has started a trend…
▽ More
An important limitation to the development of AI-based solutions for In Vitro Fertilization (IVF) is the black-box nature of most state-of-the-art models, due to the complexity of deep learning architectures, which raises potential bias and fairness issues. The need for interpretable AI has risen not only in the IVF field but also in the deep learning community in general. This has started a trend in literature where authors focus on designing objective metrics to evaluate generic explanation methods. In this paper, we study the behavior of recently proposed objective faithfulness metrics applied to the problem of embryo stage identification. We benchmark attention models and post-hoc methods using metrics and further show empirically that (1) the metrics produce low overall agreement on the model ranking and (2) depending on the metric approach, either post-hoc methods or attention models are favored. We conclude with general remarks about the difficulty of defining faithfulness and the necessity of understanding its relationship with the type of approach that is favored.
△ Less
Submitted 13 May, 2022;
originally announced May 2022.
-
Towards deep learning-powered IVF: A large public benchmark for morphokinetic parameter prediction
Authors:
Tristan Gomez,
Magalie Feyeux,
Nicolas Normand,
Laurent David,
Perrine Paul-Gilloteaux,
Thomas Fréour,
Harold Mouchère
Abstract:
An important limitation to the development of Artificial Intelligence (AI)-based solutions for In Vitro Fertilization (IVF) is the absence of a public reference benchmark to train and evaluate deep learning (DL) models. In this work, we describe a fully annotated dataset of 704 videos of develo** embryos, for a total of 337k images. We applied ResNet, LSTM, and ResNet-3D architectures to our dat…
▽ More
An important limitation to the development of Artificial Intelligence (AI)-based solutions for In Vitro Fertilization (IVF) is the absence of a public reference benchmark to train and evaluate deep learning (DL) models. In this work, we describe a fully annotated dataset of 704 videos of develo** embryos, for a total of 337k images. We applied ResNet, LSTM, and ResNet-3D architectures to our dataset and demonstrate that they overperform algorithmic approaches to automatically annotate stage development phases. Altogether, we propose the first public benchmark that will allow the community to evaluate morphokinetic models. This is the first step towards deep learning-powered IVF. Of note, we propose highly detailed annotations with 16 different development phases, including early cell division phases, but also late cell divisions, phases after morulation, and very early phases, which have never been used before. We postulate that this original approach will help improve the overall performance of deep learning approaches on time-lapse videos of embryo development, ultimately benefiting infertile patients with improved clinical success rates (Code and data are available at https://gitlab.univ-nantes.fr/E144069X/bench_mk_pred.git).
△ Less
Submitted 13 May, 2022; v1 submitted 1 March, 2022;
originally announced March 2022.
-
Metrics for saliency map evaluation of deep learning explanation methods
Authors:
Tristan Gomez,
Thomas Fréour,
Harold Mouchère
Abstract:
Due to the black-box nature of deep learning models, there is a recent development of solutions for visual explanations of CNNs. Given the high cost of user studies, metrics are necessary to compare and evaluate these different methods. In this paper, we critically analyze the Deletion Area Under Curve (DAUC) and Insertion Area Under Curve (IAUC) metrics proposed by Petsiuk et al. (2018). These me…
▽ More
Due to the black-box nature of deep learning models, there is a recent development of solutions for visual explanations of CNNs. Given the high cost of user studies, metrics are necessary to compare and evaluate these different methods. In this paper, we critically analyze the Deletion Area Under Curve (DAUC) and Insertion Area Under Curve (IAUC) metrics proposed by Petsiuk et al. (2018). These metrics were designed to evaluate the faithfulness of saliency maps generated by generic methods such as Grad-CAM or RISE. First, we show that the actual saliency score values given by the saliency map are ignored as only the ranking of the scores is taken into account. This shows that these metrics are insufficient by themselves, as the visual appearance of a saliency map can change significantly without the ranking of the scores being modified. Secondly, we argue that during the computation of DAUC and IAUC, the model is presented with images that are out of the training distribution which might lead to an unreliable behavior of the model being explained. To complement DAUC/IAUC, we propose new metrics that quantify the sparsity and the calibration of explanation methods, two previously unstudied properties. Finally, we give general remarks about the metrics studied in this paper and discuss how to evaluate them in a user study.
△ Less
Submitted 22 June, 2022; v1 submitted 31 January, 2022;
originally announced January 2022.
-
BR-NPA: A Non-Parametric High-Resolution Attention Model to improve the Interpretability of Attention
Authors:
Tristan Gomez,
Suiyi Ling,
Thomas Fréour,
Harold Mouchère
Abstract:
The prevalence of employing attention mechanisms has brought along concerns on the interpretability of attention distributions. Although it provides insights about how a model is operating, utilizing attention as the explanation of model predictions is still highly dubious. The community is still seeking more interpretable strategies for better identifying local active regions that contribute the…
▽ More
The prevalence of employing attention mechanisms has brought along concerns on the interpretability of attention distributions. Although it provides insights about how a model is operating, utilizing attention as the explanation of model predictions is still highly dubious. The community is still seeking more interpretable strategies for better identifying local active regions that contribute the most to the final decision. To improve the interpretability of existing attention models, we propose a novel Bilinear Representative Non-Parametric Attention (BR-NPA) strategy that captures the task-relevant human-interpretable information. The target model is first distilled to have higher-resolution intermediate feature maps. From which, representative features are then grouped based on local pairwise feature similarity, to produce finer-grained, more precise attention maps highlighting task-relevant parts of the input. The obtained attention maps are ranked according to the activity level of the compound feature, which provides information regarding the important level of the highlighted regions. The proposed model can be easily adapted in a wide variety of modern deep models, where classification is involved. Extensive quantitative and qualitative experiments showcase more comprehensive and accurate visual explanations compared to state-of-the-art attention models and visualizations methods across multiple tasks including fine-grained image classification, few-shot classification, and person re-identification, without compromising the classification accuracy. The proposed visualization model sheds imperative light on how neural networks `pay their attention' differently in different tasks.
△ Less
Submitted 15 September, 2022; v1 submitted 4 June, 2021;
originally announced June 2021.
-
A General Framework for the Recognition of Online Handwritten Graphics
Authors:
Frank Julca-Aguilar,
Harold Mouchère,
Christian Viard-Gaudin,
Nina S. T. Hirata
Abstract:
We propose a new framework for the recognition of online handwritten graphics. Three main features of the framework are its ability to treat symbol and structural level information in an integrated way, its flexibility with respect to different families of graphics, and means to control the tradeoff between recognition effectiveness and computational cost. We model a graphic as a labeled graph gen…
▽ More
We propose a new framework for the recognition of online handwritten graphics. Three main features of the framework are its ability to treat symbol and structural level information in an integrated way, its flexibility with respect to different families of graphics, and means to control the tradeoff between recognition effectiveness and computational cost. We model a graphic as a labeled graph generated from a graph grammar. Non-terminal vertices represent subcomponents, terminal vertices represent symbols, and edges represent relations between subcomponents or symbols. We then model the recognition problem as a graph parsing problem: given an input stroke set, we search for a parse tree that represents the best interpretation of the input. Our graph parsing algorithm generates multiple interpretations (consistent with the grammar) and then we extract an optimal interpretation according to a cost function that takes into consideration the likelihood scores of symbols and structures. The parsing algorithm consists in recursively partitioning the stroke set according to structures defined in the grammar and it does not impose constraints present in some previous works (e.g. stroke ordering). By avoiding such constraints and thanks to the powerful representativeness of graphs, our approach can be adapted to the recognition of different graphic notations. We show applications to the recognition of mathematical expressions and flowcharts. Experimentation shows that our method obtains state-of-the-art accuracy in both applications.
△ Less
Submitted 19 September, 2017;
originally announced September 2017.