Search | arXiv e-print repository

doi 10.1007/s11222-023-10309-0

Random Forest Kernel for High-Dimension Low Sample Size Classification

Authors: Lucca Portes Cavalheiro, Simon Bernard, Jean Paul Barddal, Laurent Heutte

Abstract: High dimension, low sample size (HDLSS) problems are numerous among real-world applications of machine learning. From medical images to text processing, traditional machine learning algorithms are usually unsuccessful in learning the best possible concept from such data. In a previous work, we proposed a dissimilarity-based approach for multi-view classification, the Random Forest Dissimilarity (R… ▽ More High dimension, low sample size (HDLSS) problems are numerous among real-world applications of machine learning. From medical images to text processing, traditional machine learning algorithms are usually unsuccessful in learning the best possible concept from such data. In a previous work, we proposed a dissimilarity-based approach for multi-view classification, the Random Forest Dissimilarity (RFD), that perfoms state-of-the-art results for such problems. In this work, we transpose the core principle of this approach to solving HDLSS classification problems, by using the RF similarity measure as a learned precomputed SVM kernel (RFSVM). We show that such a learned similarity measure is particularly suited and accurate for this classification context. Experiments conducted on 40 public HDLSS classification datasets, supported by rigorous statistical analyses, show that the RFSVM method outperforms existing methods for the majority of HDLSS problems and remains at the same time very competitive for low or non-HDLSS problems. △ Less

Submitted 17 November, 2023; v1 submitted 23 October, 2023; originally announced October 2023.

Comments: 23 pages. To be published in statistics and computing (accepted September 26, 2023)

Journal ref: Stat Comput 34, 9 (2024)

arXiv:2309.14394 [pdf, other]

Multiple Noises in Diffusion Model for Semi-Supervised Multi-Domain Translation

Authors: Tsiry Mayet, Simon Bernard, Clement Chatelain, Romain Herault

Abstract: Domain-to-domain translation involves generating a target domain sample given a condition in the source domain. Most existing methods focus on fixed input and output domains, i.e. they only work for specific configurations (i.e. for two domains, either $D_1\rightarrow{}D_2$ or $D_2\rightarrow{}D_1$). This paper proposes Multi-Domain Diffusion (MDD), a conditional diffusion framework for multi-doma… ▽ More Domain-to-domain translation involves generating a target domain sample given a condition in the source domain. Most existing methods focus on fixed input and output domains, i.e. they only work for specific configurations (i.e. for two domains, either $D_1\rightarrow{}D_2$ or $D_2\rightarrow{}D_1$). This paper proposes Multi-Domain Diffusion (MDD), a conditional diffusion framework for multi-domain translation in a semi-supervised context. Unlike previous methods, MDD does not require defining input and output domains, allowing translation between any partition of domains within a set (such as $(D_1, D_2)\rightarrow{}D_3$, $D_2\rightarrow{}(D_1, D_3)$, $D_3\rightarrow{}D_1$, etc. for 3 domains), without the need to train separate models for each domain configuration. The key idea behind MDD is to leverage the noise formulation of diffusion models by incorporating one noise level per domain, which allows missing domains to be modeled with noise in a natural way. This transforms the training task from a simple reconstruction task to a domain translation task, where the model relies on less noisy domains to reconstruct more noisy domains. We present results on a multi-domain (with more than two domains) synthetic image translation dataset with challenging semantic domain inversion. △ Less

Submitted 25 September, 2023; originally announced September 2023.

arXiv:2307.13517 [pdf, other]

Towards Long-Term predictions of Turbulence using Neural Operators

Authors: Fernando Gonzalez, François-Xavier Demoulin, Simon Bernard

Abstract: This paper explores Neural Operators to predict turbulent flows, focusing on the Fourier Neural Operator (FNO) model. It aims to develop reduced-order/surrogate models for turbulent flow simulations using Machine Learning. Different model configurations are analyzed, with U-NET structures (UNO and U-FNET) performing better than the standard FNO in accuracy and stability. U-FNET excels in predictin… ▽ More This paper explores Neural Operators to predict turbulent flows, focusing on the Fourier Neural Operator (FNO) model. It aims to develop reduced-order/surrogate models for turbulent flow simulations using Machine Learning. Different model configurations are analyzed, with U-NET structures (UNO and U-FNET) performing better than the standard FNO in accuracy and stability. U-FNET excels in predicting turbulence at higher Reynolds numbers. Regularization terms, like gradient and stability losses, are essential for stable and accurate predictions. The study emphasizes the need for improved metrics for deep learning models in fluid flow prediction. Further research should focus on models handling complex flows and practical benchmarking metrics. △ Less

Submitted 25 July, 2023; originally announced July 2023.

Comments: ETMM14 proceedings

arXiv:2303.12888 [pdf, other]

A dynamic risk score for early prediction of cardiogenic shock using machine learning

Authors: Yuxuan Hu, Albert Lui, Mark Goldstein, Mukund Sudarshan, Andrea Tinsay, Cindy Tsui, Samuel Maidman, John Medamana, Neil Jethani, Aahlad Puli, Vuthy Nguy, Yindalon Aphinyanaphongs, Nicholas Kiefer, Nathaniel Smilowitz, James Horowitz, Tania Ahuja, Glenn I Fishman, Judith Hochman, Stuart Katz, Samuel Bernard, Rajesh Ranganath

Abstract: Myocardial infarction and heart failure are major cardiovascular diseases that affect millions of people in the US. The morbidity and mortality are highest among patients who develop cardiogenic shock. Early recognition of cardiogenic shock is critical. Prompt implementation of treatment measures can prevent the deleterious spiral of ischemia, low blood pressure, and reduced cardiac output due to… ▽ More Myocardial infarction and heart failure are major cardiovascular diseases that affect millions of people in the US. The morbidity and mortality are highest among patients who develop cardiogenic shock. Early recognition of cardiogenic shock is critical. Prompt implementation of treatment measures can prevent the deleterious spiral of ischemia, low blood pressure, and reduced cardiac output due to cardiogenic shock. However, early identification of cardiogenic shock has been challenging due to human providers' inability to process the enormous amount of data in the cardiac intensive care unit (ICU) and lack of an effective risk stratification tool. We developed a deep learning-based risk stratification tool, called CShock, for patients admitted into the cardiac ICU with acute decompensated heart failure and/or myocardial infarction to predict onset of cardiogenic shock. To develop and validate CShock, we annotated cardiac ICU datasets with physician adjudicated outcomes. CShock achieved an area under the receiver operator characteristic curve (AUROC) of 0.820, which substantially outperformed CardShock (AUROC 0.519), a well-established risk score for cardiogenic shock prognosis. CShock was externally validated in an independent patient cohort and achieved an AUROC of 0.800, demonstrating its generalizability in other cardiac ICUs. △ Less

Submitted 28 March, 2023; v1 submitted 22 March, 2023; originally announced March 2023.

arXiv:2212.03361 [pdf, other]

Domain Translation via Latent Space Map**

Authors: Tsiry Mayet, Simon Bernard, Clement Chatelain, Romain Herault

Abstract: In this paper, we investigate the problem of multi-domain translation: given an element $a$ of domain $A$, we would like to generate a corresponding $b$ sample in another domain $B$, and vice versa. Acquiring supervision in multiple domains can be a tedious task, also we propose to learn this translation from one domain to another when supervision is available as a pair $(a,b)\sim A\times B$ and l… ▽ More In this paper, we investigate the problem of multi-domain translation: given an element $a$ of domain $A$, we would like to generate a corresponding $b$ sample in another domain $B$, and vice versa. Acquiring supervision in multiple domains can be a tedious task, also we propose to learn this translation from one domain to another when supervision is available as a pair $(a,b)\sim A\times B$ and leveraging possible unpaired data when only $a\sim A$ or only $b\sim B$ is available. We introduce a new unified framework called Latent Space Map** (\model) that exploits the manifold assumption in order to learn, from each domain, a latent space. Unlike existing approaches, we propose to further regularize each latent space using available domains by learning each dependency between pairs of domains. We evaluate our approach in three tasks performing i) synthetic dataset with image translation, ii) real-world task of semantic segmentation for medical images, and iii) real-world task of facial landmark detection. △ Less

Submitted 6 December, 2022; originally announced December 2022.

arXiv:2007.08377 [pdf, other]

doi 10.1142/9789811211072_0007

Random Forest for Dissimilarity-based Multi-view Learning

Authors: Simon Bernard, Hongliu Cao, Robert Sabourin, Laurent Heutte

Abstract: Many classification problems are naturally multi-view in the sense their data are described through multiple heterogeneous descriptions. For such tasks, dissimilarity strategies are effective ways to make the different descriptions comparable and to easily merge them, by (i) building intermediate dissimilarity representations for each view and (ii) fusing these representations by averaging the dis… ▽ More Many classification problems are naturally multi-view in the sense their data are described through multiple heterogeneous descriptions. For such tasks, dissimilarity strategies are effective ways to make the different descriptions comparable and to easily merge them, by (i) building intermediate dissimilarity representations for each view and (ii) fusing these representations by averaging the dissimilarities over the views. In this work, we show that the Random Forest proximity measure can be used to build the dissimilarity representations, since this measure reflects similarities between features but also class membership. We then propose a Dynamic View Selection method to better combine the view-specific dissimilarity representations. This allows to take a decision, on each instance to predict, with only the most relevant views for that instance. Experiments are conducted on several real-world multi-view datasets, and show that the Dynamic View Selection offers a significant improvement in performance compared to the simple average combination and two state-of-the-art static view combinations. △ Less

Submitted 16 July, 2020; originally announced July 2020.

Comments: Published in Handbook of Pattern Recognition and Computer Vision, 2020 (preprint)

arXiv:2007.02572 [pdf, other]

A Novel Random Forest Dissimilarity Measure for Multi-View Learning

Authors: Hongliu Cao, Simon Bernard, Robert Sabourin, Laurent Heutte

Abstract: Multi-view learning is a learning task in which data is described by several concurrent representations. Its main challenge is most often to exploit the complementarities between these representations to help solve a classification/regression task. This is a challenge that can be met nowadays if there is a large amount of data available for learning. However, this is not necessarily true for all r… ▽ More Multi-view learning is a learning task in which data is described by several concurrent representations. Its main challenge is most often to exploit the complementarities between these representations to help solve a classification/regression task. This is a challenge that can be met nowadays if there is a large amount of data available for learning. However, this is not necessarily true for all real-world problems, where data are sometimes scarce (e.g. problems related to the medical environment). In these situations, an effective strategy is to use intermediate representations based on the dissimilarities between instances. This work presents new ways of constructing these dissimilarity representations, learning them from data with Random Forest classifiers. More precisely, two methods are proposed, which modify the Random Forest proximity measure, to adapt it to the context of High Dimension Low Sample Size (HDLSS) multi-view classification problems. The second method, based on an Instance Hardness measurement, is significantly more accurate than other state-of-the-art measurements including the original RF Proximity measurement and the Large Margin Nearest Neighbor (LMNN) metric learning measurement. △ Less

Submitted 6 July, 2020; originally announced July 2020.

Comments: accepted to ICPR 2020 (22/06/2020)

arXiv:1806.07686 [pdf, other]

Dynamic voting in multi-view learning for radiomics applications

Authors: Hongliu Cao, Simon Bernard, Laurent Heutte, Robert Sabourin

Abstract: Cancer diagnosis and treatment often require a personalized analysis for each patient nowadays, due to the heterogeneity among the different types of tumor and among patients. Radiomics is a recent medical imaging field that has shown during the past few years to be promising for achieving this personalization. However, a recent study shows that most of the state-of-the-art works in Radiomics fail… ▽ More Cancer diagnosis and treatment often require a personalized analysis for each patient nowadays, due to the heterogeneity among the different types of tumor and among patients. Radiomics is a recent medical imaging field that has shown during the past few years to be promising for achieving this personalization. However, a recent study shows that most of the state-of-the-art works in Radiomics fail to identify this problem as a multi-view learning task and that multi-view learning techniques are generally more efficient. In this work, we propose to further investigate the potential of one family of multi-view learning methods based on Multiple Classifiers Systems where one classifier is learnt on each view and all classifiers are combined afterwards. In particular, we propose a random forest based dynamic weighted voting scheme, which personalizes the combination of views for each new patient for classification tasks. The proposed method is validated on several real-world Radiomics problems. △ Less

Submitted 26 June, 2018; v1 submitted 20 June, 2018; originally announced June 2018.

Comments: 10 pages

arXiv:1803.11241 [pdf, ps, other]

Improve the performance of transfer learning without fine-tuning using dissimilarity-based multi-view learning for breast cancer histology images

Authors: Hongliu Cao, Simon Bernard, Laurent Heutte, Robert Sabourin

Abstract: Breast cancer is one of the most common types of cancer and leading cancer-related death causes for women. In the context of ICIAR 2018 Grand Challenge on Breast Cancer Histology Images, we compare one handcrafted feature extractor and five transfer learning feature extractors based on deep learning. We find out that the deep learning networks pretrained on ImageNet have better performance than th… ▽ More Breast cancer is one of the most common types of cancer and leading cancer-related death causes for women. In the context of ICIAR 2018 Grand Challenge on Breast Cancer Histology Images, we compare one handcrafted feature extractor and five transfer learning feature extractors based on deep learning. We find out that the deep learning networks pretrained on ImageNet have better performance than the popular handcrafted features used for breast cancer histology images. The best feature extractor achieves an average accuracy of 79.30%. To improve the classification performance, a random forest dissimilarity based integration method is used to combine different feature groups together. When the five deep learning feature groups are combined, the average accuracy is improved to 82.90% (best accuracy 85.00%). When handcrafted features are combined with the five deep learning feature groups, the average accuracy is improved to 87.10% (best accuracy 93.00%). △ Less

Submitted 29 March, 2018; originally announced March 2018.

arXiv:1803.04460 [pdf, other]

Dissimilarity-based representation for radiomics applications

Authors: Hongliu Cao, Simon Bernard, Laurent Heutte, Robert Sabourin

Abstract: Radiomics is a term which refers to the analysis of the large amount of quantitative tumor features extracted from medical images to find useful predictive, diagnostic or prognostic information. Many recent studies have proved that radiomics can offer a lot of useful information that physicians cannot extract from the medical images and can be associated with other information like gene or protein… ▽ More Radiomics is a term which refers to the analysis of the large amount of quantitative tumor features extracted from medical images to find useful predictive, diagnostic or prognostic information. Many recent studies have proved that radiomics can offer a lot of useful information that physicians cannot extract from the medical images and can be associated with other information like gene or protein data. However, most of the classification studies in radiomics report the use of feature selection methods without identifying the machine learning challenges behind radiomics. In this paper, we first show that the radiomics problem should be viewed as an high dimensional, low sample size, multi view learning problem, then we compare different solutions proposed in multi view learning for classifying radiomics data. Our experiments, conducted on several real world multi view datasets, show that the intermediate integration methods work significantly better than filter and embedded feature selection methods commonly used in radiomics. △ Less

Submitted 12 March, 2018; originally announced March 2018.

Comments: conference, 6 pages, 2 figures

arXiv:1803.03453 [pdf, other]

The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities

Authors: Joel Lehman, Jeff Clune, Dusan Misevic, Christoph Adami, Lee Altenberg, Julie Beaulieu, Peter J. Bentley, Samuel Bernard, Guillaume Beslon, David M. Bryson, Patryk Chrabaszcz, Nick Cheney, Antoine Cully, Stephane Doncieux, Fred C. Dyer, Kai Olav Ellefsen, Robert Feldt, Stephan Fischer, Stephanie Forrest, Antoine Frénoy, Christian Gagné, Leni Le Goff, Laura M. Grabowski, Babak Hodjat, Frank Hutter , et al. (28 additional authors not shown)

Abstract: Biological evolution provides a creative fount of complex and subtle adaptations, often surprising the scientists who discover them. However, because evolution is an algorithmic process that transcends the substrate in which it occurs, evolution's creativity is not limited to nature. Indeed, many researchers in the field of digital evolution have observed their evolving algorithms and organisms su… ▽ More Biological evolution provides a creative fount of complex and subtle adaptations, often surprising the scientists who discover them. However, because evolution is an algorithmic process that transcends the substrate in which it occurs, evolution's creativity is not limited to nature. Indeed, many researchers in the field of digital evolution have observed their evolving algorithms and organisms subverting their intentions, exposing unrecognized bugs in their code, producing unexpected adaptations, or exhibiting outcomes uncannily convergent with ones in nature. Such stories routinely reveal creativity by evolution in these digital worlds, but they rarely fit into the standard scientific narrative. Instead they are often treated as mere obstacles to be overcome, rather than results that warrant study in their own right. The stories themselves are traded among researchers through oral tradition, but that mode of information transmission is inefficient and prone to error and outright loss. Moreover, the fact that these stories tend to be shared only among practitioners means that many natural scientists do not realize how interesting and lifelike digital organisms are and how natural their evolution can be. To our knowledge, no collection of such anecdotes has been published before. This paper is the crowd-sourced product of researchers in the fields of artificial life and evolutionary computation who have provided first-hand accounts of such cases. It thus serves as a written, fact-checked collection of scientifically important and even entertaining stories. In doing so we also present here substantial evidence that the existence and importance of evolutionary surprises extends beyond the natural world, and may indeed be a universal property of all complex evolving systems. △ Less

Submitted 21 November, 2019; v1 submitted 9 March, 2018; originally announced March 2018.

arXiv:1512.02791 [pdf, ps, other]

Formal Proofs of Transcendence for e and $π$ as an Application of Multivariate and Symmetric Polynomials

Authors: Sophie Bernard, Yves Bertot, Laurence Rideau, Pierre-Yves Strub

Abstract: We describe the formalisation in Coq of a proof that the numbers e and $π$ are transcendental. This proof lies at the interface of two domains of mathematics that are often considered separately: calculus (real and elementary complex analysis) and algebra. For the work on calculus, we rely on the Coquelicot library and for the work on algebra, we rely on the Mathematical Components library. Moreov… ▽ More We describe the formalisation in Coq of a proof that the numbers e and $π$ are transcendental. This proof lies at the interface of two domains of mathematics that are often considered separately: calculus (real and elementary complex analysis) and algebra. For the work on calculus, we rely on the Coquelicot library and for the work on algebra, we rely on the Mathematical Components library. Moreover, some of the elements of our formalized proof originate in the more ancient library for real numbers included in the Coq distribution. The case of $π$ relies extensively on properties of multivariate polynomials and this experiment was also an occasion to put to test a newly developed library for these multivariate polynomials. △ Less

Submitted 9 December, 2015; originally announced December 2015.

Comments: in Jeremy Avigad and Adam Chlipala. Certified Programs and Proofs, Jan 2016, St Petersburg, Florida, United States. ACM Press, pp.12, 2016

arXiv:0805.0851 [pdf, other]

Bounds for self-stabilization in unidirectional networks

Authors: Samuel Bernard, Stéphane Devismes, Maria Gradinariu Potop-Butucaru, Sébastien Tixeuil

Abstract: A distributed algorithm is self-stabilizing if after faults and attacks hit the system and place it in some arbitrary global state, the systems recovers from this catastrophic situation without external intervention in finite time. Unidirectional networks preclude many common techniques in self-stabilization from being used, such as preserving local predicates. In this paper, we investigate the… ▽ More A distributed algorithm is self-stabilizing if after faults and attacks hit the system and place it in some arbitrary global state, the systems recovers from this catastrophic situation without external intervention in finite time. Unidirectional networks preclude many common techniques in self-stabilization from being used, such as preserving local predicates. In this paper, we investigate the intrinsic complexity of achieving self-stabilization in unidirectional networks, and focus on the classical vertex coloring problem. When deterministic solutions are considered, we prove a lower bound of $n$ states per process (where $n$ is the network size) and a recovery time of at least $n(n-1)/2$ actions in total. We present a deterministic algorithm with matching upper bounds that performs in arbitrary graphs. When probabilistic solutions are considered, we observe that at least $Δ+ 1$ states per process and a recovery time of $Ω(n)$ actions in total are required (where $Δ$ denotes the maximal degree of the underlying simple undirected graph). We present a probabilistically self-stabilizing algorithm that uses $\mathtt{k}$ states per process, where $\mathtt{k}$ is a parameter of the algorithm. When $\mathtt{k}=Δ+1$, the algorithm recovers in expected $O(Δn)$ actions. When $\mathtt{k}$ may grow arbitrarily, the algorithm recovers in expected O(n) actions in total. Thus, our algorithm can be made optimal with respect to space or time complexity. △ Less

Submitted 13 May, 2008; v1 submitted 7 May, 2008; originally announced May 2008.

Report number: RR-6524

Showing 1–13 of 13 results for author: Bernard, S