Search | arXiv e-print repository

Injecting Hierarchical Biological Priors into Graph Neural Networks for Flow Cytometry Prediction

Authors: Fatemeh Nassajian Mojarrad, Lorenzo Bini, Thomas Matthes, Stéphane Marchand-Maillet

Abstract: In the complex landscape of hematologic samples such as peripheral blood or bone marrow derived from flow cytometry (FC) data, cell-level prediction presents profound challenges. This work explores injecting hierarchical prior knowledge into graph neural networks (GNNs) for single-cell multi-class classification of tabular cellular data. By representing the data as graphs and encoding hierarchical… ▽ More In the complex landscape of hematologic samples such as peripheral blood or bone marrow derived from flow cytometry (FC) data, cell-level prediction presents profound challenges. This work explores injecting hierarchical prior knowledge into graph neural networks (GNNs) for single-cell multi-class classification of tabular cellular data. By representing the data as graphs and encoding hierarchical relationships between classes, we propose our hierarchical plug-in method to be applied to several GNN models, namely, FCHC-GNN, and effectively designed to capture neighborhood information crucial for single-cell FC domain. Extensive experiments on our cohort of 19 distinct patients, demonstrate that incorporating hierarchical biological constraints boosts performance significantly across multiple metrics compared to baseline GNNs without such priors. The proposed approach highlights the importance of structured inductive biases for gaining improved generalization in complex biological prediction tasks. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Comments: 14 pages, ICML Conference Workshop 2024. arXiv admin note: text overlap with arXiv:2402.18610

arXiv:2403.00024 [pdf, other]

FlowCyt: A Comparative Study of Deep Learning Approaches for Multi-Class Classification in Flow Cytometry Benchmarking

Authors: Lorenzo Bini, Fatemeh Nassajian Mojarrad, Margarita Liarou, Thomas Matthes, Stéphane Marchand-Maillet

Abstract: This paper presents FlowCyt, the first comprehensive benchmark for multi-class single-cell classification in flow cytometry data. The dataset comprises bone marrow samples from 30 patients, with each cell characterized by twelve markers. Ground truth labels identify five hematological cell types: T lymphocytes, B lymphocytes, Monocytes, Mast cells, and Hematopoietic Stem/Progenitor Cells (HSPCs).… ▽ More This paper presents FlowCyt, the first comprehensive benchmark for multi-class single-cell classification in flow cytometry data. The dataset comprises bone marrow samples from 30 patients, with each cell characterized by twelve markers. Ground truth labels identify five hematological cell types: T lymphocytes, B lymphocytes, Monocytes, Mast cells, and Hematopoietic Stem/Progenitor Cells (HSPCs). Experiments utilize supervised inductive learning and semi-supervised transductive learning on up to 1 million cells per patient. Baseline methods include Gaussian Mixture Models, XGBoost, Random Forests, Deep Neural Networks, and Graph Neural Networks (GNNs). GNNs demonstrate superior performance by exploiting spatial relationships in graph-encoded data. The benchmark allows standardized evaluation of clinically relevant classification tasks, along with exploratory analyses to gain insights into hematological cell phenotypes. This represents the first public flow cytometry benchmark with a richly annotated, heterogeneous dataset. It will empower the development and rigorous assessment of novel methodologies for single-cell analysis. △ Less

Submitted 25 April, 2024; v1 submitted 28 February, 2024; originally announced March 2024.

Comments: arXiv admin note: text overlap with arXiv:2402.18611

arXiv:2402.18611 [pdf, other]

HemaGraph: Breaking Barriers in Hematologic Single Cell Classification with Graph Attention

Authors: Lorenzo Bini, Fatemeh Nassajian Mojarrad, Thomas Matthes, Stéphane Marchand-Maillet

Abstract: In the realm of hematologic cell populations classification, the intricate patterns within flow cytometry data necessitate advanced analytical tools. This paper presents 'HemaGraph', a novel framework based on Graph Attention Networks (GATs) for single-cell multi-class classification of hematological cells from flow cytometry data. Harnessing the power of GATs, our method captures subtle cell rela… ▽ More In the realm of hematologic cell populations classification, the intricate patterns within flow cytometry data necessitate advanced analytical tools. This paper presents 'HemaGraph', a novel framework based on Graph Attention Networks (GATs) for single-cell multi-class classification of hematological cells from flow cytometry data. Harnessing the power of GATs, our method captures subtle cell relationships, offering highly accurate patient profiling. Based on evaluation of data from 30 patients, HemaGraph demonstrates classification performance across five different cell classes, outperforming traditional methodologies and state-of-the-art methods. Moreover, the uniqueness of this framework lies in the training and testing phase of HemaGraph, where it has been applied for extremely large graphs, containing up to hundreds of thousands of nodes and two million edges, to detect low frequency cell populations (e.g. 0.01% for one population), with accuracies reaching 98%. Our findings underscore the potential of HemaGraph in improving hematoligic multi-class classification, paving the way for patient-personalized interventions. To the best of our knowledge, this is the first effort to use GATs, and Graph Neural Networks (GNNs) in general, to classify cell populations from single-cell flow cytometry data. We envision applying this method to single-cell data from larger cohort of patients and on other hematologic diseases. △ Less

Submitted 28 February, 2024; originally announced February 2024.

arXiv:2402.18610 [pdf, other]

Why Attention Graphs Are All We Need: Pioneering Hierarchical Classification of Hematologic Cell Populations with LeukoGraph

Authors: Fatemeh Nassajian Mojarrad, Lorenzo Bini, Thomas Matthes, Stéphane Marchand-Maillet

Abstract: In the complex landscape of hematologic samples such as peripheral blood or bone marrow, cell classification, delineating diverse populations into a hierarchical structure, presents profound challenges. This study presents LeukoGraph, a recently developed framework designed explicitly for this purpose employing graph attention networks (GATs) to navigate hierarchical classification (HC) complexiti… ▽ More In the complex landscape of hematologic samples such as peripheral blood or bone marrow, cell classification, delineating diverse populations into a hierarchical structure, presents profound challenges. This study presents LeukoGraph, a recently developed framework designed explicitly for this purpose employing graph attention networks (GATs) to navigate hierarchical classification (HC) complexities. Notably, LeukoGraph stands as a pioneering effort, marking the application of graph neural networks (GNNs) for hierarchical inference on graphs, accommodating up to one million nodes and millions of edges, all derived from flow cytometry data. LeukoGraph intricately addresses a classification paradigm where for example four different cell populations undergo flat categorization, while a fifth diverges into two distinct child branches, exemplifying the nuanced hierarchical structure inherent in complex datasets. The technique is more general than this example. A hallmark achievement of LeukoGraph is its F-score of 98%, significantly outclassing prevailing state-of-the-art methodologies. Crucially, LeukoGraph's prowess extends beyond theoretical innovation, showcasing remarkable precision in predicting both flat and hierarchical cell types across flow cytometry datasets from 30 distinct patients. This precision is further underscored by LeukoGraph's ability to maintain a correct label ratio, despite the inherent challenges posed by hierarchical classifications. △ Less

Submitted 28 February, 2024; originally announced February 2024.

arXiv:2306.11122 [pdf, other]

Supervised Auto-Encoding Twin-Bottleneck Hashing

Authors: Yuan Chen, Stéphane Marchand-Maillet

Abstract: Deep hashing has shown to be a complexity-efficient solution for the Approximate Nearest Neighbor search problem in high dimensional space. Many methods usually build the loss function from pairwise or triplet data points to capture the local similarity structure. Other existing methods construct the similarity graph and consider all points simultaneously. Auto-encoding Twin-bottleneck Hashing is… ▽ More Deep hashing has shown to be a complexity-efficient solution for the Approximate Nearest Neighbor search problem in high dimensional space. Many methods usually build the loss function from pairwise or triplet data points to capture the local similarity structure. Other existing methods construct the similarity graph and consider all points simultaneously. Auto-encoding Twin-bottleneck Hashing is one such method that dynamically builds the graph. Specifically, each input data is encoded into a binary code and a continuous variable, or the so-called twin bottlenecks. The similarity graph is then computed from these binary codes, which get updated consistently during the training. In this work, we generalize the original model into a supervised deep hashing network by incorporating the label information. In addition, we examine the differences of codes structure between these two networks and consider the class imbalance problem especially in multi-labeled datasets. Experiments on three datasets yield statistically significant improvement against the original model. Results are also comparable and competitive to other supervised methods. △ Less

Submitted 19 June, 2023; originally announced June 2023.

arXiv:2201.10227 [pdf, other]

Cold Start Active Learning Strategies in the Context of Imbalanced Classification

Authors: Etienne Brangbour, Pierrick Bruneau, Thomas Tamisier, Stéphane Marchand-Maillet

Abstract: We present novel active learning strategies dedicated to providing a solution to the cold start stage, i.e. initializing the classification of a large set of data with no attached labels. Moreover, proposed strategies are designed to handle an imbalanced context in which random selection is highly inefficient. Specifically, our active learning iterations address label scarcity and imbalance using… ▽ More We present novel active learning strategies dedicated to providing a solution to the cold start stage, i.e. initializing the classification of a large set of data with no attached labels. Moreover, proposed strategies are designed to handle an imbalanced context in which random selection is highly inefficient. Specifically, our active learning iterations address label scarcity and imbalance using element scores, combining information extracted from a clustering structure to a label propagation model. The strategy is illustrated by a case study on annotating Twitter content w.r.t. testimonies of a real flood event. We show that our method effectively copes with class imbalance, by boosting the recall of samples from the minority class. △ Less

Submitted 25 January, 2022; originally announced January 2022.

Comments: 13 pages, submitted to PAKDD 2021, eventually rejected

arXiv:2201.06329 [pdf, other]

H&E-adversarial network: a convolutional neural network to learn stain-invariant features through Hematoxylin & Eosin regression

Authors: Niccoló Marini, Manfredo Atzori, Sebastian Otálora, Stephane Marchand-Maillet, Henning Müller

Abstract: Computational pathology is a domain that aims to develop algorithms to automatically analyze large digitized histopathology images, called whole slide images (WSI). WSIs are produced scanning thin tissue samples that are stained to make specific structures visible. They show stain colour heterogeneity due to different preparation and scanning settings applied across medical centers. Stain colour h… ▽ More Computational pathology is a domain that aims to develop algorithms to automatically analyze large digitized histopathology images, called whole slide images (WSI). WSIs are produced scanning thin tissue samples that are stained to make specific structures visible. They show stain colour heterogeneity due to different preparation and scanning settings applied across medical centers. Stain colour heterogeneity is a problem to train convolutional neural networks (CNN), the state-of-the-art algorithms for most computational pathology tasks, since CNNs usually underperform when tested on images including different stain variations than those within data used to train the CNN. Despite several methods that were developed, stain colour heterogeneity is still an unsolved challenge that limits the development of CNNs that can generalize on data from several medical centers. This paper aims to present a novel method to train CNNs that better generalize on data including several colour variations. The method, called H&E-adversarial CNN, exploits H&E matrix information to learn stain-invariant features during the training. The method is evaluated on the classification of colon and prostate histopathology images, involving eleven heterogeneous datasets, and compared with five other techniques used to handle stain colour heterogeneity. H&E-adversarial CNNs show an improvement in performance compared to the other algorithms, demonstrating that it can help to better deal with stain colour heterogeneous images. △ Less

Submitted 19 January, 2022; v1 submitted 17 January, 2022; originally announced January 2022.

Comments: Errata corrige Proceedings of the IEEE/CVF International Conference on Computer Vision 2021

arXiv:2107.01407 [pdf, other]

Optimality Inductive Biases and Agnostic Guidelines for Offline Reinforcement Learning

Authors: Lionel Blondé, Alexandros Kalousis, Stéphane Marchand-Maillet

Abstract: The performance of state-of-the-art offline RL methods varies widely over the spectrum of dataset qualities, ranging from far-from-optimal random data to close-to-optimal expert demonstrations. We re-implement these methods to test their reproducibility, and show that when a given method outperforms the others on one end of the spectrum, it never does on the other end. This prevents us from naming… ▽ More The performance of state-of-the-art offline RL methods varies widely over the spectrum of dataset qualities, ranging from far-from-optimal random data to close-to-optimal expert demonstrations. We re-implement these methods to test their reproducibility, and show that when a given method outperforms the others on one end of the spectrum, it never does on the other end. This prevents us from naming a victor across the board. We attribute the asymmetry to the amount of inductive bias injected into the agent to entice it to posit that the behavior underlying the offline dataset is optimal for the task. Our investigations confirm that careless injections of such optimality inductive biases make dominant agents subpar as soon as the offline policy is sub-optimal. To bridge this gap, we generalize importance-weighted regression methods that have proved the most versatile across the spectrum of dataset grades into a modular framework that allows for the design of methods that align with how much we know about the dataset. This modularity enables qualitatively different injections of optimality inductive biases. We show that certain orchestrations strike the right balance, improving the return on one end of the spectrum without harming it on the other end. While the formulation of guidelines for the design of an offline method reduces to aligning the amount of optimality bias to inject with what we know about the quality of the data, the design of an agnostic method for which we need not know the quality of the data beforehand is more nuanced. Only our framework allowed us to design a method that performed well across the spectrum while remaining modular if more information about the quality of the data ever becomes available. △ Less

Submitted 19 January, 2022; v1 submitted 3 July, 2021; originally announced July 2021.

arXiv:2106.00358 [pdf, other]

Towards Efficient Cross-Modal Visual Textual Retrieval using Transformer-Encoder Deep Features

Authors: Nicola Messina, Giuseppe Amato, Fabrizio Falchi, Claudio Gennaro, Stéphane Marchand-Maillet

Abstract: Cross-modal retrieval is an important functionality in modern search engines, as it increases the user experience by allowing queries and retrieved objects to pertain to different modalities. In this paper, we focus on the image-sentence retrieval task, where the objective is to efficiently find relevant images for a given sentence (image-retrieval) or the relevant sentences for a given image (sen… ▽ More Cross-modal retrieval is an important functionality in modern search engines, as it increases the user experience by allowing queries and retrieved objects to pertain to different modalities. In this paper, we focus on the image-sentence retrieval task, where the objective is to efficiently find relevant images for a given sentence (image-retrieval) or the relevant sentences for a given image (sentence-retrieval). Computer vision literature reports the best results on the image-sentence matching task using deep neural networks equipped with attention and self-attention mechanisms. They evaluate the matching performance on the retrieval task by performing sequential scans of the whole dataset. This method does not scale well with an increasing amount of images or captions. In this work, we explore different preprocessing techniques to produce sparsified deep multi-modal features extracting them from state-of-the-art deep-learning architectures for image-text matching. Our main objective is to lay down the paths for efficient indexing of complex multi-modal descriptions. We use the recently introduced TERN architecture as an image-sentence features extractor. It is designed for producing fixed-size 1024-d vectors describing whole images and sentences, as well as variable-length sets of 1024-d vectors describing the various building components of the two modalities (image regions and sentence words respectively). All these vectors are enforced by the TERN design to lie into the same common space. Our experiments show interesting preliminary results on the explored methods and suggest further experimentation in this important research direction. △ Less

Submitted 1 June, 2021; originally announced June 2021.

Comments: Accepted at CBMI 2021

arXiv:2012.03731 [pdf, other]

doi 10.5065/y82j-f154

Computing flood probabilities using Twitter: application to the Houston urban area during Harvey

Authors: Etienne Brangbour, Pierrick Bruneau, Stéphane Marchand-Maillet, Renaud Hostache, Marco Chini, Patrick Matgen, Thomas Tamisier

Abstract: In this paper, we investigate the conversion of a Twitter corpus into geo-referenced raster cells holding the probability of the associated geographical areas of being flooded. We describe a baseline approach that combines a density ratio function, aggregation using a spatio-temporal Gaussian kernel function, and TFIDF textual features. The features are transformed to probabilities using a logisti… ▽ More In this paper, we investigate the conversion of a Twitter corpus into geo-referenced raster cells holding the probability of the associated geographical areas of being flooded. We describe a baseline approach that combines a density ratio function, aggregation using a spatio-temporal Gaussian kernel function, and TFIDF textual features. The features are transformed to probabilities using a logistic regression model. The described method is evaluated on a corpus collected after the floods that followed Hurricane Harvey in the Houston urban area in August-September 2017. The baseline reaches a F1 score of 68%. We highlight research directions likely to improve these initial results. △ Less

Submitted 7 December, 2020; originally announced December 2020.

Comments: 5 pages, 1 figure. Published in Proceedings of the 9th International Workshop on Climate Informatics: CI 2019

arXiv:2008.05231 [pdf, other]

Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders

Authors: Nicola Messina, Giuseppe Amato, Andrea Esuli, Fabrizio Falchi, Claudio Gennaro, Stéphane Marchand-Maillet

Abstract: Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal matching remains a challenging task. In this work, we tackle the task of cross-modal retrieval through image-sentence matching based on word-region alignments, using supervision only at the global image-sentence level. Specifically, we present a novel approach called Transformer Encoder Reasoning an… ▽ More Despite the evolution of deep-learning-based visual-textual processing systems, precise multi-modal matching remains a challenging task. In this work, we tackle the task of cross-modal retrieval through image-sentence matching based on word-region alignments, using supervision only at the global image-sentence level. Specifically, we present a novel approach called Transformer Encoder Reasoning and Alignment Network (TERAN). TERAN enforces a fine-grained match between the underlying components of images and sentences, i.e., image regions and words, respectively, in order to preserve the informative richness of both modalities. TERAN obtains state-of-the-art results on the image retrieval task on both MS-COCO and Flickr30k datasets. Moreover, on MS-COCO, it also outperforms current approaches on the sentence retrieval task. Focusing on scalable cross-modal information retrieval, TERAN is designed to keep the visual and textual data pipelines well separated. Cross-attention links invalidate any chance to separately extract visual and textual features needed for the online search and the offline indexing steps in large-scale retrieval systems. In this respect, TERAN merges the information from the two domains only during the final alignment phase, immediately before the loss computation. We argue that the fine-grained alignments produced by TERAN pave the way towards the research for effective and efficient methods for large-scale cross-modal information retrieval. We compare the effectiveness of our approach against relevant state-of-the-art methods. On the MS-COCO 1K test set, we obtain an improvement of 5.7% and 3.5% respectively on the image and the sentence retrieval tasks on the Recall@1 metric. The code used for the experiments is publicly available on GitHub at https://github.com/mesnico/TERAN. △ Less

Submitted 2 March, 2021; v1 submitted 12 August, 2020; originally announced August 2020.

Comments: Accepted in ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM). arXiv admin note: text overlap with arXiv:2004.09144

arXiv:2008.01478 [pdf, other]

doi 10.59275/j.melba.2023-3462

Learning Interpretable Microscopic Features of Tumor by Multi-task Adversarial CNNs To Improve Generalization

Authors: Mara Graziani, Sebastian Otalora, Stephane Marchand-Maillet, Henning Muller, Vincent Andrearczyk

Abstract: Adopting Convolutional Neural Networks (CNNs) in the daily routine of primary diagnosis requires not only near-perfect precision, but also a sufficient degree of generalization to data acquisition shifts and transparency. Existing CNN models act as black boxes, not ensuring to the physicians that important diagnostic features are used by the model. Building on top of successfully existing techniqu… ▽ More Adopting Convolutional Neural Networks (CNNs) in the daily routine of primary diagnosis requires not only near-perfect precision, but also a sufficient degree of generalization to data acquisition shifts and transparency. Existing CNN models act as black boxes, not ensuring to the physicians that important diagnostic features are used by the model. Building on top of successfully existing techniques such as multi-task learning, domain adversarial training and concept-based interpretability, this paper addresses the challenge of introducing diagnostic factors in the training objectives. Here we show that our architecture, by learning end-to-end an uncertainty-based weighting combination of multi-task and adversarial losses, is encouraged to focus on pathology features such as density and pleomorphism of nuclei, e.g. variations in size and appearance, while discarding misleading features such as staining differences. Our results on breast lymph node tissue show significantly improved generalization in the detection of tumorous tissue, with best average AUC 0.89 (0.01) against the baseline AUC 0.86 (0.005). By applying the interpretability technique of linearly probing intermediate representations, we also demonstrate that interpretable pathology features such as nuclei density are learned by the proposed CNN architecture, confirming the increased transparency of this model. This result is a starting point towards building interpretable multi-task architectures that are robust to data heterogeneity. Our code is available at https://github.com/maragraziani/multitask_adversarial △ Less

Submitted 21 June, 2023; v1 submitted 4 August, 2020; originally announced August 2020.

Comments: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2023:011

arXiv:2003.07323 [pdf, other]

Tuning Ranking in Co-occurrence Networks with General Biased Exchange-based Diffusion on Hyper-bag-graphs

Authors: Xavier Ouvrard, Jean-Marie Le Goff, Stéphane Marchand-Maillet

Abstract: Co-occurence networks can be adequately modeled by hyper-bag-graphs (hb-graphs for short). A hb-graph is a family of multisets having same universe, called the vertex set. An efficient exchange-based diffusion scheme has been previously proposed that allows the ranking of both vertices and hb-edges. In this article, we extend this scheme to allow biases of different kinds and explore their effect… ▽ More Co-occurence networks can be adequately modeled by hyper-bag-graphs (hb-graphs for short). A hb-graph is a family of multisets having same universe, called the vertex set. An efficient exchange-based diffusion scheme has been previously proposed that allows the ranking of both vertices and hb-edges. In this article, we extend this scheme to allow biases of different kinds and explore their effect on the different rankings obtained. The biases enhance the emphasize on some particular aspects of the network. △ Less

Submitted 16 March, 2020; originally announced March 2020.

arXiv:1905.11695 [pdf, other]

The HyperBagGraph DataEdron: An Enriched Browsing Experience of Multimedia Datasets

Authors: Xavier Ouvrard, Jean-Marie Le Goff, Stéphane Marchand-Maillet

Abstract: Traditional verbatim browsers give back information in a linear way according to a ranking performed by a search engine that may not be optimal for the surfer. The latter may need to assess the pertinence of the information retrieved, particularly when s$\cdot$he wants to explore other facets of a multi-facetted information space. For instance, in a multimedia dataset different facets such as keyw… ▽ More Traditional verbatim browsers give back information in a linear way according to a ranking performed by a search engine that may not be optimal for the surfer. The latter may need to assess the pertinence of the information retrieved, particularly when s$\cdot$he wants to explore other facets of a multi-facetted information space. For instance, in a multimedia dataset different facets such as keywords, authors, publication category, organisations and figures can be of interest. The facet simultaneous visualisation can help to gain insights on the information retrieved and call for further searches. Facets are co-occurence networks, modeled by HyperBag-Graphs -- families of multisets -- and are in fact linked not only to the publication itself, but to any chosen reference. These references allow to navigate inside the dataset and perform visual queries. We explore here the case of scientific publications based on Arxiv searches. △ Less

Submitted 28 May, 2019; originally announced May 2019.

Comments: Extension of the hypergraph framework shortly presented in arXiv:1809.00164 (possible small overlaps); use the theoretical framework of hb-graphs presented in arXiv:1809.00190

arXiv:1905.11245 [pdf, other]

Learning by stochastic serializations

Authors: Pablo Strasser, Stephane Armand, Stephane Marchand-Maillet, Alexandros Kalousis

Abstract: Complex structures are typical in machine learning. Tailoring learning algorithms for every structure requires an effort that may be saved by defining a generic learning procedure adaptive to any complex structure. In this paper, we propose to map any complex structure onto a generic form, called serialization, over which we can apply any sequence-based density estimator. We then show how to trans… ▽ More Complex structures are typical in machine learning. Tailoring learning algorithms for every structure requires an effort that may be saved by defining a generic learning procedure adaptive to any complex structure. In this paper, we propose to map any complex structure onto a generic form, called serialization, over which we can apply any sequence-based density estimator. We then show how to transfer the learned density back onto the space of original structures. To expose the learning procedure to the structural particularities of the original structures, we take care that the serializations reflect accurately the structures' properties. Enumerating all serializations is infeasible. We propose an effective way to sample representative serializations from the complete set of serializations which preserves the statistics of the complete set. Our method is competitive or better than state of the art learning algorithms that have been specifically designed for given structures. In addition, since the serialization involves sampling from a combinatorial process it provides considerable protection from overfitting, which we clearly demonstrate on a number of experiments. △ Less

Submitted 27 May, 2019; originally announced May 2019.

Comments: Submission to NeurIPS 2019

arXiv:1903.04748 [pdf, other]

Extracting localized information from a Twitter corpus for flood prevention

Authors: Etienne Brangbour, Pierrick Bruneau, Stéphane Marchand-Maillet, Renaud Hostache, Patrick Matgen, Marco Chini, Thomas Tamisier

Abstract: In this paper, we discuss the collection of a corpus associated to tropical storm Harvey, as well as its analysis from both spatial and topical perspectives. From the spatial perspective, our goal here is to get a first estimation of the quality and precision of the geographical information featured in the collected corpus. From a topical perspective, we discuss the representation of Twitter posts… ▽ More In this paper, we discuss the collection of a corpus associated to tropical storm Harvey, as well as its analysis from both spatial and topical perspectives. From the spatial perspective, our goal here is to get a first estimation of the quality and precision of the geographical information featured in the collected corpus. From a topical perspective, we discuss the representation of Twitter posts, and strategies to process an initially unlabeled corpus of tweets. △ Less

Submitted 10 May, 2019; v1 submitted 12 March, 2019; originally announced March 2019.

arXiv:1809.00190 [pdf, other]

Exchange-Based Diffusion in Hb-Graphs: Highlighting Complex Relationships

Authors: Xavier Ouvrard, Jean-Marie Le Goff, Stephane Marchand-Maillet

Abstract: Most networks tend to show complex and multiple relationships between entities. Networks are usually modeled by graphs or hypergraphs; nonetheless a given entity can occur many times in a relationship: this brings the need to deal with multisets instead of sets or simple edges. Diffusion processes are useful to highlight interesting parts of a network: they usually start with a stroke at one verte… ▽ More Most networks tend to show complex and multiple relationships between entities. Networks are usually modeled by graphs or hypergraphs; nonetheless a given entity can occur many times in a relationship: this brings the need to deal with multisets instead of sets or simple edges. Diffusion processes are useful to highlight interesting parts of a network: they usually start with a stroke at one vertex and diffuse throughout the network to reach a uniform distribution. Several iterations of the process are required prior to reaching a stable solution. We propose an alternative solution to highlighting the main components of a network using a diffusion process based on exchanges: it is an iterative two-phase step exchange process. This process allows to evaluate the importance not only of the vertices but also of the regrou** level. To model the diffusion process, we extend the concept of hypergraphs that are families of sets to families of multisets, that we call hb-graphs. This version is an extended version of arXiv:1809.00190v1: the overlaps with the v1 are in black, the new content is in blue. The contributions of this extended version are: the proofs of conservation and convergence of the extracted sequences of the diffusion process, as well as the illustration of the speed of convergence and comparison to classical and modified random walks; the algorithms of the exchange-based diffusion and the modified random walk; the application to a use case based on Arxiv publications. All the figures except one have been either modified or added in this extended version to take into account the new developments. △ Less

Submitted 28 May, 2019; v1 submitted 1 September, 2018; originally announced September 2018.

Comments: arXiv:1809.00190v1: Accepted version of article submitted at CBMI 2018 IEEE This version is an extended version of arXiv:1809.00190v1 currently in submission

arXiv:1809.00164 [pdf, ps, other]

Hypergraph Modeling and Visualisation of Complex Co-occurence Networks

Authors: Xavier Ouvrard, Jean-Marie Le Goff, Stephane Marchand-Maillet

Abstract: Finding inherent or processed links within a dataset allows to discover potential knowledge. The main contribution of this article is to define a global framework that enables optimal knowledge discovery by visually rendering co-occurences (i.e. groups of linked data instances attached to a metadata reference) - either inherently present or processed - from a dataset as facets. Hypergraphs are wel… ▽ More Finding inherent or processed links within a dataset allows to discover potential knowledge. The main contribution of this article is to define a global framework that enables optimal knowledge discovery by visually rendering co-occurences (i.e. groups of linked data instances attached to a metadata reference) - either inherently present or processed - from a dataset as facets. Hypergraphs are well suited for modeling co-occurences since they support multi-adicity whereas graphs only support pairwise relationships. This article introduces an efficient navigation between different facets of an information space based on hypergraph modelisation and visualisation. △ Less

Submitted 1 September, 2018; originally announced September 2018.

Comments: Preprint submitted at ENDM Special Journal 2nd IMA Conference on Theoretical and Computational Discrete Mathematics

arXiv:1809.00162 [pdf, ps, other]

On Adjacency and e-Adjacency in General Hypergraphs: Towards a New e-Adjacency Tensor

Authors: Xavier Ouvrard, Jean-Marie Le Goff, Stephane Marchand-Maillet

Abstract: In graphs, the concept of adjacency is clearly defined: it is a pairwise relationship between vertices. Adjacency in hypergraphs has to integrate hyperedge multi-adicity: the concept of adjacency needs to be defined properly by introducing two new concepts: $k$-adjacency - $k$ vertices are in the same hyperedge - and e-adjacency - vertices of a given hyperedge are e-adjacent. In order to build a n… ▽ More In graphs, the concept of adjacency is clearly defined: it is a pairwise relationship between vertices. Adjacency in hypergraphs has to integrate hyperedge multi-adicity: the concept of adjacency needs to be defined properly by introducing two new concepts: $k$-adjacency - $k$ vertices are in the same hyperedge - and e-adjacency - vertices of a given hyperedge are e-adjacent. In order to build a new e-adjacency tensor that is interpretable in terms of hypergraph uniformisation, we designed two processes: the first is a hypergraph uniformisation process (HUP) and the second is a polynomial homogeneisation process (PHP). The PHP allows the construction of the e-adjacency tensor while the HUP ensures that the PHP keeps interpretability. This tensor is symmetric and can be fully described by the number of hyperedges; its order is the range of the hypergraph, while extra dimensions allow to capture additional hypergraph structural information including the maximum level of $k$-adjacency of each hyperedge. Some results on spectral analysis are discussed. △ Less

Submitted 1 September, 2018; originally announced September 2018.

Comments: Preprint submitted to ENDM special journal 2nd IMA Conference on Theoretical and Computational Discrete Mathematics

arXiv:1805.11952 [pdf, ps, other]

Adjacency and Tensor Representation in General Hypergraphs.Part 2: Multisets, Hb-graphs and Related e-adjacency Tensors

Authors: Xavier Ouvrard, Jean-Marie Le Goff, Stephane Marchand-Maillet

Abstract: HyperBagGraphs (hb-graphs as short) extend hypergraphs by allowing the hyperedges to be multisets. Multisets are composed of elements that have a multiplicity. When this multiplicity has positive integer values, it corresponds to non ordered lists of potentially duplicated elements. We define hb-graphs as family of multisets over a vertex set; natural hb-graphs correspond to hb-graphs that have mu… ▽ More HyperBagGraphs (hb-graphs as short) extend hypergraphs by allowing the hyperedges to be multisets. Multisets are composed of elements that have a multiplicity. When this multiplicity has positive integer values, it corresponds to non ordered lists of potentially duplicated elements. We define hb-graphs as family of multisets over a vertex set; natural hb-graphs correspond to hb-graphs that have multiplicity function with positive integer values. Extending the definition of e-adjacency to natural hb-graphs, we define different way of building an e-adjacency tensor, that we compare before having a final choice of the tensor. This hb-graph e-adjacency tensor is used with hypergraphs. △ Less

Submitted 18 September, 2018; v1 submitted 30 May, 2018; originally announced May 2018.

arXiv:1805.06258 [pdf, other]

Structured nonlinear variable selection

Authors: Magda Gregorová, Alexandros Kalousis, Stéphane Marchand-Maillet

Abstract: We investigate structured sparsity methods for variable selection in regression problems where the target depends nonlinearly on the inputs. We focus on general nonlinear functions not limiting a priori the function space to additive models. We propose two new regularizers based on partial derivatives as nonlinear equivalents of group lasso and elastic net. We formulate the problem within the fram… ▽ More We investigate structured sparsity methods for variable selection in regression problems where the target depends nonlinearly on the inputs. We focus on general nonlinear functions not limiting a priori the function space to additive models. We propose two new regularizers based on partial derivatives as nonlinear equivalents of group lasso and elastic net. We formulate the problem within the framework of learning in reproducing kernel Hilbert spaces and show how the variational problem can be reformulated into a more practical finite dimensional equivalent. We develop a new algorithm derived from the ADMM principles that relies solely on closed forms of the proximal operators. We explore the empirical properties of our new algorithm for Nonlinear Variable Selection based on Derivatives (NVSD) on a set of experiments and confirm favourable properties of our structured-sparsity models and the algorithm in terms of both prediction and variable selection accuracy. △ Less

Submitted 16 May, 2018; originally announced May 2018.

Comments: Accepted to UAI2018

arXiv:1804.07169 [pdf, ps, other]

Large-scale Nonlinear Variable Selection via Kernel Random Features

Authors: Magda Gregorová, Jason Ramapuram, Alexandros Kalousis, Stéphane Marchand-Maillet

Abstract: We propose a new method for input variable selection in nonlinear regression. The method is embedded into a kernel regression machine that can model general nonlinear functions, not being a priori limited to additive models. This is the first kernel-based variable selection method applicable to large datasets. It sidesteps the typical poor scaling properties of kernel methods by map** the inputs… ▽ More We propose a new method for input variable selection in nonlinear regression. The method is embedded into a kernel regression machine that can model general nonlinear functions, not being a priori limited to additive models. This is the first kernel-based variable selection method applicable to large datasets. It sidesteps the typical poor scaling properties of kernel methods by map** the inputs into a relatively low-dimensional space of random features. The algorithm discovers the variables relevant for the regression task together with learning the prediction model through learning the appropriate nonlinear random feature maps. We demonstrate the outstanding performance of our method on a set of large-scale synthetic and real datasets. △ Less

Submitted 1 September, 2018; v1 submitted 19 April, 2018; originally announced April 2018.

Comments: Final version for proceedings of ECML/PKDD 2018

arXiv:1712.08189 [pdf, ps, other]

Adjacency and Tensor Representation in General Hypergraphs Part 1: e-adjacency Tensor Uniformisation Using Homogeneous Polynomials

Authors: Xavier Ouvrard, Jean-Marie Le Goff, Stéphane Marchand-Maillet

Abstract: Adjacency between two vertices in graphs or hypergraphs is a pairwise relationship. It is redefined in this article as 2-adjacency. In general hypergraphs, hyperedges hold for $n$-adic relationship. To keep the $n$-adic relationship the concepts of $k$-adjacency and e-adjacency are defined. In graphs 2-adjacency and e-adjacency concepts match, just as $k$-adjacency and e-adjacency do for $k$-unifo… ▽ More Adjacency between two vertices in graphs or hypergraphs is a pairwise relationship. It is redefined in this article as 2-adjacency. In general hypergraphs, hyperedges hold for $n$-adic relationship. To keep the $n$-adic relationship the concepts of $k$-adjacency and e-adjacency are defined. In graphs 2-adjacency and e-adjacency concepts match, just as $k$-adjacency and e-adjacency do for $k$-uniform hypergraphs. For general hypergraphs these concepts are different. This paper also contributes in a uniformization process of a general hypergraph to allow the definition of an e-adjacency tensor, viewed as a hypermatrix, reflecting the general hypergraph structure. This symmetric e-adjacency hypermatrix allows to capture not only the degree of the vertices and the cardinality of the hyperedges but also makes a full separation of the different layers of a hypergraph. △ Less

Submitted 30 May, 2018; v1 submitted 21 December, 2017; originally announced December 2017.

arXiv:1707.00115 [pdf, other]

Networks of Collaborations: Hypergraph Modeling and Visualisation

Authors: Xavier Ouvrard, Jean-Marie Le Goff, Stéphane Marchand-Maillet

Abstract: The acknowledged model for networks of collaborations is the hypergraph model. Nonetheless when it comes to be visualized hypergraphs are transformed into simple graphs. Very often, the transformation is made by clique expansion of the hyperedges resulting in a loss of information for the user and in artificially more complex graphs due to the high number of edges represented. The extra-node repre… ▽ More The acknowledged model for networks of collaborations is the hypergraph model. Nonetheless when it comes to be visualized hypergraphs are transformed into simple graphs. Very often, the transformation is made by clique expansion of the hyperedges resulting in a loss of information for the user and in artificially more complex graphs due to the high number of edges represented. The extra-node representation gives substantial improvement in the visualisation of hypergraphs and in the retrieval of information. This paper aims at showing qualitatively and quantitatively how the extra-node representation can improve the visualisation of hypergraphs without loss of information. △ Less

Submitted 1 July, 2017; originally announced July 2017.

Comments: 24 pages, 9 figures

arXiv:1706.08811 [pdf, ps, other]

Forecasting and Granger Modelling with Non-linear Dynamical Dependencies

Authors: Magda Gregorová, Alexandros Kalousis, Stéphane Marchand-Maillet

Abstract: Traditional linear methods for forecasting multivariate time series are not able to satisfactorily model the non-linear dependencies that may exist in non-Gaussian series. We build on the theory of learning vector-valued functions in the reproducing kernel Hilbert space and develop a method for learning prediction functions that accommodate such non-linearities. The method not only learns the pred… ▽ More Traditional linear methods for forecasting multivariate time series are not able to satisfactorily model the non-linear dependencies that may exist in non-Gaussian series. We build on the theory of learning vector-valued functions in the reproducing kernel Hilbert space and develop a method for learning prediction functions that accommodate such non-linearities. The method not only learns the predictive function but also the matrix-valued kernel underlying the function search space directly from the data. Our approach is based on learning multiple matrix-valued kernels, each of those composed of a set of input kernels and a set of output kernels learned in the cone of positive semi-definite matrices. In addition to superior predictive performance in the presence of strong non-linearities, our method also recovers the hidden dynamic relationships between the series and thus is a new alternative to existing graphical Granger techniques. △ Less

Submitted 27 June, 2017; originally announced June 2017.

Comments: Accepted for ECML-PKDD 2017

arXiv:1701.03916 [pdf, other]

doi 10.3390/e19030122

On Hölder projective divergences

Authors: Frank Nielsen, Ke Sun, Stéphane Marchand-Maillet

Abstract: We describe a framework to build distances by measuring the tightness of inequalities, and introduce the notion of proper statistical divergences and improper pseudo-divergences. We then consider the Hölder ordinary and reverse inequalities, and present two novel classes of Hölder divergences and pseudo-divergences that both encapsulate the special case of the Cauchy-Schwarz divergence. We report… ▽ More We describe a framework to build distances by measuring the tightness of inequalities, and introduce the notion of proper statistical divergences and improper pseudo-divergences. We then consider the Hölder ordinary and reverse inequalities, and present two novel classes of Hölder divergences and pseudo-divergences that both encapsulate the special case of the Cauchy-Schwarz divergence. We report closed-form formulas for those statistical dissimilarities when considering distributions belonging to the same exponential family provided that the natural parameter space is a cone (e.g., multivariate Gaussians), or affine (e.g., categorical distributions). Those new classes of Hölder distances are invariant to rescaling, and thus do not require distributions to be normalized. Finally, we show how to compute statistical Hölder centroids with respect to those divergences, and carry out center-based clustering toy experiments on a set of Gaussian distributions that demonstrate empirically that symmetrized Hölder divergences outperform the symmetric Cauchy-Schwarz divergence. △ Less

Submitted 14 January, 2017; originally announced January 2017.

Comments: 25 pages

arXiv:1507.01978 [pdf, other]

Learning Leading Indicators for Time Series Predictions

Authors: Magda Gregorova, Alexandros Kalousis, Stéphane Marchand-Maillet

Abstract: We consider the problem of learning models for forecasting multiple time-series systems together with discovering the leading indicators that serve as good predictors for the system. We model the systems by linear vector autoregressive models (VAR) and link the discovery of leading indicators to inferring sparse graphs of Granger-causality. We propose new problem formulations and develop two new m… ▽ More We consider the problem of learning models for forecasting multiple time-series systems together with discovering the leading indicators that serve as good predictors for the system. We model the systems by linear vector autoregressive models (VAR) and link the discovery of leading indicators to inferring sparse graphs of Granger-causality. We propose new problem formulations and develop two new methods to learn such models, gradually increasing the complexity of assumptions and approaches. While the first method assumes common structures across the whole system, our second method uncovers model clusters based on the Granger-causality and leading indicators together with learning the model parameters. We study the performance of our methods on a comprehensive set of experiments and confirm their efficacy and their advantages over state-of-the-art sparse VAR and graphical Granger learning methods. △ Less

Submitted 2 November, 2016; v1 submitted 7 July, 2015; originally announced July 2015.

Comments: Changed title plus minor updates in the text

arXiv:1405.2798 [pdf, other]

Two-Stage Metric Learning

Authors: Jun Wang, Ke Sun, Fei Sha, Stephane Marchand-Maillet, Alexandros Kalousis

Abstract: In this paper, we present a novel two-stage metric learning algorithm. We first map each learning instance to a probability distribution by computing its similarities to a set of fixed anchor points. Then, we define the distance in the input data space as the Fisher information distance on the associated statistical manifold. This induces in the input data space a new family of distance metric wit… ▽ More In this paper, we present a novel two-stage metric learning algorithm. We first map each learning instance to a probability distribution by computing its similarities to a set of fixed anchor points. Then, we define the distance in the input data space as the Fisher information distance on the associated statistical manifold. This induces in the input data space a new family of distance metric with unique properties. Unlike kernelized metric learning, we do not require the similarity measure to be positive semi-definite. Moreover, it can also be interpreted as a local metric learning algorithm with well defined distance approximation. We evaluate its performance on a number of datasets. It outperforms significantly other metric learning methods and SVM. △ Less

Submitted 12 May, 2014; originally announced May 2014.

Comments: Accepted for publication in ICML 2014

Showing 1–28 of 28 results for author: Marchand-Maillet, S