-
The Canadian Cropland Dataset: A New Land Cover Dataset for Multitemporal Deep Learning Classification in Agriculture
Authors:
Amanda A. Boatswain Jacques,
Abdoulaye Baniré Diallo,
Etienne Lord
Abstract:
Monitoring land cover using remote sensing is vital for studying environmental changes and ensuring global food security through crop yield forecasting. Specifically, multitemporal remote sensing imagery provides relevant information about the dynamics of a scene, which has proven to lead to better land cover classification results. Nevertheless, few studies have benefited from high spatial and te…
▽ More
Monitoring land cover using remote sensing is vital for studying environmental changes and ensuring global food security through crop yield forecasting. Specifically, multitemporal remote sensing imagery provides relevant information about the dynamics of a scene, which has proven to lead to better land cover classification results. Nevertheless, few studies have benefited from high spatial and temporal resolution data due to the difficulty of accessing reliable, fine-grained and high-quality annotated samples to support their hypotheses. Therefore, we introduce a temporal patch-based dataset of Canadian croplands, enriched with labels retrieved from the Canadian Annual Crop Inventory. The dataset contains 78,536 manually verified high-resolution (10 m/pixel, 640 x 640 m) geo-referenced images from 10 crop classes collected over four crop production years (2017-2020) and five months (June-October). Each instance contains 12 spectral bands, an RGB image, and additional vegetation index bands. Individually, each category contains at least 4,800 images. Moreover, as a benchmark, we provide models and source code that allow a user to predict the crop class using a single image (ResNet, DenseNet, EfficientNet) or a sequence of images (LRCN, 3D-CNN) from the same location. In perspective, we expect this evolving dataset to propel the creation of robust agro-environmental models that can accelerate the comprehension of complex agricultural regions by providing accurate and continuous monitoring of land cover.
△ Less
Submitted 4 June, 2023; v1 submitted 31 May, 2023;
originally announced June 2023.
-
Prior Density Learning in Variational Bayesian Phylogenetic Parameters Inference
Authors:
Amine M. Remita,
Golrokh Vitae,
Abdoulaye Baniré Diallo
Abstract:
The advances in variational inference are providing promising paths in Bayesian estimation problems. These advances make variational phylogenetic inference an alternative approach to Markov Chain Monte Carlo methods for approximating the phylogenetic posterior. However, one of the main drawbacks of such approaches is modelling the prior through fixed distributions, which could bias the posterior a…
▽ More
The advances in variational inference are providing promising paths in Bayesian estimation problems. These advances make variational phylogenetic inference an alternative approach to Markov Chain Monte Carlo methods for approximating the phylogenetic posterior. However, one of the main drawbacks of such approaches is modelling the prior through fixed distributions, which could bias the posterior approximation if they are distant from the current data distribution. In this paper, we propose an approach and an implementation framework to relax the rigidity of the prior densities by learning their parameters using a gradient-based method and a neural network-based parameterization. We applied this approach for branch lengths and evolutionary parameters estimation under several Markov chain substitution models. The results of performed simulations show that the approach is powerful in estimating branch lengths and evolutionary model parameters. They also show that a flexible prior model could provide better results than a predefined prior model. Finally, the results highlight that using neural networks improves the initialization of the optimization of the prior density parameters.
△ Less
Submitted 8 September, 2023; v1 submitted 5 February, 2023;
originally announced February 2023.
-
EvoVGM: a Deep Variational Generative Model for Evolutionary Parameter Estimation
Authors:
Amine M. Remita,
Abdoulaye Baniré Diallo
Abstract:
Most evolutionary-oriented deep generative models do not explicitly consider the underlying evolutionary dynamics of biological sequences as it is performed within the Bayesian phylogenetic inference framework. In this study, we propose a method for a deep variational Bayesian generative model (EvoVGM) that jointly approximates the true posterior of local evolutionary parameters and generates sequ…
▽ More
Most evolutionary-oriented deep generative models do not explicitly consider the underlying evolutionary dynamics of biological sequences as it is performed within the Bayesian phylogenetic inference framework. In this study, we propose a method for a deep variational Bayesian generative model (EvoVGM) that jointly approximates the true posterior of local evolutionary parameters and generates sequence alignments. Moreover, it is instantiated and tuned for continuous-time Markov chain substitution models such as JC69, K80 and GTR. We train the model via a low-variance stochastic estimator and a gradient ascent algorithm. Here, we analyze the consistency and effectiveness of EvoVGM on synthetic sequence alignments simulated with several evolutionary scenarios and different sizes. Finally, we highlight the robustness of a fine-tuned EvoVGM model using a sequence alignment of gene S of coronaviruses.
△ Less
Submitted 30 June, 2022; v1 submitted 25 May, 2022;
originally announced May 2022.
-
Etude de classification des bacteriophages
Authors:
Dung Nguyen,
Alix Boc,
Abdoulaye Banire Diallo,
Vladimir Makarenkov
Abstract:
Phages are one of the most present groups of organisms in the biosphere. Their identification continues and their taxonomies are divergent. However, due to their evolution mode and the complexity of their species ecosystem, their classification is not complete. Here, we present a new approach to the phages classification that combines the methods of horizontal gene transfer detection and ancestral…
▽ More
Phages are one of the most present groups of organisms in the biosphere. Their identification continues and their taxonomies are divergent. However, due to their evolution mode and the complexity of their species ecosystem, their classification is not complete. Here, we present a new approach to the phages classification that combines the methods of horizontal gene transfer detection and ancestral sequence reconstruction.
△ Less
Submitted 1 January, 2022;
originally announced January 2022.
-
Supporting supervised learning in fungal Biosynthetic Gene Cluster discovery: new benchmark datasets
Authors:
Hayda Almeida,
Adrian Tsang,
Abdoulaye Baniré Diallo
Abstract:
Fungal Biosynthetic Gene Clusters (BGCs) of secondary metabolites are clusters of genes capable of producing natural products, compounds that play an important role in the production of a wide variety of bioactive compounds, including antibiotics and pharmaceuticals. Identifying BGCs can lead to the discovery of novel natural products to benefit human health. Previous work has been focused on deve…
▽ More
Fungal Biosynthetic Gene Clusters (BGCs) of secondary metabolites are clusters of genes capable of producing natural products, compounds that play an important role in the production of a wide variety of bioactive compounds, including antibiotics and pharmaceuticals. Identifying BGCs can lead to the discovery of novel natural products to benefit human health. Previous work has been focused on develo** automatic tools to support BGC discovery in plants, fungi, and bacteria. Data-driven methods, as well as probabilistic and supervised learning methods have been explored in identifying BGCs. Most methods applied to identify fungal BGCs were data-driven and presented limited scope. Supervised learning methods have been shown to perform well at identifying BGCs in bacteria, and could be well suited to perform the same task in fungi. But labeled data instances are needed to perform supervised learning. Openly accessible BGC databases contain only a very small portion of previously curated fungal BGCs. Making new fungal BGC datasets available could motivate the development of supervised learning methods for fungal BGCs and potentially improve prediction performance compared to data-driven methods. In this work we propose new publicly available fungal BGC datasets to support the BGC discovery task using supervised learning. These datasets are prepared to perform binary classification and predict candidate BGC regions in fungal genomes. In addition we analyse the performance of a well supported supervised learning tool developed to predict BGCs.
△ Less
Submitted 9 January, 2020;
originally announced January 2020.
-
Statistical Linear Models in Virus Genomic Alignment-free Classification: Application to Hepatitis C Viruses
Authors:
Amine M. Remita,
Abdoulaye Baniré Diallo
Abstract:
Viral sequence classification is an important task in pathogen detection, epidemiological surveys and evolutionary studies. Statistical learning methods are widely used to classify and identify viral sequences in samples from environments. These methods face several challenges associated with the nature and properties of viral genomes such as recombination, mutation rate and diversity. Also, new g…
▽ More
Viral sequence classification is an important task in pathogen detection, epidemiological surveys and evolutionary studies. Statistical learning methods are widely used to classify and identify viral sequences in samples from environments. These methods face several challenges associated with the nature and properties of viral genomes such as recombination, mutation rate and diversity. Also, new generations of sequencing technologies rise other difficulties by generating massive amounts of fragmented sequences. While linear classifiers are often used to classify viruses, there is a lack of exploration of the accuracy space of existing models in the context of alignment free approaches. In this study, we present an exhaustive assessment procedure exploring the power of linear classifiers in genoty** and subty** partial and complete genomes. It is applied to the Hepatitis C viruses (HCV). Several variables are considered in this investigation such as classifier types (generative and discriminative) and their hyper-parameters (smoothing value and regularization penalty function), the classification task (genoty** and subty**), the length of the tested sequences (partial and complete) and the length of k-mer words. Overall, several classifiers perform well given a set of precise combination of the experimental variables mentioned above. Finally, we provide the procedure and benchmark data to allow for more robust assessment of classification from virus genomes.
△ Less
Submitted 28 May, 2024; v1 submitted 11 October, 2019;
originally announced October 2019.
-
PGR: A Graph Repository of Protein 3D-Structures
Authors:
Wajdi Dhifli,
Abdoulaye Baniré Diallo
Abstract:
Graph theory and graph mining constitute rich fields of computational techniques to study the structures, topologies and properties of graphs. These techniques constitute a good asset in bioinformatics if there exist efficient methods for transforming biological data into graphs. In this paper, we present Protein Graph Repository (PGR), a novel database of protein 3D-structures transformed into gr…
▽ More
Graph theory and graph mining constitute rich fields of computational techniques to study the structures, topologies and properties of graphs. These techniques constitute a good asset in bioinformatics if there exist efficient methods for transforming biological data into graphs. In this paper, we present Protein Graph Repository (PGR), a novel database of protein 3D-structures transformed into graphs allowing the use of the large repertoire of graph theory techniques in protein mining. This repository contains graph representations of all currently known protein 3D-structures described in the Protein Data Bank (PDB). PGR also provides an efficient online converter of protein 3D-structures into graphs, biological and graph-based description, pre-computed protein graph attributes and statistics, visualization of each protein graph, as well as graph-based protein similarity search tool. Such repository presents an enrichment of existing online databases that will help bridging the gap between graph mining and protein structure analysis. PGR data and features are unique and not included in any other protein database. The repository is available at http://wjdi.bioinfo.uqam.ca/.
△ Less
Submitted 24 January, 2016;
originally announced April 2016.
-
ProtNN: Fast and Accurate Nearest Neighbor Protein Function Prediction based on Graph Embedding in Structural and Topological Space
Authors:
Wajdi Dhifli,
Abdoulaye Baniré Diallo
Abstract:
Studying the function of proteins is important for understanding the molecular mechanisms of life. The number of publicly available protein structures has increasingly become extremely large. Still, the determination of the function of a protein structure remains a difficult, costly, and time consuming task. The difficulties are often due to the essential role of spatial and topological structures…
▽ More
Studying the function of proteins is important for understanding the molecular mechanisms of life. The number of publicly available protein structures has increasingly become extremely large. Still, the determination of the function of a protein structure remains a difficult, costly, and time consuming task. The difficulties are often due to the essential role of spatial and topological structures in the determination of protein functions in living cells. In this paper, we propose ProtNN, a novel approach for protein function prediction. Given an unannotated protein structure and a set of annotated proteins, ProtNN finds the nearest neighbor annotated structures based on protein-graph pairwise similarities. Given a query protein, ProtNN finds the nearest neighbor reference proteins based on a graph representation model and a pairwise similarity between vector embedding of both query and reference protein-graphs in structural and topological spaces. ProtNN assigns to the query protein the function with the highest number of votes across the set of k nearest neighbor reference proteins, where k is a user-defined parameter. Experimental evaluation demonstrates that ProtNN is able to accurately classify several datasets in an extremely fast runtime compared to state-of-the-art approaches. We further show that ProtNN is able to scale up to a whole PDB dataset in a single-process mode with no parallelization, with a gain of thousands order of magnitude of runtime compared to state-of-the-art approaches.
△ Less
Submitted 24 January, 2016; v1 submitted 2 November, 2015;
originally announced November 2015.
-
Toward an Efficient Multi-class Classification in an Open Universe
Authors:
Wajdi Dhifli,
Abdoulaye Baniré Diallo
Abstract:
Classification is a fundamental task in machine learning and data mining. Existing classification methods are designed to classify unknown instances within a set of previously known training classes. Such a classification takes the form of a prediction within a closed-set of classes. However, a more realistic scenario that fits real-world applications is to consider the possibility of encountering…
▽ More
Classification is a fundamental task in machine learning and data mining. Existing classification methods are designed to classify unknown instances within a set of previously known training classes. Such a classification takes the form of a prediction within a closed-set of classes. However, a more realistic scenario that fits real-world applications is to consider the possibility of encountering instances that do not belong to any of the training classes, $i.e.$, an open-set classification. In such situation, existing closed-set classifiers will assign a training label to these instances resulting in a misclassification. In this paper, we introduce Galaxy-X, a novel multi-class classification approach for open-set recognition problems. For each class of the training set, Galaxy-X creates a minimum bounding hyper-sphere that encompasses the distribution of the class by enclosing all of its instances. In such manner, our method is able to distinguish instances resembling previously seen classes from those that are of unknown ones. To adequately evaluate open-set classification, we introduce a novel evaluation procedure. Experimental results on benchmark datasets show the efficiency of our approach in classifying novel instances from known as well as unknown classes.
△ Less
Submitted 1 March, 2018; v1 submitted 2 November, 2015;
originally announced November 2015.