Search | arXiv e-print repository

DeePathology: Deep Multi-Task Learning for Inferring Molecular Pathology from Cancer Transcriptome

Authors: Behrooz Azarkhalili, Ali Saberi, Hamidreza Chitsaz, Ali Sharifi-Zarchi

Abstract: Despite great advances, molecular cancer pathology is often limited to the use of a small number of biomarkers rather than the whole transcriptome, partly due to computational challenges. Here, we introduce a novel architecture of Deep Neural Networks (DNNs) that is capable of simultaneous inference of various properties of biological samples, through multi-task and transfer learning. It encodes t… ▽ More Despite great advances, molecular cancer pathology is often limited to the use of a small number of biomarkers rather than the whole transcriptome, partly due to computational challenges. Here, we introduce a novel architecture of Deep Neural Networks (DNNs) that is capable of simultaneous inference of various properties of biological samples, through multi-task and transfer learning. It encodes the whole transcription profile into a strikingly low-dimensional latent vector of size 8, and then recovers mRNA and miRNA expression profiles, tissue and disease type from this vector. This latent space is significantly better than the original gene expression profiles for discriminating samples based on their tissue and disease. We employed this architecture on mRNA transcription profiles of 10787 clinical samples from 34 classes (one healthy and 33 different types of cancer) from 27 tissues. Our method significantly outperforms prior works and classical machine learning approaches in predicting tissue-of-origin, normal or disease state and cancer type of each sample. For tissues with more than one type of cancer, it reaches 99.4\% accuracy in identifying the correct cancer subtype. We also show this system is very robust against noise and missing values. Collectively, our results highlight applications of artificial intelligence in molecular cancer pathology and oncological research. DeePathology is freely available at \url{https://github.com/SharifBioinf/DeePathology}. △ Less

Submitted 13 August, 2019; v1 submitted 7 August, 2018; originally announced August 2018.

arXiv:1806.04863 [pdf, other]

Cell Identity Codes: Understanding Cell Identity from Gene Expression Profiles using Deep Neural Networks

Authors: Farzad Abdolhosseini, Behrooz Azarkhalili, Abbas Maazallahi, Aryan Kamal, Seyed Abolfazl Motahari, Ali Sharifi-Zarchi, Hamidreza Chitsaz

Abstract: Understanding cell identity is an important task in many biomedical areas. Expression patterns of specific marker genes have been used to characterize some limited cell types, but exclusive markers are not available for many cell types. A second approach is to use machine learning to discriminate cell types based on the whole gene expression profiles (GEPs). The accuracies of simple classification… ▽ More Understanding cell identity is an important task in many biomedical areas. Expression patterns of specific marker genes have been used to characterize some limited cell types, but exclusive markers are not available for many cell types. A second approach is to use machine learning to discriminate cell types based on the whole gene expression profiles (GEPs). The accuracies of simple classification algorithms such as linear discriminators or support vector machines are limited due to the complexity of biological systems. We used deep neural networks to analyze 1040 GEPs from 16 different human tissues and cell types. After comparing different architectures, we identified a specific structure of deep autoencoders that can encode a GEP into a vector of 30 numeric values, which we call the cell identity code (CIC). The original GEP can be reproduced from the CIC with an accuracy comparable to technical replicates of the same experiment. Although we use an unsupervised approach to train the autoencoder, we show different values of the CIC are connected to different biological aspects of the cell, such as different pathways or biological processes. This network can use CIC to reproduce the GEP of the cell types it has never seen during the training. It also can resist some noise in the measurement of the GEP. Furthermore, we introduce classifier autoencoder, an architecture that can accurately identify cell type based on the GEP or the CIC. △ Less

Submitted 13 June, 2018; originally announced June 2018.

arXiv:1710.00760 [pdf, ps, other]

Scalable Nonlinear AUC Maximization Methods

Authors: Majdi Khalid, Indrakshi Ray, Hamidreza Chitsaz

Abstract: The area under the ROC curve (AUC) is a measure of interest in various machine learning and data mining applications. It has been widely used to evaluate classification performance on heavily imbalanced data. The kernelized AUC maximization machines have established a superior generalization ability compared to linear AUC machines because of their capability in modeling the complex nonlinear struc… ▽ More The area under the ROC curve (AUC) is a measure of interest in various machine learning and data mining applications. It has been widely used to evaluate classification performance on heavily imbalanced data. The kernelized AUC maximization machines have established a superior generalization ability compared to linear AUC machines because of their capability in modeling the complex nonlinear structure underlying most real-world data. However, the high training complexity renders the kernelized AUC machines infeasible for large-scale data. In this paper, we present two nonlinear AUC maximization algorithms that optimize pairwise linear classifiers over a finite-dimensional feature space constructed via the k-means Nyström method. Our first algorithm maximize the AUC metric by optimizing a pairwise squared hinge loss function using the truncated Newton method. However, the second-order batch AUC maximization method becomes expensive to optimize for extremely massive datasets. This motivate us to develop a first-order stochastic AUC maximization algorithm that incorporates a scheduled regularization update and scheduled averaging techniques to accelerate the convergence of the classifier. Experiments on several benchmark datasets demonstrate that the proposed AUC classifiers are more efficient than kernelized AUC machines while they are able to surpass or at least match the AUC performance of the kernelized AUC machines. The experiments also show that the proposed stochastic AUC classifier outperforms the state-of-the-art online AUC maximization methods in terms of AUC classification accuracy. △ Less

Submitted 29 April, 2019; v1 submitted 2 October, 2017; originally announced October 2017.

arXiv:1610.02573 [pdf, other]

Proceedings of the 1st International Workshop on Robot Learning and Planning (RLP 2016)

Authors: Nancy Amato, Charles Anderson, Gregory Chirikjian, Hamidreza Chitsaz, Vishnu Desaraju, Chinwe Ekenna, Kris Hauser, Geoff Hollinger, Reza Iraji, Minwoo Lee, Qianli Ma, Seth McCammon, Nathan Michael, Shawna Thomas, Diane Uwacu, Yan Yan

Abstract: Proceedings of the 1st International Workshop on Robot Learning and Planning (RLP 2016) Proceedings of the 1st International Workshop on Robot Learning and Planning (RLP 2016) △ Less

Submitted 8 October, 2016; originally announced October 2016.

arXiv:1607.00847 [pdf, other]

Confidence-Weighted Bipartite Ranking

Authors: Majdi Khalid, Indrakshi Ray, Hamidreza Chitsaz

Abstract: Bipartite ranking is a fundamental machine learning and data mining problem. It commonly concerns the maximization of the AUC metric. Recently, a number of studies have proposed online bipartite ranking algorithms to learn from massive streams of class-imbalanced data. These methods suggest both linear and kernel-based bipartite ranking algorithms based on first and second-order online learning. U… ▽ More Bipartite ranking is a fundamental machine learning and data mining problem. It commonly concerns the maximization of the AUC metric. Recently, a number of studies have proposed online bipartite ranking algorithms to learn from massive streams of class-imbalanced data. These methods suggest both linear and kernel-based bipartite ranking algorithms based on first and second-order online learning. Unlike kernelized ranker, linear ranker is more scalable learning algorithm. The existing linear online bipartite ranking algorithms lack either handling non-separable data or constructing adaptive large margin. These limitations yield unreliable bipartite ranking performance. In this work, we propose a linear online confidence-weighted bipartite ranking algorithm (CBR) that adopts soft confidence-weighted learning. The proposed algorithm leverages the same properties of soft confidence-weighted learning in a framework for bipartite ranking. We also develop a diagonal variation of the proposed confidence-weighted bipartite ranking algorithm to deal with high-dimensional data by maintaining only the diagonal elements of the covariance matrix. We empirically evaluate the effectiveness of the proposed algorithms on several benchmark and high-dimensional datasets. The experimental results validate the reliability of the proposed algorithms. The results also show that our algorithms outperform or are at least comparable to the competing online AUC maximization methods. △ Less

Submitted 10 March, 2019; v1 submitted 4 July, 2016; originally announced July 2016.

Comments: 15 pages, 6 tables, and 2 figures

arXiv:1408.5592 [pdf, other]

HyDA-Vista: Towards Optimal Guided Selection of k-mer Size for Sequence Assembly

Authors: Seyed Basir Shariat Razavi, Narjes Sadat Movahedi Tabrizi, Hamidreza Chitsaz, Christina Boucher

Abstract: Motivation: Intimately tied to assembly quality is the complexity of the de Bruijn graph built by the assembler. Thus, there have been many paradigms developed to decrease the complexity of the de Bruijn graph. One obvious combinatorial paradigm for this is to allow the value of $k$ to vary; having a larger value of $k$ where the graph is more complex and a smaller value of $k$ where the graph wou… ▽ More Motivation: Intimately tied to assembly quality is the complexity of the de Bruijn graph built by the assembler. Thus, there have been many paradigms developed to decrease the complexity of the de Bruijn graph. One obvious combinatorial paradigm for this is to allow the value of $k$ to vary; having a larger value of $k$ where the graph is more complex and a smaller value of $k$ where the graph would likely contain fewer spurious edges and vertices. One open problem that affects the practicality of this method is how to predict the value of $k$ prior to building the de Bruijn graph. We show that optimal values of $k$ can be predicted prior to assembly by using the information contained in a phylogenetically-close genome and therefore, help make the use of multiple values of $k$ practical for genome assembly. Results: We present HyDA-Vista, which is a genome assembler that uses homology information to choose a value of $k$ for each read prior to the de Bruijn graph construction. The chosen $k$ is optimal if there are no sequencing errors and the coverage is sufficient. Fundamental to our method is the construction of the {\em maximal sequence landscape}, which is a data structure that stores for each position in the input string, the largest repeated substring containing that position. In particular, we show the maximal sequence landscape can be constructed in $O(n + n \log n)$-time and $O(n)$-space. HyDA-Vista first constructs the maximal sequence landscape for a homologous genome. The reads are then aligned to this reference genome, and values of $k$ are assigned to each read using the maximal sequence landscape and the alignments. Eventually, all the reads are assembled by an iterative de Bruijn graph construction method. Our results and comparison to other assemblers demonstrate that HyDA-Vista achieves the best assembly of {\em E. coli} before repeat resolution or scaffolding. △ Less

Submitted 24 August, 2014; originally announced August 2014.

Comments: 11 pages, 1 figure, 1 table

arXiv:1403.5384 [pdf, other]

NUROA: A Numerical Roadmap Algorithm

Authors: Reza Iraji, Hamidreza Chitsaz

Abstract: Motion planning has been studied for nearly four decades now. Complete, combinatorial motion planning approaches are theoretically well-rooted with completeness guarantees but they are hard to implement. Sampling-based and heuristic methods are easy to implement and quite simple to customize but they lack completeness guarantees. Can the best of both worlds be ever achieved, particularly for missi… ▽ More Motion planning has been studied for nearly four decades now. Complete, combinatorial motion planning approaches are theoretically well-rooted with completeness guarantees but they are hard to implement. Sampling-based and heuristic methods are easy to implement and quite simple to customize but they lack completeness guarantees. Can the best of both worlds be ever achieved, particularly for mission critical applications such as robotic surgery, space explorations, and handling hazardous material? In this paper, we answer affirmatively to that question. We present a new methodology, NUROA, to numerically approximate the Canny's roadmap, which is a network of one-dimensional algebraic curves. Our algorithm encloses the roadmap with a chain of tiny boxes each of which contains a piece of the roadmap and whose connectivity captures the roadmap connectivity. It starts by enclosing the entire space with a box. In each iteration, remaining boxes are shrunk on all sides and then split into smaller sized boxes. Those boxes that are empty are detected in the shrink phase and removed. The algorithm terminates when all remaining boxes are smaller than a resolution that can be either given as input or automatically computed using root separation lower bounds. Shrink operation is cast as a polynomial optimization with semialgebraic constraints, which is in turn transformed into a (series of) semidefinite programs (SDP) using the Lasserre's approach. NUROA's success is due to fast SDP solvers. NUROA correctly captured the connectivity of multiple curves/skeletons whereas competitors such as IBEX and Realpaver failed in some cases. Since boxes are independent from one another, NUROA can be parallelized particularly on GPUs. NUROA is available as an open source package at http://nuroa.sourceforge.net/. △ Less

Submitted 25 June, 2014; v1 submitted 21 March, 2014; originally announced March 2014.

arXiv:1402.3213 [pdf, other]

Proceedings of the 1st Workshop on Robotics Challenges and Vision (RCV2013)

Authors: Aitor Aladren, Sasa Bodiroza, Hamidreza Chitsaz, J. J. Guerrero, Verena Hafner, Kris Hauser, Aleksandar Jevtic, Moslem Kazemi, Bruno Lara, Gonzalo Lopez-Nicolas, Peer Neubert, Peter Protzel, Laurel D. Riek, Niko Sunderhauf, Chee Yap

Abstract: Proceedings of the 1st Workshop on Robotics Challenges and Vision (RCV2013) Proceedings of the 1st Workshop on Robotics Challenges and Vision (RCV2013) △ Less

Submitted 13 February, 2014; originally announced February 2014.

Comments: http://compbio.cs.wayne.edu/robotics/rcv2013/proceedings-emb.pdf

arXiv:1310.4223 [pdf, ps, other]

Exact Learning of RNA Energy Parameters From Structure

Authors: Hamidreza Chitsaz, Mohammad Aminisharifabad

Abstract: We consider the problem of exact learning of parameters of a linear RNA energy model from secondary structure data. A necessary and sufficient condition for learnability of parameters is derived, which is based on computing the convex hull of union of translated Newton polytopes of input sequences. The set of learned energy parameters is characterized as the convex cone generated by the normal vec… ▽ More We consider the problem of exact learning of parameters of a linear RNA energy model from secondary structure data. A necessary and sufficient condition for learnability of parameters is derived, which is based on computing the convex hull of union of translated Newton polytopes of input sequences. The set of learned energy parameters is characterized as the convex cone generated by the normal vectors to those facets of the resulting polytope that are incident to the origin. In practice, the sufficient condition may not be satisfied by the entire training data set; hence, computing a maximal subset of training data for which the sufficient condition is satisfied is often desired. We show that problem is NP-hard in general for an arbitrary dimensional feature space. Using a randomized greedy algorithm, we select a subset of RNA STRAND v2.0 database that satisfies the sufficient condition for separate A-U, C-G, G-U base pair counting model. The set of learned energy parameters includes experimentally measured energies of A-U, C-G, and G-U pairs; hence, our parameter set is in agreement with the Turner parameters. △ Less

Submitted 15 October, 2013; originally announced October 2013.

arXiv:1305.0062 [pdf, ps, other]

doi 10.1093/bioinformatics/btt420

Distilled Single Cell Genome Sequencing and De Novo Assembly for Sparse Microbial Communities

Authors: Zeinab Taghavi, Narjes S. Movahedi, Sorin Draghici, Hamidreza Chitsaz

Abstract: Identification of every single genome present in a microbial sample is an important and challenging task with crucial applications. It is challenging because there are typically millions of cells in a microbial sample, the vast majority of which elude cultivation. The most accurate method to date is exhaustive single cell sequencing using multiple displacement amplification, which is simply intrac… ▽ More Identification of every single genome present in a microbial sample is an important and challenging task with crucial applications. It is challenging because there are typically millions of cells in a microbial sample, the vast majority of which elude cultivation. The most accurate method to date is exhaustive single cell sequencing using multiple displacement amplification, which is simply intractable for a large number of cells. However, there is hope for breaking this barrier as the number of different cell types with distinct genome sequences is usually much smaller than the number of cells. Here, we present a novel divide and conquer method to sequence and de novo assemble all distinct genomes present in a microbial sample with a sequencing cost and computational complexity proportional to the number of genome types, rather than the number of cells. The method is implemented in a tool called Squeezambler. We evaluated Squeezambler on simulated data. The proposed divide and conquer method successfully reduces the cost of sequencing in comparison with the naive exhaustive approach. Availability: Squeezambler and datasets are available under http://compbio.cs.wayne.edu/software/squeezambler/. △ Less

Submitted 22 May, 2013; v1 submitted 30 April, 2013; originally announced May 2013.

arXiv:1301.6662 [pdf, ps, other]

On Time-optimal Trajectories for a Car-like Robot with One Trailer

Authors: Hamidreza Chitsaz

Abstract: In addition to the theoretical value of challenging optimal control problmes, recent progress in autonomous vehicles mandates further research in optimal motion planning for wheeled vehicles. Since current numerical optimal control techniques suffer from either the curse of dimens ionality, e.g. the Hamilton-Jacobi-Bellman equation, or the curse of complexity, e.g. pseudospectral optimal control a… ▽ More In addition to the theoretical value of challenging optimal control problmes, recent progress in autonomous vehicles mandates further research in optimal motion planning for wheeled vehicles. Since current numerical optimal control techniques suffer from either the curse of dimens ionality, e.g. the Hamilton-Jacobi-Bellman equation, or the curse of complexity, e.g. pseudospectral optimal control and max-plus methods, analytical characterization of geodesics for wheeled vehicles becomes important not only from a theoretical point of view but also from a prac tical one. Such an analytical characterization provides a fast motion planning algorithm that can be used in robust feedback loops. In this work, we use the Pontryagin Maximum Principle to characterize extremal trajectories, i.e. candidate geodesics, for a car-like robot with one trailer. We use time as the distance function. In spite of partial progress, this problem has remained open in the past two decades. Besides straight motion and turn with maximum allowed curvature, we identify planar elastica as the third piece of motion that occurs along our extr emals. We give a detailed characterization of such curves, a special case of which, called \emph{merging curve}, connects maximum curvature turns to straight line segments. The structure of extremals in our case is revealed through analytical integration of the system and adjoint equations. △ Less

Submitted 28 January, 2013; originally announced January 2013.

arXiv:1301.1608 [pdf, ps, other]

The RNA Newton Polytope and Learnability of Energy Parameters

Authors: Elmirasadat Forouzmand, Hamidreza Chitsaz

Abstract: Despite nearly two scores of research on RNA secondary structure and RNA-RNA interaction prediction, the accuracy of the state-of-the-art algorithms are still far from satisfactory. Researchers have proposed increasingly complex energy models and improved parameter estimation methods in anticipation of endowing their methods with enough power to solve the problem. The output has disappointingly be… ▽ More Despite nearly two scores of research on RNA secondary structure and RNA-RNA interaction prediction, the accuracy of the state-of-the-art algorithms are still far from satisfactory. Researchers have proposed increasingly complex energy models and improved parameter estimation methods in anticipation of endowing their methods with enough power to solve the problem. The output has disappointingly been only modest improvements, not matching the expectations. Even recent massively featured machine learning approaches were not able to break the barrier. In this paper, we introduce the notion of learnability of the parameters of an energy model as a measure of its inherent capability. We say that the parameters of an energy model are learnable iff there exists at least one set of such parameters that renders every known RNA structure to date the minimum free energy structure. We derive a necessary condition for the learnability and give a dynamic programming algorithm to assess it. Our algorithm computes the convex hull of the feature vectors of all feasible structures in the ensemble of a given input sequence. Interestingly, that convex hull coincides with the Newton polytope of the partition function as a polynomial in energy parameters. We demonstrated the application of our theory to a simple energy model consisting of a weighted count of A-U and C-G base pairs. Our results show that this simple energy model satisfies the necessary condition for less than one third of the input unpseudoknotted sequence-structure pairs chosen from the RNA STRAND v2.0 database. For another one third, the necessary condition is barely violated, which suggests that augmenting this simple energy model with more features such as the Turner loops may solve the problem. The necessary condition is severely violated for 8%, which provides a small set of hard cases that require further investigation. △ Less

Submitted 8 January, 2013; originally announced January 2013.

arXiv:1301.1590 [pdf, ps, other]

An Efficient Algorithm for Upper Bound on the Partition Function of Nucleic Acids

Authors: Hamidreza Chitsaz, Elmirasadat Forouzmand, Gholamreza Haffari

Abstract: It has been shown that minimum free energy structure for RNAs and RNA-RNA interaction is often incorrect due to inaccuracies in the energy parameters and inherent limitations of the energy model. In contrast, ensemble based quantities such as melting temperature and equilibrium concentrations can be more reliably predicted. Even structure prediction by sampling from the ensemble and clustering tho… ▽ More It has been shown that minimum free energy structure for RNAs and RNA-RNA interaction is often incorrect due to inaccuracies in the energy parameters and inherent limitations of the energy model. In contrast, ensemble based quantities such as melting temperature and equilibrium concentrations can be more reliably predicted. Even structure prediction by sampling from the ensemble and clustering those structures by Sfold [7] has proven to be more reliable than minimum free energy structure prediction. The main obstacle for ensemble based approaches is the computational complexity of the partition function and base pairing probabilities. For instance, the space complexity of the partition function for RNA-RNA interaction is $O(n^4)$ and the time complexity is $O(n^6)$ which are prohibitively large [4,12]. Our goal in this paper is to give a fast algorithm, based on sparse folding, to calculate an upper bound on the partition function. Our work is based on the recent algorithm of Hazan and Jaakkola [10]. The space complexity of our algorithm is the same as that of sparse folding algorithms, and the time complexity of our algorithm is $O(MFE(n)\ell)$ for single RNA and $O(MFE(m, n)\ell)$ for RNA-RNA interaction in practice, in which $MFE$ is the running time of sparse folding and $\ell \leq n$ ($\ell \leq n + m$) is a sequence dependent parameter. △ Less

Submitted 8 January, 2013; originally announced January 2013.

Showing 1–13 of 13 results for author: Chitsaz, H