-
Unsupervised Hierarchical Graph Representation Learning by Mutual Information Maximization
Authors:
Fei Ding,
Xiaohong Zhang,
Justin Sybrandt,
Ilya Safro
Abstract:
Graph representation learning based on graph neural networks (GNNs) can greatly improve the performance of downstream tasks, such as node and graph classification. However, the general GNN models do not aggregate node information in a hierarchical manner, and can miss key higher-order structural features of many graphs. The hierarchical aggregation also enables the graph representations to be expl…
▽ More
Graph representation learning based on graph neural networks (GNNs) can greatly improve the performance of downstream tasks, such as node and graph classification. However, the general GNN models do not aggregate node information in a hierarchical manner, and can miss key higher-order structural features of many graphs. The hierarchical aggregation also enables the graph representations to be explainable. In addition, supervised graph representation learning requires labeled data, which is expensive and error-prone. To address these issues, we present an unsupervised graph representation learning method, Unsupervised Hierarchical Graph Representation (UHGR), which can generate hierarchical representations of graphs. Our method focuses on maximizing mutual information between "local" and high-level "global" representations, which enables us to learn the node embeddings and graph embeddings without any labeled data. To demonstrate the effectiveness of the proposed method, we perform the node and graph classification using the learned node and graph embeddings. The results show that the proposed method achieves comparable results to state-of-the-art supervised methods on several benchmarks. In addition, our visualization of hierarchical representations indicates that our method can capture meaningful and interpretable clusters.
△ Less
Submitted 28 July, 2020; v1 submitted 18 March, 2020;
originally announced March 2020.
-
CBAG: Conditional Biomedical Abstract Generation
Authors:
Justin Sybrandt,
Ilya Safro
Abstract:
Biomedical research papers use significantly different language and jargon when compared to typical English text, which reduces the utility of pre-trained NLP models in this domain. Meanwhile Medline, a database of biomedical abstracts, introduces nearly a million new documents per-year. Applications that could benefit from understanding this wealth of publicly available information, such as scien…
▽ More
Biomedical research papers use significantly different language and jargon when compared to typical English text, which reduces the utility of pre-trained NLP models in this domain. Meanwhile Medline, a database of biomedical abstracts, introduces nearly a million new documents per-year. Applications that could benefit from understanding this wealth of publicly available information, such as scientific writing assistants, chat-bots, or descriptive hypothesis generation systems, require new domain-centered approaches. A conditional language model, one that learns the probability of words given some a priori criteria, is a fundamental building block in many such applications. We propose a transformer-based conditional language model with a shallow encoder "condition" stack, and a deep "language model" stack of multi-headed attention blocks. The condition stack encodes metadata used to alter the output probability distribution of the language model stack. We sample this distribution in order to generate biomedical abstracts given only a proposed title, an intended publication year, and a set of keywords. Using typical natural language generation metrics, we demonstrate that this proposed approach is more capable of producing non-trivial relevant entities within the abstract body than the 1.5B parameter GPT-2 language model.
△ Less
Submitted 13 February, 2020;
originally announced February 2020.
-
AGATHA: Automatic Graph-mining And Transformer based Hypothesis generation Approach
Authors:
Justin Sybrandt,
Ilya Tyagin,
Michael Shtutman,
Ilya Safro
Abstract:
Medical research is risky and expensive. Drug discovery, as an example, requires that researchers efficiently winnow thousands of potential targets to a small candidate set for more thorough evaluation. However, research groups spend significant time and money to perform the experiments necessary to determine this candidate set long before seeing intermediate results. Hypothesis generation systems…
▽ More
Medical research is risky and expensive. Drug discovery, as an example, requires that researchers efficiently winnow thousands of potential targets to a small candidate set for more thorough evaluation. However, research groups spend significant time and money to perform the experiments necessary to determine this candidate set long before seeing intermediate results. Hypothesis generation systems address this challenge by mining the wealth of publicly available scientific information to predict plausible research directions. We present AGATHA, a deep-learning hypothesis generation system that can introduce data-driven insights earlier in the discovery process. Through a learned ranking criteria, this system quickly prioritizes plausible term-pairs among entity sets, allowing us to recommend new research directions. We massively validate our system with a temporal holdout wherein we predict connections first introduced after 2015 using data published beforehand. We additionally explore biomedical sub-domains, and demonstrate AGATHA's predictive capacity across the twenty most popular relationship types. This system achieves best-in-class performance on an established benchmark, and demonstrates high recommendation scores across subdomains. Reproducibility: All code, experimental data, and pre-trained models are available online: sybrandt.com/2020/agatha
△ Less
Submitted 13 February, 2020;
originally announced February 2020.
-
FOBE and HOBE: First- and High-Order Bipartite Embeddings
Authors:
Justin Sybrandt,
Ilya Safro
Abstract:
Typical graph embeddings may not capture type-specific bipartite graph features that arise in such areas as recommender systems, data visualization, and drug discovery. Machine learning methods utilized in these applications would be better served with specialized embedding techniques. We propose two embeddings for bipartite graphs that decompose edges into sets of indirect relationships between n…
▽ More
Typical graph embeddings may not capture type-specific bipartite graph features that arise in such areas as recommender systems, data visualization, and drug discovery. Machine learning methods utilized in these applications would be better served with specialized embedding techniques. We propose two embeddings for bipartite graphs that decompose edges into sets of indirect relationships between node neighborhoods. When sampling higher-order relationships, we reinforce similarities through algebraic distance on graphs. We also introduce ensemble embeddings to combine both into a "best of both worlds" embedding. The proposed methods are evaluated on link prediction and recommendation tasks and compared with other state-of-the-art embeddings. While being all highly beneficial in applications, we demonstrate that none of the considered embeddings is clearly superior (in contrast to what is claimed in many papers), and discuss the trade offs present among them. Reproducibility: Our code, data sets, and results are all publicly available online at: http://sybrandt.com/2020/fobe_hobe.
△ Less
Submitted 22 July, 2020; v1 submitted 26 May, 2019;
originally announced May 2019.
-
Spatio-temporal prediction of crimes using network analytic approach
Authors:
Saroj Kumar Dash,
Ilya Safro,
Ravisutha Sakrepatna Srinivasamurthy
Abstract:
It is quite evident that majority of the population lives in urban area today than in any time of the human history. This trend seems to increase in coming years. A study [5] says that nearly 80.7% of total population in USA stays in urban area. By 2030 nearly 60% of the population in the world will live in or move to cities. With the increase in urban population, it is important to keep an eye on…
▽ More
It is quite evident that majority of the population lives in urban area today than in any time of the human history. This trend seems to increase in coming years. A study [5] says that nearly 80.7% of total population in USA stays in urban area. By 2030 nearly 60% of the population in the world will live in or move to cities. With the increase in urban population, it is important to keep an eye on criminal activities. By doing so, governments can enforce intelligent policing systems and hence many government agencies and local authorities have made the crime data publicly available. In this paper, we analyze Chicago city crime data fused with other social information sources using network analytic techniques to predict criminal activity for the next year. We observe that as we add more layers of data which represent different aspects of the society, the quality of prediction is improved. Our prediction models not just predict total number of crimes for the whole Chicago city, rather they predict number of crimes for all types of crimes and for different regions in City of Chicago.
△ Less
Submitted 30 October, 2018; v1 submitted 19 August, 2018;
originally announced August 2018.
-
Detecting and monitoring foodborne illness outbreaks: Twitter communications and the 2015 U.S. Salmonella outbreak linked to imported cucumbers
Authors:
Yuliya V. Bolotova,
Jie Lou,
Ilya Safro
Abstract:
This research uses Twitter, as a social media device, to track communications related to the 2015 U.S. foodborne illness outbreak linked to Salmonella in imported cucumbers from Mexico. The relevant Twitter data are analyzed in light of the timeline of the official announcements made by the Centers for Disease Control and Prevention (CDC). The largest number of registered tweets is associated with…
▽ More
This research uses Twitter, as a social media device, to track communications related to the 2015 U.S. foodborne illness outbreak linked to Salmonella in imported cucumbers from Mexico. The relevant Twitter data are analyzed in light of the timeline of the official announcements made by the Centers for Disease Control and Prevention (CDC). The largest number of registered tweets is associated with the period immediately following the CDC initial announcement and the official release of the first recall of cucumbers.
△ Less
Submitted 24 August, 2017;
originally announced August 2017.
-
Engineering fast multilevel support vector machines
Authors:
E. Sadrfaridpour,
T. Razzaghi,
I. Safro
Abstract:
The computational complexity of solving nonlinear support vector machine (SVM) is prohibitive on large-scale data. In particular, this issue becomes very sensitive when the data represents additional difficulties such as highly imbalanced class sizes. Typically, nonlinear kernels produce significantly higher classification quality to linear kernels but introduce extra kernel and model parameters w…
▽ More
The computational complexity of solving nonlinear support vector machine (SVM) is prohibitive on large-scale data. In particular, this issue becomes very sensitive when the data represents additional difficulties such as highly imbalanced class sizes. Typically, nonlinear kernels produce significantly higher classification quality to linear kernels but introduce extra kernel and model parameters which requires computationally expensive fitting. This increases the quality but also reduces the performance dramatically. We introduce a generalized fast multilevel framework for regular and weighted SVM and discuss several versions of its algorithmic components that lead to a good trade-off between quality and time. Our framework is implemented using PETSc which allows an easy integration with scientific computing tasks. The experimental results demonstrate significant speed up compared to the state-of-the-art nonlinear SVM libraries.
Reproducibility: our source code, documentation and parameters are available at https:// github.com/esadr/mlsvm.
△ Less
Submitted 5 April, 2019; v1 submitted 24 July, 2017;
originally announced July 2017.
-
MOLIERE: Automatic Biomedical Hypothesis Generation System
Authors:
Justin Sybrandt,
Michael Shtutman,
Ilya Safro
Abstract:
Hypothesis generation is becoming a crucial time-saving technique which allows biomedical researchers to quickly discover implicit connections between important concepts. Typically, these systems operate on domain-specific fractions of public medical data. MOLIERE, in contrast, utilizes information from over 24.5 million documents. At the heart of our approach lies a multi-modal and multi-relation…
▽ More
Hypothesis generation is becoming a crucial time-saving technique which allows biomedical researchers to quickly discover implicit connections between important concepts. Typically, these systems operate on domain-specific fractions of public medical data. MOLIERE, in contrast, utilizes information from over 24.5 million documents. At the heart of our approach lies a multi-modal and multi-relational network of biomedical objects extracted from several heterogeneous datasets from the National Center for Biotechnology Information (NCBI). These objects include but are not limited to scientific papers, keywords, genes, proteins, diseases, and diagnoses. We model hypotheses using Latent Dirichlet Allocation applied on abstracts found near shortest paths discovered within this network, and demonstrate the effectiveness of MOLIERE by performing hypothesis generation on historical data. Our network, implementation, and resulting data are all publicly available for the broad scientific community.
△ Less
Submitted 31 May, 2017; v1 submitted 20 February, 2017;
originally announced February 2017.
-
Algebraic multigrid support vector machines
Authors:
Ehsan Sadrfaridpour,
Sandeep Jeereddy,
Ken Kennedy,
Andre Luckow,
Talayeh Razzaghi,
Ilya Safro
Abstract:
The support vector machine is a flexible optimization-based technique widely used for classification problems. In practice, its training part becomes computationally expensive on large-scale data sets because of such reasons as the complexity and number of iterations in parameter fitting methods, underlying optimization solvers, and nonlinearity of kernels. We introduce a fast multilevel framework…
▽ More
The support vector machine is a flexible optimization-based technique widely used for classification problems. In practice, its training part becomes computationally expensive on large-scale data sets because of such reasons as the complexity and number of iterations in parameter fitting methods, underlying optimization solvers, and nonlinearity of kernels. We introduce a fast multilevel framework for solving support vector machine models that is inspired by the algebraic multigrid. Significant improvement in the running has been achieved without any loss in the quality. The proposed technique is highly beneficial on imbalanced sets. We demonstrate computational results on publicly available and industrial data sets.
△ Less
Submitted 23 November, 2016; v1 submitted 16 November, 2016;
originally announced November 2016.
-
Scalable Dynamic Topic Modeling with Clustered Latent Dirichlet Allocation (CLDA)
Authors:
Chris Gropp,
Alexander Herzog,
Ilya Safro,
Paul W. Wilson,
Amy W. Apon
Abstract:
Topic modeling, a method for extracting the underlying themes from a collection of documents, is an increasingly important component of the design of intelligent systems enabling the sense-making of highly dynamic and diverse streams of text data. Traditional methods such as Dynamic Topic Modeling (DTM) do not lend themselves well to direct parallelization because of dependencies from one time ste…
▽ More
Topic modeling, a method for extracting the underlying themes from a collection of documents, is an increasingly important component of the design of intelligent systems enabling the sense-making of highly dynamic and diverse streams of text data. Traditional methods such as Dynamic Topic Modeling (DTM) do not lend themselves well to direct parallelization because of dependencies from one time step to another. In this paper, we introduce and empirically analyze Clustered Latent Dirichlet Allocation (CLDA), a method for extracting dynamic latent topics from a collection of documents. Our approach is based on data decomposition in which the data is partitioned into segments, followed by topic modeling on the individual segments. The resulting local models are then combined into a global solution using clustering. The decomposition and resulting parallelization leads to very fast runtime even on very large datasets. Our approach furthermore provides insight into how the composition of topics changes over time and can also be applied using other data partitioning strategies over any discrete features of the data, such as geographic features or classes of users. In this paper CLDA is applied successfully to seventeen years of NIPS conference papers (2,484 documents and 3,280,697 words), seventeen years of computer science journal abstracts (533,560 documents and 32,551,540 words), and to forty years of the PubMed corpus (4,025,978 documents and 273,853,980 words).
△ Less
Submitted 4 October, 2019; v1 submitted 24 October, 2016;
originally announced October 2016.
-
Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values
Authors:
Talayeh Razzaghi,
Oleg Roderick,
Ilya Safro,
Nicholas Marko
Abstract:
This work is motivated by the needs of predictive analytics on healthcare data as represented by Electronic Medical Records. Such data is invariably problematic: noisy, with missing entries, with imbalance in classes of interests, leading to serious bias in predictive modeling. Since standard data mining methods often produce poor performance measures, we argue for development of specialized techn…
▽ More
This work is motivated by the needs of predictive analytics on healthcare data as represented by Electronic Medical Records. Such data is invariably problematic: noisy, with missing entries, with imbalance in classes of interests, leading to serious bias in predictive modeling. Since standard data mining methods often produce poor performance measures, we argue for development of specialized techniques of data-preprocessing and classification. In this paper, we propose a new method to simultaneously classify large datasets and reduce the effects of missing values. It is based on a multilevel framework of the cost-sensitive SVM and the expected maximization imputation method for missing values, which relies on iterated regression analyses. We compare classification results of multilevel SVM-based algorithms on public benchmark datasets with imbalanced classes and missing values as well as real data in health applications, and show that our multilevel SVM-based method produces fast, and more accurate and robust classification results.
△ Less
Submitted 7 April, 2016;
originally announced April 2016.
-
Fast Imbalanced Classification of Healthcare Data with Missing Values
Authors:
Talayeh Razzaghi,
Oleg Roderick,
Ilya Safro,
Nick Marko
Abstract:
In medical domain, data features often contain missing values. This can create serious bias in the predictive modeling. Typical standard data mining methods often produce poor performance measures. In this paper, we propose a new method to simultaneously classify large datasets and reduce the effects of missing values. The proposed method is based on a multilevel framework of the cost-sensitive SV…
▽ More
In medical domain, data features often contain missing values. This can create serious bias in the predictive modeling. Typical standard data mining methods often produce poor performance measures. In this paper, we propose a new method to simultaneously classify large datasets and reduce the effects of missing values. The proposed method is based on a multilevel framework of the cost-sensitive SVM and the expected maximization imputation method for missing values, which relies on iterated regression analyses. We compare classification results of multilevel SVM-based algorithms on public benchmark datasets with imbalanced classes and missing values as well as real data in health applications, and show that our multilevel SVM-based method produces fast, and more accurate and robust classification results.
△ Less
Submitted 20 March, 2015;
originally announced March 2015.
-
Fast Multilevel Support Vector Machines
Authors:
Talayeh Razzaghi,
Ilya Safro
Abstract:
Solving different types of optimization models (including parameters fitting) for support vector machines on large-scale training data is often an expensive computational task. This paper proposes a multilevel algorithmic framework that scales efficiently to very large data sets. Instead of solving the whole training set in one optimization process, the support vectors are obtained and gradually r…
▽ More
Solving different types of optimization models (including parameters fitting) for support vector machines on large-scale training data is often an expensive computational task. This paper proposes a multilevel algorithmic framework that scales efficiently to very large data sets. Instead of solving the whole training set in one optimization process, the support vectors are obtained and gradually refined at multiple levels of coarseness of the data. The proposed framework includes: (a) construction of hierarchy of large-scale data coarse representations, and (b) a local processing of updating the hyperplane throughout this hierarchy. Our multilevel framework substantially improves the computational time without loosing the quality of classifiers. The algorithms are demonstrated for both regular and weighted support vector machines. Experimental results are presented for balanced and imbalanced classification problems. Quality improvement on several imbalanced data sets has been observed.
△ Less
Submitted 13 October, 2014;
originally announced October 2014.