-
MLQAOA: Graph Learning Accelerated Hybrid Quantum-Classical Multilevel QAOA
Authors:
Bao Bach,
Jose Falla,
Ilya Safro
Abstract:
Learning the problem structure at multiple levels of coarseness to inform the decomposition-based hybrid quantum-classical combinatorial optimization solvers is a promising approach to scaling up variational approaches. We introduce a multilevel algorithm reinforced with the spectral graph representation learning-based accelerator to tackle large-scale graph maximum cut instances and fused with se…
▽ More
Learning the problem structure at multiple levels of coarseness to inform the decomposition-based hybrid quantum-classical combinatorial optimization solvers is a promising approach to scaling up variational approaches. We introduce a multilevel algorithm reinforced with the spectral graph representation learning-based accelerator to tackle large-scale graph maximum cut instances and fused with several versions of the quantum approximate optimization algorithm (QAOA) and QAOA-inspired algorithms. The graph representation learning model utilizes the idea of QAOA variational parameters concentration and substantially improves the performance of QAOA. We demonstrate the potential of using multilevel QAOA and representation learning-based approaches on very large graphs by achieving high-quality solutions in a much faster time. Reproducibility: Our source code and results are available at https://github.com/bachbao/MLQAOA
△ Less
Submitted 30 April, 2024; v1 submitted 22 April, 2024;
originally announced April 2024.
-
Dyport: Dynamic Importance-based Hypothesis Generation Benchmarking Technique
Authors:
Ilya Tyagin,
Ilya Safro
Abstract:
This paper presents a novel benchmarking framework Dyport for evaluating biomedical hypothesis generation systems. Utilizing curated datasets, our approach tests these systems under realistic conditions, enhancing the relevance of our evaluations. We integrate knowledge from the curated databases into a dynamic graph, accompanied by a method to quantify discovery importance. This not only assesses…
▽ More
This paper presents a novel benchmarking framework Dyport for evaluating biomedical hypothesis generation systems. Utilizing curated datasets, our approach tests these systems under realistic conditions, enhancing the relevance of our evaluations. We integrate knowledge from the curated databases into a dynamic graph, accompanied by a method to quantify discovery importance. This not only assesses hypothesis accuracy but also their potential impact in biomedical research which significantly extends traditional link prediction benchmarks. Applicability of our benchmarking process is demonstrated on several link prediction systems applied on biomedical semantic knowledge graphs. Being flexible, our benchmarking system is designed for broad application in hypothesis generation quality verification, aiming to expand the scope of scientific discovery within the biomedical research community. Availability and implementation: Dyport framework is fully open-source. All code and datasets are available at: https://github.com/IlyaTyagin/Dyport
△ Less
Submitted 6 December, 2023;
originally announced December 2023.
-
Open Problems in (Hyper)Graph Decomposition
Authors:
Deepak Ajwani,
Rob H. Bisseling,
Katrin Casel,
Ümit V. Çatalyürek,
Cédric Chevalier,
Florian Chudigiewitsch,
Marcelo Fonseca Faraj,
Michael Fellows,
Lars Gottesbüren,
Tobias Heuer,
George Karypis,
Kamer Kaya,
Jakub Lacki,
Johannes Langguth,
Xiaoye Sherry Li,
Ruben Mayer,
Johannes Meintrup,
Yosuke Mizutani,
François Pellegrini,
Fabrizio Petrini,
Frances Rosamond,
Ilya Safro,
Sebastian Schlag,
Christian Schulz,
Roohani Sharma
, et al. (4 additional authors not shown)
Abstract:
Large networks are useful in a wide range of applications. Sometimes problem instances are composed of billions of entities. Decomposing and analyzing these structures helps us gain new insights about our surroundings. Even if the final application concerns a different problem (such as traversal, finding paths, trees, and flows), decomposing large graphs is often an important subproblem for comple…
▽ More
Large networks are useful in a wide range of applications. Sometimes problem instances are composed of billions of entities. Decomposing and analyzing these structures helps us gain new insights about our surroundings. Even if the final application concerns a different problem (such as traversal, finding paths, trees, and flows), decomposing large graphs is often an important subproblem for complexity reduction or parallelization. This report is a summary of discussions that happened at Dagstuhl seminar 23331 on "Recent Trends in Graph Decomposition" and presents currently open problems and future directions in the area of (hyper)graph decomposition.
△ Less
Submitted 18 October, 2023;
originally announced October 2023.
-
QArchSearch: A Scalable Quantum Architecture Search Package
Authors:
Ankit Kulshrestha,
Danylo Lykov,
Ilya Safro,
Yuri Alexeev
Abstract:
The current era of quantum computing has yielded several algorithms that promise high computational efficiency. While the algorithms are sound in theory and can provide potentially exponential speedup, there is little guidance on how to design proper quantum circuits to realize the appropriate unitary transformation to be applied to the input quantum state. In this paper, we present \texttt{QArchS…
▽ More
The current era of quantum computing has yielded several algorithms that promise high computational efficiency. While the algorithms are sound in theory and can provide potentially exponential speedup, there is little guidance on how to design proper quantum circuits to realize the appropriate unitary transformation to be applied to the input quantum state. In this paper, we present \texttt{QArchSearch}, an AI based quantum architecture search package with the \texttt{QTensor} library as a backend that provides a principled and automated approach to finding the best model given a task and input quantum state. We show that the search package is able to efficiently scale the search to large quantum circuits and enables the exploration of more complex models for different quantum applications. \texttt{QArchSearch} runs at scale and high efficiency on high-performance computing systems using a two-level parallelization scheme on both CPUs and GPUs, which has been demonstrated on the Polaris supercomputer.
△ Less
Submitted 11 October, 2023;
originally announced October 2023.
-
Decomposition Based Refinement for the Network Interdiction Problem
Authors:
Krish Matta,
Xiaoyuan Liu,
Ilya Safro
Abstract:
The shortest path network interdiction (SPNI) problem poses significant computational challenges due to its NP-hardness. Current solutions, primarily based on integer programming methods, are inefficient for large-scale instances. In this paper, we introduce a novel hybrid algorithm that can utilize Ising Processing Units (IPUs) alongside classical solvers. This approach decomposes the problem int…
▽ More
The shortest path network interdiction (SPNI) problem poses significant computational challenges due to its NP-hardness. Current solutions, primarily based on integer programming methods, are inefficient for large-scale instances. In this paper, we introduce a novel hybrid algorithm that can utilize Ising Processing Units (IPUs) alongside classical solvers. This approach decomposes the problem into manageable sub-problems, which are then offloaded to the slow but high-quality classical solvers or IPU. Results are subsequently recombined to form a global solution. Our method demonstrates comparable quality to existing whole problem solvers while reducing computational time for large-scale instances. Furthermore, our approach is amenable to parallelization, allowing for simultaneous processing of decomposed sub-problems.
△ Less
Submitted 14 September, 2023; v1 submitted 14 July, 2023;
originally announced July 2023.
-
Literature-based Discovery for Landscape Planning
Authors:
David Marasco,
Ilya Tyagin,
Justin Sybrandt,
James H. Spencer,
Ilya Safro
Abstract:
This project demonstrates how medical corpus hypothesis generation, a knowledge discovery field of AI, can be used to derive new research angles for landscape and urban planners. The hypothesis generation approach herein consists of a combination of deep learning with topic modeling, a probabilistic approach to natural language analysis that scans aggregated research databases for words that can b…
▽ More
This project demonstrates how medical corpus hypothesis generation, a knowledge discovery field of AI, can be used to derive new research angles for landscape and urban planners. The hypothesis generation approach herein consists of a combination of deep learning with topic modeling, a probabilistic approach to natural language analysis that scans aggregated research databases for words that can be grouped together based on their subject matter commonalities; the word groups accordingly form topics that can provide implicit connections between two general research terms. The hypothesis generation system AGATHA was used to identify likely conceptual relationships between emerging infectious diseases (EIDs) and deforestation, with the objective of providing landscape planners guidelines for productive research directions to help them formulate research hypotheses centered on deforestation and EIDs that will contribute to the broader health field that asserts causal roles of landscape-level issues. This research also serves as a partial proof-of-concept for the application of medical database hypothesis generation to medicine-adjacent hypothesis discovery.
△ Less
Submitted 5 June, 2023;
originally announced June 2023.
-
Learning To Optimize Quantum Neural Network Without Gradients
Authors:
Ankit Kulshrestha,
Xiaoyuan Liu,
Hayato Ushijima-Mwesigwa,
Ilya Safro
Abstract:
Quantum Machine Learning is an emerging sub-field in machine learning where one of the goals is to perform pattern recognition tasks by encoding data into quantum states. This extension from classical to quantum domain has been made possible due to the development of hybrid quantum-classical algorithms that allow a parameterized quantum circuit to be optimized using gradient based algorithms that…
▽ More
Quantum Machine Learning is an emerging sub-field in machine learning where one of the goals is to perform pattern recognition tasks by encoding data into quantum states. This extension from classical to quantum domain has been made possible due to the development of hybrid quantum-classical algorithms that allow a parameterized quantum circuit to be optimized using gradient based algorithms that run on a classical computer. The similarities in training of these hybrid algorithms and classical neural networks has further led to the development of Quantum Neural Networks (QNNs). However, in the current training regime for QNNs, the gradients w.r.t objective function have to be computed on the quantum device. This computation is highly non-scalable and is affected by hardware and sampling noise present in the current generation of quantum hardware. In this paper, we propose a training algorithm that does not rely on gradient information. Specifically, we introduce a novel meta-optimization algorithm that trains a \emph{meta-optimizer} network to output parameters for the quantum circuit such that the objective function is minimized. We empirically and theoretically show that we achieve a better quality minima in fewer circuit evaluations than existing gradient based algorithms on different datasets.
△ Less
Submitted 14 April, 2023;
originally announced April 2023.
-
Towards Practical Explainability with Cluster Descriptors
Authors:
Xiaoyuan Liu,
Ilya Tyagin,
Hayato Ushijima-Mwesigwa,
Indradeep Ghosh,
Ilya Safro
Abstract:
With the rapid development of machine learning, improving its explainability has become a crucial research goal. We study the problem of making the clusters more explainable by investigating the cluster descriptors. Given a set of objects $S$, a clustering of these objects $π$, and a set of tags $T$ that have not participated in the clustering algorithm. Each object in $S$ is associated with a sub…
▽ More
With the rapid development of machine learning, improving its explainability has become a crucial research goal. We study the problem of making the clusters more explainable by investigating the cluster descriptors. Given a set of objects $S$, a clustering of these objects $π$, and a set of tags $T$ that have not participated in the clustering algorithm. Each object in $S$ is associated with a subset of $T$. The goal is to find a representative set of tags for each cluster, referred to as the cluster descriptors, with the constraint that these descriptors we find are pairwise disjoint, and the total size of all the descriptors is minimized. In general, this problem is NP-hard. We propose a novel explainability model that reinforces the previous models in such a way that tags that do not contribute to explainability and do not sufficiently distinguish between clusters are not added to the optimal descriptors. The proposed model is formulated as a quadratic unconstrained binary optimization problem which makes it suitable for solving on modern optimization hardware accelerators. We experimentally demonstrate how a proposed explainability model can be solved on specialized hardware for accelerating combinatorial optimization, the Fujitsu Digital Annealer, and use real-life Twitter and PubMed datasets for use cases.
△ Less
Submitted 20 October, 2022; v1 submitted 17 October, 2022;
originally announced October 2022.
-
Constructing Optimal Contraction Trees for Tensor Network Quantum Circuit Simulation
Authors:
Cameron Ibrahim,
Danylo Lykov,
Zichang He,
Yuri Alexeev,
Ilya Safro
Abstract:
One of the key problems in tensor network based quantum circuit simulation is the construction of a contraction tree which minimizes the cost of the simulation, where the cost can be expressed in the number of operations as a proxy for the simulation running time. This same problem arises in a variety of application areas, such as combinatorial scientific computing, marginalization in probabilisti…
▽ More
One of the key problems in tensor network based quantum circuit simulation is the construction of a contraction tree which minimizes the cost of the simulation, where the cost can be expressed in the number of operations as a proxy for the simulation running time. This same problem arises in a variety of application areas, such as combinatorial scientific computing, marginalization in probabilistic graphical models, and solving constraint satisfaction problems. In this paper, we reduce the computationally hard portion of this problem to one of graph linear ordering, and demonstrate how existing approaches in this area can be utilized to achieve results up to several orders of magnitude better than existing state of the art methods for the same running time. To do so, we introduce a novel polynomial time algorithm for constructing an optimal contraction tree from a given order. Furthermore, we introduce a fast and high quality linear ordering solver, and demonstrate its applicability as a heuristic for providing orderings for contraction trees. Finally, we compare our solver with competing methods for constructing contraction trees in quantum circuit simulation on a collection of randomly generated Quantum Approximate Optimization Algorithm Max Cut circuits and show that our method achieves superior results on a majority of tested quantum circuits.
Reproducibility: Our source code and data are available at https://github.com/cameton/HPEC2022_ContractionTrees.
△ Less
Submitted 6 September, 2022;
originally announced September 2022.
-
Quantum Approximate Optimization Algorithm with Sparsified Phase Operator
Authors:
Xiaoyuan Liu,
Ruslan Shaydulin,
Ilya Safro
Abstract:
The Quantum Approximate Optimization Algorithm (QAOA) is a promising candidate algorithm for demonstrating quantum advantage in optimization using near-term quantum computers. However, QAOA has high requirements on gate fidelity due to the need to encode the objective function in the phase separating operator, requiring a large number of gates that potentially do not match the hardware connectivit…
▽ More
The Quantum Approximate Optimization Algorithm (QAOA) is a promising candidate algorithm for demonstrating quantum advantage in optimization using near-term quantum computers. However, QAOA has high requirements on gate fidelity due to the need to encode the objective function in the phase separating operator, requiring a large number of gates that potentially do not match the hardware connectivity. Using the MaxCut problem as the target, we demonstrate numerically that an easier way to implement an alternative phase operator can be used in lieu of the phase operator encoding the objective function, as long as the ground state is the same. We observe that if the ground state energy is not preserved, the approximation ratio obtained by QAOA with such phase separating operator is likely to decrease. Moreover, we show that a better alignment of the low energy subspace of the alternative operator leads to better performance. Leveraging these observations, we propose a sparsification strategy that reduces the resource requirements of QAOA. We also compare our sparsification strategy with some other classical graph sparsification methods, and demonstrate the efficacy of our approach.
△ Less
Submitted 29 April, 2022;
originally announced May 2022.
-
BEINIT: Avoiding Barren Plateaus in Variational Quantum Algorithms
Authors:
Ankit Kulshrestha,
Ilya Safro
Abstract:
Barren plateaus are a notorious problem in the optimization of variational quantum algorithms and pose a critical obstacle in the quest for more efficient quantum machine learning algorithms. Many potential reasons for barren plateaus have been identified but few solutions have been proposed to avoid them in practice. Existing solutions are mainly focused on the initialization of unitary gate para…
▽ More
Barren plateaus are a notorious problem in the optimization of variational quantum algorithms and pose a critical obstacle in the quest for more efficient quantum machine learning algorithms. Many potential reasons for barren plateaus have been identified but few solutions have been proposed to avoid them in practice. Existing solutions are mainly focused on the initialization of unitary gate parameters without taking into account the changes induced by input data. In this paper, we propose an alternative strategy which initializes the parameters of a unitary gate by drawing from a beta distribution. The hyperparameters of the beta distribution are estimated from the data. To further prevent barren plateau during training we add a novel perturbation at every gradient descent step. Taking these ideas together, we empirically show that our proposed framework significantly reduces the possibility of a complex quantum neural network getting stuck in a barren plateau.
△ Less
Submitted 28 April, 2022;
originally announced April 2022.
-
Partitioning Dense Graphs with Hardware Accelerators
Authors:
Xiaoyuan Liu,
Hayato Ushijima-Mwesigwa,
Indradeep Ghosh,
Ilya Safro
Abstract:
Graph partitioning is a fundamental combinatorial optimization problem that attracts a lot of attention from theoreticians and practitioners due to its broad applications. From multilevel graph partitioning to more general-purpose optimization solvers such as Gurobi and CPLEX, a wide range of approaches have been developed. Limitations of these approaches are important to study in order to break t…
▽ More
Graph partitioning is a fundamental combinatorial optimization problem that attracts a lot of attention from theoreticians and practitioners due to its broad applications. From multilevel graph partitioning to more general-purpose optimization solvers such as Gurobi and CPLEX, a wide range of approaches have been developed. Limitations of these approaches are important to study in order to break the computational optimization barriers of this problem. As we approach the limits of Moore's law, there is now a need to explore ways of solving such problems with special-purpose hardware such as quantum computers or quantum-inspired accelerators. In this work, we experiment with solving the graph partitioning on the Fujitsu Digital Annealer (a special-purpose hardware designed for solving combinatorial optimization problems) and compare it with the existing top solvers. We demonstrate limitations of existing solvers on many dense graphs as well as those of the Digital Annealer on sparse graphs which opens an avenue to hybridize these approaches.
△ Less
Submitted 21 February, 2022; v1 submitted 18 February, 2022;
originally announced February 2022.
-
Proactive Query Expansion for Streaming Data Using External Source
Authors:
Farah Alshanik,
Amy Apon,
Yuheng Du,
Alexander Herzog,
Ilya Safro
Abstract:
Query expansion is the process of reformulating the original query by adding relevant words. Choosing which terms to add in order to improve the performance of the query expansion methods or to enhance the quality of the retrieved results is an important aspect of any information retrieval system. Adding words that can positively impact the quality of the search query or are informative enough pla…
▽ More
Query expansion is the process of reformulating the original query by adding relevant words. Choosing which terms to add in order to improve the performance of the query expansion methods or to enhance the quality of the retrieved results is an important aspect of any information retrieval system. Adding words that can positively impact the quality of the search query or are informative enough play an important role in returning or gathering relevant documents that cover a certain topic can result in improving the efficiency of the information retrieval system. Typically, query expansion techniques are used to add or substitute words to a given search query to collect relevant data. In this paper, we design and implement a pipeline of automated query expansion. We outline several tools using different methods to expand the query. Our methods depend on targeting emergent events in streaming data over time and finding the hidden topics from targeted documents using probabilistic topic models. We employ Dynamic Eigenvector Centrality to trigger the emergent events, and the Latent Dirichlet Allocation to discover the topics. Also, we use an external data source as a secondary stream to supplement the primary stream with relevant words and expand the query using the words from both primary and secondary streams. An experimental study is performed on Twitter data (primary stream) related to the events that happened during protests in Baltimore in 2015. The quality of the retrieved results was measured using a quality indicator of the streaming data: tweets count, hashtag count, and hashtag clustering.
△ Less
Submitted 17 January, 2022;
originally announced January 2022.
-
CONFAIR: Configurable and Interpretable Algorithmic Fairness
Authors:
Ankit Kulshrestha,
Ilya Safro
Abstract:
The rapid growth of data in the recent years has led to the development of complex learning algorithms that are often used to make decisions in real world. While the positive impact of the algorithms has been tremendous, there is a need to mitigate any bias arising from either training samples or implicit assumptions made about the data samples. This need becomes critical when algorithms are used…
▽ More
The rapid growth of data in the recent years has led to the development of complex learning algorithms that are often used to make decisions in real world. While the positive impact of the algorithms has been tremendous, there is a need to mitigate any bias arising from either training samples or implicit assumptions made about the data samples. This need becomes critical when algorithms are used in automated decision making systems that can hugely impact people's lives.
Many approaches have been proposed to make learning algorithms fair by detecting and mitigating bias in different stages of optimization. However, due to a lack of a universal definition of fairness, these algorithms optimize for a particular interpretation of fairness which makes them limited for real world use. Moreover, an underlying assumption that is common to all algorithms is the apparent equivalence of achieving fairness and removing bias. In other words, there is no user defined criteria that can be incorporated into the optimization procedure for producing a fair algorithm. Motivated by these shortcomings of existing methods, we propose the CONFAIR procedure that produces a fair algorithm by incorporating user constraints into the optimization procedure. Furthermore, we make the process interpretable by estimating the most predictive features from data. We demonstrate the efficacy of our approach on several real world datasets using different fairness criteria.
△ Less
Submitted 29 December, 2021; v1 submitted 16 November, 2021;
originally announced November 2021.
-
Co** with Mistreatment in Fair Algorithms
Authors:
Ankit Kulshrestha,
Ilya Safro
Abstract:
Machine learning actively impacts our everyday life in almost all endeavors and domains such as healthcare, finance, and energy. As our dependence on the machine learning increases, it is inevitable that these algorithms will be used to make decisions that will have a direct impact on the society spanning all resolutions from personal choices to world-wide policies. Hence, it is crucial to ensure…
▽ More
Machine learning actively impacts our everyday life in almost all endeavors and domains such as healthcare, finance, and energy. As our dependence on the machine learning increases, it is inevitable that these algorithms will be used to make decisions that will have a direct impact on the society spanning all resolutions from personal choices to world-wide policies. Hence, it is crucial to ensure that (un)intentional bias does not affect the machine learning algorithms especially when they are required to take decisions that may have unintended consequences. Algorithmic fairness techniques have found traction in the machine learning community and many methods and metrics have been proposed to ensure and evaluate fairness in algorithms and data collection.
In this paper, we study the algorithmic fairness in a supervised learning setting and examine the effect of optimizing a classifier for the Equal Opportunity metric. We demonstrate that such a classifier has an increased false positive rate across sensitive groups and propose a conceptually simple method to mitigate this bias. We rigorously analyze the proposed method and evaluate it on several real world datasets demonstrating its efficacy.
△ Less
Submitted 21 February, 2021;
originally announced February 2021.
-
Accelerating COVID-19 research with graph mining and transformer-based learning
Authors:
Ilya Tyagin,
Ankit Kulshrestha,
Justin Sybrandt,
Krish Matta,
Michael Shtutman,
Ilya Safro
Abstract:
In 2020, the White House released the, "Call to Action to the Tech Community on New Machine Readable COVID-19 Dataset," wherein artificial intelligence experts are asked to collect data and develop text mining techniques that can help the science community answer high-priority scientific questions related to COVID-19. The Allen Institute for AI and collaborators announced the availability of a rap…
▽ More
In 2020, the White House released the, "Call to Action to the Tech Community on New Machine Readable COVID-19 Dataset," wherein artificial intelligence experts are asked to collect data and develop text mining techniques that can help the science community answer high-priority scientific questions related to COVID-19. The Allen Institute for AI and collaborators announced the availability of a rapidly growing open dataset of publications, the COVID-19 Open Research Dataset (CORD-19). As the pace of research accelerates, biomedical scientists struggle to stay current. To expedite their investigations, scientists leverage hypothesis generation systems, which can automatically inspect published papers to discover novel implicit connections. We present an automated general purpose hypothesis generation systems AGATHA-C and AGATHA-GP for COVID-19 research. The systems are based on graph-mining and the transformer model. The systems are massively validated using retrospective information rediscovery and proactive analysis involving human-in-the-loop expert analysis. Both systems achieve high-quality predictions across domains (in some domains up to 0.97% ROC AUC) in fast computational time and are released to the broad scientific community to accelerate biomedical research. In addition, by performing the domain expert curated study, we show that the systems are able to discover on-going research findings such as the relationship between COVID-19 and oxytocin hormone.
△ Less
Submitted 29 September, 2021; v1 submitted 10 February, 2021;
originally announced February 2021.
-
Layer VQE: A Variational Approach for Combinatorial Optimization on Noisy Quantum Computers
Authors:
Xiaoyuan Liu,
Anthony Angone,
Ruslan Shaydulin,
Ilya Safro,
Yuri Alexeev,
Lukasz Cincio
Abstract:
Combinatorial optimization on near-term quantum devices is a promising path to demonstrating quantum advantage. However, the capabilities of these devices are constrained by high noise or error rates. In this paper, we propose an iterative Layer VQE (L-VQE) approach, inspired by the Variational Quantum Eigensolver (VQE). We present a large-scale numerical study, simulating circuits with up to 40 q…
▽ More
Combinatorial optimization on near-term quantum devices is a promising path to demonstrating quantum advantage. However, the capabilities of these devices are constrained by high noise or error rates. In this paper, we propose an iterative Layer VQE (L-VQE) approach, inspired by the Variational Quantum Eigensolver (VQE). We present a large-scale numerical study, simulating circuits with up to 40 qubits and 352 parameters, that demonstrates the potential of the proposed approach. We evaluate quantum optimization heuristics on the problem of detecting multiple communities in networks, for which we introduce a novel qubit-frugal formulation. We numerically compare L-VQE with Quantum Approximate Optimization Algorithm (QAOA) and demonstrate that QAOA achieves lower approximation ratios while requiring significantly deeper circuits. We show that L-VQE is more robust to finite sampling errors and has a higher chance of finding the solution as compared with standard VQE approaches. Our simulation results show that L-VQE performs well under realistic hardware noise.
△ Less
Submitted 11 May, 2022; v1 submitted 10 February, 2021;
originally announced February 2021.
-
Classical symmetries and the Quantum Approximate Optimization Algorithm
Authors:
Ruslan Shaydulin,
Stuart Hadfield,
Tad Hogg,
Ilya Safro
Abstract:
We study the relationship between the Quantum Approximate Optimization Algorithm (QAOA) and the underlying symmetries of the objective function to be optimized. Our approach formalizes the connection between quantum symmetry properties of the QAOA dynamics and the group of classical symmetries of the objective function. The connection is general and includes but is not limited to problems defined…
▽ More
We study the relationship between the Quantum Approximate Optimization Algorithm (QAOA) and the underlying symmetries of the objective function to be optimized. Our approach formalizes the connection between quantum symmetry properties of the QAOA dynamics and the group of classical symmetries of the objective function. The connection is general and includes but is not limited to problems defined on graphs. We show a series of results exploring the connection and highlight examples of hard problem classes where a nontrivial symmetry subgroup can be obtained efficiently. In particular we show how classical objective function symmetries lead to invariant measurement outcome probabilities across states connected by such symmetries, independent of the choice of algorithm parameters or number of layers. To illustrate the power of the developed connection, we apply machine learning techniques towards predicting QAOA performance based on symmetry considerations. We provide numerical evidence that a small set of graph symmetry properties suffices to predict the minimum QAOA depth required to achieve a target approximation ratio on the MaxCut problem, in a practically important setting where QAOA parameter schedules are constrained to be linear and hence easier to optimize.
△ Less
Submitted 27 October, 2021; v1 submitted 8 December, 2020;
originally announced December 2020.
-
Accelerating Text Mining Using Domain-Specific Stop Word Lists
Authors:
Farah Alshanik,
Amy Apon,
Alexander Herzog,
Ilya Safro,
Justin Sybrandt
Abstract:
Text preprocessing is an essential step in text mining. Removing words that can negatively impact the quality of prediction algorithms or are not informative enough is a crucial storage-saving technique in text indexing and results in improved computational efficiency. Typically, a generic stop word list is applied to a dataset regardless of the domain. However, many common words are different fro…
▽ More
Text preprocessing is an essential step in text mining. Removing words that can negatively impact the quality of prediction algorithms or are not informative enough is a crucial storage-saving technique in text indexing and results in improved computational efficiency. Typically, a generic stop word list is applied to a dataset regardless of the domain. However, many common words are different from one domain to another but have no significance within a particular domain. Eliminating domain-specific common words in a corpus reduces the dimensionality of the feature space, and improves the performance of text mining tasks. In this paper, we present a novel mathematical approach for the automatic extraction of domain-specific words called the hyperplane-based approach. This new approach depends on the notion of low dimensional representation of the word in vector space and its distance from hyperplane. The hyperplane-based approach can significantly reduce text dimensionality by eliminating irrelevant features. We compare the hyperplane-based approach with other feature selection methods, namely \c{hi}2 and mutual information. An experimental study is performed on three different datasets and five classification algorithms, and measure the dimensionality reduction and the increase in the classification performance. Results indicate that the hyperplane-based approach can reduce the dimensionality of the corpus by 90% and outperforms mutual information. The computational time to identify the domain-specific words is significantly lower than mutual information.
△ Less
Submitted 18 November, 2020;
originally announced December 2020.
-
AML-SVM: Adaptive Multilevel Learning with Support Vector Machines
Authors:
Ehsan Sadrfaridpour,
Korey Palmer,
Ilya Safro
Abstract:
The support vector machines (SVM) is one of the most widely used and practical optimization based classification models in machine learning because of its interpretability and flexibility to produce high quality results. However, the big data imposes a certain difficulty to the most sophisticated but relatively slow versions of SVM, namely, the nonlinear SVM. The complexity of nonlinear SVM solver…
▽ More
The support vector machines (SVM) is one of the most widely used and practical optimization based classification models in machine learning because of its interpretability and flexibility to produce high quality results. However, the big data imposes a certain difficulty to the most sophisticated but relatively slow versions of SVM, namely, the nonlinear SVM. The complexity of nonlinear SVM solvers and the number of elements in the kernel matrix quadratically increases with the number of samples in training data. Therefore, both runtime and memory requirements are negatively affected. Moreover, the parameter fitting has extra kernel parameters to tune, which exacerbate the runtime even further. This paper proposes an adaptive multilevel learning framework for the nonlinear SVM, which addresses these challenges, improves the classification quality across the refinement process, and leverages multi-threaded parallel processing for better performance. The integration of parameter fitting in the hierarchical learning framework and adaptive process to stop unnecessary computation significantly reduce the running time while increase the overall performance. The experimental results demonstrate reduced variance on prediction over validation and test data across levels in the hierarchy, and significant speedup compared to state-of-the-art nonlinear SVM libraries without a decrease in the classification quality. The code is accessible at https://github.com/esadr/amlsvm.
△ Less
Submitted 4 November, 2020;
originally announced November 2020.
-
Unsupervised Hierarchical Graph Representation Learning by Mutual Information Maximization
Authors:
Fei Ding,
Xiaohong Zhang,
Justin Sybrandt,
Ilya Safro
Abstract:
Graph representation learning based on graph neural networks (GNNs) can greatly improve the performance of downstream tasks, such as node and graph classification. However, the general GNN models do not aggregate node information in a hierarchical manner, and can miss key higher-order structural features of many graphs. The hierarchical aggregation also enables the graph representations to be expl…
▽ More
Graph representation learning based on graph neural networks (GNNs) can greatly improve the performance of downstream tasks, such as node and graph classification. However, the general GNN models do not aggregate node information in a hierarchical manner, and can miss key higher-order structural features of many graphs. The hierarchical aggregation also enables the graph representations to be explainable. In addition, supervised graph representation learning requires labeled data, which is expensive and error-prone. To address these issues, we present an unsupervised graph representation learning method, Unsupervised Hierarchical Graph Representation (UHGR), which can generate hierarchical representations of graphs. Our method focuses on maximizing mutual information between "local" and high-level "global" representations, which enables us to learn the node embeddings and graph embeddings without any labeled data. To demonstrate the effectiveness of the proposed method, we perform the node and graph classification using the learned node and graph embeddings. The results show that the proposed method achieves comparable results to state-of-the-art supervised methods on several benchmarks. In addition, our visualization of hierarchical representations indicates that our method can capture meaningful and interpretable clusters.
△ Less
Submitted 28 July, 2020; v1 submitted 18 March, 2020;
originally announced March 2020.
-
Recent Advances in Scalable Network Generation
Authors:
Manuel Penschuck,
Ulrik Brandes,
Michael Hamann,
Sebastian Lamm,
Ulrich Meyer,
Ilya Safro,
Peter Sanders,
Christian Schulz
Abstract:
Random graph models are frequently used as a controllable and versatile data source for experimental campaigns in various research fields. Generating such data-sets at scale is a non-trivial task as it requires design decisions typically spanning multiple areas of expertise. Challenges begin with the identification of relevant domain-specific network features, continue with the question of how to…
▽ More
Random graph models are frequently used as a controllable and versatile data source for experimental campaigns in various research fields. Generating such data-sets at scale is a non-trivial task as it requires design decisions typically spanning multiple areas of expertise. Challenges begin with the identification of relevant domain-specific network features, continue with the question of how to compile such features into a tractable model, and culminate in algorithmic details arising while implementing the pertaining model.
In the present survey, we explore crucial aspects of random graph models with known scalable generators. We begin by briefly introducing network features considered by such models, and then discuss random graphs alongside with generation algorithms. Our focus lies on modelling techniques and algorithmic primitives that have proven successful in obtaining massive graphs. We consider concepts and graph models for various domains (such as social network, infrastructure, ecology, and numerical simulations), and discuss generators for different models of computation (including shared-memory parallelism, massive-parallel GPUs, and distributed systems).
△ Less
Submitted 2 March, 2020;
originally announced March 2020.
-
CBAG: Conditional Biomedical Abstract Generation
Authors:
Justin Sybrandt,
Ilya Safro
Abstract:
Biomedical research papers use significantly different language and jargon when compared to typical English text, which reduces the utility of pre-trained NLP models in this domain. Meanwhile Medline, a database of biomedical abstracts, introduces nearly a million new documents per-year. Applications that could benefit from understanding this wealth of publicly available information, such as scien…
▽ More
Biomedical research papers use significantly different language and jargon when compared to typical English text, which reduces the utility of pre-trained NLP models in this domain. Meanwhile Medline, a database of biomedical abstracts, introduces nearly a million new documents per-year. Applications that could benefit from understanding this wealth of publicly available information, such as scientific writing assistants, chat-bots, or descriptive hypothesis generation systems, require new domain-centered approaches. A conditional language model, one that learns the probability of words given some a priori criteria, is a fundamental building block in many such applications. We propose a transformer-based conditional language model with a shallow encoder "condition" stack, and a deep "language model" stack of multi-headed attention blocks. The condition stack encodes metadata used to alter the output probability distribution of the language model stack. We sample this distribution in order to generate biomedical abstracts given only a proposed title, an intended publication year, and a set of keywords. Using typical natural language generation metrics, we demonstrate that this proposed approach is more capable of producing non-trivial relevant entities within the abstract body than the 1.5B parameter GPT-2 language model.
△ Less
Submitted 13 February, 2020;
originally announced February 2020.
-
AGATHA: Automatic Graph-mining And Transformer based Hypothesis generation Approach
Authors:
Justin Sybrandt,
Ilya Tyagin,
Michael Shtutman,
Ilya Safro
Abstract:
Medical research is risky and expensive. Drug discovery, as an example, requires that researchers efficiently winnow thousands of potential targets to a small candidate set for more thorough evaluation. However, research groups spend significant time and money to perform the experiments necessary to determine this candidate set long before seeing intermediate results. Hypothesis generation systems…
▽ More
Medical research is risky and expensive. Drug discovery, as an example, requires that researchers efficiently winnow thousands of potential targets to a small candidate set for more thorough evaluation. However, research groups spend significant time and money to perform the experiments necessary to determine this candidate set long before seeing intermediate results. Hypothesis generation systems address this challenge by mining the wealth of publicly available scientific information to predict plausible research directions. We present AGATHA, a deep-learning hypothesis generation system that can introduce data-driven insights earlier in the discovery process. Through a learned ranking criteria, this system quickly prioritizes plausible term-pairs among entity sets, allowing us to recommend new research directions. We massively validate our system with a temporal holdout wherein we predict connections first introduced after 2015 using data published beforehand. We additionally explore biomedical sub-domains, and demonstrate AGATHA's predictive capacity across the twenty most popular relationship types. This system achieves best-in-class performance on an established benchmark, and demonstrates high recommendation scores across subdomains. Reproducibility: All code, experimental data, and pre-trained models are available online: sybrandt.com/2020/agatha
△ Less
Submitted 13 February, 2020;
originally announced February 2020.
-
Leveraging Special-Purpose Hardware for Local Search Heuristics
Authors:
Xiaoyuan Liu,
Hayato Ushijima-Mwesigwa,
Avradip Mandal,
Sarvagya Upadhyay,
Ilya Safro,
Arnab Roy
Abstract:
As we approach the physical limits predicted by Moore's law, a variety of specialized hardware is emerging to tackle specialized tasks in different domains. Within combinatorial optimization, adiabatic quantum computers, CMOS annealers, and optical parametric oscillators are few of the emerging specialized hardware technology aimed at solving optimization problems. In terms of mathematical framewo…
▽ More
As we approach the physical limits predicted by Moore's law, a variety of specialized hardware is emerging to tackle specialized tasks in different domains. Within combinatorial optimization, adiabatic quantum computers, CMOS annealers, and optical parametric oscillators are few of the emerging specialized hardware technology aimed at solving optimization problems. In terms of mathematical framework, the Ising optimization model unifies all of these emerging special-purpose hardware. In other words, they are all designed to solve optimization problems expressed in the Ising model or equivalently as a quadratic unconstrained binary optimization model. Due to variety of constraints specific to each type of hardware, they usually suffer from a major challenge: the number of variables that the hardware can manage to solve is very limited. Given that large-scale practical problems, including problems in operations research, combinatorial scientific computing, data science and network science require significantly more variables to model than these devices provide, we are likely to witness that cloud-based deployments of these devices will be available for parallel and shared access. Thus hybrid techniques in combination with both hardware and software must be developed to utilize these technologies. Local search meta-heuristics is one of the approaches to tackle large scale problems. However, a general optimization step within local search is not traditionally formulated in the Ising form. In this work, we propose a new meta-heuristic to model local search in the Ising form for the special-purpose hardware devices. As such, we demonstrate that our method takes the limitations of the Ising model and current hardware into account, utilizes a given hardware more efficiently compared to previous approaches, while also producing high quality solutions compared to other well-known meta-heuristics.
△ Less
Submitted 28 November, 2020; v1 submitted 21 November, 2019;
originally announced November 2019.
-
ELRUNA: Elimination Rule-based Network Alignment
Authors:
Zirou Qiu,
Ruslan Shaydulin,
Xiaoyuan Liu,
Yuri Alexeev,
Christopher S. Henry,
Ilya Safro
Abstract:
Networks model a variety of complex phenomena across different domains. In many applications, one of the most essential tasks is to align two or more networks to infer the similarities between cross-network vertices and discover potential node-level correspondence. In this paper, we propose ELRUNA (Elimination rule-based network alignment), a novel network alignment algorithm that relies exclusive…
▽ More
Networks model a variety of complex phenomena across different domains. In many applications, one of the most essential tasks is to align two or more networks to infer the similarities between cross-network vertices and discover potential node-level correspondence. In this paper, we propose ELRUNA (Elimination rule-based network alignment), a novel network alignment algorithm that relies exclusively on the underlying graph structure. Under the guidance of the elimination rules that we defined, ELRUNA computes the similarity between a pair of cross-network vertices iteratively by accumulating the similarities between their selected neighbors. The resulting cross-network similarity matrix is then used to infer a permutation matrix that encodes the final alignment of cross-network vertices. In addition to the novel alignment algorithm, we also improve the performance of local search, a commonly used post-processing step for solving the network alignment problem, by introducing a novel selection method RAWSEM (Randomwalk based selection method) based on the propagation of the levels of mismatching (defined in the paper) of vertices across the networks. The key idea is to pass on the initial levels of mismatching of vertices throughout the entire network in a random-walk fashion. Through extensive numerical experiments on real networks, we demonstrate that ELRUNA significantly outperforms the state-of-the-art alignment methods in terms of alignment accuracy under lower or comparable running time. Moreover, ELRUNA is robust to network perturbations such that it can maintain a close to optimal objective value under a high level of noise added to the original networks. Finally, the proposed RAWSEM can further improve the alignment quality with a less number of iterations compared with the naive local search method.
△ Less
Submitted 23 February, 2021; v1 submitted 29 October, 2019;
originally announced November 2019.
-
Multilevel Combinatorial Optimization Across Quantum Architectures
Authors:
Hayato Ushijima-Mwesigwa,
Ruslan Shaydulin,
Christian F. A. Negre,
Susan M. Mniszewski,
Yuri Alexeev,
Ilya Safro
Abstract:
Emerging quantum processors provide an opportunity to explore new approaches for solving traditional problems in the post Moore's law supercomputing era. However, the limited number of qubits makes it infeasible to tackle massive real-world datasets directly in the near future, leading to new challenges in utilizing these quantum processors for practical purposes. Hybrid quantum-classical algorith…
▽ More
Emerging quantum processors provide an opportunity to explore new approaches for solving traditional problems in the post Moore's law supercomputing era. However, the limited number of qubits makes it infeasible to tackle massive real-world datasets directly in the near future, leading to new challenges in utilizing these quantum processors for practical purposes. Hybrid quantum-classical algorithms that leverage both quantum and classical types of devices are considered as one of the main strategies to apply quantum computing to large-scale problems. In this paper, we advocate the use of multilevel frameworks for combinatorial optimization as a promising general paradigm for designing hybrid quantum-classical algorithms. In order to demonstrate this approach, we apply this method to two well-known combinatorial optimization problems, namely, the Graph Partitioning Problem, and the Community Detection Problem. We develop hybrid multilevel solvers with quantum local search on D-Wave's quantum annealer and IBM's gate-model based quantum processor. We carry out experiments on graphs that are orders of magnitudes larger than the current quantum hardware size, and we observe results comparable to state-of-the-art solvers in terms of quality of the solution.
△ Less
Submitted 22 September, 2020; v1 submitted 22 October, 2019;
originally announced October 2019.
-
Hypergraph Partitioning With Embeddings
Authors:
Justin Sybrandt,
Ruslan Shaydulin,
Ilya Safro
Abstract:
Problems in scientific computing, such as distributing large sparse matrix operations, have analogous formulations as hypergraph partitioning problems. A hypergraph is a generalization of a traditional graph wherein "hyperedges" may connect any number of nodes. As a result, hypergraph partitioning is an NP-Hard problem to both solve or approximate. State-of-the-art algorithms that solve this probl…
▽ More
Problems in scientific computing, such as distributing large sparse matrix operations, have analogous formulations as hypergraph partitioning problems. A hypergraph is a generalization of a traditional graph wherein "hyperedges" may connect any number of nodes. As a result, hypergraph partitioning is an NP-Hard problem to both solve or approximate. State-of-the-art algorithms that solve this problem follow the multilevel paradigm, which begins by iteratively "coarsening" the input hypergraph to smaller problem instances that share key structural features. Once identifying an approximate problem that is small enough to be solved directly, that solution can be interpolated and refined to the original problem. While this strategy represents an excellent trade off between quality and running time, it is sensitive to coarsening strategy. In this work we propose using graph embeddings of the initial hypergraph in order to ensure that coarsened problem instances retrain key structural features. Our approach prioritizes coarsening within self-similar regions within the input graph, and leads to significantly improved solution quality across a range of considered hypergraphs. Reproducibility: All source code, plots and experimental data are available at https://sybrandt.com/2019/partition.
△ Less
Submitted 25 August, 2020; v1 submitted 9 September, 2019;
originally announced September 2019.
-
FOBE and HOBE: First- and High-Order Bipartite Embeddings
Authors:
Justin Sybrandt,
Ilya Safro
Abstract:
Typical graph embeddings may not capture type-specific bipartite graph features that arise in such areas as recommender systems, data visualization, and drug discovery. Machine learning methods utilized in these applications would be better served with specialized embedding techniques. We propose two embeddings for bipartite graphs that decompose edges into sets of indirect relationships between n…
▽ More
Typical graph embeddings may not capture type-specific bipartite graph features that arise in such areas as recommender systems, data visualization, and drug discovery. Machine learning methods utilized in these applications would be better served with specialized embedding techniques. We propose two embeddings for bipartite graphs that decompose edges into sets of indirect relationships between node neighborhoods. When sampling higher-order relationships, we reinforce similarities through algebraic distance on graphs. We also introduce ensemble embeddings to combine both into a "best of both worlds" embedding. The proposed methods are evaluated on link prediction and recommendation tasks and compared with other state-of-the-art embeddings. While being all highly beneficial in applications, we demonstrate that none of the considered embeddings is clearly superior (in contrast to what is claimed in many papers), and discuss the trade offs present among them. Reproducibility: Our code, data sets, and results are all publicly available online at: http://sybrandt.com/2020/fobe_hobe.
△ Less
Submitted 22 July, 2020; v1 submitted 26 May, 2019;
originally announced May 2019.
-
Centralities for Networks with Consumable Resources
Authors:
Hayato Ushijima-Mwesigwa,
Zadid Khan,
Mashrur A. Chowdhury,
Ilya Safro
Abstract:
Identification of influential nodes is an important step in understanding and controlling the dynamics of information, traffic and spreading processes in networks. As a result, a number of centrality measures have been proposed and used across different application domains. At the heart of many of these measures, lies an assumption describing the manner in which traffic (of information, social act…
▽ More
Identification of influential nodes is an important step in understanding and controlling the dynamics of information, traffic and spreading processes in networks. As a result, a number of centrality measures have been proposed and used across different application domains. At the heart of many of these measures, lies an assumption describing the manner in which traffic (of information, social actors, particles, etc.) flows through the network. For example, some measures only count shortest paths while others consider random walks.
This paper considers a spreading process in which a resource necessary for transit is partially consumed along the way while being refilled at special nodes on the network. Examples include fuel consumption of vehicles together with refueling stations, information loss during dissemination with error correcting nodes, and consumption of ammunition of military troops while moving.
We propose generalizations of the well-known measures of betweenness, random walk betweenness, and Katz centralities to take such a spreading process with consumable resources into account. In order to validate the results, experiments on real-world networks are carried out by develo** simulations based on well-known models such as Susceptible-Infected-Recovered and congestion with respect to particle hop** from vehicular flow theory. The simulation-based models are shown to be highly correlated to the proposed centrality measures.
△ Less
Submitted 2 March, 2019;
originally announced March 2019.
-
Multilevel Graph Partitioning for Three-Dimensional Discrete Fracture Network Flow Simulations
Authors:
Hayato Ushijima-Mwesigwa,
Jeffrey D. Hyman,
Aric Hagberg,
Ilya Safro,
Satish Karra,
Carl W. Gable,
Matthew R. Sweeney,
Gowri Srinivasan
Abstract:
We present a topology-based method for mesh-partitioning in three-dimensional discrete fracture network (DFN) simulations that take advantage of the intrinsic multi-level nature of a DFN. DFN models are used to simulate flow and transport through low-permeability fractured media in the subsurface by explicitly representing fractures as discrete entities. The governing equations for flow and transp…
▽ More
We present a topology-based method for mesh-partitioning in three-dimensional discrete fracture network (DFN) simulations that take advantage of the intrinsic multi-level nature of a DFN. DFN models are used to simulate flow and transport through low-permeability fractured media in the subsurface by explicitly representing fractures as discrete entities. The governing equations for flow and transport are numerically integrated on computational meshes generated on the interconnected fracture networks. Modern high-fidelity DFN simulations require high-performance computing on multiple processors where performance and scalability depend partially on obtaining a high-quality partition of the mesh to balance work-loads and minimize communication across all processors. The discrete structure of a DFN naturally lends itself to various graph representations. We develop two applications of the multilevel graph partitioning algorithm to partition the mesh of a DFN. In the first, we project a partition of the graph based on the DFN topology onto the mesh of the DFN and in the second, this projection is used as the initial condition for further partitioning refinement of the mesh. We compare the performance of these methods with standard multi-level graph partitioning using graph-based metrics (cut, imbalance, partitioning time), computational-based metrics (FLOPS, iterations, solver time), and total run time. The DFN-based and the mesh-based partitioning methods are comparable in terms of the graph-based metrics, but the time required to obtain the partition is several orders of magnitude faster using the DFN-based partitions. In combination, these partitions are several orders of magnitude faster than the mesh-based partition. In turn, this hybrid method outperformed both of the other methods in terms of the total run time.
△ Less
Submitted 1 April, 2021; v1 submitted 18 February, 2019;
originally announced February 2019.
-
Community Detection Across Emerging Quantum Architectures
Authors:
Ruslan Shaydulin,
Hayato Ushijima-Mwesigwa,
Ilya Safro,
Susan Mniszewski,
Yuri Alexeev
Abstract:
One of the roadmap plans for quantum computers is an integration within HPC ecosystems assigning them a role of accelerators for a variety of computationally hard tasks. However, in the near term, quantum hardware will be in a constant state of change. Heading towards solving real-world problems, we advocate development of portable, architecture-agnostic hybrid quantum-classical frameworks and dem…
▽ More
One of the roadmap plans for quantum computers is an integration within HPC ecosystems assigning them a role of accelerators for a variety of computationally hard tasks. However, in the near term, quantum hardware will be in a constant state of change. Heading towards solving real-world problems, we advocate development of portable, architecture-agnostic hybrid quantum-classical frameworks and demonstrate one for the community detection problem evaluated using quantum annealing and gate-based universal quantum computation paradigms.
△ Less
Submitted 1 October, 2018;
originally announced October 2018.
-
Spatio-temporal prediction of crimes using network analytic approach
Authors:
Saroj Kumar Dash,
Ilya Safro,
Ravisutha Sakrepatna Srinivasamurthy
Abstract:
It is quite evident that majority of the population lives in urban area today than in any time of the human history. This trend seems to increase in coming years. A study [5] says that nearly 80.7% of total population in USA stays in urban area. By 2030 nearly 60% of the population in the world will live in or move to cities. With the increase in urban population, it is important to keep an eye on…
▽ More
It is quite evident that majority of the population lives in urban area today than in any time of the human history. This trend seems to increase in coming years. A study [5] says that nearly 80.7% of total population in USA stays in urban area. By 2030 nearly 60% of the population in the world will live in or move to cities. With the increase in urban population, it is important to keep an eye on criminal activities. By doing so, governments can enforce intelligent policing systems and hence many government agencies and local authorities have made the crime data publicly available. In this paper, we analyze Chicago city crime data fused with other social information sources using network analytic techniques to predict criminal activity for the next year. We observe that as we add more layers of data which represent different aspects of the society, the quality of prediction is improved. Our prediction models not just predict total number of crimes for the whole Chicago city, rather they predict number of crimes for all types of crimes and for different regions in City of Chicago.
△ Less
Submitted 30 October, 2018; v1 submitted 19 August, 2018;
originally announced August 2018.
-
Are Abstracts Enough for Hypothesis Generation?
Authors:
Justin Sybrandt,
Angelo Carrabba,
Alexander Herzog,
Ilya Safro
Abstract:
The potential for automatic hypothesis generation (HG) systems to improve research productivity keeps pace with the growing set of publicly available scientific information. But as data becomes easier to acquire, we must understand the effect different textual data sources have on our resulting hypotheses. Are abstracts enough for HG, or does it need full-text papers? How many papers does an HG sy…
▽ More
The potential for automatic hypothesis generation (HG) systems to improve research productivity keeps pace with the growing set of publicly available scientific information. But as data becomes easier to acquire, we must understand the effect different textual data sources have on our resulting hypotheses. Are abstracts enough for HG, or does it need full-text papers? How many papers does an HG system need to make valuable predictions? How sensitive is a general-purpose HG system to hyperparameter values or input quality? What effect does corpus size and document length have on HG results? To answer these questions we train multiple versions of knowledge network-based HG system, Moliere, on varying corpora in order to compare challenges and trade offs in terms of result quality and computational requirements. Moliere generalizes main principles of similar knowledge network-based HG systems and reinforces them with topic modeling components. The corpora include the abstract and full-text versions of PubMed Central, as well as iterative halves of MEDLINE, which allows us to compare the effect document length and count has on the results. We find that, quantitatively, corpora with a higher median document length result in marginally higher quality results, yet require substantially longer to process. However, qualitatively, full-length papers introduce a significant number of intruder terms to the resulting topics, which decreases human interpretability. Additionally, we find that the effect of document length is greater than that of document count, even if both sets contain only paper abstracts. Reproducibility: Our code and data are available at github.com/jsybran/moliere, and bit.ly/2GxghpM respectively.
△ Less
Submitted 20 October, 2018; v1 submitted 13 April, 2018;
originally announced April 2018.
-
Multiscale Planar Graph Generation
Authors:
Varsha Chauhan,
Alexander Gutfraind,
Ilya Safro
Abstract:
The study of network representations of physical, biological, and social phenomena can help us better understand the structural and functional dynamics of their networks and formulate predictive models of these phenomena. However, due to the scarcity of real-world network data owing to factors such as cost and effort required in collection of network data and the sensitivity of this data towards t…
▽ More
The study of network representations of physical, biological, and social phenomena can help us better understand the structural and functional dynamics of their networks and formulate predictive models of these phenomena. However, due to the scarcity of real-world network data owing to factors such as cost and effort required in collection of network data and the sensitivity of this data towards theft and misuse, engineers and researchers often rely on synthetic data for simulations, hypothesis testing, decision making, and algorithm engineering. An important characteristic of infrastructure networks such as roads, water distribution and other utility systems is that they can be embedded in a plane, therefore to simulate these system we need realistic networks which are also planar. While the currently-available synthetic network generators can model networks that exhibit realism, they do not guarantee or achieve planarity. Therefore, in this paper we present a flexible algorithm that can synthesize realistic networks that are planar. The method follows a multi-scale randomized editing approach generating a hierarchy of coarsened networks of a given planar graph and introducing edits at various levels in the hierarchy. The method preserves the structural properties with minimal bias including the planarity of the network, while introducing realistic variability at multiple scales.
△ Less
Submitted 12 May, 2019; v1 submitted 26 February, 2018;
originally announced February 2018.
-
Aggregative Coarsening for Multilevel Hypergraph Partitioning
Authors:
Ruslan Shaydulin,
Ilya Safro
Abstract:
Algorithms for many hypergraph problems, including partitioning, utilize multilevel frameworks to achieve a good trade-off between the performance and the quality of results. In this paper we introduce two novel aggregative coarsening schemes and incorporate them within state-of-the-art hypergraph partitioner Zoltan. Our coarsening schemes are inspired by the algebraic multigrid and stable matchin…
▽ More
Algorithms for many hypergraph problems, including partitioning, utilize multilevel frameworks to achieve a good trade-off between the performance and the quality of results. In this paper we introduce two novel aggregative coarsening schemes and incorporate them within state-of-the-art hypergraph partitioner Zoltan. Our coarsening schemes are inspired by the algebraic multigrid and stable matching approaches. We demonstrate the effectiveness of the developed schemes as a part of multilevel hypergraph partitioning framework on a wide range of problems.
△ Less
Submitted 9 April, 2018; v1 submitted 26 February, 2018;
originally announced February 2018.
-
Large-Scale Validation of Hypothesis Generation Systems via Candidate Ranking
Authors:
Justin Sybrandt,
Michael Shtutman,
Ilya Safro
Abstract:
The first step of many research projects is to define and rank a short list of candidates for study. In the modern rapidity of scientific progress, some turn to automated hypothesis generation (HG) systems to aid this process. These systems can identify implicit or overlooked connections within a large scientific corpus, and while their importance grows alongside the pace of science, they lack tho…
▽ More
The first step of many research projects is to define and rank a short list of candidates for study. In the modern rapidity of scientific progress, some turn to automated hypothesis generation (HG) systems to aid this process. These systems can identify implicit or overlooked connections within a large scientific corpus, and while their importance grows alongside the pace of science, they lack thorough validation. Without any standard numerical evaluation method, many validate general-purpose HG systems by rediscovering a handful of historical findings, and some wishing to be more thorough may run laboratory experiments based on automatic suggestions. These methods are expensive, time consuming, and cannot scale. Thus, we present a numerical evaluation framework for the purpose of validating HG systems that leverages thousands of validation hypotheses. This method evaluates a HG system by its ability to rank hypotheses by plausibility; a process reminiscent of human candidate selection. Because HG systems do not produce a ranking criteria, specifically those that produce topic models, we additionally present novel metrics to quantify the plausibility of hypotheses given topic model system output. Finally, we demonstrate that our proposed validation method aligns with real-world research goals by deploying our method within Moliere, our recent topic-driven HG system, in order to automatically generate a set of candidate genes related to HIV-associated neurodegenerative disease (HAND). By performing laboratory experiments based on this candidate set, we discover a new connection between HAND and Dead Box RNA Helicase 3 (DDX3). Reproducibility: code, validation data, and results can be found at sybrandt.com/2018/validation.
△ Less
Submitted 5 December, 2018; v1 submitted 11 February, 2018;
originally announced February 2018.
-
Relaxation-Based Coarsening for Multilevel Hypergraph Partitioning
Authors:
Ruslan Shaydulin,
Jie Chen,
Ilya Safro
Abstract:
Multilevel partitioning methods that are inspired by principles of multiscaling are the most powerful practical hypergraph partitioning solvers. Hypergraph partitioning has many applications in disciplines ranging from scientific computing to data science. In this paper we introduce the concept of algebraic distance on hypergraphs and demonstrate its use as an algorithmic component in the coarseni…
▽ More
Multilevel partitioning methods that are inspired by principles of multiscaling are the most powerful practical hypergraph partitioning solvers. Hypergraph partitioning has many applications in disciplines ranging from scientific computing to data science. In this paper we introduce the concept of algebraic distance on hypergraphs and demonstrate its use as an algorithmic component in the coarsening stage of multilevel hypergraph partitioning solvers. The algebraic distance is a vertex distance measure that extends hyperedge weights for capturing the local connectivity of vertices which is critical for hypergraph coarsening schemes. The practical effectiveness of the proposed measure and corresponding coarsening scheme is demonstrated through extensive computational experiments on a diverse set of problems. Finally, we propose a benchmark of hypergraph partitioning problems to compare the quality of other solvers.
△ Less
Submitted 8 February, 2019; v1 submitted 17 October, 2017;
originally announced October 2017.
-
Detecting and monitoring foodborne illness outbreaks: Twitter communications and the 2015 U.S. Salmonella outbreak linked to imported cucumbers
Authors:
Yuliya V. Bolotova,
Jie Lou,
Ilya Safro
Abstract:
This research uses Twitter, as a social media device, to track communications related to the 2015 U.S. foodborne illness outbreak linked to Salmonella in imported cucumbers from Mexico. The relevant Twitter data are analyzed in light of the timeline of the official announcements made by the Centers for Disease Control and Prevention (CDC). The largest number of registered tweets is associated with…
▽ More
This research uses Twitter, as a social media device, to track communications related to the 2015 U.S. foodborne illness outbreak linked to Salmonella in imported cucumbers from Mexico. The relevant Twitter data are analyzed in light of the timeline of the official announcements made by the Centers for Disease Control and Prevention (CDC). The largest number of registered tweets is associated with the period immediately following the CDC initial announcement and the official release of the first recall of cucumbers.
△ Less
Submitted 24 August, 2017;
originally announced August 2017.
-
Utility Maximization Framework for Opportunistic Wireless Electric Vehicle Charging
Authors:
MD Zadid Khan,
Mashrur Chowdhury,
Sakib Mahmud Khan,
Ilya Safro,
Hayato Ushijima-Mwesigwa
Abstract:
This is an extended abstract, it has no separate abstract section
This is an extended abstract, it has no separate abstract section
△ Less
Submitted 6 December, 2017; v1 submitted 22 August, 2017;
originally announced August 2017.
-
Engineering fast multilevel support vector machines
Authors:
E. Sadrfaridpour,
T. Razzaghi,
I. Safro
Abstract:
The computational complexity of solving nonlinear support vector machine (SVM) is prohibitive on large-scale data. In particular, this issue becomes very sensitive when the data represents additional difficulties such as highly imbalanced class sizes. Typically, nonlinear kernels produce significantly higher classification quality to linear kernels but introduce extra kernel and model parameters w…
▽ More
The computational complexity of solving nonlinear support vector machine (SVM) is prohibitive on large-scale data. In particular, this issue becomes very sensitive when the data represents additional difficulties such as highly imbalanced class sizes. Typically, nonlinear kernels produce significantly higher classification quality to linear kernels but introduce extra kernel and model parameters which requires computationally expensive fitting. This increases the quality but also reduces the performance dramatically. We introduce a generalized fast multilevel framework for regular and weighted SVM and discuss several versions of its algorithmic components that lead to a good trade-off between quality and time. Our framework is implemented using PETSc which allows an easy integration with scientific computing tasks. The experimental results demonstrate significant speed up compared to the state-of-the-art nonlinear SVM libraries.
Reproducibility: our source code, documentation and parameters are available at https:// github.com/esadr/mlsvm.
△ Less
Submitted 5 April, 2019; v1 submitted 24 July, 2017;
originally announced July 2017.
-
MOLIERE: Automatic Biomedical Hypothesis Generation System
Authors:
Justin Sybrandt,
Michael Shtutman,
Ilya Safro
Abstract:
Hypothesis generation is becoming a crucial time-saving technique which allows biomedical researchers to quickly discover implicit connections between important concepts. Typically, these systems operate on domain-specific fractions of public medical data. MOLIERE, in contrast, utilizes information from over 24.5 million documents. At the heart of our approach lies a multi-modal and multi-relation…
▽ More
Hypothesis generation is becoming a crucial time-saving technique which allows biomedical researchers to quickly discover implicit connections between important concepts. Typically, these systems operate on domain-specific fractions of public medical data. MOLIERE, in contrast, utilizes information from over 24.5 million documents. At the heart of our approach lies a multi-modal and multi-relational network of biomedical objects extracted from several heterogeneous datasets from the National Center for Biotechnology Information (NCBI). These objects include but are not limited to scientific papers, keywords, genes, proteins, diseases, and diagnoses. We model hypotheses using Latent Dirichlet Allocation applied on abstracts found near shortest paths discovered within this network, and demonstrate the effectiveness of MOLIERE by performing hypothesis generation on historical data. Our network, implementation, and resulting data are all publicly available for the broad scientific community.
△ Less
Submitted 31 May, 2017; v1 submitted 20 February, 2017;
originally announced February 2017.
-
Algebraic multigrid support vector machines
Authors:
Ehsan Sadrfaridpour,
Sandeep Jeereddy,
Ken Kennedy,
Andre Luckow,
Talayeh Razzaghi,
Ilya Safro
Abstract:
The support vector machine is a flexible optimization-based technique widely used for classification problems. In practice, its training part becomes computationally expensive on large-scale data sets because of such reasons as the complexity and number of iterations in parameter fitting methods, underlying optimization solvers, and nonlinearity of kernels. We introduce a fast multilevel framework…
▽ More
The support vector machine is a flexible optimization-based technique widely used for classification problems. In practice, its training part becomes computationally expensive on large-scale data sets because of such reasons as the complexity and number of iterations in parameter fitting methods, underlying optimization solvers, and nonlinearity of kernels. We introduce a fast multilevel framework for solving support vector machine models that is inspired by the algebraic multigrid. Significant improvement in the running has been achieved without any loss in the quality. The proposed technique is highly beneficial on imbalanced sets. We demonstrate computational results on publicly available and industrial data sets.
△ Less
Submitted 23 November, 2016; v1 submitted 16 November, 2016;
originally announced November 2016.
-
Scalable Dynamic Topic Modeling with Clustered Latent Dirichlet Allocation (CLDA)
Authors:
Chris Gropp,
Alexander Herzog,
Ilya Safro,
Paul W. Wilson,
Amy W. Apon
Abstract:
Topic modeling, a method for extracting the underlying themes from a collection of documents, is an increasingly important component of the design of intelligent systems enabling the sense-making of highly dynamic and diverse streams of text data. Traditional methods such as Dynamic Topic Modeling (DTM) do not lend themselves well to direct parallelization because of dependencies from one time ste…
▽ More
Topic modeling, a method for extracting the underlying themes from a collection of documents, is an increasingly important component of the design of intelligent systems enabling the sense-making of highly dynamic and diverse streams of text data. Traditional methods such as Dynamic Topic Modeling (DTM) do not lend themselves well to direct parallelization because of dependencies from one time step to another. In this paper, we introduce and empirically analyze Clustered Latent Dirichlet Allocation (CLDA), a method for extracting dynamic latent topics from a collection of documents. Our approach is based on data decomposition in which the data is partitioned into segments, followed by topic modeling on the individual segments. The resulting local models are then combined into a global solution using clustering. The decomposition and resulting parallelization leads to very fast runtime even on very large datasets. Our approach furthermore provides insight into how the composition of topics changes over time and can also be applied using other data partitioning strategies over any discrete features of the data, such as geographic features or classes of users. In this paper CLDA is applied successfully to seventeen years of NIPS conference papers (2,484 documents and 3,280,697 words), seventeen years of computer science journal abstracts (533,560 documents and 32,551,540 words), and to forty years of the PubMed corpus (4,025,978 documents and 273,853,980 words).
△ Less
Submitted 4 October, 2019; v1 submitted 24 October, 2016;
originally announced October 2016.
-
Detecting and Summarizing Emergent Events in Microblogs and Social Media Streams by Dynamic Centralities
Authors:
Neela Avudaiappan,
Alexander Herzog,
Sneha Kadam,
Yuheng Du,
Jason Thatcher,
Ilya Safro
Abstract:
Methods for detecting and summarizing emergent keywords have been extensively studied since social media and microblogging activities have started to play an important role in data analysis and decision making. We present a system for monitoring emergent keywords and summarizing a document stream based on the dynamic semantic graphs of streaming documents. We introduce the notion of dynamic eigenv…
▽ More
Methods for detecting and summarizing emergent keywords have been extensively studied since social media and microblogging activities have started to play an important role in data analysis and decision making. We present a system for monitoring emergent keywords and summarizing a document stream based on the dynamic semantic graphs of streaming documents. We introduce the notion of dynamic eigenvector centrality for ranking emergent keywords, and present an algorithm for summarizing emergent events that is based on the minimum weight set cover. We demonstrate our system with an analysis of streaming Twitter data related to public security events.
△ Less
Submitted 20 October, 2016;
originally announced October 2016.
-
Generating realistic scaled complex networks
Authors:
Christian L. Staudt,
Michael Hamann,
Alexander Gutfraind,
Ilya Safro,
Henning Meyerhenke
Abstract:
Research on generative models is a central project in the emerging field of network science, and it studies how statistical patterns found in real networks could be generated by formal rules. Output from these generative models is then the basis for designing and evaluating computational methods on networks, and for verification and simulation studies. During the last two decades, a variety of mod…
▽ More
Research on generative models is a central project in the emerging field of network science, and it studies how statistical patterns found in real networks could be generated by formal rules. Output from these generative models is then the basis for designing and evaluating computational methods on networks, and for verification and simulation studies. During the last two decades, a variety of models has been proposed with an ultimate goal of achieving comprehensive realism for the generated networks. In this study, we (a) introduce a new generator, termed ReCoN; (b) explore how ReCoN and some existing models can be fitted to an original network to produce a structurally similar replica, (c) use ReCoN to produce networks much larger than the original exemplar, and finally (d) discuss open problems and promising research directions. In a comparative experimental study, we find that ReCoN is often superior to many other state-of-the-art network generation methods. We argue that ReCoN is a scalable and effective tool for modeling a given network while preserving important properties at both micro- and macroscopic scales, and for scaling the exemplar data by orders of magnitude in size.
△ Less
Submitted 23 March, 2017; v1 submitted 7 September, 2016;
originally announced September 2016.
-
Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values
Authors:
Talayeh Razzaghi,
Oleg Roderick,
Ilya Safro,
Nicholas Marko
Abstract:
This work is motivated by the needs of predictive analytics on healthcare data as represented by Electronic Medical Records. Such data is invariably problematic: noisy, with missing entries, with imbalance in classes of interests, leading to serious bias in predictive modeling. Since standard data mining methods often produce poor performance measures, we argue for development of specialized techn…
▽ More
This work is motivated by the needs of predictive analytics on healthcare data as represented by Electronic Medical Records. Such data is invariably problematic: noisy, with missing entries, with imbalance in classes of interests, leading to serious bias in predictive modeling. Since standard data mining methods often produce poor performance measures, we argue for development of specialized techniques of data-preprocessing and classification. In this paper, we propose a new method to simultaneously classify large datasets and reduce the effects of missing values. It is based on a multilevel framework of the cost-sensitive SVM and the expected maximization imputation method for missing values, which relies on iterated regression analyses. We compare classification results of multilevel SVM-based algorithms on public benchmark datasets with imbalanced classes and missing values as well as real data in health applications, and show that our multilevel SVM-based method produces fast, and more accurate and robust classification results.
△ Less
Submitted 7 April, 2016;
originally announced April 2016.
-
Single- and Multi-level Network Sparsification by Algebraic Distance
Authors:
Emmanuel John,
Ilya Safro
Abstract:
Network sparsification methods play an important role in modern network analysis when fast estimation of computationally expensive properties (such as the diameter, centrality indices, and paths) is required. We propose a method of network sparsification that preserves a wide range of structural properties. Depending on the analysis goals, the method allows to distinguish between local and global…
▽ More
Network sparsification methods play an important role in modern network analysis when fast estimation of computationally expensive properties (such as the diameter, centrality indices, and paths) is required. We propose a method of network sparsification that preserves a wide range of structural properties. Depending on the analysis goals, the method allows to distinguish between local and global range edges that can be filtered out during the sparsification. First we rank edges by their algebraic distances and then we sample them. We also introduce a multilevel framework for sparsification that can be used to control the sparsification process at various coarse-grained resolutions. Based primarily on the matrix-vector multiplications, our method is easily parallelized for different architectures.
△ Less
Submitted 21 January, 2016;
originally announced January 2016.
-
Fast Imbalanced Classification of Healthcare Data with Missing Values
Authors:
Talayeh Razzaghi,
Oleg Roderick,
Ilya Safro,
Nick Marko
Abstract:
In medical domain, data features often contain missing values. This can create serious bias in the predictive modeling. Typical standard data mining methods often produce poor performance measures. In this paper, we propose a new method to simultaneously classify large datasets and reduce the effects of missing values. The proposed method is based on a multilevel framework of the cost-sensitive SV…
▽ More
In medical domain, data features often contain missing values. This can create serious bias in the predictive modeling. Typical standard data mining methods often produce poor performance measures. In this paper, we propose a new method to simultaneously classify large datasets and reduce the effects of missing values. The proposed method is based on a multilevel framework of the cost-sensitive SVM and the expected maximization imputation method for missing values, which relies on iterated regression analyses. We compare classification results of multilevel SVM-based algorithms on public benchmark datasets with imbalanced classes and missing values as well as real data in health applications, and show that our multilevel SVM-based method produces fast, and more accurate and robust classification results.
△ Less
Submitted 20 March, 2015;
originally announced March 2015.
-
A Multilevel Bilinear Programming Algorithm For the Vertex Separator Problem
Authors:
William W. Hager,
James T. Hungerford,
Ilya Safro
Abstract:
The Vertex Separator Problem for a graph is to find the smallest collection of vertices whose removal breaks the graph into two disconnected subsets that satisfy specified size constraints. In the paper 10.1016/j.ejor.2014.05.042, the Vertex Separator Problem was formulated as a continuous (non-concave/non-convex) bilinear quadratic program. In this paper, we develop a more general continuous bili…
▽ More
The Vertex Separator Problem for a graph is to find the smallest collection of vertices whose removal breaks the graph into two disconnected subsets that satisfy specified size constraints. In the paper 10.1016/j.ejor.2014.05.042, the Vertex Separator Problem was formulated as a continuous (non-concave/non-convex) bilinear quadratic program. In this paper, we develop a more general continuous bilinear program which incorporates vertex weights, and which applies to the coarse graphs that are generated in a multilevel compression of the original Vertex Separator Problem. A Mountain Climbing Algorithm is used to find a stationary point of the continuous bilinear quadratic program, while second-order optimality conditions and perturbation techniques are used to escape from either a stationary point or a local maximizer. The algorithms for solving the continuous bilinear program are employed during the solution and refinement phases in a multilevel scheme. Computational results and comparisons demonstrate the advantage of the proposed algorithm.
△ Less
Submitted 17 July, 2016; v1 submitted 17 October, 2014;
originally announced October 2014.