-
Pseudo-Riemannian Embedding Models for Multi-Relational Graph Representations
Authors:
Saee Paliwal,
Angus Brayne,
Benedek Fabian,
Maciej Wiatrak,
Aaron Sim
Abstract:
In this paper we generalize single-relation pseudo-Riemannian graph embedding models to multi-relational networks, and show that the typical approach of encoding relations as manifold transformations translates from the Riemannian to the pseudo-Riemannian case. In addition we construct a view of relations as separate spacetime submanifolds of multi-time manifolds, and consider an interpolation bet…
▽ More
In this paper we generalize single-relation pseudo-Riemannian graph embedding models to multi-relational networks, and show that the typical approach of encoding relations as manifold transformations translates from the Riemannian to the pseudo-Riemannian case. In addition we construct a view of relations as separate spacetime submanifolds of multi-time manifolds, and consider an interpolation between a pseudo-Riemannian embedding model and its Wick-rotated Riemannian counterpart. We validate these extensions in the task of link prediction, focusing on flat Lorentzian manifolds, and demonstrate their use in both knowledge graph completion and knowledge discovery in a biological domain.
△ Less
Submitted 2 December, 2022;
originally announced December 2022.
-
Extract Dynamic Information To Improve Time Series Modeling: a Case Study with Scientific Workflow
Authors:
Jeeyung Kim,
Mengtian **,
Youkow Homma,
Alex Sim,
Wilko Kroeger,
Kesheng Wu
Abstract:
In modeling time series data, we often need to augment the existing data records to increase the modeling accuracy. In this work, we describe a number of techniques to extract dynamic information about the current state of a large scientific workflow, which could be generalized to other types of applications. The specific task to be modeled is the time needed for transferring a file from an experi…
▽ More
In modeling time series data, we often need to augment the existing data records to increase the modeling accuracy. In this work, we describe a number of techniques to extract dynamic information about the current state of a large scientific workflow, which could be generalized to other types of applications. The specific task to be modeled is the time needed for transferring a file from an experimental facility to a data center. The key idea of our approach is to find recent past data transfer events that match the current event in some ways. Tests showed that we could identify recent events matching some recorded properties and reduce the prediction error by about 12% compared to the similar models with only static features. We additionally explored an application specific technique to extract information about the data production process, and was able to reduce the average prediction error by 44%.
△ Less
Submitted 19 May, 2022;
originally announced May 2022.
-
What Makes You Hold on to That Old Car? Joint Insights from Machine Learning and Multinomial Logit on Vehicle-level Transaction Decisions
Authors:
Ling **,
Alina Lazar,
Caitlin Brown,
Bingrong Sun,
Venu Garikapati,
Srinath Ravulaparthy,
Qianmiao Chen,
Alexander Sim,
Kesheng Wu,
Tin Ho,
Thomas Wenzel,
C. Anna Spurlock
Abstract:
What makes you hold on that old car? While the vast majority of the household vehicles are still powered by conventional internal combustion engines, the progress of adopting emerging vehicle technologies will critically depend on how soon the existing vehicles are transacted out of the household fleet. Leveraging a nationally representative longitudinal data set, the Panel Study of Income Dynamic…
▽ More
What makes you hold on that old car? While the vast majority of the household vehicles are still powered by conventional internal combustion engines, the progress of adopting emerging vehicle technologies will critically depend on how soon the existing vehicles are transacted out of the household fleet. Leveraging a nationally representative longitudinal data set, the Panel Study of Income Dynamics, this study examines how household decisions to dispose of or replace a given vehicle are: (1) influenced by the vehicle's attributes, (2) mediated by households' concurrent socio-demographic and economic attributes, and (3) triggered by key life cycle events. Coupled with a newly developed machine learning interpretation tool, TreeExplainer, we demonstrate an innovative use of machine learning models to augment traditional logit modeling to both generate behavioral insights and improve model performance. We find the two gradient-boosting-based methods, CatBoost and LightGBM, are the best performing machine learning models for this problem. The multinomial logistic model can achieve similar performance levels after its model specification is informed by TreeExplainer. Both machine learning and multinomial logit models suggest that while older vehicles are more likely to be disposed of or replaced than newer ones, such probability decreases as the vehicles serve the family longer. We find that married families, families with higher education levels, homeowners, and older families tend to keep their vehicles longer. Life events such as childbirth, residential relocation, and change of household composition and income are found to increase vehicle disposal and/or replacement. We provide additional insights on the timing of vehicle replacement or disposal, in particular, the presence of children and childbirth events are more strongly associated with vehicle replacement among younger parents.
△ Less
Submitted 13 May, 2022;
originally announced May 2022.
-
Phenoty** with Positive Unlabelled Learning for Genome-Wide Association Studies
Authors:
Andre Vauvelle,
Hamish Tomlinson,
Aaron Sim,
Spiros Denaxas
Abstract:
Identifying phenotypes plays an important role in furthering our understanding of disease biology through practical applications within healthcare and the life sciences. The challenge of dealing with the complexities and noise within electronic health records (EHRs) has motivated applications of machine learning in phenotypic discovery. While recent research has focused on finding predictive subty…
▽ More
Identifying phenotypes plays an important role in furthering our understanding of disease biology through practical applications within healthcare and the life sciences. The challenge of dealing with the complexities and noise within electronic health records (EHRs) has motivated applications of machine learning in phenotypic discovery. While recent research has focused on finding predictive subtypes for clinical decision support, here we instead focus on the noise that results in phenotypic misclassification, which can reduce a phenotypes ability to detect associations in genome-wide association studies (GWAS). We show that by combining anchor learning and transformer architectures into our proposed model, AnchorBERT, we are able to detect genomic associations only previously found in large consortium studies with 5$\times$ more cases. When reducing the number of controls available by 50\%, we find our model is able to maintain 40\% more significant genomic associations from the GWAS catalog compared to standard phenotype definitions. \keywords{Phenoty** \and Machine Learning \and Semi-Supervised \and Genetic Association Studies \and Biological Discovery}
△ Less
Submitted 15 February, 2022;
originally announced February 2022.
-
Directed Graph Embeddings in Pseudo-Riemannian Manifolds
Authors:
Aaron Sim,
Maciej Wiatrak,
Angus Brayne,
Páidí Creed,
Saee Paliwal
Abstract:
The inductive biases of graph representation learning algorithms are often encoded in the background geometry of their embedding space. In this paper, we show that general directed graphs can be effectively represented by an embedding model that combines three components: a pseudo-Riemannian metric structure, a non-trivial global topology, and a unique likelihood function that explicitly incorpora…
▽ More
The inductive biases of graph representation learning algorithms are often encoded in the background geometry of their embedding space. In this paper, we show that general directed graphs can be effectively represented by an embedding model that combines three components: a pseudo-Riemannian metric structure, a non-trivial global topology, and a unique likelihood function that explicitly incorporates a preferred direction in embedding space. We demonstrate the representational capabilities of this method by applying it to the task of link prediction on a series of synthetic and real directed graphs from natural language applications and biology. In particular, we show that low-dimensional cylindrical Minkowski and anti-de Sitter spacetimes can produce equal or better graph representations than curved Riemannian manifolds of higher dimensions.
△ Less
Submitted 16 June, 2021;
originally announced June 2021.
-
Contrastive Mixture of Posteriors for Counterfactual Inference, Data Integration and Fairness
Authors:
Adam Foster,
Árpi Vezér,
Craig A Glastonbury,
Páidí Creed,
Sam Abujudeh,
Aaron Sim
Abstract:
Learning meaningful representations of data that can address challenges such as batch effect correction and counterfactual inference is a central problem in many domains including computational biology. Adopting a Conditional VAE framework, we show that marginal independence between the representation and a condition variable plays a key role in both of these challenges. We propose the Contrastive…
▽ More
Learning meaningful representations of data that can address challenges such as batch effect correction and counterfactual inference is a central problem in many domains including computational biology. Adopting a Conditional VAE framework, we show that marginal independence between the representation and a condition variable plays a key role in both of these challenges. We propose the Contrastive Mixture of Posteriors (CoMP) method that uses a novel misalignment penalty defined in terms of mixtures of the variational posteriors to enforce this independence in latent space. We show that CoMP has attractive theoretical properties compared to previous approaches, and we prove counterfactual identifiability of CoMP under additional assumptions. We demonstrate state-of-the-art performance on a set of challenging tasks including aligning human tumour samples with cancer cell-lines, predicting transcriptome-level perturbation responses, and batch correction on single-cell RNA sequencing data. We also find parallels to fair representation learning and demonstrate that CoMP is competitive on a common task in the field.
△ Less
Submitted 26 June, 2022; v1 submitted 15 June, 2021;
originally announced June 2021.
-
Investigating Underlying Drivers of Variability in Residential Energy Usage Patterns with Daily Load Shape Clustering of Smart Meter Data
Authors:
Ling **,
C. Anna Spurlock,
Sam Borgeson,
Alina Lazar,
Daniel Fredman,
Annika Todd,
Alexander Sim,
Kesheng Wu
Abstract:
Residential customers have traditionally not been treated as individual entities due to the high volatility in residential consumption patterns as well as a historic focus on aggregated loads from the utility and system feeder perspective. Large-scale deployment of smart meters has motivated increasing studies to explore disaggregated daily load patterns, which can reveal important heterogeneity a…
▽ More
Residential customers have traditionally not been treated as individual entities due to the high volatility in residential consumption patterns as well as a historic focus on aggregated loads from the utility and system feeder perspective. Large-scale deployment of smart meters has motivated increasing studies to explore disaggregated daily load patterns, which can reveal important heterogeneity across different time scales, weather conditions, as well as within and across individual households. This paper aims to shed light on the mechanisms by which electricity consumption patterns exhibit variability and the different constraints that may affect demand-response (DR) flexibility. We systematically evaluate the relationship between daily time-of-use patterns and their variability to external and internal influencing factors, including time scales of interest, meteorological conditions, and household characteristics by application of an improved version of the adaptive K-means clustering method to profile "household-days" of a summer peaking utility. We find that for this summer-peaking utility, outdoor temperature is the most important external driver of the load shape variability relative to seasonality and day-of-week. The top three consumption patterns represent approximately 50% of usage on the highest temperature days. The variability in summer load shapes across customers can be explained by the responsiveness of the households to outside temperature. Our results suggest that depending on the influencing factors, not all the consumption variability can be readily translated to consumption flexibility. Such information needs to be further explored in segmenting customers for better program targeting and tailoring to meet the needs of the rapidly evolving electricity grid.
△ Less
Submitted 16 February, 2021;
originally announced February 2021.
-
An Ensemble Approach toward Automated Variable Selection for Network Anomaly Detection
Authors:
Makiya Nakashima,
Alex Sim,
Youngsoo Kim,
Jonghyun Kim,
**oh Kim
Abstract:
While variable selection is essential to optimize the learning complexity by prioritizing features, automating the selection process is preferred since it requires laborious efforts with intensive analysis otherwise. However, it is not an easy task to enable the automation due to several reasons. First, selection techniques often need a condition to terminate the reduction process, for example, by…
▽ More
While variable selection is essential to optimize the learning complexity by prioritizing features, automating the selection process is preferred since it requires laborious efforts with intensive analysis otherwise. However, it is not an easy task to enable the automation due to several reasons. First, selection techniques often need a condition to terminate the reduction process, for example, by using a threshold or the number of features to stop, and searching an adequate stop** condition is highly challenging. Second, it is uncertain that the reduced variable set would work well; our preliminary experimental result shows that well-known selection techniques produce different sets of variables as a result of reduction (even with the same termination condition), and it is hard to estimate which of them would work the best in future testing. In this paper, we demonstrate the potential power of our approach to the automation of selection process that incorporates well-known selection methods identifying important variables. Our experimental results with two public network traffic data (UNSW-NB15 and IDS2017) show that our proposed method identifies a small number of core variables, with which it is possible to approximate the performance to the one with the entire variables.
△ Less
Submitted 28 October, 2019;
originally announced October 2019.
-
Interpretable Graph Convolutional Neural Networks for Inference on Noisy Knowledge Graphs
Authors:
Daniel Neil,
Joss Briody,
Alix Lacoste,
Aaron Sim,
Paidi Creed,
Amir Saffari
Abstract:
In this work, we provide a new formulation for Graph Convolutional Neural Networks (GCNNs) for link prediction on graph data that addresses common challenges for biomedical knowledge graphs (KGs). We introduce a regularized attention mechanism to GCNNs that not only improves performance on clean datasets, but also favorably accommodates noise in KGs, a pervasive issue in real-world applications. F…
▽ More
In this work, we provide a new formulation for Graph Convolutional Neural Networks (GCNNs) for link prediction on graph data that addresses common challenges for biomedical knowledge graphs (KGs). We introduce a regularized attention mechanism to GCNNs that not only improves performance on clean datasets, but also favorably accommodates noise in KGs, a pervasive issue in real-world applications. Further, we explore new visualization methods for interpretable modelling and to illustrate how the learned representation can be exploited to automate dataset denoising. The results are demonstrated on a synthetic dataset, the common benchmark dataset FB15k-237, and a large biomedical knowledge graph derived from a combination of noisy and clean data sources. Using these improvements, we visualize a learned model's representation of the disease cystic fibrosis and demonstrate how to interrogate a neural network to show the potential of PPARG as a candidate therapeutic target for rheumatoid arthritis.
△ Less
Submitted 1 December, 2018;
originally announced December 2018.
-
Random Forests on Distance Matrices for Imaging Genetics Studies
Authors:
Aaron Sim,
Dimosthenis Tsagkrasoulis,
Giovanni Montana
Abstract:
We propose a non-parametric regression methodology, Random Forests on Distance Matrices (RFDM), for detecting genetic variants associated to quantitative phenotypes representing the human brain's structure or function, and obtained using neuroimaging techniques. RFDM, which is an extension of decision forests, requires a distance matrix as response that encodes all pair-wise phenotypic distances i…
▽ More
We propose a non-parametric regression methodology, Random Forests on Distance Matrices (RFDM), for detecting genetic variants associated to quantitative phenotypes representing the human brain's structure or function, and obtained using neuroimaging techniques. RFDM, which is an extension of decision forests, requires a distance matrix as response that encodes all pair-wise phenotypic distances in the random sample. We discuss ways to learn such distances directly from the data using manifold learning techniques, and how to define such distances when the phenotypes are non-vectorial objects such as brain connectivity networks. We also describe an extension of RFDM to detect espistatic effects while kee** the computational complexity low. Extensive simulation results and an application to an imaging genetics study of Alzheimer's Disease are presented and discussed.
△ Less
Submitted 24 September, 2013;
originally announced September 2013.
-
Information Geometry and Sequential Monte Carlo
Authors:
Aaron Sim,
Sarah Filippi,
Michael P. H. Stumpf
Abstract:
This paper explores the application of methods from information geometry to the sequential Monte Carlo (SMC) sampler. In particular the Riemannian manifold Metropolis-adjusted Langevin algorithm (mMALA) is adapted for the transition kernels in SMC. Similar to its function in Markov chain Monte Carlo methods, the mMALA is a fully adaptable kernel which allows for efficient sampling of high-dimensio…
▽ More
This paper explores the application of methods from information geometry to the sequential Monte Carlo (SMC) sampler. In particular the Riemannian manifold Metropolis-adjusted Langevin algorithm (mMALA) is adapted for the transition kernels in SMC. Similar to its function in Markov chain Monte Carlo methods, the mMALA is a fully adaptable kernel which allows for efficient sampling of high-dimensional and highly correlated parameter spaces. We set up the theoretical framework for its use in SMC with a focus on the application to the problem of sequential Bayesian inference for dynamical systems as modelled by sets of ordinary differential equations. In addition, we argue that defining the sequence of distributions on geodesics optimises the effective sample sizes in the SMC run. We illustrate the application of the methodology by inferring the parameters of simulated Lotka-Volterra and Fitzhugh-Nagumo models. In particular we demonstrate that compared to employing a standard adaptive random walk kernel, the SMC sampler with an information geometric kernel design attains a higher level of statistical robustness in the inferred parameters of the dynamical systems.
△ Less
Submitted 4 December, 2012;
originally announced December 2012.