-
Geometric-informed GFlowNets for Structure-Based Drug Design
Authors:
Grayson Lee,
Tony Shen,
Martin Ester
Abstract:
The rise of cost involved with drug discovery and current speed of which they are discover, underscore the need for more efficient structure-based drug design (SBDD) methods. We employ Generative Flow Networks (GFlowNets), to effectively explore the vast combinatorial space of drug-like molecules, which traditional virtual screening methods fail to cover. We introduce a novel modification to the G…
▽ More
The rise of cost involved with drug discovery and current speed of which they are discover, underscore the need for more efficient structure-based drug design (SBDD) methods. We employ Generative Flow Networks (GFlowNets), to effectively explore the vast combinatorial space of drug-like molecules, which traditional virtual screening methods fail to cover. We introduce a novel modification to the GFlowNet framework by incorporating trigonometrically consistent embeddings, previously utilized in tasks involving protein conformation and protein-ligand interactions, to enhance the model's ability to generate molecules tailored to specific protein pockets. We have modified the existing protein conditioning used by GFlowNets, blending geometric information from both protein and ligand embeddings to achieve more geometrically consistent embeddings. Experiments conducted using CrossDocked2020 demonstrated an improvement in the binding affinity between generated molecules and protein pockets for both single and multi-objective tasks, compared to previous work. Additionally, we propose future work aimed at further increasing the geometric information captured in protein-ligand interactions.
△ Less
Submitted 16 June, 2024;
originally announced June 2024.
-
IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs
Authors:
Yuzhen Mao,
Martin Ester,
Ke Li
Abstract:
One limitation of existing Transformer-based models is that they cannot handle very long sequences as input since their self-attention operations exhibit quadratic time and space complexity. This problem becomes especially acute when Transformers are deployed on hardware platforms equipped only with CPUs. To address this issue, we propose a novel method for accelerating self-attention at inference…
▽ More
One limitation of existing Transformer-based models is that they cannot handle very long sequences as input since their self-attention operations exhibit quadratic time and space complexity. This problem becomes especially acute when Transformers are deployed on hardware platforms equipped only with CPUs. To address this issue, we propose a novel method for accelerating self-attention at inference time that works with pretrained Transformer models out-of-the-box without requiring retraining. We experiment using our method to accelerate various long-sequence Transformers, including a leading LLaMA 2-based LLM, on various benchmarks and demonstrate a greater speedup of 2.73x - 7.63x while retaining 98.6% - 99.6% of the accuracy of the original pretrained models. The code is available on our project website at https://yuzhenmao.github.io/IceFormer/.
△ Less
Submitted 5 May, 2024;
originally announced May 2024.
-
Adversarially Balanced Representation for Continuous Treatment Effect Estimation
Authors:
Amirreza Kazemi,
Martin Ester
Abstract:
Individual treatment effect (ITE) estimation requires adjusting for the covariate shift between populations with different treatments, and deep representation learning has shown great promise in learning a balanced representation of covariates. However the existing methods mostly consider the scenario of binary treatments. In this paper, we consider the more practical and challenging scenario in w…
▽ More
Individual treatment effect (ITE) estimation requires adjusting for the covariate shift between populations with different treatments, and deep representation learning has shown great promise in learning a balanced representation of covariates. However the existing methods mostly consider the scenario of binary treatments. In this paper, we consider the more practical and challenging scenario in which the treatment is a continuous variable (e.g. dosage of a medication), and we address the two main challenges of this setup. We propose the adversarial counterfactual regression network (ACFR) that adversarially minimizes the representation imbalance in terms of KL divergence, and also maintains the impact of the treatment value on the outcome prediction by leveraging an attention mechanism. Theoretically we demonstrate that ACFR objective function is grounded in an upper bound on counterfactual outcome prediction error. Our experimental evaluation on semi-synthetic datasets demonstrates the empirical superiority of ACFR over a range of state-of-the-art methods.
△ Less
Submitted 16 December, 2023;
originally announced December 2023.
-
TacoGFN: Target-conditioned GFlowNet for Structure-based Drug Design
Authors:
Tony Shen,
Seonghwan Seo,
Grayson Lee,
Mohit Pandey,
Jason R Smith,
Artem Cherkasov,
Woo Youn Kim,
Martin Ester
Abstract:
Searching the vast chemical space for drug-like and synthesizable molecules with high binding affinity to a protein pocket is a challenging task in drug discovery. Recently, molecular deep generative models have been introduced which promise to be more efficient than exhaustive virtual screening, by directly generating molecules based on the protein structure. However, since they learn the distrib…
▽ More
Searching the vast chemical space for drug-like and synthesizable molecules with high binding affinity to a protein pocket is a challenging task in drug discovery. Recently, molecular deep generative models have been introduced which promise to be more efficient than exhaustive virtual screening, by directly generating molecules based on the protein structure. However, since they learn the distribution of a limited protein-ligand complex dataset, the existing methods struggle with generating novel molecules with significant property improvements. In this paper, we frame the generation task as a Reinforcement Learning task, where the goal is to search the wider chemical space for molecules with desirable properties as opposed to fitting a training data distribution. More specifically, we propose TacoGFN, a Generative Flow Network conditioned on protein pocket structure, using binding affinity, drug-likeliness and synthesizability measures as our reward. Empirically, our method outperforms state-of-art methods on the CrossDocked2020 benchmark for every molecular property (Vina score, QED, SA), while significantly improving the generation time. TacoGFN achieves $-8.82$ in median docking score and $52.63\%$ in Novel Hit Rate.
△ Less
Submitted 7 April, 2024; v1 submitted 4 October, 2023;
originally announced October 2023.
-
Semi-Supervised Junction Tree Variational Autoencoder for Molecular Property Prediction
Authors:
Atia Hamidizadeh,
Tony Shen,
Martin Ester
Abstract:
Molecular Representation Learning is essential to solving many drug discovery and computational chemistry problems. It is a challenging problem due to the complex structure of molecules and the vast chemical space. Graph representations of molecules are more expressive than traditional representations, such as molecular fingerprints. Therefore, they can improve the performance of machine learning…
▽ More
Molecular Representation Learning is essential to solving many drug discovery and computational chemistry problems. It is a challenging problem due to the complex structure of molecules and the vast chemical space. Graph representations of molecules are more expressive than traditional representations, such as molecular fingerprints. Therefore, they can improve the performance of machine learning models. We propose SeMole, a method that augments the Junction Tree Variational Autoencoders, a state-of-the-art generative model for molecular graphs, with semi-supervised learning. SeMole aims to improve the accuracy of molecular property prediction when having limited labeled data by exploiting unlabeled data. We enforce that the model generates molecular graphs conditioned on target properties by incorporating the property into the latent representation. We propose an additional pre-training phase to improve the training process for our semi-supervised generative model. We perform an experimental evaluation on the ZINC dataset using three different molecular properties and demonstrate the benefits of semi-supervision.
△ Less
Submitted 14 January, 2023; v1 submitted 9 August, 2022;
originally announced August 2022.
-
Subgroup Discovery in Unstructured Data
Authors:
Ali Arab,
Dev Arora,
Jialin Lu,
Martin Ester
Abstract:
Subgroup discovery is a descriptive and exploratory data mining technique to identify subgroups in a population that exhibit interesting behavior with respect to a variable of interest. Subgroup discovery has numerous applications in knowledge discovery and hypothesis generation, yet it remains inapplicable for unstructured, high-dimensional data such as images. This is because subgroup discovery…
▽ More
Subgroup discovery is a descriptive and exploratory data mining technique to identify subgroups in a population that exhibit interesting behavior with respect to a variable of interest. Subgroup discovery has numerous applications in knowledge discovery and hypothesis generation, yet it remains inapplicable for unstructured, high-dimensional data such as images. This is because subgroup discovery algorithms rely on defining descriptive rules based on (attribute, value) pairs, however, in unstructured data, an attribute is not well defined. Even in cases where the notion of attribute intuitively exists in the data, such as a pixel in an image, due to the high dimensionality of the data, these attributes are not informative enough to be used in a rule. In this paper, we introduce the subgroup-aware variational autoencoder, a novel variational autoencoder that learns a representation of unstructured data which leads to subgroups with higher quality. Our experimental results demonstrate the effectiveness of the method at learning subgroups with high quality while supporting the interpretability of the concepts.
△ Less
Submitted 15 July, 2022;
originally announced July 2022.
-
A Comprehensive Survey on Deep Clustering: Taxonomy, Challenges, and Future Directions
Authors:
Sheng Zhou,
Hongjia Xu,
Zhuonan Zheng,
Jiawei Chen,
Zhao li,
Jiajun Bu,
Jia Wu,
Xin Wang,
Wenwu Zhu,
Martin Ester
Abstract:
Clustering is a fundamental machine learning task which has been widely studied in the literature. Classic clustering methods follow the assumption that data are represented as features in a vectorized form through various representation learning techniques. As the data become increasingly complicated and complex, the shallow (traditional) clustering methods can no longer handle the high-dimension…
▽ More
Clustering is a fundamental machine learning task which has been widely studied in the literature. Classic clustering methods follow the assumption that data are represented as features in a vectorized form through various representation learning techniques. As the data become increasingly complicated and complex, the shallow (traditional) clustering methods can no longer handle the high-dimensional data type. With the huge success of deep learning, especially the deep unsupervised learning, many representation learning techniques with deep architectures have been proposed in the past decade. Recently, the concept of Deep Clustering, i.e., jointly optimizing the representation learning and clustering, has been proposed and hence attracted growing attention in the community. Motivated by the tremendous success of deep learning in clustering, one of the most fundamental machine learning tasks, and the large number of recent advances in this direction, in this paper we conduct a comprehensive survey on deep clustering by proposing a new taxonomy of different state-of-the-art approaches. We summarize the essential components of deep clustering and categorize existing methods by the ways they design interactions between deep representation learning and clustering. Moreover, this survey also provides the popular benchmark datasets, evaluation metrics and open-source implementations to clearly illustrate various experimental settings. Last but not least, we discuss the practical applications of deep clustering and suggest challenging topics deserving further investigations as future directions.
△ Less
Submitted 15 June, 2022;
originally announced June 2022.
-
Causal Inference from Small High-dimensional Datasets
Authors:
Raquel Aoki,
Martin Ester
Abstract:
Many methods have been proposed to estimate treatment effects with observational data. Often, the choice of the method considers the application's characteristics, such as type of treatment and outcome, confounding effect, and the complexity of the data. These methods implicitly assume that the sample size is large enough to train such models, especially the neural network-based estimators. What i…
▽ More
Many methods have been proposed to estimate treatment effects with observational data. Often, the choice of the method considers the application's characteristics, such as type of treatment and outcome, confounding effect, and the complexity of the data. These methods implicitly assume that the sample size is large enough to train such models, especially the neural network-based estimators. What if this is not the case? In this work, we propose Causal-Batle, a methodology to estimate treatment effects in small high-dimensional datasets in the presence of another high-dimensional dataset in the same feature space. We adopt an approach that brings transfer learning techniques into causal inference. Our experiments show that such an approach helps to bring stability to neural network-based methods and improve the treatment effect estimates in small high-dimensional datasets.
△ Less
Submitted 18 May, 2022;
originally announced May 2022.
-
Multi-treatment Effect Estimation from Biomedical Data
Authors:
Raquel Aoki,
Yizhou Chen,
Martin Ester
Abstract:
This work proposes the M3E2, a multi-task learning neural network model to estimate the effect of multiple treatments. In contrast to existing methods, M3E2 can handle multiple treatment effects applied simultaneously to the same unit, continuous and binary treatments, and many covariates. We compared M3E2 with three baselines in three synthetic benchmark datasets: two with multiple treatments and…
▽ More
This work proposes the M3E2, a multi-task learning neural network model to estimate the effect of multiple treatments. In contrast to existing methods, M3E2 can handle multiple treatment effects applied simultaneously to the same unit, continuous and binary treatments, and many covariates. We compared M3E2 with three baselines in three synthetic benchmark datasets: two with multiple treatments and one with one treatment. Our analysis showed that our method has superior performance, making more assertive estimations of the multiple treatment effects.
△ Less
Submitted 5 January, 2023; v1 submitted 14 December, 2021;
originally announced December 2021.
-
An Interactive Visualization Tool for Understanding Active Learning
Authors:
Zihan Wang,
Jialin Lu,
Oliver Snow,
Martin Ester
Abstract:
Despite recent progress in artificial intelligence and machine learning, many state-of-the-art methods suffer from a lack of explainability and transparency. The ability to interpret the predictions made by machine learning models and accurately evaluate these models is crucially important. In this paper, we present an interactive visualization tool to elucidate the training process of active lear…
▽ More
Despite recent progress in artificial intelligence and machine learning, many state-of-the-art methods suffer from a lack of explainability and transparency. The ability to interpret the predictions made by machine learning models and accurately evaluate these models is crucially important. In this paper, we present an interactive visualization tool to elucidate the training process of active learning. This tool enables one to select a sample of interesting data points, view how their prediction values change at different querying stages, and thus better understand when and how active learning works. Additionally, users can utilize this tool to compare different active learning strategies simultaneously and inspect why some strategies outperform others in certain contexts. With some preliminary experiments, we demonstrate that our visualization panel has a great potential to be used in various active learning experiments and help users evaluate their models appropriately.
△ Less
Submitted 8 November, 2021;
originally announced November 2021.
-
CoSam: An Efficient Collaborative Adaptive Sampler for Recommendation
Authors:
Jiawei Chen,
Chengquan Jiang,
Can Wang,
Sheng Zhou,
Yan Feng,
Chun Chen,
Martin Ester,
Xiangnan He
Abstract:
Sampling strategies have been widely applied in many recommendation systems to accelerate model learning from implicit feedback data. A typical strategy is to draw negative instances with uniform distribution, which however will severely affect model's convergency, stability, and even recommendation accuracy. A promising solution for this problem is to over-sample the ``difficult'' (a.k.a informat…
▽ More
Sampling strategies have been widely applied in many recommendation systems to accelerate model learning from implicit feedback data. A typical strategy is to draw negative instances with uniform distribution, which however will severely affect model's convergency, stability, and even recommendation accuracy. A promising solution for this problem is to over-sample the ``difficult'' (a.k.a informative) instances that contribute more on training. But this will increase the risk of biasing the model and leading to non-optimal results. Moreover, existing samplers are either heuristic, which require domain knowledge and often fail to capture real ``difficult'' instances; or rely on a sampler model that suffers from low efficiency.
To deal with these problems, we propose an efficient and effective collaborative sampling method CoSam, which consists of: (1) a collaborative sampler model that explicitly leverages user-item interaction information in sampling probability and exhibits good properties of normalization, adaption, interaction information awareness, and sampling efficiency; and (2) an integrated sampler-recommender framework, leveraging the sampler model in prediction to offset the bias caused by uneven sampling. Correspondingly, we derive a fast reinforced training algorithm of our framework to boost the sampler performance and sampler-recommender collaboration. Extensive experiments on four real-world datasets demonstrate the superiority of the proposed collaborative sampler model and integrated sampler-recommender framework.
△ Less
Submitted 16 November, 2020;
originally announced November 2020.
-
Combining Domain-Specific Meta-Learners in the Parameter Space for Cross-Domain Few-Shot Classification
Authors:
Shuman Peng,
Weilian Song,
Martin Ester
Abstract:
The goal of few-shot classification is to learn a model that can classify novel classes using only a few training examples. Despite the promising results shown by existing meta-learning algorithms in solving the few-shot classification problem, there still remains an important challenge: how to generalize to unseen domains while meta-learning on multiple seen domains? In this paper, we propose an…
▽ More
The goal of few-shot classification is to learn a model that can classify novel classes using only a few training examples. Despite the promising results shown by existing meta-learning algorithms in solving the few-shot classification problem, there still remains an important challenge: how to generalize to unseen domains while meta-learning on multiple seen domains? In this paper, we propose an optimization-based meta-learning method, called Combining Domain-Specific Meta-Learners (CosML), that addresses the cross-domain few-shot classification problem. CosML first trains a set of meta-learners, one for each training domain, to learn prior knowledge (i.e., meta-parameters) specific to each domain. The domain-specific meta-learners are then combined in the \emph{parameter space}, by taking a weighted average of their meta-parameters, which is used as the initialization parameters of a task network that is quickly adapted to novel few-shot classification tasks in an unseen domain. Our experiments show that CosML outperforms a range of state-of-the-art methods and achieves strong cross-domain generalization ability.
△ Less
Submitted 30 October, 2020;
originally announced November 2020.
-
Domain Generalization via Semi-supervised Meta Learning
Authors:
Hossein Sharifi-Noghabi,
Hossein Asghari,
Nazanin Mehrasa,
Martin Ester
Abstract:
The goal of domain generalization is to learn from multiple source domains to generalize to unseen target domains under distribution discrepancy. Current state-of-the-art methods in this area are fully supervised, but for many real-world problems it is hardly possible to obtain enough labeled samples. In this paper, we propose the first method of domain generalization to leverage unlabeled samples…
▽ More
The goal of domain generalization is to learn from multiple source domains to generalize to unseen target domains under distribution discrepancy. Current state-of-the-art methods in this area are fully supervised, but for many real-world problems it is hardly possible to obtain enough labeled samples. In this paper, we propose the first method of domain generalization to leverage unlabeled samples, combining of meta learning's episodic training and semi-supervised learning, called DGSML. DGSML employs an entropy-based pseudo-labeling approach to assign labels to unlabeled samples and then utilizes a novel discrepancy loss to ensure that class centroids before and after labeling unlabeled samples are close to each other. To learn a domain-invariant representation, it also utilizes a novel alignment loss to ensure that the distance between pairs of class centroids, computed after adding the unlabeled samples, is preserved across different domains. DGSML is trained by a meta learning approach to mimic the distribution shift between the input source domains and unseen target domains. Experimental results on benchmark datasets indicate that DGSML outperforms state-of-the-art domain generalization and semi-supervised learning methods.
△ Less
Submitted 30 September, 2020; v1 submitted 26 September, 2020;
originally announced September 2020.
-
CAST: A Correlation-based Adaptive Spectral Clustering Algorithm on Multi-scale Data
Authors:
Xiang Li,
Ben Kao,
Caihua Shan,
Dawei Yin,
Martin Ester
Abstract:
We study the problem of applying spectral clustering to cluster multi-scale data, which is data whose clusters are of various sizes and densities. Traditional spectral clustering techniques discover clusters by processing a similarity matrix that reflects the proximity of objects. For multi-scale data, distance-based similarity is not effective because objects of a sparse cluster could be far apar…
▽ More
We study the problem of applying spectral clustering to cluster multi-scale data, which is data whose clusters are of various sizes and densities. Traditional spectral clustering techniques discover clusters by processing a similarity matrix that reflects the proximity of objects. For multi-scale data, distance-based similarity is not effective because objects of a sparse cluster could be far apart while those of a dense cluster have to be sufficiently close. Following [16], we solve the problem of spectral clustering on multi-scale data by integrating the concept of objects' "reachability similarity" with a given distance-based similarity to derive an objects' coefficient matrix. We propose the algorithm CAST that applies trace Lasso to regularize the coefficient matrix. We prove that the resulting coefficient matrix has the "grou** effect" and that it exhibits "sparsity". We show that these two characteristics imply very effective spectral clustering. We evaluate CAST and 10 other clustering methods on a wide range of datasets w.r.t. various measures. Experimental results show that CAST provides excellent performance and is highly robust across test cases of multi-scale data.
△ Less
Submitted 8 June, 2020;
originally announced June 2020.
-
ParKCa: Causal Inference with Partially Known Causes
Authors:
Raquel Aoki,
Martin Ester
Abstract:
Methods for causal inference from observational data are an alternative for scenarios where collecting counterfactual data or realizing a randomized experiment is not possible. Adopting a stacking approach, our proposed method ParKCA combines the results of several causal inference methods to learn new causes in applications with some known causes and many potential causes. We validate ParKCA in t…
▽ More
Methods for causal inference from observational data are an alternative for scenarios where collecting counterfactual data or realizing a randomized experiment is not possible. Adopting a stacking approach, our proposed method ParKCA combines the results of several causal inference methods to learn new causes in applications with some known causes and many potential causes. We validate ParKCA in two Genome-wide association studies, one real-world and one simulated dataset. Our results show that ParKCA can infer more causes than existing methods.
△ Less
Submitted 11 November, 2020; v1 submitted 17 March, 2020;
originally announced March 2020.
-
Hierarchical Graph Pooling with Structure Learning
Authors:
Zhen Zhang,
Jiajun Bu,
Martin Ester,
Jianfeng Zhang,
Chengwei Yao,
Zhi Yu,
Can Wang
Abstract:
Graph Neural Networks (GNNs), which generalize deep neural networks to graph-structured data, have drawn considerable attention and achieved state-of-the-art performance in numerous graph related tasks. However, existing GNN models mainly focus on designing graph convolution operations. The graph pooling (or downsampling) operations, that play an important role in learning hierarchical representat…
▽ More
Graph Neural Networks (GNNs), which generalize deep neural networks to graph-structured data, have drawn considerable attention and achieved state-of-the-art performance in numerous graph related tasks. However, existing GNN models mainly focus on designing graph convolution operations. The graph pooling (or downsampling) operations, that play an important role in learning hierarchical representations, are usually overlooked. In this paper, we propose a novel graph pooling operator, called Hierarchical Graph Pooling with Structure Learning (HGP-SL), which can be integrated into various graph neural network architectures. HGP-SL incorporates graph pooling and structure learning into a unified module to generate hierarchical representations of graphs. More specifically, the graph pooling operation adaptively selects a subset of nodes to form an induced subgraph for the subsequent layers. To preserve the integrity of graph's topological information, we further introduce a structure learning mechanism to learn a refined graph structure for the pooled graph at each layer. By combining HGP-SL operator with graph neural networks, we perform graph level representation learning with focus on graph classification task. Experimental results on six widely used benchmarks demonstrate the effectiveness of our proposed model.
△ Less
Submitted 25 December, 2019; v1 submitted 14 November, 2019;
originally announced November 2019.
-
An Active Approach for Model Interpretation
Authors:
Jialin Lu,
Martin Ester
Abstract:
Model interpretation, or explanation of a machine learning classifier, aims to extract generalizable knowledge from a trained classifier into a human-understandable format, for various purposes such as model assessment, debugging and trust. From a computaional viewpoint, it is formulated as approximating the target classifier using a simpler interpretable model, such as rule models like a decision…
▽ More
Model interpretation, or explanation of a machine learning classifier, aims to extract generalizable knowledge from a trained classifier into a human-understandable format, for various purposes such as model assessment, debugging and trust. From a computaional viewpoint, it is formulated as approximating the target classifier using a simpler interpretable model, such as rule models like a decision set/list/tree. Often, this approximation is handled as standard supervised learning and the only difference is that the labels are provided by the target classifier instead of ground truth. This paradigm is particularly popular because there exists a variety of well-studied supervised algorithms for learning an interpretable classifier. However, we argue that this paradigm is suboptimal for it does not utilize the unique property of the model interpretation problem, that is, the ability to generate synthetic instances and query the target classifier for their labels. We call this the active-query property, suggesting that we should consider model interpretation from an active learning perspective. Following this insight, we argue that the active-query property should be employed when designing a model interpretation algorithm, and that the generation of synthetic instances should be integrated seamlessly with the algorithm that learns the model interpretation. In this paper, we demonstrate that by doing so, it is possible to achieve more faithful interpretation with simpler model complexity. As a technical contribution, we present an active algorithm Active Decision Set Induction (ADS) to learn a decision set, a set of if-else rules, for model interpretation. ADS performs a local search over the space of all decision sets. In every iteration, ADS computes confidence intervals for the value of the objective function of all local actions and utilizes active-query to determine the best one.
△ Less
Submitted 27 October, 2019;
originally announced October 2019.
-
PADME: A Deep Learning-based Framework for Drug-Target Interaction Prediction
Authors:
Qingyuan Feng,
Evgenia Dueva,
Artem Cherkasov,
Martin Ester
Abstract:
In silico drug-target interaction (DTI) prediction is an important and challenging problem in biomedical research with a huge potential benefit to the pharmaceutical industry and patients. Most existing methods for DTI prediction including deep learning models generally have binary endpoints, which could be an oversimplification of the problem, and those methods are typically unable to handle cold…
▽ More
In silico drug-target interaction (DTI) prediction is an important and challenging problem in biomedical research with a huge potential benefit to the pharmaceutical industry and patients. Most existing methods for DTI prediction including deep learning models generally have binary endpoints, which could be an oversimplification of the problem, and those methods are typically unable to handle cold-target problems, i.e., problems involving target protein that never appeared in the training set. Towards this, we contrived PADME (Protein And Drug Molecule interaction prEdiction), a framework based on Deep Neural Networks, to predict real-valued interaction strength between compounds and proteins without requiring feature engineering. PADME takes both compound and protein information as inputs, so it is capable of solving cold-target (and cold-drug) problems. To our knowledge, we are the first to combine Molecular Graph Convolution (MGC) for compound featurization with protein descriptors for DTI prediction. We used multiple cross-validation split schemes and evaluation metrics to measure the performance of PADME on multiple datasets, including the ToxCast dataset, and PADME consistently dominates baseline methods. The results of a case study, which predicts the binding affinity between various compounds and androgen receptor (AR), suggest PADME's potential in drug development. The scalability of PADME is another advantage in the age of Big Data.
△ Less
Submitted 21 August, 2019; v1 submitted 25 July, 2018;
originally announced July 2018.
-
Detecting Singleton Review Spammers Using Semantic Similarity
Authors:
Vlad Sandulescu,
Martin Ester
Abstract:
Online reviews have increasingly become a very important resource for consumers when making purchases. Though it is becoming more and more difficult for people to make well-informed buying decisions without being deceived by fake reviews. Prior works on the opinion spam problem mostly considered classifying fake reviews using behavioral user patterns. They focused on prolific users who write more…
▽ More
Online reviews have increasingly become a very important resource for consumers when making purchases. Though it is becoming more and more difficult for people to make well-informed buying decisions without being deceived by fake reviews. Prior works on the opinion spam problem mostly considered classifying fake reviews using behavioral user patterns. They focused on prolific users who write more than a couple of reviews, discarding one-time reviewers. The number of singleton reviewers however is expected to be high for many review websites. While behavioral patterns are effective when dealing with elite users, for one-time reviewers, the review text needs to be exploited. In this paper we tackle the problem of detecting fake reviews written by the same person using multiple names, posting each review under a different name. We propose two methods to detect similar reviews and show the results generally outperform the vectorial similarity measures used in prior works. The first method extends the semantic similarity between words to the reviews level. The second method is based on topic modeling and exploits the similarity of the reviews topic distributions using two models: bag-of-words and bag-of-opinion-phrases. The experiments were conducted on reviews from three different datasets: Yelp (57K reviews), Trustpilot (9K reviews) and Ott dataset (800 reviews).
△ Less
Submitted 9 September, 2016;
originally announced September 2016.
-
Structural Analysis of User Choices for Mobile App Recommendation
Authors:
Bin Liu,
Yao Wu,
Neil Zhenqiang Gong,
Junjie Wu,
Hui Xiong,
Martin Ester
Abstract:
Advances in smartphone technology have promoted the rapid development of mobile apps. However, the availability of a huge number of mobile apps in application stores has imposed the challenge of finding the right apps to meet the user needs. Indeed, there is a critical demand for personalized app recommendations. Along this line, there are opportunities and challenges posed by two unique character…
▽ More
Advances in smartphone technology have promoted the rapid development of mobile apps. However, the availability of a huge number of mobile apps in application stores has imposed the challenge of finding the right apps to meet the user needs. Indeed, there is a critical demand for personalized app recommendations. Along this line, there are opportunities and challenges posed by two unique characteristics of mobile apps. First, app markets have organized apps in a hierarchical taxonomy. Second, apps with similar functionalities are competing with each other. While there are a variety of approaches for mobile app recommendations, these approaches do not have a focus on dealing with these opportunities and challenges. To this end, in this paper, we provide a systematic study for addressing these challenges. Specifically, we develop a Structural User Choice Model (SUCM) to learn fine-grained user preferences by exploiting the hierarchical taxonomy of apps as well as the competitive relationships among apps. Moreover, we design an efficient learning algorithm to estimate the parameters for the SUCM model. Finally, we perform extensive experiments on a large app adoption data set collected from Google Play. The results show that SUCM consistently outperforms state-of-the-art top-N recommendation methods by a significant margin.
△ Less
Submitted 25 May, 2016;
originally announced May 2016.
-
KiWi: A Scalable Subspace Clustering Algorithm for Gene Expression Analysis
Authors:
Obi L. Griffith,
Byron J. Gao,
Mikhail Bilenky,
Yuliya Prichyna,
Martin Ester,
Steven J. M. Jones
Abstract:
Subspace clustering has gained increasing popularity in the analysis of gene expression data. Among subspace cluster models, the recently introduced order-preserving sub-matrix (OPSM) has demonstrated high promise. An OPSM, essentially a pattern-based subspace cluster, is a subset of rows and columns in a data matrix for which all the rows induce the same linear ordering of columns. Existing OPS…
▽ More
Subspace clustering has gained increasing popularity in the analysis of gene expression data. Among subspace cluster models, the recently introduced order-preserving sub-matrix (OPSM) has demonstrated high promise. An OPSM, essentially a pattern-based subspace cluster, is a subset of rows and columns in a data matrix for which all the rows induce the same linear ordering of columns. Existing OPSM discovery methods do not scale well to increasingly large expression datasets. In particular, twig clusters having few genes and many experiments incur explosive computational costs and are completely pruned off by existing methods. However, it is of particular interest to determine small groups of genes that are tightly coregulated across many conditions. In this paper, we present KiWi, an OPSM subspace clustering algorithm that is scalable to massive datasets, capable of discovering twig clusters and identifying negative as well as positive correlations. We extensively validate KiWi using relevant biological datasets and show that KiWi correctly assigns redundant probes to the same cluster, groups experiments with common clinical annotations, differentiates real promoter sequences from negative control sequences, and shows good association with cis-regulatory motif predictions.
△ Less
Submitted 13 April, 2009;
originally announced April 2009.
-
Learning Class-Level Bayes Nets for Relational Data
Authors:
Oliver Schulte,
Hassan Khosravi,
Flavia Moser,
Martin Ester
Abstract:
Many databases store data in relational format, with different types of entities and information about links between the entities. The field of statistical-relational learning (SRL) has developed a number of new statistical models for such data. In this paper we focus on learning class-level or first-order dependencies, which model the general database statistics over attributes of linked object…
▽ More
Many databases store data in relational format, with different types of entities and information about links between the entities. The field of statistical-relational learning (SRL) has developed a number of new statistical models for such data. In this paper we focus on learning class-level or first-order dependencies, which model the general database statistics over attributes of linked objects and links (e.g., the percentage of A grades given in computer science classes). Class-level statistical relationships are important in themselves, and they support applications like policy making, strategic planning, and query optimization. Most current SRL methods find class-level dependencies, but their main task is to support instance-level predictions about the attributes or links of specific entities. We focus only on class-level prediction, and describe algorithms for learning class-level models that are orders of magnitude faster for this task. Our algorithms learn Bayes nets with relational structure, leveraging the efficiency of single-table nonrelational Bayes net learners. An evaluation of our methods on three data sets shows that they are computationally feasible for realistic table sizes, and that the learned structures represent the statistical information in the databases well. After learning compiles the database statistics into a Bayes net, querying these statistics via Bayes net inference is faster than with SQL queries, and does not depend on the size of the database.
△ Less
Submitted 20 October, 2009; v1 submitted 26 November, 2008;
originally announced November 2008.
-
Association Rules in the Relational Calculus
Authors:
Oliver Schulte,
Flavia Moser,
Martin Ester,
Zhiyong Lu
Abstract:
One of the most utilized data mining tasks is the search for association rules. Association rules represent significant relationships between items in transactions. We extend the concept of association rule to represent a much broader class of associations, which we refer to as \emph{entity-relationship rules.} Semantically, entity-relationship rules express associations between properties of re…
▽ More
One of the most utilized data mining tasks is the search for association rules. Association rules represent significant relationships between items in transactions. We extend the concept of association rule to represent a much broader class of associations, which we refer to as \emph{entity-relationship rules.} Semantically, entity-relationship rules express associations between properties of related objects. Syntactically, these rules are based on a broad subclass of safe domain relational calculus queries. We propose a new definition of support and confidence for entity-relationship rules and for the frequency of entity-relationship queries. We prove that the definition of frequency satisfies standard probability axioms and the Apriori property.
△ Less
Submitted 10 October, 2007;
originally announced October 2007.