Search | arXiv e-print repository

LLM Processes: Numerical Predictive Distributions Conditioned on Natural Language

Authors: James Requeima, John Bronskill, Dami Choi, Richard E. Turner, David Duvenaud

Abstract: Machine learning practitioners often face significant challenges in formally integrating their prior knowledge and beliefs into predictive models, limiting the potential for nuanced and context-aware analyses. Moreover, the expertise needed to integrate this prior knowledge into probabilistic modeling typically limits the application of these models to specialists. Our goal is to build a regressio… ▽ More Machine learning practitioners often face significant challenges in formally integrating their prior knowledge and beliefs into predictive models, limiting the potential for nuanced and context-aware analyses. Moreover, the expertise needed to integrate this prior knowledge into probabilistic modeling typically limits the application of these models to specialists. Our goal is to build a regression model that can process numerical data and make probabilistic predictions at arbitrary locations, guided by natural language text which describes a user's prior knowledge. Large Language Models (LLMs) provide a useful starting point for designing such a tool since they 1) provide an interface where users can incorporate expert insights in natural language and 2) provide an opportunity for leveraging latent problem-relevant knowledge encoded in LLMs that users may not have themselves. We start by exploring strategies for eliciting explicit, coherent numerical predictive distributions from LLMs. We examine these joint predictive distributions, which we call LLM Processes, over arbitrarily-many quantities in settings such as forecasting, multi-dimensional regression, black-box optimization, and image modeling. We investigate the practical details of prompting to elicit coherent predictive distributions, and demonstrate their effectiveness at regression. Finally, we demonstrate the ability to usefully incorporate text into numerical predictions, improving predictive performance and giving quantitative structure that reflects qualitative descriptions. This lets us begin to explore the rich, grounded hypothesis space that LLMs implicitly encode. △ Less

Submitted 25 May, 2024; v1 submitted 21 May, 2024; originally announced May 2024.

arXiv:2309.03969 [pdf, other]

Estimating the prevalance of indirect effects and other spillovers

Authors: David Choi

Abstract: In settings where interference between units is possible, we define the prevalence of indirect effects to be the number of units who are affected by the treatment of others. This quantity does not fully identify an indirect effect, but may be used to show whether such effects are widely prevalent. Given a randomized experiment with binary-valued outcomes, methods are presented for conservative poi… ▽ More In settings where interference between units is possible, we define the prevalence of indirect effects to be the number of units who are affected by the treatment of others. This quantity does not fully identify an indirect effect, but may be used to show whether such effects are widely prevalent. Given a randomized experiment with binary-valued outcomes, methods are presented for conservative point estimation and one-sided interval estimation. No assumptions beyond randomization of treatment are required, allowing for usage in settings where models or assumptions on interference might be questionable. To show asymptotic coverage of our intervals in settings not covered by existing results, we provide a central limit theorem that combines local dependence and sampling without replacement. Consistency and minimax properties of the point estimator are shown as well. The approach is demonstrated on an experiment in which students were treated for a highly transmissible parasitic infection, for which we find that a significant fraction of students were affected by the treatment of schools other than their own. △ Less

Submitted 16 January, 2024; v1 submitted 7 September, 2023; originally announced September 2023.

Comments: small corrections to proofs and statement of Theorem 4

arXiv:2305.11445 [pdf, ps, other]

A general model-checking procedure for semiparametric accelerated failure time models

Authors: Dongrak Choi, Woojung Bae, Jun Yan, Sangwook Kang

Abstract: We propose a set of goodness-of-fit tests for the semiparametric accelerated failure time (AFT) model, including an omnibus test, a link function test, and a functional form test. This set of tests is derived from a multi-parameter cumulative sum process shown to follow asymptotically a zero-mean Gaussian process. Its evaluation is based on the asymptotically equivalent perturbed version, which en… ▽ More We propose a set of goodness-of-fit tests for the semiparametric accelerated failure time (AFT) model, including an omnibus test, a link function test, and a functional form test. This set of tests is derived from a multi-parameter cumulative sum process shown to follow asymptotically a zero-mean Gaussian process. Its evaluation is based on the asymptotically equivalent perturbed version, which enables both graphical and numerical evaluations of the assumed AFT model. Empirical p-values are obtained using the Kolmogorov-type supremum test, which provides a reliable approach for estimating the significance of both proposed un-standardized and standardized test statistics. The proposed procedure is illustrated using the induced smoothed rank-based estimator but is directly applicable to other popular estimators such as non-smooth rank-based estimator or least-squares estimator.Our proposed methods are rigorously evaluated using extensive simulation experiments that demonstrate their effectiveness in maintaining a Type I error rate and detecting departures from the assumed AFT model in practical sample sizes and censoring rates. Furthermore, the proposed approach is applied to the analysis of the Primary Biliary Cirrhosis data, a widely studied dataset in survival analysis, providing further evidence of the practical usefulness of the proposed methods in real-world scenarios. To make the proposed methods more accessible to researchers, we have implemented them in the R package afttest, which is publicly available on the Comprehensive R Archieve Network. △ Less

Submitted 19 May, 2023; originally announced May 2023.

arXiv:2210.14602 [pdf, other]

Efficient Data Mosaicing with Simulation-based Inference

Authors: Andrew Gambardella, Youngjun Choi, Doyo Choi, **joon Lee

Abstract: We introduce an efficient algorithm for general data mosaicing, based on the simulation-based inference paradigm. Our algorithm takes as input a target datum, source data, and partitions of the target and source data into fragments, learning distributions over averages of fragments of the source data such that samples from those distributions approximate fragments of the target datum. We utilize a… ▽ More We introduce an efficient algorithm for general data mosaicing, based on the simulation-based inference paradigm. Our algorithm takes as input a target datum, source data, and partitions of the target and source data into fragments, learning distributions over averages of fragments of the source data such that samples from those distributions approximate fragments of the target datum. We utilize a model that can be trivially parallelized in conjunction with the latest advances in efficient simulation-based inference in order to find approximate posteriors fast enough for use in practical applications. We demonstrate our technique is effective in both audio and image mosaicing problems. △ Less

Submitted 1 February, 2023; v1 submitted 26 October, 2022; originally announced October 2022.

arXiv:2107.00248 [pdf, ps, other]

New Estimands for Experiments with Strong Interference

Authors: David Choi

Abstract: In experiments that study social phenomena, such as peer influence or herd immunity, the treatment of one unit may influence the outcomes of others. Such "interference between units" violates traditional approaches for causal inference, so that additional assumptions are often imposed to model or limit the underlying social mechanism. For binary outcomes, we propose new estimands that can be estim… ▽ More In experiments that study social phenomena, such as peer influence or herd immunity, the treatment of one unit may influence the outcomes of others. Such "interference between units" violates traditional approaches for causal inference, so that additional assumptions are often imposed to model or limit the underlying social mechanism. For binary outcomes, we propose new estimands that can be estimated without such assumptions, allowing for interval estimates assuming only the randomization of treatment. However, the causal implications of these estimands are more limited than those attainable under stronger assumptions, showing only that the treatment effects under the observed assignment varied systematically as a function of each unit's direct and indirect exposure, while also lower bounding the number of units affected. △ Less

Submitted 29 August, 2023; v1 submitted 1 July, 2021; originally announced July 2021.

Comments: new title, expanded discussion of interpretation and limitations, consolidation of central limit theorem results

arXiv:2105.02381 [pdf, other]

Balancing weights for region-level analysis: the effect of Medicaid Expansion on the uninsurance rate among states that did not expand Medicaid

Authors: Max Rubinstein, Amelia Haviland, David Choi

Abstract: We predict the average effect of Medicaid expansion on the non-elderly adult uninsurance rate among states that did not expand Medicaid in 2014 as if they had expanded their Medicaid eligibility requirements. Using American Community Survey data aggregated to the region level, we estimate this effect by finding weights that approximately reweights the expansion regions to match the covariate distr… ▽ More We predict the average effect of Medicaid expansion on the non-elderly adult uninsurance rate among states that did not expand Medicaid in 2014 as if they had expanded their Medicaid eligibility requirements. Using American Community Survey data aggregated to the region level, we estimate this effect by finding weights that approximately reweights the expansion regions to match the covariate distribution of the non-expansion regions. Existing methods to estimate balancing weights often assume that the covariates are measured without error and do not account for dependencies in the outcome model. Our covariates have random noise that is uncorrelated with the outcome errors and our outcome model has state-level random effects inducing dependence between regions. To correct for the bias induced by the measurement error, we propose generating our weights on a linear approximation to the true covariates, using an idea from measurement error literature known as "regression-calibration" (see, e.g., Carroll (2006)). This requires auxiliary data to estimate the variability of the measurement error. We also modify the Stable Balancing Weights objective proposed by Zubizaretta (2015)) to reduce the variance of our estimator when the model errors follow our assumed correlation structure. We show that these approaches outperform existing methods when attempting to predict observed outcomes during the pre-treatment period. Using this method we estimate that Medicaid expansion would have caused a -2.33 (-3.54, -1.11) percentage point change in the adult uninsurance rate among states that did not expand Medicaid. △ Less

Submitted 23 May, 2022; v1 submitted 5 May, 2021; originally announced May 2021.

arXiv:2009.01444 [pdf, other]

Data Programming by Demonstration: A Framework for Interactively Learning Labeling Functions

Authors: Sara Evensen, Chang Ge, Dong** Choi, Çağatay Demiralp

Abstract: Data programming is a programmatic weak supervision approach to efficiently curate large-scale labeled training data. Writing data programs (labeling functions) requires, however, both programming literacy and domain expertise. Many subject matter experts have neither programming proficiency nor time to effectively write data programs. Furthermore, regardless of one's expertise in coding or machin… ▽ More Data programming is a programmatic weak supervision approach to efficiently curate large-scale labeled training data. Writing data programs (labeling functions) requires, however, both programming literacy and domain expertise. Many subject matter experts have neither programming proficiency nor time to effectively write data programs. Furthermore, regardless of one's expertise in coding or machine learning, transferring domain expertise into labeling functions by enumerating rules and thresholds is not only time consuming but also inherently difficult. Here we propose a new framework, data programming by demonstration (DPBD), to generate labeling rules using interactive demonstrations of users. DPBD aims to relieve the burden of writing labeling functions from users, enabling them to focus on higher-level semantics such as identifying relevant signals for labeling tasks. We operationalize our framework with Ruler, an interactive system that synthesizes labeling rules for document classification by using span-level annotations of users on document examples. We compare Ruler with conventional data programming through a user study conducted with 10 data scientists creating labeling functions for sentiment and spam classification tasks. We find that Ruler is easier to use and learn and offers higher overall satisfaction, while providing discriminative model performances comparable to ones achieved by conventional data programming. △ Less

Submitted 15 September, 2020; v1 submitted 3 September, 2020; originally announced September 2020.

arXiv:2006.08063 [pdf, other]

Gradient Estimation with Stochastic Softmax Tricks

Authors: Max B. Paulus, Dami Choi, Daniel Tarlow, Andreas Krause, Chris J. Maddison

Abstract: The Gumbel-Max trick is the basis of many relaxed gradient estimators. These estimators are easy to implement and low variance, but the goal of scaling them comprehensively to large combinatorial distributions is still outstanding. Working within the perturbation model framework, we introduce stochastic softmax tricks, which generalize the Gumbel-Softmax trick to combinatorial spaces. Our framewor… ▽ More The Gumbel-Max trick is the basis of many relaxed gradient estimators. These estimators are easy to implement and low variance, but the goal of scaling them comprehensively to large combinatorial distributions is still outstanding. Working within the perturbation model framework, we introduce stochastic softmax tricks, which generalize the Gumbel-Softmax trick to combinatorial spaces. Our framework is a unified perspective on existing relaxed estimators for perturbation models, and it contains many novel relaxations. We design structured relaxations for subset selection, spanning trees, arborescences, and others. When compared to less structured baselines, we find that stochastic softmax tricks can be used to train latent variable models that perform better and discover more latent structure. △ Less

Submitted 28 February, 2021; v1 submitted 14 June, 2020; originally announced June 2020.

Comments: NeurIPS 2020, final copy

arXiv:1910.05446 [pdf, other]

On Empirical Comparisons of Optimizers for Deep Learning

Authors: Dami Choi, Christopher J. Shallue, Zachary Nado, Jaehoon Lee, Chris J. Maddison, George E. Dahl

Abstract: Selecting an optimizer is a central step in the contemporary deep learning pipeline. In this paper, we demonstrate the sensitivity of optimizer comparisons to the hyperparameter tuning protocol. Our findings suggest that the hyperparameter search space may be the single most important factor explaining the rankings obtained by recent empirical comparisons in the literature. In fact, we show that t… ▽ More Selecting an optimizer is a central step in the contemporary deep learning pipeline. In this paper, we demonstrate the sensitivity of optimizer comparisons to the hyperparameter tuning protocol. Our findings suggest that the hyperparameter search space may be the single most important factor explaining the rankings obtained by recent empirical comparisons in the literature. In fact, we show that these results can be contradicted when hyperparameter search spaces are changed. As tuning effort grows without bound, more general optimizers should never underperform the ones they can approximate (i.e., Adam should never perform worse than momentum), but recent attempts to compare optimizers either assume these inclusion relationships are not practically relevant or restrict the hyperparameters in ways that break the inclusions. In our experiments, we find that inclusion relationships between optimizers matter in practice and always predict optimizer comparisons. In particular, we find that the popular adaptive gradient methods never underperform momentum or gradient descent. We also report practical tips around tuning often ignored hyperparameters of adaptive gradient methods and raise concerns about fairly benchmarking optimizers for neural network training. △ Less

Submitted 15 June, 2020; v1 submitted 11 October, 2019; originally announced October 2019.

arXiv:1906.01498 [pdf, other]

Multimodal Ensemble Approach to Incorporate Various Types of Clinical Notes for Predicting Readmission

Authors: Bonggun Shin, Julien Hogan, Andrew B. Adams, Raymond J. Lynch, Rachel E. Patzer, **ho D. Choi

Abstract: Electronic Health Records (EHRs) have been heavily used to predict various downstream clinical tasks such as readmission or mortality. One of the modalities in EHRs, clinical notes, has not been fully explored for these tasks due to its unstructured and inexplicable nature. Although recent advances in deep learning (DL) enables models to extract interpretable features from unstructured data, they… ▽ More Electronic Health Records (EHRs) have been heavily used to predict various downstream clinical tasks such as readmission or mortality. One of the modalities in EHRs, clinical notes, has not been fully explored for these tasks due to its unstructured and inexplicable nature. Although recent advances in deep learning (DL) enables models to extract interpretable features from unstructured data, they often require a large amount of training data. However, many tasks in medical domains inherently consist of small sample data with lengthy documents; for a kidney transplant as an example, data from only a few thousand of patients are available and each patient's document consists of a couple of millions of words in major hospitals. Thus, complex DL methods cannot be applied to these kinds of domains. In this paper, we present a comprehensive ensemble model using vector space modeling and topic modeling. Our proposed model is evaluated on the readmission task of kidney transplant patients and improves 0.0211 in terms of c-statistics from the previous state-of-the-art approach using structured data, while typical DL methods fail to beat this approach. The proposed architecture provides the interpretable score for each feature from both modalities, structured and unstructured data, which is shown to be meaningful through a physician's evaluation. △ Less

Submitted 31 May, 2019; originally announced June 2019.

Comments: 4 pages, IEEE BHI 2019

Journal ref: Proceedings of the IEEE-EMBS International Conference on Biomedical and Health Informatics, 2019 (BHI'19)

arXiv:1906.00095 [pdf, other]

The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding Distillation with Ensemble Learning

Authors: Bonggun Shin, Hao Yang, **ho D. Choi

Abstract: Recent advances in deep learning have facilitated the demand of neural models for real applications. In practice, these applications often need to be deployed with limited resources while kee** high accuracy. This paper touches the core of neural models in NLP, word embeddings, and presents a new embedding distillation framework that remarkably reduces the dimension of word embeddings without co… ▽ More Recent advances in deep learning have facilitated the demand of neural models for real applications. In practice, these applications often need to be deployed with limited resources while kee** high accuracy. This paper touches the core of neural models in NLP, word embeddings, and presents a new embedding distillation framework that remarkably reduces the dimension of word embeddings without compromising accuracy. A novel distillation ensemble approach is also proposed that trains a high-efficient student model using multiple teacher models. In our approach, the teacher models play roles only during training such that the student model operates on its own without getting supports from the teacher models during decoding, which makes it eighty times faster and lighter than other typical ensemble methods. All models are evaluated on seven document classification datasets and show a significant advantage over the teacher models for most cases. Our analysis depicts insightful transformation of word embeddings from distillation and suggests a future direction to ensemble approaches using neural models. △ Less

Submitted 31 May, 2019; originally announced June 2019.

Comments: 7 pages, Proceedings of the 28th International Joint Conference on Artificial Intelligence, 2019 (IJCAI'19)

arXiv:1905.09680 [pdf, other]

DEEP-BO for Hyperparameter Optimization of Deep Networks

Authors: Hyunghun Cho, Yong** Kim, Eunjung Lee, Daeyoung Choi, Yongjae Lee, Wonjong Rhee

Abstract: The performance of deep neural networks (DNN) is very sensitive to the particular choice of hyper-parameters. To make it worse, the shape of the learning curve can be significantly affected when a technique like batchnorm is used. As a result, hyperparameter optimization of deep networks can be much more challenging than traditional machine learning models. In this work, we start from well known B… ▽ More The performance of deep neural networks (DNN) is very sensitive to the particular choice of hyper-parameters. To make it worse, the shape of the learning curve can be significantly affected when a technique like batchnorm is used. As a result, hyperparameter optimization of deep networks can be much more challenging than traditional machine learning models. In this work, we start from well known Bayesian Optimization solutions and provide enhancement strategies specifically designed for hyperparameter optimization of deep networks. The resulting algorithm is named as DEEP-BO (Diversified, Early-termination-Enabled, and Parallel Bayesian Optimization). When evaluated over six DNN benchmarks, DEEP-BO easily outperforms or shows comparable performance with some of the well-known solutions including GP-Hedge, Hyperband, BOHB, Median Stop** Rule, and Learning Curve Extrapolation. The code used is made publicly available at https://github.com/snu-adsl/DEEP-BO. △ Less

Submitted 23 May, 2019; originally announced May 2019.

Comments: 26 pages, NeurIPS19 under review

arXiv:1811.03666 [pdf, other]

Statistical Characteristics of Deep Representations: An Empirical Investigation

Authors: Daeyoung Choi, Kyungeun Lee, Duhun Hwang, Wonjong Rhee

Abstract: In this study, the effects of eight representation regularization methods are investigated, including two newly developed rank regularizers (RR). The investigation shows that the statistical characteristics of representations such as correlation, sparsity, and rank can be manipulated as intended, during training. Furthermore, it is possible to improve the baseline performance simply by trying all… ▽ More In this study, the effects of eight representation regularization methods are investigated, including two newly developed rank regularizers (RR). The investigation shows that the statistical characteristics of representations such as correlation, sparsity, and rank can be manipulated as intended, during training. Furthermore, it is possible to improve the baseline performance simply by trying all the representation regularizers and fine-tuning the strength of their effects. In contrast to performance improvement, no consistent relationship between performance and statistical characteristics was observable. The results indicate that manipulation of statistical characteristics can be helpful for improving performance, but only indirectly through its influence on learning dynamics or its tuning effects. △ Less

Submitted 2 December, 2020; v1 submitted 8 November, 2018; originally announced November 2018.

arXiv:1809.09307 [pdf, other]

Utilizing Class Information for Deep Network Representation Sha**

Authors: Daeyoung Choi, Wonjong Rhee

Abstract: Statistical characteristics of deep network representations, such as sparsity and correlation, are known to be relevant to the performance and interpretability of deep learning. When a statistical characteristic is desired, often an adequate regularizer can be designed and applied during the training phase. Typically, such a regularizer aims to manipulate a statistical characteristic over all clas… ▽ More Statistical characteristics of deep network representations, such as sparsity and correlation, are known to be relevant to the performance and interpretability of deep learning. When a statistical characteristic is desired, often an adequate regularizer can be designed and applied during the training phase. Typically, such a regularizer aims to manipulate a statistical characteristic over all classes together. For classification tasks, however, it might be advantageous to enforce the desired characteristic per class such that different classes can be better distinguished. Motivated by the idea, we design two class-wise regularizers that explicitly utilize class information: class-wise Covariance Regularizer (cw-CR) and class-wise Variance Regularizer (cw-VR). cw-CR targets to reduce the covariance of representations calculated from the same class samples for encouraging feature independence. cw-VR is similar, but variance instead of covariance is targeted to improve feature compactness. For the sake of completeness, their counterparts without using class information, Covariance Regularizer (CR) and Variance Regularizer (VR), are considered together. The four regularizers are conceptually simple and computationally very efficient, and the visualization shows that the regularizers indeed perform distinct representation sha**. In terms of classification performance, significant improvements over the baseline and L1/L2 weight regularization methods were found for 21 out of 22 tasks over popular benchmark datasets. In particular, cw-VR achieved the best performance for 13 tasks including ResNet-32/110. △ Less

Submitted 28 February, 2019; v1 submitted 24 September, 2018; originally announced September 2018.

Comments: Published in AAAI 2019

arXiv:1809.01316 [pdf, other]

Learning User Preferences and Understanding Calendar Contexts for Event Scheduling

Authors: Donghyeon Kim, **hyuk Lee, Donghee Choi, Jaehoon Choi, Jaewoo Kang

Abstract: With online calendar services gaining popularity worldwide, calendar data has become one of the richest context sources for understanding human behavior. However, event scheduling is still time-consuming even with the development of online calendars. Although machine learning based event scheduling models have automated scheduling processes to some extent, they often fail to understand subtle user… ▽ More With online calendar services gaining popularity worldwide, calendar data has become one of the richest context sources for understanding human behavior. However, event scheduling is still time-consuming even with the development of online calendars. Although machine learning based event scheduling models have automated scheduling processes to some extent, they often fail to understand subtle user preferences and complex calendar contexts with event titles written in natural language. In this paper, we propose Neural Event Scheduling Assistant (NESA) which learns user preferences and understands calendar contexts, directly from raw online calendars for fully automated and highly effective event scheduling. We leverage over 593K calendar events for NESA to learn scheduling personal events, and we further utilize NESA for multi-attendee event scheduling. NESA successfully incorporates deep neural networks such as Bidirectional Long Short-Term Memory, Convolutional Neural Network, and Highway Network for learning the preferences of each user and understanding calendar context based on natural languages. The experimental results show that NESA significantly outperforms previous baseline models in terms of various evaluation metrics on both personal and multi-attendee event scheduling tasks. Our qualitative analysis demonstrates the effectiveness of each layer in NESA and learned user preferences. △ Less

Submitted 18 July, 2020; v1 submitted 5 September, 2018; originally announced September 2018.

Comments: CIKM 2018

arXiv:1806.11219 [pdf, other]

Using Exposure Map**s as Side Information in Experiments with Interference

Authors: David Choi

Abstract: Exposure map**s are widely used to model potential outcomes in the presence of interference, where each unit's outcome may depend not only on its own treatment, but also on the treatment of other units as well. However, in practice these models may be only a crude proxy for social dynamics. In this work, we give estimands and estimators that are robust to the misspecification of an exposure mode… ▽ More Exposure map**s are widely used to model potential outcomes in the presence of interference, where each unit's outcome may depend not only on its own treatment, but also on the treatment of other units as well. However, in practice these models may be only a crude proxy for social dynamics. In this work, we give estimands and estimators that are robust to the misspecification of an exposure model. In the first part, we require the treatment effect to be nonnegative (or "monotone") in both direct effects and spillovers. In the second part, we consider a weaker estimand ("contrasts attributable to treatment") which makes no restrictions on the interference at all. △ Less

Submitted 28 June, 2018; originally announced June 2018.

arXiv:1806.10230 [pdf, other]

Guided evolutionary strategies: Augmenting random search with surrogate gradients

Authors: Niru Maheswaranathan, Luke Metz, George Tucker, Dami Choi, Jascha Sohl-Dickstein

Abstract: Many applications in machine learning require optimizing a function whose true gradient is unknown, but where surrogate gradient information (directions that may be correlated with, but not necessarily identical to, the true gradient) is available instead. This arises when an approximate gradient is easier to compute than the full gradient (e.g. in meta-learning or unrolled optimization), or when… ▽ More Many applications in machine learning require optimizing a function whose true gradient is unknown, but where surrogate gradient information (directions that may be correlated with, but not necessarily identical to, the true gradient) is available instead. This arises when an approximate gradient is easier to compute than the full gradient (e.g. in meta-learning or unrolled optimization), or when a true gradient is intractable and is replaced with a surrogate (e.g. in certain reinforcement learning applications, or when using synthetic gradients). We propose Guided Evolutionary Strategies, a method for optimally using surrogate gradient directions along with random search. We define a search distribution for evolutionary strategies that is elongated along a guiding subspace spanned by the surrogate gradients. This allows us to estimate a descent direction which can then be passed to a first-order optimizer. We analytically and numerically characterize the tradeoffs that result from tuning how strongly the search distribution is stretched along the guiding subspace, and we use this to derive a setting of the hyperparameters that works well across problems. Finally, we apply our method to example problems, demonstrating an improvement over both standard evolutionary strategies and first-order methods (that directly follow the surrogate gradient). We provide a demo of Guided ES at https://github.com/brain-research/guided-evolutionary-strategies △ Less

Submitted 10 June, 2019; v1 submitted 26 June, 2018; originally announced June 2018.

Comments: Published at ICML 2019

arXiv:1711.08095 [pdf, ps, other]

SNeCT: Scalable network constrained Tucker decomposition for integrative multi-platform data analysis

Authors: Dong** Choi, Lee Sael

Abstract: Motivation: How do we integratively analyze large-scale multi-platform genomic data that are high dimensional and sparse? Furthermore, how can we incorporate prior knowledge, such as the association between genes, in the analysis systematically? Method: To solve this problem, we propose a Scalable Network Constrained Tucker decomposition method we call SNeCT. SNeCT adopts parallel stochastic gradi… ▽ More Motivation: How do we integratively analyze large-scale multi-platform genomic data that are high dimensional and sparse? Furthermore, how can we incorporate prior knowledge, such as the association between genes, in the analysis systematically? Method: To solve this problem, we propose a Scalable Network Constrained Tucker decomposition method we call SNeCT. SNeCT adopts parallel stochastic gradient descent approach on the proposed parallelizable network constrained optimization function. SNeCT decomposition is applied to tensor constructed from large scale multi-platform multi-cohort cancer data, PanCan12, constrained on a network built from PathwayCommons database. Results: The decomposed factor matrices are applied to stratify cancers, to search for top-k similar patients, and to illustrate how the matrices can be used for personalized interpretation. In the stratification test, combined twelve-cohort data is clustered to form thirteen subclasses. The thirteen subclasses have a high correlation to tissue of origin in addition to other interesting observations, such as clear separation of OV cancers to two groups, and high clinical correlation within subclusters formed in cohorts BRCA and UCEC. In the top-k search, a new patient's genomic profile is generated and searched against existing patients based on the factor matrices. The similarity of the top-k patient to the query is high for 23 clinical features, including estrogen/progesterone receptor statuses of BRCA patients with average precision value ranges from 0.72 to 0.86 and from 0.68 to 0.86, respectively. We also provide an illustration of how the factor matrices can be used for interpretable personalized analysis of each patient. △ Less

Submitted 26 November, 2017; v1 submitted 21 November, 2017; originally announced November 2017.

Comments: 8 pages

arXiv:1710.03608 [pdf, other]

doi 10.1371/journal.pone.0200579

CTD: Fast, Accurate, and Interpretable Method for Static and Dynamic Tensor Decompositions

Authors: Jungwoo Lee, Dong** Choi, Lee Sael

Abstract: How can we find patterns and anomalies in a tensor, or multi-dimensional array, in an efficient and directly interpretable way? How can we do this in an online environment, where a new tensor arrives each time step? Finding patterns and anomalies in a tensor is a crucial problem with many applications, including building safety monitoring, patient health monitoring, cyber security, terrorist detec… ▽ More How can we find patterns and anomalies in a tensor, or multi-dimensional array, in an efficient and directly interpretable way? How can we do this in an online environment, where a new tensor arrives each time step? Finding patterns and anomalies in a tensor is a crucial problem with many applications, including building safety monitoring, patient health monitoring, cyber security, terrorist detection, and fake user detection in social networks. Standard PARAFAC and Tucker decomposition results are not directly interpretable. Although a few sampling-based methods have previously been proposed towards better interpretability, they need to be made faster, more memory efficient, and more accurate. In this paper, we propose CTD, a fast, accurate, and directly interpretable tensor decomposition method based on sampling. CTD-S, the static version of CTD, provably guarantees a high accuracy that is 17 ~ 83x more accurate than that of the state-of-the-art method. Also, CTD-S is made 5 ~ 86x faster, and 7 ~ 12x more memory-efficient than the state-of-the-art method by removing redundancy. CTD-D, the dynamic version of CTD, is the first interpretable dynamic tensor decomposition method ever proposed. Also, it is made 2 ~ 3x faster than already fast CTD-S by exploiting factors at previous time step and by reordering operations. With CTD, we demonstrate how the results can be effectively interpreted in the online distributed denial of service (DDoS) attack detection. △ Less

Submitted 9 October, 2017; originally announced October 2017.

arXiv:1611.05407 [pdf, other]

A Semidefinite Program for Structured Blockmodels

Authors: David Choi

Abstract: Semidefinite programs have recently been developed for the problem of community detection, which may be viewed as a special case of the stochastic blockmodel. Here, we develop a semidefinite program that can be tailored to other instances of the blockmodel, such as non-assortative networks and overlap** communities. We establish label recovery in sparse settings, with conditions that are analogo… ▽ More Semidefinite programs have recently been developed for the problem of community detection, which may be viewed as a special case of the stochastic blockmodel. Here, we develop a semidefinite program that can be tailored to other instances of the blockmodel, such as non-assortative networks and overlap** communities. We establish label recovery in sparse settings, with conditions that are analogous to recent results for community detection. In settings where the data is not generated by a blockmodel, we give an oracle inequality that bounds excess risk relative to the best blockmodel approximation. Simulations are presented for community detection, for overlap** communities, and for latent space models. △ Less

Submitted 16 November, 2016; originally announced November 2016.

arXiv:1604.04264 [pdf, other]

A semiparametric mixture method for local false discovery rate estimation

Authors: Seok-Oh Jeong, Dongseok Choi, Woncheol Jang

Abstract: We propose a semiparametric mixture model to estimate local false discovery rates in multiple testing problems. The two pilars of the proposed approach are Efron's empirical null principle and log-concave density estimation for the alternative distribution. Compared to existing methods, our method can be easily extended to high dimension. Simulation results show that our method outperforms other e… ▽ More We propose a semiparametric mixture model to estimate local false discovery rates in multiple testing problems. The two pilars of the proposed approach are Efron's empirical null principle and log-concave density estimation for the alternative distribution. Compared to existing methods, our method can be easily extended to high dimension. Simulation results show that our method outperforms other existing methods and we illustrate its use via case studies in astronomy and microarray. △ Less

Submitted 14 April, 2016; originally announced April 2016.

arXiv:1408.4102 [pdf, other]

Estimation of Monotone Treatment Effects in Network Experiments

Authors: David S. Choi

Abstract: Randomized experiments on social networks pose statistical challenges, due to the possibility of interference between units. We propose new methods for estimating attributable treatment effects in such settings. The methods do not require partial interference, but instead require an identifying assumption that is similar to requiring nonnegative treatment effects. Network or spatial information ca… ▽ More Randomized experiments on social networks pose statistical challenges, due to the possibility of interference between units. We propose new methods for estimating attributable treatment effects in such settings. The methods do not require partial interference, but instead require an identifying assumption that is similar to requiring nonnegative treatment effects. Network or spatial information can be used to customize the test statistic; in principle, this can increase power without making assumptions on the data generating process. △ Less

Submitted 12 October, 2015; v1 submitted 18 August, 2014; originally announced August 2014.

Comments: new methods and data examples added

arXiv:1310.4249 [pdf, other]

Map** the stereotyped behaviour of freely-moving fruit flies

Authors: Gordon J. Berman, Daniel M. Choi, William Bialek, Joshua W. Shaevitz

Abstract: Most animals possess the ability to actuate a vast diversity of movements, ostensibly constrained only by morphology and physics. In practice, however, a frequent assumption in behavioral science is that most of an animal's activities can be described in terms of a small set of stereotyped motifs. Here we introduce a method for map** the behavioral space of organisms, relying only upon the under… ▽ More Most animals possess the ability to actuate a vast diversity of movements, ostensibly constrained only by morphology and physics. In practice, however, a frequent assumption in behavioral science is that most of an animal's activities can be described in terms of a small set of stereotyped motifs. Here we introduce a method for map** the behavioral space of organisms, relying only upon the underlying structure of postural movement data to organize and classify behaviors. We find that six different drosophilid species each perform a mix of non-stereotyped actions and over one hundred hierarchically-organized, stereotyped behaviors. Moreover, we use this approach to compare these species' behavioral spaces, systematically identifying subtle behavioral differences between closely-related species. △ Less

Submitted 11 August, 2014; v1 submitted 15 October, 2013; originally announced October 2013.

Comments: 21 pages, 17 figures. Email GJB ([email protected]) to see supplementary movies, Journal of the Royal Society Interface, 2014

arXiv:1212.4093 [pdf, ps, other]

doi 10.1214/13-AOS1173

Co-clustering separately exchangeable network data

Authors: David Choi, Patrick J. Wolfe

Abstract: This article establishes the performance of stochastic blockmodels in addressing the co-clustering problem of partitioning a binary array into subsets, assuming only that the data are generated by a nonparametric process satisfying the condition of separate exchangeability. We provide oracle inequalities with rate of convergence $\mathcal{O}_P(n^{-1/4})$ corresponding to profile likelihood maximiz… ▽ More This article establishes the performance of stochastic blockmodels in addressing the co-clustering problem of partitioning a binary array into subsets, assuming only that the data are generated by a nonparametric process satisfying the condition of separate exchangeability. We provide oracle inequalities with rate of convergence $\mathcal{O}_P(n^{-1/4})$ corresponding to profile likelihood maximization and mean-square error minimization, and show that the blockmodel can be interpreted in this setting as an optimal piecewise-constant approximation to the generative nonparametric model. We also show for large sample sizes that the detection of co-clusters in such data indicates with high probability the existence of co-clusters of equal size and asymptotically equivalent connectivity in the underlying generative process. △ Less

Submitted 16 January, 2014; v1 submitted 17 December, 2012; originally announced December 2012.

Comments: Published in at http://dx.doi.org/10.1214/13-AOS1173 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS1173

Journal ref: Annals of Statistics 2014, Vol. 42, No. 1, 29-63

arXiv:1105.6245 [pdf, other]

doi 10.1002/sam.10136

Confidence sets for network structure

Authors: Edoardo M. Airoldi, David S. Choi, Patrick J. Wolfe

Abstract: Latent variable models are frequently used to identify structure in dichotomous network data, in part because they give rise to a Bernoulli product likelihood that is both well understood and consistent with the notion of exchangeable random graphs. In this article we propose conservative confidence sets that hold with respect to these underlying Bernoulli parameters as a function of any given par… ▽ More Latent variable models are frequently used to identify structure in dichotomous network data, in part because they give rise to a Bernoulli product likelihood that is both well understood and consistent with the notion of exchangeable random graphs. In this article we propose conservative confidence sets that hold with respect to these underlying Bernoulli parameters as a function of any given partition of network nodes, enabling us to assess estimates of 'residual' network structure, that is, structure that cannot be explained by known covariates and thus cannot be easily verified by manual inspection. We demonstrate the proposed methodology by analyzing student friendship networks from the National Longitudinal Survey of Adolescent Health that include race, gender, and school year as covariates. We employ a stochastic expectation-maximization algorithm to fit a logistic regression model that includes these explanatory variables as well as a latent stochastic blockmodel component and additional node-specific effects. Although maximum-likelihood estimates do not appear consistent in this context, we are able to evaluate confidence sets as a function of different blockmodel partitions, which enables us to qualitatively assess the significance of estimated residual network structure relative to a baseline, which models covariates but lacks block structure. △ Less

Submitted 31 May, 2011; originally announced May 2011.

Comments: 17 pages, 3 figures, 3 tables

Journal ref: Statistical Analysis and Data Mining, vol. 4, pp. 461-469, 2011

arXiv:1011.4644 [pdf, ps, other]

doi 10.1093/biomet/asr053

Stochastic blockmodels with growing number of classes

Authors: David S. Choi, Patrick J. Wolfe, Edoardo M. Airoldi

Abstract: We present asymptotic and finite-sample results on the use of stochastic blockmodels for the analysis of network data. We show that the fraction of misclassified network nodes converges in probability to zero under maximum likelihood fitting when the number of classes is allowed to grow as the root of the network size and the average network degree grows at least poly-logarithmically in this size.… ▽ More We present asymptotic and finite-sample results on the use of stochastic blockmodels for the analysis of network data. We show that the fraction of misclassified network nodes converges in probability to zero under maximum likelihood fitting when the number of classes is allowed to grow as the root of the network size and the average network degree grows at least poly-logarithmically in this size. We also establish finite-sample confidence bounds on maximum-likelihood blockmodel parameter estimates from data comprising independent Bernoulli random variates; these results hold uniformly over class assignment. We provide simulations verifying the conditions sufficient for our results, and conclude by fitting a logit parameterization of a stochastic blockmodel with covariates to a network data example comprising a collection of Facebook profiles, resulting in block estimates that reveal residual structure. △ Less

Submitted 30 April, 2011; v1 submitted 21 November, 2010; originally announced November 2010.

Comments: 12 pages, 3 figures; revised version

Journal ref: Biometrika, 99:273--284, 2012

Showing 1–26 of 26 results for author: Choi, D