Search | arXiv e-print repository

Generalization error of min-norm interpolators in transfer learning

Authors: Yanke Song, Sohom Bhattacharya, Pragya Sur

Abstract: This paper establishes the generalization error of pooled min-$\ell_2$-norm interpolation in transfer learning where data from diverse distributions are available. Min-norm interpolators emerge naturally as implicit regularized limits of modern machine learning algorithms. Previous work characterized their out-of-distribution risk when samples from the test distribution are unavailable during trai… ▽ More This paper establishes the generalization error of pooled min-$\ell_2$-norm interpolation in transfer learning where data from diverse distributions are available. Min-norm interpolators emerge naturally as implicit regularized limits of modern machine learning algorithms. Previous work characterized their out-of-distribution risk when samples from the test distribution are unavailable during training. However, in many applications, a limited amount of test data may be available during training, yet properties of min-norm interpolation in this setting are not well-understood. We address this gap by characterizing the bias and variance of pooled min-$\ell_2$-norm interpolation under covariate and model shifts. The pooled interpolator captures both early fusion and a form of intermediate fusion. Our results have several implications: under model shift, for low signal-to-noise ratio (SNR), adding data always hurts. For higher SNR, transfer learning helps as long as the shift-to-signal (SSR) ratio lies below a threshold that we characterize explicitly. By consistently estimating these ratios, we provide a data-driven method to determine: (i) when the pooled interpolator outperforms the target-based interpolator, and (ii) the optimal number of target samples that minimizes the generalization error. Under covariate shift, if the source sample size is small relative to the dimension, heterogeneity between between domains improves the risk, and vice versa. We establish a novel anisotropic local law to achieve these characterizations, which may be of independent interest in random matrix theory. We supplement our theoretical characterizations with comprehensive simulations that demonstrate the finite-sample efficacy of our results. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: 53 pages, 2 figures

arXiv:2406.11828 [pdf, other]

Learning sum of diverse features: computational hardness and efficient gradient-based training for ridge combinations

Authors: Kazusato Oko, Yu** Song, Taiji Suzuki, Denny Wu

Abstract: We study the computational and sample complexity of learning a target function $f_*:\mathbb{R}^d\to\mathbb{R}$ with additive structure, that is, $f_*(x) = \frac{1}{\sqrt{M}}\sum_{m=1}^M f_m(\langle x, v_m\rangle)$, where $f_1,f_2,...,f_M:\mathbb{R}\to\mathbb{R}$ are nonlinear link functions of single-index models (ridge functions) with diverse and near-orthogonal index features $\{v_m\}_{m=1}^M$,… ▽ More We study the computational and sample complexity of learning a target function $f_*:\mathbb{R}^d\to\mathbb{R}$ with additive structure, that is, $f_*(x) = \frac{1}{\sqrt{M}}\sum_{m=1}^M f_m(\langle x, v_m\rangle)$, where $f_1,f_2,...,f_M:\mathbb{R}\to\mathbb{R}$ are nonlinear link functions of single-index models (ridge functions) with diverse and near-orthogonal index features $\{v_m\}_{m=1}^M$, and the number of additive tasks $M$ grows with the dimensionality $M\asymp d^γ$ for $γ\ge 0$. This problem setting is motivated by the classical additive model literature, the recent representation learning theory of two-layer neural network, and large-scale pretraining where the model simultaneously acquires a large number of "skills" that are often localized in distinct parts of the trained network. We prove that a large subset of polynomial $f_*$ can be efficiently learned by gradient descent training of a two-layer neural network, with a polynomial statistical and computational complexity that depends on the number of tasks $M$ and the information exponent of $f_m$, despite the unknown link function and $M$ growing with the dimensionality. We complement this learnability guarantee with computational hardness result by establishing statistical query (SQ) lower bounds for both the correlational SQ and full SQ algorithms. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: COLT 2024

arXiv:2406.11184 [pdf, other]

HEDE: Heritability estimation in high dimensions by Ensembling Debiased Estimators

Authors: Yanke Song, Xihong Lin, Pragya Sur

Abstract: Estimating heritability remains a significant challenge in statistical genetics. Diverse approaches have emerged over the years that are broadly categorized as either random effects or fixed effects heritability methods. In this work, we focus on the latter. We propose HEDE, an ensemble approach to estimate heritability or the signal-to-noise ratio in high-dimensional linear models where the sampl… ▽ More Estimating heritability remains a significant challenge in statistical genetics. Diverse approaches have emerged over the years that are broadly categorized as either random effects or fixed effects heritability methods. In this work, we focus on the latter. We propose HEDE, an ensemble approach to estimate heritability or the signal-to-noise ratio in high-dimensional linear models where the sample size and the dimension grow proportionally. Our method ensembles post-processed versions of the debiased lasso and debiased ridge estimators, and incorporates a data-driven strategy for hyperparameter selection that significantly boosts estimation performance. We establish rigorous consistency guarantees that hold despite adaptive tuning. Extensive simulations demonstrate our method's superiority over existing state-of-the-art methods across various signal structures and genetic architectures, ranging from sparse to relatively dense and from evenly to unevenly distributed signals. Furthermore, we discuss the advantages of fixed effects heritability estimation compared to random effects estimation. Our theoretical guarantees hold for realistic genotype distributions observed in genetic studies, where genotypes typically take on discrete values and are often well-modeled by sub-Gaussian distributed random variables. We establish our theoretical results by deriving uniform bounds, built upon the convex Gaussian min-max theorem, and leveraging universality results. Finally, we showcase the efficacy of our approach in estimating height and BMI heritability using the UK Biobank. △ Less

Submitted 16 June, 2024; originally announced June 2024.

Comments: 58 pages, 7 figures

arXiv:2406.06149 [pdf, other]

Decoupled Marked Temporal Point Process using Neural Ordinary Differential Equations

Authors: Yujee Song, Donghyun Lee, Rui Meng, Won Hwa Kim

Abstract: A Marked Temporal Point Process (MTPP) is a stochastic process whose realization is a set of event-time data. MTPP is often used to understand complex dynamics of asynchronous temporal events such as money transaction, social media, healthcare, etc. Recent studies have utilized deep neural networks to capture complex temporal dependencies of events and generate embedding that aptly represent the o… ▽ More A Marked Temporal Point Process (MTPP) is a stochastic process whose realization is a set of event-time data. MTPP is often used to understand complex dynamics of asynchronous temporal events such as money transaction, social media, healthcare, etc. Recent studies have utilized deep neural networks to capture complex temporal dependencies of events and generate embedding that aptly represent the observed events. While most previous studies focus on the inter-event dependencies and their representations, how individual events influence the overall dynamics over time has been under-explored. In this regime, we propose a Decoupled MTPP framework that disentangles characterization of a stochastic process into a set of evolving influences from different events. Our approach employs Neural Ordinary Differential Equations (Neural ODEs) to learn flexible continuous dynamics of these influences while simultaneously addressing multiple inference problems, such as density estimation and survival rate computation. We emphasize the significance of disentangling the influences by comparing our framework with state-of-the-art methods on real-life datasets, and provide analysis on the model behavior for potential applications. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: 18 pages, 8 figures, The Twelfth International Conference on Learning Representations (ICLR 2024)

arXiv:2406.00396 [pdf, other]

Stochastic Restarting to Overcome Overfitting in Neural Networks with Noisy Labels

Authors: Youngkyoung Bae, Yeongwoo Song, Hawoong Jeong

Abstract: Despite its prevalence, giving up and starting over may seem wasteful in many situations such as searching for a target or training deep neural networks (DNNs). Our study, though, demonstrates that restarting from a checkpoint can significantly improve generalization performance when training DNNs with noisy labels. In the presence of noisy labels, DNNs initially learn the general patterns of the… ▽ More Despite its prevalence, giving up and starting over may seem wasteful in many situations such as searching for a target or training deep neural networks (DNNs). Our study, though, demonstrates that restarting from a checkpoint can significantly improve generalization performance when training DNNs with noisy labels. In the presence of noisy labels, DNNs initially learn the general patterns of the data but then gradually overfit to the noisy labels. To combat this overfitting phenomenon, we developed a method based on stochastic restarting, which has been actively explored in the statistical physics field for finding targets efficiently. By approximating the dynamics of stochastic gradient descent into Langevin dynamics, we theoretically show that restarting can provide great improvements as the batch size and the proportion of corrupted data increase. We then empirically validate our theory, confirming the significant improvements achieved by restarting. An important aspect of our method is its ease of implementation and compatibility with other methods, while still yielding notably improved performance. We envision it as a valuable tool that can complement existing methods for handling noisy labels. △ Less

Submitted 1 June, 2024; originally announced June 2024.

Comments: 21 pages, 10 figures

arXiv:2405.07220 [pdf, other]

On Discovery of Local Independence over Continuous Variables via Neural Contextual Decomposition

Authors: Inwoo Hwang, Yunhyeok Kwak, Yeon-Ji Song, Byoung-Tak Zhang, Sanghack Lee

Abstract: Conditional independence provides a way to understand causal relationships among the variables of interest. An underlying system may exhibit more fine-grained causal relationships especially between a variable and its parents, which will be called the local independence relationships. One of the most widely studied local relationships is Context-Specific Independence (CSI), which holds in a specif… ▽ More Conditional independence provides a way to understand causal relationships among the variables of interest. An underlying system may exhibit more fine-grained causal relationships especially between a variable and its parents, which will be called the local independence relationships. One of the most widely studied local relationships is Context-Specific Independence (CSI), which holds in a specific assignment of conditioned variables. However, its applicability is often limited since it does not allow continuous variables: data conditioned to the specific value of a continuous variable contains few instances, if not none, making it infeasible to test independence. In this work, we define and characterize the local independence relationship that holds in a specific set of joint assignments of parental variables, which we call context-set specific independence (CSSI). We then provide a canonical representation of CSSI and prove its fundamental properties. Based on our theoretical findings, we cast the problem of discovering multiple CSSI relationships in a system as finding a partition of the joint outcome space. Finally, we propose a novel method, coined neural contextual decomposition (NCD), which learns such partition by imposing each set to induce CSSI via modeling a conditional distribution. We empirically demonstrate that the proposed method successfully discovers the ground truth local independence relationships in both synthetic dataset and complex system reflecting the real-world physical dynamics. △ Less

Submitted 12 May, 2024; originally announced May 2024.

Comments: Conference on Causal Learning and Reasoning (CLeaR), 2023

arXiv:2402.13259 [pdf, other]

Fast Discrete-Event Simulation of Markovian Queueing Networks through Euler Approximation

Authors: L. Jeff Hong, Yingda Song, Tan Wang

Abstract: The efficient management of large-scale queueing networks is critical for a variety of sectors, including healthcare, logistics, and customer service, where system performance has profound implications for operational effectiveness and cost management. To address this key challenge, our paper introduces simulation techniques tailored for complex, large-scale Markovian queueing networks. We develop… ▽ More The efficient management of large-scale queueing networks is critical for a variety of sectors, including healthcare, logistics, and customer service, where system performance has profound implications for operational effectiveness and cost management. To address this key challenge, our paper introduces simulation techniques tailored for complex, large-scale Markovian queueing networks. We develop two simulation schemes based on Euler approximation, namely the backward and forward schemes. These schemes can accommodate time-varying dynamics and are optimized for efficient implementation using vectorization. Assuming a feedforward queueing network structure, we establish that the two schemes provide stochastic upper and lower bounds for the system state, while the approximation error remains bounded over the simulation horizon. With the recommended choice of time step, we show that our approximation schemes exhibit diminishing asymptotic relative error as the system scales up, while maintaining much lower computational complexity compared to traditional discrete-event simulation and achieving speedups up to tens of thousands times. This study highlights the substantial potential of Euler approximation in simulating large-scale discrete systems. △ Less

Submitted 2 February, 2024; originally announced February 2024.

arXiv:2401.12824 [pdf, other]

MAPPING: Debiasing Graph Neural Networks for Fair Node Classification with Limited Sensitive Information Leakage

Authors: Ying Song, Balaji Palanisamy

Abstract: Despite remarkable success in diverse web-based applications, Graph Neural Networks(GNNs) inherit and further exacerbate historical discrimination and social stereotypes, which critically hinder their deployments in high-stake domains such as online clinical diagnosis, financial crediting, etc. However, current fairness research that primarily craft on i.i.d data, cannot be trivially replicated to… ▽ More Despite remarkable success in diverse web-based applications, Graph Neural Networks(GNNs) inherit and further exacerbate historical discrimination and social stereotypes, which critically hinder their deployments in high-stake domains such as online clinical diagnosis, financial crediting, etc. However, current fairness research that primarily craft on i.i.d data, cannot be trivially replicated to non-i.i.d. graph structures with topological dependence among samples. Existing fair graph learning typically favors pairwise constraints to achieve fairness but fails to cast off dimensional limitations and generalize them into multiple sensitive attributes; besides, most studies focus on in-processing techniques to enforce and calibrate fairness, constructing a model-agnostic debiasing GNN framework at the pre-processing stage to prevent downstream misuses and improve training reliability is still largely under-explored. Furthermore, previous work on GNNs tend to enhance either fairness or privacy individually but few probe into their interplays. In this paper, we propose a novel model-agnostic debiasing framework named MAPPING (\underline{M}asking \underline{A}nd \underline{P}runing and Message-\underline{P}assing train\underline{ING}) for fair node classification, in which we adopt the distance covariance($dCov$)-based fairness constraints to simultaneously reduce feature and topology biases in arbitrary dimensions, and combine them with adversarial debiasing to confine the risks of attribute inference attacks. Experiments on real-world datasets with different GNN variants demonstrate the effectiveness and flexibility of MAPPING. Our results show that MAPPING can achieve better trade-offs between utility and fairness, and mitigate privacy risks of sensitive information leakage. △ Less

Submitted 23 January, 2024; originally announced January 2024.

Comments: Finished May last year. Remember to submit all papers to arXiv early without compromising the principles of conferences

arXiv:2311.08384 [pdf, other]

Offline Data Enhanced On-Policy Policy Gradient with Provable Guarantees

Authors: Yifei Zhou, Ayush Sekhari, Yuda Song, Wen Sun

Abstract: Hybrid RL is the setting where an RL agent has access to both offline data and online data by interacting with the real-world environment. In this work, we propose a new hybrid RL algorithm that combines an on-policy actor-critic method with offline data. On-policy methods such as policy gradient and natural policy gradient (NPG) have shown to be more robust to model misspecification, though somet… ▽ More Hybrid RL is the setting where an RL agent has access to both offline data and online data by interacting with the real-world environment. In this work, we propose a new hybrid RL algorithm that combines an on-policy actor-critic method with offline data. On-policy methods such as policy gradient and natural policy gradient (NPG) have shown to be more robust to model misspecification, though sometimes it may not be as sample efficient as methods that rely on off-policy learning. On the other hand, offline methods that depend on off-policy training often require strong assumptions in theory and are less stable to train in practice. Our new approach integrates a procedure of off-policy training on the offline data into an on-policy NPG framework. We show that our approach, in theory, can obtain a best-of-both-worlds type of result -- it achieves the state-of-art theoretical guarantees of offline RL when offline RL-specific assumptions hold, while at the same time maintaining the theoretical guarantees of on-policy NPG regardless of the offline RL assumptions' validity. Experimentally, in challenging rich-observation environments, we show that our approach outperforms a state-of-the-art hybrid RL baseline which only relies on off-policy policy optimization, demonstrating the empirical benefit of combining on-policy and off-policy learning. Our code is publicly available at https://github.com/YifeiZhou02/HNPG. △ Less

Submitted 14 November, 2023; originally announced November 2023.

Comments: The first two authors contributed equally

arXiv:2310.04367 [pdf]

A Marketplace Price Anomaly Detection System at Scale

Authors: Akshit Sarpal, Qiwen Kang, Fang** Huang, Yang Song, Lijie Wan

Abstract: Online marketplaces execute large volume of price updates that are initiated by individual marketplace sellers each day on the platform. This price democratization comes with increasing challenges with data quality. Lack of centralized guardrails that are available for a traditional online retailer causes a higher likelihood for inaccurate prices to get published on the website, leading to poor cu… ▽ More Online marketplaces execute large volume of price updates that are initiated by individual marketplace sellers each day on the platform. This price democratization comes with increasing challenges with data quality. Lack of centralized guardrails that are available for a traditional online retailer causes a higher likelihood for inaccurate prices to get published on the website, leading to poor customer experience and potential for revenue loss. We present MoatPlus (Masked Optimal Anchors using Trees, Proximity-based Labeling and Unsupervised Statistical-features), a scalable price anomaly detection framework for a growing marketplace platform. The goal is to leverage proximity and historical price trends from unsupervised statistical features to generate an upper price bound. We build an ensemble of models to detect irregularities in price-based features, exclude irregular features and use optimized weighting scheme to build a reliable price bound in real-time pricing pipeline. We observed that our approach improves precise anchor coverage by up to 46.6% in high-vulnerability item subsets △ Less

Submitted 9 October, 2023; v1 submitted 6 October, 2023; originally announced October 2023.

Comments: 10 pages, 4 figures, 7 tables

arXiv:2310.02216 [pdf, other]

Efficient stochastic generators with spherical harmonic transformation for high-resolution global climate simulations from CESM2-LENS2

Authors: Yan Song, Zubair Khalid, Marc G. Genton

Abstract: Earth system models (ESMs) are fundamental for understanding Earth's complex climate system. However, the computational demands and storage requirements of ESM simulations limit their utility. For the newly published CESM2-LENS2 data, which suffer from this issue, we propose a novel stochastic generator (SG) as a practical complement to the CESM2, capable of rapidly producing emulations closely mi… ▽ More Earth system models (ESMs) are fundamental for understanding Earth's complex climate system. However, the computational demands and storage requirements of ESM simulations limit their utility. For the newly published CESM2-LENS2 data, which suffer from this issue, we propose a novel stochastic generator (SG) as a practical complement to the CESM2, capable of rapidly producing emulations closely mirroring training simulations. Our SG leverages the spherical harmonic transformation (SHT) to shift from spatial to spectral domains, enabling efficient low-rank approximations that significantly reduce computational and storage costs. By accounting for axial symmetry and retaining distinct ranks for land and ocean regions, our SG captures intricate non-stationary spatial dependencies. Additionally, a modified Tukey g-and-h (TGH) transformation accommodates non-Gaussianity in high-temporal-resolution data. We apply the proposed SG to generate emulations for surface temperature simulations from the CESM2-LENS2 data across various scales, marking the first attempt of reproducing daily data. These emulations are then meticulously validated against training simulations. This work offers a promising complementary pathway for efficient climate modeling and analysis while overcoming computational and storage limitations. △ Less

Submitted 24 May, 2024; v1 submitted 3 October, 2023; originally announced October 2023.

arXiv:2307.00190 [pdf]

Estimands in Real-World Evidence Studies

Authors: Jie Chen, Daniel Scharfstein, Hongwei Wang, Binbing Yu, Yang Song, Weili He, John Scott, Xiwu Lin, Hana Lee

Abstract: A Real-World Evidence (RWE) Scientific Working Group (SWG) of the American Statistical Association Biopharmaceutical Section (ASA BIOP) has been reviewing statistical considerations for the generation of RWE to support regulatory decision-making. As part of the effort, the working group is addressing estimands in RWE studies. Constructing the right estimand -- the target of estimation -- which ref… ▽ More A Real-World Evidence (RWE) Scientific Working Group (SWG) of the American Statistical Association Biopharmaceutical Section (ASA BIOP) has been reviewing statistical considerations for the generation of RWE to support regulatory decision-making. As part of the effort, the working group is addressing estimands in RWE studies. Constructing the right estimand -- the target of estimation -- which reflects the research question and the study objective, is one of the key components in formulating a clinical study. ICH E9(R1) describes statistical principles for constructing estimands in clinical trials with a focus on five attributes -- population, treatment, endpoints, intercurrent events, and population-level summary. However, defining estimands for clinical studies using real-world data (RWD), i.e., RWE studies, requires additional considerations due to, for example, heterogeneity of study population, complexity of treatment regimes, different types and patterns of intercurrent events, and complexities in choosing study endpoints. This paper reviews the essential components of estimands and causal inference framework, discusses considerations in constructing estimands for RWE studies, highlights similarities and differences in traditional clinical trial and RWE study estimands, and provides a roadmap for choosing appropriate estimands for RWE studies. △ Less

Submitted 30 June, 2023; originally announced July 2023.

arXiv:2305.07813 [pdf, other]

Fast robust location and scatter estimation: a depth-based method

Authors: Maoyu Zhang, Yan Song, Wenlin Dai

Abstract: The minimum covariance determinant (MCD) estimator is ubiquitous in multivariate analysis, the critical step of which is to select a subset of a given size with the lowest sample covariance determinant. The concentration step (C-step) is a common tool for subset-seeking; however, it becomes computationally demanding for high-dimensional data. To alleviate the challenge, we propose a depth-based al… ▽ More The minimum covariance determinant (MCD) estimator is ubiquitous in multivariate analysis, the critical step of which is to select a subset of a given size with the lowest sample covariance determinant. The concentration step (C-step) is a common tool for subset-seeking; however, it becomes computationally demanding for high-dimensional data. To alleviate the challenge, we propose a depth-based algorithm, termed as \texttt{FDB}, which replaces the optimal subset with the trimmed region induced by statistical depth. We show that the depth-based region is consistent with the MCD-based subset under a specific class of depth notions, for instance, the projection depth. With the two suggested depths, the \texttt{FDB} estimator is not only computationally more efficient but also reaches the same level of robustness as the MCD estimator. Extensive simulation studies are conducted to assess the empirical performance of our estimators. We also validate the computational efficiency and robustness of our estimators under several typical tasks such as principal component analysis, linear discriminant analysis, image denoise and outlier detection on real-life datasets. A R package \textit{FDB} and potential extensions are available in the Supplementary Materials. △ Less

Submitted 12 May, 2023; originally announced May 2023.

arXiv:2305.01188 [pdf, other]

Advancing inverse scattering with surrogate modeling and Bayesian inference for functional inputs

Authors: Chih-Li Sung, Yao Song, Ying Hung

Abstract: Inverse scattering aims to infer information about a hidden object by using the received scattered waves and training data collected from forward mathematical models. Recent advances in computing have led to increasing attention towards functional inverse inference, which can reveal more detailed properties of a hidden object. However, rigorous studies on functional inverse, including the reconstr… ▽ More Inverse scattering aims to infer information about a hidden object by using the received scattered waves and training data collected from forward mathematical models. Recent advances in computing have led to increasing attention towards functional inverse inference, which can reveal more detailed properties of a hidden object. However, rigorous studies on functional inverse, including the reconstruction of the functional input and quantification of uncertainty, remain scarce. Motivated by an inverse scattering problem where the objective is to infer the functional input representing the refractive index of a bounded scatterer, a new Bayesian framework is proposed. It contains a surrogate model that takes into account the functional inputs directly through kernel functions, and a Bayesian procedure that infers functional inputs through the posterior distribution. Furthermore, the proposed Bayesian framework is extended to reconstruct functional inverse by integrating multi-fidelity simulations, including a high-fidelity simulator solved by finite element methods and a low-fidelity simulator called the Born approximation. When compared with existing alternatives developed by finite basis expansion, the proposed method provides more accurate functional recoveries with smaller prediction variations. △ Less

Submitted 1 May, 2023; originally announced May 2023.

arXiv:2304.09868 [pdf, other]

Accelerate Support Vector Clustering via Spectrum-Preserving Data Compression

Authors: Yuxuan Song, Yongyu Wang

Abstract: This paper proposes a novel framework for accelerating support vector clustering. The proposed method first computes much smaller compressed data sets while preserving the key cluster properties of the original data sets based on a novel spectral data compression approach. Then, the resultant spectrally-compressed data sets are leveraged for the development of fast and high quality algorithm for s… ▽ More This paper proposes a novel framework for accelerating support vector clustering. The proposed method first computes much smaller compressed data sets while preserving the key cluster properties of the original data sets based on a novel spectral data compression approach. Then, the resultant spectrally-compressed data sets are leveraged for the development of fast and high quality algorithm for support vector clustering. We conducted extensive experiments using real-world data sets and obtained very promising results. The proposed method allows us to achieve 100X and 115X speedups over the state of the art SVC method on the Pendigits and USPS data sets, respectively, while achieving even better clustering quality. To the best of our knowledge, this represents the first practical method for high-quality and fast SVC on large-scale real-world data sets △ Less

Submitted 14 May, 2023; v1 submitted 18 April, 2023; originally announced April 2023.

arXiv:2304.09132 [pdf, other]

Independence testing for inhomogeneous random graphs

Authors: Yukun Song, Carey E. Priebe, Minh Tang

Abstract: Testing for independence between graphs is a problem that arises naturally in social network analysis and neuroscience. In this paper, we address independence testing for inhomogeneous Erdős-Rényi random graphs on the same vertex set. We first formulate a notion of pairwise correlations between the edges of these graphs and derive a necessary condition for their detectability. We next show that th… ▽ More Testing for independence between graphs is a problem that arises naturally in social network analysis and neuroscience. In this paper, we address independence testing for inhomogeneous Erdős-Rényi random graphs on the same vertex set. We first formulate a notion of pairwise correlations between the edges of these graphs and derive a necessary condition for their detectability. We next show that the problem can exhibit a statistical vs. computational tradeoff, i.e., there are regimes for which the correlations are statistically detectable but may require algorithms whose running time is exponential in n, the number of vertices. Finally, we consider a special case of correlation testing when the graphs are sampled from a latent space model (graphon) and propose an asymptotically valid and consistent test procedure that also runs in time polynomial in n. △ Less

Submitted 18 April, 2023; originally announced April 2023.

Comments: 24 pages, 2 figures

arXiv:2303.01469 [pdf, other]

Consistency Models

Authors: Yang Song, Prafulla Dhariwal, Mark Chen, Ilya Sutskever

Abstract: Diffusion models have significantly advanced the fields of image, audio, and video generation, but they depend on an iterative sampling process that causes slow generation. To overcome this limitation, we propose consistency models, a new family of models that generate high quality samples by directly map** noise to data. They support fast one-step generation by design, while still allowing mult… ▽ More Diffusion models have significantly advanced the fields of image, audio, and video generation, but they depend on an iterative sampling process that causes slow generation. To overcome this limitation, we propose consistency models, a new family of models that generate high quality samples by directly map** noise to data. They support fast one-step generation by design, while still allowing multistep sampling to trade compute for sample quality. They also support zero-shot data editing, such as image inpainting, colorization, and super-resolution, without requiring explicit training on these tasks. Consistency models can be trained either by distilling pre-trained diffusion models, or as standalone generative models altogether. Through extensive experiments, we demonstrate that they outperform existing distillation techniques for diffusion models in one- and few-step sampling, achieving the new state-of-the-art FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 for one-step generation. When trained in isolation, consistency models become a new family of generative models that can outperform existing one-step, non-adversarial generative models on standard benchmarks such as CIFAR-10, ImageNet 64x64 and LSUN 256x256. △ Less

Submitted 31 May, 2023; v1 submitted 2 March, 2023; originally announced March 2023.

Comments: ICML 2023

arXiv:2302.11077 [pdf]

doi 10.1016/j.aap.2023.107016

Impact of Event Encoding and Dissimilarity Measures on Traffic Crash Characterization Based on Sequence of Events

Authors: Yu Song, Madhav V. Chitturi, David A. Noyce

Abstract: Crash sequence analysis has been shown in prior studies to be useful for characterizing crashes and identifying safety countermeasures. Sequence analysis is highly domain-specific, but its various techniques have not been evaluated for adaptation to crash sequences. This paper evaluates the impact of encoding and dissimilarity measures on crash sequence analysis and clustering. Sequence data of in… ▽ More Crash sequence analysis has been shown in prior studies to be useful for characterizing crashes and identifying safety countermeasures. Sequence analysis is highly domain-specific, but its various techniques have not been evaluated for adaptation to crash sequences. This paper evaluates the impact of encoding and dissimilarity measures on crash sequence analysis and clustering. Sequence data of interstate highway, single-vehicle crashes in the United States, from 2016-2018, were studied. Two encoding schemes and five optimal matching based dissimilarity measures were compared by evaluating the sequence clustering results. The five dissimilarity measures were categorized into two groups based on correlations between dissimilarity matrices. The optimal dissimilarity measure and encoding scheme were identified based on the agreements with a benchmark crash categorization. The transition-rate-based, localized optimal matching dissimilarity and consolidated encoding scheme had the highest agreement with the benchmark. Evaluation results indicate that the selection of dissimilarity measure and encoding scheme determines the results of sequence clustering and crash characterization. A dissimilarity measure that considers the relationships between events and domain context tends to perform well in crash sequence clustering. An encoding scheme that consolidates similar events naturally takes domain context into consideration. △ Less

Submitted 21 February, 2023; originally announced February 2023.

arXiv:2302.01269 [pdf, other]

Adjusting for Incomplete Baseline Covariates in Randomized Controlled Trials: A Cross-World Imputation Framework

Authors: Yilin Song, James P. Hughes, Ting Ye

Abstract: In randomized controlled trials, adjusting for baseline covariates is often applied to improve the precision of treatment effect estimation. However, missingness in covariates is common. Recently, Zhao & Ding (2022) studied two simple strategies, the single imputation method and missingness indicator method (MIM), to deal with missing covariates, and showed that both methods can provide efficiency… ▽ More In randomized controlled trials, adjusting for baseline covariates is often applied to improve the precision of treatment effect estimation. However, missingness in covariates is common. Recently, Zhao & Ding (2022) studied two simple strategies, the single imputation method and missingness indicator method (MIM), to deal with missing covariates, and showed that both methods can provide efficiency gain. To better understand and compare these two strategies, we propose and investigate a novel imputation framework termed cross-world imputation (CWI), which includes single imputation and MIM as special cases. Through the lens of CWI, we show that MIM implicitly searches for the optimal CWI values and thus achieves optimal efficiency. We also derive conditions under which the single imputation method, by searching for the optimal single imputation values, can achieve the same efficiency as the MIM. △ Less

Submitted 2 February, 2023; originally announced February 2023.

arXiv:2212.01168 [pdf, other]

Towards Cross Domain Generalization of Hamiltonian Representation via Meta Learning

Authors: Yeongwoo Song, Hawoong Jeong

Abstract: Recent advances in deep learning for physics have focused on discovering shared representations of target systems by incorporating physics priors or inductive biases into neural networks. While effective, these methods are limited to the system domain, where the type of system remains consistent and thus cannot ensure the adaptation to new, or unseen physical systems governed by different laws. Fo… ▽ More Recent advances in deep learning for physics have focused on discovering shared representations of target systems by incorporating physics priors or inductive biases into neural networks. While effective, these methods are limited to the system domain, where the type of system remains consistent and thus cannot ensure the adaptation to new, or unseen physical systems governed by different laws. For instance, a neural network trained on a mass-spring system cannot guarantee accurate predictions for the behavior of a two-body system or any other system with different physical laws. In this work, we take a significant leap forward by targeting cross domain generalization within the field of Hamiltonian dynamics. We model our system with a graph neural network (GNN) and employ a meta learning algorithm to enable the model to gain experience over a distribution of systems and make it adapt to new physics. Our approach aims to learn a unified Hamiltonian representation that is generalizable across multiple system domains, thereby overcoming the limitations of system-specific models. We demonstrate that the meta-trained model captures the generalized Hamiltonian representation that is consistent across different physical domains. Overall, through the use of meta learning, we offer a framework that achieves cross domain generalization, providing a step towards a unified model for understanding a wide array of dynamical systems via deep learning. △ Less

Submitted 27 April, 2024; v1 submitted 2 December, 2022; originally announced December 2022.

Comments: Conference paper at ICLR 2024

arXiv:2210.16976 [pdf, other]

Representation Learning for General-sum Low-rank Markov Games

Authors: Chengzhuo Ni, Yuda Song, Xuezhou Zhang, Chi **, Mengdi Wang

Abstract: We study multi-agent general-sum Markov games with nonlinear function approximation. We focus on low-rank Markov games whose transition matrix admits a hidden low-rank structure on top of an unknown non-linear representation. The goal is to design an algorithm that (1) finds an $\varepsilon$-equilibrium policy sample efficiently without prior knowledge of the environment or the representation, and… ▽ More We study multi-agent general-sum Markov games with nonlinear function approximation. We focus on low-rank Markov games whose transition matrix admits a hidden low-rank structure on top of an unknown non-linear representation. The goal is to design an algorithm that (1) finds an $\varepsilon$-equilibrium policy sample efficiently without prior knowledge of the environment or the representation, and (2) permits a deep-learning friendly implementation. We leverage representation learning and present a model-based and a model-free approach to construct an effective representation from the collected data. For both approaches, the algorithm achieves a sample complexity of poly$(H,d,A,1/\varepsilon)$, where $H$ is the game horizon, $d$ is the dimension of the feature vector, $A$ is the size of the joint action space and $\varepsilon$ is the optimality gap. When the number of players is large, the above sample complexity can scale exponentially with the number of players in the worst case. To address this challenge, we consider Markov games with a factorized transition structure and present an algorithm that escapes such exponential scaling. To our best knowledge, this is the first sample-efficient algorithm for multi-agent general-sum Markov games that incorporates (non-linear) function approximation. We accompany our theoretical result with a neural network-based implementation of our algorithm and evaluate it against the widely used deep RL baseline, DQN with fictitious play. △ Less

Submitted 30 October, 2022; originally announced October 2022.

arXiv:2209.13117 [pdf, ps, other]

doi 10.1111/sjoe.12703

Consistent Covariance estimation for stratum imbalances under minimization method for covariate-adaptive randomization

Authors: Zixuan Zhao, Yanglei Song, Wenyu Jiang, Dongsheng Tu

Abstract: Pocock and Simon's minimization method is a popular approach for covariate-adaptive randomization in clinical trials. Valid statistical inference with data collected under the minimization method requires the knowledge of the limiting covariance matrix of within-stratum imbalances, whose existence is only recently established. In this work, we propose a bootstrap-based estimator for this limit and… ▽ More Pocock and Simon's minimization method is a popular approach for covariate-adaptive randomization in clinical trials. Valid statistical inference with data collected under the minimization method requires the knowledge of the limiting covariance matrix of within-stratum imbalances, whose existence is only recently established. In this work, we propose a bootstrap-based estimator for this limit and establish its consistency, in particular, by Le Cam's third lemma. As an application, we consider in simulation studies adjustments to existing robust tests for treatment effects with survival data by the proposed estimator. It shows that the adjusted tests achieve a size close to the nominal level, and unlike other designs, the robust tests without adjustment may have an asymptotic size inflation issue under the minimization method. △ Less

Submitted 26 December, 2023; v1 submitted 26 September, 2022; originally announced September 2022.

Comments: 29 pages, peer reviewed version, will appear in Scandinavian Journal of Statistics

arXiv:2208.09103 [pdf]

doi 10.1016/j.aap.2022.106814

Intersection Two-Vehicle Crash Scenario Specification for Automated Vehicle Safety Evaluation Using Sequence Analysis and Bayesian Networks

Authors: Yu Song, Madhav V. Chitturi, David A. Noyce

Abstract: This paper develops a test scenario specification procedure using crash sequence analysis and Bayesian network modeling. Intersection two-vehicle crash data was obtained from the 2016 to 2018 National Highway Traffic Safety Administration Crash Report Sampling System database. Vehicles involved in the crashes are specifically renumbered based on their initial positions and trajectories. Crash sequ… ▽ More This paper develops a test scenario specification procedure using crash sequence analysis and Bayesian network modeling. Intersection two-vehicle crash data was obtained from the 2016 to 2018 National Highway Traffic Safety Administration Crash Report Sampling System database. Vehicles involved in the crashes are specifically renumbered based on their initial positions and trajectories. Crash sequences are encoded to include detailed pre-crash events and concise collision events. Based on sequence patterns, the crashes are characterized as 55 types. A Bayesian network model is developed to depict the interrelationships among crash sequence types, crash outcomes, human factors, and environmental conditions. Scenarios are specified by querying the Bayesian network conditional probability tables. Distributions of operational design domain attributes - such as driver behavior, weather, lighting condition, intersection geometry, traffic control device - are specified based on conditions of sequence types. Also, distribution of sequence types is specified on specific crash outcomes or combinations of operational design domain attributes. △ Less

Submitted 18 August, 2022; originally announced August 2022.

arXiv:2207.12804 [pdf, other]

Large-Scale Low-Rank Gaussian Process Prediction with Support Points

Authors: Yan Song, Wenlin Dai, Marc G. Genton

Abstract: Low-rank approximation is a popular strategy to tackle the "big n problem" associated with large-scale Gaussian process regressions. Basis functions for develo** low-rank structures are crucial and should be carefully specified. Predictive processes simplify the problem by inducing basis functions with a covariance function and a set of knots. The existing literature suggests certain practical i… ▽ More Low-rank approximation is a popular strategy to tackle the "big n problem" associated with large-scale Gaussian process regressions. Basis functions for develo** low-rank structures are crucial and should be carefully specified. Predictive processes simplify the problem by inducing basis functions with a covariance function and a set of knots. The existing literature suggests certain practical implementations of knot selection and covariance estimation; however, theoretical foundations explaining the influence of these two factors on predictive processes are lacking. In this paper, the asymptotic prediction performance of the predictive process and Gaussian process predictions is derived and the impacts of the selected knots and estimated covariance are studied. We suggest the use of support points as knots, which best represent data locations. Extensive simulation studies demonstrate the superiority of support points and verify our theoretical results. Real data of precipitation and ozone are used as examples, and the efficiency of our method over other widely used low-rank approximation methods is verified. △ Less

Submitted 26 July, 2022; originally announced July 2022.

arXiv:2207.07890 [pdf, other]

doi 10.1002/sim.9840

Covariate Adjustment in Randomized Clinical Trials with Missing Covariate and Outcome Data

Authors: Chia-Rui Chang, Yue Song, Fan Li, Rui Wang

Abstract: When analyzing data from randomized clinical trials, covariate adjustment can be used to account for chance imbalance in baseline covariates and to increase precision of the treatment effect estimate. A practical barrier to covariate adjustment is the presence of missing data. In this paper, in the light of recent theoretical advancement, we first review several covariate adjustment methods with i… ▽ More When analyzing data from randomized clinical trials, covariate adjustment can be used to account for chance imbalance in baseline covariates and to increase precision of the treatment effect estimate. A practical barrier to covariate adjustment is the presence of missing data. In this paper, in the light of recent theoretical advancement, we first review several covariate adjustment methods with incomplete covariate data. We investigate the implications of the missing data mechanism on estimating the average treatment effect in randomized clinical trials with continuous or binary outcomes. In parallel, we consider settings where the outcome data are fully observed or are missing at random; in the latter setting, we propose a full weighting approach that combines inverse probability weighting for adjusting missing outcomes and overlap weighting for covariate adjustment. We highlight the importance of including the interaction terms between the missingness indicators and covariates as predictors in the models. We conduct comprehensive simulation studies to examine the finite-sample performance of the proposed methods and compare with a range of common alternatives. We find that conducting the proposed adjustment methods generally improves the precision of treatment effect estimates regardless of the imputation methods when the adjusted covariate is associated with the outcome. We apply the methods to the Childhood Adenotonsillectomy Trial to assess the effect of adenotonsillectomy on neurocognitive functioning scores. △ Less

Submitted 16 May, 2023; v1 submitted 16 July, 2022; originally announced July 2022.

arXiv:2206.04615 [pdf, other]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Authors: Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza , et al. (426 additional authors not shown)

Abstract: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur… ▽ More Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting. △ Less

Submitted 12 June, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

Comments: 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-bench

Journal ref: Transactions on Machine Learning Research, May/2022, https://openreview.net/forum?id=uyTL5Bvosj

arXiv:2202.11735 [pdf, other]

Truncated LinUCB for Stochastic Linear Bandits

Authors: Yanglei Song, Meng zhou

Abstract: This paper considers contextual bandits with a finite number of arms, where the contexts are independent and identically distributed $d$-dimensional random vectors, and the expected rewards are linear in both the arm parameters and contexts. The LinUCB algorithm, which is near minimax optimal for related linear bandits, is shown to have a cumulative regret that is suboptimal in both the dimension… ▽ More This paper considers contextual bandits with a finite number of arms, where the contexts are independent and identically distributed $d$-dimensional random vectors, and the expected rewards are linear in both the arm parameters and contexts. The LinUCB algorithm, which is near minimax optimal for related linear bandits, is shown to have a cumulative regret that is suboptimal in both the dimension $d$ and time horizon $T$, due to its over-exploration. A truncated version of LinUCB is proposed and termed "Tr-LinUCB", which follows LinUCB up to a truncation time $S$ and performs pure exploitation afterwards. The Tr-LinUCB algorithm is shown to achieve $O(d\log(T))$ regret if $S = Cd\log(T)$ for a sufficiently large constant $C$, and a matching lower bound is established, which shows the rate optimality of Tr-LinUCB in both $d$ and $T$ under a low dimensional regime. Further, if $S = d\log^κ(T)$ for some $κ>1$, the loss compared to the optimal is a multiplicative $\log\log(T)$ factor, which does not depend on $d$. This insensitivity to overshooting in choosing the truncation time of Tr-LinUCB is of practical importance. △ Less

Submitted 17 November, 2022; v1 submitted 23 February, 2022; originally announced February 2022.

Comments: A typo corrected: in Lemma 34(ii), it should be \|x\| instead of \|x\|^2. Thus, in the proof of Lemma 3, exp(-r^2) should be exp(-r), which, however, does not affect other parts

arXiv:2112.10992 [pdf, other]

Expansion-Squeeze-Excitation Fusion Network for Elderly Activity Recognition

Authors: Xiangbo Shu, Jiawen Yang, Rui Yan, Yan Song

Abstract: This work focuses on the task of elderly activity recognition, which is a challenging task due to the existence of individual actions and human-object interactions in elderly activities. Thus, we attempt to effectively aggregate the discriminative information of actions and interactions from both RGB videos and skeleton sequences by attentively fusing multi-modal features. Recently, some nonlinear… ▽ More This work focuses on the task of elderly activity recognition, which is a challenging task due to the existence of individual actions and human-object interactions in elderly activities. Thus, we attempt to effectively aggregate the discriminative information of actions and interactions from both RGB videos and skeleton sequences by attentively fusing multi-modal features. Recently, some nonlinear multi-modal fusion approaches are proposed by utilizing nonlinear attention mechanism that is extended from Squeeze-and-Excitation Networks (SENet). Inspired by this, we propose a novel Expansion-Squeeze-Excitation Fusion Network (ESE-FN) to effectively address the problem of elderly activity recognition, which learns modal and channel-wise Expansion-Squeeze-Excitation (ESE) attentions for attentively fusing the multi-modal features in the modal and channel-wise ways. Furthermore, we design a new Multi-modal Loss (ML) to keep the consistency between the single-modal features and the fused multi-modal features by adding the penalty of difference between the minimum prediction losses on single modalities and the prediction loss on the fused modality. Finally, we conduct experiments on a largest-scale elderly activity dataset, i.e., ETRI-Activity3D (including 110,000+ videos, and 50+ categories), to demonstrate that the proposed ESE-FN achieves the best accuracy compared with the state-of-the-art methods. In addition, more extensive experimental results show that the proposed ESE-FN is also comparable to the other methods in terms of normal action recognition task. △ Less

Submitted 24 April, 2022; v1 submitted 21 December, 2021; originally announced December 2021.

arXiv:2111.11010 [pdf, other]

Density Ratio Estimation via Infinitesimal Classification

Authors: Kristy Choi, Chenlin Meng, Yang Song, Stefano Ermon

Abstract: Density ratio estimation (DRE) is a fundamental machine learning technique for comparing two probability distributions. However, existing methods struggle in high-dimensional settings, as it is difficult to accurately compare probability distributions based on finite samples. In this work we propose DRE-\infty, a divide-and-conquer approach to reduce DRE to a series of easier subproblems. Inspired… ▽ More Density ratio estimation (DRE) is a fundamental machine learning technique for comparing two probability distributions. However, existing methods struggle in high-dimensional settings, as it is difficult to accurately compare probability distributions based on finite samples. In this work we propose DRE-\infty, a divide-and-conquer approach to reduce DRE to a series of easier subproblems. Inspired by Monte Carlo methods, we smoothly interpolate between the two distributions via an infinite continuum of intermediate bridge distributions. We then estimate the instantaneous rate of change of the bridge distributions indexed by time (the "time score") -- a quantity defined analogously to data (Stein) scores -- with a novel time score matching objective. Crucially, the learned time scores can then be integrated to compute the desired density ratio. In addition, we show that traditional (Stein) scores can be used to obtain integration paths that connect regions of high density in both distributions, improving performance in practice. Empirically, we demonstrate that our approach performs well on downstream tasks such as mutual information estimation and energy-based modeling on complex, high-dimensional datasets. △ Less

Submitted 12 March, 2022; v1 submitted 22 November, 2021; originally announced November 2021.

Comments: First two authors contributed equally

arXiv:2111.08005 [pdf, other]

Solving Inverse Problems in Medical Imaging with Score-Based Generative Models

Authors: Yang Song, Liyue Shen, Lei Xing, Stefano Ermon

Abstract: Reconstructing medical images from partial measurements is an important inverse problem in Computed Tomography (CT) and Magnetic Resonance Imaging (MRI). Existing solutions based on machine learning typically train a model to directly map measurements to medical images, leveraging a training dataset of paired images and measurements. These measurements are typically synthesized from images using a… ▽ More Reconstructing medical images from partial measurements is an important inverse problem in Computed Tomography (CT) and Magnetic Resonance Imaging (MRI). Existing solutions based on machine learning typically train a model to directly map measurements to medical images, leveraging a training dataset of paired images and measurements. These measurements are typically synthesized from images using a fixed physical model of the measurement process, which hinders the generalization capability of models to unknown measurement processes. To address this issue, we propose a fully unsupervised technique for inverse problem solving, leveraging the recently introduced score-based generative models. Specifically, we first train a score-based generative model on medical images to capture their prior distribution. Given measurements and a physical model of the measurement process at test time, we introduce a sampling method to reconstruct an image consistent with both the prior and the observed measurements. Our method does not assume a fixed measurement process during training, and can thus be flexibly adapted to different measurement processes at test time. Empirically, we observe comparable or better performance to supervised learning techniques in several medical imaging tasks in CT and MRI, while demonstrating significantly better generalization to unknown measurement processes. △ Less

Submitted 15 June, 2022; v1 submitted 15 November, 2021; originally announced November 2021.

Comments: Published at ICLR 2022

arXiv:2111.07067 [pdf, other]

Interquantile Shrinkage in Spatial Quantile Autoregressive Regression models

Authors: ** Dong, Jiawei Hou, Yunquan Song

Abstract: Spatial dependent data frequently occur in many fields such as spatial econometrics and epidemiology. To deal with the dependence of variables and estimate quantile-specific effects by covariates, spatial quantile autoregressive models (SQAR models) are introduced. Conventional quantile regression only focuses on the fitting models but ignores the examination of multiple conditional quantile funct… ▽ More Spatial dependent data frequently occur in many fields such as spatial econometrics and epidemiology. To deal with the dependence of variables and estimate quantile-specific effects by covariates, spatial quantile autoregressive models (SQAR models) are introduced. Conventional quantile regression only focuses on the fitting models but ignores the examination of multiple conditional quantile functions, which provides a comprehensive view of the relationship between the response and covariates. Thus, it is necessary to study the different regression slopes at different quantiles, especially in situations where the quantile coefficients share some common feature. However, traditional Wald multiple tests not only increase the burden of computation but also bring greater FDR. In this paper, we transform the estimation and examination problem into a penalization problem, which estimates the parameters at different quantiles and identifies the interquantile commonality at the same time. To avoid the endogeneity caused by the spatial lag variables in SQAR models, we also introduce instrumental variables before estimation and propose two-stage estimation methods based on fused adaptive LASSO and fused adaptive sup-norm penalty approaches. The oracle properties of the proposed estimation methods are established. Through numerical investigations, it is demonstrated that the proposed methods lead to higher estimation efficiency than the traditional quantile regression. △ Less

Submitted 13 November, 2021; originally announced November 2021.

arXiv:2111.04726 [pdf, other]

Estimating High Order Gradients of the Data Distribution by Denoising

Authors: Chenlin Meng, Yang Song, Wenzhe Li, Stefano Ermon

Abstract: The first order derivative of a data density can be estimated efficiently by denoising score matching, and has become an important component in many applications, such as image generation and audio synthesis. Higher order derivatives provide additional local information about the data distribution and enable new applications. Although they can be estimated via automatic differentiation of a learne… ▽ More The first order derivative of a data density can be estimated efficiently by denoising score matching, and has become an important component in many applications, such as image generation and audio synthesis. Higher order derivatives provide additional local information about the data distribution and enable new applications. Although they can be estimated via automatic differentiation of a learned density model, this can amplify estimation errors and is expensive in high dimensional settings. To overcome these limitations, we propose a method to directly estimate high order derivatives (scores) of a data density from samples. We first show that denoising score matching can be interpreted as a particular case of Tweedie's formula. By leveraging Tweedie's formula on higher order moments, we generalize denoising score matching to estimate higher order derivatives. We demonstrate empirically that models trained with the proposed method can approximate second order derivatives more efficiently and accurately than via automatic differentiation. We show that our models can be used to quantify uncertainty in denoising and to improve the mixing speed of Langevin dynamics via Ozaki discretization for sampling synthetic data and natural images. △ Less

Submitted 8 November, 2021; originally announced November 2021.

Comments: NeurIPS 2021

arXiv:2110.00473 [pdf, other]

Score-Based Generative Classifiers

Authors: Roland S. Zimmermann, Lukas Schott, Yang Song, Benjamin A. Dunn, David A. Klindt

Abstract: The tremendous success of generative models in recent years raises the question whether they can also be used to perform classification. Generative models have been used as adversarially robust classifiers on simple datasets such as MNIST, but this robustness has not been observed on more complex datasets like CIFAR-10. Additionally, on natural image datasets, previous results have suggested a tra… ▽ More The tremendous success of generative models in recent years raises the question whether they can also be used to perform classification. Generative models have been used as adversarially robust classifiers on simple datasets such as MNIST, but this robustness has not been observed on more complex datasets like CIFAR-10. Additionally, on natural image datasets, previous results have suggested a trade-off between the likelihood of the data and classification accuracy. In this work, we investigate score-based generative models as classifiers for natural images. We show that these models not only obtain competitive likelihood values but simultaneously achieve state-of-the-art classification accuracy for generative classifiers on CIFAR-10. Nevertheless, we find that these models are only slightly, if at all, more robust than discriminative baseline models on out-of-distribution tasks based on common image corruptions. Similarly and contrary to prior results, we find that score-based are prone to worst-case distribution shifts in the form of adversarial perturbations. Our work highlights that score-based generative models are closing the gap in classification accuracy compared to standard discriminative models. While they do not yet deliver on the promise of adversarial and out-of-domain robustness, they provide a different approach to classification that warrants further research. △ Less

Submitted 11 December, 2021; v1 submitted 1 October, 2021; originally announced October 2021.

Comments: published at https://dgms-and-applications.github.io/2021/ project website https://zimmerrol.github.io/SBGC/

arXiv:2109.15261 [pdf, other]

A simple and flexible test of sample exchangeability with applications to statistical genomics

Authors: Alan J. Aw, Jeffrey P. Spence, Yun S. Song

Abstract: In scientific studies involving analyses of multivariate data, basic but important questions often arise for the researcher: Is the sample exchangeable, meaning that the joint distribution of the sample is invariant to the ordering of the units? Are the features independent of one another, or perhaps the features can be grouped so that the groups are mutually independent? In statistical genomics,… ▽ More In scientific studies involving analyses of multivariate data, basic but important questions often arise for the researcher: Is the sample exchangeable, meaning that the joint distribution of the sample is invariant to the ordering of the units? Are the features independent of one another, or perhaps the features can be grouped so that the groups are mutually independent? In statistical genomics, these considerations are fundamental to downstream tasks such as demographic inference and the construction of polygenic risk scores. We propose a non-parametric approach, which we call the V test, to address these two questions, namely, a test of sample exchangeability given dependency structure of features, and a test of feature independence given sample exchangeability. Our test is conceptually simple, yet fast and flexible. It controls the Type I error across realistic scenarios, and handles data of arbitrary dimensions by leveraging large-sample asymptotics. Through extensive simulations and a comparison against unsupervised tests of stratification based on random matrix theory, we find that our test compares favorably in various scenarios of interest. We apply the test to data from the 1000 Genomes Project, demonstrating how it can be employed to assess exchangeability of the genetic sample, or find optimal linkage disequilibrium (LD) splits for downstream analysis. For exchangeability assessment, we find that removing rare variants can substantially increase the p-value of the test statistic. For optimal LD splitting, the V test reports different optimal splits than previous approaches not relying on hypothesis testing. Software for our methods is available in R (CRAN: flintyR) and Python (PyPI: flintyPy). △ Less

Submitted 30 August, 2023; v1 submitted 30 September, 2021; originally announced September 2021.

Comments: 24 pages. Supplementary Information file (38 pages, contains mathematical proofs) is available at https://github.com/songlab-cal/flinty/

MSC Class: 62G10; 62H15; 62P10 ACM Class: G.3

arXiv:2107.03502 [pdf, other]

CSDI: Conditional Score-based Diffusion Models for Probabilistic Time Series Imputation

Authors: Yusuke Tashiro, Jiaming Song, Yang Song, Stefano Ermon

Abstract: The imputation of missing values in time series has many applications in healthcare and finance. While autoregressive models are natural candidates for time series imputation, score-based diffusion models have recently outperformed existing counterparts including autoregressive models in many tasks such as image generation and audio synthesis, and would be promising for time series imputation. In… ▽ More The imputation of missing values in time series has many applications in healthcare and finance. While autoregressive models are natural candidates for time series imputation, score-based diffusion models have recently outperformed existing counterparts including autoregressive models in many tasks such as image generation and audio synthesis, and would be promising for time series imputation. In this paper, we propose Conditional Score-based Diffusion models for Imputation (CSDI), a novel time series imputation method that utilizes score-based diffusion models conditioned on observed data. Unlike existing score-based approaches, the conditional diffusion model is explicitly trained for imputation and can exploit correlations between observed values. On healthcare and environmental data, CSDI improves by 40-65% over existing probabilistic imputation methods on popular performance metrics. In addition, deterministic imputation by CSDI reduces the error by 5-20% compared to the state-of-the-art deterministic imputation methods. Furthermore, CSDI can also be applied to time series interpolation and probabilistic forecasting, and is competitive with existing baselines. The code is available at https://github.com/ermongroup/CSDI. △ Less

Submitted 27 October, 2021; v1 submitted 7 July, 2021; originally announced July 2021.

Comments: NeurIPS 2021

arXiv:2106.13097 [pdf, other]

Understanding the Spread of COVID-19 Epidemic: A Spatio-Temporal Point Process View

Authors: Shuang Li, Lu Wang, Xinyun Chen, Yixiang Fang, Yan Song

Abstract: Since the first coronavirus case was identified in the U.S. on Jan. 21, more than 1 million people in the U.S. have confirmed cases of COVID-19. This infectious respiratory disease has spread rapidly across more than 3000 counties and 50 states in the U.S. and have exhibited evolutionary clustering and complex triggering patterns. It is essential to understand the complex spacetime intertwined pro… ▽ More Since the first coronavirus case was identified in the U.S. on Jan. 21, more than 1 million people in the U.S. have confirmed cases of COVID-19. This infectious respiratory disease has spread rapidly across more than 3000 counties and 50 states in the U.S. and have exhibited evolutionary clustering and complex triggering patterns. It is essential to understand the complex spacetime intertwined propagation of this disease so that accurate prediction or smart external intervention can be carried out. In this paper, we model the propagation of the COVID-19 as spatio-temporal point processes and propose a generative and intensity-free model to track the spread of the disease. We further adopt a generative adversarial imitation learning framework to learn the model parameters. In comparison with the traditional likelihood-based learning methods, this imitation learning framework does not need to prespecify an intensity function, which alleviates the model-misspecification. Moreover, the adversarial learning procedure bypasses the difficult-to-evaluate integral involved in the likelihood evaluation, which makes the model inference more scalable with the data and variables. We showcase the dynamic learning performance on the COVID-19 confirmed cases in the U.S. and evaluate the social distancing policy based on the learned generative model. △ Less

Submitted 24 June, 2021; originally announced June 2021.

arXiv:2105.10590 [pdf, other]

Parallelizing Contextual Bandits

Authors: Jeffrey Chan, Aldo Pacchiano, Nilesh Tripuraneni, Yun S. Song, Peter Bartlett, Michael I. Jordan

Abstract: Standard approaches to decision-making under uncertainty focus on sequential exploration of the space of decisions. However, \textit{simultaneously} proposing a batch of decisions, which leverages available resources for parallel experimentation, has the potential to rapidly accelerate exploration. We present a family of (parallel) contextual bandit algorithms applicable to problems with bounded e… ▽ More Standard approaches to decision-making under uncertainty focus on sequential exploration of the space of decisions. However, \textit{simultaneously} proposing a batch of decisions, which leverages available resources for parallel experimentation, has the potential to rapidly accelerate exploration. We present a family of (parallel) contextual bandit algorithms applicable to problems with bounded eluder dimension whose regret is nearly identical to their perfectly sequential counterparts -- given access to the same total number of oracle queries -- up to a lower-order ``burn-in" term. We further show these algorithms can be specialized to the class of linear reward functions where we introduce and analyze several new linear bandit algorithms which explicitly introduce diversity into their action selection. Finally, we also present an empirical evaluation of these parallel algorithms in several domains, including materials discovery and biological sequence design problems, to demonstrate the utility of parallelized bandits in practical settings. △ Less

Submitted 5 February, 2023; v1 submitted 21 May, 2021; originally announced May 2021.

arXiv:2104.10029 [pdf, other]

Multiple Sclerosis Lesion Analysis in Brain Magnetic Resonance Images: Techniques and Clinical Applications

Authors: Yang Ma, Chaoyi Zhang, Mariano Cabezas, Yang Song, Zihao Tang, Dongnan Liu, Weidong Cai, Michael Barnett, Chenyu Wang

Abstract: Multiple sclerosis (MS) is a chronic inflammatory and degenerative disease of the central nervous system, characterized by the appearance of focal lesions in the white and gray matter that topographically correlate with an individual patient's neurological symptoms and signs. Magnetic resonance imaging (MRI) provides detailed in-vivo structural information, permitting the quantification and catego… ▽ More Multiple sclerosis (MS) is a chronic inflammatory and degenerative disease of the central nervous system, characterized by the appearance of focal lesions in the white and gray matter that topographically correlate with an individual patient's neurological symptoms and signs. Magnetic resonance imaging (MRI) provides detailed in-vivo structural information, permitting the quantification and categorization of MS lesions that critically inform disease management. Traditionally, MS lesions have been manually annotated on 2D MRI slices, a process that is inefficient and prone to inter-/intra-observer errors. Recently, automated statistical imaging analysis techniques have been proposed to detect and segment MS lesions based on MRI voxel intensity. However, their effectiveness is limited by the heterogeneity of both MRI data acquisition techniques and the appearance of MS lesions. By learning complex lesion representations directly from images, deep learning techniques have achieved remarkable breakthroughs in the MS lesion segmentation task. Here, we provide a comprehensive review of state-of-the-art automatic statistical and deep-learning MS segmentation methods and discuss current and future clinical applications. Further, we review technical strategies, such as domain adaptation, to enhance MS lesion segmentation in real-world clinical settings. △ Less

Submitted 27 January, 2022; v1 submitted 20 April, 2021; originally announced April 2021.

Comments: Accepted to appear in IEEE Journal of Biomedical And Health Informatics

arXiv:2102.06286 [pdf]

doi 10.1016/j.aap.2021.106017

Automated Vehicle Crash Sequences: Patterns and Potential Uses in Safety Testing

Authors: Yu Song, Madhav V. Chitturi, David A. Noyce

Abstract: With safety being one of the primary motivations for develo** automated vehicles (AVs), extensive field and simulation tests are being carried out to ensure AVs can operate safely on roadways. Since 2014, the California DMV has been collecting AV collision and disengagement reports, which are valuable data sources for studying AV crash patterns. In this study, crash sequence data extracted from… ▽ More With safety being one of the primary motivations for develo** automated vehicles (AVs), extensive field and simulation tests are being carried out to ensure AVs can operate safely on roadways. Since 2014, the California DMV has been collecting AV collision and disengagement reports, which are valuable data sources for studying AV crash patterns. In this study, crash sequence data extracted from California AV collision reports were used to investigate patterns and how they may be used to develop AV test scenarios. Employing sequence analysis, this study evaluated 168 AV crashes (with AV in automatic driving mode before disengagement or collision) from 2015 to 2019. Analysis of subsequences showed that the most representative pattern in AV crashes was (collision following AV stop) type. Analysis of event transition showed that disengagement, as an event in 24 percent of all studied AV crash sequences, had a transition probability of 68 percent to an immediate collision. Cluster analysis characterized AV crash sequences into seven groups with distinctive crash dynamic features. Cross-tabulation analysis showed that sequence groups were significantly associated with variables measuring crash outcomes and describing environmental conditions. Crash sequences are useful for develo** AV test scenarios. Based on the findings, a scenario-based AV safety testing framework was proposed with sequence of events embedded as a core component. △ Less

Submitted 11 February, 2021; originally announced February 2021.

Journal ref: Accident Analysis & Prevention, 153, p.106017 (2021)

arXiv:2102.05291 [pdf, other]

Clusterability as an Alternative to Anchor Points When Learning with Noisy Labels

Authors: Zhaowei Zhu, Yiwen Song, Yang Liu

Abstract: The label noise transition matrix, characterizing the probabilities of a training instance being wrongly annotated, is crucial to designing popular solutions to learning with noisy labels. Existing works heavily rely on finding "anchor points" or their approximates, defined as instances belonging to a particular class almost surely. Nonetheless, finding anchor points remains a non-trivial task, an… ▽ More The label noise transition matrix, characterizing the probabilities of a training instance being wrongly annotated, is crucial to designing popular solutions to learning with noisy labels. Existing works heavily rely on finding "anchor points" or their approximates, defined as instances belonging to a particular class almost surely. Nonetheless, finding anchor points remains a non-trivial task, and the estimation accuracy is also often throttled by the number of available anchor points. In this paper, we propose an alternative option to the above task. Our main contribution is the discovery of an efficient estimation procedure based on a clusterability condition. We prove that with clusterable representations of features, using up to third-order consensuses of noisy labels among neighbor representations is sufficient to estimate a unique transition matrix. Compared with methods using anchor points, our approach uses substantially more instances and benefits from a much better sample complexity. We demonstrate the estimation accuracy and advantages of our estimates using both synthetic noisy labels (on CIFAR-10/100) and real human-level noisy labels (on Clothing1M and our self-collected human-annotated CIFAR-10). Our code and human-level noisy CIFAR-10 labels are available at https://github.com/UCSC-REAL/HOC. △ Less

Submitted 13 July, 2021; v1 submitted 10 February, 2021; originally announced February 2021.

Comments: ICML 2021

arXiv:2102.03450 [pdf, other]

Wasserstein Graph Neural Networks for Graphs with Missing Attributes

Authors: Zhixian Chen, Tengfei Ma, Yangqiu Song, Yang Wang

Abstract: Missing node attributes is a common problem in real-world graphs. Graph neural networks have been demonstrated power in graph representation learning while their performance is affected by the completeness of graph information. Most of them are not specified for missing-attribute graphs and fail to leverage incomplete attribute information effectively. In this paper, we propose an innovative node… ▽ More Missing node attributes is a common problem in real-world graphs. Graph neural networks have been demonstrated power in graph representation learning while their performance is affected by the completeness of graph information. Most of them are not specified for missing-attribute graphs and fail to leverage incomplete attribute information effectively. In this paper, we propose an innovative node representation learning framework, Wasserstein Graph Neural Network (WGNN), to mitigate the problem. To make the most of limited observed attribute information and capture the uncertainty caused by missing values, we express nodes as low-dimensional distributions derived from the decomposition of the attribute matrix. Furthermore, we strengthen the expressiveness of representations by develo** a novel message passing schema that aggregates distributional information from neighbors in the Wasserstein space. We test WGNN in node classification tasks under two missing-attribute cases on both synthetic and real-world datasets. In addition, we find WGNN suitable to recover missing values and adapt them to tackle matrix completion problems with graphs of users and items. Experimental results on both tasks demonstrate the superiority of our method. △ Less

Submitted 16 February, 2022; v1 submitted 5 February, 2021; originally announced February 2021.

arXiv:2101.09258 [pdf, other]

Maximum Likelihood Training of Score-Based Diffusion Models

Authors: Yang Song, Conor Durkan, Iain Murray, Stefano Ermon

Abstract: Score-based diffusion models synthesize samples by reversing a stochastic process that diffuses data to noise, and are trained by minimizing a weighted combination of score matching losses. The log-likelihood of score-based diffusion models can be tractably computed through a connection to continuous normalizing flows, but log-likelihood is not directly optimized by the weighted combination of sco… ▽ More Score-based diffusion models synthesize samples by reversing a stochastic process that diffuses data to noise, and are trained by minimizing a weighted combination of score matching losses. The log-likelihood of score-based diffusion models can be tractably computed through a connection to continuous normalizing flows, but log-likelihood is not directly optimized by the weighted combination of score matching losses. We show that for a specific weighting scheme, the objective upper bounds the negative log-likelihood, thus enabling approximate maximum likelihood training of score-based diffusion models. We empirically observe that maximum likelihood training consistently improves the likelihood of score-based diffusion models across multiple datasets, stochastic processes, and model architectures. Our best models achieve negative log-likelihoods of 2.83 and 3.76 bits/dim on CIFAR-10 and ImageNet 32x32 without any data augmentation, on a par with state-of-the-art autoregressive models on these tasks. △ Less

Submitted 20 October, 2021; v1 submitted 22 January, 2021; originally announced January 2021.

Comments: NeurIPS 2021 (Spotlight)

arXiv:2101.03288 [pdf, other]

How to Train Your Energy-Based Models

Authors: Yang Song, Diederik P. Kingma

Abstract: Energy-Based Models (EBMs), also known as non-normalized probabilistic models, specify probability density or mass functions up to an unknown normalizing constant. Unlike most other probabilistic models, EBMs do not place a restriction on the tractability of the normalizing constant, thus are more flexible to parameterize and can model a more expressive family of probability distributions. However… ▽ More Energy-Based Models (EBMs), also known as non-normalized probabilistic models, specify probability density or mass functions up to an unknown normalizing constant. Unlike most other probabilistic models, EBMs do not place a restriction on the tractability of the normalizing constant, thus are more flexible to parameterize and can model a more expressive family of probability distributions. However, the unknown normalizing constant of EBMs makes training particularly difficult. Our goal is to provide a friendly introduction to modern approaches for EBM training. We start by explaining maximum likelihood training with Markov chain Monte Carlo (MCMC), and proceed to elaborate on MCMC-free approaches, including Score Matching (SM) and Noise Constrastive Estimation (NCE). We highlight theoretical connections among these three approaches, and end with a brief survey on alternative training methods, which are still under active research. Our tutorial is targeted at an audience with basic understanding of generative models who want to apply EBMs or start a research project in this direction. △ Less

Submitted 17 February, 2021; v1 submitted 8 January, 2021; originally announced January 2021.

arXiv:2101.03098 [pdf, other]

Optimization Models for Integrated Biorefinery Operations

Authors: Berkay Gulcan, Sandra D. Eksioglu, Yongjia Song, Mohammad Roni, Qiushi Chen

Abstract: Variations of physical and chemical characteristics of biomass lead to an uneven flow of biomass in a biorefinery, which reduces equipment utilization and increases operational costs. Uncertainty of biomass supply and high processing costs increase the risk of investing in the US's cellulosic biofuel industry. We propose a stochastic programming model to streamline processes within a biorefinery.… ▽ More Variations of physical and chemical characteristics of biomass lead to an uneven flow of biomass in a biorefinery, which reduces equipment utilization and increases operational costs. Uncertainty of biomass supply and high processing costs increase the risk of investing in the US's cellulosic biofuel industry. We propose a stochastic programming model to streamline processes within a biorefinery. A chance constraint models system's reliability requirement that the reactor is operating at a high utilization rate given uncertain biomass moisture content, particle size distribution, and equipment failure. The model identifies operating conditions of equipment and inventory level to maintain a continuous flow of biomass to the reactor. The Sample Average Approximation method approximates the chance constraint and a bisection search-based heuristic solves this approximation. A case study is developed using real-life data collected at Idaho National Laboratory's pilot biomass processing facility. An extensive computational analysis indicates that sequencing of biomass bales based on moisture level, increasing storage capacity, and managing particle size distribution increase utilization of the reactor and reduce operational costs. △ Less

Submitted 8 January, 2021; originally announced January 2021.

arXiv:2012.08125 [pdf, other]

Learning Energy-Based Models by Diffusion Recovery Likelihood

Authors: Ruiqi Gao, Yang Song, Ben Poole, Ying Nian Wu, Diederik P. Kingma

Abstract: While energy-based models (EBMs) exhibit a number of desirable properties, training and sampling on high-dimensional datasets remains challenging. Inspired by recent progress on diffusion probabilistic models, we present a diffusion recovery likelihood method to tractably learn and sample from a sequence of EBMs trained on increasingly noisy versions of a dataset. Each EBM is trained with recovery… ▽ More While energy-based models (EBMs) exhibit a number of desirable properties, training and sampling on high-dimensional datasets remains challenging. Inspired by recent progress on diffusion probabilistic models, we present a diffusion recovery likelihood method to tractably learn and sample from a sequence of EBMs trained on increasingly noisy versions of a dataset. Each EBM is trained with recovery likelihood, which maximizes the conditional probability of the data at a certain noise level given their noisy versions at a higher noise level. Optimizing recovery likelihood is more tractable than marginal likelihood, as sampling from the conditional distributions is much easier than sampling from the marginal distributions. After training, synthesized images can be generated by the sampling process that initializes from Gaussian white noise distribution and progressively samples the conditional distributions at decreasingly lower noise levels. Our method generates high fidelity samples on various image datasets. On unconditional CIFAR-10 our method achieves FID 9.58 and inception score 8.30, superior to the majority of GANs. Moreover, we demonstrate that unlike previous work on EBMs, our long-run MCMC samples from the conditional distributions do not diverge and still represent realistic images, allowing us to accurately estimate the normalized density of data even for high-dimensional datasets. Our implementation is available at https://github.com/ruiqigao/recovery_likelihood. △ Less

Submitted 27 March, 2021; v1 submitted 15 December, 2020; originally announced December 2020.

arXiv:2012.03761 [pdf, ps, other]

Adaptive Sequential SAA for Solving Two-stage Stochastic Linear Programs

Authors: Raghu Pasupathy, Yongjia Song

Abstract: We present adaptive sequential SAA (sample average approximation) algorithms to solve large-scale two-stage stochastic linear programs. The iterative algorithm framework we propose is organized into \emph{outer} and \emph{inner} iterations as follows: during each outer iteration, a sample-path problem is implicitly generated using a sample of observations or ``scenarios," and solved only \emph{imp… ▽ More We present adaptive sequential SAA (sample average approximation) algorithms to solve large-scale two-stage stochastic linear programs. The iterative algorithm framework we propose is organized into \emph{outer} and \emph{inner} iterations as follows: during each outer iteration, a sample-path problem is implicitly generated using a sample of observations or ``scenarios," and solved only \emph{imprecisely}, to within a tolerance that is chosen \emph{adaptively}, by balancing the estimated statistical error against solution error. The solutions from prior iterations serve as \emph{warm starts} to aid efficient solution of the (piecewise linear convex) sample-path optimization problems generated on subsequent iterations. The generated scenarios can be independent and identically distributed (iid), or dependent, as in Monte Carlo generation using Latin-hypercube sampling, antithetic variates, or randomized quasi-Monte Carlo. We first characterize the almost-sure convergence (and convergence in mean) of the optimality gap and the distance of the generated stochastic iterates to the true solution set. We then characterize the corresponding iteration complexity and work complexity rates as a function of the sample size schedule, demonstrating that the best achievable work complexity rate is Monte Carlo canonical and analogous to the generic $\mathcal{O}(ε^{-2})$ optimal complexity for non-smooth convex optimization. We report extensive numerical tests that indicate favorable performance, due primarily to the use of a sequential framework with an optimal sample size schedule, and the use of warm starts. The proposed algorithm can be stopped in finite-time to return a solution endowed with a probabilistic guarantee on quality. △ Less

Submitted 7 December, 2020; originally announced December 2020.

arXiv:2011.13456 [pdf, other]

Score-Based Generative Modeling through Stochastic Differential Equations

Authors: Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, Ben Poole

Abstract: Creating noise from data is easy; creating data from noise is generative modeling. We present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise. Crucially, the re… ▽ More Creating noise from data is easy; creating data from noise is generative modeling. We present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise. Crucially, the reverse-time SDE depends only on the time-dependent gradient field (\aka, score) of the perturbed data distribution. By leveraging advances in score-based generative modeling, we can accurately estimate these scores with neural networks, and use numerical SDE solvers to generate samples. We show that this framework encapsulates previous approaches in score-based generative modeling and diffusion probabilistic modeling, allowing for new sampling procedures and new modeling capabilities. In particular, we introduce a predictor-corrector framework to correct errors in the evolution of the discretized reverse-time SDE. We also derive an equivalent neural ODE that samples from the same distribution as the SDE, but additionally enables exact likelihood computation, and improved sampling efficiency. In addition, we provide a new way to solve inverse problems with score-based models, as demonstrated with experiments on class-conditional generation, image inpainting, and colorization. Combined with multiple architectural improvements, we achieve record-breaking performance for unconditional image generation on CIFAR-10 with an Inception score of 9.89 and FID of 2.20, a competitive likelihood of 2.99 bits/dim, and demonstrate high fidelity generation of 1024 x 1024 images for the first time from a score-based generative model. △ Less

Submitted 10 February, 2021; v1 submitted 26 November, 2020; originally announced November 2020.

Comments: ICLR 2021 (Oral)

arXiv:2010.12810 [pdf, other]

Autoregressive Score Matching

Authors: Chenlin Meng, Lantao Yu, Yang Song, Jiaming Song, Stefano Ermon

Abstract: Autoregressive models use chain rule to define a joint probability distribution as a product of conditionals. These conditionals need to be normalized, imposing constraints on the functional families that can be used. To increase flexibility, we propose autoregressive conditional score models (AR-CSM) where we parameterize the joint distribution in terms of the derivatives of univariate log-condit… ▽ More Autoregressive models use chain rule to define a joint probability distribution as a product of conditionals. These conditionals need to be normalized, imposing constraints on the functional families that can be used. To increase flexibility, we propose autoregressive conditional score models (AR-CSM) where we parameterize the joint distribution in terms of the derivatives of univariate log-conditionals (scores), which need not be normalized. To train AR-CSM, we introduce a new divergence between distributions named Composite Score Matching (CSM). For AR-CSM models, this divergence between data and model distributions can be computed and optimized efficiently, requiring no expensive sampling or adversarial training. Compared to previous score matching algorithms, our method is more scalable to high dimensional data and more stable to optimize. We show with extensive experimental results that it can be applied to density estimation on synthetic data, image generation, image denoising, and training latent variable models with implicit encoders. △ Less

Submitted 24 October, 2020; originally announced October 2020.

Comments: NeurIPS 2020

arXiv:2010.09808 [pdf, other]

Imitation with Neural Density Models

Authors: Kuno Kim, Akshat **dal, Yang Song, Jiaming Song, Yanan Sui, Stefano Ermon

Abstract: We propose a new framework for Imitation Learning (IL) via density estimation of the expert's occupancy measure followed by Maximum Occupancy Entropy Reinforcement Learning (RL) using the density as a reward. Our approach maximizes a non-adversarial model-free RL objective that provably lower bounds reverse Kullback-Leibler divergence between occupancy measures of the expert and imitator. We prese… ▽ More We propose a new framework for Imitation Learning (IL) via density estimation of the expert's occupancy measure followed by Maximum Occupancy Entropy Reinforcement Learning (RL) using the density as a reward. Our approach maximizes a non-adversarial model-free RL objective that provably lower bounds reverse Kullback-Leibler divergence between occupancy measures of the expert and imitator. We present a practical IL algorithm, Neural Density Imitation (NDI), which obtains state-of-the-art demonstration efficiency on benchmark control tasks. △ Less

Submitted 19 October, 2020; originally announced October 2020.

arXiv:2009.11409 [pdf, other]

Bayesian Hierarchical Models for High-Dimensional Mediation Analysis with Coordinated Selection of Correlated Mediators

Authors: Yanyi Song, Xiang Zhou, Jian Kang, Max T. Aung, Min Zhang, Wei Zhao, Belinda L. Needham, Sharon L. R. Kardia, Yongmei Liu, John D. Meeker, Jennifer A. Smith, Bhramar Mukherjee

Abstract: We consider Bayesian high-dimensional mediation analysis to identify among a large set of correlated potential mediators the active ones that mediate the effect from an exposure variable to an outcome of interest. Correlations among mediators are commonly observed in modern data analysis; examples include the activated voxels within connected regions in brain image data, regulatory signals driven… ▽ More We consider Bayesian high-dimensional mediation analysis to identify among a large set of correlated potential mediators the active ones that mediate the effect from an exposure variable to an outcome of interest. Correlations among mediators are commonly observed in modern data analysis; examples include the activated voxels within connected regions in brain image data, regulatory signals driven by gene networks in genome data and correlated exposure data from the same source. When correlations are present among active mediators, mediation analysis that fails to account for such correlation can be sub-optimal and may lead to a loss of power in identifying active mediators. Building upon a recent high-dimensional mediation analysis framework, we propose two Bayesian hierarchical models, one with a Gaussian mixture prior that enables correlated mediator selection and the other with a Potts mixture prior that accounts for the correlation among active mediators in mediation analysis. We develop efficient sampling algorithms for both methods. Various simulations demonstrate that our methods enable effective identification of correlated active mediators, which could be missed by using existing methods that assume prior independence among active mediators. The proposed methods are applied to the LIFECODES birth cohort and the Multi-Ethnic Study of Atherosclerosis (MESA) and identified new active mediators with important biological implications. △ Less

Submitted 23 September, 2020; originally announced September 2020.

Showing 1–50 of 138 results for author: Song, Y