Search | arXiv e-print repository

On Discrete Prompt Optimization for Diffusion Models

Authors: Ruochen Wang, Ting Liu, Cho-Jui Hsieh, Boqing Gong

Abstract: This paper introduces the first gradient-based framework for prompt optimization in text-to-image diffusion models. We formulate prompt engineering as a discrete optimization problem over the language space. Two major challenges arise in efficiently finding a solution to this problem: (1) Enormous Domain Space: Setting the domain to the entire language space poses significant difficulty to the opt… ▽ More This paper introduces the first gradient-based framework for prompt optimization in text-to-image diffusion models. We formulate prompt engineering as a discrete optimization problem over the language space. Two major challenges arise in efficiently finding a solution to this problem: (1) Enormous Domain Space: Setting the domain to the entire language space poses significant difficulty to the optimization process. (2) Text Gradient: Efficiently computing the text gradient is challenging, as it requires backpropagating through the inference steps of the diffusion model and a non-differentiable embedding lookup table. Beyond the problem formulation, our main technical contributions lie in solving the above challenges. First, we design a family of dynamically generated compact subspaces comprised of only the most relevant words to user input, substantially restricting the domain space. Second, we introduce "Shortcut Text Gradient" -- an effective replacement for the text gradient that can be obtained with constant memory and runtime. Empirical evaluation on prompts collected from diverse sources (DiffusionDB, ChatGPT, COCO) suggests that our method can discover prompts that substantially improve (prompt enhancement) or destroy (adversarial attack) the faithfulness of images generated by the text-to-image diffusion model. △ Less

Submitted 26 June, 2024; originally announced July 2024.

Comments: ICML 2024. Code available at https://github.com/ruocwang/dpo-diffusion

MSC Class: 68T01

Journal ref: Proceedings of the 41st International Conference on Machine Learning (ICML 2024)

arXiv:2405.00917 [pdf, other]

Semiparametric mean and variance joint models with clipped-Laplace link functions for bounded integer-valued time series

Authors: Tianqing Liu, Xiaohui Yuan

Abstract: We present a novel approach for modeling bounded count time series data, by deriving accurate upper and lower bounds for the variance of a bounded count random variable while maintaining a fixed mean. Leveraging these bounds, we propose semiparametric mean and variance joint (MVJ) models utilizing a clipped-Laplace link function. These models offer a flexible and feasible structure for both mean a… ▽ More We present a novel approach for modeling bounded count time series data, by deriving accurate upper and lower bounds for the variance of a bounded count random variable while maintaining a fixed mean. Leveraging these bounds, we propose semiparametric mean and variance joint (MVJ) models utilizing a clipped-Laplace link function. These models offer a flexible and feasible structure for both mean and variance, accommodating various scenarios of under-dispersion, equi-dispersion, or over-dispersion in bounded time series. The proposed MVJ models feature a linear mean structure with positive regression coefficients summing to one and allow for negative regression cefficients and autocorrelations. We demonstrate that the autocorrelation structure of MVJ models mirrors that of an autoregressive moving-average (ARMA) process, provided the proposed clipped-Laplace link functions with nonnegative regression coefficients summing to one are utilized. We establish conditions ensuring the stationarity and ergodicity properties of the MVJ process, along with demonstrating the consistency and asymptotic normality of the conditional least squares estimators. To aid model selection and diagnostics, we introduce two model selection criteria and apply two model diagnostics statistics. Finally, we conduct simulations and real data analyses to investigate the finite-sample properties of the proposed MVJ models, providing insights into their efficacy and applicability in practical scenarios. △ Less

Submitted 1 May, 2024; originally announced May 2024.

Comments: arXiv admin note: text overlap with arXiv:2404.18421

arXiv:2404.18421 [pdf, other]

Semiparametric mean and variance joint models with Laplace link functions for count time series

Authors: Tianqing Liu, Xiaohui Yuan

Abstract: Count time series data are frequently analyzed by modeling their conditional means and the conditional variance is often considered to be a deterministic function of the corresponding conditional mean and is not typically modeled independently. We propose a semiparametric mean and variance joint model, called random rounded count-valued generalized autoregressive conditional heteroskedastic (RRC-G… ▽ More Count time series data are frequently analyzed by modeling their conditional means and the conditional variance is often considered to be a deterministic function of the corresponding conditional mean and is not typically modeled independently. We propose a semiparametric mean and variance joint model, called random rounded count-valued generalized autoregressive conditional heteroskedastic (RRC-GARCH) model, to address this limitation. The RRC-GARCH model and its variations allow for the joint modeling of both the conditional mean and variance and offer a flexible framework for capturing various mean-variance structures (MVSs). One main feature of this model is its ability to accommodate negative values for regression coefficients and autocorrelation functions. The autocorrelation structure of the RRC-GARCH model using the proposed Laplace link functions with nonnegative regression coefficients is the same as that of an autoregressive moving-average (ARMA) process. For the new model, the stationarity and ergodicity are established and the consistency and asymptotic normality of the conditional least squares estimator are proved. Model selection criteria are proposed to evaluate the RRC-GARCH models. The performance of the RRC-GARCH model is assessed through analyses of both simulated and real data sets. The results indicate that the model can effectively capture the MVS of count time series data and generate accurate forecast means and variances. △ Less

Submitted 29 April, 2024; originally announced April 2024.

arXiv:2403.08635 [pdf, other]

Human Alignment of Large Language Models through Online Preference Optimisation

Authors: Daniele Calandriello, Daniel Guo, Remi Munos, Mark Rowland, Yunhao Tang, Bernardo Avila Pires, Pierre Harvey Richemond, Charline Le Lan, Michal Valko, Tianqi Liu, Rishabh Joshi, Zeyu Zheng, Bilal Piot

Abstract: Ensuring alignment of language models' outputs with human preferences is critical to guarantee a useful, safe, and pleasant user experience. Thus, human alignment has been extensively studied recently and several methods such as Reinforcement Learning from Human Feedback (RLHF), Direct Policy Optimisation (DPO) and Sequence Likelihood Calibration (SLiC) have emerged. In this paper, our contributio… ▽ More Ensuring alignment of language models' outputs with human preferences is critical to guarantee a useful, safe, and pleasant user experience. Thus, human alignment has been extensively studied recently and several methods such as Reinforcement Learning from Human Feedback (RLHF), Direct Policy Optimisation (DPO) and Sequence Likelihood Calibration (SLiC) have emerged. In this paper, our contribution is two-fold. First, we show the equivalence between two recent alignment methods, namely Identity Policy Optimisation (IPO) and Nash Mirror Descent (Nash-MD). Second, we introduce a generalisation of IPO, named IPO-MD, that leverages the regularised sampling approach proposed by Nash-MD. This equivalence may seem surprising at first sight, since IPO is an offline method whereas Nash-MD is an online method using a preference model. However, this equivalence can be proven when we consider the online version of IPO, that is when both generations are sampled by the online policy and annotated by a trained preference model. Optimising the IPO loss with such a stream of data becomes then equivalent to finding the Nash equilibrium of the preference model through self-play. Building on this equivalence, we introduce the IPO-MD algorithm that generates data with a mixture policy (between the online and reference policy) similarly as the general Nash-MD algorithm. We compare online-IPO and IPO-MD to different online versions of existing losses on preference data such as DPO and SLiC on a summarisation task. △ Less

Submitted 13 March, 2024; originally announced March 2024.

arXiv:2402.03941 [pdf, other]

Discovery of the Hidden World with Large Language Models

Authors: Chenxi Liu, Yongqiang Chen, Tongliang Liu, Mingming Gong, James Cheng, Bo Han, Kun Zhang

Abstract: Science originates with discovering new causal knowledge from a combination of known facts and observations. Traditional causal discovery approaches mainly rely on high-quality measured variables, usually given by human experts, to find causal relations. However, the causal variables are usually unavailable in a wide range of real-world applications. The rise of large language models (LLMs) that a… ▽ More Science originates with discovering new causal knowledge from a combination of known facts and observations. Traditional causal discovery approaches mainly rely on high-quality measured variables, usually given by human experts, to find causal relations. However, the causal variables are usually unavailable in a wide range of real-world applications. The rise of large language models (LLMs) that are trained to learn rich knowledge from the massive observations of the world, provides a new opportunity to assist with discovering high-level hidden variables from the raw observational data. Therefore, we introduce COAT: Causal representatiOn AssistanT. COAT incorporates LLMs as a factor proposer that extracts the potential causal factors from unstructured data. Moreover, LLMs can also be instructed to provide additional information used to collect data values (e.g., annotation criteria) and to further parse the raw unstructured data into structured data. The annotated data will be fed to a causal learning module (e.g., the FCI algorithm) that provides both rigorous explanations of the data, as well as useful feedback to further improve the extraction of causal factors by LLMs. We verify the effectiveness of COAT in uncovering the underlying causal system with two case studies of review rating analysis and neuropathic diagnosis. △ Less

Submitted 6 February, 2024; originally announced February 2024.

Comments: Preliminary version of an ongoing project; Chenxi and Yongqiang contributed equally; 26 pages, 41 figures; Project page: https://causalcoat.github.io/

arXiv:2310.18910 [pdf, other]

InstanT: Semi-supervised Learning with Instance-dependent Thresholds

Authors: Muyang Li, Runze Wu, Haoyu Liu, Jun Yu, Xun Yang, Bo Han, Tongliang Liu

Abstract: Semi-supervised learning (SSL) has been a fundamental challenge in machine learning for decades. The primary family of SSL algorithms, known as pseudo-labeling, involves assigning pseudo-labels to confident unlabeled instances and incorporating them into the training set. Therefore, the selection criteria of confident instances are crucial to the success of SSL. Recently, there has been growing in… ▽ More Semi-supervised learning (SSL) has been a fundamental challenge in machine learning for decades. The primary family of SSL algorithms, known as pseudo-labeling, involves assigning pseudo-labels to confident unlabeled instances and incorporating them into the training set. Therefore, the selection criteria of confident instances are crucial to the success of SSL. Recently, there has been growing interest in the development of SSL methods that use dynamic or adaptive thresholds. Yet, these methods typically apply the same threshold to all samples, or use class-dependent thresholds for instances belonging to a certain class, while neglecting instance-level information. In this paper, we propose the study of instance-dependent thresholds, which has the highest degree of freedom compared with existing methods. Specifically, we devise a novel instance-dependent threshold function for all unlabeled instances by utilizing their instance-level ambiguity and the instance-dependent error rates of pseudo-labels, so instances that are more likely to have incorrect pseudo-labels will have higher thresholds. Furthermore, we demonstrate that our instance-dependent threshold function provides a bounded probabilistic guarantee for the correctness of the pseudo-labels it assigns. △ Less

Submitted 29 October, 2023; originally announced October 2023.

Comments: Accepted as poster for NeurIPS 2023

arXiv:2310.18286 [pdf, other]

Optimal Transport for Treatment Effect Estimation

Authors: Hao Wang, Zhichao Chen, Jiajun Fan, Haoxuan Li, Tianqiao Liu, Weiming Liu, Quanyu Dai, Yichao Wang, Zhenhua Dong, Ruiming Tang

Abstract: Estimating conditional average treatment effect from observational data is highly challenging due to the existence of treatment selection bias. Prevalent methods mitigate this issue by aligning distributions of different treatment groups in the latent space. However, there are two critical problems that these methods fail to address: (1) mini-batch sampling effects (MSE), which causes misalignment… ▽ More Estimating conditional average treatment effect from observational data is highly challenging due to the existence of treatment selection bias. Prevalent methods mitigate this issue by aligning distributions of different treatment groups in the latent space. However, there are two critical problems that these methods fail to address: (1) mini-batch sampling effects (MSE), which causes misalignment in non-ideal mini-batches with outcome imbalance and outliers; (2) unobserved confounder effects (UCE), which results in inaccurate discrepancy calculation due to the neglect of unobserved confounders. To tackle these problems, we propose a principled approach named Entire Space CounterFactual Regression (ESCFR), which is a new take on optimal transport in the context of causality. Specifically, based on the framework of stochastic optimal transport, we propose a relaxed mass-preserving regularizer to address the MSE issue and design a proximal factual outcome regularizer to handle the UCE issue. Extensive experiments demonstrate that our proposed ESCFR can successfully tackle the treatment selection bias and achieve significantly better performance than state-of-the-art methods. △ Less

Submitted 27 October, 2023; originally announced October 2023.

Comments: Accepted as NeurIPS 2023 Poster

arXiv:2310.13232 [pdf, other]

Interaction Screening and Pseudolikelihood Approaches for Tensor Learning in Ising Models

Authors: Tianyu Liu, Somabha Mukherjee

Abstract: In this paper, we study two well known methods of Ising structure learning, namely the pseudolikelihood approach and the interaction screening approach, in the context of tensor recovery in $k$-spin Ising models. We show that both these approaches, with proper regularization, retrieve the underlying hypernetwork structure using a sample size logarithmic in the number of network nodes, and exponent… ▽ More In this paper, we study two well known methods of Ising structure learning, namely the pseudolikelihood approach and the interaction screening approach, in the context of tensor recovery in $k$-spin Ising models. We show that both these approaches, with proper regularization, retrieve the underlying hypernetwork structure using a sample size logarithmic in the number of network nodes, and exponential in the maximum interaction strength and maximum node-degree. We also track down the exact dependence of the rate of tensor recovery on the interaction order $k$, that is allowed to grow with the number of samples and nodes, for both the approaches. Finally, we provide a comparative discussion of the performance of the two approaches based on simulation studies, which also demonstrate the exponential dependence of the tensor recovery rate on the maximum coupling strength. △ Less

Submitted 19 October, 2023; originally announced October 2023.

Comments: 17 pages, 5 figures

arXiv:2310.07999 [pdf, other]

LEMON: Lossless model expansion

Authors: Yite Wang, Jiahao Su, Hanlin Lu, Cong Xie, Tianyi Liu, Jianbo Yuan, Haibin Lin, Ruoyu Sun, Hongxia Yang

Abstract: Scaling of deep neural networks, especially Transformers, is pivotal for their surging performance and has further led to the emergence of sophisticated reasoning capabilities in foundation models. Such scaling generally requires training large models from scratch with random initialization, failing to leverage the knowledge acquired by their smaller counterparts, which are already resource-intens… ▽ More Scaling of deep neural networks, especially Transformers, is pivotal for their surging performance and has further led to the emergence of sophisticated reasoning capabilities in foundation models. Such scaling generally requires training large models from scratch with random initialization, failing to leverage the knowledge acquired by their smaller counterparts, which are already resource-intensive to obtain. To tackle this inefficiency, we present $\textbf{L}$ossl$\textbf{E}$ss $\textbf{MO}$del Expansio$\textbf{N}$ (LEMON), a recipe to initialize scaled models using the weights of their smaller but pre-trained counterparts. This is followed by model training with an optimized learning rate scheduler tailored explicitly for the scaled models, substantially reducing the training time compared to training from scratch. Notably, LEMON is versatile, ensuring compatibility with various network structures, including models like Vision Transformers and BERT. Our empirical results demonstrate that LEMON reduces computational costs by 56.7% for Vision Transformers and 33.2% for BERT when compared to training from scratch. △ Less

Submitted 11 October, 2023; originally announced October 2023.

Comments: Preprint

arXiv:2307.01389 [pdf, other]

Identification of Causal Relationship between Amyloid-beta Accumulation and Alzheimer's Disease Progression via Counterfactual Inference

Authors: Haixing Dai, Mengxuan Hu, Qing Li, Lu Zhang, Lin Zhao, Dajiang Zhu, Ibai Diez, Jorge Sepulcre, Fan Zhang, Xingyu Gao, Manhua Liu, Quanzheng Li, Sheng Li, Tianming Liu, Xiang Li

Abstract: Alzheimer's disease (AD) is a neurodegenerative disorder that is beginning with amyloidosis, followed by neuronal loss and deterioration in structure, function, and cognition. The accumulation of amyloid-beta in the brain, measured through 18F-florbetapir (AV45) positron emission tomography (PET) imaging, has been widely used for early diagnosis of AD. However, the relationship between amyloid-bet… ▽ More Alzheimer's disease (AD) is a neurodegenerative disorder that is beginning with amyloidosis, followed by neuronal loss and deterioration in structure, function, and cognition. The accumulation of amyloid-beta in the brain, measured through 18F-florbetapir (AV45) positron emission tomography (PET) imaging, has been widely used for early diagnosis of AD. However, the relationship between amyloid-beta accumulation and AD pathophysiology remains unclear, and causal inference approaches are needed to uncover how amyloid-beta levels can impact AD development. In this paper, we propose a graph varying coefficient neural network (GVCNet) for estimating the individual treatment effect with continuous treatment levels using a graph convolutional neural network. We highlight the potential of causal inference approaches, including GVCNet, for measuring the regional causal connections between amyloid-beta accumulation and AD pathophysiology, which may serve as a robust tool for early diagnosis and tailored care. △ Less

Submitted 3 July, 2023; originally announced July 2023.

arXiv:2306.14019 [pdf, other]

Instrumental Variable Approach to Estimating Individual Causal Effects in N-of-1 Trials: Application to ISTOP Study

Authors: Kexin Qu, Christopher H. Schmid, Tao Liu

Abstract: An N-of-1 trial is a multiple crossover trial conducted in a single individual to provide evidence to directly inform personalized treatment decisions. Advancements in wearable devices greatly improved the feasibility of adopting these trials to identify optimal individual treatment plans, particularly when treatments differ among individuals and responses are highly heterogeneous. Our work was mo… ▽ More An N-of-1 trial is a multiple crossover trial conducted in a single individual to provide evidence to directly inform personalized treatment decisions. Advancements in wearable devices greatly improved the feasibility of adopting these trials to identify optimal individual treatment plans, particularly when treatments differ among individuals and responses are highly heterogeneous. Our work was motivated by the I-STOP-AFib Study, which examined the impact of different triggers on atrial fibrillation (AF) occurrence. We described a causal framework for 'N-of-1' trial using potential treatment selection paths and potential outcome paths. Two estimands of individual causal effect were defined:(a) the effect of continuous exposure, and (b) the effect of an individual observed behavior. We addressed three challenges: (a) imperfect compliance to the randomized treatment assignment; (b) binary treatments and binary outcomes which led to the 'non-collapsibility' issue of estimating odds ratios; and (c) serial inference in the longitudinal observations. We adopted the Bayesian IV approach where the study randomization was the IV as it impacted the choice of exposure of a subject but not directly the outcome. Estimations were through a system of two parametric Bayesian models to estimate the individual causal effect. Our model got around the non-collapsibility and non-consistency by modeling the confounding mechanism through latent structural models and by inferring with Bayesian posterior of functionals. Autocorrelation present in the repeated measurements was also accounted for. The simulation study showed our method largely reduced bias and greatly improved the coverage of the estimated causal effect, compared to existing methods (ITT, PP, and AT). We applied the method to I-STOP-AFib Study to estimate the individual effect of alcohol on AF occurrence. △ Less

Submitted 24 June, 2023; originally announced June 2023.

arXiv:2306.05751 [pdf, other]

Advancing Counterfactual Inference through Nonlinear Quantile Regression

Authors: Shaoan Xie, Biwei Huang, Bin Gu, Tongliang Liu, Kun Zhang

Abstract: The capacity to address counterfactual "what if" inquiries is crucial for understanding and making use of causal influences. Traditional counterfactual inference, under Pearls' counterfactual framework, typically depends on having access to or estimating a structural causal model. Yet, in practice, this causal model is often unknown and might be challenging to identify. Hence, this paper aims to p… ▽ More The capacity to address counterfactual "what if" inquiries is crucial for understanding and making use of causal influences. Traditional counterfactual inference, under Pearls' counterfactual framework, typically depends on having access to or estimating a structural causal model. Yet, in practice, this causal model is often unknown and might be challenging to identify. Hence, this paper aims to perform reliable counterfactual inference based solely on observational data and the (learned) qualitative causal structure, without necessitating a predefined causal model or even direct estimations of conditional distributions. To this end, we establish a novel connection between counterfactual inference and quantile regression and show that counterfactual inference can be reframed as an extended quantile regression problem. Building on this insight, we propose a practical framework for efficient and effective counterfactual inference implemented with neural networks under a bi-level optimization scheme. The proposed approach enhances the capacity to generalize estimated counterfactual outcomes to unseen data, thereby providing an upper bound on the generalization error. Furthermore, empirical evidence demonstrates its superior statistical efficiency in comparison to existing methods. Empirical results conducted on multiple datasets offer compelling support for our theoretical assertions. △ Less

Submitted 27 February, 2024; v1 submitted 9 June, 2023; originally announced June 2023.

arXiv:2305.14076 [pdf, other]

Towards Understanding the Dynamics of Gaussian-Stein Variational Gradient Descent

Authors: Tianle Liu, Promit Ghosal, Krishnakumar Balasubramanian, Natesh S. Pillai

Abstract: Stein Variational Gradient Descent (SVGD) is a nonparametric particle-based deterministic sampling algorithm. Despite its wide usage, understanding the theoretical properties of SVGD has remained a challenging problem. For sampling from a Gaussian target, the SVGD dynamics with a bilinear kernel will remain Gaussian as long as the initializer is Gaussian. Inspired by this fact, we undertake a deta… ▽ More Stein Variational Gradient Descent (SVGD) is a nonparametric particle-based deterministic sampling algorithm. Despite its wide usage, understanding the theoretical properties of SVGD has remained a challenging problem. For sampling from a Gaussian target, the SVGD dynamics with a bilinear kernel will remain Gaussian as long as the initializer is Gaussian. Inspired by this fact, we undertake a detailed theoretical study of the Gaussian-SVGD, i.e., SVGD projected to the family of Gaussian distributions via the bilinear kernel, or equivalently Gaussian variational inference (GVI) with SVGD. We present a complete picture by considering both the mean-field PDE and discrete particle systems. When the target is strongly log-concave, the mean-field Gaussian-SVGD dynamics is proven to converge linearly to the Gaussian distribution closest to the target in KL divergence. In the finite-particle setting, there is both uniform in time convergence to the mean-field limit and linear convergence in time to the equilibrium if the target is Gaussian. In the general case, we propose a density-based and a particle-based implementation of the Gaussian-SVGD, and show that several recent algorithms for GVI, proposed from different perspectives, emerge as special cases of our unified framework. Interestingly, one of the new particle-based instance from this framework empirically outperforms existing approaches. Our results make concrete contributions towards obtaining a deeper understanding of both SVGD and GVI. △ Less

Submitted 27 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

Comments: NeurIPS 2023; 60 pages, 8 figures

arXiv:2305.00876 [pdf, ps, other]

Exactly Tight Information-Theoretic Generalization Error Bound for the Quadratic Gaussian Problem

Authors: Ruida Zhou, Chao Tian, Tie Liu

Abstract: We provide a new information-theoretic generalization error bound that is exactly tight (i.e., matching even the constant) for the canonical quadratic Gaussian (location) problem. Most existing bounds are order-wise loose in this setting, which has raised concerns about the fundamental capability of information-theoretic bounds in reasoning the generalization behavior for machine learning. The pro… ▽ More We provide a new information-theoretic generalization error bound that is exactly tight (i.e., matching even the constant) for the canonical quadratic Gaussian (location) problem. Most existing bounds are order-wise loose in this setting, which has raised concerns about the fundamental capability of information-theoretic bounds in reasoning the generalization behavior for machine learning. The proposed new bound adopts the individual-sample-based approach proposed by Bu et al., but also has several key new ingredients. Firstly, instead of applying the change of measure inequality on the loss function, we apply it to the generalization error function itself; secondly, the bound is derived in a conditional manner; lastly, a reference distribution is introduced. The combination of these components produces a KL-divergence-based generalization error bound. We show that although the latter two new ingredients can help make the bound exactly tight, removing them does not significantly degrade the bound, leading to an asymptotically tight mutual-information-based bound. We further consider the vector Gaussian setting, where a direct application of the proposed bound again does not lead to tight bounds except in special cases. A refined bound is then proposed for decomposable loss functions, leading to a tight bound for the vector setting. △ Less

Submitted 12 November, 2023; v1 submitted 1 May, 2023; originally announced May 2023.

arXiv:2304.07005 [pdf, ps, other]

doi 10.1109/LSP.2023.3270080

Detector Design and Performance Analysis for Target Detection in Subspace Interference

Authors: Weijian Liu, Jun Liu, Tao Liu, Hui Chen, Yong-Liang Wang

Abstract: It is often difficult to obtain sufficient training data for adaptive signal detection, which is required to calculate the unknown noise covariance matrix. Additionally, interference is frequently present, which complicates the detecting issue. We provide a two-step method, termed interference cancellation before detection (ICBD), to address the issue of signal detection in the unknown Gaussian no… ▽ More It is often difficult to obtain sufficient training data for adaptive signal detection, which is required to calculate the unknown noise covariance matrix. Additionally, interference is frequently present, which complicates the detecting issue. We provide a two-step method, termed interference cancellation before detection (ICBD), to address the issue of signal detection in the unknown Gaussian noise and subspace interference. The first involves projecting the test and training data to the interference-orthogonal subspace in order to suppress the interference. Utilizing traditional adaptive detector design ideas is the next stage. Due to the smaller dimension of the projected data, the ICBD-based detectors can function with little training data. The ICBD has two additional benefits over traditional detectors. Lower computational burden and proper operation with interference being in the training data are two additional benefits of ICBD-based detectors over conventional ones. We also give the statistical properties of the ICBD-based detectors and demonstrate their equivalence with the traditional ones in the special case of a large amount of training data containing no interference △ Less

Submitted 14 April, 2023; originally announced April 2023.

Comments: This manuscript is submitted to IEEE SPL with paper ID SPL-35580-2023 and the decision "AQ - Publish In Minor, Required Changes"

arXiv:2304.00530 [pdf, other]

Tensor Recovery in High-Dimensional Ising Models

Authors: Tianyu Liu, Somabha Mukherjee, Rahul Biswas

Abstract: The $k$-tensor Ising model is an exponential family on a $p$-dimensional binary hypercube for modeling dependent binary data, where the sufficient statistic consists of all $k$-fold products of the observations, and the parameter is an unknown $k$-fold tensor, designed to capture higher-order interactions between the binary variables. In this paper, we describe an approach based on a penalization… ▽ More The $k$-tensor Ising model is an exponential family on a $p$-dimensional binary hypercube for modeling dependent binary data, where the sufficient statistic consists of all $k$-fold products of the observations, and the parameter is an unknown $k$-fold tensor, designed to capture higher-order interactions between the binary variables. In this paper, we describe an approach based on a penalization technique that helps us recover the signed support of the tensor parameter with high probability, assuming that no entry of the true tensor is too close to zero. The method is based on an $\ell_1$-regularized node-wise logistic regression, that recovers the signed neighborhood of each node with high probability. Our analysis is carried out in the high-dimensional regime, that allows the dimension $p$ of the Ising model, as well as the interaction factor $k$ to potentially grow to $\infty$ with the sample size $n$. We show that if the minimum interaction strength is not too small, then consistent recovery of the entire signed support is possible if one takes $n = Ω((k!)^8 d^3 \log \binom{p-1}{k-1})$ samples, where $d$ denotes the maximum degree of the hypernetwork in question. Our results are validated in two simulation settings, and applied on a real neurobiological dataset consisting of multi-array electro-physiological recordings from the mouse visual cortex, to model higher-order interactions between the brain regions. △ Less

Submitted 23 July, 2023; v1 submitted 2 April, 2023; originally announced April 2023.

Comments: 28 pages, 7 figures

arXiv:2303.05506 [pdf, other]

TANGOS: Regularizing Tabular Neural Networks through Gradient Orthogonalization and Specialization

Authors: Alan Jeffares, Tennison Liu, Jonathan Crabbé, Fergus Imrie, Mihaela van der Schaar

Abstract: Despite their success with unstructured data, deep neural networks are not yet a panacea for structured tabular data. In the tabular domain, their efficiency crucially relies on various forms of regularization to prevent overfitting and provide strong generalization performance. Existing regularization techniques include broad modelling decisions such as choice of architecture, loss functions, and… ▽ More Despite their success with unstructured data, deep neural networks are not yet a panacea for structured tabular data. In the tabular domain, their efficiency crucially relies on various forms of regularization to prevent overfitting and provide strong generalization performance. Existing regularization techniques include broad modelling decisions such as choice of architecture, loss functions, and optimization methods. In this work, we introduce Tabular Neural Gradient Orthogonalization and Specialization (TANGOS), a novel framework for regularization in the tabular setting built on latent unit attributions. The gradient attribution of an activation with respect to a given input feature suggests how the neuron attends to that feature, and is often employed to interpret the predictions of deep networks. In TANGOS, we take a different approach and incorporate neuron attributions directly into training to encourage orthogonalization and specialization of latent attributions in a fully-connected network. Our regularizer encourages neurons to focus on sparse, non-overlap** input features and results in a set of diverse and specialized latent units. In the tabular domain, we demonstrate that our approach can lead to improved out-of-sample generalization performance, outperforming other popular regularization methods. We provide insight into why our regularizer is effective and demonstrate that TANGOS can be applied jointly with existing methods to achieve even greater generalization performance. △ Less

Submitted 9 March, 2023; originally announced March 2023.

Comments: Published at International Conference on Learning Representations (ICLR) 2023

arXiv:2212.02125 [pdf, other]

TD3 with Reverse KL Regularizer for Offline Reinforcement Learning from Mixed Datasets

Authors: Yuanying Cai, Chuheng Zhang, Li Zhao, Wei Shen, Xuyun Zhang, Lei Song, Jiang Bian, Tao Qin, Tieyan Liu

Abstract: We consider an offline reinforcement learning (RL) setting where the agent need to learn from a dataset collected by rolling out multiple behavior policies. There are two challenges for this setting: 1) The optimal trade-off between optimizing the RL signal and the behavior cloning (BC) signal changes on different states due to the variation of the action coverage induced by different behavior pol… ▽ More We consider an offline reinforcement learning (RL) setting where the agent need to learn from a dataset collected by rolling out multiple behavior policies. There are two challenges for this setting: 1) The optimal trade-off between optimizing the RL signal and the behavior cloning (BC) signal changes on different states due to the variation of the action coverage induced by different behavior policies. Previous methods fail to handle this by only controlling the global trade-off. 2) For a given state, the action distribution generated by different behavior policies may have multiple modes. The BC regularizers in many previous methods are mean-seeking, resulting in policies that select out-of-distribution (OOD) actions in the middle of the modes. In this paper, we address both challenges by using adaptively weighted reverse Kullback-Leibler (KL) divergence as the BC regularizer based on the TD3 algorithm. Our method not only trades off the RL and BC signals with per-state weights (i.e., strong BC regularization on the states with narrow action coverage, and vice versa) but also avoids selecting OOD actions thanks to the mode-seeking property of reverse KL. Empirically, our algorithm can outperform existing offline RL algorithms in the MuJoCo locomotion tasks with the standard D4RL datasets as well as the mixed datasets that combine the standard datasets. △ Less

Submitted 5 December, 2022; originally announced December 2022.

Comments: Accepted by ICDM-22 (Best Student Paper Runner-Up Awards)

arXiv:2211.06812 [pdf, other]

FedRule: Federated Rule Recommendation System with Graph Neural Networks

Authors: Yuhang Yao, Mohammad Mahdi Kamani, Zhongwei Cheng, Lin Chen, Carlee Joe-Wong, Tianqiang Liu

Abstract: Much of the value that IoT (Internet-of-Things) devices bring to ``smart'' homes lies in their ability to automatically trigger other devices' actions: for example, a smart camera triggering a smart lock to unlock a door. Manually setting up these rules for smart devices or applications, however, is time-consuming and inefficient. Rule recommendation systems can automatically suggest rules for use… ▽ More Much of the value that IoT (Internet-of-Things) devices bring to ``smart'' homes lies in their ability to automatically trigger other devices' actions: for example, a smart camera triggering a smart lock to unlock a door. Manually setting up these rules for smart devices or applications, however, is time-consuming and inefficient. Rule recommendation systems can automatically suggest rules for users by learning which rules are popular based on those previously deployed (e.g., in others' smart homes). Conventional recommendation formulations require a central server to record the rules used in many users' homes, which compromises their privacy and leaves them vulnerable to attacks on the central server's database of rules. Moreover, these solutions typically leverage generic user-item matrix methods that do not fully exploit the structure of the rule recommendation problem. In this paper, we propose a new rule recommendation system, dubbed as FedRule, to address these challenges. One graph is constructed per user upon the rules s/he is using, and the rule recommendation is formulated as a link prediction task in these graphs. This formulation enables us to design a federated training algorithm that is able to keep users' data private. Extensive experiments corroborate our claims by demonstrating that FedRule has comparable performance as the centralized setting and outperforms conventional solutions. △ Less

Submitted 12 November, 2022; originally announced November 2022.

arXiv:2211.06138 [pdf, other]

Practical Approaches for Fair Learning with Multitype and Multivariate Sensitive Attributes

Authors: Tennison Liu, Alex J. Chan, Boris van Breugel, Mihaela van der Schaar

Abstract: It is important to guarantee that machine learning algorithms deployed in the real world do not result in unfairness or unintended social consequences. Fair ML has largely focused on the protection of single attributes in the simpler setting where both attributes and target outcomes are binary. However, the practical application in many a real-world problem entails the simultaneous protection of m… ▽ More It is important to guarantee that machine learning algorithms deployed in the real world do not result in unfairness or unintended social consequences. Fair ML has largely focused on the protection of single attributes in the simpler setting where both attributes and target outcomes are binary. However, the practical application in many a real-world problem entails the simultaneous protection of multiple sensitive attributes, which are often not simply binary, but continuous or categorical. To address this more challenging task, we introduce FairCOCCO, a fairness measure built on cross-covariance operators on reproducing kernel Hilbert Spaces. This leads to two practical tools: first, the FairCOCCO Score, a normalised metric that can quantify fairness in settings with single or multiple sensitive attributes of arbitrary type; and second, a subsequent regularisation term that can be incorporated into arbitrary learning objectives to obtain fair predictors. These contributions address crucial gaps in the algorithmic fairness literature, and we empirically demonstrate consistent improvements against state-of-the-art techniques in balancing predictive power and fairness on real-world datasets. △ Less

Submitted 11 November, 2022; originally announced November 2022.

arXiv:2211.02315 [pdf, other]

Spatial-Temporal Convolutional Attention for Map** Functional Brain Networks

Authors: Yiheng Liu, Enjie Ge, Ning Qiang, Tianming Liu, Bao Ge

Abstract: Using functional magnetic resonance imaging (fMRI) and deep learning to explore functional brain networks (FBNs) has attracted many researchers. However, most of these studies are still based on the temporal correlation between the sources and voxel signals, and lack of researches on the dynamics of brain function. Due to the widespread local correlations in the volumes, FBNs can be generated dire… ▽ More Using functional magnetic resonance imaging (fMRI) and deep learning to explore functional brain networks (FBNs) has attracted many researchers. However, most of these studies are still based on the temporal correlation between the sources and voxel signals, and lack of researches on the dynamics of brain function. Due to the widespread local correlations in the volumes, FBNs can be generated directly in the spatial domain in a self-supervised manner by using spatial-wise attention (SA), and the resulting FBNs has a higher spatial similarity with templates compared to the classical method. Therefore, we proposed a novel Spatial-Temporal Convolutional Attention (STCA) model to discover the dynamic FBNs by using the sliding windows. To validate the performance of the proposed method, we evaluate the approach on HCP-rest dataset. The results indicate that STCA can be used to discover FBNs in a dynamic way which provide a novel approach to better understand human brain. △ Less

Submitted 4 November, 2022; originally announced November 2022.

Comments: 5 pages, 5 figures, submitted to 20th IEEE International Symposium on Biomedical Imaging (ISBI 2023)

arXiv:2210.15801 [pdf, ps, other]

Clustering High-dimensional Data via Feature Selection

Authors: Tianqi Liu, Yu Lu, Biqing Zhu, Hongyu Zhao

Abstract: High-dimensional clustering analysis is a challenging problem in statistics and machine learning, with broad applications such as the analysis of microarray data and RNA-seq data. In this paper, we propose a new clustering procedure called Spectral Clustering with Feature Selection (SC-FS), where we first obtain an initial estimate of labels via spectral clustering, then select a small fraction of… ▽ More High-dimensional clustering analysis is a challenging problem in statistics and machine learning, with broad applications such as the analysis of microarray data and RNA-seq data. In this paper, we propose a new clustering procedure called Spectral Clustering with Feature Selection (SC-FS), where we first obtain an initial estimate of labels via spectral clustering, then select a small fraction of features with the largest R-squared with these labels, i.e., the proportion of variation explained by group labels, and conduct clustering again using selected features. Under mild conditions, we prove that the proposed method identifies all informative features with high probability and achieves minimax optimal clustering error rate for the sparse Gaussian mixture model. Applications of SC-FS to four real world data sets demonstrate its usefulness in clustering high-dimensional data. △ Less

Submitted 27 October, 2022; originally announced October 2022.

Comments: Accepted at Biometrics Journal (https://onlinelibrary.wiley.com/doi/epdf/10.1111/biom.13665)

arXiv:2210.13258 [pdf, other]

A comparative study to alternatives to the log-rank test

Authors: Ina Dormuth, Tiantian Liu, ** Xu, Markus Pauly, Marc Ditzhaus

Abstract: Studies to compare the survival of two or more groups using time-to-event data are of high importance in medical research. The gold standard is the log-rank test, which is optimal under proportional hazards. As the latter is no simple regularity assumption, we are interested in evaluating the power of various statistical tests under different settings including proportional and non-proportional ha… ▽ More Studies to compare the survival of two or more groups using time-to-event data are of high importance in medical research. The gold standard is the log-rank test, which is optimal under proportional hazards. As the latter is no simple regularity assumption, we are interested in evaluating the power of various statistical tests under different settings including proportional and non-proportional hazards with a special emphasize on crossing hazards. This challenge has been going on for many years now and multiple methods have already been investigated in extensive simulation studies. However, in recent years new omnibus tests and methods based on the restricted mean survival time appeared that have been strongly recommended in biometric literature. Thus, to give updated recommendations, we perform a vast simulation study to compare tests that showed high power in previous studies with these more recent approaches. We thereby analyze various simulation settings with varying survival and censoring distributions, unequal censoring between groups, small sample sizes and unbalanced group sizes. Overall, omnibus tests are more robust in terms of power against deviations from the proportional hazards assumption. △ Less

Submitted 24 October, 2022; originally announced October 2022.

arXiv:2210.08486 [pdf, other]

Streaming PAC-Bayes Gaussian process regression with a performance guarantee for online decision making

Authors: Tianyu Liu, Jie Lu, Zheng Yan, Guangquan Zhang

Abstract: As a powerful Bayesian non-parameterized algorithm, the Gaussian process (GP) has performed a significant role in Bayesian optimization and signal processing. GPs have also advanced online decision-making systems because their posterior distribution has a closed-form solution. However, its training and inference process requires all historic data to be stored and the GP model to be trained from sc… ▽ More As a powerful Bayesian non-parameterized algorithm, the Gaussian process (GP) has performed a significant role in Bayesian optimization and signal processing. GPs have also advanced online decision-making systems because their posterior distribution has a closed-form solution. However, its training and inference process requires all historic data to be stored and the GP model to be trained from scratch. For those reasons, several online GP algorithms, such as O-SGPR and O-SVGP, have been specifically designed for streaming settings. In this paper, we present a new theoretical framework for online GPs based on the online probably approximately correct (PAC) Bayes theory. The framework offers both a guarantee of generalized performance and good accuracy. Instead of minimizing the marginal likelihood, our algorithm optimizes both the empirical risk function and a regularization item, which is in proportion to the divergence between the prior distribution and posterior distribution of parameters. In addition to its theoretical appeal, the algorithm performs well empirically on several regression datasets. Compared to other online GP algorithms, ours yields a generalization guarantee and very competitive accuracy. △ Less

Submitted 26 October, 2022; v1 submitted 16 October, 2022; originally announced October 2022.

arXiv:2210.05955 [pdf, other]

Identifiability and Asymptotics in Learning Homogeneous Linear ODE Systems from Discrete Observations

Authors: Yuanyuan Wang, Wei Huang, Mingming Gong, Xi Geng, Tongliang Liu, Kun Zhang, Dacheng Tao

Abstract: Ordinary Differential Equations (ODEs) have recently gained a lot of attention in machine learning. However, the theoretical aspects, e.g., identifiability and asymptotic properties of statistical estimation are still obscure. This paper derives a sufficient condition for the identifiability of homogeneous linear ODE systems from a sequence of equally-spaced error-free observations sampled from a… ▽ More Ordinary Differential Equations (ODEs) have recently gained a lot of attention in machine learning. However, the theoretical aspects, e.g., identifiability and asymptotic properties of statistical estimation are still obscure. This paper derives a sufficient condition for the identifiability of homogeneous linear ODE systems from a sequence of equally-spaced error-free observations sampled from a single trajectory. When observations are disturbed by measurement noise, we prove that under mild conditions, the parameter estimator based on the Nonlinear Least Squares (NLS) method is consistent and asymptotic normal with $n^{-1/2}$ convergence rate. Based on the asymptotic normality property, we construct confidence sets for the unknown system parameters and propose a new method to infer the causal structure of the ODE system, i.e., inferring whether there is a causal link between system variables. Furthermore, we extend the results to degraded observations, including aggregated and time-scaled ones. To the best of our knowledge, our work is the first systematic study of the identifiability and asymptotic properties in learning linear ODE systems. We also construct simulations with various system dimensions to illustrate the established theoretical results. △ Less

Submitted 2 June, 2024; v1 submitted 12 October, 2022; originally announced October 2022.

Journal ref: Journal of Machine Learning Research 25 (2024) 1-50

arXiv:2210.01765 [pdf, other]

One Transformer Can Understand Both 2D & 3D Molecular Data

Authors: Shengjie Luo, Tianlang Chen, Yixian Xu, Shuxin Zheng, Tie-Yan Liu, Liwei Wang, Di He

Abstract: Unlike vision and language data which usually has a unique format, molecules can naturally be characterized using different chemical formulations. One can view a molecule as a 2D graph or define it as a collection of atoms located in a 3D space. For molecular representation learning, most previous works designed neural networks only for a particular data format, making the learned models likely to… ▽ More Unlike vision and language data which usually has a unique format, molecules can naturally be characterized using different chemical formulations. One can view a molecule as a 2D graph or define it as a collection of atoms located in a 3D space. For molecular representation learning, most previous works designed neural networks only for a particular data format, making the learned models likely to fail for other data formats. We believe a general-purpose neural network model for chemistry should be able to handle molecular tasks across data modalities. To achieve this goal, in this work, we develop a novel Transformer-based Molecular model called Transformer-M, which can take molecular data of 2D or 3D formats as input and generate meaningful semantic representations. Using the standard Transformer as the backbone architecture, Transformer-M develops two separated channels to encode 2D and 3D structural information and incorporate them with the atom features in the network modules. When the input data is in a particular format, the corresponding channel will be activated, and the other will be disabled. By training on 2D and 3D molecular data with properly designed supervised signals, Transformer-M automatically learns to leverage knowledge from different data modalities and correctly capture the representations. We conducted extensive experiments for Transformer-M. All empirical results show that Transformer-M can simultaneously achieve strong performance on 2D and 3D tasks, suggesting its broad applicability. The code and models will be made publicly available at https://github.com/lsj2408/Transformer-M. △ Less

Submitted 27 March, 2023; v1 submitted 4 October, 2022; originally announced October 2022.

Comments: 20 pages; ICLR 2023, Camera Ready Version; Code: https://github.com/lsj2408/Transformer-M

arXiv:2209.15466 [pdf, other]

Sparsity-Constrained Optimal Transport

Authors: Tianlin Liu, Joan Puigcerver, Mathieu Blondel

Abstract: Regularized optimal transport (OT) is now increasingly used as a loss or as a matching layer in neural networks. Entropy-regularized OT can be computed using the Sinkhorn algorithm but it leads to fully-dense transportation plans, meaning that all sources are (fractionally) matched with all targets. To address this issue, several works have investigated quadratic regularization instead. This regul… ▽ More Regularized optimal transport (OT) is now increasingly used as a loss or as a matching layer in neural networks. Entropy-regularized OT can be computed using the Sinkhorn algorithm but it leads to fully-dense transportation plans, meaning that all sources are (fractionally) matched with all targets. To address this issue, several works have investigated quadratic regularization instead. This regularization preserves sparsity and leads to unconstrained and smooth (semi) dual objectives, that can be solved with off-the-shelf gradient methods. Unfortunately, quadratic regularization does not give direct control over the cardinality (number of nonzeros) of the transportation plan. We propose in this paper a new approach for OT with explicit cardinality constraints on the transportation plan. Our work is motivated by an application to sparse mixture of experts, where OT can be used to match input tokens such as image patches with expert models such as neural networks. Cardinality constraints ensure that at most $k$ tokens are matched with an expert, which is crucial for computational performance reasons. Despite the nonconvexity of cardinality constraints, we show that the corresponding (semi) dual problems are tractable and can be solved with first-order gradient methods. Our method can be thought as a middle ground between unregularized OT (recovered in the limit case $k=1$) and quadratically-regularized OT (recovered when $k$ is large enough). The smoothness of the objectives increases as $k$ increases, giving rise to a trade-off between convergence speed and sparsity of the optimal plan. △ Less

Submitted 14 April, 2023; v1 submitted 30 September, 2022; originally announced September 2022.

Comments: Camera-ready ICLR 2023

arXiv:2209.07303 [pdf, other]

Differentially Private Estimation of Hawkes Process

Authors: Simiao Zuo, Tianyi Liu, Tuo Zhao, Hongyuan Zha

Abstract: Point process models are of great importance in real world applications. In certain critical applications, estimation of point process models involves large amounts of sensitive personal data from users. Privacy concerns naturally arise which have not been addressed in the existing literature. To bridge this glaring gap, we propose the first general differentially private estimation procedure for… ▽ More Point process models are of great importance in real world applications. In certain critical applications, estimation of point process models involves large amounts of sensitive personal data from users. Privacy concerns naturally arise which have not been addressed in the existing literature. To bridge this glaring gap, we propose the first general differentially private estimation procedure for point process models. Specifically, we take the Hawkes process as an example, and introduce a rigorous definition of differential privacy for event stream data based on a discretized representation of the Hawkes process. We then propose two differentially private optimization algorithms, which can efficiently estimate Hawkes process models with the desired privacy and utility guarantees under two different settings. Experiments are provided to back up our theoretical analysis. △ Less

Submitted 15 September, 2022; originally announced September 2022.

arXiv:2207.07985 [pdf]

doi 10.1177/00420980221101707

Home-made blues: Residential crowding and mental health in Bei**g, China

Authors: Xize Wang, Tao Liu

Abstract: Although residential crowding has many well-being implications, its connection to mental health is yet to be widely examined. Using survey data from 1613 residents in Bei**g, China, we find that living in a crowded place - measured by both square metres per person and persons per bedroom - is significantly associated with a higher risk of depression. We test for the mechanisms of such association… ▽ More Although residential crowding has many well-being implications, its connection to mental health is yet to be widely examined. Using survey data from 1613 residents in Bei**g, China, we find that living in a crowded place - measured by both square metres per person and persons per bedroom - is significantly associated with a higher risk of depression. We test for the mechanisms of such associations and find that the residential crowding-depression link arises through increased living space-specific stress rather than increased life stress. We also identify the following subgroups that have relatively stronger residential crowding-depression associations: females, those living with children, those not living with parents, and those living in non-market housing units. Our findings show that inequality in living space among urban residents not only is an important social justice issue but also has health implications. △ Less

Submitted 16 July, 2022; originally announced July 2022.

Journal ref: Urban Studies (2022)

arXiv:2207.02540 [pdf, other]

Design-based theory for cluster rerandomization

Authors: Xin Lu, Tianle Liu, Hanzhong Liu, Peng Ding

Abstract: Complete randomization balances covariates on average, but covariate imbalance often exists in finite samples. Rerandomization can ensure covariate balance in the realized experiment by discarding the undesired treatment assignments. Many field experiments in public health and social sciences assign the treatment at the cluster level due to logistical constraints or policy considerations. Moreover… ▽ More Complete randomization balances covariates on average, but covariate imbalance often exists in finite samples. Rerandomization can ensure covariate balance in the realized experiment by discarding the undesired treatment assignments. Many field experiments in public health and social sciences assign the treatment at the cluster level due to logistical constraints or policy considerations. Moreover, they are frequently combined with rerandomization in the design stage. We refer to cluster rerandomization as a cluster-randomized experiment compounded with rerandomization to balance covariates at the individual or cluster level. Existing asymptotic theory can only deal with rerandomization with treatments assigned at the individual level, leaving that for cluster rerandomization an open problem. To fill the gap, we provide a design-based theory for cluster rerandomization. Moreover, we compare two cluster rerandomization schemes that use prior information on the importance of the covariates: one based on the weighted Euclidean distance and the other based on the Mahalanobis distance with tiers of covariates. We demonstrate that the former dominates the latter with optimal weights and orthogonalized covariates. Last but not least, we discuss the role of covariate adjustment in the analysis stage and recommend covariate-adjusted procedures that can be conveniently implemented by least squares with the associated robust standard errors. △ Less

Submitted 6 July, 2022; originally announced July 2022.

arXiv:2206.13033 [pdf, other]

Normalized/Clipped SGD with Perturbation for Differentially Private Non-Convex Optimization

Authors: Xiaodong Yang, Huishuai Zhang, Wei Chen, Tie-Yan Liu

Abstract: By ensuring differential privacy in the learning algorithms, one can rigorously mitigate the risk of large models memorizing sensitive training data. In this paper, we study two algorithms for this purpose, i.e., DP-SGD and DP-NSGD, which first clip or normalize \textit{per-sample} gradients to bound the sensitivity and then add noise to obfuscate the exact information. We analyze the convergence… ▽ More By ensuring differential privacy in the learning algorithms, one can rigorously mitigate the risk of large models memorizing sensitive training data. In this paper, we study two algorithms for this purpose, i.e., DP-SGD and DP-NSGD, which first clip or normalize \textit{per-sample} gradients to bound the sensitivity and then add noise to obfuscate the exact information. We analyze the convergence behavior of these two algorithms in the non-convex optimization setting with two common assumptions and achieve a rate $\mathcal{O}\left(\sqrt[4]{\frac{d\log(1/δ)}{N^2ε^2}}\right)$ of the gradient norm for a $d$-dimensional model, $N$ samples and $(ε,δ)$-DP, which improves over previous bounds under much weaker assumptions. Specifically, we introduce a regularizing factor in DP-NSGD and show that it is crucial in the convergence proof and subtly controls the bias and noise trade-off. Our proof deliberately handles the per-sample gradient clip** and normalization that are specified for the private setting. Empirically, we demonstrate that these two algorithms achieve similar best accuracy while DP-NSGD is comparatively easier to tune than DP-SGD and hence may help further save the privacy budget when accounting the tuning effort. △ Less

Submitted 26 June, 2022; originally announced June 2022.

Comments: 25 pages, under review

arXiv:2206.07769 [pdf, other]

HyperImpute: Generalized Iterative Imputation with Automatic Model Selection

Authors: Daniel Jarrett, Bogdan Cebere, Tennison Liu, Alicia Curth, Mihaela van der Schaar

Abstract: Consider the problem of imputing missing values in a dataset. One the one hand, conventional approaches using iterative imputation benefit from the simplicity and customizability of learning conditional distributions directly, but suffer from the practical requirement for appropriate model specification of each and every variable. On the other hand, recent methods using deep generative modeling be… ▽ More Consider the problem of imputing missing values in a dataset. One the one hand, conventional approaches using iterative imputation benefit from the simplicity and customizability of learning conditional distributions directly, but suffer from the practical requirement for appropriate model specification of each and every variable. On the other hand, recent methods using deep generative modeling benefit from the capacity and efficiency of learning with neural network function approximators, but are often difficult to optimize and rely on stronger data assumptions. In this work, we study an approach that marries the advantages of both: We propose *HyperImpute*, a generalized iterative imputation framework for adaptively and automatically configuring column-wise models and their hyperparameters. Practically, we provide a concrete implementation with out-of-the-box learners, optimizers, simulators, and extensible interfaces. Empirically, we investigate this framework via comprehensive experiments and sensitivities on a variety of public datasets, and demonstrate its ability to generate accurate imputations relative to a strong suite of benchmarks. Contrary to recent work, we believe our findings constitute a strong defense of the iterative imputation paradigm. △ Less

Submitted 15 June, 2022; originally announced June 2022.

Journal ref: In Proc. 39th International Conference on Machine Learning (ICML 2022)

arXiv:2206.05643 [pdf, other]

Density Regression and Uncertainty Quantification with Bayesian Deep Noise Neural Networks

Authors: Daiwei Zhang, Tianci Liu, Jian Kang

Abstract: Deep neural network (DNN) models have achieved state-of-the-art predictive accuracy in a wide range of supervised learning applications. However, accurately quantifying the uncertainty in DNN predictions remains a challenging task. For continuous outcome variables, an even more difficult problem is to estimate the predictive density function, which not only provides a natural quantification of the… ▽ More Deep neural network (DNN) models have achieved state-of-the-art predictive accuracy in a wide range of supervised learning applications. However, accurately quantifying the uncertainty in DNN predictions remains a challenging task. For continuous outcome variables, an even more difficult problem is to estimate the predictive density function, which not only provides a natural quantification of the predictive uncertainty, but also fully captures the random variation in the outcome. In this work, we propose the Bayesian Deep Noise Neural Network (B-DeepNoise), which generalizes standard Bayesian DNNs by extending the random noise variable from the output layer to all hidden layers. The latent random noise equips B-DeepNoise with the flexibility to approximate highly complex predictive distributions and accurately quantify predictive uncertainty. For posterior computation, the unique structure of B-DeepNoise leads to a closed-form Gibbs sampling algorithm that iteratively simulates from the posterior full conditional distributions of the model parameters, circumventing computationally intensive Metropolis-Hastings methods. A theoretical analysis of B-DeepNoise establishes a recursive representation of the predictive distribution and decomposes the predictive variance with respect to the latent parameters. We evaluate B-DeepNoise against existing methods on benchmark regression datasets, demonstrating its superior performance in terms of prediction accuracy, uncertainty quantification accuracy, and uncertainty quantification efficiency. To illustrate our method's usefulness in scientific studies, we apply B-DeepNoise to predict general intelligence from neuroimaging features in the Adolescent Brain Cognitive Development (ABCD) project. △ Less

Submitted 11 June, 2022; originally announced June 2022.

arXiv:2206.02617 [pdf, other]

Individual Privacy Accounting for Differentially Private Stochastic Gradient Descent

Authors: Da Yu, Gautam Kamath, Janardhan Kulkarni, Tie-Yan Liu, Jian Yin, Huishuai Zhang

Abstract: Differentially private stochastic gradient descent (DP-SGD) is the workhorse algorithm for recent advances in private deep learning. It provides a single privacy guarantee to all datapoints in the dataset. We propose output-specific $(\varepsilon,δ)$-DP to characterize privacy guarantees for individual examples when releasing models trained by DP-SGD. We also design an efficient algorithm to inves… ▽ More Differentially private stochastic gradient descent (DP-SGD) is the workhorse algorithm for recent advances in private deep learning. It provides a single privacy guarantee to all datapoints in the dataset. We propose output-specific $(\varepsilon,δ)$-DP to characterize privacy guarantees for individual examples when releasing models trained by DP-SGD. We also design an efficient algorithm to investigate individual privacy across a number of datasets. We find that most examples enjoy stronger privacy guarantees than the worst-case bound. We further discover that the training loss and the privacy parameter of an example are well-correlated. This implies groups that are underserved in terms of model utility simultaneously experience weaker privacy guarantees. For example, on CIFAR-10, the average $\varepsilon$ of the class with the lowest test accuracy is 44.2\% higher than that of the class with the highest accuracy. △ Less

Submitted 2 September, 2023; v1 submitted 6 June, 2022; originally announced June 2022.

Comments: Published in Transactions on Machine Learning Research (TMLR)

arXiv:2205.13869 [pdf, other]

MissDAG: Causal Discovery in the Presence of Missing Data with Continuous Additive Noise Models

Authors: Erdun Gao, Ignavier Ng, Mingming Gong, Li Shen, Wei Huang, Tongliang Liu, Kun Zhang, Howard Bondell

Abstract: State-of-the-art causal discovery methods usually assume that the observational data is complete. However, the missing data problem is pervasive in many practical scenarios such as clinical trials, economics, and biology. One straightforward way to address the missing data problem is first to impute the data using off-the-shelf imputation methods and then apply existing causal discovery methods. H… ▽ More State-of-the-art causal discovery methods usually assume that the observational data is complete. However, the missing data problem is pervasive in many practical scenarios such as clinical trials, economics, and biology. One straightforward way to address the missing data problem is first to impute the data using off-the-shelf imputation methods and then apply existing causal discovery methods. However, such a two-step method may suffer from suboptimality, as the imputation algorithm may introduce bias for modeling the underlying data distribution. In this paper, we develop a general method, which we call MissDAG, to perform causal discovery from data with incomplete observations. Focusing mainly on the assumptions of ignorable missingness and the identifiable additive noise models (ANMs), MissDAG maximizes the expected likelihood of the visible part of observations under the expectation-maximization (EM) framework. In the E-step, in cases where computing the posterior distributions of parameters in closed-form is not feasible, Monte Carlo EM is leveraged to approximate the likelihood. In the M-step, MissDAG leverages the density transformation to model the noise distributions with simpler and specific formulations by virtue of the ANMs and uses a likelihood-based causal discovery algorithm with directed acyclic graph constraint. We demonstrate the flexibility of MissDAG for incorporating various causal discovery algorithms and its efficacy through extensive simulations and real data experiments. △ Less

Submitted 16 January, 2023; v1 submitted 27 May, 2022; originally announced May 2022.

Comments: Accepted to NeurIPS22

arXiv:2205.13401 [pdf, other]

Your Transformer May Not be as Powerful as You Expect

Authors: Shengjie Luo, Shanda Li, Shuxin Zheng, Tie-Yan Liu, Liwei Wang, Di He

Abstract: Relative Positional Encoding (RPE), which encodes the relative distance between any pair of tokens, is one of the most successful modifications to the original Transformer. As far as we know, theoretical understanding of the RPE-based Transformers is largely unexplored. In this work, we mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximati… ▽ More Relative Positional Encoding (RPE), which encodes the relative distance between any pair of tokens, is one of the most successful modifications to the original Transformer. As far as we know, theoretical understanding of the RPE-based Transformers is largely unexplored. In this work, we mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions. One may naturally assume the answer is in the affirmative -- RPE-based Transformers are universal function approximators. However, we present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is. One key reason lies in that most RPEs are placed in the softmax attention that always generates a right stochastic matrix. This restricts the network from capturing positional information in the RPEs and limits its capacity. To overcome the problem and make the model more powerful, we first present sufficient conditions for RPE-based Transformers to achieve universal function approximation. With the theoretical guidance, we develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions. Therefore, the corresponding URPE-based Transformers become universal function approximators. Extensive experiments covering typical architectures and tasks demonstrate that our model is parameter-efficient and can achieve superior performance to strong baselines in a wide range of applications. The code will be made publicly available at https://github.com/lsj2408/URPE. △ Less

Submitted 28 October, 2022; v1 submitted 26 May, 2022; originally announced May 2022.

Comments: 22 pages; NeurIPS 2022, Camera Ready Version

arXiv:2205.12418 [pdf, other]

Tiered Reinforcement Learning: Pessimism in the Face of Uncertainty and Constant Regret

Authors: Jiawei Huang, Li Zhao, Tao Qin, Wei Chen, Nan Jiang, Tie-Yan Liu

Abstract: We propose a new learning framework that captures the tiered structure of many real-world user-interaction applications, where the users can be divided into two groups based on their different tolerance on exploration risks and should be treated separately. In this setting, we simultaneously maintain two policies $π^{\text{O}}$ and $π^{\text{E}}$: $π^{\text{O}}$ ("O" for "online") interacts with m… ▽ More We propose a new learning framework that captures the tiered structure of many real-world user-interaction applications, where the users can be divided into two groups based on their different tolerance on exploration risks and should be treated separately. In this setting, we simultaneously maintain two policies $π^{\text{O}}$ and $π^{\text{E}}$: $π^{\text{O}}$ ("O" for "online") interacts with more risk-tolerant users from the first tier and minimizes regret by balancing exploration and exploitation as usual, while $π^{\text{E}}$ ("E" for "exploit") exclusively focuses on exploitation for risk-averse users from the second tier utilizing the data collected so far. An important question is whether such a separation yields advantages over the standard online setting (i.e., $π^{\text{E}}=π^{\text{O}}$) for the risk-averse users. We individually consider the gap-independent vs.~gap-dependent settings. For the former, we prove that the separation is indeed not beneficial from a minimax perspective. For the latter, we show that if choosing Pessimistic Value Iteration as the exploitation algorithm to produce $π^{\text{E}}$, we can achieve a constant regret for risk-averse users independent of the number of episodes $K$, which is in sharp contrast to the $Ω(\log K)$ regret for any online RL algorithms in the same setting, while the regret of $π^{\text{O}}$ (almost) maintains its online regret optimality and does not need to compromise for the success of $π^{\text{E}}$. △ Less

Submitted 26 February, 2023; v1 submitted 24 May, 2022; originally announced May 2022.

Comments: 38 pages; NeurIPS 2022

arXiv:2203.17262 [pdf]

Length L-function for Network-Constrained Point Data

Authors: Zidong Fang, Ci Song, Hua Shu, Jie Chen, Tianyu Liu, Xi Wang, Xiao Chen, Tao Pei

Abstract: Network constrained points are referred to as points restricted to road networks, such as taxi pick up and drop off locations. A significant pattern of network constrained points is referred to as an aggregation; e.g., the aggregation of pick up points may indicate a high taxi demand in a particular area. Although the network K function using the shortest path network distance has been proposed to… ▽ More Network constrained points are referred to as points restricted to road networks, such as taxi pick up and drop off locations. A significant pattern of network constrained points is referred to as an aggregation; e.g., the aggregation of pick up points may indicate a high taxi demand in a particular area. Although the network K function using the shortest path network distance has been proposed to detect point aggregation, its statistical unit is still radius based. R neighborhood, in particular, has inconsistent network length owing to the complex configuration of road networks which cause unfair counts and identification errors in networks (e.g., the length of the r neighborhood located at an intersection is longer than that on straight roads, which may include more points). In this study, we derived the length L function for network constrained points to identify the aggregation by designing a novel neighborhood as the statistical unit; the total length of this is consistent throughout the network. Compared to the network K function, our method can detect a true to life aggregation scale, identify the aggregation with higher network density, as well as identify the aggregations that the network K function cannot. We validated our method using taxi trips pick up location data within Zhongguancun Area in Bei**g, analyzing differences in maximal aggregation between workdays and weekends to understand taxi demand in the morning and evening peak. △ Less

Submitted 29 March, 2022; originally announced March 2022.

arXiv:2203.07681 [pdf, other]

DEPTS: Deep Expansion Learning for Periodic Time Series Forecasting

Authors: Wei Fan, Shun Zheng, Xiaohan Yi, Wei Cao, Yanjie Fu, Jiang Bian, Tie-Yan Liu

Abstract: Periodic time series (PTS) forecasting plays a crucial role in a variety of industries to foster critical tasks, such as early warning, pre-planning, resource scheduling, etc. However, the complicated dependencies of the PTS signal on its inherent periodicity as well as the sophisticated composition of various periods hinder the performance of PTS forecasting. In this paper, we introduce a deep ex… ▽ More Periodic time series (PTS) forecasting plays a crucial role in a variety of industries to foster critical tasks, such as early warning, pre-planning, resource scheduling, etc. However, the complicated dependencies of the PTS signal on its inherent periodicity as well as the sophisticated composition of various periods hinder the performance of PTS forecasting. In this paper, we introduce a deep expansion learning framework, DEPTS, for PTS forecasting. DEPTS starts with a decoupled formulation by introducing the periodic state as a hidden variable, which stimulates us to make two dedicated modules to tackle the aforementioned two challenges. First, we develop an expansion module on top of residual learning to perform a layer-by-layer expansion of those complicated dependencies. Second, we introduce a periodicity module with a parameterized periodic function that holds sufficient capacity to capture diversified periods. Moreover, our two customized modules also have certain interpretable capabilities, such as attributing the forecasts to either local momenta or global periodicity and characterizing certain core periodic properties, e.g., amplitudes and frequencies. Extensive experiments on both synthetic data and real-world data demonstrate the effectiveness of DEPTS on handling PTS. In most cases, DEPTS achieves significant improvements over the best baseline. Specifically, the error reduction can even reach up to 20% for a few cases. Finally, all codes are publicly available. △ Less

Submitted 15 March, 2022; originally announced March 2022.

Comments: ICLR22 Spotlight

arXiv:2202.08928 [pdf, other]

"Back to the future" projections for COVID-19 surges

Authors: J. Sunil Rao, Tianhao Liu, Daniel Andrés Díaz-Pachón

Abstract: We argue that information from countries who had earlier COVID-19 surges can be used to inform another country's current model, then generating what we call back-to-the-future (BTF) projections. We show that these projections can be used to accurately predict future COVID-19 surges prior to an inflection point of the daily infection curve. We show, across 12 different countries from all populated… ▽ More We argue that information from countries who had earlier COVID-19 surges can be used to inform another country's current model, then generating what we call back-to-the-future (BTF) projections. We show that these projections can be used to accurately predict future COVID-19 surges prior to an inflection point of the daily infection curve. We show, across 12 different countries from all populated continents around the world, that our method can often predict future surges in scenarios where the traditional approaches would always predict no future surges. However, as expected, BTF projections cannot accurately predict a surge due to the emergence of a new variant. To generate BTF projections, we make use of a matching scheme for asynchronous time series combined with a response coaching SIR model. △ Less

Submitted 17 February, 2022; originally announced February 2022.

Comments: 21 pages, 7 figures

MSC Class: 92D25 (Primary) 92C60 92B15 62P10 62M10 (Secondary)

arXiv:2202.08057 [pdf, other]

Understanding and Improving Graph Injection Attack by Promoting Unnoticeability

Authors: Yongqiang Chen, Han Yang, Yonggang Zhang, Kaili Ma, Tongliang Liu, Bo Han, James Cheng

Abstract: Recently Graph Injection Attack (GIA) emerges as a practical attack scenario on Graph Neural Networks (GNNs), where the adversary can merely inject few malicious nodes instead of modifying existing nodes or edges, i.e., Graph Modification Attack (GMA). Although GIA has achieved promising results, little is known about why it is successful and whether there is any pitfall behind the success. To und… ▽ More Recently Graph Injection Attack (GIA) emerges as a practical attack scenario on Graph Neural Networks (GNNs), where the adversary can merely inject few malicious nodes instead of modifying existing nodes or edges, i.e., Graph Modification Attack (GMA). Although GIA has achieved promising results, little is known about why it is successful and whether there is any pitfall behind the success. To understand the power of GIA, we compare it with GMA and find that GIA can be provably more harmful than GMA due to its relatively high flexibility. However, the high flexibility will also lead to great damage to the homophily distribution of the original graph, i.e., similarity among neighbors. Consequently, the threats of GIA can be easily alleviated or even prevented by homophily-based defenses designed to recover the original homophily. To mitigate the issue, we introduce a novel constraint -- homophily unnoticeability that enforces GIA to preserve the homophily, and propose Harmonious Adversarial Objective (HAO) to instantiate it. Extensive experiments verify that GIA with HAO can break homophily-based defenses and outperform previous GIA attacks by a significant margin. We believe our methods can serve for a more reliable evaluation of the robustness of GNNs. △ Less

Submitted 5 April, 2022; v1 submitted 16 February, 2022; originally announced February 2022.

Comments: ICLR2022, 42 pages, 22 figures

arXiv:2202.06450 [pdf, other]

Towards Deployment-Efficient Reinforcement Learning: Lower Bound and Optimality

Authors: Jiawei Huang, **glin Chen, Li Zhao, Tao Qin, Nan Jiang, Tie-Yan Liu

Abstract: Deployment efficiency is an important criterion for many real-world applications of reinforcement learning (RL). Despite the community's increasing interest, there lacks a formal theoretical formulation for the problem. In this paper, we propose such a formulation for deployment-efficient RL (DE-RL) from an "optimization with constraints" perspective: we are interested in exploring an MDP and obta… ▽ More Deployment efficiency is an important criterion for many real-world applications of reinforcement learning (RL). Despite the community's increasing interest, there lacks a formal theoretical formulation for the problem. In this paper, we propose such a formulation for deployment-efficient RL (DE-RL) from an "optimization with constraints" perspective: we are interested in exploring an MDP and obtaining a near-optimal policy within minimal \emph{deployment complexity}, whereas in each deployment the policy can sample a large batch of data. Using finite-horizon linear MDPs as a concrete structural model, we reveal the fundamental limit in achieving deployment efficiency by establishing information-theoretic lower bounds, and provide algorithms that achieve the optimal deployment efficiency. Moreover, our formulation for DE-RL is flexible and can serve as a building block for other practically relevant settings; we give "Safe DE-RL" and "Sample-Efficient DE-RL" as two examples, which may be worth future investigation. △ Less

Submitted 30 August, 2022; v1 submitted 13 February, 2022; originally announced February 2022.

Comments: 49 Pages; ICLR 2022

arXiv:2201.12739 [pdf, other]

Do We Need to Penalize Variance of Losses for Learning with Label Noise?

Authors: Yexiong Lin, Yu Yao, Yuxuan Du, Jun Yu, Bo Han, Mingming Gong, Tongliang Liu

Abstract: Algorithms which minimize the averaged loss have been widely designed for dealing with noisy labels. Intuitively, when there is a finite training sample, penalizing the variance of losses will improve the stability and generalization of the algorithms. Interestingly, we found that the variance should be increased for the problem of learning with noisy labels. Specifically, increasing the variance… ▽ More Algorithms which minimize the averaged loss have been widely designed for dealing with noisy labels. Intuitively, when there is a finite training sample, penalizing the variance of losses will improve the stability and generalization of the algorithms. Interestingly, we found that the variance should be increased for the problem of learning with noisy labels. Specifically, increasing the variance will boost the memorization effects and reduce the harmfulness of incorrect labels. By exploiting the label noise transition matrix, regularizers can be easily designed to reduce the variance of losses and be plugged in many existing algorithms. Empirically, the proposed method by increasing the variance of losses significantly improves the generalization ability of baselines on both synthetic and real-world datasets. △ Less

Submitted 30 January, 2022; originally announced January 2022.

arXiv:2112.03555 [pdf, other]

FedDAG: Federated DAG Structure Learning

Authors: Erdun Gao, Junjia Chen, Li Shen, Tongliang Liu, Mingming Gong, Howard Bondell

Abstract: To date, most directed acyclic graphs (DAGs) structure learning approaches require data to be stored in a central server. However, due to the consideration of privacy protection, data owners gradually refuse to share their personalized raw data to avoid private information leakage, making this task more troublesome by cutting off the first step. Thus, a puzzle arises: \textit{how do we discover th… ▽ More To date, most directed acyclic graphs (DAGs) structure learning approaches require data to be stored in a central server. However, due to the consideration of privacy protection, data owners gradually refuse to share their personalized raw data to avoid private information leakage, making this task more troublesome by cutting off the first step. Thus, a puzzle arises: \textit{how do we discover the underlying DAG structure from decentralized data?} In this paper, focusing on the additive noise models (ANMs) assumption of data generation, we take the first step in develo** a gradient-based learning framework named FedDAG, which can learn the DAG structure without directly touching the local data and also can naturally handle the data heterogeneity. Our method benefits from a two-level structure of each local model. The first level structure learns the edges and directions of the graph and communicates with the server to get the model information from other clients during the learning procedure, while the second level structure approximates the mechanisms among variables and personally updates on its own data to accommodate the data heterogeneity. Moreover, FedDAG formulates the overall learning task as a continuous optimization problem by taking advantage of an equality acyclicity constraint, which can be solved by gradient descent methods to boost the searching efficiency. Extensive experiments on both synthetic and real-world datasets verify the efficacy of the proposed method. △ Less

Submitted 16 January, 2023; v1 submitted 7 December, 2021; originally announced December 2021.

Comments: Accepted to Transactions on Machine Learning Research

arXiv:2111.13164 [pdf, other]

Neural network stochastic differential equation models with applications to financial data forecasting

Authors: Luxuan Yang, Ting Gao, Yubin Lu, **qiao Duan, Tao Liu

Abstract: In this article, we employ a collection of stochastic differential equations with drift and diffusion coefficients approximated by neural networks to predict the trend of chaotic time series which has big jump properties. Our contributions are, first, we propose a model called Lévy induced stochastic differential equation network, which explores compounded stochastic differential equations with… ▽ More In this article, we employ a collection of stochastic differential equations with drift and diffusion coefficients approximated by neural networks to predict the trend of chaotic time series which has big jump properties. Our contributions are, first, we propose a model called Lévy induced stochastic differential equation network, which explores compounded stochastic differential equations with $α$-stable Lévy motion to model complex time series data and solve the problem through neural network approximation. Second, we theoretically prove that the numerical solution through our algorithm converges in probability to the solution of corresponding stochastic differential equation, without curse of dimensionality. Finally, we illustrate our method by applying it to real financial time series data and find the accuracy increases through the use of non-Gaussian Lévy processes. We also present detailed comparisons in terms of data patterns, various models, different shapes of Lévy motion and the prediction lengths. △ Less

Submitted 3 November, 2022; v1 submitted 25 November, 2021; originally announced November 2021.

Comments: 18 pages, 38 figures

arXiv:2110.13750 [pdf, other]

Optimizing Information-theoretical Generalization Bounds via Anisotropic Noise in SGLD

Authors: Bohan Wang, Huishuai Zhang, Jieyu Zhang, Qi Meng, Wei Chen, Tie-Yan Liu

Abstract: Recently, the information-theoretical framework has been proven to be able to obtain non-vacuous generalization bounds for large models trained by Stochastic Gradient Langevin Dynamics (SGLD) with isotropic noise. In this paper, we optimize the information-theoretical generalization bound by manipulating the noise structure in SGLD. We prove that with constraint to guarantee low empirical risk, th… ▽ More Recently, the information-theoretical framework has been proven to be able to obtain non-vacuous generalization bounds for large models trained by Stochastic Gradient Langevin Dynamics (SGLD) with isotropic noise. In this paper, we optimize the information-theoretical generalization bound by manipulating the noise structure in SGLD. We prove that with constraint to guarantee low empirical risk, the optimal noise covariance is the square root of the expected gradient covariance if both the prior and the posterior are jointly optimized. This validates that the optimal noise is quite close to the empirical gradient covariance. Technically, we develop a new information-theoretical bound that enables such an optimization analysis. We then apply matrix analysis to derive the form of optimal noise covariance. Presented constraint and results are validated by the empirical observations. △ Less

Submitted 2 November, 2021; v1 submitted 26 October, 2021; originally announced October 2021.

Comments: Accepted by Neurips 2021

arXiv:2110.12088 [pdf, other]

Learning with Noisy Labels Revisited: A Study Using Real-World Human Annotations

Authors: Jiaheng Wei, Zhaowei Zhu, Hao Cheng, Tongliang Liu, Gang Niu, Yang Liu

Abstract: Existing research on learning with noisy labels mainly focuses on synthetic label noise. Synthetic noise, though has clean structures which greatly enabled statistical analyses, often fails to model real-world noise patterns. The recent literature has observed several efforts to offer real-world noisy datasets, yet the existing efforts suffer from two caveats: (1) The lack of ground-truth verifica… ▽ More Existing research on learning with noisy labels mainly focuses on synthetic label noise. Synthetic noise, though has clean structures which greatly enabled statistical analyses, often fails to model real-world noise patterns. The recent literature has observed several efforts to offer real-world noisy datasets, yet the existing efforts suffer from two caveats: (1) The lack of ground-truth verification makes it hard to theoretically study the property and treatment of real-world label noise; (2) These efforts are often of large scales, which may result in unfair comparisons of robust methods within reasonable and accessible computation power. To better understand real-world label noise, it is crucial to build controllable and moderate-sized real-world noisy datasets with both ground-truth and noisy labels. This work presents two new benchmark datasets CIFAR-10N, CIFAR-100N, equip** the training datasets of CIFAR-10, CIFAR-100 with human-annotated real-world noisy labels we collected from Amazon Mechanical Turk. We quantitatively and qualitatively show that real-world noisy labels follow an instance-dependent pattern rather than the classically assumed and adopted ones (e.g., class-dependent label noise). We then initiate an effort to benchmarking a subset of the existing solutions using CIFAR-10N and CIFAR-100N. We further proceed to study the memorization of correct and wrong predictions, which further illustrates the difference between human noise and class-dependent synthetic noise. We show indeed the real-world noise patterns impose new and outstanding challenges as compared to synthetic label noise. These observations require us to rethink the treatment of noisy labels, and we hope the availability of these two datasets would facilitate the development and evaluation of future learning with noisy label solutions. Datasets and leaderboards are available at http://noisylabels.com. △ Less

Submitted 27 March, 2022; v1 submitted 22 October, 2021; originally announced October 2021.

Comments: Published as a conference paper at ICLR 2022

arXiv:2109.12784 [pdf, other]

Learning from Few Samples: Transformation-Invariant SVMs with Composition and Locality at Multiple Scales

Authors: Tao Liu, P. R. Kumar, Ruida Zhou, Xi Liu

Abstract: Motivated by the problem of learning with small sample sizes, this paper shows how to incorporate into support-vector machines (SVMs) those properties that have made convolutional neural networks (CNNs) successful. Particularly important is the ability to incorporate domain knowledge of invariances, e.g., translational invariance of images. Kernels based on the \textit{maximum} similarity over a g… ▽ More Motivated by the problem of learning with small sample sizes, this paper shows how to incorporate into support-vector machines (SVMs) those properties that have made convolutional neural networks (CNNs) successful. Particularly important is the ability to incorporate domain knowledge of invariances, e.g., translational invariance of images. Kernels based on the \textit{maximum} similarity over a group of transformations are not generally positive definite. Perhaps it is for this reason that they have not been studied theoretically. We address this lacuna and show that positive definiteness indeed holds \textit{with high probability} for kernels based on the maximum similarity in the small training sample set regime of interest, and that they do yield the best results in that regime. We also show how additional properties such as their ability to incorporate local features at multiple spatial scales, e.g., as done in CNNs through max pooling, and to provide the benefits of composition through the architecture of multiple layers, can also be embedded into SVMs. We verify through experiments on widely available image sets that the resulting SVMs do provide superior accuracy in comparison to well-established deep neural network benchmarks for small sample sizes. △ Less

Submitted 22 October, 2022; v1 submitted 27 September, 2021; originally announced September 2021.

Comments: Will appear in NeurIPS 2022

arXiv:2109.02986 [pdf, other]

Instance-dependent Label-noise Learning under a Structural Causal Model

Authors: Yu Yao, Tongliang Liu, Mingming Gong, Bo Han, Gang Niu, Kun Zhang

Abstract: Label noise will degenerate the performance of deep learning algorithms because deep neural networks easily overfit label errors. Let X and Y denote the instance and clean label, respectively. When Y is a cause of X, according to which many datasets have been constructed, e.g., SVHN and CIFAR, the distributions of P(X) and P(Y|X) are entangled. This means that the unsupervised instances are helpfu… ▽ More Label noise will degenerate the performance of deep learning algorithms because deep neural networks easily overfit label errors. Let X and Y denote the instance and clean label, respectively. When Y is a cause of X, according to which many datasets have been constructed, e.g., SVHN and CIFAR, the distributions of P(X) and P(Y|X) are entangled. This means that the unsupervised instances are helpful to learn the classifier and thus reduce the side effect of label noise. However, it remains elusive on how to exploit the causal information to handle the label noise problem. In this paper, by leveraging a structural causal model, we propose a novel generative approach for instance-dependent label-noise learning. In particular, we show that properly modeling the instances will contribute to the identifiability of the label noise transition matrix and thus lead to a better classifier. Empirically, our method outperforms all state-of-the-art methods on both synthetic and real-world label-noise datasets. △ Less

Submitted 3 June, 2022; v1 submitted 7 September, 2021; originally announced September 2021.

arXiv:2108.09042 [pdf]

Identifying Aggregation Artery Architecture of constrained Origin-Destination flows using Manhattan L-function

Authors: Zidong Fang, Hua Shu, Ci Song, Jie Chen, Tianyu Liu, Xiaohan Liu, Tao Pei

Abstract: The movement of humans and goods in cities can be represented by constrained flow, which is defined as the movement of objects between origin and destination in road networks. Flow aggregation, namely origins and destinations aggregated simultaneously, is one of the most common patterns, say the aggregated origin-to-destination flows between two transport hubs may indicate the great traffic demand… ▽ More The movement of humans and goods in cities can be represented by constrained flow, which is defined as the movement of objects between origin and destination in road networks. Flow aggregation, namely origins and destinations aggregated simultaneously, is one of the most common patterns, say the aggregated origin-to-destination flows between two transport hubs may indicate the great traffic demand between two sites. Develo** a clustering method for constrained flows is crucial for determining urban flow aggregation. Among existing methods about identifying flow aggregation, L-function of flows is the major one. Nevertheless, this method depends on the aggregation scale, the key parameter detected by Euclidean L-function, it does not adapt to road network. The extracted aggregation may be overestimated and dispersed. Therefore, we propose a clustering method based on L-function of Manhattan space, which consists of three major steps. The first is to detect aggregation scales by Manhattan L-function. The second is to determine core flows possessing highest local L-function values at different scales. The final step is to take the intersection of core flows neighbourhoods, the extent of which depends on corresponding scale. By setting the number of core flows, we could concentrate the aggregation and thus highlight Aggregation Artery Architecture (AAA), which depicts road sections that contain the projection of key flow cluster on the road networks. Experiment using taxi flows showed that AAA could clarify resident movement type of identified aggregated flows. Our method also helps selecting locations for distribution sites, thereby supporting accurate analysis of urban interactions. △ Less

Submitted 20 August, 2021; originally announced August 2021.

Comments: 29 pages, 12 figures

Showing 1–50 of 213 results for author: Liu, T