-
EM Distillation for One-step Diffusion Models
Authors:
Sirui Xie,
Zhisheng Xiao,
Diederik P Kingma,
Tingbo Hou,
Ying Nian Wu,
Kevin Patrick Murphy,
Tim Salimans,
Ben Poole,
Ruiqi Gao
Abstract:
While diffusion models can learn complex distributions, sampling requires a computationally expensive iterative process. Existing distillation methods enable efficient sampling, but have notable limitations, such as performance degradation with very few sampling steps, reliance on training data access, or mode-seeking optimization that may fail to capture the full distribution. We propose EM Disti…
▽ More
While diffusion models can learn complex distributions, sampling requires a computationally expensive iterative process. Existing distillation methods enable efficient sampling, but have notable limitations, such as performance degradation with very few sampling steps, reliance on training data access, or mode-seeking optimization that may fail to capture the full distribution. We propose EM Distillation (EMD), a maximum likelihood-based approach that distills a diffusion model to a one-step generator model with minimal loss of perceptual quality. Our approach is derived through the lens of Expectation-Maximization (EM), where the generator parameters are updated using samples from the joint distribution of the diffusion teacher prior and inferred generator latents. We develop a reparametrized sampling scheme and a noise cancellation technique that together stabilizes the distillation process. We further reveal an interesting connection of our method with existing methods that minimize mode-seeking KL. EMD outperforms existing one-step generative methods in terms of FID scores on ImageNet-64 and ImageNet-128, and compares favorably with prior work on distilling text-to-image diffusion models.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
Implicit Bias of AdamW: $\ell_\infty$ Norm Constrained Optimization
Authors:
Shuo Xie,
Zhiyuan Li
Abstract:
Adam with decoupled weight decay, also known as AdamW, is widely acclaimed for its superior performance in language modeling tasks, surpassing Adam with $\ell_2$ regularization in terms of generalization and optimization. However, this advantage is not theoretically well-understood. One challenge here is that though intuitively Adam with $\ell_2$ regularization optimizes the $\ell_2$ regularized l…
▽ More
Adam with decoupled weight decay, also known as AdamW, is widely acclaimed for its superior performance in language modeling tasks, surpassing Adam with $\ell_2$ regularization in terms of generalization and optimization. However, this advantage is not theoretically well-understood. One challenge here is that though intuitively Adam with $\ell_2$ regularization optimizes the $\ell_2$ regularized loss, it is not clear if AdamW optimizes a specific objective. In this work, we make progress toward understanding the benefit of AdamW by showing that it implicitly performs constrained optimization. More concretely, we show in the full-batch setting, if AdamW converges with any non-increasing learning rate schedule whose partial sum diverges, it must converge to a KKT point of the original loss under the constraint that the $\ell_\infty$ norm of the parameter is bounded by the inverse of the weight decay factor. This result is built on the observation that Adam can be viewed as a smoothed version of SignGD, which is the normalized steepest descent with respect to $\ell_\infty$ norm, and a surprising connection between normalized steepest descent with weight decay and Frank-Wolfe.
△ Less
Submitted 5 April, 2024;
originally announced April 2024.
-
Robust Conformal Prediction under Distribution Shift via Physics-Informed Structural Causal Model
Authors:
Rui Xu,
Yue Sun,
Chao Chen,
Parv Venkitasubramaniam,
Sihong Xie
Abstract:
Uncertainty is critical to reliable decision-making with machine learning. Conformal prediction (CP) handles uncertainty by predicting a set on a test input, ho** the set to cover the true label with at least $(1-α)$ confidence. This coverage can be guaranteed on test data even if the marginal distributions $P_X$ differ between calibration and test datasets. However, as it is common in practice,…
▽ More
Uncertainty is critical to reliable decision-making with machine learning. Conformal prediction (CP) handles uncertainty by predicting a set on a test input, ho** the set to cover the true label with at least $(1-α)$ confidence. This coverage can be guaranteed on test data even if the marginal distributions $P_X$ differ between calibration and test datasets. However, as it is common in practice, when the conditional distribution $P_{Y|X}$ is different on calibration and test data, the coverage is not guaranteed and it is essential to measure and minimize the coverage loss under distributional shift at \textit{all} possible confidence levels. To address these issues, we upper bound the coverage difference at all levels using the cumulative density functions of calibration and test conformal scores and Wasserstein distance. Inspired by the invariance of physics across data distributions, we propose a physics-informed structural causal model (PI-SCM) to reduce the upper bound. We validated that PI-SCM can improve coverage robustness along confidence level and test domain on a traffic speed prediction task and an epidemic spread task with multiple real-world datasets.
△ Less
Submitted 22 March, 2024;
originally announced March 2024.
-
Causal Representation Learning from Multiple Distributions: A General Setting
Authors:
Kun Zhang,
Shaoan Xie,
Ignavier Ng,
Yujia Zheng
Abstract:
In many problems, the measured variables (e.g., image pixels) are just mathematical functions of the hidden causal variables (e.g., the underlying concepts or objects). For the purpose of making predictions in changing environments or making proper changes to the system, it is helpful to recover the hidden causal variables $Z_i$ and their causal relations represented by graph $\mathcal{G}_Z$. This…
▽ More
In many problems, the measured variables (e.g., image pixels) are just mathematical functions of the hidden causal variables (e.g., the underlying concepts or objects). For the purpose of making predictions in changing environments or making proper changes to the system, it is helpful to recover the hidden causal variables $Z_i$ and their causal relations represented by graph $\mathcal{G}_Z$. This problem has recently been known as causal representation learning. This paper is concerned with a general, completely nonparametric setting of causal representation learning from multiple distributions (arising from heterogeneous data or nonstationary time series), without assuming hard interventions behind distribution changes. We aim to develop general solutions in this fundamental case; as a by product, this helps see the unique benefit offered by other assumptions such as parametric causal models or hard interventions. We show that under the sparsity constraint on the recovered graph over the latent variables and suitable sufficient change conditions on the causal influences, interestingly, one can recover the moralized graph of the underlying directed acyclic graph, and the recovered latent variables and their relations are related to the underlying causal model in a specific, nontrivial way. In some cases, each latent variable can even be recovered up to component-wise transformations. Experimental results verify our theoretical claims.
△ Less
Submitted 9 April, 2024; v1 submitted 7 February, 2024;
originally announced February 2024.
-
Unmasking Transformers: A Theoretical Approach to Data Recovery via Attention Weights
Authors:
Yichuan Deng,
Zhao Song,
Shenghao Xie,
Chiwun Yang
Abstract:
In the realm of deep learning, transformers have emerged as a dominant architecture, particularly in natural language processing tasks. However, with their widespread adoption, concerns regarding the security and privacy of the data processed by these models have arisen. In this paper, we address a pivotal question: Can the data fed into transformers be recovered using their attention weights and…
▽ More
In the realm of deep learning, transformers have emerged as a dominant architecture, particularly in natural language processing tasks. However, with their widespread adoption, concerns regarding the security and privacy of the data processed by these models have arisen. In this paper, we address a pivotal question: Can the data fed into transformers be recovered using their attention weights and outputs? We introduce a theoretical framework to tackle this problem. Specifically, we present an algorithm that aims to recover the input data $X \in \mathbb{R}^{d \times n}$ from given attention weights $W = QK^\top \in \mathbb{R}^{d \times d}$ and output $B \in \mathbb{R}^{n \times n}$ by minimizing the loss function $L(X)$. This loss function captures the discrepancy between the expected output and the actual output of the transformer. Our findings have significant implications for the Localized Layer-wise Mechanism (LLM), suggesting potential vulnerabilities in the model's design from a security and privacy perspective. This work underscores the importance of understanding and safeguarding the internal workings of transformers to ensure the confidentiality of processed data.
△ Less
Submitted 19 October, 2023;
originally announced October 2023.
-
Learning Energy-Based Prior Model with Diffusion-Amortized MCMC
Authors:
Peiyu Yu,
Yaxuan Zhu,
Sirui Xie,
Xiaojian Ma,
Ruiqi Gao,
Song-Chun Zhu,
Ying Nian Wu
Abstract:
Latent space Energy-Based Models (EBMs), also known as energy-based priors, have drawn growing interests in the field of generative modeling due to its flexibility in the formulation and strong modeling power of the latent space. However, the common practice of learning latent space EBMs with non-convergent short-run MCMC for prior and posterior sampling is hindering the model from further progres…
▽ More
Latent space Energy-Based Models (EBMs), also known as energy-based priors, have drawn growing interests in the field of generative modeling due to its flexibility in the formulation and strong modeling power of the latent space. However, the common practice of learning latent space EBMs with non-convergent short-run MCMC for prior and posterior sampling is hindering the model from further progress; the degenerate MCMC sampling quality in practice often leads to degraded generation quality and instability in training, especially with highly multi-modal and/or high-dimensional target distributions. To remedy this sampling issue, in this paper we introduce a simple but effective diffusion-based amortization method for long-run MCMC sampling and develop a novel learning algorithm for the latent space EBM based on it. We provide theoretical evidence that the learned amortization of MCMC is a valid long-run MCMC sampler. Experiments on several image modeling benchmark datasets demonstrate the superior performance of our method compared with strong counterparts
△ Less
Submitted 4 October, 2023;
originally announced October 2023.
-
Convergence of Two-Layer Regression with Nonlinear Units
Authors:
Yichuan Deng,
Zhao Song,
Shenghao Xie
Abstract:
Large language models (LLMs), such as ChatGPT and GPT4, have shown outstanding performance in many human life task. Attention computation plays an important role in training LLMs. Softmax unit and ReLU unit are the key structure in attention computation. Inspired by them, we put forward a softmax ReLU regression problem. Generally speaking, our goal is to find an optimal solution to the regression…
▽ More
Large language models (LLMs), such as ChatGPT and GPT4, have shown outstanding performance in many human life task. Attention computation plays an important role in training LLMs. Softmax unit and ReLU unit are the key structure in attention computation. Inspired by them, we put forward a softmax ReLU regression problem. Generally speaking, our goal is to find an optimal solution to the regression problem involving the ReLU unit. In this work, we calculate a close form representation for the Hessian of the loss function. Under certain assumptions, we prove the Lipschitz continuous and the PSDness of the Hessian. Then, we introduce an greedy algorithm based on approximate Newton method, which converges in the sense of the distance to optimal solution. Last, We relax the Lipschitz condition and prove the convergence in the sense of loss value.
△ Less
Submitted 16 August, 2023;
originally announced August 2023.
-
Partial Identifiability for Domain Adaptation
Authors:
Ling**g Kong,
Shaoan Xie,
Weiran Yao,
Yujia Zheng,
Guangyi Chen,
Petar Stojanov,
Victor Akinwande,
Kun Zhang
Abstract:
Unsupervised domain adaptation is critical to many real-world applications where label information is unavailable in the target domain. In general, without further assumptions, the joint distribution of the features and the label is not identifiable in the target domain. To address this issue, we rely on the property of minimal changes of causal mechanisms across domains to minimize unnecessary in…
▽ More
Unsupervised domain adaptation is critical to many real-world applications where label information is unavailable in the target domain. In general, without further assumptions, the joint distribution of the features and the label is not identifiable in the target domain. To address this issue, we rely on the property of minimal changes of causal mechanisms across domains to minimize unnecessary influences of distribution shifts. To encode this property, we first formulate the data-generating process using a latent variable model with two partitioned latent subspaces: invariant components whose distributions stay the same across domains and sparse changing components that vary across domains. We further constrain the domain shift to have a restrictive influence on the changing components. Under mild conditions, we show that the latent variables are partially identifiable, from which it follows that the joint distribution of data and labels in the target domain is also identifiable. Given the theoretical insights, we propose a practical domain adaptation framework called iMSDA. Extensive experimental results reveal that iMSDA outperforms state-of-the-art domain adaptation algorithms on benchmark datasets, demonstrating the effectiveness of our framework.
△ Less
Submitted 10 June, 2023;
originally announced June 2023.
-
Advancing Counterfactual Inference through Nonlinear Quantile Regression
Authors:
Shaoan Xie,
Biwei Huang,
Bin Gu,
Tongliang Liu,
Kun Zhang
Abstract:
The capacity to address counterfactual "what if" inquiries is crucial for understanding and making use of causal influences. Traditional counterfactual inference, under Pearls' counterfactual framework, typically depends on having access to or estimating a structural causal model. Yet, in practice, this causal model is often unknown and might be challenging to identify. Hence, this paper aims to p…
▽ More
The capacity to address counterfactual "what if" inquiries is crucial for understanding and making use of causal influences. Traditional counterfactual inference, under Pearls' counterfactual framework, typically depends on having access to or estimating a structural causal model. Yet, in practice, this causal model is often unknown and might be challenging to identify. Hence, this paper aims to perform reliable counterfactual inference based solely on observational data and the (learned) qualitative causal structure, without necessitating a predefined causal model or even direct estimations of conditional distributions. To this end, we establish a novel connection between counterfactual inference and quantile regression and show that counterfactual inference can be reframed as an extended quantile regression problem. Building on this insight, we propose a practical framework for efficient and effective counterfactual inference implemented with neural networks under a bi-level optimization scheme. The proposed approach enhances the capacity to generalize estimated counterfactual outcomes to unseen data, thereby providing an upper bound on the generalization error. Furthermore, empirical evidence demonstrates its superior statistical efficiency in comparison to existing methods. Empirical results conducted on multiple datasets offer compelling support for our theoretical assertions.
△ Less
Submitted 27 February, 2024; v1 submitted 9 June, 2023;
originally announced June 2023.
-
Noisy Tensor Completion via Low-rank Tensor Ring
Authors:
Yuning Qiu,
Guoxu Zhou,
Qibin Zhao,
Shengli Xie
Abstract:
Tensor completion is a fundamental tool for incomplete data analysis, where the goal is to predict missing entries from partial observations. However, existing methods often make the explicit or implicit assumption that the observed entries are noise-free to provide a theoretical guarantee of exact recovery of missing entries, which is quite restrictive in practice. To remedy such drawbacks, this…
▽ More
Tensor completion is a fundamental tool for incomplete data analysis, where the goal is to predict missing entries from partial observations. However, existing methods often make the explicit or implicit assumption that the observed entries are noise-free to provide a theoretical guarantee of exact recovery of missing entries, which is quite restrictive in practice. To remedy such drawbacks, this paper proposes a novel noisy tensor completion model, which complements the incompetence of existing works in handling the degeneration of high-order and noisy observations. Specifically, the tensor ring nuclear norm (TRNN) and least-squares estimator are adopted to regularize the underlying tensor and the observed entries, respectively. In addition, a non-asymptotic upper bound of estimation error is provided to depict the statistical performance of the proposed estimator. Two efficient algorithms are developed to solve the optimization problem with convergence guarantee, one of which is specially tailored to handle large-scale tensors by replacing the minimization of TRNN of the original tensor equivalently with that of a much smaller one in a heterogeneous tensor decomposition framework. Experimental results on both synthetic and real-world data demonstrate the effectiveness and efficiency of the proposed model in recovering noisy incomplete tensor data compared with state-of-the-art tensor completion models.
△ Less
Submitted 14 March, 2022;
originally announced March 2022.
-
Extending the WILDS Benchmark for Unsupervised Adaptation
Authors:
Shiori Sagawa,
Pang Wei Koh,
Tony Lee,
Irena Gao,
Sang Michael Xie,
Kendrick Shen,
Ananya Kumar,
Weihua Hu,
Michihiro Yasunaga,
Henrik Marklund,
Sara Beery,
Etienne David,
Ian Stavness,
Wei Guo,
Jure Leskovec,
Kate Saenko,
Tatsunori Hashimoto,
Sergey Levine,
Chelsea Finn,
Percy Liang
Abstract:
Machine learning systems deployed in the wild are often trained on a source distribution but deployed on a different target distribution. Unlabeled data can be a powerful point of leverage for mitigating these distribution shifts, as it is frequently much more available than labeled data and can often be obtained from distributions beyond the source distribution as well. However, existing distribu…
▽ More
Machine learning systems deployed in the wild are often trained on a source distribution but deployed on a different target distribution. Unlabeled data can be a powerful point of leverage for mitigating these distribution shifts, as it is frequently much more available than labeled data and can often be obtained from distributions beyond the source distribution as well. However, existing distribution shift benchmarks with unlabeled data do not reflect the breadth of scenarios that arise in real-world applications. In this work, we present the WILDS 2.0 update, which extends 8 of the 10 datasets in the WILDS benchmark of distribution shifts to include curated unlabeled data that would be realistically obtainable in deployment. These datasets span a wide range of applications (from histology to wildlife conservation), tasks (classification, regression, and detection), and modalities (photos, satellite images, microscope slides, text, molecular graphs). The update maintains consistency with the original WILDS benchmark by using identical labeled training, validation, and test sets, as well as the evaluation metrics. On these datasets, we systematically benchmark state-of-the-art methods that leverage unlabeled data, including domain-invariant, self-training, and self-supervised methods, and show that their success on WILDS is limited. To facilitate method development and evaluation, we provide an open-source package that automates data loading and contains all of the model architectures and methods used in this paper. Code and leaderboards are available at https://wilds.stanford.edu.
△ Less
Submitted 23 April, 2022; v1 submitted 9 December, 2021;
originally announced December 2021.
-
Evaluating Effectiveness of Public Health Intervention Strategies for Mitigating COVID-19 Pandemic
Authors:
Shanghong Xie,
Wenbo Wang,
Qinxia Wang,
Yuanjia Wang,
Donglin Zeng
Abstract:
Coronavirus disease 2019 (COVID-19) pandemic is an unprecedented global public health challenge. In the United States (US), state governments have implemented various non-pharmaceutical interventions (NPIs), such as physical distance closure (lockdown), stay-at-home order, mandatory facial mask in public in response to the rapid spread of COVID-19. To evaluate the effectiveness of these NPIs, we p…
▽ More
Coronavirus disease 2019 (COVID-19) pandemic is an unprecedented global public health challenge. In the United States (US), state governments have implemented various non-pharmaceutical interventions (NPIs), such as physical distance closure (lockdown), stay-at-home order, mandatory facial mask in public in response to the rapid spread of COVID-19. To evaluate the effectiveness of these NPIs, we propose a nested case-control design with propensity score weighting under the quasi-experiment framework to estimate the average intervention effect on disease transmission across states. We further develop a method to test for factors that moderate intervention effect to assist precision public health intervention. Our method takes account of the underlying dynamics of disease transmission and balance state-level pre-intervention characteristics. We prove that our estimator provides causal intervention effect under assumptions. We apply this method to analyze US COVID-19 incidence cases to estimate the effects of six interventions. We show that lockdown has the largest effect on reducing transmission and reopening bars significantly increase transmission. States with a higher percentage of non-white population are at greater risk of increased $R_t$ associated with reopening bars.
△ Less
Submitted 20 July, 2021;
originally announced July 2021.
-
Why Do Pretrained Language Models Help in Downstream Tasks? An Analysis of Head and Prompt Tuning
Authors:
Colin Wei,
Sang Michael Xie,
Tengyu Ma
Abstract:
Pretrained language models have achieved state-of-the-art performance when adapted to a downstream NLP task. However, theoretical analysis of these models is scarce and challenging since the pretraining and downstream tasks can be very different. We propose an analysis framework that links the pretraining and downstream tasks with an underlying latent variable generative model of text -- the downs…
▽ More
Pretrained language models have achieved state-of-the-art performance when adapted to a downstream NLP task. However, theoretical analysis of these models is scarce and challenging since the pretraining and downstream tasks can be very different. We propose an analysis framework that links the pretraining and downstream tasks with an underlying latent variable generative model of text -- the downstream classifier must recover a function of the posterior distribution over the latent variables. We analyze head tuning (learning a classifier on top of the frozen pretrained model) and prompt tuning in this setting. The generative model in our analysis is either a Hidden Markov Model (HMM) or an HMM augmented with a latent memory component, motivated by long-term dependencies in natural language. We show that 1) under certain non-degeneracy conditions on the HMM, simple classification heads can solve the downstream task, 2) prompt tuning obtains downstream guarantees with weaker non-degeneracy conditions, and 3) our recovery guarantees for the memory-augmented HMM are stronger than for the vanilla HMM because task-relevant information is easier to recover from the long-term memory. Experiments on synthetically generated data from HMMs back our theoretical findings.
△ Less
Submitted 20 April, 2022; v1 submitted 16 June, 2021;
originally announced June 2021.
-
An Early Stop** Bayesian Data Assimilation Approach for Mixed-Logit Estimation
Authors:
Shanshan Xie,
Tim Hillel,
Ying **
Abstract:
The mixed-logit model is a flexible tool in transportation choice analysis, which provides valuable insights into inter and intra-individual behavioural heterogeneity. However, applications of mixed-logit models are limited by the high computational and data requirements for model estimation. When estimating on small samples, the Bayesian estimation approach becomes vulnerable to over and under-fi…
▽ More
The mixed-logit model is a flexible tool in transportation choice analysis, which provides valuable insights into inter and intra-individual behavioural heterogeneity. However, applications of mixed-logit models are limited by the high computational and data requirements for model estimation. When estimating on small samples, the Bayesian estimation approach becomes vulnerable to over and under-fitting. This is problematic for investigating the behaviour of specific population sub-groups or market segments with low data availability. Similar challenges arise when transferring an existing model to a new location or time period, e.g., when estimating post-pandemic travel behaviour. We propose an Early Stop** Bayesian Data Assimilation (ESBDA) simulator for estimation of mixed-logit which combines a Bayesian statistical approach with Machine Learning methodologies. The aim is to improve the transferability of mixed-logit models and to enable the estimation of robust choice models with low data availability. This approach can provide new insights into choice behaviour where the traditional estimation of mixed-logit models was not possible due to low data availability, and open up new opportunities for investment and planning decisions support. The ESBDA estimator is benchmarked against the Direct Application approach, a basic Bayesian simulator with random starting parameter values and a Bayesian Data Assimilation (BDA) simulator without early stop**. The ESBDA approach is found to effectively overcome under and over-fitting and non-convergence issues in simulation. Its resulting models clearly outperform those of the reference simulators in predictive accuracy. Furthermore, models estimated with ESBDA tend to be more robust, with significant parameters with signs and values consistent with behavioural theory, even when estimated on small samples.
△ Less
Submitted 26 January, 2021;
originally announced January 2021.
-
In-N-Out: Pre-Training and Self-Training using Auxiliary Information for Out-of-Distribution Robustness
Authors:
Sang Michael Xie,
Ananya Kumar,
Robbie Jones,
Fereshte Khani,
Tengyu Ma,
Percy Liang
Abstract:
Consider a prediction setting with few in-distribution labeled examples and many unlabeled examples both in- and out-of-distribution (OOD). The goal is to learn a model which performs well both in-distribution and OOD. In these settings, auxiliary information is often cheaply available for every input. How should we best leverage this auxiliary information for the prediction task? Empirically acro…
▽ More
Consider a prediction setting with few in-distribution labeled examples and many unlabeled examples both in- and out-of-distribution (OOD). The goal is to learn a model which performs well both in-distribution and OOD. In these settings, auxiliary information is often cheaply available for every input. How should we best leverage this auxiliary information for the prediction task? Empirically across three image and time-series datasets, and theoretically in a multi-task linear regression setting, we show that (i) using auxiliary information as input features improves in-distribution error but can hurt OOD error; but (ii) using auxiliary information as outputs of auxiliary pre-training tasks improves OOD error. To get the best of both worlds, we introduce In-N-Out, which first trains a model with auxiliary inputs and uses it to pseudolabel all the in-distribution inputs, then pre-trains a model on OOD auxiliary outputs and fine-tunes this model with the pseudolabels (self-training). We show both theoretically and empirically that In-N-Out outperforms auxiliary inputs or outputs alone on both in-distribution and OOD error.
△ Less
Submitted 7 April, 2021; v1 submitted 8 December, 2020;
originally announced December 2020.
-
GTEA: Inductive Representation Learning on Temporal Interaction Graphs via Temporal Edge Aggregation
Authors:
Siyue Xie,
Yiming Li,
Da Sun Handason Tam,
Xiaxin Liu,
Qiu Fang Ying,
Wing Cheong Lau,
Dah Ming Chiu,
Shou Zhi Chen
Abstract:
In this paper, we propose the Graph Temporal Edge Aggregation (GTEA) framework for inductive learning on Temporal Interaction Graphs (TIGs). Different from previous works, GTEA models the temporal dynamics of interaction sequences in the continuous-time space and simultaneously takes advantage of both rich node and edge/ interaction attributes in the graph. Concretely, we integrate a sequence mode…
▽ More
In this paper, we propose the Graph Temporal Edge Aggregation (GTEA) framework for inductive learning on Temporal Interaction Graphs (TIGs). Different from previous works, GTEA models the temporal dynamics of interaction sequences in the continuous-time space and simultaneously takes advantage of both rich node and edge/ interaction attributes in the graph. Concretely, we integrate a sequence model with a time encoder to learn pairwise interactional dynamics between two adjacent nodes.This helps capture complex temporal interactional patterns of a node pair along the history, which generates edge embeddings that can be fed into a GNN backbone. By aggregating features of neighboring nodes and the corresponding edge embeddings, GTEA jointly learns both topological and temporal dependencies of a TIG. In addition, a sparsity-inducing self-attention scheme is incorporated for neighbor aggregation, which highlights more important neighbors and suppresses trivial noises for GTEA. By jointly optimizing the sequence model and the GNN backbone, GTEA learns more comprehensive node representations capturing both temporal and graph structural characteristics. Extensive experiments on five large-scale real-world datasets demonstrate the superiority of GTEA over other inductive models.
△ Less
Submitted 3 May, 2023; v1 submitted 11 September, 2020;
originally announced September 2020.
-
Understanding the wiring evolution in differentiable neural architecture search
Authors:
Sirui Xie,
Shoukang Hu,
Xinjiang Wang,
Chunxiao Liu,
Jian** Shi,
Xunying Liu,
Dahua Lin
Abstract:
Controversy exists on whether differentiable neural architecture search methods discover wiring topology effectively. To understand how wiring topology evolves, we study the underlying mechanism of several existing differentiable NAS frameworks. Our investigation is motivated by three observed searching patterns of differentiable NAS: 1) they search by growing instead of pruning; 2) wider networks…
▽ More
Controversy exists on whether differentiable neural architecture search methods discover wiring topology effectively. To understand how wiring topology evolves, we study the underlying mechanism of several existing differentiable NAS frameworks. Our investigation is motivated by three observed searching patterns of differentiable NAS: 1) they search by growing instead of pruning; 2) wider networks are more preferred than deeper ones; 3) no edges are selected in bi-level optimization. To anatomize these phenomena, we propose a unified view on searching algorithms of existing frameworks, transferring the global optimization to local cost minimization. Based on this reformulation, we conduct empirical and theoretical analyses, revealing implicit inductive biases in the cost's assignment mechanism and evolution dynamics that cause the observed phenomena. These biases indicate strong discrimination towards certain topologies. To this end, we pose questions that future differentiable methods for neural wiring discovery need to confront, ho** to evoke a discussion and rethinking on how much bias has been enforced implicitly in existing NAS methods.
△ Less
Submitted 25 February, 2021; v1 submitted 2 September, 2020;
originally announced September 2020.
-
Graph Structure of Neural Networks
Authors:
Jiaxuan You,
Jure Leskovec,
Kaiming He,
Saining Xie
Abstract:
Neural networks are often represented as graphs of connections between neurons. However, despite their wide use, there is currently little understanding of the relationship between the graph structure of the neural network and its predictive performance. Here we systematically investigate how does the graph structure of neural networks affect their predictive performance. To this end, we develop a…
▽ More
Neural networks are often represented as graphs of connections between neurons. However, despite their wide use, there is currently little understanding of the relationship between the graph structure of the neural network and its predictive performance. Here we systematically investigate how does the graph structure of neural networks affect their predictive performance. To this end, we develop a novel graph-based representation of neural networks called relational graph, where layers of neural network computation correspond to rounds of message exchange along the graph structure. Using this representation we show that: (1) a "sweet spot" of relational graphs leads to neural networks with significantly improved predictive performance; (2) neural network's performance is approximately a smooth function of the clustering coefficient and average path length of its relational graph; (3) our findings are consistent across many different tasks and datasets; (4) the sweet spot can be identified efficiently; (5) top-performing neural networks have graph structure surprisingly similar to those of real biological neural networks. Our work opens new directions for the design of neural architectures and the understanding on neural networks in general.
△ Less
Submitted 27 August, 2020; v1 submitted 13 July, 2020;
originally announced July 2020.
-
Learning to Reweight with Deep Interactions
Authors:
Yang Fan,
Yingce Xia,
Lijun Wu,
Shufang Xie,
Weiqing Liu,
Jiang Bian,
Tao Qin,
Xiang-Yang Li
Abstract:
Recently, the concept of teaching has been introduced into machine learning, in which a teacher model is used to guide the training of a student model (which will be used in real tasks) through data selection, loss function design, etc. Learning to reweight, which is a specific kind of teaching that reweights training data using a teacher model, receives much attention due to its simplicity and ef…
▽ More
Recently, the concept of teaching has been introduced into machine learning, in which a teacher model is used to guide the training of a student model (which will be used in real tasks) through data selection, loss function design, etc. Learning to reweight, which is a specific kind of teaching that reweights training data using a teacher model, receives much attention due to its simplicity and effectiveness. In existing learning to reweight works, the teacher model only utilizes shallow/surface information such as training iteration number and loss/accuracy of the student model from training/validation sets, but ignores the internal states of the student model, which limits the potential of learning to reweight. In this work, we propose an improved data reweighting algorithm, in which the student model provides its internal states to the teacher model, and the teacher model returns adaptive weights of training samples to enhance the training of the student model. The teacher model is jointly trained with the student model using meta gradients propagated from a validation set. Experiments on image classification with clean/noisy labels and neural machine translation empirically demonstrate that our algorithm makes significant improvement over previous methods.
△ Less
Submitted 12 January, 2021; v1 submitted 9 July, 2020;
originally announced July 2020.
-
Composed Fine-Tuning: Freezing Pre-Trained Denoising Autoencoders for Improved Generalization
Authors:
Sang Michael Xie,
Tengyu Ma,
Percy Liang
Abstract:
We focus on prediction problems with structured outputs that are subject to output validity constraints, e.g. pseudocode-to-code translation where the code must compile. While labeled input-output pairs are expensive to obtain, "unlabeled" outputs, i.e. outputs without corresponding inputs, are freely available (e.g. code on GitHub) and provide information about output validity. We can capture the…
▽ More
We focus on prediction problems with structured outputs that are subject to output validity constraints, e.g. pseudocode-to-code translation where the code must compile. While labeled input-output pairs are expensive to obtain, "unlabeled" outputs, i.e. outputs without corresponding inputs, are freely available (e.g. code on GitHub) and provide information about output validity. We can capture the output structure by pre-training a denoiser to denoise corrupted versions of unlabeled outputs. We first show that standard fine-tuning after pre-training destroys some of this structure. We then propose composed fine-tuning, which fine-tunes a predictor composed with the pre-trained denoiser, which is frozen to preserve output structure. For two-layer ReLU networks, we prove that composed fine-tuning significantly reduces the complexity of the predictor, thus improving generalization. Empirically, we show that composed fine-tuning improves over standard fine-tuning on two pseudocode-to-code translation datasets (3% and 6% relative). The improvement from composed fine-tuning is magnified on out-of-distribution (OOD) examples (4% and 25% relative).
△ Less
Submitted 24 October, 2023; v1 submitted 29 June, 2020;
originally announced June 2020.
-
Understanding and Mitigating the Tradeoff Between Robustness and Accuracy
Authors:
Aditi Raghunathan,
Sang Michael Xie,
Fanny Yang,
John Duchi,
Percy Liang
Abstract:
Adversarial training augments the training set with perturbations to improve the robust error (over worst-case perturbations), but it often leads to an increase in the standard error (on unperturbed test inputs). Previous explanations for this tradeoff rely on the assumption that no predictor in the hypothesis class has low standard and robust error. In this work, we precisely characterize the eff…
▽ More
Adversarial training augments the training set with perturbations to improve the robust error (over worst-case perturbations), but it often leads to an increase in the standard error (on unperturbed test inputs). Previous explanations for this tradeoff rely on the assumption that no predictor in the hypothesis class has low standard and robust error. In this work, we precisely characterize the effect of augmentation on the standard error in linear regression when the optimal linear predictor has zero standard and robust error. In particular, we show that the standard error could increase even when the augmented perturbations have noiseless observations from the optimal linear predictor. We then prove that the recently proposed robust self-training (RST) estimator improves robust error without sacrificing standard error for noiseless linear regression. Empirically, for neural networks, we find that RST with different adversarial training methods improves both standard and robust error for random and adversarial rotations and adversarial $\ell_\infty$ perturbations in CIFAR-10.
△ Less
Submitted 6 July, 2020; v1 submitted 25 February, 2020;
originally announced February 2020.
-
DSNAS: Direct Neural Architecture Search without Parameter Retraining
Authors:
Shoukang Hu,
Sirui Xie,
Hehui Zheng,
Chunxiao Liu,
Jian** Shi,
Xunying Liu,
Dahua Lin
Abstract:
If NAS methods are solutions, what is the problem? Most existing NAS methods require two-stage parameter optimization. However, performance of the same architecture in the two stages correlates poorly. In this work, we propose a new problem definition for NAS, task-specific end-to-end, based on this observation. We argue that given a computer vision task for which a NAS method is expected, this de…
▽ More
If NAS methods are solutions, what is the problem? Most existing NAS methods require two-stage parameter optimization. However, performance of the same architecture in the two stages correlates poorly. In this work, we propose a new problem definition for NAS, task-specific end-to-end, based on this observation. We argue that given a computer vision task for which a NAS method is expected, this definition can reduce the vaguely-defined NAS evaluation to i) accuracy of this task and ii) the total computation consumed to finally obtain a model with satisfying accuracy. Seeing that most existing methods do not solve this problem directly, we propose DSNAS, an efficient differentiable NAS framework that simultaneously optimizes architecture and parameters with a low-biased Monte Carlo estimate. Child networks derived from DSNAS can be deployed directly without parameter retraining. Comparing with two-stage methods, DSNAS successfully discovers networks with comparable accuracy (74.4%) on ImageNet in 420 GPU hours, reducing the total time by more than 34%. Our implementation is available at https://github.com/SNAS-Series/SNAS-Series.
△ Less
Submitted 31 March, 2020; v1 submitted 20 February, 2020;
originally announced February 2020.
-
Direct and Indirect Effects -- An Information Theoretic Perspective
Authors:
Gabriel Schamberg,
William Chapman,
Shang-** Xie,
Todd P. Coleman
Abstract:
Information theoretic (IT) approaches to quantifying causal influences have experienced some popularity in the literature, in both theoretical and applied (e.g. neuroscience and climate science) domains. While these causal measures are desirable in that they are model agnostic and can capture non-linear interactions, they are fundamentally different from common statistical notions of causal influe…
▽ More
Information theoretic (IT) approaches to quantifying causal influences have experienced some popularity in the literature, in both theoretical and applied (e.g. neuroscience and climate science) domains. While these causal measures are desirable in that they are model agnostic and can capture non-linear interactions, they are fundamentally different from common statistical notions of causal influence in that they (1) compare distributions over the effect rather than values of the effect and (2) are defined with respect to random variables representing a cause rather than specific values of a cause. We here present IT measures of direct, indirect, and total causal effects. The proposed measures are unlike existing IT techniques in that they enable measuring causal effects that are defined with respect to specific values of a cause while still offering the flexibility and general applicability of IT techniques. We provide an identifiability result and demonstrate application of the proposed measures in estimating the causal effect of the El Niño-Southern Oscillation on temperature anomalies in the North American Pacific Northwest.
△ Less
Submitted 28 July, 2020; v1 submitted 22 December, 2019;
originally announced December 2019.
-
Neural Network Based Parameter Estimation Method for the Pareto/NBD Model
Authors:
Shao-Ming Xie
Abstract:
Whether stochastic or parametric, the Pareto/NBD model can only be utilized for an in-sample prediction rather than an out-of-sample prediction. This research thus provides a neural network based extension of the Pareto/NBD model to estimate the out-of-sample parameters, which overrides the estimation burden and the application dilemma of the Pareto/NBD approach. The empirical results indicate tha…
▽ More
Whether stochastic or parametric, the Pareto/NBD model can only be utilized for an in-sample prediction rather than an out-of-sample prediction. This research thus provides a neural network based extension of the Pareto/NBD model to estimate the out-of-sample parameters, which overrides the estimation burden and the application dilemma of the Pareto/NBD approach. The empirical results indicate that the Pareto/NBD model and neural network algorithms have similar predictability for identifying inactive customers. Even with a strong trend fitting on the customer count of each repeat purchase point, the Pareto/NBD model underestimates repeat purchases at both the individual and aggregate levels. Nonetheless, when embedding the likelihood function of the Pareto/NBD model into the loss function, the proposed parameter estimation method shows extraordinary predictability on repeat purchases at these two levels. Furthermore, the proposed neural network based method is highly efficient and resource-friendly and can be deployed in cloud computing to handle with big data analysis.
△ Less
Submitted 5 November, 2019;
originally announced November 2019.
-
Sample-Efficient Neural Architecture Search by Learning Action Space
Authors:
Linnan Wang,
Saining Xie,
Teng Li,
Rodrigo Fonseca,
Yuandong Tian
Abstract:
Neural Architecture Search (NAS) has emerged as a promising technique for automatic neural network design. However, existing MCTS based NAS approaches often utilize manually designed action space, which is not directly related to the performance metric to be optimized (e.g., accuracy), leading to sample-inefficient explorations of architectures. To improve the sample efficiency, this paper propose…
▽ More
Neural Architecture Search (NAS) has emerged as a promising technique for automatic neural network design. However, existing MCTS based NAS approaches often utilize manually designed action space, which is not directly related to the performance metric to be optimized (e.g., accuracy), leading to sample-inefficient explorations of architectures. To improve the sample efficiency, this paper proposes Latent Action Neural Architecture Search (LaNAS), which learns actions to recursively partition the search space into good or bad regions that contain networks with similar performance metrics. During the search phase, as different action sequences lead to regions with different performance, the search efficiency can be significantly improved by biasing towards the good regions. On three NAS tasks, empirical results demonstrate that LaNAS is at least an order more sample efficient than baseline methods including evolutionary algorithms, Bayesian optimizations, and random search. When applied in practice, both one-shot and regular LaNAS consistently outperform existing results. Particularly, LaNAS achieves 99.0% accuracy on CIFAR-10 and 80.8% top1 accuracy at 600 MFLOPS on ImageNet in only 800 samples, significantly outperforming AmoebaNet with 33x fewer samples. Our code is publicly available at https://github.com/facebookresearch/LaMCTS.
△ Less
Submitted 31 March, 2021; v1 submitted 16 June, 2019;
originally announced June 2019.
-
Adversarial Training Can Hurt Generalization
Authors:
Aditi Raghunathan,
Sang Michael Xie,
Fanny Yang,
John C. Duchi,
Percy Liang
Abstract:
While adversarial training can improve robust accuracy (against an adversary), it sometimes hurts standard accuracy (when there is no adversary). Previous work has studied this tradeoff between standard and robust accuracy, but only in the setting where no predictor performs well on both objectives in the infinite data limit. In this paper, we show that even when the optimal predictor with infinit…
▽ More
While adversarial training can improve robust accuracy (against an adversary), it sometimes hurts standard accuracy (when there is no adversary). Previous work has studied this tradeoff between standard and robust accuracy, but only in the setting where no predictor performs well on both objectives in the infinite data limit. In this paper, we show that even when the optimal predictor with infinite data performs well on both objectives, a tradeoff can still manifest itself with finite data. Furthermore, since our construction is based on a convex learning problem, we rule out optimization concerns, thus laying bare a fundamental tension between robustness and generalization. Finally, we show that robust self-training mostly eliminates this tradeoff by leveraging unlabeled data.
△ Less
Submitted 26 August, 2019; v1 submitted 14 June, 2019;
originally announced June 2019.
-
Reparameterizable Subset Sampling via Continuous Relaxations
Authors:
Sang Michael Xie,
Stefano Ermon
Abstract:
Many machine learning tasks require sampling a subset of items from a collection based on a parameterized distribution. The Gumbel-softmax trick can be used to sample a single item, and allows for low-variance reparameterized gradients with respect to the parameters of the underlying distribution. However, stochastic optimization involving subset sampling is typically not reparameterizable. To ove…
▽ More
Many machine learning tasks require sampling a subset of items from a collection based on a parameterized distribution. The Gumbel-softmax trick can be used to sample a single item, and allows for low-variance reparameterized gradients with respect to the parameters of the underlying distribution. However, stochastic optimization involving subset sampling is typically not reparameterizable. To overcome this limitation, we define a continuous relaxation of subset sampling that provides reparameterization gradients by generalizing the Gumbel-max trick. We use this approach to sample subsets of features in an instance-wise feature selection task for model interpretability, subsets of neighbors to implement a deep stochastic k-nearest neighbors model, and sub-sequences of neighbors to implement parametric t-SNE by directly comparing the identities of local neighbors. We improve performance in all these tasks by incorporating subset sampling in end-to-end training.
△ Less
Submitted 26 February, 2021; v1 submitted 29 January, 2019;
originally announced January 2019.
-
Deep Features Analysis with Attention Networks
Authors:
Shipeng Xie,
Da Chen,
Rong Zhang,
Hui Xue
Abstract:
Deep neural network models have recently draw lots of attention, as it consistently produce impressive results in many computer vision tasks such as image classification, object detection, etc. However, interpreting such model and show the reason why it performs quite well becomes a challenging question. In this paper, we propose a novel method to interpret the neural network models with attention…
▽ More
Deep neural network models have recently draw lots of attention, as it consistently produce impressive results in many computer vision tasks such as image classification, object detection, etc. However, interpreting such model and show the reason why it performs quite well becomes a challenging question. In this paper, we propose a novel method to interpret the neural network models with attention mechanism. Inspired by the heatmap visualization, we analyze the relation between classification accuracy with the attention based heatmap. An improved attention based method is also included and illustrate that a better classifier can be interpreted by the attention based heatmap.
△ Less
Submitted 20 January, 2019;
originally announced January 2019.
-
SNAS: Stochastic Neural Architecture Search
Authors:
Sirui Xie,
Hehui Zheng,
Chunxiao Liu,
Liang Lin
Abstract:
We propose Stochastic Neural Architecture Search (SNAS), an economical end-to-end solution to Neural Architecture Search (NAS) that trains neural operation parameters and architecture distribution parameters in same round of back-propagation, while maintaining the completeness and differentiability of the NAS pipeline. In this work, NAS is reformulated as an optimization problem on parameters of a…
▽ More
We propose Stochastic Neural Architecture Search (SNAS), an economical end-to-end solution to Neural Architecture Search (NAS) that trains neural operation parameters and architecture distribution parameters in same round of back-propagation, while maintaining the completeness and differentiability of the NAS pipeline. In this work, NAS is reformulated as an optimization problem on parameters of a joint distribution for the search space in a cell. To leverage the gradient information in generic differentiable loss for architecture search, a novel search gradient is proposed. We prove that this search gradient optimizes the same objective as reinforcement-learning-based NAS, but assigns credits to structural decisions more efficiently. This credit assignment is further augmented with locally decomposable reward to enforce a resource-efficient constraint. In experiments on CIFAR-10, SNAS takes less epochs to find a cell architecture with state-of-the-art accuracy than non-differentiable evolution-based and reinforcement-learning-based NAS, which is also transferable to ImageNet. It is also shown that child networks of SNAS can maintain the validation accuracy in searching, with which attention-based NAS requires parameter retraining to compete, exhibiting potentials to stride towards efficient NAS on big datasets. We have released our implementation at https://github.com/SNAS-Series/SNAS-Series.
△ Less
Submitted 31 March, 2020; v1 submitted 24 December, 2018;
originally announced December 2018.
-
NADPEx: An on-policy temporally consistent exploration method for deep reinforcement learning
Authors:
Sirui Xie,
Junning Huang,
Lanxin Lei,
Chunxiao Liu,
Zheng Ma,
Wei Zhang,
Liang Lin
Abstract:
Reinforcement learning agents need exploratory behaviors to escape from local optima. These behaviors may include both immediate dithering perturbation and temporally consistent exploration. To achieve these, a stochastic policy model that is inherently consistent through a period of time is in desire, especially for tasks with either sparse rewards or long term information. In this work, we intro…
▽ More
Reinforcement learning agents need exploratory behaviors to escape from local optima. These behaviors may include both immediate dithering perturbation and temporally consistent exploration. To achieve these, a stochastic policy model that is inherently consistent through a period of time is in desire, especially for tasks with either sparse rewards or long term information. In this work, we introduce a novel on-policy temporally consistent exploration strategy - Neural Adaptive Dropout Policy Exploration (NADPEx) - for deep reinforcement learning agents. Modeled as a global random variable for conditional distribution, dropout is incorporated to reinforcement learning policies, equip** them with inherent temporal consistency, even when the reward signals are sparse. Two factors, gradients' alignment with the objective and KL constraint in policy space, are discussed to guarantee NADPEx policy's stable improvement. Our experiments demonstrate that NADPEx solves tasks with sparse reward while naive exploration and parameter noise fail. It yields as well or even faster convergence in the standard mujoco benchmark for continuous control.
△ Less
Submitted 24 December, 2018; v1 submitted 21 December, 2018;
originally announced December 2018.
-
Semi-supervised Deep Representation Learning for Multi-View Problems
Authors:
Vahid Noroozi,
Sara Bahaadini,
Lei Zheng,
Sihong Xie,
Weixiang Shao,
Philip S. Yu
Abstract:
While neural networks for learning representation of multi-view data have been previously proposed as one of the state-of-the-art multi-view dimension reduction techniques, how to make the representation discriminative with only a small amount of labeled data is not well-studied. We introduce a semi-supervised neural network model, named Multi-view Discriminative Neural Network (MDNN), for multi-v…
▽ More
While neural networks for learning representation of multi-view data have been previously proposed as one of the state-of-the-art multi-view dimension reduction techniques, how to make the representation discriminative with only a small amount of labeled data is not well-studied. We introduce a semi-supervised neural network model, named Multi-view Discriminative Neural Network (MDNN), for multi-view problems. MDNN finds nonlinear view-specific map**s by projecting samples to a common feature space using multiple coupled deep networks. It is capable of leveraging both labeled and unlabeled data to project multi-view data so that samples from different classes are separated and those from the same class are clustered together. It also uses the inter-view correlation between views to exploit the available information in both the labeled and unlabeled data. Extensive experiments conducted on four datasets demonstrate the effectiveness of the proposed algorithm for multi-view semi-supervised learning.
△ Less
Submitted 11 November, 2018;
originally announced November 2018.
-
Securing Behavior-based Opinion Spam Detection
Authors:
Shuaijun Ge,
Guixiang Ma,
Sihong Xie,
Philip S. Yu
Abstract:
Reviews spams are prevalent in e-commerce to manipulate product ranking and customers decisions maliciously. While spams generated based on simple spamming strategy can be detected effectively, hardened spammers can evade regular detectors via more advanced spamming strategies. Previous work gave more attention to evasion against text and graph-based detectors, but evasions against behavior-based…
▽ More
Reviews spams are prevalent in e-commerce to manipulate product ranking and customers decisions maliciously. While spams generated based on simple spamming strategy can be detected effectively, hardened spammers can evade regular detectors via more advanced spamming strategies. Previous work gave more attention to evasion against text and graph-based detectors, but evasions against behavior-based detectors are largely ignored, leading to vulnerabilities in spam detection systems. Since real evasion data are scarce, we first propose EMERAL (Evasion via Maximum Entropy and Rating sAmpLing) to generate evasive spams to certain existing detectors. EMERAL can simulate spammers with different goals and levels of knowledge about the detectors, targeting at different stages of the life cycle of target products. We show that in the evasion-defense dynamic, only a few evasion types are meaningful to the spammers, and any spammer will not be able to evade too many detection signals at the same time. We reveal that some evasions are quite insidious and can fail all detection signals. We then propose DETER (Defense via Evasion generaTion using EmeRal), based on model re-training on diverse evasive samples generated by EMERAL. Experiments confirm that DETER is more accurate in detecting both suspicious time window and individual spamming reviews. In terms of security, DETER is versatile enough to be vaccinated against diverse and unexpected evasions, is agnostic about evasion strategy and can be released without privacy concern.
△ Less
Submitted 8 November, 2018;
originally announced November 2018.
-
Domain Confusion with Self Ensembling for Unsupervised Adaptation
Authors:
Jiawei Wang,
Zhaoshui He,
Chengjian Feng,
Zhou** Zhu,
Qinzhuang Lin,
Jun Lv,
Shengli Xie
Abstract:
Data collection and annotation are time-consuming in machine learning, expecially for large scale problem. A common approach for this problem is to transfer knowledge from a related labeled domain to a target one. There are two popular ways to achieve this goal: adversarial learning and self training. In this article, we first analyze the training unstablity problem and the mistaken confusion issu…
▽ More
Data collection and annotation are time-consuming in machine learning, expecially for large scale problem. A common approach for this problem is to transfer knowledge from a related labeled domain to a target one. There are two popular ways to achieve this goal: adversarial learning and self training. In this article, we first analyze the training unstablity problem and the mistaken confusion issue in adversarial learning process. Then, inspired by domain confusion and self-ensembling methods, we propose a combined model to learn feature and class jointly invariant representation, namely Domain Confusion with Self Ensembling (DCSE). The experiments verified that our proposed approach can offer better performance than empirical art in a variety of unsupervised domain adaptation benchmarks.
△ Less
Submitted 8 July, 2020; v1 submitted 10 October, 2018;
originally announced October 2018.
-
Semi-supervised Deep Kernel Learning: Regression with Unlabeled Data by Minimizing Predictive Variance
Authors:
Neal Jean,
Sang Michael Xie,
Stefano Ermon
Abstract:
Large amounts of labeled data are typically required to train deep learning models. For many real-world problems, however, acquiring additional data can be expensive or even impossible. We present semi-supervised deep kernel learning (SSDKL), a semi-supervised regression model based on minimizing predictive variance in the posterior regularization framework. SSDKL combines the hierarchical represe…
▽ More
Large amounts of labeled data are typically required to train deep learning models. For many real-world problems, however, acquiring additional data can be expensive or even impossible. We present semi-supervised deep kernel learning (SSDKL), a semi-supervised regression model based on minimizing predictive variance in the posterior regularization framework. SSDKL combines the hierarchical representation learning of neural networks with the probabilistic modeling capabilities of Gaussian processes. By leveraging unlabeled data, we show improvements on a diverse set of real-world regression tasks over supervised deep kernel learning and semi-supervised methods such as VAT and mean teacher adapted for regression.
△ Less
Submitted 4 March, 2019; v1 submitted 25 May, 2018;
originally announced May 2018.
-
Beyond Unfolding: Exact Recovery of Latent Convex Tensor Decomposition under Reshuffling
Authors:
Chao Li,
Mohammad Emtiyaz Khan,
Zhun Sun,
Gang Niu,
Bo Han,
Shengli Xie,
Qibin Zhao
Abstract:
Exact recovery of tensor decomposition (TD) methods is a desirable property in both unsupervised learning and scientific data analysis. The numerical defects of TD methods, however, limit their practical applications on real-world data. As an alternative, convex tensor decomposition (CTD) was proposed to alleviate these problems, but its exact-recovery property is not properly addressed so far. To…
▽ More
Exact recovery of tensor decomposition (TD) methods is a desirable property in both unsupervised learning and scientific data analysis. The numerical defects of TD methods, however, limit their practical applications on real-world data. As an alternative, convex tensor decomposition (CTD) was proposed to alleviate these problems, but its exact-recovery property is not properly addressed so far. To this end, we focus on latent convex tensor decomposition (LCTD), a practically widely-used CTD model, and rigorously prove a sufficient condition for its exact-recovery property. Furthermore, we show that such property can be also achieved by a more general model than LCTD. In the new model, we generalize the classic tensor (un-)folding into reshuffling operation, a more flexible map** to relocate the entries of the matrix into a tensor. Armed with the reshuffling operations and exact-recovery property, we explore a totally novel application for (generalized) LCTD, i.e., image steganography. Experimental results on synthetic data validate our theory, and results on image steganography show that our method outperforms the state-of-the-art methods.
△ Less
Submitted 28 January, 2020; v1 submitted 22 May, 2018;
originally announced May 2018.
-
Cox Regression Model Under Dependent Truncation
Authors:
Lior Rennert,
Sharon X. Xie
Abstract:
Truncation is a statistical phenomenon that occurs in many time to event studies. For example, autopsy-confirmed studies of neurodegenerative diseases are subject to an inherent left and right truncation, also known as double truncation. When the goal is to study the effect of risk factors on survival, the standard Cox regression model cannot be used when the data is subject to truncation. Existin…
▽ More
Truncation is a statistical phenomenon that occurs in many time to event studies. For example, autopsy-confirmed studies of neurodegenerative diseases are subject to an inherent left and right truncation, also known as double truncation. When the goal is to study the effect of risk factors on survival, the standard Cox regression model cannot be used when the data is subject to truncation. Existing methods which adjust for both left and right truncation in the Cox regression model require independence between the survival times and truncation times, which may not be a reasonable assumption in practice. We propose an expectation-maximization algorithm to relax the independence assumption in the Cox regression model under left, right, or double truncation, to an assumption of conditional independence. The resulting regression coefficient estimators are consistent and asymptotically normal. We demonstrate through extensive simulations that the proposed estimators have little bias and, in most practical situations, have a lower mean-squared error compared to existing estimators. We implement our approach to assess the effect of occupation on survival in subjects with autopsy-confirmed Alzheimer's disease.
△ Less
Submitted 26 March, 2018;
originally announced March 2018.
-
SEVEN: Deep Semi-supervised Verification Networks
Authors:
Vahid Noroozi,
Lei Zheng,
Sara Bahaadini,
Sihong Xie,
Philip S. Yu
Abstract:
Verification determines whether two samples belong to the same class or not, and has important applications such as face and fingerprint verification, where thousands or millions of categories are present but each category has scarce labeled examples, presenting two major challenges for existing deep learning models. We propose a deep semi-supervised model named SEmi-supervised VErification Networ…
▽ More
Verification determines whether two samples belong to the same class or not, and has important applications such as face and fingerprint verification, where thousands or millions of categories are present but each category has scarce labeled examples, presenting two major challenges for existing deep learning models. We propose a deep semi-supervised model named SEmi-supervised VErification Network (SEVEN) to address these challenges. The model consists of two complementary components. The generative component addresses the lack of supervision within each category by learning general salient structures from a large amount of data across categories. The discriminative component exploits the learned general features to mitigate the lack of supervision within categories, and also directs the generative component to find more informative structures of the whole data manifold. The two components are tied together in SEVEN to allow an end-to-end training of the two components. Extensive experiments on four verification tasks demonstrate that SEVEN significantly outperforms other state-of-the-art deep semi-supervised techniques when labeled data are in short supply. Furthermore, SEVEN is competitive with fully supervised baselines trained with a larger amount of labeled data. It indicates the importance of the generative component in SEVEN.
△ Less
Submitted 14 June, 2017; v1 submitted 12 June, 2017;
originally announced June 2017.
-
Deeply-Supervised Nets
Authors:
Chen-Yu Lee,
Saining Xie,
Patrick Gallagher,
Zhengyou Zhang,
Zhuowen Tu
Abstract:
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the inter…
▽ More
Our proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce "companion objective" to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layer-wise pre-training). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all state-of-the-art results on MNIST, CIFAR-10, CIFAR-100, and SVHN).
△ Less
Submitted 25 September, 2014; v1 submitted 18 September, 2014;
originally announced September 2014.
-
Efficient Nonnegative Tucker Decompositions: Algorithms and Uniqueness
Authors:
Guoxu Zhou,
Andrzej Cichocki,
Qibin Zhao,
Shengli Xie
Abstract:
Nonnegative Tucker decomposition (NTD) is a powerful tool for the extraction of nonnegative parts-based and physically meaningful latent components from high-dimensional tensor data while preserving the natural multilinear structure of data. However, as the data tensor often has multiple modes and is large-scale, existing NTD algorithms suffer from a very high computational complexity in terms of…
▽ More
Nonnegative Tucker decomposition (NTD) is a powerful tool for the extraction of nonnegative parts-based and physically meaningful latent components from high-dimensional tensor data while preserving the natural multilinear structure of data. However, as the data tensor often has multiple modes and is large-scale, existing NTD algorithms suffer from a very high computational complexity in terms of both storage and computation time, which has been one major obstacle for practical applications of NTD. To overcome these disadvantages, we show how low (multilinear) rank approximation (LRA) of tensors is able to significantly simplify the computation of the gradients of the cost function, upon which a family of efficient first-order NTD algorithms are developed. Besides dramatically reducing the storage complexity and running time, the new algorithms are quite flexible and robust to noise because any well-established LRA approaches can be applied. We also show how nonnegativity incorporating sparsity substantially improves the uniqueness property and partially alleviates the curse of dimensionality of the Tucker decompositions. Simulation results on synthetic and real-world data justify the validity and high efficiency of the proposed NTD algorithms.
△ Less
Submitted 16 September, 2015; v1 submitted 16 April, 2014;
originally announced April 2014.
-
Multilabel Consensus Classification
Authors:
Sihong Xie,
Xiangnan Kong,
**g Gao,
Wei Fan,
Philip S. Yu
Abstract:
In the era of big data, a large amount of noisy and incomplete data can be collected from multiple sources for prediction tasks. Combining multiple models or data sources helps to counteract the effects of low data quality and the bias of any single model or data source, and thus can improve the robustness and the performance of predictive models. Out of privacy, storage and bandwidth consideratio…
▽ More
In the era of big data, a large amount of noisy and incomplete data can be collected from multiple sources for prediction tasks. Combining multiple models or data sources helps to counteract the effects of low data quality and the bias of any single model or data source, and thus can improve the robustness and the performance of predictive models. Out of privacy, storage and bandwidth considerations, in certain circumstances one has to combine the predictions from multiple models or data sources to obtain the final predictions without accessing the raw data. Consensus-based prediction combination algorithms are effective for such situations. However, current research on prediction combination focuses on the single label setting, where an instance can have one and only one label. Nonetheless, data nowadays are usually multilabeled, such that more than one label have to be predicted at the same time. Direct applications of existing prediction combination methods to multilabel settings can lead to degenerated performance. In this paper, we address the challenges of combining predictions from multiple multilabel classifiers and propose two novel algorithms, MLCM-r (MultiLabel Consensus Maximization for ranking) and MLCM-a (MLCM for microAUC). These algorithms can capture label correlations that are common in multilabel classifications, and optimize corresponding performance metrics. Experimental results on popular multilabel classification tasks verify the theoretical analysis and effectiveness of the proposed methods.
△ Less
Submitted 15 October, 2013;
originally announced October 2013.