-
Scalable Differentiable Causal Discovery in the Presence of Latent Confounders with Skeleton Posterior (Extended Version)
Authors:
**chuan Ma,
Rui Ding,
Qiang Fu,
Jiaru Zhang,
Shuai Wang,
Shi Han,
Dongmei Zhang
Abstract:
Differentiable causal discovery has made significant advancements in the learning of directed acyclic graphs. However, its application to real-world datasets remains restricted due to the ubiquity of latent confounders and the requirement to learn maximal ancestral graphs (MAGs). To date, existing differentiable MAG learning algorithms have been limited to small datasets and failed to scale to lar…
▽ More
Differentiable causal discovery has made significant advancements in the learning of directed acyclic graphs. However, its application to real-world datasets remains restricted due to the ubiquity of latent confounders and the requirement to learn maximal ancestral graphs (MAGs). To date, existing differentiable MAG learning algorithms have been limited to small datasets and failed to scale to larger ones (e.g., with more than 50 variables).
The key insight in this paper is that the causal skeleton, which is the undirected version of the causal graph, has potential for improving accuracy and reducing the search space of the optimization procedure, thereby enhancing the performance of differentiable causal discovery. Therefore, we seek to address a two-fold challenge to harness the potential of the causal skeleton for differentiable causal discovery in the presence of latent confounders: (1) scalable and accurate estimation of skeleton and (2) universal integration of skeleton estimation with differentiable causal discovery.
To this end, we propose SPOT (Skeleton Posterior-guided OpTimization), a two-phase framework that harnesses skeleton posterior for differentiable causal discovery in the presence of latent confounders. On the contrary to a ``point-estimation'', SPOT seeks to estimate the posterior distribution of skeletons given the dataset. It first formulates the posterior inference as an instance of amortized inference problem and concretizes it with a supervised causal learning (SCL)-enabled solution to estimate the skeleton posterior. To incorporate the skeleton posterior with differentiable causal discovery, SPOT then features a skeleton posterior-guided stochastic optimization procedure to guide the optimization of MAGs. [abridged due to length limit]
△ Less
Submitted 15 June, 2024;
originally announced June 2024.
-
Statistical Depth Function Random Variables for Univariate Distributions and induced Divergences
Authors:
Rui Ding
Abstract:
In this paper, we show that the halfspace depth random variable for samples from a univariate distribution with a notion of center is distributed as a uniform distribution on the interval [0,1/2]. The simplicial depth random variable has a distribution that first-order stochastic dominates that of the halfspace depth random variable and relates to a Beta distribution. Depth-induced divergences bet…
▽ More
In this paper, we show that the halfspace depth random variable for samples from a univariate distribution with a notion of center is distributed as a uniform distribution on the interval [0,1/2]. The simplicial depth random variable has a distribution that first-order stochastic dominates that of the halfspace depth random variable and relates to a Beta distribution. Depth-induced divergences between two univariate distributions can be defined using divergences on the distributions for the statistical depth random variables in-between these two distributions. We discuss the properties of such induced divergences, particularly the depth-induced TVD distance based on halfspace or simplicial depth functions, and how empirical two-sample estimators benefit from such transformations.
△ Less
Submitted 25 April, 2023;
originally announced April 2023.
-
f-Betas and Portfolio Optimization with f-Divergence induced Risk Measures
Authors:
Rui Ding
Abstract:
In this paper, we build on using the class of f-divergence induced coherent risk measures for portfolio optimization and derive its necessary optimality conditions formulated in CAPM format. We derive a new f-Beta similar to the Standard Betas and also extended it to previous works in Drawdown Betas. The f-Beta evaluates portfolio performance under an optimally perturbed market probability measure…
▽ More
In this paper, we build on using the class of f-divergence induced coherent risk measures for portfolio optimization and derive its necessary optimality conditions formulated in CAPM format. We derive a new f-Beta similar to the Standard Betas and also extended it to previous works in Drawdown Betas. The f-Beta evaluates portfolio performance under an optimally perturbed market probability measure, and this family of Beta metrics gives various degrees of flexibility and interpretability. We conduct numerical experiments using selected stocks against a chosen S\&P 500 market index as the optimal portfolio to demonstrate the new perspectives provided by Hellinger-Beta as compared with Standard Beta and Drawdown Betas. In our experiments, the squared Hellinger distance is chosen to be the particular choice of the f-divergence function in the f-divergence induced risk measures and f-Betas. We calculate Hellinger-Beta metrics based on deviation measures and further extend this approach to calculate Hellinger-Betas based on drawdown measures, resulting in another new metric which is termed Hellinger-Drawdown Beta. We compare the resulting Hellinger-Beta values under various choices of the risk aversion parameter to study their sensitivity to increasing stress levels.
△ Less
Submitted 12 May, 2023; v1 submitted 1 February, 2023;
originally announced February 2023.
-
ML4C: Seeing Causality Through Latent Vicinity
Authors:
Haoyue Dai,
Rui Ding,
Yuanyuan Jiang,
Shi Han,
Dongmei Zhang
Abstract:
Supervised Causal Learning (SCL) aims to learn causal relations from observational data by accessing previously seen datasets associated with ground truth causal relations. This paper presents a first attempt at addressing a fundamental question: What are the benefits from supervision and how does it benefit? Starting from seeing that SCL is not better than random guessing if the learning target i…
▽ More
Supervised Causal Learning (SCL) aims to learn causal relations from observational data by accessing previously seen datasets associated with ground truth causal relations. This paper presents a first attempt at addressing a fundamental question: What are the benefits from supervision and how does it benefit? Starting from seeing that SCL is not better than random guessing if the learning target is non-identifiable a priori, we propose a two-phase paradigm for SCL by explicitly considering structure identifiability. Following this paradigm, we tackle the problem of SCL on discrete data and propose ML4C. The core of ML4C is a binary classifier with a novel learning target: it classifies whether an Unshielded Triple (UT) is a v-structure or not. Specifically, starting from an input dataset with the corresponding skeleton provided, ML4C orients each UT once it is classified as a v-structure. These v-structures are together used to construct the final output. To address the fundamental question of SCL, we propose a principled method for ML4C featurization: we exploit the vicinity of a given UT (i.e., the neighbors of UT in skeleton), and derive features by considering the conditional dependencies and structural entanglement within the vicinity. We further prove that ML4C is asymptotically correct. Last but foremost, thorough experiments conducted on benchmark datasets demonstrate that ML4C remarkably outperforms other state-of-the-art algorithms in terms of accuracy, reliability, robustness and tolerance. In summary, ML4C shows promising results on validating the effectiveness of supervision for causal learning. Our codes are publicly available at https://github.com/microsoft/ML4C.
△ Less
Submitted 16 April, 2023; v1 submitted 1 October, 2021;
originally announced October 2021.
-
Profile control chart based on maximum entropy
Authors:
Seyedeh Azadeh Fallah Mortezanejad,
Ruochen Wang,
Gholamreza Mohtashami Borzadaran,
Renkai Ding,
Kim Phuc Tran
Abstract:
Monitoring a process over time is so important in manufacturing processes to reduce the waste of money and time. Some charts as Shewhart, CUSUM, and EWMA are common to monitor a process with a single intended attribute which is used in different kinds of processes with various ranges of shifts. In some cases, the process quality is characterized by different types of profiles. The purpose of this…
▽ More
Monitoring a process over time is so important in manufacturing processes to reduce the waste of money and time. Some charts as Shewhart, CUSUM, and EWMA are common to monitor a process with a single intended attribute which is used in different kinds of processes with various ranges of shifts. In some cases, the process quality is characterized by different types of profiles. The purpose of this article is to monitor profile coefficients instead of a process mean. In this paper, two methods are proposed for monitoring the intercept and slope of the simple linear profile, simultaneously. In this regard, two methods are compared here. The first one is the linear regression, and the one is the maximum entropy principle. The T2 Hotelling statistics is used to transfer two coefficients to a scalar. A simulation study is applied to compare the two methods in terms of the second type of error and average run length. Finally, two real examples are presented to demonstrate the applicability of the proposed chart. The first one is about semiconductors, and the second one is about pharmaceutical production processes. The performance of the methods is relatively similar. The maximum entropy plays an important role in correctly identifying differences in the pharmaceutical example, while linear regression did not correctly detect these changes.
△ Less
Submitted 31 October, 2023; v1 submitted 28 December, 2020;
originally announced December 2020.
-
Deep Retrieval: Learning A Retrievable Structure for Large-Scale Recommendations
Authors:
Weihao Gao,
Xiangjun Fan,
Chong Wang,
Jiankai Sun,
Kai Jia,
Wenzhi Xiao,
Ruofan Ding,
Xingyan Bin,
Hui Yang,
Xiaobing Liu
Abstract:
One of the core problems in large-scale recommendations is to retrieve top relevant candidates accurately and efficiently, preferably in sub-linear time. Previous approaches are mostly based on a two-step procedure: first learn an inner-product model, and then use some approximate nearest neighbor (ANN) search algorithm to find top candidates. In this paper, we present Deep Retrieval (DR), to lear…
▽ More
One of the core problems in large-scale recommendations is to retrieve top relevant candidates accurately and efficiently, preferably in sub-linear time. Previous approaches are mostly based on a two-step procedure: first learn an inner-product model, and then use some approximate nearest neighbor (ANN) search algorithm to find top candidates. In this paper, we present Deep Retrieval (DR), to learn a retrievable structure directly with user-item interaction data (e.g. clicks) without resorting to the Euclidean space assumption in ANN algorithms. DR's structure encodes all candidate items into a discrete latent space. Those latent codes for the candidates are model parameters and learnt together with other neural network parameters to maximize the same objective function. With the model learnt, a beam search over the structure is performed to retrieve the top candidates for reranking. Empirically, we first demonstrate that DR, with sub-linear computational complexity, can achieve almost the same accuracy as the brute-force baseline on two public datasets. Moreover, we show that, in a live production recommendation system, a deployed DR approach significantly outperforms a well-tuned ANN baseline in terms of engagement metrics. To the best of our knowledge, DR is among the first non-ANN algorithms successfully deployed at the scale of hundreds of millions of items for industrial recommendation systems.
△ Less
Submitted 18 May, 2021; v1 submitted 12 July, 2020;
originally announced July 2020.
-
Spatio-Temporal Point Processes with Attention for Traffic Congestion Event Modeling
Authors:
Shixiang Zhu,
Ruyi Ding,
Minghe Zhang,
Pascal Van Hentenryck,
Yao Xie
Abstract:
We present a novel framework for modeling traffic congestion events over road networks. Using multi-modal data by combining count data from traffic sensors with police reports that report traffic incidents, we aim to capture two types of triggering effect for congestion events. Current traffic congestion at one location may cause future congestion over the road network, and traffic incidents may c…
▽ More
We present a novel framework for modeling traffic congestion events over road networks. Using multi-modal data by combining count data from traffic sensors with police reports that report traffic incidents, we aim to capture two types of triggering effect for congestion events. Current traffic congestion at one location may cause future congestion over the road network, and traffic incidents may cause spread traffic congestion. To model the non-homogeneous temporal dependence of the event on the past, we use a novel attention-based mechanism based on neural networks embedding for point processes. To incorporate the directional spatial dependence induced by the road network, we adapt the "tail-up" model from the context of spatial statistics to the traffic network setting. We demonstrate our approach's superior performance compared to the state-of-the-art methods for both synthetic and real data.
△ Less
Submitted 31 May, 2021; v1 submitted 15 May, 2020;
originally announced May 2020.
-
Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems
Authors:
Weijie Zhao,
Ronglai Jia,
Yulei Qian,
Ruiquan Ding,
Mingming Sun,
** Li
Abstract:
Neural networks of ads systems usually take input from multiple resources, e.g., query-ad relevance, ad features and user portraits. These inputs are encoded into one-hot or multi-hot binary features, with typically only a tiny fraction of nonzero feature values per example. Deep learning models in online advertising industries can have terabyte-scale parameters that do not fit in the GPU memory n…
▽ More
Neural networks of ads systems usually take input from multiple resources, e.g., query-ad relevance, ad features and user portraits. These inputs are encoded into one-hot or multi-hot binary features, with typically only a tiny fraction of nonzero feature values per example. Deep learning models in online advertising industries can have terabyte-scale parameters that do not fit in the GPU memory nor the CPU main memory on a computing node. For example, a sponsored online advertising system can contain more than $10^{11}$ sparse features, making the neural network a massive model with around 10 TB parameters. In this paper, we introduce a distributed GPU hierarchical parameter server for massive scale deep learning ads systems. We propose a hierarchical workflow that utilizes GPU High-Bandwidth Memory, CPU main memory and SSD as 3-layer hierarchical storage. All the neural network training computations are contained in GPUs. Extensive experiments on real-world data confirm the effectiveness and the scalability of the proposed system. A 4-node hierarchical GPU parameter server can train a model more than 2X faster than a 150-node in-memory distributed parameter server in an MPI cluster. In addition, the price-performance ratio of our proposed system is 4-9 times better than an MPI-cluster solution.
△ Less
Submitted 12 March, 2020;
originally announced March 2020.
-
Deep Fourier Kernel for Self-Attentive Point Processes
Authors:
Shixiang Zhu,
Minghe Zhang,
Ruyi Ding,
Yao Xie
Abstract:
We present a novel attention-based model for discrete event data to capture complex non-linear temporal dependence structures. We borrow the idea from the attention mechanism and incorporate it into the point processes' conditional intensity function. We further introduce a novel score function using Fourier kernel embedding, whose spectrum is represented using neural networks, which drastically d…
▽ More
We present a novel attention-based model for discrete event data to capture complex non-linear temporal dependence structures. We borrow the idea from the attention mechanism and incorporate it into the point processes' conditional intensity function. We further introduce a novel score function using Fourier kernel embedding, whose spectrum is represented using neural networks, which drastically differs from the traditional dot-product kernel and can capture a more complex similarity structure. We establish our approach's theoretical properties and demonstrate our approach's competitive performance compared to the state-of-the-art for synthetic and real data.
△ Less
Submitted 21 February, 2021; v1 submitted 17 February, 2020;
originally announced February 2020.
-
Single-Path Mobile AutoML: Efficient ConvNet Design and NAS Hyperparameter Optimization
Authors:
Dimitrios Stamoulis,
Ruizhou Ding,
Di Wang,
Dimitrios Lymberopoulos,
Bodhi Priyantha,
Jie Liu,
Diana Marculescu
Abstract:
Can we reduce the search cost of Neural Architecture Search (NAS) from days down to only few hours? NAS methods automate the design of Convolutional Networks (ConvNets) under hardware constraints and they have emerged as key components of AutoML frameworks. However, the NAS problem remains challenging due to the combinatorially large design space and the significant search time (at least 200 GPU-h…
▽ More
Can we reduce the search cost of Neural Architecture Search (NAS) from days down to only few hours? NAS methods automate the design of Convolutional Networks (ConvNets) under hardware constraints and they have emerged as key components of AutoML frameworks. However, the NAS problem remains challenging due to the combinatorially large design space and the significant search time (at least 200 GPU-hours). In this work, we alleviate the NAS search cost down to less than 3 hours, while achieving state-of-the-art image classification results under mobile latency constraints. We propose a novel differentiable NAS formulation, namely Single-Path NAS, that uses one single-path over-parameterized ConvNet to encode all architectural decisions based on shared convolutional kernel parameters, hence drastically decreasing the search overhead. Single-Path NAS achieves state-of-the-art top-1 ImageNet accuracy (75.62%), hence outperforming existing mobile NAS methods in similar latency settings (~80ms). In particular, we enhance the accuracy-runtime trade-off in differentiable NAS by treating the Squeeze-and-Excitation path as a fully searchable operation with our novel single-path encoding. Our method has an overall cost of only 8 epochs (24 TPU-hours), which is up to 5,000x faster compared to prior work. Moreover, we study how different NAS formulation choices affect the performance of the designed ConvNets. Furthermore, we exploit the efficiency of our method to answer an interesting question: instead of empirically tuning the hyperparameters of the NAS solver (as in prior work), can we automatically find the hyperparameter values that yield the desired accuracy-runtime trade-off? We open-source our entire codebase at: https://github.com/dstamoulis/single-path-nas.
△ Less
Submitted 1 July, 2019;
originally announced July 2019.
-
Single-Path NAS: Device-Aware Efficient ConvNet Design
Authors:
Dimitrios Stamoulis,
Ruizhou Ding,
Di Wang,
Dimitrios Lymberopoulos,
Bodhi Priyantha,
Jie Liu,
Diana Marculescu
Abstract:
Can we automatically design a Convolutional Network (ConvNet) with the highest image classification accuracy under the latency constraint of a mobile device? Neural Architecture Search (NAS) for ConvNet design is a challenging problem due to the combinatorially large design space and search time (at least 200 GPU-hours). To alleviate this complexity, we propose Single-Path NAS, a novel differentia…
▽ More
Can we automatically design a Convolutional Network (ConvNet) with the highest image classification accuracy under the latency constraint of a mobile device? Neural Architecture Search (NAS) for ConvNet design is a challenging problem due to the combinatorially large design space and search time (at least 200 GPU-hours). To alleviate this complexity, we propose Single-Path NAS, a novel differentiable NAS method for designing device-efficient ConvNets in less than 4 hours. 1. Novel NAS formulation: our method introduces a single-path, over-parameterized ConvNet to encode all architectural decisions with shared convolutional kernel parameters. 2. NAS efficiency: Our method decreases the NAS search cost down to 8 epochs (30 TPU-hours), i.e., up to 5,000x faster compared to prior work. 3. On-device image classification: Single-Path NAS achieves 74.96% top-1 accuracy on ImageNet with 79ms inference latency on a Pixel 1 phone, which is state-of-the-art accuracy compared to NAS methods with similar latency (<80ms).
△ Less
Submitted 10 May, 2019;
originally announced May 2019.
-
Single-Path NAS: Designing Hardware-Efficient ConvNets in less than 4 Hours
Authors:
Dimitrios Stamoulis,
Ruizhou Ding,
Di Wang,
Dimitrios Lymberopoulos,
Bodhi Priyantha,
Jie Liu,
Diana Marculescu
Abstract:
Can we automatically design a Convolutional Network (ConvNet) with the highest image classification accuracy under the runtime constraint of a mobile device? Neural architecture search (NAS) has revolutionized the design of hardware-efficient ConvNets by automating this process. However, the NAS problem remains challenging due to the combinatorially large design space, causing a significant search…
▽ More
Can we automatically design a Convolutional Network (ConvNet) with the highest image classification accuracy under the runtime constraint of a mobile device? Neural architecture search (NAS) has revolutionized the design of hardware-efficient ConvNets by automating this process. However, the NAS problem remains challenging due to the combinatorially large design space, causing a significant searching time (at least 200 GPU-hours). To alleviate this complexity, we propose Single-Path NAS, a novel differentiable NAS method for designing hardware-efficient ConvNets in less than 4 hours. Our contributions are as follows: 1. Single-path search space: Compared to previous differentiable NAS methods, Single-Path NAS uses one single-path over-parameterized ConvNet to encode all architectural decisions with shared convolutional kernel parameters, hence drastically decreasing the number of trainable parameters and the search cost down to few epochs. 2. Hardware-efficient ImageNet classification: Single-Path NAS achieves 74.96% top-1 accuracy on ImageNet with 79ms latency on a Pixel 1 phone, which is state-of-the-art accuracy compared to NAS methods with similar constraints (<80ms). 3. NAS efficiency: Single-Path NAS search cost is only 8 epochs (30 TPU-hours), which is up to 5,000x faster compared to prior work. 4. Reproducibility: Unlike all recent mobile-efficient NAS methods which only release pretrained models, we open-source our entire codebase at: https://github.com/dstamoulis/single-path-nas.
△ Less
Submitted 5 April, 2019;
originally announced April 2019.