-
A Unified Statistical And Computational Framework For Ex-Post Harmonisation Of Aggregate Statistics
Authors:
Cynthia A. Huang
Abstract:
Ex-post harmonisation is one of many data preprocessing processes used to combine the increasingly vast and diverse sources of data available for research and analysis. Documenting provenance and ensuring the quality of multi-source datasets is vital for ensuring trustworthy scientific research and encouraging reuse of existing harmonisation efforts. However, capturing and communicating statistica…
▽ More
Ex-post harmonisation is one of many data preprocessing processes used to combine the increasingly vast and diverse sources of data available for research and analysis. Documenting provenance and ensuring the quality of multi-source datasets is vital for ensuring trustworthy scientific research and encouraging reuse of existing harmonisation efforts. However, capturing and communicating statistically relevant properties of harmonised datasets is difficult without a universal standard for describing harmonisation operations. Our paper combines mathematical and computer science perspectives to address this need. The Crossmaps Framework defines a new approach for transforming existing variables collected under a specific measurement or classification standard to an imputed counterfactual variable indexed by some target standard. It uses computational graphs to separate intended transformation logic from actual data transformations, and avoid the risk of syntactically valid data manipulation scripts resulting in statistically questionable data. In this paper, we introduce the Crossmaps Framework through the example of ex-post harmonisation of aggregated statistics in the social sciences. We define a new provenance task abstraction, the crossmap transform, and formalise two associated objects, the shared mass array and the crossmap. We further define graph, matrix and list encodings of crossmaps and discuss resulting implications for understanding statistical properties of ex-post harmonisation and designing error minimising workflows.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
Distribution-Free Predictive Inference under Unknown Temporal Drift
Authors:
Elise Han,
Chengpiao Huang,
Kaizheng Wang
Abstract:
Distribution-free prediction sets play a pivotal role in uncertainty quantification for complex statistical models. Their validity hinges on reliable calibration data, which may not be readily available as real-world environments often undergo unknown changes over time. In this paper, we propose a strategy for choosing an adaptive window and use the data therein to construct prediction sets. The w…
▽ More
Distribution-free prediction sets play a pivotal role in uncertainty quantification for complex statistical models. Their validity hinges on reliable calibration data, which may not be readily available as real-world environments often undergo unknown changes over time. In this paper, we propose a strategy for choosing an adaptive window and use the data therein to construct prediction sets. The window is selected by optimizing an estimated bias-variance tradeoff. We provide sharp coverage guarantees for our method, showing its adaptivity to the underlying temporal drift. We also illustrate its efficacy through numerical experiments on synthetic and real data.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
An Efficient Quasi-Random Sampling for Copulas
Authors:
Sumin Wang,
Chenxian Huang,
Yongdao Zhou,
Min-Qian Liu
Abstract:
This paper examines an efficient method for quasi-random sampling of copulas in Monte Carlo computations. Traditional methods, like conditional distribution methods (CDM), have limitations when dealing with high-dimensional or implicit copulas, which refer to those that cannot be accurately represented by existing parametric copulas. Instead, this paper proposes the use of generative models, such…
▽ More
This paper examines an efficient method for quasi-random sampling of copulas in Monte Carlo computations. Traditional methods, like conditional distribution methods (CDM), have limitations when dealing with high-dimensional or implicit copulas, which refer to those that cannot be accurately represented by existing parametric copulas. Instead, this paper proposes the use of generative models, such as Generative Adversarial Networks (GANs), to generate quasi-random samples for any copula. GANs are a type of implicit generative models used to learn the distribution of complex data, thus facilitating easy sampling. In our study, GANs are employed to learn the map** from a uniform distribution to copulas. Once this map** is learned, obtaining quasi-random samples from the copula only requires inputting quasi-random samples from the uniform distribution. This approach offers a more flexible method for any copula. Additionally, we provide theoretical analysis of quasi-Monte Carlo estimators based on quasi-random samples of copulas. Through simulated and practical applications, particularly in the field of risk management, we validate the proposed method and demonstrate its superiority over various existing methods.
△ Less
Submitted 8 March, 2024;
originally announced March 2024.
-
Model Assessment and Selection under Temporal Distribution Shift
Authors:
Elise Han,
Chengpiao Huang,
Kaizheng Wang
Abstract:
We investigate model assessment and selection in a changing environment, by synthesizing datasets from both the current time period and historical epochs. To tackle unknown and potentially arbitrary temporal distribution shift, we develop an adaptive rolling window approach to estimate the generalization error of a given model. This strategy also facilitates the comparison between any two candidat…
▽ More
We investigate model assessment and selection in a changing environment, by synthesizing datasets from both the current time period and historical epochs. To tackle unknown and potentially arbitrary temporal distribution shift, we develop an adaptive rolling window approach to estimate the generalization error of a given model. This strategy also facilitates the comparison between any two candidate models by estimating the difference of their generalization errors. We further integrate pairwise comparisons into a single-elimination tournament, achieving near-optimal model selection from a collection of candidates. Theoretical analyses and numerical experiments demonstrate the adaptivity of our proposed methods to the non-stationarity in data.
△ Less
Submitted 3 June, 2024; v1 submitted 13 February, 2024;
originally announced February 2024.
-
Factor Importance Ranking and Selection using Total Indices
Authors:
Chaofan Huang,
V. Roshan Joseph
Abstract:
Factor importance measures the impact of each feature on output prediction accuracy. Many existing works focus on the model-based importance, but an important feature in one learning algorithm may hold little significance in another model. Hence, a factor importance measure ought to characterize the feature's predictive potential without relying on a specific prediction algorithm. Such algorithm-a…
▽ More
Factor importance measures the impact of each feature on output prediction accuracy. Many existing works focus on the model-based importance, but an important feature in one learning algorithm may hold little significance in another model. Hence, a factor importance measure ought to characterize the feature's predictive potential without relying on a specific prediction algorithm. Such algorithm-agnostic importance is termed as intrinsic importance in Williamson et al. (2023), but their estimator again requires model fitting. To bypass the modeling step, we present the equivalence between predictiveness potential and total Sobol' indices from global sensitivity analysis, and introduce a novel consistent estimator that can be directly estimated from noisy data. Integrating with forward selection and backward elimination gives rise to FIRST, Factor Importance Ranking and Selection using Total (Sobol') indices. Extensive simulations are provided to demonstrate the effectiveness of FIRST on regression and binary classification problems, and a clear advantage over the state-of-the-art methods.
△ Less
Submitted 11 January, 2024; v1 submitted 1 January, 2024;
originally announced January 2024.
-
Towards Human-like Perception: Learning Structural Causal Model in Heterogeneous Graph
Authors:
Tianqian** Lin,
Kaisong Song,
Zhuoren Jiang,
Yangyang Kang,
Weikang Yuan,
Xurui Li,
Changlong Sun,
Cui Huang,
Xiaozhong Liu
Abstract:
Heterogeneous graph neural networks have become popular in various domains. However, their generalizability and interpretability are limited due to the discrepancy between their inherent inference flows and human reasoning logic or underlying causal relationships for the learning problem. This study introduces a novel solution, HG-SCM (Heterogeneous Graph as Structural Causal Model). It can mimic…
▽ More
Heterogeneous graph neural networks have become popular in various domains. However, their generalizability and interpretability are limited due to the discrepancy between their inherent inference flows and human reasoning logic or underlying causal relationships for the learning problem. This study introduces a novel solution, HG-SCM (Heterogeneous Graph as Structural Causal Model). It can mimic the human perception and decision process through two key steps: constructing intelligible variables based on semantics derived from the graph schema and automatically learning task-level causal relationships among these variables by incorporating advanced causal discovery techniques. We compared HG-SCM to seven state-of-the-art baseline models on three real-world datasets, under three distinct and ubiquitous out-of-distribution settings. HG-SCM achieved the highest average performance rank with minimal standard deviation, substantiating its effectiveness and superiority in terms of both predictive power and generalizability. Additionally, the visualization and analysis of the auto-learned causal diagrams for the three tasks aligned well with domain knowledge and human cognition, demonstrating prominent interpretability. HG-SCM's human-like nature and its enhanced generalizability and interpretability make it a promising solution for special scenarios where transparency and trustworthiness are paramount.
△ Less
Submitted 9 December, 2023;
originally announced December 2023.
-
A Stability Principle for Learning under Non-Stationarity
Authors:
Chengpiao Huang,
Kaizheng Wang
Abstract:
We develop a versatile framework for statistical learning in non-stationary environments. In each time period, our approach applies a stability principle to select a look-back window that maximizes the utilization of historical data while kee** the cumulative bias within an acceptable range relative to the stochastic error. Our theory showcases the adaptability of this approach to unknown non-st…
▽ More
We develop a versatile framework for statistical learning in non-stationary environments. In each time period, our approach applies a stability principle to select a look-back window that maximizes the utilization of historical data while kee** the cumulative bias within an acceptable range relative to the stochastic error. Our theory showcases the adaptability of this approach to unknown non-stationarity. The regret bound is minimax optimal up to logarithmic factors when the population losses are strongly convex, or Lipschitz only. At the heart of our analysis lie two novel components: a measure of similarity between functions and a segmentation technique for dividing the non-stationary data sequence into quasi-stationary pieces.
△ Less
Submitted 22 January, 2024; v1 submitted 27 October, 2023;
originally announced October 2023.
-
Enhancing Sample Quality through Minimum Energy Importance Weights
Authors:
Chaofan Huang,
V. Roshan Joseph
Abstract:
Importance sampling is a powerful tool for correcting the distributional mismatch in many statistical and machine learning problems, but in practice its performance is limited by the usage of simple proposals whose importance weights can be computed analytically. To address this limitation, Liu and Lee (2017) proposed a Black-Box Importance Sampling (BBIS) algorithm that computes the importance we…
▽ More
Importance sampling is a powerful tool for correcting the distributional mismatch in many statistical and machine learning problems, but in practice its performance is limited by the usage of simple proposals whose importance weights can be computed analytically. To address this limitation, Liu and Lee (2017) proposed a Black-Box Importance Sampling (BBIS) algorithm that computes the importance weights for arbitrary simulated samples by minimizing the kernelized Stein discrepancy. However, this requires knowing the score function of the target distribution, which is not easy to compute for many Bayesian problems. Hence, in this paper we propose another novel BBIS algorithm using minimum energy design, BBIS-MED, that requires only the unnormalized density function, which can be utilized as a post-processing step to improve the quality of Markov Chain Monte Carlo samples. We demonstrate the effectiveness and wide applicability of our proposed BBIS-MED algorithm on extensive simulations and a real-world Bayesian model calibration problem where the score function cannot be derived analytically.
△ Less
Submitted 31 December, 2023; v1 submitted 11 October, 2023;
originally announced October 2023.
-
Asset Bundling for Wind Power Forecasting
Authors:
Hanyu Zhang,
Mathieu Tanneau,
Chaofan Huang,
V. Roshan Joseph,
Shangkun Wang,
Pascal Van Hentenryck
Abstract:
The growing penetration of intermittent, renewable generation in US power grids, especially wind and solar generation, results in increased operational uncertainty. In that context, accurate forecasts are critical, especially for wind generation, which exhibits large variability and is historically harder to predict. To overcome this challenge, this work proposes a novel Bundle-Predict-Reconcile (…
▽ More
The growing penetration of intermittent, renewable generation in US power grids, especially wind and solar generation, results in increased operational uncertainty. In that context, accurate forecasts are critical, especially for wind generation, which exhibits large variability and is historically harder to predict. To overcome this challenge, this work proposes a novel Bundle-Predict-Reconcile (BPR) framework that integrates asset bundling, machine learning, and forecast reconciliation techniques. The BPR framework first learns an intermediate hierarchy level (the bundles), then predicts wind power at the asset, bundle, and fleet level, and finally reconciles all forecasts to ensure consistency. This approach effectively introduces an auxiliary learning task (predicting the bundle-level time series) to help the main learning tasks. The paper also introduces new asset-bundling criteria that capture the spatio-temporal dynamics of wind power time series. Extensive numerical experiments are conducted on an industry-size dataset of 283 wind farms in the MISO footprint. The experiments consider short-term and day-ahead forecasts, and evaluates a large variety of forecasting models that include weather predictions as covariates. The results demonstrate the benefits of BPR, which consistently and significantly improves forecast accuracy over baselines, especially at the fleet level.
△ Less
Submitted 28 September, 2023;
originally announced September 2023.
-
Quantile regression outcome-adaptive lasso: variable selection for causal quantile treatment effect estimation
Authors:
Yahang Liu,
Kecheng Wei,
Chen Huang,
Yongfu Yu,
Guoyou Qin
Abstract:
Quantile treatment effects (QTEs) can characterize the potentially heterogeneous causal effect of a treatment on different points of the entire outcome distribution. Propensity score (PS) methods are commonly employed for estimating QTEs in non-randomized studies. Empirical and theoretical studies have shown that insufficient and unnecessary adjustment for covariates in PS models can lead to bias…
▽ More
Quantile treatment effects (QTEs) can characterize the potentially heterogeneous causal effect of a treatment on different points of the entire outcome distribution. Propensity score (PS) methods are commonly employed for estimating QTEs in non-randomized studies. Empirical and theoretical studies have shown that insufficient and unnecessary adjustment for covariates in PS models can lead to bias and efficiency loss in estimating treatment effects. Striking a balance between bias and efficiency through variable selection is a crucial concern in casual inference. It is essential to acknowledge that the covariates related treatment and outcome may vary across different quantiles of the outcome distribution. However, previous studies have overlooked to adjust for different covariates separately in the PS models when estimating different QTEs. In this article, we proposed the quantile regression outcome-adaptive lasso (QROAL) method to select covariates that can provide unbiased and efficient estimates of QTEs. A distinctive feature of our proposed method is the utilization of linear quantile regression models for constructing penalty weights, enabling covariate selection in PS models separately when estimating different QTEs. We conducted simulation studies to show the superiority of our proposed method over the outcome-adaptive lasso (OAL) method in variable selection. Moreover, the proposed method exhibited favorable performance compared to the OAL method in terms of root mean square error in a range of settings, including both homogeneous and heterogeneous scenarios. Additionally, we applied the QROAL method to datasets from the China Health and Retirement Longitudinal Study (CHARLS) to explore the impact of smoking status on the severity of depression symptoms.
△ Less
Submitted 14 August, 2023; v1 submitted 10 August, 2023;
originally announced August 2023.
-
On an Interpretation of ResNets via Solution Constructions
Authors:
Changcun Huang
Abstract:
This paper first constructs a typical solution of ResNets for multi-category classifications by the principle of gate-network controls and deep-layer classifications, from which a general interpretation of the ResNet architecture is given and the performance mechanism is explained. We then use more solutions to further demonstrate the generality of that interpretation. The universal-approximation…
▽ More
This paper first constructs a typical solution of ResNets for multi-category classifications by the principle of gate-network controls and deep-layer classifications, from which a general interpretation of the ResNet architecture is given and the performance mechanism is explained. We then use more solutions to further demonstrate the generality of that interpretation. The universal-approximation capability of ResNets is proved.
△ Less
Submitted 23 December, 2022; v1 submitted 11 December, 2022;
originally announced December 2022.
-
Adaptive Exploration and Optimization of Materials Crystal Structures
Authors:
Arvind Krishna,
Huan Tran,
Chaofan Huang,
Rampi Ramprasad,
V. Roshan Joseph
Abstract:
A central problem of materials science is to determine whether a hypothetical material is stable without being synthesized, which is mathematically equivalent to a global optimization problem on a highly non-linear and multi-modal potential energy surface (PES). This optimization problem poses multiple outstanding challenges, including the exceedingly high dimensionality of the PES and that PES mu…
▽ More
A central problem of materials science is to determine whether a hypothetical material is stable without being synthesized, which is mathematically equivalent to a global optimization problem on a highly non-linear and multi-modal potential energy surface (PES). This optimization problem poses multiple outstanding challenges, including the exceedingly high dimensionality of the PES and that PES must be constructed from a reliable, sophisticated, parameters-free, and thus, very expensive computational method, for which density functional theory (DFT) is an example. DFT is a quantum mechanics based method that can predict, among other things, the total potential energy of a given configuration of atoms. DFT, while accurate, is computationally expensive. In this work, we propose a novel expansion-exploration-exploitation framework to find the global minimum of the PES. Starting from a few atomic configurations, this ``known'' space is expanded to construct a big candidate set. The expansion begins in a non-adaptive manner, where new configurations are added without considering their potential energy. A novel feature of this step is that it tends to generate a space-filling design without the knowledge of the boundaries of the domain space. If needed, the non-adaptive expansion of the space of configurations is followed by adaptive expansion, where ``promising regions'' of the domain space (those with low energy configurations) are further expanded. Once a candidate set of configurations is obtained, it is simultaneously explored and exploited using Bayesian optimization to find the global minimum. The methodology is demonstrated using a problem of finding the most stable crystal structure of Aluminum.
△ Less
Submitted 1 December, 2022;
originally announced December 2022.
-
Quantifying the Impact of Label Noise on Federated Learning
Authors:
Shuqi Ke,
Chao Huang,
Xin Liu
Abstract:
Federated Learning (FL) is a distributed machine learning paradigm where clients collaboratively train a model using their local (human-generated) datasets. While existing studies focus on FL algorithm development to tackle data heterogeneity across clients, the important issue of data quality (e.g., label noise) in FL is overlooked. This paper aims to fill this gap by providing a quantitative stu…
▽ More
Federated Learning (FL) is a distributed machine learning paradigm where clients collaboratively train a model using their local (human-generated) datasets. While existing studies focus on FL algorithm development to tackle data heterogeneity across clients, the important issue of data quality (e.g., label noise) in FL is overlooked. This paper aims to fill this gap by providing a quantitative study on the impact of label noise on FL. We derive an upper bound for the generalization error that is linear in the clients' label noise level. Then we conduct experiments on MNIST and CIFAR-10 datasets using various FL algorithms. Our empirical results show that the global model accuracy linearly decreases as the noise level increases, which is consistent with our theoretical analysis. We further find that label noise slows down the convergence of FL training, and the global model tends to overfit when the noise level is high.
△ Less
Submitted 3 April, 2023; v1 submitted 14 November, 2022;
originally announced November 2022.
-
Optimal Sub-sampling to Boost Power of Kernel Sequential Change-point Detection
Authors:
Song Wei,
Chaofan Huang
Abstract:
We present a novel scheme to boost detection power for kernel maximum mean discrepancy based sequential change-point detection procedures. Our proposed scheme features an optimal sub-sampling of the history data before the detection procedure, in order to tackle the power loss incurred by the random sub-sample from the enormous history data. We apply our proposed scheme to both Scan $B$ and Kernel…
▽ More
We present a novel scheme to boost detection power for kernel maximum mean discrepancy based sequential change-point detection procedures. Our proposed scheme features an optimal sub-sampling of the history data before the detection procedure, in order to tackle the power loss incurred by the random sub-sample from the enormous history data. We apply our proposed scheme to both Scan $B$ and Kernel Cumulative Sum (CUSUM) procedures, and improved performance is observed from extensive numerical experiments.
△ Less
Submitted 13 January, 2023; v1 submitted 26 October, 2022;
originally announced October 2022.
-
From Local to Global: Spectral-Inspired Graph Neural Networks
Authors:
Ningyuan Huang,
Soledad Villar,
Carey E. Priebe,
Da Zheng,
Chengyue Huang,
Lin Yang,
Vladimir Braverman
Abstract:
Graph Neural Networks (GNNs) are powerful deep learning methods for Non-Euclidean data. Popular GNNs are message-passing algorithms (MPNNs) that aggregate and combine signals in a local graph neighborhood. However, shallow MPNNs tend to miss long-range signals and perform poorly on some heterophilous graphs, while deep MPNNs can suffer from issues like over-smoothing or over-squashing. To mitigate…
▽ More
Graph Neural Networks (GNNs) are powerful deep learning methods for Non-Euclidean data. Popular GNNs are message-passing algorithms (MPNNs) that aggregate and combine signals in a local graph neighborhood. However, shallow MPNNs tend to miss long-range signals and perform poorly on some heterophilous graphs, while deep MPNNs can suffer from issues like over-smoothing or over-squashing. To mitigate such issues, existing works typically borrow normalization techniques from training neural networks on Euclidean data or modify the graph structures. Yet these approaches are not well-understood theoretically and could increase the overall computational complexity. In this work, we draw inspirations from spectral graph embedding and propose $\texttt{PowerEmbed}$ -- a simple layer-wise normalization technique to boost MPNNs. We show $\texttt{PowerEmbed}$ can provably express the top-$k$ leading eigenvectors of the graph operator, which prevents over-smoothing and is agnostic to the graph topology; meanwhile, it produces a list of representations ranging from local features to global signals, which avoids over-squashing. We apply $\texttt{PowerEmbed}$ in a wide range of simulated and real graphs and demonstrate its competitive performance, particularly for heterophilous graphs.
△ Less
Submitted 4 November, 2022; v1 submitted 24 September, 2022;
originally announced September 2022.
-
BioKlustering: a web app for semi-supervised learning of maximally imbalanced genomic data
Authors:
Samuel Ozminkowski,
Yuke Wu,
Liule Yang,
Zhiwen Xu,
Luke Selberg,
Chunrong Huang,
Claudia Solis-Lemus
Abstract:
Summary: Accurate phenotype prediction from genomic sequences is a highly coveted task in biological and medical research. While machine-learning holds the key to accurate prediction in a variety of fields, the complexity of biological data can render many methodologies inapplicable. We introduce BioKlustering, a user-friendly open-source and publicly available web app for unsupervised and semi-su…
▽ More
Summary: Accurate phenotype prediction from genomic sequences is a highly coveted task in biological and medical research. While machine-learning holds the key to accurate prediction in a variety of fields, the complexity of biological data can render many methodologies inapplicable. We introduce BioKlustering, a user-friendly open-source and publicly available web app for unsupervised and semi-supervised learning specialized for cases when sequence alignment and/or experimental phenoty** of all classes are not possible. Among its main advantages, BioKlustering 1) allows for maximally imbalanced settings of partially observed labels including cases when only one class is observed, which is currently prohibited in most semi-supervised methods, 2) takes unaligned sequences as input and thus, allows learning for widely diverse sequences (impossible to align) such as virus and bacteria, 3) is easy to use for anyone with little or no programming expertise, and 4) works well with small sample sizes.
Availability and Implementation: BioKlustering (https://bioklustering.wid.wisc.edu) is a freely available web app implemented with Django, a Python-based framework, with all major browsers supported. The web app does not need any installation, and it is publicly available and open-source (https://github.com/solislemuslab/bioklustering).
△ Less
Submitted 26 September, 2022; v1 submitted 23 September, 2022;
originally announced September 2022.
-
Research on spatial information transmission efficiency and capability of safe evacuation signs
Authors:
Ruiwen Fan,
Zhangyin Dai,
Shixiang Tian,
Ting Xia a,
Hui Zhou,
Congbao Huang
Abstract:
As an indispensable spatial direction information indicator for emergency evacuation, the spatial relationship between safety evacuation signs and evacuees will affect the response time of evacuees and the evacuation efficiency. This paper takes 2 kinds of common safety evacuation signs, hangtag-type and embedded, as the research object and designs space direction information transmission efficien…
▽ More
As an indispensable spatial direction information indicator for emergency evacuation, the spatial relationship between safety evacuation signs and evacuees will affect the response time of evacuees and the evacuation efficiency. This paper takes 2 kinds of common safety evacuation signs, hangtag-type and embedded, as the research object and designs space direction information transmission efficiency and capability simulation experiment and fire drill, the efficiency and capability of spatial direction information transmission of safety evacuation signs are studied. The results show that the space angle of the hangtag-type safety evacuation sign is inversely proportional to the information transmission efficiency and capability of the space direction, and the fire drill also confirms this conclusion. When the spatial angle of the embedded safety evacuation sign is 5°, the spatial direction information transmission efficiency and capability increase. Simultaneously, the average escape time of the participants in the fire drill was lower, and the percentage of choosing unfamiliarity exports increased. The evolution of spatial angle has no significant effect on the intention of the response of subjects of different genders; when choosing the direction, males are more easily affected by the change of spatial angle than females; the confidence level of females' choice is more easily affected by spatial angle. In addition, according to the research results, the corresponding three-dimensional structure safety evacuation signs are designed. The functional structure of the safety evacuation signs is perfected, which can effectively improve the efficiency of fire emergency evacuation.
△ Less
Submitted 22 April, 2022;
originally announced April 2022.
-
The Effects of Dynamic Learning and the Forgetting Process on an Optimizing Modelling for Full-Service Repair Pricing Contracts for Medical Devices
Authors:
Ai** Jiang,
Lin Li,
Xuemin Xu,
David Y. C. Huang
Abstract:
In order to improve the profitability and customer service management of original equipment manufacturers (OEMs) in a market where full-service (FS) and on-call service (OS) co-exist, this article extends the optimizing modelling for pricing FS repair contracts with the effects of dynamic learning and forgetting. Along with considering autonomous learning in maintenance practice, this study also a…
▽ More
In order to improve the profitability and customer service management of original equipment manufacturers (OEMs) in a market where full-service (FS) and on-call service (OS) co-exist, this article extends the optimizing modelling for pricing FS repair contracts with the effects of dynamic learning and forgetting. Along with considering autonomous learning in maintenance practice, this study also analyses how induced learning and forgetting process in a workplace put impact on the pricing optimizing model of FS contracts in the portfolio of FS and OS. A numerical analysis based on real data from a medical industry proves that the enhanced FS pricing model discussed here has two main advantages: (1) It could prominently improve repair efficiency, and (2) It help OEMs gain better profits compared to the original FS model and the sole OS maintenance. Sensitivity analysis shows that if internal failure rate increases, the optimized FS price rises gradually until reaching the maximum value, and profitability to the OEM increases overall; if frequency of induced learning goes up, the optimal FS price rises after a short-term downward trend, with a stable profitability to the OEM.
△ Less
Submitted 12 April, 2022;
originally announced April 2022.
-
Network of Low-cost Air Quality Sensor for Monitoring Indoor, Outdoor, and Personal PM2.5 Exposure in Seattle during the 2020 Wildfire Season
Authors:
Jiayang He,
Ching-Hsuan Huang,
Nanhsun Yuan,
Elena Austin,
Edmund Seto,
Igor Novosselov
Abstract:
The increased frequency of wildfires in the Western United States has raised public concerns. Exposure to wildfire smoke has been linked to an increased risk of cancer and cardiorespiratory morbidity. Evidence-driven interventions can alleviate the adverse health impact of wildfire smoke. Public health guidance during wildfires is based on regional air quality data with limited spatiotemporal reso…
▽ More
The increased frequency of wildfires in the Western United States has raised public concerns. Exposure to wildfire smoke has been linked to an increased risk of cancer and cardiorespiratory morbidity. Evidence-driven interventions can alleviate the adverse health impact of wildfire smoke. Public health guidance during wildfires is based on regional air quality data with limited spatiotemporal resolution. We demonstrate the use of a network of low-cost particulate matter (PM) sensors to gather indoor, outdoor, and personal PM2.5 exposure data from seven locations in the urban Seattle area, along with a personal exposure monitor worn by a resident living in one of these locations during the 2020 Washington wildfire event. The data were used to determine PM concentration indoor/outdoor (I/O) ratios, PM reduction, and personal exposure levels. The result shows that locations equipped with high-efficiency particulate air (HEPA) filters and HVAC filtration systems had significantly lower I/O ratios (median I/O = 0.43) than those without air filtration (median I/O = 0.82). The median PM2.5 reduction for the locations with HEPA is 58 % compared to 20% for the locations without HEPA. The outdoor PM sensors showed a high correlation to the nearby regional air quality monitoring stations (R2 = 0.93). The personal monitor showed high variance in PM measurements as the user moved through different microenvironments and could not be fully characterized by the network of indoor or outdoor monitors. The findings imply evidence-based interventions can be developed for reducing pollution exposure based on the combination of indoor, outdoor sensors. Personal exposure monitoring in individuals' breathing zones provided the highest fidelity data capturing temporal spikes in PM exposure.
△ Less
Submitted 26 March, 2022;
originally announced March 2022.
-
A flexible approach for variable selection in large-scale healthcare database studies with missing covariate and outcome data
Authors:
Jung-Yi Joyce Lin,
Liangyuan Hu,
Chuyue Huang,
Steven Lawrence,
Usha Govindarajulu
Abstract:
Prior work has shown that combining bootstrap imputation with tree-based machine learning variable selection methods can provide good performances achievable on fully observed data when covariate and outcome data are missing at random (MAR). This approach however is computationally expensive, especially on large-scale datasets. We propose an inference-based method, called RR-BART, which leverages…
▽ More
Prior work has shown that combining bootstrap imputation with tree-based machine learning variable selection methods can provide good performances achievable on fully observed data when covariate and outcome data are missing at random (MAR). This approach however is computationally expensive, especially on large-scale datasets. We propose an inference-based method, called RR-BART, which leverages the likelihood-based Bayesian machine learning technique, Bayesian additive regression trees, and uses Rubin's rule to combine the estimates and variances of the variable importance measures on multiply imputed datasets for variable selection in the presence of MAR data. We conduct a representative simulation study to investigate the practical operating characteristics of RR-BART, and compare it with the bootstrap imputation based methods. We further demonstrate the methods via a case study of risk factors for 3-year incidence of metabolic syndrome among middle-aged women using data from the Study of Women's Health Across the Nation (SWAN). The simulation study suggests that even in complex conditions of nonlinearity and nonadditivity with a large percentage of missingness, RR-BART can reasonably recover both prediction and variable selection performances, achievable on the fully observed data. RR-BART provides the best performance that the bootstrap imputation based methods can achieve with the optimal selection threshold value. In addition, RR-BART demonstrates a substantially stronger ability of detecting discrete predictors. Furthermore, RR-BART offers substantial computational savings. When implemented on the SWAN data, RR-BART adds to the literature by selecting a set of predictors that had been less commonly identified as risk factors but had substantial biological justifications.
△ Less
Submitted 13 April, 2022; v1 submitted 20 July, 2021;
originally announced July 2021.
-
Non-Homogeneity Estimation and Universal Kriging on the Sphere
Authors:
Nicholas W. Bussberg,
Jacob Shields,
Chunfeng Huang
Abstract:
Kriging is a widely recognized method for making spatial predictions. On the sphere, popular methods such as ordinary kriging assume that the spatial process is intrinsically homogeneous. However, intrinsic homogeneity is too strict in many cases. This research uses intrinsic random function (IRF) theory to relax the homogeneity assumption. A key component of modeling IRF processes is estimating t…
▽ More
Kriging is a widely recognized method for making spatial predictions. On the sphere, popular methods such as ordinary kriging assume that the spatial process is intrinsically homogeneous. However, intrinsic homogeneity is too strict in many cases. This research uses intrinsic random function (IRF) theory to relax the homogeneity assumption. A key component of modeling IRF processes is estimating the degree of non-homogeneity. A graphical approach is proposed to accomplish this estimation. With the ability to estimate non-homogeneity, an IRF universal kriging procedure can be developed. Results from simulation studies are provided to demonstrate the advantage of using IRF universal kriging as opposed to ordinary kriging when the underlying process is not intrinsically homogeneous.
△ Less
Submitted 6 July, 2021;
originally announced July 2021.
-
MAGI-X: Manifold-Constrained Gaussian Process Inference for Unknown System Dynamics
Authors:
Chaofan Huang,
Simin Ma,
Shihao Yang
Abstract:
Ordinary differential equations (ODEs), commonly used to characterize the dynamic systems, are difficult to propose in closed-form for many complicated scientific applications, even with the help of domain expert. We propose a fast and accurate data-driven method, MAGI-X, to learn the unknown dynamic from the observation data in a non-parametric fashion, without the need of any domain knowledge. U…
▽ More
Ordinary differential equations (ODEs), commonly used to characterize the dynamic systems, are difficult to propose in closed-form for many complicated scientific applications, even with the help of domain expert. We propose a fast and accurate data-driven method, MAGI-X, to learn the unknown dynamic from the observation data in a non-parametric fashion, without the need of any domain knowledge. Unlike the existing methods that mainly rely on the costly numerical integration, MAGI-X utilizes the powerful functional approximator of neural network to learn the unknown nonlinear dynamic within the MAnifold-constrained Gaussian process Inference (MAGI) framework that completely circumvents the numerical integration. Comparing against the state-of-the-art methods on three realistic examples, MAGI-X achieves competitive accuracy in both fitting and forecasting while only taking a fraction of computational time. Moreover, MAGI-X provides practical solution for the inference of partial observed systems, which no previous method is able to handle.
△ Less
Submitted 19 October, 2021; v1 submitted 26 May, 2021;
originally announced May 2021.
-
Uniform Inference on High-dimensional Spatial Panel Networks
Authors:
Victor Chernozhukov,
Chen Huang,
Weining Wang
Abstract:
We propose employing a debiased-regularized, high-dimensional generalized method of moments (GMM) framework to perform inference on large-scale spatial panel networks. In particular, network structure with a flexible sparse deviation, which can be regarded either as latent or as misspecified from a predetermined adjacency matrix, is estimated using debiased machine learning approach. The theoretic…
▽ More
We propose employing a debiased-regularized, high-dimensional generalized method of moments (GMM) framework to perform inference on large-scale spatial panel networks. In particular, network structure with a flexible sparse deviation, which can be regarded either as latent or as misspecified from a predetermined adjacency matrix, is estimated using debiased machine learning approach. The theoretical analysis establishes the consistency and asymptotic normality of our proposed estimator, taking into account general temporal and spatial dependency inherent in the data-generating processes. The dimensionality allowance in presence of dependency is discussed. A primary contribution of our study is the development of uniform inference theory that enables hypothesis testing on the parameters of interest, including zero or non-zero elements in the network structure. Additionally, the asymptotic properties for the estimator are derived for both linear and nonlinear moments. Simulations demonstrate superior performance of our proposed approach. Lastly, we apply our methodology to investigate the spatial network effect of stock returns.
△ Less
Submitted 7 September, 2023; v1 submitted 16 May, 2021;
originally announced May 2021.
-
Constrained Minimum Energy Designs
Authors:
Chaofan Huang,
V. Roshan Joseph,
Douglas M. Ray
Abstract:
Space-filling designs are important in computer experiments, which are critical for building a cheap surrogate model that adequately approximates an expensive computer code. Many design construction techniques in the existing literature are only applicable for rectangular bounded space, but in real world applications, the input space can often be non-rectangular because of constraints on the input…
▽ More
Space-filling designs are important in computer experiments, which are critical for building a cheap surrogate model that adequately approximates an expensive computer code. Many design construction techniques in the existing literature are only applicable for rectangular bounded space, but in real world applications, the input space can often be non-rectangular because of constraints on the input variables. One solution to generate designs in a constrained space is to first generate uniformly distributed samples in the feasible region, and then use them as the candidate set to construct the designs. Sequentially Constrained Monte Carlo (SCMC) is the state-of-the-art technique for candidate generation, but it still requires large number of constraint evaluations, which is problematic especially when the constraints are expensive to evaluate. Thus, to reduce constraint evaluations and improve efficiency, we propose the Constrained Minimum Energy Design (CoMinED) that utilizes recent advances in deterministic sampling methods. Extensive simulation results on 15 benchmark problems with dimensions ranging from 2 to 13 are provided for demonstrating the improved performance of CoMinED over the existing methods.
△ Less
Submitted 24 April, 2021;
originally announced April 2021.
-
Regression Modeling for Recurrent Events Using R Package reReg
Authors:
Sy Han Chiou,
Gongjun Xu,
Jun Yan,
Chiung-Yu Huang
Abstract:
Recurrent event analyses have found a wide range of applications in biomedicine, public health, and engineering, among others, where study subjects may experience a sequence of event of interest during follow-up. The R package reReg (Chiou and Huang 2021) offers a comprehensive collection of practical and easy-to-use tools for regression analysis of recurrent events, possibly with the presence of…
▽ More
Recurrent event analyses have found a wide range of applications in biomedicine, public health, and engineering, among others, where study subjects may experience a sequence of event of interest during follow-up. The R package reReg (Chiou and Huang 2021) offers a comprehensive collection of practical and easy-to-use tools for regression analysis of recurrent events, possibly with the presence of an informative terminal event. The regression framework is a general scale-change model which encompasses the popular Cox-type model, the accelerated rate model, and the accelerated mean model as special cases. Informative censoring is accommodated through a subject-specific frailty without no need for parametric specification. Different regression models are allowed for the recurrent event process and the terminal event. Also included are visualization and simulation tools.
△ Less
Submitted 20 August, 2022; v1 submitted 23 April, 2021;
originally announced April 2021.
-
Hidden Technical Debts for Fair Machine Learning in Financial Services
Authors:
Chong Huang,
Arash Nourian,
Kevin Griest
Abstract:
The recent advancements in machine learning (ML) have demonstrated the potential for providing a powerful solution to build complex prediction systems in a short time. However, in highly regulated industries, such as the financial technology (Fintech), people have raised concerns about the risk of ML systems discriminating against specific protected groups or individuals. To address these concerns…
▽ More
The recent advancements in machine learning (ML) have demonstrated the potential for providing a powerful solution to build complex prediction systems in a short time. However, in highly regulated industries, such as the financial technology (Fintech), people have raised concerns about the risk of ML systems discriminating against specific protected groups or individuals. To address these concerns, researchers have introduced various mathematical fairness metrics and bias mitigation algorithms. This paper discusses hidden technical debts and challenges of building fair ML systems in a production environment for Fintech. We explore various stages that require attention for fairness in the ML system development and deployment life cycle. To identify hidden technical debts that exist in building fair ML system for Fintech, we focus on key pipeline stages including data preparation, model development, system monitoring and integration in production. Our analysis shows that enforcing fairness for production-ready ML systems in Fintech requires specific engineering commitments at different stages of ML system life cycle. We also propose several initial starting points to mitigate these technical debts for deploying fair ML systems in production.
△ Less
Submitted 21 March, 2021; v1 submitted 18 March, 2021;
originally announced March 2021.
-
Towards Practical Robustness Analysis for DNNs based on PAC-Model Learning
Authors:
Renjue Li,
Pengfei Yang,
Cheng-Chao Huang,
Youcheng Sun,
Bai Xue,
Lijun Zhang
Abstract:
To analyse local robustness properties of deep neural networks (DNNs), we present a practical framework from a model learning perspective. Based on black-box model learning with scenario optimisation, we abstract the local behaviour of a DNN via an affine model with the probably approximately correct (PAC) guarantee. From the learned model, we can infer the corresponding PAC-model robustness prope…
▽ More
To analyse local robustness properties of deep neural networks (DNNs), we present a practical framework from a model learning perspective. Based on black-box model learning with scenario optimisation, we abstract the local behaviour of a DNN via an affine model with the probably approximately correct (PAC) guarantee. From the learned model, we can infer the corresponding PAC-model robustness property. The innovation of our work is the integration of model learning into PAC robustness analysis: that is, we construct a PAC guarantee on the model level instead of sample distribution, which induces a more faithful and accurate robustness evaluation. This is in contrast to existing statistical methods without model learning. We implement our method in a prototypical tool named DeepPAC. As a black-box method, DeepPAC is scalable and efficient, especially when DNNs have complex structures or high-dimensional inputs. We extensively evaluate DeepPAC, with 4 baselines (using formal verification, statistical methods, testing and adversarial attack) and 20 DNN models across 3 datasets, including MNIST, CIFAR-10, and ImageNet. It is shown that DeepPAC outperforms the state-of-the-art statistical method PROVERO, and it achieves more practical robustness analysis than the formal verification tool ERAN. Also, its results are consistent with existing DNN testing work like DeepGini.
△ Less
Submitted 13 April, 2022; v1 submitted 25 January, 2021;
originally announced January 2021.
-
Population Quasi-Monte Carlo
Authors:
Chaofan Huang,
V. Roshan Joseph,
Simon Mak
Abstract:
Monte Carlo methods are widely used for approximating complicated, multidimensional integrals for Bayesian inference. Population Monte Carlo (PMC) is an important class of Monte Carlo methods, which utilizes a population of proposals to generate weighted samples that approximate the target distribution. The generic PMC framework iterates over three steps: samples are simulated from a set of propos…
▽ More
Monte Carlo methods are widely used for approximating complicated, multidimensional integrals for Bayesian inference. Population Monte Carlo (PMC) is an important class of Monte Carlo methods, which utilizes a population of proposals to generate weighted samples that approximate the target distribution. The generic PMC framework iterates over three steps: samples are simulated from a set of proposals, weights are assigned to such samples to correct for mismatch between the proposal and target distributions, and the proposals are then adapted via resampling from the weighted samples. When the target distribution is expensive to evaluate, the PMC has its computational limitation since the convergence rate is $\mathcal{O}(N^{-1/2})$. To address this, we propose in this paper a new Population Quasi-Monte Carlo (PQMC) framework, which integrates Quasi-Monte Carlo ideas within the sampling and adaptation steps of PMC. A key novelty in PQMC is the idea of importance support points resampling, a deterministic method for finding an "optimal" subsample from the weighted proposal samples. Moreover, within the PQMC framework, we develop an efficient covariance adaptation strategy for multivariate normal proposals. Lastly, a new set of correction weights is introduced for the weighted PMC estimator to improve the efficiency from the standard PMC estimator. We demonstrate the improved empirical convergence of PQMC over PMC in extensive numerical simulations and a friction drilling application.
△ Less
Submitted 26 December, 2020;
originally announced December 2020.
-
Guiding Neural Network Initialization via Marginal Likelihood Maximization
Authors:
Anthony S. Tai,
Chunfeng Huang
Abstract:
We propose a simple, data-driven approach to help guide hyperparameter selection for neural network initialization. We leverage the relationship between neural network and Gaussian process models having corresponding activation and covariance functions to infer the hyperparameter values desirable for model initialization. Our experiment shows that marginal likelihood maximization provides recommen…
▽ More
We propose a simple, data-driven approach to help guide hyperparameter selection for neural network initialization. We leverage the relationship between neural network and Gaussian process models having corresponding activation and covariance functions to infer the hyperparameter values desirable for model initialization. Our experiment shows that marginal likelihood maximization provides recommendations that yield near-optimal prediction performance on MNIST classification task under experiment constraints. Furthermore, our empirical results indicate consistency in the proposed technique, suggesting that computation cost for the procedure could be significantly reduced with smaller training sets.
△ Less
Submitted 17 December, 2020;
originally announced December 2020.
-
Deep Unsupervised Image Anomaly Detection: An Information Theoretic Framework
Authors:
Fei Ye,
Huangjie Zheng,
Chaoqin Huang,
Ya Zhang
Abstract:
Surrogate task based methods have recently shown great promise for unsupervised image anomaly detection. However, there is no guarantee that the surrogate tasks share the consistent optimization direction with anomaly detection. In this paper, we return to a direct objective function for anomaly detection with information theory, which maximizes the distance between normal and anomalous data in te…
▽ More
Surrogate task based methods have recently shown great promise for unsupervised image anomaly detection. However, there is no guarantee that the surrogate tasks share the consistent optimization direction with anomaly detection. In this paper, we return to a direct objective function for anomaly detection with information theory, which maximizes the distance between normal and anomalous data in terms of the joint distribution of images and their representation. Unfortunately, this objective function is not directly optimizable under the unsupervised setting where no anomalous data is provided during training. Through mathematical analysis of the above objective function, we manage to decompose it into four components. In order to optimize in an unsupervised fashion, we show that, under the assumption that distribution of the normal and anomalous data are separable in the latent space, its lower bound can be considered as a function which weights the trade-off between mutual information and entropy. This objective function is able to explain why the surrogate task based methods are effective for anomaly detection and further point out the potential direction of improvement. Based on this object function we introduce a novel information theoretic framework for unsupervised image anomaly detection. Extensive experiments have demonstrated that the proposed framework significantly outperforms several state-of-the-arts on multiple benchmark data sets.
△ Less
Submitted 8 December, 2020;
originally announced December 2020.
-
RealCause: Realistic Causal Inference Benchmarking
Authors:
Brady Neal,
Chin-Wei Huang,
Sunand Raghupathi
Abstract:
There are many different causal effect estimators in causal inference. However, it is unclear how to choose between these estimators because there is no ground-truth for causal effects. A commonly used option is to simulate synthetic data, where the ground-truth is known. However, the best causal estimators on synthetic data are unlikely to be the best causal estimators on real data. An ideal benc…
▽ More
There are many different causal effect estimators in causal inference. However, it is unclear how to choose between these estimators because there is no ground-truth for causal effects. A commonly used option is to simulate synthetic data, where the ground-truth is known. However, the best causal estimators on synthetic data are unlikely to be the best causal estimators on real data. An ideal benchmark for causal estimators would both (a) yield ground-truth values of the causal effects and (b) be representative of real data. Using flexible generative models, we provide a benchmark that both yields ground-truth and is realistic. Using this benchmark, we evaluate over 1500 different causal estimators and provide evidence that it is rational to choose hyperparameters for causal estimators using predictive metrics.
△ Less
Submitted 29 March, 2021; v1 submitted 30 November, 2020;
originally announced November 2020.
-
Freecyto: Quantized Flow Cytometry Analysis for the Web
Authors:
Nathan Wong,
Daehwan Kim,
Zachery Robinson,
Connie Huang,
Irina M. Conboy
Abstract:
Flow cytometry (FCM) is an analytic technique that is capable of detecting and recording the emission of fluorescence and light scattering of cells or particles (that are collectively called "events") in a population. A typical FCM experiment can produce a large array of data making the analysis computationally intensive. Current FCM data analysis platforms (FlowJo, etc.), while very useful, do no…
▽ More
Flow cytometry (FCM) is an analytic technique that is capable of detecting and recording the emission of fluorescence and light scattering of cells or particles (that are collectively called "events") in a population. A typical FCM experiment can produce a large array of data making the analysis computationally intensive. Current FCM data analysis platforms (FlowJo, etc.), while very useful, do not allow interactive data processing online due to the data size limitations. Here we report a more effective way to analyze FCM data. Freecyto is a free, easy-to-learn, Python-flask-based web application that uses a weighted k-means clustering algorithm to facilitate the interactive analysis of flow cytometry data. A key limitation of web browsers is their inability to interactively display large amounts of data. Freecyto addresses this bottleneck through the use of the k-means algorithm to quantize the data, allowing the user to access a representative set of data points for interactive visualization of complex datasets. Moreover, Freecyto enables the interactive analyses of large complex datasets while preserving the standard FCM visualization features, such as the generation of scatterplots (dotplots), histograms, heatmaps, boxplots, as well as a SQL-based sub-population gating feature. We also show that Freecyto can be applied to the analysis of various experimental setups that frequently require the use of FCM. Finally, we demonstrate that the data accuracy is preserved when Freecyto is compared to conventional FCM software.
△ Less
Submitted 19 November, 2020;
originally announced November 2020.
-
Dynamic Risk Prediction Triggered by Intermediate Events Using Survival Tree Ensembles
Authors:
Yifei Sun,
Sy Han Chiou,
Colin O. Wu,
Meghan McGarry,
Chiung-Yu Huang
Abstract:
With the availability of massive amounts of data from electronic health records and registry databases, incorporating time-varying patient information to improve risk prediction has attracted great attention. To exploit the growing amount of predictor information over time, we develop a unified framework for landmark prediction using survival tree ensembles, where an updated prediction can be perf…
▽ More
With the availability of massive amounts of data from electronic health records and registry databases, incorporating time-varying patient information to improve risk prediction has attracted great attention. To exploit the growing amount of predictor information over time, we develop a unified framework for landmark prediction using survival tree ensembles, where an updated prediction can be performed when new information becomes available. Compared to conventional landmark prediction with fixed landmark times, our methods allow the landmark times to be subject-specific and triggered by an intermediate clinical event. Moreover, the nonparametric approach circumvents the thorny issue of model incompatibility at different landmark times. In our framework, both the longitudinal predictors and the event time outcome are subject to right censoring, and thus existing tree-based approaches cannot be directly applied. To tackle the analytical challenges, we propose a risk-set-based ensemble procedure by averaging martingale estimating equations from individual trees. Extensive simulation studies are conducted to evaluate the performance of our methods. The methods are applied to the Cystic Fibrosis Patient Registry (CFFPR) data to perform dynamic prediction of lung disease in cystic fibrosis patients and to identify important prognosis factors.
△ Less
Submitted 25 August, 2022; v1 submitted 13 November, 2020;
originally announced November 2020.
-
Multivariate Time-series Anomaly Detection via Graph Attention Network
Authors:
Hang Zhao,
Yu**g Wang,
Juanyong Duan,
Congrui Huang,
Defu Cao,
Yunhai Tong,
Bixiong Xu,
**g Bai,
Jie Tong,
Qi Zhang
Abstract:
Anomaly detection on multivariate time-series is of great importance in both data mining research and industrial applications. Recent approaches have achieved significant progress in this topic, but there is remaining limitations. One major limitation is that they do not capture the relationships between different time-series explicitly, resulting in inevitable false alarms. In this paper, we prop…
▽ More
Anomaly detection on multivariate time-series is of great importance in both data mining research and industrial applications. Recent approaches have achieved significant progress in this topic, but there is remaining limitations. One major limitation is that they do not capture the relationships between different time-series explicitly, resulting in inevitable false alarms. In this paper, we propose a novel self-supervised framework for multivariate time-series anomaly detection to address this issue. Our framework considers each univariate time-series as an individual feature and includes two graph attention layers in parallel to learn the complex dependencies of multivariate time-series in both temporal and feature dimensions. In addition, our approach jointly optimizes a forecasting-based model and are construction-based model, obtaining better time-series representations through a combination of single-timestamp prediction and reconstruction of the entire time-series. We demonstrate the efficacy of our model through extensive experiments. The proposed method outperforms other state-of-the-art models on three real-world datasets. Further analysis shows that our method has good interpretability and is useful for anomaly diagnosis.
△ Less
Submitted 4 September, 2020;
originally announced September 2020.
-
Recommender Systems for the Internet of Things: A Survey
Authors:
May Altulyan,
Lina Yao,
Xianzhi Wang,
Chaoran Huang,
Salil S Kanhere,
Quan Z Sheng
Abstract:
Recommendation represents a vital stage in develo** and promoting the benefits of the Internet of Things (IoT). Traditional recommender systems fail to exploit ever-growing, dynamic, and heterogeneous IoT data. This paper presents a comprehensive review of the state-of-the-art recommender systems, as well as related techniques and application in the vibrant field of IoT. We discuss several limit…
▽ More
Recommendation represents a vital stage in develo** and promoting the benefits of the Internet of Things (IoT). Traditional recommender systems fail to exploit ever-growing, dynamic, and heterogeneous IoT data. This paper presents a comprehensive review of the state-of-the-art recommender systems, as well as related techniques and application in the vibrant field of IoT. We discuss several limitations of applying recommendation systems to IoT and propose a reference framework for comparing existing studies to guide future research and practices.
△ Less
Submitted 13 July, 2020;
originally announced July 2020.
-
A Benchmark of Medical Out of Distribution Detection
Authors:
Tianshi Cao,
Chin-Wei Huang,
David Yu-Tung Hui,
Joseph Paul Cohen
Abstract:
Motivation: Deep learning models deployed for use on medical tasks can be equipped with Out-of-Distribution Detection (OoDD) methods in order to avoid erroneous predictions. However it is unclear which OoDD method should be used in practice. Specific Problem: Systems trained for one particular domain of images cannot be expected to perform accurately on images of a different domain. These images s…
▽ More
Motivation: Deep learning models deployed for use on medical tasks can be equipped with Out-of-Distribution Detection (OoDD) methods in order to avoid erroneous predictions. However it is unclear which OoDD method should be used in practice. Specific Problem: Systems trained for one particular domain of images cannot be expected to perform accurately on images of a different domain. These images should be flagged by an OoDD method prior to diagnosis. Our approach: This paper defines 3 categories of OoD examples and benchmarks popular OoDD methods in three domains of medical imaging: chest X-ray, fundus imaging, and histology slides. Results: Our experiments show that despite methods yielding good results on some categories of out-of-distribution samples, they fail to recognize images close to the training distribution. Conclusion: We find a simple binary classifier on the feature representation has the best accuracy and AUPRC on average. Users of diagnostic tools which employ these OoDD methods should still remain vigilant that images very close to the training distribution yet not in it could yield unexpected results.
△ Less
Submitted 4 August, 2020; v1 submitted 8 July, 2020;
originally announced July 2020.
-
AR-DAE: Towards Unbiased Neural Entropy Gradient Estimation
Authors:
Jae Hyun Lim,
Aaron Courville,
Christopher Pal,
Chin-Wei Huang
Abstract:
Entropy is ubiquitous in machine learning, but it is in general intractable to compute the entropy of the distribution of an arbitrary continuous random variable. In this paper, we propose the amortized residual denoising autoencoder (AR-DAE) to approximate the gradient of the log density function, which can be used to estimate the gradient of entropy. Amortization allows us to significantly reduc…
▽ More
Entropy is ubiquitous in machine learning, but it is in general intractable to compute the entropy of the distribution of an arbitrary continuous random variable. In this paper, we propose the amortized residual denoising autoencoder (AR-DAE) to approximate the gradient of the log density function, which can be used to estimate the gradient of entropy. Amortization allows us to significantly reduce the error of the gradient approximator by approaching asymptotic optimality of a regular DAE, in which case the estimation is in theory unbiased. We conduct theoretical and experimental analyses on the approximation error of the proposed method, as well as extensive studies on heuristics to ensure its robustness. Finally, using the proposed gradient approximator to estimate the gradient of entropy, we demonstrate state-of-the-art performance on density estimation with variational autoencoders and continuous control with soft actor-critic.
△ Less
Submitted 9 June, 2020;
originally announced June 2020.
-
Recurrent Events Analysis With Data Collected at Informative Clinical Visits in Electronic Health Records
Authors:
Yifei Sun,
Charles E. McCulloch,
Kieren A. Marr,
Chiung-Yu Huang
Abstract:
Although increasingly used as a data resource for assembling cohorts, electronic health records (EHRs) pose many analytic challenges. In particular, a patient's health status influences when and what data are recorded, generating sampling bias in the collected data. In this paper, we consider recurrent event analysis using EHR data. Conventional regression methods for event risk analysis usually r…
▽ More
Although increasingly used as a data resource for assembling cohorts, electronic health records (EHRs) pose many analytic challenges. In particular, a patient's health status influences when and what data are recorded, generating sampling bias in the collected data. In this paper, we consider recurrent event analysis using EHR data. Conventional regression methods for event risk analysis usually require the values of covariates to be observed throughout the follow-up period. In EHR databases, time-dependent covariates are intermittently measured during clinical visits, and the timing of these visits is informative in the sense that it depends on the disease course. Simple methods, such as the last-observation-carried-forward approach, can lead to biased estimation. On the other hand, complex joint models require additional assumptions on the covariate process and cannot be easily extended to handle multiple longitudinal predictors. By incorporating sampling weights derived from estimating the observation time process, we develop a novel estimation procedure based on inverse-rate-weighting and kernel-smoothing for the semiparametric proportional rate model of recurrent events. The proposed methods do not require model specifications for the covariate processes and can easily handle multiple time-dependent covariates. Our methods are applied to a kidney transplant study for illustration.
△ Less
Submitted 25 April, 2020;
originally announced April 2020.
-
Knowledge-guided Deep Reinforcement Learning for Interactive Recommendation
Authors:
Xiaocong Chen,
Chaoran Huang,
Lina Yao,
Xianzhi Wang,
Wei Liu,
Wenjie Zhang
Abstract:
Interactive recommendation aims to learn from dynamic interactions between items and users to achieve responsiveness and accuracy. Reinforcement learning is inherently advantageous for co** with dynamic environments and thus has attracted increasing attention in interactive recommendation research. Inspired by knowledge-aware recommendation, we proposed Knowledge-Guided deep Reinforcement learni…
▽ More
Interactive recommendation aims to learn from dynamic interactions between items and users to achieve responsiveness and accuracy. Reinforcement learning is inherently advantageous for co** with dynamic environments and thus has attracted increasing attention in interactive recommendation research. Inspired by knowledge-aware recommendation, we proposed Knowledge-Guided deep Reinforcement learning (KGRL) to harness the advantages of both reinforcement learning and knowledge graphs for interactive recommendation. This model is implemented upon the actor-critic network framework. It maintains a local knowledge network to guide decision-making and employs the attention mechanism to capture long-term semantics between items. We have conducted comprehensive experiments in a simulated online environment with six public real-world datasets and demonstrated the superiority of our model over several state-of-the-art methods.
△ Less
Submitted 17 April, 2020;
originally announced April 2020.
-
Augmented Normalizing Flows: Bridging the Gap Between Generative Flows and Latent Variable Models
Authors:
Chin-Wei Huang,
Laurent Dinh,
Aaron Courville
Abstract:
In this work, we propose a new family of generative flows on an augmented data space, with an aim to improve expressivity without drastically increasing the computational cost of sampling and evaluation of a lower bound on the likelihood. Theoretically, we prove the proposed flow can approximate a Hamiltonian ODE as a universal transport map. Empirically, we demonstrate state-of-the-art performanc…
▽ More
In this work, we propose a new family of generative flows on an augmented data space, with an aim to improve expressivity without drastically increasing the computational cost of sampling and evaluation of a lower bound on the likelihood. Theoretically, we prove the proposed flow can approximate a Hamiltonian ODE as a universal transport map. Empirically, we demonstrate state-of-the-art performance on standard benchmarks of flow-based generative modeling.
△ Less
Submitted 17 February, 2020;
originally announced February 2020.
-
Representation Learning on Variable Length and Incomplete Wearable-Sensory Time Series
Authors:
Xian Wu,
Chao Huang,
Pablo Roblesgranda,
Nitesh Chawla
Abstract:
The prevalence of wearable sensors (e.g., smart wristband) is creating unprecedented opportunities to not only inform health and wellness states of individuals, but also assess and infer personal attributes, including demographic and personality attributes. However, the data captured from wearables, such as heart rate or number of steps, present two key challenges: 1) the time series is often of v…
▽ More
The prevalence of wearable sensors (e.g., smart wristband) is creating unprecedented opportunities to not only inform health and wellness states of individuals, but also assess and infer personal attributes, including demographic and personality attributes. However, the data captured from wearables, such as heart rate or number of steps, present two key challenges: 1) the time series is often of variable-length and incomplete due to different data collection periods (e.g., wearing behavior varies by person); and 2) inter-individual variability to external factors like stress and environment. This paper addresses these challenges and brings us closer to the potential of personalized insights about an individual, taking the leap from quantified self to qualified self. Specifically, HeartSpace proposed in this paper encodes time series data with variable-length and missing values via the integration of a time series encoding module and a pattern aggregation network. Additionally, HeartSpace implements a Siamese-triplet network to optimize representations by jointly capturing intra- and inter-series correlations during the embedding learning process. The empirical evaluation over two different real-world data presents significant performance gains overstate-of-the-art baselines in a variety of applications, including personality prediction, demographics inference, and user identification.
△ Less
Submitted 27 May, 2020; v1 submitted 10 February, 2020;
originally announced February 2020.
-
Faster On-Device Training Using New Federated Momentum Algorithm
Authors:
Zhouyuan Huo,
Qian Yang,
Bin Gu,
Lawrence Carin. Heng Huang
Abstract:
Mobile crowdsensing has gained significant attention in recent years and has become a critical paradigm for emerging Internet of Things applications. The sensing devices continuously generate a significant quantity of data, which provide tremendous opportunities to develop innovative intelligent applications. To utilize these data to train machine learning models while not compromising user privac…
▽ More
Mobile crowdsensing has gained significant attention in recent years and has become a critical paradigm for emerging Internet of Things applications. The sensing devices continuously generate a significant quantity of data, which provide tremendous opportunities to develop innovative intelligent applications. To utilize these data to train machine learning models while not compromising user privacy, federated learning has become a promising solution. However, there is little understanding of whether federated learning algorithms are guaranteed to converge. We reconsider model averaging in federated learning and formulate it as a gradient-based method with biased gradients. This novel perspective assists analysis of its convergence rate and provides a new direction for more acceleration. We prove for the first time that the federated averaging algorithm is guaranteed to converge for non-convex problems, without imposing additional assumptions. We further propose a novel accelerated federated learning algorithm and provide a convergence guarantee. Simulated federated learning experiments are conducted to train deep neural networks on benchmark datasets, and experimental results show that our proposed method converges faster than previous approaches.
△ Less
Submitted 5 February, 2020;
originally announced February 2020.
-
AutoShrink: A Topology-aware NAS for Discovering Efficient Neural Architecture
Authors:
Tunhou Zhang,
Hsin-Pai Cheng,
Zhenwen Li,
Feng Yan,
Chengyu Huang,
Hai Li,
Yiran Chen
Abstract:
Resource is an important constraint when deploying Deep Neural Networks (DNNs) on mobile and edge devices. Existing works commonly adopt the cell-based search approach, which limits the flexibility of network patterns in learned cell structures. Moreover, due to the topology-agnostic nature of existing works, including both cell-based and node-based approaches, the search process is time consuming…
▽ More
Resource is an important constraint when deploying Deep Neural Networks (DNNs) on mobile and edge devices. Existing works commonly adopt the cell-based search approach, which limits the flexibility of network patterns in learned cell structures. Moreover, due to the topology-agnostic nature of existing works, including both cell-based and node-based approaches, the search process is time consuming and the performance of found architecture may be sub-optimal. To address these problems, we propose AutoShrink, a topology-aware Neural Architecture Search(NAS) for searching efficient building blocks of neural architectures. Our method is node-based and thus can learn flexible network patterns in cell structures within a topological search space. Directed Acyclic Graphs (DAGs) are used to abstract DNN architectures and progressively optimize the cell structure through edge shrinking. As the search space intrinsically reduces as the edges are progressively shrunk, AutoShrink explores more flexible search space with even less search time. We evaluate AutoShrink on image classification and language tasks by crafting ShrinkCNN and ShrinkRNN models. ShrinkCNN is able to achieve up to 48% parameter reduction and save 34% Multiply-Accumulates (MACs) on ImageNet-1K with comparable accuracy of state-of-the-art (SOTA) models. Specifically, both ShrinkCNN and ShrinkRNN are crafted within 1.5 GPU hours, which is 7.2x and 6.7x faster than the crafting time of SOTA CNN and RNN models, respectively.
△ Less
Submitted 20 November, 2019;
originally announced November 2019.
-
Heterogeneous Deep Graph Infomax
Authors:
Yuxiang Ren,
Bo Liu,
Chao Huang,
Peng Dai,
Liefeng Bo,
Jiawei Zhang
Abstract:
Graph representation learning is to learn universal node representations that preserve both node attributes and structural information. The derived node representations can be used to serve various downstream tasks, such as node classification and node clustering. When a graph is heterogeneous, the problem becomes more challenging than the homogeneous graph node learning problem. Inspired by the e…
▽ More
Graph representation learning is to learn universal node representations that preserve both node attributes and structural information. The derived node representations can be used to serve various downstream tasks, such as node classification and node clustering. When a graph is heterogeneous, the problem becomes more challenging than the homogeneous graph node learning problem. Inspired by the emerging information theoretic-based learning algorithm, in this paper we propose an unsupervised graph neural network Heterogeneous Deep Graph Infomax (HDGI) for heterogeneous graph representation learning. We use the meta-path structure to analyze the connections involving semantics in heterogeneous graphs and utilize graph convolution module and semantic-level attention mechanism to capture local representations. By maximizing local-global mutual information, HDGI effectively learns high-level node representations that can be utilized in downstream graph-related tasks. Experiment results show that HDGI remarkably outperforms state-of-the-art unsupervised graph representation learning methods on both classification and clustering tasks. By feeding the learned representations into a parametric model, such as logistic regression, we even achieve comparable performance in node classification tasks when comparing with state-of-the-art supervised end-to-end GNN models.
△ Less
Submitted 13 November, 2020; v1 submitted 19 November, 2019;
originally announced November 2019.
-
ERNet Family: Hardware-Oriented CNN Models for Computational Imaging Using Block-Based Inference
Authors:
Chao-Tsung Huang
Abstract:
Convolutional neural networks (CNNs) demand huge DRAM bandwidth for computational imaging tasks, and block-based processing has recently been applied to greatly reduce the bandwidth. However, the induced additional computation for feature recomputing or the large SRAM for feature reusing will degrade the performance or even forbid the usage of state-of-the-art models. In this paper, we address the…
▽ More
Convolutional neural networks (CNNs) demand huge DRAM bandwidth for computational imaging tasks, and block-based processing has recently been applied to greatly reduce the bandwidth. However, the induced additional computation for feature recomputing or the large SRAM for feature reusing will degrade the performance or even forbid the usage of state-of-the-art models. In this paper, we address these issues by considering the overheads and hardware constraints in advance when constructing CNNs. We investigate a novel model family---ERNet---which includes temporary layer expansion as another means for increasing model capacity. We analyze three ERNet variants in terms of hardware requirement and introduce a hardware-aware model optimization procedure. Evaluations on Full HD and 4K UHD applications will be given to show the effectiveness in terms of image quality, pixel throughput, and SRAM usage. The results also show that, for block-based inference, ERNet can outperform the state-of-the-art FFDNet and EDSR-baseline models for image denoising and super-resolution respectively.
△ Less
Submitted 30 January, 2020; v1 submitted 13 October, 2019;
originally announced October 2019.
-
Generating Fair Universal Representations using Adversarial Models
Authors:
Peter Kairouz,
Jiachun Liao,
Chong Huang,
Maunil Vyas,
Monica Welfert,
Lalitha Sankar
Abstract:
We present a data-driven framework for learning fair universal representations (FUR) that guarantee statistical fairness for any learning task that may not be known a priori. Our framework leverages recent advances in adversarial learning to allow a data holder to learn representations in which a set of sensitive attributes are decoupled from the rest of the dataset. We formulate this as a constra…
▽ More
We present a data-driven framework for learning fair universal representations (FUR) that guarantee statistical fairness for any learning task that may not be known a priori. Our framework leverages recent advances in adversarial learning to allow a data holder to learn representations in which a set of sensitive attributes are decoupled from the rest of the dataset. We formulate this as a constrained minimax game between an encoder and an adversary where the constraint ensures a measure of usefulness (utility) of the representation. The resulting problem is that of censoring, i.e., finding a representation that is least informative about the sensitive attributes given a utility constraint. For appropriately chosen adversarial loss functions, our censoring framework precisely clarifies the optimal adversarial strategy against strong information-theoretic adversaries; it also achieves the fairness measure of demographic parity for the resulting constrained representations. We evaluate the performance of our proposed framework on both synthetic and publicly available datasets. For these datasets, we use two tradeoff measures: censoring vs. representation fidelity and fairness vs. utility for downstream tasks, to amply demonstrate that multiple sensitive features can be effectively censored even as the resulting fair representations ensure accuracy for multiple downstream tasks.
△ Less
Submitted 11 May, 2022; v1 submitted 27 September, 2019;
originally announced October 2019.
-
Learning physics-based reduced-order models for a single-injector combustion process
Authors:
Renee Swischuk,
Boris Kramer,
Cheng Huang,
Karen Willcox
Abstract:
This paper presents a physics-based data-driven method to learn predictive reduced-order models (ROMs) from high-fidelity simulations, and illustrates it in the challenging context of a single-injector combustion process. The method combines the perspectives of model reduction and machine learning. Model reduction brings in the physics of the problem, constraining the ROM predictions to lie on a s…
▽ More
This paper presents a physics-based data-driven method to learn predictive reduced-order models (ROMs) from high-fidelity simulations, and illustrates it in the challenging context of a single-injector combustion process. The method combines the perspectives of model reduction and machine learning. Model reduction brings in the physics of the problem, constraining the ROM predictions to lie on a subspace defined by the governing equations. This is achieved by defining the ROM in proper orthogonal decomposition (POD) coordinates, which embed the rich physics information contained in solution snapshots of a high-fidelity computational fluid dynamics (CFD) model. The machine learning perspective brings the flexibility to use transformed physical variables to define the POD basis. This is in contrast to traditional model reduction approaches that are constrained to use the physical variables of the high-fidelity code. Combining the two perspectives, the approach identifies a set of transformed physical variables that expose quadratic structure in the combustion governing equations and learns a quadratic ROM from transformed snapshot data. This learning does not require access to the high-fidelity model implementation. Numerical experiments show that the ROM accurately predicts temperature, pressure, velocity, species concentrations, and the limit-cycle amplitude, with speedups of more than five orders of magnitude over high-fidelity models. Our ROM simulation is shown to be predictive 200% past the training interval. Moreover, ROM-predicted pressure traces accurately match the phase of the pressure signal and yield good approximations of the limit-cycle amplitude.
△ Less
Submitted 11 July, 2020; v1 submitted 9 August, 2019;
originally announced August 2019.
-
The Bach Doodle: Approachable music composition with machine learning at scale
Authors:
Cheng-Zhi Anna Huang,
Curtis Hawthorne,
Adam Roberts,
Monica Dinculescu,
James Wexler,
Leon Hong,
Jacob Howcroft
Abstract:
To make music composition more approachable, we designed the first AI-powered Google Doodle, the Bach Doodle, where users can create their own melody and have it harmonized by a machine learning model Coconet (Huang et al., 2017) in the style of Bach. For users to input melodies, we designed a simplified sheet-music based interface. To support an interactive experience at scale, we re-implemented…
▽ More
To make music composition more approachable, we designed the first AI-powered Google Doodle, the Bach Doodle, where users can create their own melody and have it harmonized by a machine learning model Coconet (Huang et al., 2017) in the style of Bach. For users to input melodies, we designed a simplified sheet-music based interface. To support an interactive experience at scale, we re-implemented Coconet in TensorFlow.js (Smilkov et al., 2019) to run in the browser and reduced its runtime from 40s to 2s by adopting dilated depth-wise separable convolutions and fusing operations. We also reduced the model download size to approximately 400KB through post-training weight quantization. We calibrated a speed test based on partial model evaluation time to determine if the harmonization request should be performed locally or sent to remote TPU servers. In three days, people spent 350 years worth of time playing with the Bach Doodle, and Coconet received more than 55 million queries. Users could choose to rate their compositions and contribute them to a public dataset, which we are releasing with this paper. We hope that the community finds this dataset useful for applications ranging from ethnomusicological studies, to music education, to improving machine learning models.
△ Less
Submitted 14 July, 2019;
originally announced July 2019.
-
vGraph: A Generative Model for Joint Community Detection and Node Representation Learning
Authors:
Fan-Yun Sun,
Meng Qu,
Jordan Hoffmann,
Chin-Wei Huang,
Jian Tang
Abstract:
This paper focuses on two fundamental tasks of graph analysis: community detection and node representation learning, which capture the global and local structures of graphs, respectively. In the current literature, these two tasks are usually independently studied while they are actually highly correlated. We propose a probabilistic generative model called vGraph to learn community membership and…
▽ More
This paper focuses on two fundamental tasks of graph analysis: community detection and node representation learning, which capture the global and local structures of graphs, respectively. In the current literature, these two tasks are usually independently studied while they are actually highly correlated. We propose a probabilistic generative model called vGraph to learn community membership and node representation collaboratively. Specifically, we assume that each node can be represented as a mixture of communities, and each community is defined as a multinomial distribution over nodes. Both the mixing coefficients and the community distribution are parameterized by the low-dimensional representations of the nodes and communities. We designed an effective variational inference algorithm which regularizes the community membership of neighboring nodes to be similar in the latent space. Experimental results on multiple real-world graphs show that vGraph is very effective in both community detection and node representation learning, outperforming many competitive baselines in both tasks. We show that the framework of vGraph is quite flexible and can be easily extended to detect hierarchical communities.
△ Less
Submitted 17 September, 2019; v1 submitted 18 June, 2019;
originally announced June 2019.
-
Interpretations of Deep Learning by Forests and Haar Wavelets
Authors:
Changcun Huang
Abstract:
This paper presents a basic property of region dividing of ReLU (rectified linear unit) deep learning when new layers are successively added, by which two new perspectives of interpreting deep learning are given. The first is related to decision trees and forests; we construct a deep learning structure equivalent to a forest in classification abilities, which means that certain kinds of ReLU deep…
▽ More
This paper presents a basic property of region dividing of ReLU (rectified linear unit) deep learning when new layers are successively added, by which two new perspectives of interpreting deep learning are given. The first is related to decision trees and forests; we construct a deep learning structure equivalent to a forest in classification abilities, which means that certain kinds of ReLU deep learning can be considered as forests. The second perspective is that Haar wavelet represented functions can be approximated by ReLU deep learning with arbitrary precision; and then a general conclusion of function approximation abilities of ReLU deep learning is given. Finally, generalize some of the conclusions of ReLU deep learning to the case of sigmoid-unit deep learning.
△ Less
Submitted 6 December, 2019; v1 submitted 16 June, 2019;
originally announced June 2019.