-
Bayesian Intervention Optimization for Causal Discovery
Authors:
Yuxuan Wang,
Mingzhou Liu,
Xinwei Sun,
Wei Wang,
Yizhou Wang
Abstract:
Causal discovery is crucial for understanding complex systems and informing decisions. While observational data can uncover causal relationships under certain assumptions, it often falls short, making active interventions necessary. Current methods, such as Bayesian and graph-theoretical approaches, do not prioritize decision-making and often rely on ideal conditions or information gain, which is…
▽ More
Causal discovery is crucial for understanding complex systems and informing decisions. While observational data can uncover causal relationships under certain assumptions, it often falls short, making active interventions necessary. Current methods, such as Bayesian and graph-theoretical approaches, do not prioritize decision-making and often rely on ideal conditions or information gain, which is not directly related to hypothesis testing. We propose a novel Bayesian optimization-based method inspired by Bayes factors that aims to maximize the probability of obtaining decisive and correct evidence. Our approach uses observational data to estimate causal models under different hypotheses, evaluates potential interventions pre-experimentally, and iteratively updates priors to refine interventions. We demonstrate the effectiveness of our method through various experiments. Our contributions provide a robust framework for efficient causal discovery through active interventions, enhancing the practical application of theoretical advancements.
△ Less
Submitted 16 June, 2024;
originally announced June 2024.
-
Counterfactual Explanations for Multivariate Time-Series without Training Datasets
Authors:
Xiangyu Sun,
Raquel Aoki,
Kevin H. Wilson
Abstract:
Machine learning (ML) methods have experienced significant growth in the past decade, yet their practical application in high-impact real-world domains has been hindered by their opacity. When ML methods are responsible for making critical decisions, stakeholders often require insights into how to alter these decisions. Counterfactual explanations (CFEs) have emerged as a solution, offering interp…
▽ More
Machine learning (ML) methods have experienced significant growth in the past decade, yet their practical application in high-impact real-world domains has been hindered by their opacity. When ML methods are responsible for making critical decisions, stakeholders often require insights into how to alter these decisions. Counterfactual explanations (CFEs) have emerged as a solution, offering interpretations of opaque ML models and providing a pathway to transition from one decision to another. However, most existing CFE methods require access to the model's training dataset, few methods can handle multivariate time-series, and none can handle multivariate time-series without training datasets. These limitations can be formidable in many scenarios. In this paper, we present CFWoT, a novel reinforcement-learning-based CFE method that generates CFEs when training datasets are unavailable. CFWoT is model-agnostic and suitable for both static and multivariate time-series datasets with continuous and discrete features. Users have the flexibility to specify non-actionable, immutable, and preferred features, as well as causal constraints which CFWoT guarantees will be respected. We demonstrate the performance of CFWoT against four baselines on several datasets and find that, despite not having access to a training dataset, CFWoT finds CFEs that make significantly fewer and significantly smaller changes to the input time-series. These properties make CFEs more actionable, as the magnitude of change required to alter an outcome is vastly reduced.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Zero-inflated Smoothing Spline (ZISS) Models for Individual-level Single-cell Temporal Data
Authors:
Yifu Tang,
Yi Zhang,
Yue Wang,
**gyi Zhang,
Xiaoxiao Sun
Abstract:
Recent advancements in single-cell RNA-sequencing (scRNA-seq) have enhanced our understanding of cell heterogeneity at a high resolution. With the ability to sequence over 10,000 cells per hour, researchers can collect large scRNA-seq datasets for different participants, offering an opportunity to study the temporal progression of individual-level single-cell data. However, the presence of excessi…
▽ More
Recent advancements in single-cell RNA-sequencing (scRNA-seq) have enhanced our understanding of cell heterogeneity at a high resolution. With the ability to sequence over 10,000 cells per hour, researchers can collect large scRNA-seq datasets for different participants, offering an opportunity to study the temporal progression of individual-level single-cell data. However, the presence of excessive zeros, a common issue in scRNA-seq, significantly impacts regression/association analysis, potentially leading to biased estimates in downstream analysis. Addressing these challenges, we introduce the Zero Inflated Smoothing Spline (ZISS) method, specifically designed to model single-cell temporal data. The ZISS method encompasses two components for modeling gene expression patterns over time and handling excessive zeros. Our approach employs the smoothing spline ANOVA model, providing robust estimates of mean functions and zero probabilities for irregularly observed single-cell temporal data compared to existing methods in our simulation studies and real data analysis.
△ Less
Submitted 27 January, 2024;
originally announced January 2024.
-
Patient-Oriented Unsupervised Learning to Unlock Patterns of Multimorbidity Associated with Stroke using Primary Care Electronic Health Records
Authors:
Marc Delord,
Xiaohui Sun,
Annastazia Learoyd,
Vasa Curcin,
Iain Marshall,
Charles Wolfe,
Mark Ashworth,
Abdel Douiri
Abstract:
Background: Identifying and characterising the longitudinal patterns of multimorbidity associated with stroke is needed to better understand patients' needs and inform new models of care.
Methods: We used an unsupervised patient-oriented clustering approach to analyse primary care electronic health records (EHR) of 30 common long-term conditions (LTC), in patients with stroke aged over 18, regis…
▽ More
Background: Identifying and characterising the longitudinal patterns of multimorbidity associated with stroke is needed to better understand patients' needs and inform new models of care.
Methods: We used an unsupervised patient-oriented clustering approach to analyse primary care electronic health records (EHR) of 30 common long-term conditions (LTC), in patients with stroke aged over 18, registered in 41 general practices in south London between 2005 and 2021.
Results: Of 849,968 registered patients, 9,847 (1.16%) had a record of stroke, 46.5% were female and median age at record was 65.0 year (IQR: 51.5 to 77.0). The median number of LTCs in addition to stroke was 3 (IQR: from 2 to 5). Patients were stratified in eight clusters. These clusters revealed contrasted patterns of multimorbidity, socio-demographic characteristics (age, gender and ethnicity) and risk factors. Beside a core of 3 clusters associated with conventional stroke risk-factors, minor clusters exhibited less common but recurrent combinations of LTCs including mental health conditions, asthma, osteoarthritis and sickle cell anaemia. Importantly, complex profiles combining mental health conditions, infectious diseases and substance dependency emerged.
Conclusion: This patient-oriented approach to EHRs uncovers the heterogeneity of profiles of multimorbidity and socio-demographic characteristics associated with stroke. It highlights the importance of conventional stroke risk factors as well as the association of mental health conditions in complex profiles of multimorbidity displayed in a significant proportion of patients. These results address the need for a better understanding of stroke-associated multimorbidity and complexity to inform more efficient and patient-oriented healthcare models.
△ Less
Submitted 3 January, 2024;
originally announced January 2024.
-
The Causal Impact of Credit Lines on Spending Distributions
Authors:
Yijun Li,
Cheuk Hang Leung,
Xiangqian Sun,
Chaoqun Wang,
Yiyan Huang,
Xing Yan,
Qi Wu,
Dongdong Wang,
Zhixiang Huang
Abstract:
Consumer credit services offered by e-commerce platforms provide customers with convenient loan access during shop** and have the potential to stimulate sales. To understand the causal impact of credit lines on spending, previous studies have employed causal estimators, based on direct regression (DR), inverse propensity weighting (IPW), and double machine learning (DML) to estimate the treatmen…
▽ More
Consumer credit services offered by e-commerce platforms provide customers with convenient loan access during shop** and have the potential to stimulate sales. To understand the causal impact of credit lines on spending, previous studies have employed causal estimators, based on direct regression (DR), inverse propensity weighting (IPW), and double machine learning (DML) to estimate the treatment effect. However, these estimators do not consider the notion that an individual's spending can be understood and represented as a distribution, which captures the range and pattern of amounts spent across different orders. By disregarding the outcome as a distribution, valuable insights embedded within the outcome distribution might be overlooked. This paper develops a distribution-valued estimator framework that extends existing real-valued DR-, IPW-, and DML-based estimators to distribution-valued estimators within Rubin's causal framework. We establish their consistency and apply them to a real dataset from a large e-commerce platform. Our findings reveal that credit lines positively influence spending across all quantiles; however, as credit lines increase, consumers allocate more to luxuries (higher quantiles) than necessities (lower quantiles).
△ Less
Submitted 16 December, 2023;
originally announced December 2023.
-
SPD-DDPM: Denoising Diffusion Probabilistic Models in the Symmetric Positive Definite Space
Authors:
Yunchen Li,
Zhou Yu,
Gaoqi He,
Yunhang Shen,
Ke Li,
Xing Sun,
Shaohui Lin
Abstract:
Symmetric positive definite~(SPD) matrices have shown important value and applications in statistics and machine learning, such as FMRI analysis and traffic prediction. Previous works on SPD matrices mostly focus on discriminative models, where predictions are made directly on $E(X|y)$, where $y$ is a vector and $X$ is an SPD matrix. However, these methods are challenging to handle for large-scale…
▽ More
Symmetric positive definite~(SPD) matrices have shown important value and applications in statistics and machine learning, such as FMRI analysis and traffic prediction. Previous works on SPD matrices mostly focus on discriminative models, where predictions are made directly on $E(X|y)$, where $y$ is a vector and $X$ is an SPD matrix. However, these methods are challenging to handle for large-scale data, as they need to access and process the whole data. In this paper, inspired by denoising diffusion probabilistic model~(DDPM), we propose a novel generative model, termed SPD-DDPM, by introducing Gaussian distribution in the SPD space to estimate $E(X|y)$. Moreover, our model is able to estimate $p(X)$ unconditionally and flexibly without giving $y$. On the one hand, the model conditionally learns $p(X|y)$ and utilizes the mean of samples to obtain $E(X|y)$ as a prediction. On the other hand, the model unconditionally learns the probability distribution of the data $p(X)$ and generates samples that conform to this distribution. Furthermore, we propose a new SPD net which is much deeper than the previous networks and allows for the inclusion of conditional factors. Experiment results on toy data and real taxi data demonstrate that our models effectively fit the data distribution both unconditionally and unconditionally and provide accurate predictions.
△ Less
Submitted 13 December, 2023;
originally announced December 2023.
-
Split Knockoffs for Multiple Comparisons: Controlling the Directional False Discovery Rate
Authors:
Yang Cao,
Xinwei Sun,
Yuan Yao
Abstract:
Multiple comparisons in hypothesis testing often encounter structural constraints in various applications. For instance, in structural Magnetic Resonance Imaging for Alzheimer's Disease, the focus extends beyond examining atrophic brain regions to include comparisons of anatomically adjacent regions. These constraints can be modeled as linear transformations of parameters, where the sign patterns…
▽ More
Multiple comparisons in hypothesis testing often encounter structural constraints in various applications. For instance, in structural Magnetic Resonance Imaging for Alzheimer's Disease, the focus extends beyond examining atrophic brain regions to include comparisons of anatomically adjacent regions. These constraints can be modeled as linear transformations of parameters, where the sign patterns play a crucial role in estimating directional effects. This class of problems, encompassing total variations, wavelet transforms, fused LASSO, trend filtering, and more, presents an open challenge in effectively controlling the directional false discovery rate. In this paper, we propose an extended Split Knockoff method specifically designed to address the control of directional false discovery rate under linear transformations. Our proposed approach relaxes the stringent linear manifold constraint to its neighborhood, employing a variable splitting technique commonly used in optimization. This methodology yields an orthogonal design that benefits both power and directional false discovery rate control. By incorporating a sample splitting scheme, we achieve effective control of the directional false discovery rate, with a notable reduction to zero as the relaxed neighborhood expands. To demonstrate the efficacy of our method, we conduct simulation experiments and apply it to two real-world scenarios: Alzheimer's Disease analysis and human age comparisons.
△ Less
Submitted 11 October, 2023;
originally announced October 2023.
-
The Blessings of Multiple Treatments and Outcomes in Treatment Effect Estimation
Authors:
Yong Wu,
Mingzhou Liu,
**g Yan,
Yanwei Fu,
Shouyan Wang,
Yizhou Wang,
Xinwei Sun
Abstract:
Assessing causal effects in the presence of unobserved confounding is a challenging problem. Existing studies leveraged proxy variables or multiple treatments to adjust for the confounding bias. In particular, the latter approach attributes the impact on a single outcome to multiple treatments, allowing estimating latent variables for confounding control. Nevertheless, these methods primarily focu…
▽ More
Assessing causal effects in the presence of unobserved confounding is a challenging problem. Existing studies leveraged proxy variables or multiple treatments to adjust for the confounding bias. In particular, the latter approach attributes the impact on a single outcome to multiple treatments, allowing estimating latent variables for confounding control. Nevertheless, these methods primarily focus on a single outcome, whereas in many real-world scenarios, there is greater interest in studying the effects on multiple outcomes. Besides, these outcomes are often coupled with multiple treatments. Examples include the intensive care unit (ICU), where health providers evaluate the effectiveness of therapies on multiple health indicators. To accommodate these scenarios, we consider a new setting dubbed as multiple treatments and multiple outcomes. We then show that parallel studies of multiple outcomes involved in this setting can assist each other in causal identification, in the sense that we can exploit other treatments and outcomes as proxies for each treatment effect under study. We proceed with a causal discovery method that can effectively identify such proxies for causal estimation. The utility of our method is demonstrated in synthetic data and sepsis disease.
△ Less
Submitted 14 October, 2023; v1 submitted 29 September, 2023;
originally announced September 2023.
-
NSOTree: Neural Survival Oblique Tree
Authors:
Xiaotong Sun,
Peijie Qiu
Abstract:
Survival analysis is a statistical method employed to scrutinize the duration until a specific event of interest transpires, known as time-to-event information characterized by censorship. Recently, deep learning-based methods have dominated this field due to their representational capacity and state-of-the-art performance. However, the black-box nature of the deep neural network hinders its inter…
▽ More
Survival analysis is a statistical method employed to scrutinize the duration until a specific event of interest transpires, known as time-to-event information characterized by censorship. Recently, deep learning-based methods have dominated this field due to their representational capacity and state-of-the-art performance. However, the black-box nature of the deep neural network hinders its interpretability, which is desired in real-world survival applications but has been largely neglected by previous works. In contrast, conventional tree-based methods are advantageous with respect to interpretability, while consistently grappling with an inability to approximate the global optima due to greedy expansion. In this paper, we leverage the strengths of both neural networks and tree-based methods, capitalizing on their ability to approximate intricate functions while maintaining interpretability. To this end, we propose a Neural Survival Oblique Tree (NSOTree) for survival analysis. Specifically, the NSOTree was derived from the ReLU network and can be easily incorporated into existing survival models in a plug-and-play fashion. Evaluations on both simulated and real survival datasets demonstrated the effectiveness of the proposed method in terms of performance and interpretability.
△ Less
Submitted 24 September, 2023;
originally announced September 2023.
-
Doubly Robust Proximal Causal Learning for Continuous Treatments
Authors:
Yong Wu,
Yanwei Fu,
Shouyan Wang,
Xinwei Sun
Abstract:
Proximal causal learning is a promising framework for identifying the causal effect under the existence of unmeasured confounders. Within this framework, the doubly robust (DR) estimator was derived and has shown its effectiveness in estimation, especially when the model assumption is violated. However, the current form of the DR estimator is restricted to binary treatments, while the treatment ca…
▽ More
Proximal causal learning is a promising framework for identifying the causal effect under the existence of unmeasured confounders. Within this framework, the doubly robust (DR) estimator was derived and has shown its effectiveness in estimation, especially when the model assumption is violated. However, the current form of the DR estimator is restricted to binary treatments, while the treatment can be continuous in many real-world applications. The primary obstacle to continuous treatments resides in the delta function present in the original DR estimator, making it infeasible in causal effect estimation and introducing a heavy computational burden in nuisance function estimation. To address these challenges, we propose a kernel-based DR estimator that can well handle continuous treatments. Equipped with its smoothness, we show that its oracle form is a consistent approximation of the influence function. Further, we propose a new approach to efficiently solve the nuisance functions. We then provide a comprehensive convergence analysis in terms of the mean square error. We demonstrate the utility of our estimator on synthetic datasets and real-world applications.
△ Less
Submitted 10 March, 2024; v1 submitted 22 September, 2023;
originally announced September 2023.
-
Learning graph geometry and topology using dynamical systems based message-passing
Authors:
Dhananjay Bhaskar,
Yanlei Zhang,
Charles Xu,
Xingzhi Sun,
Oluwadamilola Fasina,
Guy Wolf,
Maximilian Nickel,
Michael Perlmutter,
Smita Krishnaswamy
Abstract:
In this paper we introduce DYMAG: a message passing paradigm for GNNs built on the expressive power of continuous, multiscale graph-dynamics. Standard discrete-time message passing algorithms implicitly make use of simplistic graph dynamics and aggregation schemes which limit their ability to capture fundamental graph topological properties. By contrast, DYMAG makes use of complex graph dynamics b…
▽ More
In this paper we introduce DYMAG: a message passing paradigm for GNNs built on the expressive power of continuous, multiscale graph-dynamics. Standard discrete-time message passing algorithms implicitly make use of simplistic graph dynamics and aggregation schemes which limit their ability to capture fundamental graph topological properties. By contrast, DYMAG makes use of complex graph dynamics based on the heat and wave equation as well as a more complex equation which admits chaotic solutions. The continuous nature of the dynamics are leveraged to generate multiscale (dynamic-time snapshot) representations which we prove are linked to various graph topological and spectral properties. We demonstrate experimentally that DYMAG achieves superior performance in recovering the generating parameters of Erdös-Renyi and stochastic block model random graphs and the persistent homology of synthetic graphs and citation network. Since the behavior of proteins and biomolecules is sensitive to graph topology and exhibits important structure at multiple scales, we find that DYMAG outperforms other methods at predicting salient features of various biomolecules.
△ Less
Submitted 12 June, 2024; v1 submitted 18 September, 2023;
originally announced September 2023.
-
Circular Clustering with Polar Coordinate Reconstruction
Authors:
Xiaoxiao Sun,
Paul Sajda
Abstract:
There is a growing interest in characterizing circular data found in biological systems. Such data are wide ranging and varied, from signal phase in neural recordings to nucleotide sequences in round genomes. Traditional clustering algorithms are often inadequate due to their limited ability to distinguish differences in the periodic component. Current clustering schemes that work in a polar coord…
▽ More
There is a growing interest in characterizing circular data found in biological systems. Such data are wide ranging and varied, from signal phase in neural recordings to nucleotide sequences in round genomes. Traditional clustering algorithms are often inadequate due to their limited ability to distinguish differences in the periodic component. Current clustering schemes that work in a polar coordinate system have limitations, such as being only angle-focused or lacking generality. To overcome these limitations, we propose a new analysis framework that utilizes projections onto a cylindrical coordinate system to better represent objects in a polar coordinate system. Using the mathematical properties of circular data, we show our approach always finds the correct clustering result within the reconstructed dataset, given sufficient periodic repetitions of the data. Our approach is generally applicable and adaptable and can be incorporated into most state-of-the-art clustering algorithms. We demonstrate on synthetic and real data that our method generates more appropriate and consistent clustering results compared to standard methods. In summary, our proposed analysis framework overcomes the limitations of existing polar coordinate-based clustering methods and provides a more accurate and efficient way to cluster circular data.
△ Less
Submitted 15 September, 2023;
originally announced September 2023.
-
Causal Discovery via Conditional Independence Testing with Proxy Variables
Authors:
Mingzhou Liu,
Xinwei Sun,
Yu Qiao,
Yizhou Wang
Abstract:
Distinguishing causal connections from correlations is important in many scenarios. However, the presence of unobserved variables, such as the latent confounder, can introduce bias in conditional independence testing commonly employed in constraint-based causal discovery for identifying causal relations. To address this issue, existing methods introduced proxy variables to adjust for the bias caus…
▽ More
Distinguishing causal connections from correlations is important in many scenarios. However, the presence of unobserved variables, such as the latent confounder, can introduce bias in conditional independence testing commonly employed in constraint-based causal discovery for identifying causal relations. To address this issue, existing methods introduced proxy variables to adjust for the bias caused by unobserveness. However, these methods were either limited to categorical variables or relied on strong parametric assumptions for identification. In this paper, we propose a novel hypothesis-testing procedure that can effectively examine the existence of the causal relationship over continuous variables, without any parametric constraint. Our procedure is based on discretization, which under completeness conditions, is able to asymptotically establish a linear equation whose coefficient vector is identifiable under the causal null hypothesis. Based on this, we introduce our test statistic and demonstrate its asymptotic level and power. We validate the effectiveness of our procedure using both synthetic and real-world data.
△ Less
Submitted 1 May, 2024; v1 submitted 9 May, 2023;
originally announced May 2023.
-
Causal Discovery from Subsampled Time Series with Proxy Variables
Authors:
Mingzhou Liu,
Xinwei Sun,
Ling**g Hu,
Yizhou Wang
Abstract:
Inferring causal structures from time series data is the central interest of many scientific inquiries. A major barrier to such inference is the problem of subsampling, i.e., the frequency of measurement is much lower than that of causal influence. To overcome this problem, numerous methods have been proposed, yet either was limited to the linear case or failed to achieve identifiability. In this…
▽ More
Inferring causal structures from time series data is the central interest of many scientific inquiries. A major barrier to such inference is the problem of subsampling, i.e., the frequency of measurement is much lower than that of causal influence. To overcome this problem, numerous methods have been proposed, yet either was limited to the linear case or failed to achieve identifiability. In this paper, we propose a constraint-based algorithm that can identify the entire causal structure from subsampled time series, without any parametric constraint. Our observation is that the challenge of subsampling arises mainly from hidden variables at the unobserved time steps. Meanwhile, every hidden variable has an observed proxy, which is essentially itself at some observable time in the future, benefiting from the temporal structure. Based on these, we can leverage the proxies to remove the bias induced by the hidden variables and hence achieve identifiability. Following this intuition, we propose a proxy-based causal discovery algorithm. Our algorithm is nonparametric and can achieve full causal identification. Theoretical advantages are reflected in synthetic and real-world experiments.
△ Less
Submitted 24 December, 2023; v1 submitted 9 May, 2023;
originally announced May 2023.
-
Identifying roadway departure crash patterns on rural two-lane highways under different lighting conditions: association knowledge using data mining approach
Authors:
Ahmed Hossain,
Xiaoduan Sun,
Shahrin Islam,
Shah Alam,
Md Mahmud Hossain
Abstract:
More than half of all fatalities on U.S. highways occur due to roadway departure (RwD) each year. Previous research has explored various risk factors that contribute to RwD crashes, however, a comprehensive investigation considering the effect of lighting conditions has been insufficiently addressed. Using the Louisiana Department of Transportation and Development crash database, fatal and injury…
▽ More
More than half of all fatalities on U.S. highways occur due to roadway departure (RwD) each year. Previous research has explored various risk factors that contribute to RwD crashes, however, a comprehensive investigation considering the effect of lighting conditions has been insufficiently addressed. Using the Louisiana Department of Transportation and Development crash database, fatal and injury RwD crashes occurring on rural two-lane (R2L) highways between 2008-2017 were analyzed based on daylight and dark (with/without streetlight). This research employed a safe system approach to explore meaningful complex interactions among multidimensional crash risk factors. To accomplish this, an unsupervised data mining algorithm association rules mining (ARM) was utilized. Based on the generated rules, the findings reveal several interesting crash patterns in the daylight, dark-with-streetlight, and dark-no-streetlight, emphasizing the importance of investigating RwD crash patterns depending on the lighting conditions. In daylight, fatal RwD crashes are associated with cloudy weather conditions, distracted drivers, standing water on the roadway, no seat belt use, and construction zones. In dark lighting conditions (with/without streetlight), the majority of the RwD crashes are associated with alcohol/drug involvement, young drivers (15-24 years), driver condition (e.g., inattentive, distracted, illness/fatigued/asleep) and colliding with animal (s). The findings reveal how certain driver behavior patterns are connected to RwD crashes, such as a strong association between alcohol/drug intoxication and no seat belt usage in the dark-no-streetlight condition. Based on the identified crash patterns and behavioral characteristics under different lighting conditions, the findings could aid researchers and safety specialists in develo** the most effective RwD crash mitigation strategies.
△ Less
Submitted 28 February, 2023;
originally announced February 2023.
-
Cause-Effect Inference in Location-Scale Noise Models: Maximum Likelihood vs. Independence Testing
Authors:
Xiangyu Sun,
Oliver Schulte
Abstract:
A fundamental problem of causal discovery is cause-effect inference, learning the correct causal direction between two random variables. Significant progress has been made through modelling the effect as a function of its cause and a noise term, which allows us to leverage assumptions about the generating function class. The recently introduced heteroscedastic location-scale noise functional model…
▽ More
A fundamental problem of causal discovery is cause-effect inference, learning the correct causal direction between two random variables. Significant progress has been made through modelling the effect as a function of its cause and a noise term, which allows us to leverage assumptions about the generating function class. The recently introduced heteroscedastic location-scale noise functional models (LSNMs) combine expressive power with identifiability guarantees. LSNM model selection based on maximizing likelihood achieves state-of-the-art accuracy, when the noise distributions are correctly specified. However, through an extensive empirical evaluation, we demonstrate that the accuracy deteriorates sharply when the form of the noise distribution is misspecified by the user. Our analysis shows that the failure occurs mainly when the conditional variance in the anti-causal direction is smaller than that in the causal direction. As an alternative, we find that causal model selection through residual independence testing is much more robust to noise misspecification and misleading conditional variance.
△ Less
Submitted 25 October, 2023; v1 submitted 26 January, 2023;
originally announced January 2023.
-
Learning Dynamical Systems from Data: A Simple Cross-Validation Perspective, Part V: Sparse Kernel Flows for 132 Chaotic Dynamical Systems
Authors:
Lu Yang,
Xiuwen Sun,
Boumediene Hamzi,
Houman Owhadi,
Naiming Xie
Abstract:
Regressing the vector field of a dynamical system from a finite number of observed states is a natural way to learn surrogate models for such systems. A simple and interpretable way to learn a dynamical system from data is to interpolate its vector-field with a data-adapted kernel which can be learned by using Kernel Flows. The method of Kernel Flows is a trainable machine learning method that lea…
▽ More
Regressing the vector field of a dynamical system from a finite number of observed states is a natural way to learn surrogate models for such systems. A simple and interpretable way to learn a dynamical system from data is to interpolate its vector-field with a data-adapted kernel which can be learned by using Kernel Flows. The method of Kernel Flows is a trainable machine learning method that learns the optimal parameters of a kernel based on the premise that a kernel is good if there is no significant loss in accuracy if half of the data is used. The objective function could be a short-term prediction or some other objective for other variants of Kernel Flows). However, this method is limited by the choice of the base kernel. In this paper, we introduce the method of \emph{Sparse Kernel Flows } in order to learn the ``best'' kernel by starting from a large dictionary of kernels. It is based on sparsifying a kernel that is a linear combination of elemental kernels. We apply this approach to a library of 132 chaotic systems.
△ Less
Submitted 27 February, 2023; v1 submitted 24 January, 2023;
originally announced January 2023.
-
Towards data-driven modeling and real-time prediction of solar flares and coronal mass ejections
Authors:
M. Rempel,
Y. Fan,
M. Dikpati,
A. Malanushenko,
M. D. Kazachenko,
M. C. M. Cheung,
G. Chintzoglou,
X. Sun,
G. H. Fisher,
T. Y. Chen
Abstract:
Modeling of transient events in the solar atmosphere requires the confluence of 3 critical elements: (1) model sophistication, (2) data availability, and (3) data assimilation. This white paper describes required advances that will enable statistical flare and CME forecasting (e.g. eruption probability and timing, estimation of strength, and CME details, such as speed and magnetic field orientatio…
▽ More
Modeling of transient events in the solar atmosphere requires the confluence of 3 critical elements: (1) model sophistication, (2) data availability, and (3) data assimilation. This white paper describes required advances that will enable statistical flare and CME forecasting (e.g. eruption probability and timing, estimation of strength, and CME details, such as speed and magnetic field orientation) similar to weather prediction on Earth.
△ Less
Submitted 29 December, 2022;
originally announced December 2022.
-
Applying Association Rules Mining to Investigate Pedestrian Fatal and Injury Crash Patterns Under Different Lighting Conditions
Authors:
Ahmed Hossain,
Xiaoduan Sun,
Raju Thapa,
Julius Codjoe
Abstract:
The pattern of pedestrian crashes varies greatly depending on lighting circumstances, emphasizing the need of examining pedestrian crashes in various lighting conditions. Using Louisiana pedestrian fatal and injury crash data (2010-2019), this study applied Association Rules Mining (ARM) to identify the hidden pattern of crash risk factors according to three different lighting conditions (daylight…
▽ More
The pattern of pedestrian crashes varies greatly depending on lighting circumstances, emphasizing the need of examining pedestrian crashes in various lighting conditions. Using Louisiana pedestrian fatal and injury crash data (2010-2019), this study applied Association Rules Mining (ARM) to identify the hidden pattern of crash risk factors according to three different lighting conditions (daylight, dark-with-streetlight, and dark-no-streetlight). Based on the generated rules, the results show that daylight pedestrian crashes are associated with children (less than 15 years), senior pedestrians (greater than 64 years), older drivers (>64 years), and other driving behaviors such as failure to yield, inattentive/distracted, illness/fatigue/asleep. Additionally, young drivers (15-24 years) are involved in severe pedestrian crashes in daylight conditions. This study also found pedestrian alcohol/drug involvement as the most frequent item in the dark-with-streetlight condition. This crash type is particularly associated with pedestrian action (crossing intersection/midblock), driver age (55-64 years), speed limit (30-35 mph), and specific area type (business with mixed residential area). Fatal pedestrian crashes are found to be associated with roadways with high-speed limits (>50 mph) during the dark without streetlight condition. Some other risk factors linked with high-speed limit related crashes are pedestrians walking with/against the traffic, presence of pedestrian dark clothing, pedestrian alcohol/drug involvement. The research findings are expected to provide an improved understanding of the underlying relationships between pedestrian crash risk factors and specific lighting conditions. Highway safety experts can utilize these findings to conduct a decision-making process for selecting effective countermeasures to reduce pedestrian crashes strategically.
△ Less
Submitted 6 November, 2022;
originally announced November 2022.
-
A New Causal Decomposition Paradigm towards Health Equity
Authors:
Xinwei Sun,
Xiangyu Zheng,
Jim Weinstein
Abstract:
Causal decomposition has provided a powerful tool to analyze health disparity problems, by assessing the proportion of disparity caused by each mediator. However, most of these methods lack \emph{policy implications}, as they fail to account for all sources of disparities caused by the mediator. Besides, their estimations \emph{pre-specified} some covariates set (\emph{a.k.a}, admissible set) for…
▽ More
Causal decomposition has provided a powerful tool to analyze health disparity problems, by assessing the proportion of disparity caused by each mediator. However, most of these methods lack \emph{policy implications}, as they fail to account for all sources of disparities caused by the mediator. Besides, their estimations \emph{pre-specified} some covariates set (\emph{a.k.a}, admissible set) for the strong ignorability condition to hold, which can be problematic as some variables in this set may induce new spurious features. To resolve these issues, under the framework of the structural causal model, we propose to decompose the total effect into adjusted and unadjusted effects, with the former being able to include all types of disparity by adjusting each mediator's distribution from the disadvantaged group to the advantaged ones. Besides, equipped with maximal ancestral graph and context variables, we can automatically identify the admissible set, followed by an efficient algorithm for estimation. Theoretical correctness and the efficacy of our method are demonstrated on a synthetic dataset and a spine disease dataset.
△ Less
Submitted 20 February, 2023; v1 submitted 24 July, 2022;
originally announced July 2022.
-
Local False Discovery Rate Estimation with Competition-Based Procedures for Variable Selection
Authors:
Xiaoya Sun,
Yan Fu
Abstract:
Multiple hypothesis testing has been widely applied to problems dealing with high-dimensional data, e.g., selecting significant variables and controlling the selection error rate. The most prevailing measure of error rate used in the multiple hypothesis testing is the false discovery rate (FDR). In recent years, local false discovery rate (fdr) has drawn much attention, due to its advantage of acc…
▽ More
Multiple hypothesis testing has been widely applied to problems dealing with high-dimensional data, e.g., selecting significant variables and controlling the selection error rate. The most prevailing measure of error rate used in the multiple hypothesis testing is the false discovery rate (FDR). In recent years, local false discovery rate (fdr) has drawn much attention, due to its advantage of accessing the confidence of individual hypothesis. However, most methods estimate fdr through p-values or statistics with known null distributions, which are sometimes not available or reliable. Adopting the innovative methodology of competition-based procedures, e.g., knockoff filter, this paper proposes a new approach, named TDfdr, to local false discovery rate estimation, which is free of the p-values or known null distributions. Simulation results demonstrate that TDfdr can accurately estimate the fdr with two competition-based procedures. In real data analysis, the power of TDfdr on variable selection is verified on two biological datasets.
△ Less
Submitted 6 June, 2022;
originally announced June 2022.
-
Bounded Memory Adversarial Bandits with Composite Anonymous Delayed Feedback
Authors:
Zongqi Wan,
Xiaoming Sun,
Jialin Zhang
Abstract:
We study the adversarial bandit problem with composite anonymous delayed feedback. In this setting, losses of an action are split into $d$ components, spreading over consecutive rounds after the action is chosen. And in each round, the algorithm observes the aggregation of losses that come from the latest $d$ rounds. Previous works focus on oblivious adversarial setting, while we investigate the h…
▽ More
We study the adversarial bandit problem with composite anonymous delayed feedback. In this setting, losses of an action are split into $d$ components, spreading over consecutive rounds after the action is chosen. And in each round, the algorithm observes the aggregation of losses that come from the latest $d$ rounds. Previous works focus on oblivious adversarial setting, while we investigate the harder non-oblivious setting. We show non-oblivious setting incurs $Ω(T)$ pseudo regret even when the loss sequence is bounded memory. However, we propose a wrapper algorithm which enjoys $o(T)$ policy regret on many adversarial bandit problems with the assumption that the loss sequence is bounded memory. Especially, for $K$-armed bandit and bandit convex optimization, we have $\mathcal{O}(T^{2/3})$ policy regret bound. We also prove a matching lower bound for $K$-armed bandit. Our lower bound works even when the loss sequence is oblivious but the delay is non-oblivious. It answers the open problem proposed in \cite{wang2021adaptive}, showing that non-oblivious delay is enough to incur $\tildeΩ(T^{2/3})$ regret.
△ Less
Submitted 27 April, 2022; v1 submitted 27 April, 2022;
originally announced April 2022.
-
Validating CircaCP: a Generic Sleep-Wake Cycle Detection Algorithm
Authors:
Shanshan Chen,
Xinxin Sun
Abstract:
Sleep-wake cycle detection is a key step when extrapolating sleep patterns from actigraphy data. Numerous supervised detection algorithms have been developed with parameters estimated from and optimized for a particular dataset, yet their generalizability from sensor to sensor or study to study is unknown. In this paper, we propose and validate an unsupervised algorithm -- CircaCP -- to detect sle…
▽ More
Sleep-wake cycle detection is a key step when extrapolating sleep patterns from actigraphy data. Numerous supervised detection algorithms have been developed with parameters estimated from and optimized for a particular dataset, yet their generalizability from sensor to sensor or study to study is unknown. In this paper, we propose and validate an unsupervised algorithm -- CircaCP -- to detect sleep-wake cycles from minute-by-minute actigraphy data. It first uses a robust cosinor model to estimate circadian rhythm, then searches for a single change point (CP) within each cycle. We used CircaCP to estimate sleep/wake onset times (S/WOTs) from 2125 indviduals' data in the MESA Sleep study and compared the estimated S/WOTs against self-reported S/WOT event markers. Lastly, we quantified the biases between estimated and self-reported S/WOTs, as well as variation in S/WOTs contributed by the two methods, using linear mixed-effects models and variance component analysis.
On average, SOTs estimated by CircaCP were five minutes behind those reported by event markers, and WOTs estimated by CircaCP were less than one minute behind those reported by markers. These differences accounted for less than 0.2% variability in SOTs and in WOTs, taking into account other sources of between-subject variations. By focusing on the commonality in human circadian rhythms captured by actigraphy, our algorithm transferred seamlessly from hip-worn ActiGraph data collected from children in our previous study to wrist-worn Actiwatch data collected from adults. The large between- and within-subject variability highlights the need for estimating individual-level S/WOTs when conducting actigraphy research. The generalizability of our algorithm also suggests that it could be widely applied to actigraphy data collected by other wearable sensors.
△ Less
Submitted 29 November, 2021;
originally announced November 2021.
-
B-scaling: A Novel Nonparametric Data Fusion Method
Authors:
Yiwen Liu,
Xiaoxiao Sun,
Wenxuan Zhong,
Bing Li
Abstract:
Very often for the same scientific question, there may exist different techniques or experiments that measure the same numerical quantity. Historically, various methods have been developed to exploit the information within each type of data independently. However, statistical data fusion methods that could effectively integrate multi-source data under a unified framework are lacking. In this paper…
▽ More
Very often for the same scientific question, there may exist different techniques or experiments that measure the same numerical quantity. Historically, various methods have been developed to exploit the information within each type of data independently. However, statistical data fusion methods that could effectively integrate multi-source data under a unified framework are lacking. In this paper, we propose a novel data fusion method, called B-scaling, for integrating multi-source data. Consider $K$ measurements that are generated from different sources but measure the same latent variable through some linear or nonlinear ways. We seek to find a representation of the latent variable, named B-mean, which captures the common information contained in the $K$ measurements while takes into account the nonlinear map**s between them and the latent variable. We also establish the asymptotic property of the B-mean and apply the proposed method to integrate multiple histone modifications and DNA methylation levels for characterizing epigenomic landscape. Both numerical and empirical studies show that B-scaling is a powerful data fusion method with broad applications.
△ Less
Submitted 20 September, 2021;
originally announced September 2021.
-
NTS-NOTEARS: Learning Nonparametric DBNs With Prior Knowledge
Authors:
Xiangyu Sun,
Oliver Schulte,
Guiliang Liu,
Pascal Poupart
Abstract:
We describe NTS-NOTEARS, a score-based structure learning method for time-series data to learn dynamic Bayesian networks (DBNs) that captures nonlinear, lagged (inter-slice) and instantaneous (intra-slice) relations among variables. NTS-NOTEARS utilizes 1D convolutional neural networks (CNNs) to model the dependence of child variables on their parents; 1D CNN is a neural function approximation mod…
▽ More
We describe NTS-NOTEARS, a score-based structure learning method for time-series data to learn dynamic Bayesian networks (DBNs) that captures nonlinear, lagged (inter-slice) and instantaneous (intra-slice) relations among variables. NTS-NOTEARS utilizes 1D convolutional neural networks (CNNs) to model the dependence of child variables on their parents; 1D CNN is a neural function approximation model well-suited for sequential data. DBN-CNN structure learning is formulated as a continuous optimization problem with an acyclicity constraint, following the NOTEARS DAG learning approach. We show how prior knowledge of dependencies (e.g., forbidden and required edges) can be included as additional optimization constraints. Empirical evaluation on simulated and benchmark data show that NTS-NOTEARS achieves state-of-the-art DAG structure quality compared to both parametric and nonparametric baseline methods, with improvement in the range of 10-20% on the F1-score. We also evaluate NTS-NOTEARS on complex real-world data acquired from professional ice hockey games that contain a mixture of continuous and discrete variables. The code is available online.
△ Less
Submitted 1 March, 2023; v1 submitted 9 September, 2021;
originally announced September 2021.
-
Bayesian Lifetime Regression with Multi-type Group-shared Latent Heterogeneity
Authors:
Xuxue Sun,
Mingyang Li
Abstract:
Products manufactured from the same batch or utilized in the same region often exhibit correlated lifetime observations due to the latent heterogeneity caused by the influence of shared but unobserved covariates. The unavailable group-shared covariates involve multiple different types (e.g., discrete, continuous, or mixed-type) and induce different structures of indispensable group-shared latent h…
▽ More
Products manufactured from the same batch or utilized in the same region often exhibit correlated lifetime observations due to the latent heterogeneity caused by the influence of shared but unobserved covariates. The unavailable group-shared covariates involve multiple different types (e.g., discrete, continuous, or mixed-type) and induce different structures of indispensable group-shared latent heterogeneity. Without carefully capturing such latent heterogeneity, the lifetime modeling accuracy will be significantly undermined. In this work, we propose a generic Bayesian lifetime modeling approach by comprehensively investigating the structures of group-shared latent heterogeneity caused by different types of group-shared unobserved covariates. The proposed approach is flexible to characterize multi-type group-shared latent heterogeneity in lifetime data. Besides, it can handle the case of lack of group membership information and address the issue of limited sample size. Bayesian sampling algorithm with data augmentation technique is further developed to jointly quantify the influence of observed covariates and group-shared latent heterogeneity. Further, we conduct comprehensive numerical study to demonstrate the improved performance of proposed modeling approach via comparison with alternative models. We also present empirical study results to investigate the impacts of group number and sample size per group on estimating the group-shared latent heterogeneity and to demonstrate model identifiability of proposed approach for different structures of unobserved group-shared covariates. We also present a real case study to illustrate the effectiveness of proposed approach.
△ Less
Submitted 14 July, 2021;
originally announced July 2021.
-
Oversampling Divide-and-conquer for Response-skewed Kernel Ridge Regression
Authors:
**gyi Zhang,
Xiaoxiao Sun
Abstract:
The divide-and-conquer method has been widely used for estimating large-scale kernel ridge regression estimates. Unfortunately, when the response variable is highly skewed, the divide-and-conquer kernel ridge regression (dacKRR) may overlook the underrepresented region and result in unacceptable results. We combine a novel response-adaptive partition strategy with the oversampling technique synerg…
▽ More
The divide-and-conquer method has been widely used for estimating large-scale kernel ridge regression estimates. Unfortunately, when the response variable is highly skewed, the divide-and-conquer kernel ridge regression (dacKRR) may overlook the underrepresented region and result in unacceptable results. We combine a novel response-adaptive partition strategy with the oversampling technique synergistically to overcome the limitation. Through the proposed novel algorithm, we allocate some carefully identified informative observations to multiple nodes (local processors). Although the oversampling technique has been widely used for addressing discrete label skewness, extending it to the dacKRR setting is nontrivial. We provide both theoretical and practical guidance on how to effectively over-sample the observations under the dacKRR setting. Furthermore, we show the proposed estimate has a smaller risk than that of the classical dacKRR estimate under mild conditions. Our theoretical findings are supported by both simulated and real-data analyses.
△ Less
Submitted 10 November, 2021; v1 submitted 13 July, 2021;
originally announced July 2021.
-
Which Invariance Should We Transfer? A Causal Minimax Learning Approach
Authors:
Mingzhou Liu,
Xiangyu Zheng,
Xinwei Sun,
Fang Fang,
Yizhou Wang
Abstract:
A major barrier to deploying current machine learning models lies in their non-reliability to dataset shifts. To resolve this problem, most existing studies attempted to transfer stable information to unseen environments. Particularly, independent causal mechanisms-based methods proposed to remove mutable causal mechanisms via the do-operator. Compared to previous methods, the obtained stable pred…
▽ More
A major barrier to deploying current machine learning models lies in their non-reliability to dataset shifts. To resolve this problem, most existing studies attempted to transfer stable information to unseen environments. Particularly, independent causal mechanisms-based methods proposed to remove mutable causal mechanisms via the do-operator. Compared to previous methods, the obtained stable predictors are more effective in identifying stable information. However, a key question remains: which subset of this whole stable information should the model transfer, in order to achieve optimal generalization ability? To answer this question, we present a comprehensive minimax analysis from a causal perspective. Specifically, we first provide a graphical condition for the whole stable set to be optimal. When this condition fails, we surprisingly find with an example that this whole stable set, although can fully exploit stable information, is not the optimal one to transfer. To identify the optimal subset under this case, we propose to estimate the worst-case risk with a novel optimization scheme over the intervention functions on mutable causal mechanisms. We then propose an efficient algorithm to search for the subset with minimal worst-case risk, based on a newly defined equivalence relation between stable subsets. Compared to the exponential cost of exhaustively searching over all subsets, our searching strategy enjoys a polynomial complexity. The effectiveness and efficiency of our methods are demonstrated on synthetic data and the diagnosis of Alzheimer's disease.
△ Less
Submitted 30 May, 2023; v1 submitted 5 July, 2021;
originally announced July 2021.
-
Parameter Estimation for the SEIR Model Using Recurrent Nets
Authors:
Chun Fan,
Yuxian Meng,
Xiaofei Sun,
Fei Wu,
Tianwei Zhang,
Jiwei Li
Abstract:
The standard way to estimate the parameters $Θ_\text{SEIR}$ (e.g., the transmission rate $β$) of an SEIR model is to use grid search, where simulations are performed on each set of parameters, and the parameter set leading to the least $L_2$ distance between predicted number of infections and observed infections is selected. This brute-force strategy is not only time consuming, as simulations are…
▽ More
The standard way to estimate the parameters $Θ_\text{SEIR}$ (e.g., the transmission rate $β$) of an SEIR model is to use grid search, where simulations are performed on each set of parameters, and the parameter set leading to the least $L_2$ distance between predicted number of infections and observed infections is selected. This brute-force strategy is not only time consuming, as simulations are slow when the population is large, but also inaccurate, since it is impossible to enumerate all parameter combinations. To address these issues, in this paper, we propose to transform the non-differentiable problem of finding optimal $Θ_\text{SEIR}$ to a differentiable one, where we first train a recurrent net to fit a small number of simulation data. Next, based on this recurrent net that is able to generalize SEIR simulations, we are able to transform the objective to a differentiable one with respect to $Θ_\text{SEIR}$, and straightforwardly obtain its optimal value. The proposed strategy is both time efficient as it only relies on a small number of SEIR simulations, and accurate as we are able to find the optimal $Θ_\text{SEIR}$ based on the differentiable objective. On two COVID-19 datasets, we observe that the proposed strategy leads to significantly better parameter estimations with a smaller number of simulations.
△ Less
Submitted 30 May, 2021;
originally announced May 2021.
-
Data-driven discovery of interpretable causal relations for deep learning material laws with uncertainty propagation
Authors:
Xiao Sun,
Bahador Bahmani,
Nikolaos N. Vlassis,
WaiChing Sun,
Yanxun Xu
Abstract:
This paper presents a computational framework that generates ensemble predictive mechanics models with uncertainty quantification (UQ). We first develop a causal discovery algorithm to infer causal relations among time-history data measured during each representative volume element (RVE) simulation through a directed acyclic graph (DAG). With multiple plausible sets of causal relationships estimat…
▽ More
This paper presents a computational framework that generates ensemble predictive mechanics models with uncertainty quantification (UQ). We first develop a causal discovery algorithm to infer causal relations among time-history data measured during each representative volume element (RVE) simulation through a directed acyclic graph (DAG). With multiple plausible sets of causal relationships estimated from multiple RVE simulations, the predictions are propagated in the derived causal graph while using a deep neural network equipped with dropout layers as a Bayesian approximation for uncertainty quantification. We select two representative numerical examples (traction-separation laws for frictional interfaces, elastoplasticity models for granular assembles) to examine the accuracy and robustness of the proposed causal discovery method for the common material law predictions in civil engineering applications.
△ Less
Submitted 20 May, 2021;
originally announced May 2021.
-
Controlling the False Discovery Rate in Transformational Sparsity: Split Knockoffs
Authors:
Yang Cao,
Xinwei Sun,
Yuan Yao
Abstract:
Controlling the False Discovery Rate (FDR) in a variable selection procedure is critical for reproducible discoveries, and it has been extensively studied in sparse linear models. However, it remains largely open in scenarios where the sparsity constraint is not directly imposed on the parameters but on a linear transformation of the parameters to be estimated. Examples of such scenarios include t…
▽ More
Controlling the False Discovery Rate (FDR) in a variable selection procedure is critical for reproducible discoveries, and it has been extensively studied in sparse linear models. However, it remains largely open in scenarios where the sparsity constraint is not directly imposed on the parameters but on a linear transformation of the parameters to be estimated. Examples of such scenarios include total variations, wavelet transforms, fused LASSO, and trend filtering. In this paper, we propose a data-adaptive FDR control method, called the Split Knockoff method, for this transformational sparsity setting. The proposed method exploits both variable and data splitting. The linear transformation constraint is relaxed to its Euclidean proximity in a lifted parameter space, which yields an orthogonal design that enables the orthogonal Split Knockoff construction. To overcome the challenge that exchangeability fails due to the heterogeneous noise brought by the transformation, new inverse supermartingale structures are developed via data splitting for provable FDR control without sacrificing power. Simulation experiments demonstrate that the proposed methodology achieves the desired FDR and power. We also provide an application to Alzheimer's Disease study, where atrophy brain regions and their abnormal connections can be discovered based on a structural Magnetic Resonance Imaging dataset (ADNI).
△ Less
Submitted 16 October, 2023; v1 submitted 30 March, 2021;
originally announced March 2021.
-
Bayesian Poisson Mortality Projections with Incomplete Data
Authors:
Rui Gong,
Xiaoqian Sun,
Le** Liu,
Yu-Bo Wang
Abstract:
The missing data problem pervasively exists in statistical applications. Even as simple as the count data in mortality projections, it may not be available for certain age-and-year groups due to the budget limitations or difficulties in tracing research units, resulting in the follow-up estimation and prediction inaccuracies. To circumvent this data-driven challenge, we extend the Poisson log-norm…
▽ More
The missing data problem pervasively exists in statistical applications. Even as simple as the count data in mortality projections, it may not be available for certain age-and-year groups due to the budget limitations or difficulties in tracing research units, resulting in the follow-up estimation and prediction inaccuracies. To circumvent this data-driven challenge, we extend the Poisson log-normal Lee-Carter model to accommodate a more flexible time structure, and develop the new sampling algorithm that improves the MCMC convergence when dealing with incomplete mortality data. Via the overdispersion term and Gibbs sampler, the extended model can be re-written as the dynamic linear model so that both Kalman and sequential Kalman filters can be incorporated into the sampling scheme. Additionally, our meticulous prior settings can avoid the re-scaling step in each MCMC iteration, and allow model selection simultaneously conducted with estimation and prediction. The proposed method is applied to the mortality data of Chinese males during the period 1995-2016 to yield mortality rate forecasts for 2017-2039. The results are comparable to those based on the imputed data set, suggesting that our approach could handle incomplete data well.
△ Less
Submitted 9 March, 2021;
originally announced March 2021.
-
A Bayesian Spatial Modeling Approach to Mortality Forecasting
Authors:
Zhen Liu,
Xiaoqian Sun,
Yu-Bo Wang
Abstract:
This paper extends Bayesian mortality projection models for multiple populations considering the stochastic structure and the effect of spatial autocorrelation among the observations. We explain high levels of overdispersion according to adjacent locations based on the conditional autoregressive model. In an empirical study, we compare different hierarchical projection models for the analysis of g…
▽ More
This paper extends Bayesian mortality projection models for multiple populations considering the stochastic structure and the effect of spatial autocorrelation among the observations. We explain high levels of overdispersion according to adjacent locations based on the conditional autoregressive model. In an empirical study, we compare different hierarchical projection models for the analysis of geographical diversity in mortality between the Japanese counties in multiple years, according to age. By a Markov chain Monte Carlo (MCMC) computation, results have demonstrated the flexibility and predictive performance of our proposed model.
△ Less
Submitted 5 March, 2021; v1 submitted 23 February, 2021;
originally announced February 2021.
-
Hierarchical Variational Auto-Encoding for Unsupervised Domain Generalization
Authors:
Xudong Sun,
Florian Buettner
Abstract:
We address the task of domain generalization, where the goal is to train a predictive model such that it is able to generalize to a new, previously unseen domain. We choose a hierarchical generative approach within the framework of variational autoencoders and propose a domain-unsupervised algorithm that is able to generalize to new domains without domain supervision. We show that our method is ab…
▽ More
We address the task of domain generalization, where the goal is to train a predictive model such that it is able to generalize to a new, previously unseen domain. We choose a hierarchical generative approach within the framework of variational autoencoders and propose a domain-unsupervised algorithm that is able to generalize to new domains without domain supervision. We show that our method is able to learn representations that disentangle domain-specific information from class-label specific information even in complex settings where domain structure is not observed during training. Our interpretable method outperforms previously proposed generative algorithms for domain generalization as well as other non-generative state-of-the-art approaches in several hierarchical domain settings including sequential overlapped near continuous domain shift. It also achieves competitive performance on the standard domain generalization benchmark dataset PACS compared to state-of-the-art approaches which rely on observing domain-specific information during training, as well as another domain unsupervised method. Additionally, we proposed model selection purely based on Evidence Lower Bound (ELBO) and also proposed weak domain supervision where implicit domain information can be added into the algorithm.
△ Less
Submitted 14 May, 2021; v1 submitted 23 January, 2021;
originally announced January 2021.
-
A Degradation Performance Model With Mixed-type Covariates and Latent Heterogeneity
Authors:
Xuxue Sun,
Wenjun Cai,
Qiong Zhang,
Mingyang Li
Abstract:
Successful modeling of degradation performance data is essential for accurate reliability assessment and failure predictions of highly reliable product units. The degradation performance measurements over time are highly heterogeneous. Such heterogeneity can be partially attributed to external factors, such as accelerated/environmental conditions, and can also be attributed to internal factors, su…
▽ More
Successful modeling of degradation performance data is essential for accurate reliability assessment and failure predictions of highly reliable product units. The degradation performance measurements over time are highly heterogeneous. Such heterogeneity can be partially attributed to external factors, such as accelerated/environmental conditions, and can also be attributed to internal factors, such as material microstructure characteristics of product units. The latent heterogeneity due to the unobserved/unknown factors shared within each product unit may also exists and need to be considered as well. Existing degradation models often fail to consider (i) the influence of both external accelerated/environmental conditions and internal material information, (ii) the influence of unobserved/unknown factors within each unit. In this work, we propose a generic degradation performance modeling framework with mixed-type covariates and latent heterogeneity to account for both influences of observed internal and external factors as well as unobserved factors. Effective estimation algorithm is also developed to jointly quantify the influences of mixed-type covariates and individual latent heterogeneity, and also to examine the potential interaction between mixed-type covariates. Functional data analysis and data augmentation techniques are employed to address a series of estimation issues. A real case study is further provided to demonstrate the superior performance of the proposed approach over several alternative modeling approaches. Besides, the proposed degradation performance modeling framework also provides interpretable findings.
△ Less
Submitted 10 January, 2021;
originally announced January 2021.
-
A Latent Survival Analysis Enabled Simulation Platform For Nursing Home Staffing Strategy Evaluation
Authors:
Xuxue Sun,
Nan Kong,
Nazmus Sakib,
Chao Meng,
Kathryn Hyer,
Hongdao Meng,
Chris Masterson,
Mingyang Li
Abstract:
Nursing homes are critical facilities for caring frail older adults with round-the-clock formal care and personal assistance. To ensure quality care for nursing home residents, adequate staffing level is of great importance. Current nursing home staffing practice is mainly based on experience and regulation. The objective of this paper is to investigate the viability of experience-based and regula…
▽ More
Nursing homes are critical facilities for caring frail older adults with round-the-clock formal care and personal assistance. To ensure quality care for nursing home residents, adequate staffing level is of great importance. Current nursing home staffing practice is mainly based on experience and regulation. The objective of this paper is to investigate the viability of experience-based and regulation-based strategies, as well as alternative staffing strategies to minimize labor costs subject to heterogeneous service demand of nursing home residents under various scenarios of census. We propose a data-driven analysis framework to model heterogeneous service demand of nursing home residents and further identify appropriate staffing strategies by combing survival model and computer simulation techniques as well as domain knowledge. Specifically, in the analysis, we develop an agent-based simulation tool consisting of four main modules, namely individual length of stay predictor, individual daily staff time generator, facility level staffing strategy evaluator, and graphical user interface. We use real nursing home data to validate the proposed model, and demonstrate that the identified staffing strategy significantly reduces the total labor cost of certified nursing assistants compared to the benchmark strategies. Additionally, the proposed length of stay predictive model that considers multiple discharge dispositions exhibits superior accuracy and offers better staffing decisions than those without the consideration. Further, we construct different census scenarios of nursing home residents to demonstrate the capability of the proposed framework in hel** adjust staffing decisions of nursing home administrators in various realistic settings.
△ Less
Submitted 8 February, 2021; v1 submitted 8 January, 2021;
originally announced January 2021.
-
Latent Causal Invariant Model
Authors:
Xinwei Sun,
Botong Wu,
Xiangyu Zheng,
Chang Liu,
Wei Chen,
Tao Qin,
Tie-yan Liu
Abstract:
Current supervised learning can learn spurious correlation during the data-fitting process, imposing issues regarding interpretability, out-of-distribution (OOD) generalization, and robustness. To avoid spurious correlation, we propose a Latent Causal Invariance Model (LaCIM) which pursues causal prediction. Specifically, we introduce latent variables that are separated into (a) output-causative f…
▽ More
Current supervised learning can learn spurious correlation during the data-fitting process, imposing issues regarding interpretability, out-of-distribution (OOD) generalization, and robustness. To avoid spurious correlation, we propose a Latent Causal Invariance Model (LaCIM) which pursues causal prediction. Specifically, we introduce latent variables that are separated into (a) output-causative factors and (b) others that are spuriously correlated to the output via confounders, to model the underlying causal factors. We further assume the generating mechanisms from latent space to observed data to be causally invariant. We give the identifiable claim of such invariance, particularly the disentanglement of output-causative factors from others, as a theoretical guarantee for precise inference and avoiding spurious correlation. We propose a Variational-Bayesian-based method for estimation and to optimize over the latent space for prediction. The utility of our approach is verified by improved interpretability, prediction power on various OOD scenarios (including healthcare) and robustness on security.
△ Less
Submitted 27 April, 2021; v1 submitted 4 November, 2020;
originally announced November 2020.
-
Learning Causal Semantic Representation for Out-of-Distribution Prediction
Authors:
Chang Liu,
Xinwei Sun,
**dong Wang,
Haoyue Tang,
Tao Li,
Tao Qin,
Wei Chen,
Tie-Yan Liu
Abstract:
Conventional supervised learning methods, especially deep ones, are found to be sensitive to out-of-distribution (OOD) examples, largely because the learned representation mixes the semantic factor with the variation factor due to their domain-specific correlation, while only the semantic factor causes the output. To address the problem, we propose a Causal Semantic Generative model (CSG) based on…
▽ More
Conventional supervised learning methods, especially deep ones, are found to be sensitive to out-of-distribution (OOD) examples, largely because the learned representation mixes the semantic factor with the variation factor due to their domain-specific correlation, while only the semantic factor causes the output. To address the problem, we propose a Causal Semantic Generative model (CSG) based on a causal reasoning so that the two factors are modeled separately, and develop methods for OOD prediction from a single training domain, which is common and challenging. The methods are based on the causal invariance principle, with a novel design in variational Bayes for both efficient learning and easy prediction. Theoretically, we prove that under certain conditions, CSG can identify the semantic factor by fitting training data, and this semantic-identification guarantees the boundedness of OOD generalization error and the success of adaptation. Empirical study shows improved OOD performance over prevailing baselines.
△ Less
Submitted 1 November, 2021; v1 submitted 3 November, 2020;
originally announced November 2020.
-
Bayesian Poisson Log-normal Model with Regularized Time Structure for Mortality Projection of Multi-population
Authors:
Zhen Liu,
Xiaoqian Sun,
Le** Liu,
Yu-Bo Wang
Abstract:
The improvement of mortality projection is a pivotal topic in the diverse branches related to insurance, demography, and public policy. Motivated by the thread of Lee-Carter related models, we propose a Bayesian model to estimate and predict mortality rates for multi-population. This new model features in information borrowing among populations and properly reflecting variations of data. It also p…
▽ More
The improvement of mortality projection is a pivotal topic in the diverse branches related to insurance, demography, and public policy. Motivated by the thread of Lee-Carter related models, we propose a Bayesian model to estimate and predict mortality rates for multi-population. This new model features in information borrowing among populations and properly reflecting variations of data. It also provides a solution to a long-time overlooked problem: model selection for dependence structures of population-specific time parameters. By introducing a Dirac spike function, simultaneous model selection and estimation for population-specific time effects can be achieved without much extra computation cost. We use the Japanese mortality data from Human Mortality Database to illustrate the desirable properties of our model.
△ Less
Submitted 9 October, 2020;
originally announced October 2020.
-
Learning with Instance-Dependent Label Noise: A Sample Sieve Approach
Authors:
Hao Cheng,
Zhaowei Zhu,
Xingyu Li,
Yifei Gong,
Xing Sun,
Yang Liu
Abstract:
Human-annotated labels are often prone to noise, and the presence of such noise will degrade the performance of the resulting deep neural network (DNN) models. Much of the literature (with several recent exceptions) of learning with noisy labels focuses on the case when the label noise is independent of features. Practically, annotations errors tend to be instance-dependent and often depend on the…
▽ More
Human-annotated labels are often prone to noise, and the presence of such noise will degrade the performance of the resulting deep neural network (DNN) models. Much of the literature (with several recent exceptions) of learning with noisy labels focuses on the case when the label noise is independent of features. Practically, annotations errors tend to be instance-dependent and often depend on the difficulty levels of recognizing a certain task. Applying existing results from instance-independent settings would require a significant amount of estimation of noise rates. Therefore, providing theoretically rigorous solutions for learning with instance-dependent label noise remains a challenge. In this paper, we propose CORES$^{2}$ (COnfidence REgularized Sample Sieve), which progressively sieves out corrupted examples. The implementation of CORES$^{2}$ does not require specifying noise rates and yet we are able to provide theoretical guarantees of CORES$^{2}$ in filtering out the corrupted examples. This high-quality sample sieve allows us to treat clean examples and the corrupted ones separately in training a DNN solution, and such a separation is shown to be advantageous in the instance-dependent noise setting. We demonstrate the performance of CORES$^{2}$ on CIFAR10 and CIFAR100 datasets with synthetic instance-dependent label noise and Clothing1M with real-world human noise. As of independent interests, our sample sieve provides a generic machinery for anatomizing noisy datasets and provides a flexible interface for various robust training techniques to further improve the performance. Code is available at https://github.com/UCSC-REAL/cores.
△ Less
Submitted 22 March, 2021; v1 submitted 5 October, 2020;
originally announced October 2020.
-
Collaborative Group Learning
Authors:
Shaoxiong Feng,
Hongshen Chen,
Xuancheng Ren,
Zhuoye Ding,
Kan Li,
Xu Sun
Abstract:
Collaborative learning has successfully applied knowledge transfer to guide a pool of small student networks towards robust local minima. However, previous approaches typically struggle with drastically aggravated student homogenization when the number of students rises. In this paper, we propose Collaborative Group Learning, an efficient framework that aims to diversify the feature representation…
▽ More
Collaborative learning has successfully applied knowledge transfer to guide a pool of small student networks towards robust local minima. However, previous approaches typically struggle with drastically aggravated student homogenization when the number of students rises. In this paper, we propose Collaborative Group Learning, an efficient framework that aims to diversify the feature representation and conduct an effective regularization. Intuitively, similar to the human group study mechanism, we induce students to learn and exchange different parts of course knowledge as collaborative groups. First, each student is established by randomly routing on a modular neural network, which facilitates flexible knowledge communication between students due to random levels of representation sharing and branching. Second, to resist the student homogenization, students first compose diverse feature sets by exploiting the inductive bias from sub-sets of training data, and then aggregate and distill different complementary knowledge by imitating a random sub-group of students at each time step. Overall, the above mechanisms are beneficial for maximizing the student population to further improve the model generalization without sacrificing computational efficiency. Empirical evaluations on both image and text tasks indicate that our method significantly outperforms various state-of-the-art collaborative approaches whilst enhancing computational efficiency.
△ Less
Submitted 21 February, 2021; v1 submitted 16 September, 2020;
originally announced September 2020.
-
Kernel Interpolation of High Dimensional Scattered Data
Authors:
Shao-Bo Lin,
Xiangyu Chang,
** Sun
Abstract:
Data sites selected from modeling high-dimensional problems often appear scattered in non-paternalistic ways. Except for sporadic clustering at some spots, they become relatively far apart as the dimension of the ambient space grows. These features defy any theoretical treatment that requires local or global quasi-uniformity of distribution of data sites. Incorporating a recently-developed applica…
▽ More
Data sites selected from modeling high-dimensional problems often appear scattered in non-paternalistic ways. Except for sporadic clustering at some spots, they become relatively far apart as the dimension of the ambient space grows. These features defy any theoretical treatment that requires local or global quasi-uniformity of distribution of data sites. Incorporating a recently-developed application of integral operator theory in machine learning, we propose and study in the current article a new framework to analyze kernel interpolation of high dimensional data, which features bounding stochastic approximation error by the spectrum of the underlying kernel matrix. Both theoretical analysis and numerical simulations show that spectra of kernel matrices are reliable and stable barometers for gauging the performance of kernel-interpolation methods for high dimensional data.
△ Less
Submitted 27 September, 2021; v1 submitted 3 September, 2020;
originally announced September 2020.
-
Open Set Recognition with Conditional Probabilistic Generative Models
Authors:
Xin Sun,
Chi Zhang,
Guosheng Lin,
Keck-Voon Ling
Abstract:
Deep neural networks have made breakthroughs in a wide range of visual understanding tasks. A typical challenge that hinders their real-world applications is that unknown samples may be fed into the system during the testing phase, but traditional deep neural networks will wrongly recognize these unknown samples as one of the known classes. Open set recognition (OSR) is a potential solution to ove…
▽ More
Deep neural networks have made breakthroughs in a wide range of visual understanding tasks. A typical challenge that hinders their real-world applications is that unknown samples may be fed into the system during the testing phase, but traditional deep neural networks will wrongly recognize these unknown samples as one of the known classes. Open set recognition (OSR) is a potential solution to overcome this problem, where the open set classifier should have the flexibility to reject unknown samples and meanwhile maintain high classification accuracy in known classes. Probabilistic generative models, such as Variational Autoencoders (VAE) and Adversarial Autoencoders (AAE), are popular methods to detect unknowns, but they cannot provide discriminative representations for known classification. In this paper, we propose a novel framework, called Conditional Probabilistic Generative Models (CPGM), for open set recognition. The core insight of our work is to add discriminative information into the probabilistic generative models, such that the proposed models can not only detect unknown samples but also classify known classes by forcing different latent features to approximate conditional Gaussian distributions. We discuss many model variants and provide comprehensive experiments to study their characteristics. Experiment results on multiple benchmark datasets reveal that the proposed method significantly outperforms the baselines and achieves new state-of-the-art performance.
△ Less
Submitted 9 February, 2021; v1 submitted 12 August, 2020;
originally announced August 2020.
-
Fast Graphlet Transform of Sparse Graphs
Authors:
Dimitris Floros,
Nikos Pitsianis,
Xiaobai Sun
Abstract:
We introduce the computational problem of graphlet transform of a sparse large graph. Graphlets are fundamental topology elements of all graphs/networks. They can be used as coding elements to encode graph-topological information at multiple granularity levels for classifying vertices on the same graph/network as well as for making differentiation or connection across different networks. Network/g…
▽ More
We introduce the computational problem of graphlet transform of a sparse large graph. Graphlets are fundamental topology elements of all graphs/networks. They can be used as coding elements to encode graph-topological information at multiple granularity levels for classifying vertices on the same graph/network as well as for making differentiation or connection across different networks. Network/graph analysis using graphlets has growing applications. We recognize the universality and increased encoding capacity in using multiple graphlets, we address the arising computational complexity issues, and we present a fast method for exact graphlet transform. The fast graphlet transform establishes a few remarkable records at once in high computational efficiency, low memory consumption, and ready translation to high-performance program and implementation. It is intended to enable and advance network/graph analysis with graphlets, and to introduce the relatively new analysis apparatus to graph theory, high-performance graph computation, and broader applications.
△ Less
Submitted 31 August, 2020; v1 submitted 21 July, 2020;
originally announced July 2020.
-
Optimization from Structured Samples for Coverage Functions
Authors:
Wei Chen,
Xiaoming Sun,
Jialin Zhang,
Zhijie Zhang
Abstract:
We revisit the optimization from samples (OPS) model, which studies the problem of optimizing objective functions directly from the sample data. Previous results showed that we cannot obtain a constant approximation ratio for the maximum coverage problem using polynomially many independent samples of the form $\{S_i, f(S_i)\}_{i=1}^t$ (Balkanski et al., 2017), even if coverage functions are…
▽ More
We revisit the optimization from samples (OPS) model, which studies the problem of optimizing objective functions directly from the sample data. Previous results showed that we cannot obtain a constant approximation ratio for the maximum coverage problem using polynomially many independent samples of the form $\{S_i, f(S_i)\}_{i=1}^t$ (Balkanski et al., 2017), even if coverage functions are $(1 - ε)$-PMAC learnable using these samples (Badanidiyuru et al., 2012), which means most of the function values can be approximately learned very well with high probability. In this work, to circumvent the impossibility result of OPS, we propose a stronger model called optimization from structured samples (OPSS) for coverage functions, where the data samples encode the structural information of the functions. We show that under three general assumptions on the sample distributions, we can design efficient OPSS algorithms that achieve a constant approximation for the maximum coverage problem. We further prove a constant lower bound under these assumptions, which is tight when not considering computational efficiency. Moreover, we also show that if we remove any one of the three assumptions, OPSS for the maximum coverage problem has no constant approximation.
△ Less
Submitted 6 July, 2020;
originally announced July 2020.
-
DessiLBI: Exploring Structural Sparsity of Deep Networks via Differential Inclusion Paths
Authors:
Yanwei Fu,
Chen Liu,
Donghao Li,
Xinwei Sun,
**shan Zeng,
Yuan Yao
Abstract:
Over-parameterization is ubiquitous nowadays in training neural networks to benefit both optimization in seeking global optima and generalization in reducing prediction error. However, compressive networks are desired in many real world applications and direct training of small networks may be trapped in local optima. In this paper, instead of pruning or distilling over-parameterized models to com…
▽ More
Over-parameterization is ubiquitous nowadays in training neural networks to benefit both optimization in seeking global optima and generalization in reducing prediction error. However, compressive networks are desired in many real world applications and direct training of small networks may be trapped in local optima. In this paper, instead of pruning or distilling over-parameterized models to compressive ones, we propose a new approach based on differential inclusions of inverse scale spaces. Specifically, it generates a family of models from simple to complex ones that couples a pair of parameters to simultaneously train over-parameterized deep models and structural sparsity on weights of fully connected and convolutional layers. Such a differential inclusion scheme has a simple discretization, proposed as Deep structurally splitting Linearized Bregman Iteration (DessiLBI), whose global convergence analysis in deep learning is established that from any initializations, algorithmic iterations converge to a critical point of empirical risks. Experimental evidence shows that DessiLBI achieve comparable and even better performance than the competitive optimizers in exploring the structural sparsity of several widely used backbones on the benchmark datasets. Remarkably, with early stop**, DessiLBI unveils "winning tickets" in early epochs: the effective sparse structure with comparable test accuracy to fully trained over-parameterized models.
△ Less
Submitted 4 July, 2020;
originally announced July 2020.
-
Adversarial Attacks for Multi-view Deep Models
Authors:
Xuli Sun,
Shiliang Sun
Abstract:
Recent work has highlighted the vulnerability of many deep machine learning models to adversarial examples. It attracts increasing attention to adversarial attacks, which can be used to evaluate the security and robustness of models before they are deployed. However, to our best knowledge, there is no specific research on the adversarial attacks for multi-view deep models. This paper proposes two…
▽ More
Recent work has highlighted the vulnerability of many deep machine learning models to adversarial examples. It attracts increasing attention to adversarial attacks, which can be used to evaluate the security and robustness of models before they are deployed. However, to our best knowledge, there is no specific research on the adversarial attacks for multi-view deep models. This paper proposes two multi-view attack strategies, two-stage attack (TSA) and end-to-end attack (ETEA). With the mild assumption that the single-view model on which the target multi-view model is based is known, we first propose the TSA strategy. The main idea of TSA is to attack the multi-view model with adversarial examples generated by attacking the associated single-view model, by which state-of-the-art single-view attack methods are directly extended to the multi-view scenario. Then we further propose the ETEA strategy when the multi-view model is provided publicly. The ETEA is applied to accomplish direct attacks on the target multi-view model, where we develop three effective multi-view attack methods. Finally, based on the fact that adversarial examples generalize well among different models, this paper takes the adversarial attack on the multi-view convolutional neural network as an example to validate that the effectiveness of the proposed multi-view attacks. Extensive experimental results demonstrate that our multi-view attack strategies are capable of attacking the multi-view deep models, and we additionally find that multi-view models are more robust than single-view models.
△ Less
Submitted 19 June, 2020;
originally announced June 2020.
-
Exploring the Vulnerability of Deep Neural Networks: A Study of Parameter Corruption
Authors:
Xu Sun,
Zhiyuan Zhang,
Xuancheng Ren,
Ruixuan Luo,
Liangyou Li
Abstract:
We argue that the vulnerability of model parameters is of crucial value to the study of model robustness and generalization but little research has been devoted to understanding this matter. In this work, we propose an indicator to measure the robustness of neural network parameters by exploiting their vulnerability via parameter corruption. The proposed indicator describes the maximum loss variat…
▽ More
We argue that the vulnerability of model parameters is of crucial value to the study of model robustness and generalization but little research has been devoted to understanding this matter. In this work, we propose an indicator to measure the robustness of neural network parameters by exploiting their vulnerability via parameter corruption. The proposed indicator describes the maximum loss variation in the non-trivial worst-case scenario under parameter corruption. For practical purposes, we give a gradient-based estimation, which is far more effective than random corruption trials that can hardly induce the worst accuracy degradation. Equipped with theoretical support and empirical validation, we are able to systematically investigate the robustness of different model parameters and reveal vulnerability of deep neural networks that has been rarely paid attention to before. Moreover, we can enhance the models accordingly with the proposed adversarial corruption-resistant training, which not only improves the parameter robustness but also translates into accuracy elevation.
△ Less
Submitted 10 December, 2020; v1 submitted 9 June, 2020;
originally announced June 2020.
-
Cracking the Black Box: Distilling Deep Sports Analytics
Authors:
Xiangyu Sun,
Jack Davis,
Oliver Schulte,
Guiliang Liu
Abstract:
This paper addresses the trade-off between Accuracy and Transparency for deep learning applied to sports analytics. Neural nets achieve great predictive accuracy through deep learning, and are popular in sports analytics. But it is hard to interpret a neural net model and harder still to extract actionable insights from the knowledge implicit in it. Therefore, we built a simple and transparent mod…
▽ More
This paper addresses the trade-off between Accuracy and Transparency for deep learning applied to sports analytics. Neural nets achieve great predictive accuracy through deep learning, and are popular in sports analytics. But it is hard to interpret a neural net model and harder still to extract actionable insights from the knowledge implicit in it. Therefore, we built a simple and transparent model that mimics the output of the original deep learning model and represents the learned knowledge in an explicit interpretable way. Our mimic model is a linear model tree, which combines a collection of linear models with a regression-tree structure. The tree version of a neural network achieves high fidelity, explains itself, and produces insights for expert stakeholders such as athletes and coaches. We propose and compare several scalable model tree learning heuristics to address the computational challenge from datasets with millions of data points.
△ Less
Submitted 29 June, 2020; v1 submitted 3 June, 2020;
originally announced June 2020.
-
An Asympirical Smoothing Parameters Selection Approach for Smoothing Spline ANOVA Models in Large Samples
Authors:
Xiaoxiao Sun,
Wenxuan Zhong,
** Ma
Abstract:
Large samples have been generated routinely from various sources. Classic statistical models, such as smoothing spline ANOVA models, are not well equipped to analyze such large samples due to expensive computational costs. In particular, the daunting computational costs of selecting smoothing parameters render smoothing spline ANOVA models impractical. In this article, we develop an asympirical, i…
▽ More
Large samples have been generated routinely from various sources. Classic statistical models, such as smoothing spline ANOVA models, are not well equipped to analyze such large samples due to expensive computational costs. In particular, the daunting computational costs of selecting smoothing parameters render smoothing spline ANOVA models impractical. In this article, we develop an asympirical, i.e., asymptotic and empirical, smoothing parameters selection approach for smoothing spline ANOVA models in large samples. The idea of this approach is to use asymptotic analysis to show that the optimal smoothing parameter is a polynomial function of the sample size and an unknown constant. The unknown constant is then estimated through empirical subsample extrapolation. The proposed method significantly reduces the computational costs of selecting smoothing parameters in high-dimensional and large samples. We show smoothing parameters chosen by the proposed method tend to the optimal smoothing parameters that minimise a specific risk function. In addition, the estimator based on the proposed smoothing parameters achieves the optimal convergence rate. Extensive simulation studies demonstrate the numerical advantage of the proposed method over competing methods in terms of relative efficacies and running time. On an application to molecular dynamics data with nearly one million observations, the proposed method has the best prediction performance.
△ Less
Submitted 21 April, 2020;
originally announced April 2020.