Search | arXiv e-print repository

Deciphering interventional dynamical causality from non-intervention systems

Authors: Jifan Shi, Yang Li, Juan Zhao, Siyang Leng, Kazuyuki Aihara, Luonan Chen, Wei Lin

Abstract: Detecting and quantifying causality is a focal topic in the fields of science, engineering, and interdisciplinary studies. However, causal studies on non-intervention systems attract much attention but remain extremely challenging. To address this challenge, we propose a framework named Interventional Dynamical Causality (IntDC) for such non-intervention systems, along with its computational crite… ▽ More Detecting and quantifying causality is a focal topic in the fields of science, engineering, and interdisciplinary studies. However, causal studies on non-intervention systems attract much attention but remain extremely challenging. To address this challenge, we propose a framework named Interventional Dynamical Causality (IntDC) for such non-intervention systems, along with its computational criterion, Interventional Embedding Entropy (IEE), to quantify causality. The IEE criterion theoretically and numerically enables the deciphering of IntDC solely from observational (non-interventional) time-series data, without requiring any knowledge of dynamical models or real interventions in the considered system. Demonstrations of performance showed the accuracy and robustness of IEE on benchmark simulated systems as well as real-world systems, including the neural connectomes of C. elegans, COVID-19 transmission networks in Japan, and regulatory networks surrounding key circadian genes. △ Less

Submitted 28 June, 2024; originally announced July 2024.

arXiv:2406.04150 [pdf, other]

A novel robust meta-analysis model using the $t$ distribution for outlier accommodation and detection

Authors: Yue Wang, Jianhua Zhao, Fen Jiang, Lei Shi, Jianxin Pan

Abstract: Random effects meta-analysis model is an important tool for integrating results from multiple independent studies. However, the standard model is based on the assumption of normal distributions for both random effects and within-study errors, making it susceptible to outlying studies. Although robust modeling using the $t$ distribution is an appealing idea, the existing work, that explores the use… ▽ More Random effects meta-analysis model is an important tool for integrating results from multiple independent studies. However, the standard model is based on the assumption of normal distributions for both random effects and within-study errors, making it susceptible to outlying studies. Although robust modeling using the $t$ distribution is an appealing idea, the existing work, that explores the use of the $t$ distribution only for random effects, involves complicated numerical integration and numerical optimization. In this paper, a novel robust meta-analysis model using the $t$ distribution is proposed ($t$Meta). The novelty is that the marginal distribution of the effect size in $t$Meta follows the $t$ distribution, enabling that $t$Meta can simultaneously accommodate and detect outlying studies in a simple and adaptive manner. A simple and fast EM-type algorithm is developed for maximum likelihood estimation. Due to the mathematical tractability of the $t$ distribution, $t$Meta frees from numerical integration and allows for efficient optimization. Experiments on real data demonstrate that $t$Meta is compared favorably with related competitors in situations involving mild outliers. Moreover, in the presence of gross outliers, while related competitors may fail, $t$Meta continues to perform consistently and robustly. △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: 15 pages, 7 figures

MSC Class: 62P10 ACM Class: I.2.6

arXiv:2406.03849 [pdf]

A Noise-robust Multi-head Attention Mechanism for Formation Resistivity Prediction: Frequency Aware LSTM

Authors: Yongan Zhang, Junfeng Zhao, Jian Li, Xuanran Wang, Youzhuang Sun, Yuntian Chen, Dongxiao Zhang

Abstract: The prediction of formation resistivity plays a crucial role in the evaluation of oil and gas reservoirs, identification and assessment of geothermal energy resources, groundwater detection and monitoring, and carbon capture and storage. However, traditional well logging techniques fail to measure accurate resistivity in cased boreholes, and the transient electromagnetic method for cased borehole… ▽ More The prediction of formation resistivity plays a crucial role in the evaluation of oil and gas reservoirs, identification and assessment of geothermal energy resources, groundwater detection and monitoring, and carbon capture and storage. However, traditional well logging techniques fail to measure accurate resistivity in cased boreholes, and the transient electromagnetic method for cased borehole resistivity logging encounters challenges of high-frequency disaster (the problem of inadequate learning by neural networks in high-frequency features) and noise interference, badly affecting accuracy. To address these challenges, frequency-aware framework and temporal anti-noise block are proposed to build frequency aware LSTM (FAL). The frequency-aware framework implements a dual-stream structure through wavelet transformation, allowing the neural network to simultaneously handle high-frequency and low-frequency flows of time-series data, thus avoiding high-frequency disaster. The temporal anti-noise block integrates multiple attention mechanisms and soft-threshold attention mechanisms, enabling the model to better distinguish noise from redundant features. Ablation experiments demonstrate that the frequency-aware framework and temporal anti-noise block contribute significantly to performance improvement. FAL achieves a 24.3% improvement in R2 over LSTM, reaching the highest value of 0.91 among all models. In robustness experiments, the impact of noise on FAL is approximately 1/8 of the baseline, confirming the noise resistance of FAL. The proposed FAL effectively reduces noise interference in predicting formation resistivity from cased transient electromagnetic well logging curves, better learns high-frequency features, and thereby enhances the prediction accuracy and noise resistance of the neural network model. △ Less

Submitted 6 June, 2024; originally announced June 2024.

arXiv:2406.00701 [pdf, other]

Profiled Transfer Learning for High Dimensional Linear Model

Authors: Ziqian Lin, Junlong Zhao, Fang Wang, Hansheng Wang

Abstract: We develop here a novel transfer learning methodology called Profiled Transfer Learning (PTL). The method is based on the \textit{approximate-linear} assumption between the source and target parameters. Compared with the commonly assumed \textit{vanishing-difference} assumption and \textit{low-rank} assumption in the literature, the \textit{approximate-linear} assumption is more flexible and less… ▽ More We develop here a novel transfer learning methodology called Profiled Transfer Learning (PTL). The method is based on the \textit{approximate-linear} assumption between the source and target parameters. Compared with the commonly assumed \textit{vanishing-difference} assumption and \textit{low-rank} assumption in the literature, the \textit{approximate-linear} assumption is more flexible and less stringent. Specifically, the PTL estimator is constructed by two major steps. Firstly, we regress the response on the transferred feature, leading to the profiled responses. Subsequently, we learn the regression relationship between profiled responses and the covariates on the target data. The final estimator is then assembled based on the \textit{approximate-linear} relationship. To theoretically support the PTL estimator, we derive the non-asymptotic upper bound and minimax lower bound. We find that the PTL estimator is minimax optimal under appropriate regularity conditions. Extensive simulation studies are presented to demonstrate the finite sample performance of the new method. A real data example about sentence prediction is also presented with very encouraging results. △ Less

Submitted 5 June, 2024; v1 submitted 2 June, 2024; originally announced June 2024.

arXiv:2404.04709 [pdf, other]

Two-Sided Flexibility in Platforms

Authors: Daniel Freund, Sébastien Martin, Jiayu Kamessi Zhao

Abstract: Flexibility is a cornerstone of operations management, crucial to hedge stochasticity in product demands, service requirements, and resource allocation. In two-sided platforms, flexibility is also two-sided and can be viewed as the compatibility of agents on one side with agents on the other side. Platform actions often influence the flexibility on either the demand or the supply side. But how sho… ▽ More Flexibility is a cornerstone of operations management, crucial to hedge stochasticity in product demands, service requirements, and resource allocation. In two-sided platforms, flexibility is also two-sided and can be viewed as the compatibility of agents on one side with agents on the other side. Platform actions often influence the flexibility on either the demand or the supply side. But how should flexibility be jointly allocated across different sides? Whereas the literature has traditionally focused on only one side at a time, our work initiates the study of two-sided flexibility in matching platforms. We propose a parsimonious matching model in random graphs and identify the flexibility allocation that optimizes the expected size of a maximum matching. Our findings reveal that flexibility allocation is a first-order issue: for a given flexibility budget, the resulting matching size can vary greatly depending on how the budget is allocated. Moreover, even in the simple and symmetric settings we study, the quest for the optimal allocation is complicated. In particular, easy and costly mistakes can be made if the flexibility decisions on the demand and supply side are optimized independently (e.g., by two different teams in the company), rather than jointly. To guide the search for optimal flexibility allocation, we uncover two effects, flexibility cannibalization, and flexibility abundance, that govern when the optimal design places the flexibility budget only on one side or equally on both sides. In doing so we identify the study of two-sided flexibility as a significant aspect of platform efficiency. △ Less

Submitted 6 April, 2024; originally announced April 2024.

arXiv:2403.07288 [pdf, other]

Efficient and Model-Agnostic Parameter Estimation Under Privacy-Preserving Post-randomization Data

Authors: Qinglong Tian, Jiwei Zhao

Abstract: Protecting individual privacy is crucial when releasing sensitive data for public use. While data de-identification helps, it is not enough. This paper addresses parameter estimation in scenarios where data are perturbed using the Post-Randomization Method (PRAM) to enhance privacy. Existing methods for parameter estimation under PRAM data suffer from limitations like being parameter-specific, mod… ▽ More Protecting individual privacy is crucial when releasing sensitive data for public use. While data de-identification helps, it is not enough. This paper addresses parameter estimation in scenarios where data are perturbed using the Post-Randomization Method (PRAM) to enhance privacy. Existing methods for parameter estimation under PRAM data suffer from limitations like being parameter-specific, model-dependent, and lacking efficiency guarantees. We propose a novel, efficient method that overcomes these limitations. Our method is applicable to general parameters defined through estimating equations and makes no assumptions about the underlying data model. We further prove that the proposed estimator achieves the semiparametric efficiency bound, making it optimal in terms of asymptotic variance. △ Less

Submitted 11 March, 2024; originally announced March 2024.

arXiv:2401.16410 [pdf, other]

ReTaSA: A Nonparametric Functional Estimation Approach for Addressing Continuous Target Shift

Authors: Hwanwoo Kim, Xin Zhang, Jiwei Zhao, Qinglong Tian

Abstract: The presence of distribution shifts poses a significant challenge for deploying modern machine learning models in real-world applications. This work focuses on the target shift problem in a regression setting (Zhang et al., 2013; Nguyen et al., 2016). More specifically, the target variable y (also known as the response variable), which is continuous, has different marginal distributions in the tra… ▽ More The presence of distribution shifts poses a significant challenge for deploying modern machine learning models in real-world applications. This work focuses on the target shift problem in a regression setting (Zhang et al., 2013; Nguyen et al., 2016). More specifically, the target variable y (also known as the response variable), which is continuous, has different marginal distributions in the training source and testing domain, while the conditional distribution of features x given y remains the same. While most literature focuses on classification tasks with finite target space, the regression problem has an infinite dimensional target space, which makes many of the existing methods inapplicable. In this work, we show that the continuous target shift problem can be addressed by estimating the importance weight function from an ill-posed integral equation. We propose a nonparametric regularized approach named ReTaSA to solve the ill-posed integral equation and provide theoretical justification for the estimated importance weight function. The effectiveness of the proposed method has been demonstrated with extensive numerical studies on synthetic and real-world datasets. △ Less

Submitted 29 January, 2024; originally announced January 2024.

Comments: Accepted by ICLR 2024

arXiv:2401.09259 [pdf, other]

Mitigating distribution shift in machine learning-augmented hybrid simulation

Authors: Jiaxi Zhao, Qianxiao Li

Abstract: We study the problem of distribution shift generally arising in machine-learning augmented hybrid simulation, where parts of simulation algorithms are replaced by data-driven surrogates. We first establish a mathematical framework to understand the structure of machine-learning augmented hybrid simulation problems, and the cause and effect of the associated distribution shift. We show correlations… ▽ More We study the problem of distribution shift generally arising in machine-learning augmented hybrid simulation, where parts of simulation algorithms are replaced by data-driven surrogates. We first establish a mathematical framework to understand the structure of machine-learning augmented hybrid simulation problems, and the cause and effect of the associated distribution shift. We show correlations between distribution shift and simulation error both numerically and theoretically. Then, we propose a simple methodology based on tangent-space regularized estimator to control the distribution shift, thereby improving the long-term accuracy of the simulation results. In the linear dynamics case, we provide a thorough theoretical analysis to quantify the effectiveness of the proposed method. Moreover, we conduct several numerical experiments, including simulating a partially known reaction-diffusion equation and solving Navier-Stokes equations using the projection method with a data-driven pressure solver. In all cases, we observe marked improvements in simulation accuracy under the proposed method, especially for systems with high degrees of distribution shift, such as those with relatively strong non-linear reaction mechanisms, or flows at large Reynolds numbers. △ Less

Submitted 17 January, 2024; originally announced January 2024.

MSC Class: 68T99; 65M15; 37M05

arXiv:2401.07267 [pdf, other]

Inference for high-dimensional linear expectile regression with de-biased method

Authors: Xiang Li, Yu-Ning Li, Li-Xin Zhang, Jun Zhao

Abstract: In this paper, we address the inference problem in high-dimensional linear expectile regression. We transform the expectile loss into a weighted-least-squares form and apply a de-biased strategy to establish Wald-type tests for multiple constraints within a regularized framework. Simultaneously, we construct an estimator for the pseudo-inverse of the generalized Hessian matrix in high dimension wi… ▽ More In this paper, we address the inference problem in high-dimensional linear expectile regression. We transform the expectile loss into a weighted-least-squares form and apply a de-biased strategy to establish Wald-type tests for multiple constraints within a regularized framework. Simultaneously, we construct an estimator for the pseudo-inverse of the generalized Hessian matrix in high dimension with general amenable regularizers including Lasso and SCAD, and demonstrate its consistency through a new proof technique. We conduct simulation studies and real data applications to demonstrate the efficacy of our proposed test statistic in both homoscedastic and heteroscedastic scenarios. △ Less

Submitted 14 January, 2024; originally announced January 2024.

Comments: 34 pages

MSC Class: 62F05; 62F12; 62J12

arXiv:2401.07000 [pdf, other]

Counterfactual Slope and Its Applications to Social Stratification

Authors: Ang Yu, Jiwei Zhao

Abstract: This paper addresses two prominent theses in social stratification research, the great equalizer thesis and Mare's (1980) school transition thesis. Both theses are premised on a descriptive regularity: the association between socioeconomic background and an outcome variable changes when conditioning on an intermediate treatment. The interpretation of this descriptive regularity is complicated by s… ▽ More This paper addresses two prominent theses in social stratification research, the great equalizer thesis and Mare's (1980) school transition thesis. Both theses are premised on a descriptive regularity: the association between socioeconomic background and an outcome variable changes when conditioning on an intermediate treatment. The interpretation of this descriptive regularity is complicated by social actors' differential selection into treatment based on their potential outcomes under treatment. In particular, if the descriptive regularity is driven by selection, then the theses do not have a substantive interpretation. We propose a set of novel counterfactual slope estimands, which capture the two theses under the hypothetical scenario where differential selection into treatment is eliminated. Thus, we use the counterfactual slopes to construct selection-free tests for the two theses. Compared with the existing literature, we are the first to provide explicit, nonparametric, and causal estimands, which enable us to conduct principled selection-free tests. We develop efficient and robust estimators by deriving the efficient influence functions of the estimands. We apply our framework to a nationally representative dataset in the United States and re-evaluate the two theses. Findings from our selection-free tests show that the descriptive regularity of the two theses is misleading for substantive interpretations. △ Less

Submitted 13 January, 2024; originally announced January 2024.

arXiv:2401.02203 [pdf, other]

Robust bilinear factor analysis based on the matrix-variate $t$ distribution

Authors: Xuan Ma, Jianhua Zhao, Changchun Shang, Fen Jiang, Philip L. H. Yu

Abstract: Factor Analysis based on multivariate $t$ distribution ($t$fa) is a useful robust tool for extracting common factors on heavy-tailed or contaminated data. However, $t$fa is only applicable to vector data. When $t$fa is applied to matrix data, it is common to first vectorize the matrix observations. This introduces two challenges for $t$fa: (i) the inherent matrix structure of the data is broken, a… ▽ More Factor Analysis based on multivariate $t$ distribution ($t$fa) is a useful robust tool for extracting common factors on heavy-tailed or contaminated data. However, $t$fa is only applicable to vector data. When $t$fa is applied to matrix data, it is common to first vectorize the matrix observations. This introduces two challenges for $t$fa: (i) the inherent matrix structure of the data is broken, and (ii) robustness may be lost, as vectorized matrix data typically results in a high data dimension, which could easily lead to the breakdown of $t$fa. To address these issues, starting from the intrinsic matrix structure of matrix data, a novel robust factor analysis model, namely bilinear factor analysis built on the matrix-variate $t$ distribution ($t$bfa), is proposed in this paper. The novelty is that it is capable to simultaneously extract common factors for both row and column variables of interest on heavy-tailed or contaminated matrix data. Two efficient algorithms for maximum likelihood estimation of $t$bfa are developed. Closed-form expression for the Fisher information matrix to calculate the accuracy of parameter estimates are derived. Empirical studies are conducted to understand the proposed $t$bfa model and compare with related competitors. The results demonstrate the superiority and practicality of $t$bfa. Importantly, $t$bfa exhibits a significantly higher breakdown point than $t$fa, making it more suitable for matrix data. △ Less

Submitted 4 January, 2024; originally announced January 2024.

arXiv:2311.14220 [pdf, other]

Assumption-lean and Data-adaptive Post-Prediction Inference

Authors: Jiacheng Miao, Xinran Miao, Yixuan Wu, Jiwei Zhao, Qiongshi Lu

Abstract: A primary challenge facing modern scientific research is the limited availability of gold-standard data which can be both costly and labor-intensive to obtain. With the rapid development of machine learning (ML), scientists have relied on ML algorithms to predict these gold-standard outcomes with easily obtained covariates. However, these predicted outcomes are often used directly in subsequent st… ▽ More A primary challenge facing modern scientific research is the limited availability of gold-standard data which can be both costly and labor-intensive to obtain. With the rapid development of machine learning (ML), scientists have relied on ML algorithms to predict these gold-standard outcomes with easily obtained covariates. However, these predicted outcomes are often used directly in subsequent statistical analyses, ignoring imprecision and heterogeneity introduced by the prediction procedure. This will likely result in false positive findings and invalid scientific conclusions. In this work, we introduce an assumption-lean and data-adaptive Post-Prediction Inference (POP-Inf) procedure that allows valid and powerful inference based on ML-predicted outcomes. Its "assumption-lean" property guarantees reliable statistical inference without assumptions on the ML-prediction, for a wide range of statistical quantities. Its "data-adaptive'" feature guarantees an efficiency gain over existing post-prediction inference methods, regardless of the accuracy of ML-prediction. We demonstrate the superiority and applicability of our method through simulations and large-scale genomic data. △ Less

Submitted 6 February, 2024; v1 submitted 23 November, 2023; originally announced November 2023.

arXiv:2311.07972 [pdf, other]

Residual Importance Weighted Transfer Learning For High-dimensional Linear Regression

Authors: Junlong Zhao, Shengbin Zheng, Chenlei Leng

Abstract: Transfer learning is an emerging paradigm for leveraging multiple sources to improve the statistical inference on a single target. In this paper, we propose a novel approach named residual importance weighted transfer learning (RIW-TL) for high-dimensional linear models built on penalized likelihood. Compared to existing methods such as Trans-Lasso that selects sources in an all-in-all-out manner,… ▽ More Transfer learning is an emerging paradigm for leveraging multiple sources to improve the statistical inference on a single target. In this paper, we propose a novel approach named residual importance weighted transfer learning (RIW-TL) for high-dimensional linear models built on penalized likelihood. Compared to existing methods such as Trans-Lasso that selects sources in an all-in-all-out manner, RIW-TL includes samples via importance weighting and thus may permit more effective sample use. To determine the weights, remarkably RIW-TL only requires the knowledge of one-dimensional densities dependent on residuals, thus overcoming the curse of dimensionality of having to estimate high-dimensional densities in naive importance weighting. We show that the oracle RIW-TL provides a faster rate than its competitors and develop a cross-fitting procedure to estimate this oracle. We discuss variants of RIW-TL by adopting different choices for residual weighting. The theoretical properties of RIW-TL and its variants are established and compared with those of LASSO and Trans-Lasso. Extensive simulation and a real data analysis confirm its advantages. △ Less

Submitted 3 January, 2024; v1 submitted 14 November, 2023; originally announced November 2023.

arXiv:2310.07990 [pdf]

Multi-View Variational Autoencoder for Missing Value Imputation in Untargeted Metabolomics

Authors: Chen Zhao, Kuan-Jui Su, Chong Wu, Xuewei Cao, Qiuying Sha, Wu Li, Zhe Luo, Tian Qin, Chuan Qiu, Lan Juan Zhao, Anqi Liu, Lindong Jiang, Xiao Zhang, Hui Shen, Weihua Zhou, Hong-Wen Deng

Abstract: Background: Missing data is a common challenge in mass spectrometry-based metabolomics, which can lead to biased and incomplete analyses. The integration of whole-genome sequencing (WGS) data with metabolomics data has emerged as a promising approach to enhance the accuracy of data imputation in metabolomics studies. Method: In this study, we propose a novel method that leverages the information f… ▽ More Background: Missing data is a common challenge in mass spectrometry-based metabolomics, which can lead to biased and incomplete analyses. The integration of whole-genome sequencing (WGS) data with metabolomics data has emerged as a promising approach to enhance the accuracy of data imputation in metabolomics studies. Method: In this study, we propose a novel method that leverages the information from WGS data and reference metabolites to impute unknown metabolites. Our approach utilizes a multi-view variational autoencoder to jointly model the burden score, polygenetic risk score (PGS), and linkage disequilibrium (LD) pruned single nucleotide polymorphisms (SNPs) for feature extraction and missing metabolomics data imputation. By learning the latent representations of both omics data, our method can effectively impute missing metabolomics values based on genomic information. Results: We evaluate the performance of our method on empirical metabolomics datasets with missing values and demonstrate its superiority compared to conventional imputation techniques. Using 35 template metabolites derived burden scores, PGS and LD-pruned SNPs, the proposed methods achieved R^2-scores > 0.01 for 71.55% of metabolites. Conclusion: The integration of WGS data in metabolomics imputation not only improves data completeness but also enhances downstream analyses, paving the way for more comprehensive and accurate investigations of metabolic pathways and disease associations. Our findings offer valuable insights into the potential benefits of utilizing WGS data for metabolomics data imputation and underscore the importance of leveraging multi-modal data integration in precision medicine research. △ Less

Submitted 12 March, 2024; v1 submitted 11 October, 2023; originally announced October 2023.

Comments: 19 pages, 3 figures

arXiv:2309.12997 [pdf, other]

Scaling Limits of the Wasserstein information matrix on Gaussian Mixture Models

Authors: Wuchen Li, Jiaxi Zhao

Abstract: We consider the Wasserstein metric on the Gaussian mixture models (GMMs), which is defined as the pullback of the full Wasserstein metric on the space of smooth probability distributions with finite second moment. It derives a class of Wasserstein metrics on probability simplices over one-dimensional bounded homogeneous lattices via a scaling limit of the Wasserstein metric on GMMs. Specifically,… ▽ More We consider the Wasserstein metric on the Gaussian mixture models (GMMs), which is defined as the pullback of the full Wasserstein metric on the space of smooth probability distributions with finite second moment. It derives a class of Wasserstein metrics on probability simplices over one-dimensional bounded homogeneous lattices via a scaling limit of the Wasserstein metric on GMMs. Specifically, for a sequence of GMMs whose variances tend to zero, we prove that the limit of the Wasserstein metric exists after certain renormalization. Generalizations of this metric in general GMMs are established, including inhomogeneous lattice models whose lattice gaps are not the same, extended GMMs whose mean parameters of Gaussian components can also change, and the second-order metric containing high-order information of the scaling limit. We further study the Wasserstein gradient flows on GMMs for three typical functionals: potential, internal, and interaction energies. Numerical examples demonstrate the effectiveness of the proposed GMM models for approximating Wasserstein gradient flows. △ Less

Submitted 22 September, 2023; originally announced September 2023.

Comments: 32 pages, 3 figures

MSC Class: 62B11; 41A60

arXiv:2309.08808 [pdf, other]

Adaptive Neyman Allocation

Authors: **glong Zhao

Abstract: In experimental design, Neyman allocation refers to the practice of allocating subjects into treated and control groups, potentially in unequal numbers proportional to their respective standard deviations, with the objective of minimizing the variance of the treatment effect estimator. This widely recognized approach increases statistical power in scenarios where the treated and control groups hav… ▽ More In experimental design, Neyman allocation refers to the practice of allocating subjects into treated and control groups, potentially in unequal numbers proportional to their respective standard deviations, with the objective of minimizing the variance of the treatment effect estimator. This widely recognized approach increases statistical power in scenarios where the treated and control groups have different standard deviations, as is often the case in social experiments, clinical trials, marketing research, and online A/B testing. However, Neyman allocation cannot be implemented unless the standard deviations are known in advance. Fortunately, the multi-stage nature of the aforementioned applications allows the use of earlier stage observations to estimate the standard deviations, which further guide allocation decisions in later stages. In this paper, we introduce a competitive analysis framework to study this multi-stage experimental design problem. We propose a simple adaptive Neyman allocation algorithm, which almost matches the information-theoretic limit of conducting experiments. Using online A/B testing data from a social media site, we demonstrate the effectiveness of our adaptive Neyman allocation algorithm, highlighting its practicality especially when applied with only a limited number of stages. △ Less

Submitted 21 September, 2023; v1 submitted 15 September, 2023; originally announced September 2023.

arXiv:2308.08152 [pdf, other]

Estimating Effects of Long-Term Treatments

Authors: Shan Huang, Chen Wang, Yuan Yuan, **glong Zhao, **g**g Zhang

Abstract: Estimating the effects of long-term treatments in A/B testing presents a significant challenge. Such treatments -- including updates to product functions, user interface designs, and recommendation algorithms -- are intended to remain in the system for a long period after their launches. On the other hand, given the constraints of conducting long-term experiments, practitioners often rely on short… ▽ More Estimating the effects of long-term treatments in A/B testing presents a significant challenge. Such treatments -- including updates to product functions, user interface designs, and recommendation algorithms -- are intended to remain in the system for a long period after their launches. On the other hand, given the constraints of conducting long-term experiments, practitioners often rely on short-term experimental results to make product launch decisions. It remains an open question how to accurately estimate the effects of long-term treatments using short-term experimental data. To address this question, we introduce a longitudinal surrogate framework. We show that, under standard assumptions, the effects of long-term treatments can be decomposed into a series of functions, which depend on the user attributes, the short-term intermediate metrics, and the treatment assignments. We describe the identification assumptions, the estimation strategies, and the inference technique under this framework. Empirically, we show that our approach outperforms existing solutions by leveraging two real-world experiments, each involving millions of users on WeChat, one of the world's largest social networking platforms. △ Less

Submitted 16 August, 2023; originally announced August 2023.

arXiv:2307.12226 [pdf, other]

Geometry-Aware Adaptation for Pretrained Models

Authors: Nicholas Roberts, Xintong Li, Dyah Adila, Sonia Cromp, Tzu-Heng Huang, Jitian Zhao, Frederic Sala

Abstract: Machine learning models -- including prominent zero-shot models -- are often trained on datasets whose labels are only a small proportion of a larger label space. Such spaces are commonly equipped with a metric that relates the labels via distances between them. We propose a simple approach to exploit this information to adapt the trained model to reliably predict new classes -- or, in the case of… ▽ More Machine learning models -- including prominent zero-shot models -- are often trained on datasets whose labels are only a small proportion of a larger label space. Such spaces are commonly equipped with a metric that relates the labels via distances between them. We propose a simple approach to exploit this information to adapt the trained model to reliably predict new classes -- or, in the case of zero-shot prediction, to improve its performance -- without any additional training. Our technique is a drop-in replacement of the standard prediction rule, swap** argmax with the Fréchet mean. We provide a comprehensive theoretical analysis for this approach, studying (i) learning-theoretic results trading off label space diameter, sample complexity, and model dimension, (ii) characterizations of the full range of scenarios in which it is possible to predict any unobserved class, and (iii) an optimal active learning-like next class selection procedure to obtain optimal training classes for when it is not possible to predict the entire range of unobserved classes. Empirically, using easily-available external metrics, our proposed approach, Loki, gains up to 29.7% relative improvement over SimCLR on ImageNet and scales to hundreds of thousands of classes. When no such metric is available, Loki can use self-derived metrics from class embeddings and obtains a 10.5% improvement on pretrained zero-shot models such as CLIP. △ Less

Submitted 27 November, 2023; v1 submitted 23 July, 2023; originally announced July 2023.

Comments: NeurIPS 2023

arXiv:2307.04250 [pdf, ps, other]

Doubly Flexible Estimation under Label Shift

Authors: Seong-ho Lee, Yanyuan Ma, Jiwei Zhao

Abstract: In studies ranging from clinical medicine to policy research, complete data are usually available from a population $\mathscr{P}$, but the quantity of interest is often sought for a related but different population $\mathscr{Q}$ which only has partial data. In this paper, we consider the setting that both outcome $Y$ and covariate ${\bf X}$ are available from $\mathscr{P}$ whereas only ${\bf X}$ i… ▽ More In studies ranging from clinical medicine to policy research, complete data are usually available from a population $\mathscr{P}$, but the quantity of interest is often sought for a related but different population $\mathscr{Q}$ which only has partial data. In this paper, we consider the setting that both outcome $Y$ and covariate ${\bf X}$ are available from $\mathscr{P}$ whereas only ${\bf X}$ is available from $\mathscr{Q}$, under the so-called label shift assumption, i.e., the conditional distribution of ${\bf X}$ given $Y$ remains the same across the two populations. To estimate the parameter of interest in $\mathscr{Q}$ via leveraging the information from $\mathscr{P}$, the following three ingredients are essential: (a) the common conditional distribution of ${\bf X}$ given $Y$, (b) the regression model of $Y$ given ${\bf X}$ in $\mathscr{P}$, and (c) the density ratio of $Y$ between the two populations. We propose an estimation procedure that only needs standard nonparametric technique to approximate the conditional expectations with respect to (a), while by no means needs an estimate or model for (b) or (c); i.e., doubly flexible to the possible model misspecifications of both (b) and (c). This is conceptually different from the well-known doubly robust estimation in that, double robustness allows at most one model to be misspecified whereas our proposal can allow both (b) and (c) to be misspecified. This is of particular interest in our setting because estimating (c) is difficult, if not impossible, by virtue of the absence of the $Y$-data in $\mathscr{Q}$. Furthermore, even though the estimation of (b) is sometimes off-the-shelf, it can face curse of dimensionality or computational challenges. We develop the large sample theory for the proposed estimator, and examine its finite-sample performance through simulation studies as well as an application to the MIMIC-III database. △ Less

Submitted 9 July, 2023; originally announced July 2023.

arXiv:2307.01908 [pdf, other]

Efficient Estimation of Average Treatment Effect on the Treated under Endogenous Treatment Assignment

Authors: Trinetri Ghosh, Menggang Yu, Jiwei Zhao

Abstract: In this paper, we consider estimation of average treatment effect on the treated (ATT), an interpretable and relevant causal estimand to policy makers when treatment assignment is endogenous. By considering shadow variables that are unrelated to the treatment assignment but related to interested outcomes, we establish identification of the ATT. Then we focus on efficient estimation of the ATT by c… ▽ More In this paper, we consider estimation of average treatment effect on the treated (ATT), an interpretable and relevant causal estimand to policy makers when treatment assignment is endogenous. By considering shadow variables that are unrelated to the treatment assignment but related to interested outcomes, we establish identification of the ATT. Then we focus on efficient estimation of the ATT by characterizing the geometric structure of the likelihood, deriving the semiparametric efficiency bound for ATT estimation and proposing an estimator that can achieve this bound. We rigorously establish the theoretical results of the proposed estimator. The finite sample performance of the proposed estimator is studied through comprehensive simulation studies as well as an application to our motivating study. △ Less

Submitted 4 July, 2023; originally announced July 2023.

Comments: 34 pages, 2 figures

arXiv:2307.00205 [pdf, other]

A Transparent and Nonlinear Method for Variable Selection

Authors: Keyao Wang, Huiwen Wang, Jichang Zhao, Lihong Wang

Abstract: Variable selection is a procedure to attain the truly important predictors from inputs. Complex nonlinear dependencies and strong coupling pose great challenges for variable selection in high-dimensional data. In addition, real-world applications have increased demands for interpretability of the selection process. A pragmatic approach should not only attain the most predictive covariates, but als… ▽ More Variable selection is a procedure to attain the truly important predictors from inputs. Complex nonlinear dependencies and strong coupling pose great challenges for variable selection in high-dimensional data. In addition, real-world applications have increased demands for interpretability of the selection process. A pragmatic approach should not only attain the most predictive covariates, but also provide ample and easy-to-understand grounds for removing certain covariates. In view of these requirements, this paper puts forward an approach for transparent and nonlinear variable selection. In order to transparently decouple information within the input predictors, a three-step heuristic search is designed, via which the input predictors are grouped into four subsets: the relevant to be selected, and the uninformative, redundant, and conditionally independent to be removed. A nonlinear partial correlation coefficient is introduced to better identify the predictors which have nonlinear functional dependence with the response. The proposed method is model-free and the selected subset can be competent input for commonly used predictive models. Experiments demonstrate the superior performance of the proposed method against the state-of-the-art baselines in terms of prediction accuracy and model interpretability. △ Less

Submitted 30 June, 2023; originally announced July 2023.

arXiv:2306.06443 [pdf, other]

Sufficient Identification Conditions and Semiparametric Estimation under Missing Not at Random Mechanisms

Authors: Anna Guo, Jiwei Zhao, Razieh Nabi

Abstract: Conducting valid statistical analyses is challenging in the presence of missing-not-at-random (MNAR) data, where the missingness mechanism is dependent on the missing values themselves even conditioned on the observed data. Here, we consider a MNAR model that generalizes several prior popular MNAR models in two ways: first, it is less restrictive in terms of statistical independence assumptions im… ▽ More Conducting valid statistical analyses is challenging in the presence of missing-not-at-random (MNAR) data, where the missingness mechanism is dependent on the missing values themselves even conditioned on the observed data. Here, we consider a MNAR model that generalizes several prior popular MNAR models in two ways: first, it is less restrictive in terms of statistical independence assumptions imposed on the underlying joint data distribution, and second, it allows for all variables in the observed sample to have missing values. This MNAR model corresponds to a so-called criss-cross structure considered in the literature on graphical models of missing data that prevents nonparametric identification of the entire missing data model. Nonetheless, part of the complete-data distribution remains nonparametrically identifiable. By exploiting this fact and considering a rich class of exponential family distributions, we establish sufficient conditions for identification of the complete-data distribution as well as the entire missingness mechanism. We then propose methods for testing the independence restrictions encoded in such models using odds ratio as our parameter of interest. We adopt two semiparametric approaches for estimating the odds ratio parameter and establish the corresponding asymptotic theories: one involves maximizing a conditional likelihood with order statistics and the other uses estimating equations. The utility of our methods is illustrated via simulation studies. △ Less

Submitted 10 June, 2023; originally announced June 2023.

Journal ref: Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence (UAI), 2023

arXiv:2305.19123 [pdf, other]

ELSA: Efficient Label Shift Adaptation through the Lens of Semiparametric Models

Authors: Qinglong Tian, Xin Zhang, Jiwei Zhao

Abstract: We study the domain adaptation problem with label shift in this work. Under the label shift context, the marginal distribution of the label varies across the training and testing datasets, while the conditional distribution of features given the label is the same. Traditional label shift adaptation methods either suffer from large estimation errors or require cumbersome post-prediction calibration… ▽ More We study the domain adaptation problem with label shift in this work. Under the label shift context, the marginal distribution of the label varies across the training and testing datasets, while the conditional distribution of features given the label is the same. Traditional label shift adaptation methods either suffer from large estimation errors or require cumbersome post-prediction calibrations. To address these issues, we first propose a moment-matching framework for adapting the label shift based on the geometry of the influence function. Under such a framework, we propose a novel method named \underline{E}fficient \underline{L}abel \underline{S}hift \underline{A}daptation (ELSA), in which the adaptation weights can be estimated by solving linear systems. Theoretically, the ELSA estimator is $\sqrt{n}$-consistent ($n$ is the sample size of the source data) and asymptotically normal. Empirically, we show that ELSA can achieve state-of-the-art estimation performances without post-prediction calibrations, thus, gaining computational efficiency. △ Less

Submitted 30 May, 2023; originally announced May 2023.

arXiv:2305.15545 [pdf, other]

Reconstructing Transit Vehicle Trajectory Using High-Resolution GPS Data

Authors: Yuzhu Huang, Awad Abdelhalim, Anson Stewart, **hua Zhao, Haris Koutsopoulos

Abstract: High-resolution location ("heartbeat") data of transit fleet vehicles is a relatively new data source for many transit agencies. On its surface, the heartbeat data can provide a wealth of information about all operational details of a recorded transit vehicle trip, from its location trajectory to its speed and acceleration profiles. Previous studies have mainly focused on decomposing the total tri… ▽ More High-resolution location ("heartbeat") data of transit fleet vehicles is a relatively new data source for many transit agencies. On its surface, the heartbeat data can provide a wealth of information about all operational details of a recorded transit vehicle trip, from its location trajectory to its speed and acceleration profiles. Previous studies have mainly focused on decomposing the total trip travel time into different components by vehicle state and then extracting measures of delays to draw conclusions on the performance of a transit route. This study delves into the task of reconstructing a complete, continuous and smooth transit vehicle trajectory from the heartbeat data that allows for the extraction of operational information of a bus at any point in time into its trip. Using only the latitude, longitude, and timestamp fields of the heartbeat data, the authors demonstrate that a continuous, smooth, and monotonic vehicle trajectory can be reconstructed using local regression in combination with monotonic cubic spline interpolation. The resultant trajectory can be used to evaluate transit performance and identify locations of bus delay near infrastructure such as traffic signals, pedestrian crossings, and bus stops. △ Less

Submitted 15 August, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

Comments: 7 pages, to be published in IEEE ITSC-2023

arXiv:2305.11323 [pdf, other]

Cumulative differences between paired samples

Authors: Isabel Kloumann, Hannah Korevaar, Chris McConnell, Mark Tygert, Jessica Zhao

Abstract: The simplest, most common paired samples consist of observations from two populations, with each observed response from one population corresponding to an observed response from the other population at the same value of an ordinal covariate. The pair of observed responses (one from each population) at the same value of the covariate is known as a "matched pair" (with the matching based on the valu… ▽ More The simplest, most common paired samples consist of observations from two populations, with each observed response from one population corresponding to an observed response from the other population at the same value of an ordinal covariate. The pair of observed responses (one from each population) at the same value of the covariate is known as a "matched pair" (with the matching based on the value of the covariate). A graph of cumulative differences between the two populations reveals differences in responses as a function of the covariate. Indeed, the slope of the secant line connecting two points on the graph becomes the average difference over the wide interval of values of the covariate between the two points; i.e., slope of the graph is the average difference in responses. ("Average" refers to the weighted average if the samples are weighted.) Moreover, a simple statistic known as the Kuiper metric summarizes into a single scalar the overall differences over all values of the covariate. The Kuiper metric is the absolute value of the total difference in responses between the two populations, totaled over the interval of values of the covariate for which the absolute value of the total is greatest. The total should be normalized such that it becomes the (weighted) average over all values of the covariate when the interval over which the total is taken is the entire range of the covariate (i.e., the sum for the total gets divided by the total number of observations, if the samples are unweighted, or divided by the total weight, if the samples are weighted). This cumulative approach is fully nonparametric and uniquely defined (with only one right way to construct the graphs and scalar summary statistics), unlike traditional methods such as reliability diagrams or parametric or semi-parametric regressions, which typically obscure significant differences due to their parameter settings. △ Less

Submitted 8 April, 2024; v1 submitted 18 May, 2023; originally announced May 2023.

Comments: 19 pages, 9 figures

arXiv:2303.06186 [pdf, other]

The impacts of remote work on travel: insights from nearly three years of monthly surveys

Authors: Nicholas S. Caros, Xiaotong Guo, Yunhan Zheng, **hua Zhao

Abstract: Remote work has expanded dramatically since 2020, upending longstanding travel patterns and behavior. More fundamentally, the flexibility for remote workers to choose when and where to work has created much stronger connections between travel behavior and organizational behavior. This paper uses a large and comprehensive monthly longitudinal survey over nearly three years to identify new trends in… ▽ More Remote work has expanded dramatically since 2020, upending longstanding travel patterns and behavior. More fundamentally, the flexibility for remote workers to choose when and where to work has created much stronger connections between travel behavior and organizational behavior. This paper uses a large and comprehensive monthly longitudinal survey over nearly three years to identify new trends in work location choice, mode choice and departure time of remote workers. The travel behavior of remote workers is found to be highly associated with employer characteristics, task characteristics, employer remote work policies, coordination between colleagues and attitudes towards remote work. Approximately one third of all remote work hours are shown to take place outside of the home, accounting for over one third of all commuting trips. These commutes to "third places" are shorter, less likely to occur during peak periods, and more likely to use sustainable travel modes than commutes to an employer's primary workplace. Hybrid work arrangements are also associated with a greater number of non-work trips than fully remote and fully in-person arrangements. Implications of this research for policy makers, shared mobility provides and land use planning are discussed. △ Less

Submitted 10 March, 2023; originally announced March 2023.

arXiv:2303.06012 [pdf, other]

Examining the interactions between working from home, travel behavior and change in car ownership due to the impact of COVID-19

Authors: Yunhan Zheng, Nicholas Caros, Jim Aloisi, **hua Zhao

Abstract: COVID-19 has disrupted society and changed how people learn, work and live. The availability of vaccines in the spring of 2021, however, led to a gradual return of many pre-pandemic activities in Massachusetts in the fall of 2021. Leveraging data that were collected using a map-based survey tool in the Greater Boston area in the fall of 2021, this study explores changes in travel behavior due to C… ▽ More COVID-19 has disrupted society and changed how people learn, work and live. The availability of vaccines in the spring of 2021, however, led to a gradual return of many pre-pandemic activities in Massachusetts in the fall of 2021. Leveraging data that were collected using a map-based survey tool in the Greater Boston area in the fall of 2021, this study explores changes in travel behavior due to COVID-19 and investigates the underlying factors contributing to these changes. First, a structural equation modeling technique is developed to capture the interactions between various travel choices, including working from home, travel mode use and change in car ownership. Moreover, attitudinal factors such as risk perceptions and attitudes towards WFH are incorporated into the framework to explain behavior changes. Second, a discrete choice modeling approach is taken to study shifts in commuting mode choices in the fall of 2021. The results show that in the fall of 2021, people became more likely to use their cars to commute, and for those who bought cars during the pandemic, they tended to work on-site more. Our findings can provide planners and policymakers with information upon which to base travel demand management decisions in the post-pandemic era. △ Less

Submitted 10 March, 2023; originally announced March 2023.

arXiv:2303.04040 [pdf, other]

Uncertainty Quantification of Spatiotemporal Travel Demand with Probabilistic Graph Neural Networks

Authors: Qingyi Wang, Shenhao Wang, Dingyi Zhuang, Haris Koutsopoulos, **hua Zhao

Abstract: Recent studies have significantly improved the prediction accuracy of travel demand using graph neural networks. However, these studies largely ignored uncertainty that inevitably exists in travel demand prediction. To fill this gap, this study proposes a framework of probabilistic graph neural networks (Prob-GNN) to quantify the spatiotemporal uncertainty of travel demand. This Prob-GNN framework… ▽ More Recent studies have significantly improved the prediction accuracy of travel demand using graph neural networks. However, these studies largely ignored uncertainty that inevitably exists in travel demand prediction. To fill this gap, this study proposes a framework of probabilistic graph neural networks (Prob-GNN) to quantify the spatiotemporal uncertainty of travel demand. This Prob-GNN framework is substantiated by deterministic and probabilistic assumptions, and empirically applied to the task of predicting the transit and ridesharing demand in Chicago. We found that the probabilistic assumptions (e.g. distribution tail, support) have a greater impact on uncertainty prediction than the deterministic ones (e.g. deep modules, depth). Among the family of Prob-GNNs, the GNNs with truncated Gaussian and Laplace distributions achieve the highest performance in transit and ridesharing data. Even under significant domain shifts, Prob-GNNs can predict the ridership uncertainty in a stable manner, when the models are trained on pre-COVID data and tested across multiple periods during and after the COVID-19 pandemic. Prob-GNNs also reveal the spatiotemporal pattern of uncertainty, which is concentrated on the afternoon peak hours and the areas with large travel volumes. Overall, our findings highlight the importance of incorporating randomness into deep learning for spatiotemporal ridership prediction. Future research should continue to investigate versatile probabilistic assumptions to capture behavioral randomness, and further develop methods to quantify uncertainty to build resilient cities. △ Less

Submitted 22 February, 2024; v1 submitted 7 March, 2023; originally announced March 2023.

arXiv:2302.08298 [pdf, other]

Unleashing the Potential of Acquisition Functions in High-Dimensional Bayesian Optimization

Authors: Jiayu Zhao, Renyu Yang, Shenghao Qiu, Zheng Wang

Abstract: Bayesian optimization (BO) is widely used to optimize expensive-to-evaluate black-box functions.BO first builds a surrogate model to represent the objective function and assesses its uncertainty. It then decides where to sample by maximizing an acquisition function (AF) based on the surrogate model. However, when dealing with high-dimensional problems, finding the global maximum of the AF becomes… ▽ More Bayesian optimization (BO) is widely used to optimize expensive-to-evaluate black-box functions.BO first builds a surrogate model to represent the objective function and assesses its uncertainty. It then decides where to sample by maximizing an acquisition function (AF) based on the surrogate model. However, when dealing with high-dimensional problems, finding the global maximum of the AF becomes increasingly challenging. In such cases, the initialization of the AF maximizer plays a pivotal role, as an inadequate setup can severely hinder the effectiveness of the AF. This paper investigates a largely understudied problem concerning the impact of AF maximizer initialization on exploiting AFs' capability. Our large-scale empirical study shows that the widely used random initialization strategy often fails to harness the potential of an AF. In light of this, we propose a better initialization approach by employing multiple heuristic optimizers to leverage the historical data of black-box optimization to generate initial points for the AF maximize. We evaluate our approach with a range of heavily studied synthetic functions and real-world applications. Experimental results show that our techniques, while simple, can significantly enhance the standard BO and outperform state-of-the-art methods by a large margin in most test cases. △ Less

Submitted 23 January, 2024; v1 submitted 16 February, 2023; originally announced February 2023.

Comments: Accepted by Transactions on Machine Learning Research (TMLR)

arXiv:2301.03808 [pdf, other]

Passenger Path Choice Estimation Using Smart Card Data: A Latent Class Approach with Panel Effects Across Days

Authors: Baichuan Mo, ZhenLiang Ma, Haris N. Koutsopoulos, **hua Zhao

Abstract: Understanding passengers' path choice behavior in urban rail systems is a prerequisite for effective operations and planning. This paper attempts bridging the gap by proposing a probabilistic approach to infer passengers' path choice behavior in urban rail systems using a large-scale smart card data. The model uses latent classes and panel effects to capture passengers' implicit behavior heterogen… ▽ More Understanding passengers' path choice behavior in urban rail systems is a prerequisite for effective operations and planning. This paper attempts bridging the gap by proposing a probabilistic approach to infer passengers' path choice behavior in urban rail systems using a large-scale smart card data. The model uses latent classes and panel effects to capture passengers' implicit behavior heterogeneity and longitudinal correlations, key research gaps in big data driven behavior studies. We formulate the probability of each individual's arrival time at a destination based on their path choice behavior, and estimate corresponding path choice model parameters as a maximum likelihood estimation problem. The original likelihood function is intractable due to the exponential computation complexity. We derive a tractable likelihood function and propose a numerical integral approach to efficiently estimate the model. Also, we propose a method to calculate the t-statistic of the estimated choice parameters based on the numerically estimated Hessian matrix and Cramer-Rao bound (the lower bound on the coefficient variance). Case studies using synthetic data validate the model performance and its robustness against parameter initialization and input errors, and highlight the importance of incorporating crowding impact in path choice estimation. Applications using actual data from the Mass Transit Railway, Hong Kong reveal two latent groups of passengers: time-sensitive (TS) and comfort-aware (CA). TS passengers are those who are more likely to choose paths with short travel times. Most of them are regular commuters with high travel frequency and less schedule flexibility. CA passengers care more about the travel comfort experience and choose paths with less walking and waiting times. The proposed approach is data-driven and general to accommodate other discrete choice structures. △ Less

Submitted 10 January, 2023; originally announced January 2023.

arXiv:2301.02594 [pdf, other]

Modeling Virus Transmission Risks in Commuting with Emerging Mobility Services: A Case Study of COVID-19

Authors: Baichuan Mo, Peyman Noursalehi, Haris N. Koutsopoulos, **hua Zhao

Abstract: Commuting is an important part of daily life. With the gradual recovery from COVID-19 and more people returning to work from the office, the transmission of COVID-19 during commuting becomes a concern. Recent emerging mobility services (such as ride-hailing and bike-sharing) further deteriorate the infection risks due to shared vehicles or spaces during travel. Hence, it is important to quantify t… ▽ More Commuting is an important part of daily life. With the gradual recovery from COVID-19 and more people returning to work from the office, the transmission of COVID-19 during commuting becomes a concern. Recent emerging mobility services (such as ride-hailing and bike-sharing) further deteriorate the infection risks due to shared vehicles or spaces during travel. Hence, it is important to quantify the infection risks in commuting. This paper proposes a probabilistic framework to estimate the risk of infection during an individual's commute considering different travel modes, including public transit, ride-share, bike, and walking. The objective is to evaluate the probability of infection as well as the estimation errors (i.e., uncertainty quantification) given the origin-destination (OD), departure time, and travel mode. We first define a general trip planning function to generate trip trajectories and probabilities of choosing different paths according to the OD, departure time, and travel mode. Then, we consider two channels of infections: 1) infection by close contact and 2) infection by touching surfaces. The infection risks are calculated on a trip segment basis. Different sources of data (such as smart card data, travel surveys, and population data) are used to estimate the potential interactions between the individual and the infectious environment. The model is implemented in the MIT community as a case study. We evaluate the commute infection risks for employees and students. Results show that most of the individuals have an infection probability close to zero. The maximum infection probability is around 0.8%, implying that the probability of getting infected during the commuting process is low. Individuals with larger travel distances, traveling in transit, and traveling during peak hours are more likely to get infected. △ Less

Submitted 6 January, 2023; originally announced January 2023.

arXiv:2211.04915 [pdf, other]

Inferring Mobility of Care Travel Behavior From Transit Origin-Destination Data

Authors: Daniela Shuman, Awad Abdelhalim, Anson F Stewart, Kayleigh B Campbell, Mira Patel, Ines Sanchez de Madariaga, **hua Zhao

Abstract: There are substantial differences in travel behavior by gender on public transit. Studies have concluded that these differences are largely attributable to household responsibilities typically falling disproportionately on women, leading to women being more likely to utilize transit for purposes referred to by the umbrella concept of "mobility of care". In contrast to past studies that have quanti… ▽ More There are substantial differences in travel behavior by gender on public transit. Studies have concluded that these differences are largely attributable to household responsibilities typically falling disproportionately on women, leading to women being more likely to utilize transit for purposes referred to by the umbrella concept of "mobility of care". In contrast to past studies that have quantified the impact of gender using survey and qualitative data, we propose a novel data-driven workflow utilizing a combination of previously developed origin, destination, and transfer inference (ODX) based on individual transit fare card transactions, name-based gender inference, and geospatial analysis as a framework to identify mobility of care trip making. We apply this framework to data from the Washington Metropolitan Area Transit Authority (WMATA). Analyzing data from millions of journeys conducted in the first quarter of 2019, the results of this study show that our proposed workflow can identify mobility of care travel behavior, detecting times and places of interest where the share of women travelers in an equally-sampled subset (on basis of inferred gender) of transit users is 10% - 15% higher than that of men. The workflow presented in this study provides a blueprint for combining transit origin-destination data, inferred customer demographics, and geospatial analyses enabling public transit agencies to assess, at the fare card level, the gendered impacts of different policy and operational decisions. △ Less

Submitted 10 April, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

Comments: Updated reference formatting and discussion points

arXiv:2208.05908 [pdf, other]

doi 10.1145/3534678.3539093

Uncertainty Quantification of Sparse Travel Demand Prediction with Spatial-Temporal Graph Neural Networks

Authors: Dingyi Zhuang, Shenhao Wang, Haris N. Koutsopoulos, **hua Zhao

Abstract: Origin-Destination (O-D) travel demand prediction is a fundamental challenge in transportation. Recently, spatial-temporal deep learning models demonstrate the tremendous potential to enhance prediction accuracy. However, few studies tackled the uncertainty and sparsity issues in fine-grained O-D matrices. This presents a serious problem, because a vast number of zeros deviate from the Gaussian as… ▽ More Origin-Destination (O-D) travel demand prediction is a fundamental challenge in transportation. Recently, spatial-temporal deep learning models demonstrate the tremendous potential to enhance prediction accuracy. However, few studies tackled the uncertainty and sparsity issues in fine-grained O-D matrices. This presents a serious problem, because a vast number of zeros deviate from the Gaussian assumption underlying the deterministic deep learning models. To address this issue, we design a Spatial-Temporal Zero-Inflated Negative Binomial Graph Neural Network (STZINB-GNN) to quantify the uncertainty of the sparse travel demand. It analyzes spatial and temporal correlations using diffusion and temporal convolution networks, which are then fused to parameterize the probabilistic distributions of travel demand. The STZINB-GNN is examined using two real-world datasets with various spatial and temporal resolutions. The results demonstrate the superiority of STZINB-GNN over benchmark models, especially under high spatial-temporal resolutions, because of its high accuracy, tight confidence intervals, and interpretable parameters. The sparsity parameter of the STZINB-GNN has physical interpretation for various transportation applications. △ Less

Submitted 11 August, 2022; originally announced August 2022.

Comments: Accepted by KDD 2022

arXiv:2208.03291 [pdf]

Comparing Unit Trains versus Manifest Trains for the Risk of Rail Transport of Hazardous Materials -- Part II: Application and Case Study

Authors: Di Kang, Jiaxi Zhao, C. Tyler Dick, Xiang Liu, Zheyong Bian, Steven W. Kirkpatrick, Chen-Yu Lin

Abstract: Built upon the risk analysis methodology (presented in the part I paper), this part II paper focuses on applying this methodology. Five illustrative scenarios were used to analyze the best or worst cases and compare the transportation risk differences between service options using unit trains and manifest trains. The comparison results indicate that if all tank cars are placed at the positions wit… ▽ More Built upon the risk analysis methodology (presented in the part I paper), this part II paper focuses on applying this methodology. Five illustrative scenarios were used to analyze the best or worst cases and compare the transportation risk differences between service options using unit trains and manifest trains. The comparison results indicate that if all tank cars are placed at the positions with the lowest probability of derailing and if switching tank cars alone in classification yards, it could provide the lowest risk estimate given the same transportation demand (i.e., number of tank cars to transport). This paper also shows that based on the data and parameters in the case study, risks during arrival/departure events and yard switching events could be as significant as risks that on mainlines. This paper provides a way to use the risk analysis methodology for rail safety decisions. The methodology and its application can be tailored to specific infrastructure and rolling stock characteristics. △ Less

Submitted 4 July, 2022; originally announced August 2022.

arXiv:2207.02113 [pdf]

Comparing Unit Trains versus Manifest Trains for the Risk of Rail Transport of Hazardous Materials -- Part I: Risk Analysis Methodology

Authors: Di Kang, Jiaxi Zhao, C. Tyler Dick, Xiang Liu, Zheyong Bian, Steven W. Kirkpatrick, Chen-Yu Lin

Abstract: Transporting hazardous materials (hazmats) using tank cars has more significant economic benefits than other transportation modes. Although railway transportation is roughly four times more fuel-efficient than roadway transportation, a train derailment has greater potential to cause more disastrous consequences than a truck incident. Train types, such as unit train or manifest train (also called m… ▽ More Transporting hazardous materials (hazmats) using tank cars has more significant economic benefits than other transportation modes. Although railway transportation is roughly four times more fuel-efficient than roadway transportation, a train derailment has greater potential to cause more disastrous consequences than a truck incident. Train types, such as unit train or manifest train (also called mixed train), can influence transport risks in several ways. For example, unit trains only experience risks on mainlines and when arriving at or departing from terminals, while manifest trains experience additional switching risks in yards. Based on prior studies and various data sources covering the years 1996-2018, this paper constructs event chains for line-haul risks on mainlines (for both unit trains and manifest trains), arrival/departure risks in terminals (for unit trains) and yards (for manifest trains), and yard switching risks for manifest trains using various probabilistic models, and finally determines expected casualties as the consequences of a potential train derailment and release incident. This is the first analysis to quantify the total risks a train may encounter throughout the shipment process, either on mainlines or in yards/terminals, distinguishing train types. It provides a methodology applicable to any train to calculate the expected risks (quantified as expected casualties in this paper) from an origin to a destination. △ Less

Submitted 4 July, 2022; originally announced July 2022.

arXiv:2204.09904 [pdf, other]

doi 10.1111/cgf.14527

Infographics Wizard: Flexible Infographics Authoring and Design Exploration

Authors: Anjul Tyagi, Jian Zhao, Pushkar Patel, Swasti Khurana, Klaus Mueller

Abstract: Infographics are an aesthetic visual representation of information following specific design principles of human perception. Designing infographics can be a tedious process for non-experts and time-consuming, even for professional designers. With the help of designers, we propose a semi-automated infographic framework for general structured and flow-based infographic design generation. For novice… ▽ More Infographics are an aesthetic visual representation of information following specific design principles of human perception. Designing infographics can be a tedious process for non-experts and time-consuming, even for professional designers. With the help of designers, we propose a semi-automated infographic framework for general structured and flow-based infographic design generation. For novice designers, our framework automatically creates and ranks infographic designs for a user-provided text with no requirement for design input. However, expert designers can still provide custom design inputs to customize the infographics. We will also contribute an individual visual group (VG) designs dataset (in SVG), along with a 1k complete infographic image dataset with segmented VGs in this work. Evaluation results confirm that by using our framework, designers from all expertise levels can generate generic infographic designs faster than existing methods while maintaining the same quality as hand-designed infographics templates. △ Less

Submitted 8 May, 2022; v1 submitted 21 April, 2022; originally announced April 2022.

Comments: Preprint of the EUROVIS 22 accepted paper. arXiv admin note: substantial text overlap with arXiv:2108.11914

ACM Class: H.5.2; I.4.6; J.5

Journal ref: Computer Graphics Forum, 2022, 41: 121-132

arXiv:2204.09086 [pdf, other]

Choosing the number of factors in factor analysis with incomplete data via a hierarchical Bayesian information criterion

Authors: Jianhua Zhao, Changchun Shang, Shulan Li, Ling Xin, Philip L. H. Yu

Abstract: The Bayesian information criterion (BIC), defined as the observed data log likelihood minus a penalty term based on the sample size $N$, is a popular model selection criterion for factor analysis with complete data. This definition has also been suggested for incomplete data. However, the penalty term based on the `complete' sample size $N$ is the same no matter whether in a complete or incomplete… ▽ More The Bayesian information criterion (BIC), defined as the observed data log likelihood minus a penalty term based on the sample size $N$, is a popular model selection criterion for factor analysis with complete data. This definition has also been suggested for incomplete data. However, the penalty term based on the `complete' sample size $N$ is the same no matter whether in a complete or incomplete data case. For incomplete data, there are often only $N_i<N$ observations for variable $i$, which means that using the `complete' sample size $N$ implausibly ignores the amounts of missing information inherent in incomplete data. Given this observation, a novel criterion called hierarchical BIC (HBIC) for factor analysis with incomplete data is proposed. The novelty is that it only uses the actual amounts of observed information, namely $N_i$'s, in the penalty term. Theoretically, it is shown that HBIC is a large sample approximation of variational Bayesian (VB) lower bound, and BIC is a further approximation of HBIC, which means that HBIC shares the theoretical consistency of BIC. Experiments on synthetic and real data sets are conducted to access the finite sample performance of HBIC, BIC, and related criteria with various missing rates. The results show that HBIC and BIC perform similarly when the missing rate is small, but HBIC is more accurate when the missing rate is not small. △ Less

Submitted 19 April, 2022; originally announced April 2022.

Comments: 16 pages, 4 figures

MSC Class: 62H25 ACM Class: G.3; I.2.6

arXiv:2203.01171 [pdf, other]

Imitation of Manipulation Skills Using Multiple Geometries

Authors: Boyang Ti, Yongsheng Gao, Jie Zhao, Sylvain Calinon

Abstract: Daily manipulation tasks are characterized by geometric primitives related to actions and object shapes. Such geometric descriptors are poorly represented by only using Cartesian coordinate systems. In this paper, we propose a learning approach to extract the optimal representation from a dictionary of coordinate systems to encode an observed movement/behavior. This is achieved by using an extensi… ▽ More Daily manipulation tasks are characterized by geometric primitives related to actions and object shapes. Such geometric descriptors are poorly represented by only using Cartesian coordinate systems. In this paper, we propose a learning approach to extract the optimal representation from a dictionary of coordinate systems to encode an observed movement/behavior. This is achieved by using an extension of Gaussian distributions on Riemannian manifolds, which is used to analyse a set of user demonstrations statistically, by considering multiple geometries as candidate representations of the task. We formulate the reproduction problem as a general optimal control problem based on an iterative linear quadratic regulator (iLQR), where the Gaussian distribution in the extracted coordinate systems are used to define the cost function. We apply our approach to object gras** and box opening tasks in simulation and on a 7-axis Franka Emika robot. The results show that the robot can exploit several geometries to execute the manipulation task and generalize it to new situations, by maintaining the invariant characteristics of the task in the coordinate system(s) of interest. △ Less

Submitted 21 July, 2022; v1 submitted 2 March, 2022; originally announced March 2022.

arXiv:2202.13188 [pdf, other]

doi 10.1016/j.ins.2023.119872

Regularized Bilinear Discriminant Analysis for Multivariate Time Series Data

Authors: Jianhua Zhao, Haiye Liang, Shulan Li, Zhiji Yang, Zhen Wang

Abstract: In recent years, the methods on matrix-based or bilinear discriminant analysis (BLDA) have received much attention. Despite their advantages, it has been reported that the traditional vector-based regularized LDA (RLDA) is still quite competitive and could outperform BLDA on some benchmark datasets. Nevertheless, it is also noted that this finding is mainly limited to image data. In this paper, we… ▽ More In recent years, the methods on matrix-based or bilinear discriminant analysis (BLDA) have received much attention. Despite their advantages, it has been reported that the traditional vector-based regularized LDA (RLDA) is still quite competitive and could outperform BLDA on some benchmark datasets. Nevertheless, it is also noted that this finding is mainly limited to image data. In this paper, we propose regularized BLDA (RBLDA) and further explore the comparison between RLDA and RBLDA on another type of matrix data, namely multivariate time series (MTS). Unlike image data, MTS typically consists of multiple variables measured at different time points. Although many methods for MTS data classification exist within the literature, there is relatively little work in exploring the matrix data structure of MTS data. Moreover, the existing BLDA can not be performed when one of its within-class matrices is singular. To address the two problems, we propose RBLDA for MTS data classification, where each of the two within-class matrices is regularized via one parameter. We develop an efficient implementation of RBLDA and an efficient model selection algorithm with which the cross validation procedure for RBLDA can be performed efficiently. Experiments on a number of real MTS data sets are conducted to evaluate the proposed algorithm and compare RBLDA with several closely related methods, including RLDA and BLDA. The results reveal that RBLDA achieves the best overall recognition performance and the proposed model selection algorithm is efficient; Moreover, RBLDA can produce better visualization of MTS data than RLDA. △ Less

Submitted 26 February, 2022; originally announced February 2022.

Comments: 14 pages, 2 figures

MSC Class: 68T10 ACM Class: I.5.2

arXiv:2201.12936 [pdf, other]

Pigeonhole Design: Balancing Sequential Experiments from an Online Matching Perspective

Authors: **glong Zhao, Zijie Zhou

Abstract: Practitioners and academics have long appreciated the benefits of covariate balancing when they conduct randomized experiments. For web-facing firms running online A/B tests, however, it still remains challenging in balancing covariate information when experimental subjects arrive sequentially. In this paper, we study an online experimental design problem, which we refer to as the "Online Blocking… ▽ More Practitioners and academics have long appreciated the benefits of covariate balancing when they conduct randomized experiments. For web-facing firms running online A/B tests, however, it still remains challenging in balancing covariate information when experimental subjects arrive sequentially. In this paper, we study an online experimental design problem, which we refer to as the "Online Blocking Problem." In this problem, experimental subjects with heterogeneous covariate information arrive sequentially and must be immediately assigned into either the control or the treated group. The objective is to minimize the total discrepancy, which is defined as the minimum weight perfect matching between the two groups. To solve this problem, we propose a randomized design of experiment, which we refer to as the "Pigeonhole Design." The pigeonhole design first partitions the covariate space into smaller spaces, which we refer to as pigeonholes, and then, when the experimental subjects arrive at each pigeonhole, balances the number of control and treated subjects for each pigeonhole. We analyze the theoretical performance of the pigeonhole design and show its effectiveness by comparing against two well-known benchmark designs: the match-pair design and the completely randomized design. We identify scenarios when the pigeonhole design demonstrates more benefits over the benchmark design. To conclude, we conduct extensive simulations using Yahoo! data to show a 10.2% reduction in variance if we use the pigeonhole design to estimate the average treatment effect. △ Less

Submitted 23 May, 2024; v1 submitted 30 January, 2022; originally announced January 2022.

arXiv:2201.01281 [pdf, other]

The emerging spectrum of flexible work locations: implications for travel demand and carbon emissions

Authors: Nicholas S. Caros, Xiaotong Guo, **hua Zhao

Abstract: Many studies of the effect of remote work on travel demand assume that remote work takes place entirely at home. Recent evidence, however, shows that in the United States, remote workers are choosing to spend approximately one third of their remote work hours outside of the home at cafes, co-working spaces or the homes of friends and family. Commutes to these "third places" could offset much of th… ▽ More Many studies of the effect of remote work on travel demand assume that remote work takes place entirely at home. Recent evidence, however, shows that in the United States, remote workers are choosing to spend approximately one third of their remote work hours outside of the home at cafes, co-working spaces or the homes of friends and family. Commutes to these "third places" could offset much of the reduction in congestion and carbon emissions from commuting that could be expected from greater shares of remote work. To estimate the impact of third places on congestion and carbon emission from commuting, this study uses a national survey of thousands of remote workers and large-scale mobile trace data to predict current and future commuting patterns for the Chicago metropolitan area. The study reveals that ignoring third places leads to an underestimation of carbon emissions from commute-based travel demand by 470 gigatons per year, or 24% of the total true emissions. Moreover, if workers' latent desire for greater levels of remote work are realized in the future, the emissions benefits will be reduced further. The spatial analyses imply that there is a decrease in visits to the city center and outskirts, but an increase in visits to near suburban areas. Implications of these results for urban transportation and land use policy are discussed. △ Less

Submitted 10 March, 2023; v1 submitted 4 January, 2022; originally announced January 2022.

arXiv:2201.01229 [pdf, other]

Impact of unplanned service disruptions on urban public transit systems

Authors: Baichuan Mo, Max Y von Franque, Haris N. Koutsopoulosc, John Attanuccid, **hua Zhao

Abstract: This paper proposes a general unplanned incident analysis framework for public transit systems from the supply and demand sides using automated fare collection (AFC) and automated vehicle location (AVL) data. Specifically, on the supply side, we propose an incident-based network redundancy index to analyze the network's ability to provide alternative services under a specific rail disruption. The… ▽ More This paper proposes a general unplanned incident analysis framework for public transit systems from the supply and demand sides using automated fare collection (AFC) and automated vehicle location (AVL) data. Specifically, on the supply side, we propose an incident-based network redundancy index to analyze the network's ability to provide alternative services under a specific rail disruption. The impacts on operations are analyzed through the headway changes. On the demand side, the analysis takes place at two levels: aggregate flows and individual response. We calculate the demand changes of different rail lines, rail stations, bus routes, and bus stops to better understand the passenger flow redistribution under incidents. Individual behavior is analyzed using a binary logit model based on inferred passengers' mode choices and socio-demographics using AFC data. The public transit system of the Chicago Transit Authority is used as a case study. Two rail disruption cases are analyzed, one with high network redundancy around the impacted stations and the other with low. Results show that the service frequency of the incident line was largely reduced (by around 30% ~ 70%) during the incident time. Nearby rail lines with substitutional functions were also slightly affected. Passengers showed different behavioral responses in the two incident scenarios. In the low redundancy case, most of the passengers chose to use nearby buses to move, either to their destinations or to the nearby rail lines. In the high redundancy case, most of the passengers transferred directly to nearby lines. Corresponding policy implications and operating suggestions are discussed. △ Less

Submitted 3 January, 2022; originally announced January 2022.

arXiv:2112.06760 [pdf, other]

doi 10.1016/j.csda.2022.107657

Robust factored principal component analysis for matrix-valued outlier accommodation and detection

Authors: Xuan Ma, Jianhua Zhao, Yue Wang

Abstract: Principal component analysis (PCA) is a popular dimension reduction technique for vector data. Factored PCA (FPCA) is a probabilistic extension of PCA for matrix data, which can substantially reduce the number of parameters in PCA while yield satisfactory performance. However, FPCA is based on the Gaussian assumption and thereby susceptible to outliers. Although the multivariate $t$ distribution a… ▽ More Principal component analysis (PCA) is a popular dimension reduction technique for vector data. Factored PCA (FPCA) is a probabilistic extension of PCA for matrix data, which can substantially reduce the number of parameters in PCA while yield satisfactory performance. However, FPCA is based on the Gaussian assumption and thereby susceptible to outliers. Although the multivariate $t$ distribution as a robust modeling tool for vector data has a very long history, its application to matrix data is very limited. The main reason is that the dimension of the vectorized matrix data is often very high and the higher the dimension, the lower the breakdown point that measures the robustness. To solve the robustness problem suffered by FPCA and make it applicable to matrix data, in this paper we propose a robust extension of FPCA (RFPCA), which is built upon a $t$-type distribution called matrix-variate $t$ distribution. Like the multivariate $t$ distribution, the matrix-variate $t$ distribution can adaptively down-weight outliers and yield robust estimates. We develop a fast EM-type algorithm for parameter estimation. Experiments on synthetic and real-world datasets reveal that RFPCA is compared favorably with several related methods and RFPCA is a simple but powerful tool for matrix-valued outlier detection. △ Less

Submitted 13 December, 2021; originally announced December 2021.

Comments: 37 pages

MSC Class: 62H25

arXiv:2110.02588 [pdf, ps, other]

Hypothesis Testing of One-Sample Mean Vector in Distributed Frameworks

Authors: Bin Du, Junlong Zhao

Abstract: Distributed frameworks are widely used to handle massive data, where sample size $n$ is very large, and data are often stored in $k$ different machines. For a random vector $X\in \mathbb{R}^p$ with expectation $μ$, testing the mean vector $H_0: μ=μ_0$ vs $H_1: μ\ne μ_0$ for a given vector $μ_0$ is a basic problem in statistics. The centralized test statistics require heavy communication costs, whi… ▽ More Distributed frameworks are widely used to handle massive data, where sample size $n$ is very large, and data are often stored in $k$ different machines. For a random vector $X\in \mathbb{R}^p$ with expectation $μ$, testing the mean vector $H_0: μ=μ_0$ vs $H_1: μ\ne μ_0$ for a given vector $μ_0$ is a basic problem in statistics. The centralized test statistics require heavy communication costs, which can be a burden when $p$ or $k$ is large. To reduce the communication cost, distributed test statistics are proposed in this paper for this problem based on the divide and conquer technique, a commonly used approach for distributed statistical inference. Specifically, we extend two commonly used centralized test statistics to the distributed ones, that apply to low and high dimensional cases, respectively. Comparing the power of centralized test statistics and the distributed ones, it is observed that there is a fundamental tradeoff between communication costs and the powers of the tests. This is quite different from the application of the divide and conquer technique in many other problems such as estimation, where the associated distributed statistics can be as good as the centralized ones. Numerical results confirm the theoretical findings. △ Less

Submitted 6 October, 2021; originally announced October 2021.

arXiv:2109.12422 [pdf, other]

Equality of opportunity in travel behavior prediction with deep neural networks and discrete choice models

Authors: Yunhan Zheng, Shenhao Wang, **hua Zhao

Abstract: Although researchers increasingly adopt machine learning to model travel behavior, they predominantly focus on prediction accuracy, ignoring the ethical challenges embedded in machine learning algorithms. This study introduces an important missing dimension - computational fairness - to travel behavior analysis. We first operationalize computational fairness by equality of opportunity, then differ… ▽ More Although researchers increasingly adopt machine learning to model travel behavior, they predominantly focus on prediction accuracy, ignoring the ethical challenges embedded in machine learning algorithms. This study introduces an important missing dimension - computational fairness - to travel behavior analysis. We first operationalize computational fairness by equality of opportunity, then differentiate between the bias inherent in data and the bias introduced by modeling. We then demonstrate the prediction disparities in travel behavior modeling using the 2017 National Household Travel Survey (NHTS) and the 2018-2019 My Daily Travel Survey in Chicago. Empirically, deep neural network (DNN) and discrete choice models (DCM) reveal consistent prediction disparities across multiple social groups: both over-predict the false negative rate of frequent driving for the ethnic minorities, the low-income and the disabled populations, and falsely predict a higher travel burden of the socially disadvantaged groups and the rural populations than reality. Comparing DNN with DCM, we find that DNN can outperform DCM in prediction disparities because of DNN's smaller misspecification error. To mitigate prediction disparities, this study introduces an absolute correlation regularization method, which is evaluated with synthetic and real-world data. The results demonstrate the prevalence of prediction disparities in travel behavior modeling, and the disparities still persist regarding a variety of model specifics such as the number of DNN layers, batch size and weight initialization. Since these prediction disparities can exacerbate social inequity if prediction results without fairness adjustment are used for transportation policy making, we advocate for careful consideration of the fairness problem in travel behavior modeling, and the use of bias mitigation algorithms for fair transport decisions. △ Less

Submitted 25 September, 2021; originally announced September 2021.

arXiv:2108.11589 [pdf, other]

A Statistical Inference Framework for the Minimal Clinically Important Difference

Authors: Zehua Zhou, Leslie J. Bisson, Jiwei Zhao

Abstract: In clinical research, the effect of a treatment or intervention is widely assessed through clinical importance, instead of statistical significance. In this paper, we propose a principled statistical inference framework to learning the minimal clinically important difference (MCID), a vital concept in assessing clinical importance. We formulate the scientific question into a novel statistical lear… ▽ More In clinical research, the effect of a treatment or intervention is widely assessed through clinical importance, instead of statistical significance. In this paper, we propose a principled statistical inference framework to learning the minimal clinically important difference (MCID), a vital concept in assessing clinical importance. We formulate the scientific question into a novel statistical learning problem, develop an efficient algorithm for parameter estimation, and establish the asymptotic theory for the proposed estimator. We conduct comprehensive simulation studies to examine the finite sample performance of the proposed method. We also re-analyze the ChAMP (Chondral Lesions And Meniscus Procedures) trial, where the primary outcome is the patient-reported pain score and the ultimate goal is to determine whether there exists a significant difference in post-operative knee pain between patients undergoing debridement versus observation of chondral lesions during the surgery. Some previous analysis of this trial exhibited that the effect of debriding the chondral lesions does not reach a statistical significance. Our analysis reinforces this conclusion that the effect of debriding the chondral lesions is not only statistically non-significant, but also clinically un-important. △ Less

Submitted 1 March, 2022; v1 submitted 26 August, 2021; originally announced August 2021.

Comments: 36 Pages, 5 figures, 3 tables, submitted to Statistics in Biosciences

arXiv:2108.04966 [pdf, ps, other]

Avoid Estimating the Unknown Function in a Semiparametric Nonignorable Propensity Model

Authors: Samidha Shetty, Yanyuan Ma, Jiwei Zhao

Abstract: We study the problem of estimating a functional or a parameter in the context where outcome is subject to nonignorable missingness. We completely avoid modeling the regression relation, while allowing the propensity to be modeled by a semiparametric logistic relation where the dependence on covariates is unspecified. We discover a surprising phenomenon in that the estimation of the parameter in th… ▽ More We study the problem of estimating a functional or a parameter in the context where outcome is subject to nonignorable missingness. We completely avoid modeling the regression relation, while allowing the propensity to be modeled by a semiparametric logistic relation where the dependence on covariates is unspecified. We discover a surprising phenomenon in that the estimation of the parameter in the propensity model as well as the functional estimation can be carried out without assessing the missingness dependence on covariates. This allows us to propose a general class of estimators for both model parameter estimation and functional estimation, including estimating the outcome mean. The robustness of the estimators are nonstandard and are established rigorously through theoretical derivations, and are supported by simulations and a data application. △ Less

Submitted 10 August, 2021; originally announced August 2021.

Comments: 21 pages

arXiv:2108.04306 [pdf, other]

Test of Significance for High-dimensional Thresholds with Application to Individualized Minimal Clinically Important Difference

Authors: Huijie Feng, **gyi Duan, Yang Ning, Jiwei Zhao

Abstract: This work is motivated by learning the individualized minimal clinically important difference, a vital concept to assess clinical importance in various biomedical studies. We formulate the scientific question into a high-dimensional statistical problem where the parameter of interest lies in an individualized linear threshold. The goal is to develop a hypothesis testing procedure for the significa… ▽ More This work is motivated by learning the individualized minimal clinically important difference, a vital concept to assess clinical importance in various biomedical studies. We formulate the scientific question into a high-dimensional statistical problem where the parameter of interest lies in an individualized linear threshold. The goal is to develop a hypothesis testing procedure for the significance of a single element in this parameter as well as of a linear combination of this parameter. The difficulty dues to the high-dimensional nuisance in develo** such a testing procedure, and also stems from the fact that this high-dimensional threshold model is nonregular and the limiting distribution of the corresponding estimator is nonstandard. To deal with these challenges, we construct a test statistic via a new bias-corrected smoothed decorrelated score approach, and establish its asymptotic distributions under both null and local alternative hypotheses. We propose a double-smoothing approach to select the optimal bandwidth in our test statistic and provide theoretical guarantees for the selected bandwidth. We conduct simulation studies to demonstrate how our proposed procedure can be applied in empirical studies. We apply the proposed method to a clinical trial where the scientific goal is to assess the clinical importance of a surgery procedure. △ Less

Submitted 26 March, 2023; v1 submitted 9 August, 2021; originally announced August 2021.

arXiv:2108.02196 [pdf, other]

Synthetic Controls for Experimental Design

Authors: Alberto Abadie, **glong Zhao

Abstract: This article studies experimental design in settings where the experimental units are large aggregate entities (e.g., markets), and only one or a small number of units can be exposed to the treatment. In such settings, randomization of the treatment may result in treated and control groups with very different characteristics at baseline, inducing biases. We propose a variety of synthetic control d… ▽ More This article studies experimental design in settings where the experimental units are large aggregate entities (e.g., markets), and only one or a small number of units can be exposed to the treatment. In such settings, randomization of the treatment may result in treated and control groups with very different characteristics at baseline, inducing biases. We propose a variety of synthetic control designs (Abadie, Diamond and Hainmueller, 2010, Abadie and Gardeazabal, 2003) as experimental designs to select treated units in non-randomized experiments with large aggregate units, as well as the untreated units to be used as a control group. Average potential outcomes are estimated as weighted averages of treated units, for potential outcomes with treatment -- and control units, for potential outcomes without treatment. We analyze the properties of estimators based on synthetic control designs and propose new inferential techniques. We show that in experimental settings with aggregate units, synthetic control designs can substantially reduce estimation biases in comparison to randomization of the treatment. △ Less

Submitted 6 December, 2023; v1 submitted 4 August, 2021; originally announced August 2021.

arXiv:2107.02043 [pdf]

An extended watershed-based zonal statistical AHP model for flood risk estimation: Constraining runoff converging related indicators by sub-watersheds

Authors: Hong** Zhang, Zhenfeng Shao, **qi Zhao, Xiao Huang, Jie Yang, Bin Hu, Wenfu Wu

Abstract: Floods are highly uncertain events, occurring in different regions, with varying prerequisites and intensities. A highly reliable flood disaster risk map can help reduce the impact of floods for flood management, disaster decreasing, and urbanization resilience. In flood risk estimation, the widely used analytic hierarchy process (AHP) usually adopts pixel as a basic unit, it cannot capture the si… ▽ More Floods are highly uncertain events, occurring in different regions, with varying prerequisites and intensities. A highly reliable flood disaster risk map can help reduce the impact of floods for flood management, disaster decreasing, and urbanization resilience. In flood risk estimation, the widely used analytic hierarchy process (AHP) usually adopts pixel as a basic unit, it cannot capture the similar threaten caused by neighborhood source flooding cells at sub-watershed scale. Thus, an extended watershed-based zonal statistical AHP model constraining runoff converging related indicators by sub-watersheds (WZSAHP-Slope & Stream) is proposed to fill this gap. Taking the Chaohu basin as test case, we validated the proposed method with a real-flood area extracted in July 2020. The results indicate that the WZSAHP-Slope & Stream model using multiple flow direction division watersheds to calculate statistics of distance from stream and slope by maximum statistic method outperformed other tested methods. Compering with pixel-based AHP method, the proposed method can improve the correct ratio by 16% (from 67% to 83%) and fit ratio by 1% (from 13% to 14%) as in validation 1, and improve the correct ratio by 37% (from 23% to 60%) and fit ratio by 6% (from 12% to 18%) as in validation 2. △ Less

Submitted 5 July, 2021; originally announced July 2021.

Comments: This paper is a research paper, it contains 40 pages and 8 figures. This paper is a modest contribution to the ongoing discussions the accuracy of flood risk estimation via AHP model improved by adopting pixels replaced with sub-watersheds as basic unit

MSC Class: 86A05 ACM Class: H.1

Showing 1–50 of 123 results for author: Zhao, J