Skip to main content

Showing 1–50 of 56 results for author: Gao, Z

Searching in archive stat. Search in all archives.
.
  1. arXiv:2406.07868  [pdf, other

    stat.ME

    Bridging multiple worlds: multi-marginal optimal transport for causal partial-identification problem

    Authors: Zijun Gao, Shu Ge, Jian Qian

    Abstract: Under the prevalent potential outcome model in causal inference, each unit is associated with multiple potential outcomes but at most one of which is observed, leading to many causal quantities being only partially identified. The inherent missing data issue echoes the multi-marginal optimal transport (MOT) problem, where marginal distributions are known, but how the marginals couple to form the j… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  2. arXiv:2405.07026  [pdf, other

    stat.ME

    Selective Randomization Inference for Adaptive Experiments

    Authors: Tobias Freidling, Qingyuan Zhao, Zijun Gao

    Abstract: Adaptive experiments use preliminary analyses of the data to inform further course of action and are commonly used in many disciplines including medical and social sciences. Because the null hypothesis and experimental design are not pre-specified, it has long been recognized that statistical inference for adaptive experiments is not straightforward. Most existing methods only apply to specific ad… ▽ More

    Submitted 11 May, 2024; originally announced May 2024.

  3. arXiv:2405.00424  [pdf, other

    econ.EM stat.ME stat.ML

    Optimal Bias-Correction and Valid Inference in High-Dimensional Ridge Regression: A Closed-Form Solution

    Authors: Zhaoxing Gao

    Abstract: Ridge regression is an indispensable tool in big data econometrics but suffers from bias issues affecting both statistical efficiency and scalability. We introduce an iterative strategy to correct the bias effectively when the dimension $p$ is less than the sample size $n$. For $p>n$, our method optimally reduces the bias to a level unachievable through linear transformations of the response. We e… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.

    Comments: 53 pages, 10 figures

  4. arXiv:2401.16651  [pdf, other

    stat.ME math.ST stat.AP stat.CO

    A constructive approach to selective risk control

    Authors: Zijun Gao, Wenjie Hu, Qingyuan Zhao

    Abstract: Many modern applications require the use of data to both select the statistical tasks and make valid inference after selection. In this article, we provide a unifying approach to control for a class of selective risks. Our method is motivated by a reformulation of the celebrated Benjamini-Hochberg (BH) procedure for multiple hypothesis testing as the iterative limit of the Benjamini-Yekutieli (BY)… ▽ More

    Submitted 29 January, 2024; originally announced January 2024.

    Comments: 8 figures, 2 tables

  5. arXiv:2310.19167  [pdf, other

    cs.LG cs.AI stat.ML

    Rare Event Probability Learning by Normalizing Flows

    Authors: Zhenggqi Gao, Dinghuai Zhang, Luca Daniel, Duane S. Boning

    Abstract: A rare event is defined by a low probability of occurrence. Accurate estimation of such small probabilities is of utmost importance across diverse domains. Conventional Monte Carlo methods are inefficient, demanding an exorbitant number of samples to achieve reliable estimates. Inspired by the exact sampling capabilities of normalizing flows, we revisit this challenge and propose normalizing flow… ▽ More

    Submitted 29 October, 2023; originally announced October 2023.

    Comments: 16 pages, 5 figures, 2 tables

  6. arXiv:2310.17844  [pdf, other

    math.NA stat.CO stat.ML

    Adaptive operator learning for infinite-dimensional Bayesian inverse problems

    Authors: Zhiwei Gao, Liang Yan, Tao Zhou

    Abstract: The fundamental computational issues in Bayesian inverse problems (BIP) governed by partial differential equations (PDEs) stem from the requirement of repeated forward model evaluations. A popular strategy to reduce such costs is to replace expensive model simulations with computationally efficient approximations using operator learning, motivated by recent progress in deep learning. However, usin… ▽ More

    Submitted 4 March, 2024; v1 submitted 26 October, 2023; originally announced October 2023.

  7. arXiv:2310.12349  [pdf, other

    cs.CE stat.AP

    Develo** 3D Virtual Safety Risk Terrain for UAS Operations in Complex Urban Environments

    Authors: Zhenyu Gao, John-Paul Clarke, Javid Mardanov, Karen Marais

    Abstract: Unmanned Aerial Systems (UAS), an integral part of the Advanced Air Mobility (AAM) vision, are capable of performing a wide spectrum of tasks in urban environments. The societal integration of UAS is a pivotal challenge, as these systems must operate harmoniously within the constraints imposed by regulations and societal concerns. In complex urban environments, UAS safety has been a perennial obst… ▽ More

    Submitted 18 October, 2023; originally announced October 2023.

    Comments: 33 pages, 19 figures

  8. arXiv:2310.06357  [pdf, other

    stat.ME stat.AP

    Adaptive Storey's null proportion estimator

    Authors: Zijun Gao

    Abstract: False discovery rate (FDR) is a commonly used criterion in multiple testing and the Benjamini-Hochberg (BH) procedure is arguably the most popular approach with FDR guarantee. To improve power, the adaptive BH procedure has been proposed by incorporating various null proportion estimators, among which Storey's estimator has gained substantial popularity. The performance of Storey's estimator hinge… ▽ More

    Submitted 10 October, 2023; originally announced October 2023.

    Comments: 17 pages, 4 figures, 1 table

  9. arXiv:2309.02674  [pdf, other

    stat.ME

    Denoising and Multilinear Dimension-Reduction of High-Dimensional Matrix-Variate Time Series via a Factor Model

    Authors: Zhaoxing Gao, Ruey S. Tsay

    Abstract: This paper proposes a new multilinear projection method for dimension-reduction in modeling high-dimensional matrix-variate time series. It assumes that a $p_1\times p_2$ matrix-variate time series consists of a dynamically dependent, lower-dimensional matrix-variate factor process and a $p_1\times p_2$ matrix white noise series. Covariance matrix of the vectorized white noises assumes a Kronecker… ▽ More

    Submitted 5 September, 2023; originally announced September 2023.

    Comments: 57 Pages, 7 figures, 7 tables. arXiv admin note: text overlap with arXiv:2011.09029

  10. arXiv:2307.07689  [pdf, other

    econ.EM q-fin.ST stat.ME

    Supervised Dynamic PCA: Linear Dynamic Forecasting with Many Predictors

    Authors: Zhaoxing Gao, Ruey S. Tsay

    Abstract: This paper proposes a novel dynamic forecasting method using a new supervised Principal Component Analysis (PCA) when a large number of predictors are available. The new supervised PCA provides an effective way to bridge the gap between predictors and the target variable of interest by scaling and combining the predictors and their lagged values, resulting in an effective dynamic forecasting. Unli… ▽ More

    Submitted 14 July, 2023; originally announced July 2023.

    Comments: 58 pages, 7 figures

    Journal ref: Journal of the American Statistical Association, 2024

  11. arXiv:2306.15444  [pdf, other

    math.OC cs.LG stat.ML

    Limited-Memory Greedy Quasi-Newton Method with Non-asymptotic Superlinear Convergence Rate

    Authors: Zhan Gao, Aryan Mokhtari, Alec Koppel

    Abstract: Non-asymptotic convergence analysis of quasi-Newton methods has gained attention with a landmark result establishing an explicit local superlinear rate of O$((1/\sqrt{t})^t)$. The methods that obtain this rate, however, exhibit a well-known drawback: they require the storage of the previous Hessian approximation matrix or all past curvature information to form the current Hessian inverse approxima… ▽ More

    Submitted 18 October, 2023; v1 submitted 27 June, 2023; originally announced June 2023.

  12. arXiv:2306.13830  [pdf, other

    cs.LG stat.AP

    Improved Aircraft Environmental Impact Segmentation via Metric Learning

    Authors: Zhenyu Gao, Dimitri N. Mavris

    Abstract: Accurate modeling of aircraft environmental impact is pivotal to the design of operational procedures and policies to mitigate negative aviation environmental impact. Aircraft environmental impact segmentation is a process which clusters aircraft types that have similar environmental impact characteristics based on a set of aircraft features. This practice helps model a large population of aircraf… ▽ More

    Submitted 10 September, 2023; v1 submitted 23 June, 2023; originally announced June 2023.

    Comments: 32 pages, 11 figures

  13. arXiv:2306.10656  [pdf, other

    cs.LG cs.AI stat.ML

    Virtual Human Generative Model: Masked Modeling Approach for Learning Human Characteristics

    Authors: Kenta Oono, Nontawat Charoenphakdee, Kotatsu Bito, Zhengyan Gao, Yoshiaki Ota, Shoichiro Yamaguchi, Yohei Sugawara, Shin-ichi Maeda, Kunihiko Miyoshi, Yuki Saito, Koki Tsuda, Hiroshi Maruyama, Kohei Hayashi

    Abstract: Identifying the relationship between healthcare attributes, lifestyles, and personality is vital for understanding and improving physical and mental conditions. Machine learning approaches are promising for modeling their relationships and offering actionable suggestions. In this paper, we propose Virtual Human Generative Model (VHGM), a machine learning model for estimating attributes about healt… ▽ More

    Submitted 14 August, 2023; v1 submitted 18 June, 2023; originally announced June 2023.

    Comments: 14 pages, 4 figures

  14. arXiv:2304.12134  [pdf, other

    econ.EM stat.ME

    Determination of the effective cointegration rank in high-dimensional time-series predictive regressions

    Authors: Puyi Fang, Zhaoxing Gao, Ruey S. Tsay

    Abstract: This paper proposes a new approach to identifying the effective cointegration rank in high-dimensional unit-root (HDUR) time series from a prediction perspective using reduced-rank regression. For a HDUR process $\mathbf{x}_t\in \mathbb{R}^N$ and a stationary series $\mathbf{y}_t\in \mathbb{R}^p$ of interest, our goal is to predict future values of $\mathbf{y}_t$ using $\mathbf{x}_t$ and lagged va… ▽ More

    Submitted 24 April, 2023; v1 submitted 24 April, 2023; originally announced April 2023.

  15. arXiv:2304.09723  [pdf, other

    stat.AP

    A Review of Bayesian Methods in Electronic Design Automation

    Authors: Zhengqi Gao, Duane S. Boning

    Abstract: The utilization of Bayesian methods has been widely acknowledged as a viable solution for tackling various challenges in electronic integrated circuit (IC) design under stochastic process variation, including circuit performance modeling, yield/failure rate estimation, and circuit optimization. As the post-Moore era brings about new technologies (such as silicon photonics and quantum circuits), ma… ▽ More

    Submitted 13 March, 2023; originally announced April 2023.

    Comments: 24 pages, a draft version. We welcome comments and feedback, which can be sent to [email protected]

  16. arXiv:2303.01552  [pdf, other

    stat.ME math.ST stat.AP

    Simultaneous Hypothesis Testing Using Internal Negative Controls with An Application to Proteomics

    Authors: Zijun Gao, Qingyuan Zhao

    Abstract: Negative control is a common technique in scientific investigations and broadly refers to the situation where a null effect (''negative result'') is expected. Motivated by a real proteomic dataset, we will present three promising and closely connected methods of using negative controls to assist simultaneous hypothesis testing. The first method uses negative controls to construct a permutation p-v… ▽ More

    Submitted 19 March, 2023; v1 submitted 2 March, 2023; originally announced March 2023.

    Comments: 41 pages, 10 figures, 3 tables

  17. arXiv:2302.01529  [pdf, other

    math.NA stat.ML

    Failure-informed adaptive sampling for PINNs, Part II: combining with re-sampling and subset simulation

    Authors: Zhiwei Gao, Tao Tang, Liang Yan, Tao Zhou

    Abstract: This is the second part of our series works on failure-informed adaptive sampling for physic-informed neural networks (FI-PINNs). In our previous work \cite{gao2022failure}, we have presented an adaptive sampling framework by using the failure probability as the posterior error indicator, where the truncated Gaussian model has been adopted for estimating the indicator. In this work, we present two… ▽ More

    Submitted 28 February, 2023; v1 submitted 2 February, 2023; originally announced February 2023.

  18. arXiv:2211.04918  [pdf, other

    cs.CR stat.AP stat.ME

    Detection of Sparse Anomalies in High-Dimensional Network Telescope Signals

    Authors: Rafail Kartsioukas, Rajat Tandon, Zheng Gao, Jelena Mirkovic, Michalis Kallitsis, Stilian Stoev

    Abstract: Network operators and system administrators are increasingly overwhelmed with incessant cyber-security threats ranging from malicious network reconnaissance to attacks such as distributed denial of service and data breaches. A large number of these attacks could be prevented if the network operators were better equipped with threat intelligence information that would allow them to block or throttl… ▽ More

    Submitted 22 June, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

  19. arXiv:2210.00279  [pdf, other

    math.NA stat.ML

    Failure-informed adaptive sampling for PINNs

    Authors: Zhiwei Gao, Liang Yan, Tao Zhou

    Abstract: Physics-informed neural networks (PINNs) have emerged as an effective technique for solving PDEs in a wide range of domains. It is noticed, however, the performance of PINNs can vary dramatically with different sampling procedures. For instance, a fixed set of (prior chosen) training points may fail to capture the effective solution region (especially for problems with singularities). To overcome… ▽ More

    Submitted 15 January, 2023; v1 submitted 1 October, 2022; originally announced October 2022.

  20. arXiv:2111.00929  [pdf, other

    cs.LG stat.ML

    Bounds all around: training energy-based models with bidirectional bounds

    Authors: Cong Geng, Jia Wang, Zhiyong Gao, Jes Frellsen, Søren Hauberg

    Abstract: Energy-based models (EBMs) provide an elegant framework for density estimation, but they are notoriously difficult to train. Recent work has established links to generative adversarial networks, where the EBM is trained through a minimax game with a variational value function. We propose a bidirectional bound on the EBM log-likelihood, such that we maximize a lower bound and minimize an upper boun… ▽ More

    Submitted 2 November, 2021; v1 submitted 1 November, 2021; originally announced November 2021.

    Comments: This paper has been accepted by NeurIPS 2021

  21. arXiv:2110.09823  [pdf, other

    cs.LG stat.AP stat.ME

    An Empirical Study: Extensive Deep Temporal Point Process

    Authors: Haitao Lin, Cheng Tan, Lirong Wu, Zhangyang Gao, Stan. Z. Li

    Abstract: Temporal point process as the stochastic process on continuous domain of time is commonly used to model the asynchronous event sequence featuring with occurrence timestamps. Thanks to the strong expressivity of deep neural networks, they are emerging as a promising choice for capturing the patterns in asynchronous sequences, in the context of temporal point process. In this paper, we first review… ▽ More

    Submitted 21 December, 2021; v1 submitted 19 October, 2021; originally announced October 2021.

    Comments: 22 pages, 8 figures

  22. arXiv:2108.01485  [pdf, other

    cs.LG stat.ML

    Fast Estimation Method for the Stability of Ensemble Feature Selectors

    Authors: Rina Onda, Zhengyan Gao, Masaaki Kotera, Kenta Oono

    Abstract: It is preferred that feature selectors be \textit{stable} for better interpretabity and robust prediction. Ensembling is known to be effective for improving the stability of feature selectors. Since ensembling is time-consuming, it is desirable to reduce the computational cost to estimate the stability of the ensemble feature selectors. We propose a simulator of a feature selector, and apply it to… ▽ More

    Submitted 3 August, 2021; originally announced August 2021.

    Comments: 7 pages. Supplementary material 9 pages. Accepted in ICML2021 Workshop, Subset Selection in Machine Learning: From Theory to Practice (SubSetML) URL: https://sites.google.com/view/icml-2021-subsetml

  23. arXiv:2107.12713  [pdf, other

    stat.ME

    LinCDE: Conditional Density Estimation via Lindsey's Method

    Authors: Zijun Gao, Trevor Hastie

    Abstract: Conditional density estimation is a fundamental problem in statistics, with scientific and practical applications in biology, economics, finance and environmental studies, to name a few. In this paper, we propose a conditional density estimator based on gradient boosting and Lindsey's method (LinCDE). LinCDE admits flexible modeling of the density family and can capture distributional characterist… ▽ More

    Submitted 31 December, 2021; v1 submitted 27 July, 2021; originally announced July 2021.

    Comments: 50 pages, 20 figures

  24. arXiv:2106.11793  [pdf, other

    stat.AP

    Identifying intercity freight trip ends of heavy trucks from GPS data

    Authors: Yitao Yang, Bin Jia, Xiao-Yong Yan, Jiangtao Li, Zhenzhen Yang, Ziyou Gao

    Abstract: The intercity freight trips of heavy trucks are important data for transportation system planning and urban agglomeration management. In recent decades, the extraction of freight trips from GPS data has gradually become the main alternative to traditional surveys. Identifying the trip ends (origin and destination, OD) is the first task in trip extraction. In previous trip end identification method… ▽ More

    Submitted 22 June, 2021; originally announced June 2021.

  25. arXiv:2103.14626  [pdf, other

    stat.ME econ.EM

    Divide-and-Conquer: A Distributed Hierarchical Factor Approach to Modeling Large-Scale Time Series Data

    Authors: Zhaoxing Gao, Ruey S. Tsay

    Abstract: This paper proposes a hierarchical approximate-factor approach to analyzing high-dimensional, large-scale heterogeneous time series data using distributed computing. The new method employs a multiple-fold dimension reduction procedure using Principal Component Analysis (PCA) and shows great promises for modeling large-scale data that cannot be stored nor analyzed by a single machine. Each computer… ▽ More

    Submitted 26 March, 2021; originally announced March 2021.

    Comments: 48 pages, 10 figures

    Journal ref: Journal of the American Statistical Association, 2022

  26. arXiv:2103.04277  [pdf, other

    stat.ME

    Estimating Heterogeneous Treatment Effects for General Responses

    Authors: Zijun Gao, Trevor Hastie

    Abstract: Heterogeneous treatment effect models allow us to compare treatments at subgroup and individual levels, and are of increasing popularity in applications like personalized medicine, advertising, and education. In this talk, we first survey different causal estimands used in practice, which focus on estimating the difference in conditional means. We then propose DINA, the difference in natural param… ▽ More

    Submitted 27 January, 2022; v1 submitted 7 March, 2021; originally announced March 2021.

  27. arXiv:2011.09029  [pdf, ps, other

    econ.EM stat.ME

    A Two-Way Transformed Factor Model for Matrix-Variate Time Series

    Authors: Zhaoxing Gao, Ruey S. Tsay

    Abstract: We propose a new framework for modeling high-dimensional matrix-variate time series by a two-way transformation, where the transformed data consist of a matrix-variate factor process, which is dynamically dependent, and three other blocks of white noises. Specifically, for a given $p_1\times p_2$ matrix-variate time series, we seek common nonsingular transformations to project the rows and columns… ▽ More

    Submitted 17 November, 2020; originally announced November 2020.

    Comments: 49 pages, 6 figures

    Journal ref: Econometrics and Statistics 2021

  28. arXiv:2009.11612  [pdf, other

    cs.LG stat.ML

    Clustering Based on Graph of Density Topology

    Authors: Zhangyang Gao, Haitao Lin, Stan. Z Li

    Abstract: Data clustering with uneven distribution in high level noise is challenging. Currently, HDBSCAN is considered as the SOTA algorithm for this problem. In this paper, we propose a novel clustering algorithm based on what we call graph of density topology (GDT). GDT jointly considers the local and global structures of data samples: firstly forming local clusters based on a density growing process wit… ▽ More

    Submitted 24 September, 2020; originally announced September 2020.

  29. arXiv:2009.05872  [pdf, ps, other

    cs.LG cs.CR stat.ML

    Certified Robustness of Graph Classification against Topology Attack with Randomized Smoothing

    Authors: Zhidong Gao, Rui Hu, Yanmin Gong

    Abstract: Graph classification has practical applications in diverse fields. Recent studies show that graph-based machine learning models are especially vulnerable to adversarial perturbations due to the non i.i.d nature of graph data. By adding or deleting a small number of edges in the graph, adversaries could greatly change the graph label predicted by a graph classification model. In this work, we propo… ▽ More

    Submitted 12 September, 2020; originally announced September 2020.

    Comments: Accepted to IEEE GLOBECOM 2020

  30. arXiv:2006.06376  [pdf, other

    cs.LG stat.ML

    Wide and Deep Graph Neural Networks with Distributed Online Learning

    Authors: Zhan Gao, Fernando Gama, Alejandro Ribeiro

    Abstract: Graph neural networks (GNNs) learn representations from network data with naturally distributed architectures, rendering them well-suited candidates for decentralized learning. Oftentimes, this decentralized graph support changes with time due to link failures or topology variations. These changes create a mismatch between the graphs on which GNNs were trained and the ones on which they are tested… ▽ More

    Submitted 24 October, 2020; v1 submitted 11 June, 2020; originally announced June 2020.

  31. Modeling High-Dimensional Unit-Root Time Series

    Authors: Zhaoxing Gao, Ruey S. Tsay

    Abstract: This paper proposes a new procedure to build factor models for high-dimensional unit-root time series by postulating that a $p$-dimensional unit-root process is a nonsingular linear transformation of a set of unit-root processes, a set of stationary common factors, which are dynamically dependent, and some idiosyncratic white noise components. For the stationary components, we assume that the fact… ▽ More

    Submitted 11 August, 2020; v1 submitted 5 May, 2020; originally announced May 2020.

    Comments: 45 pages, 11 figures. arXiv admin note: text overlap with arXiv:1808.07932

    Journal ref: International Journal of Forecasting 2020

  32. arXiv:2004.10657  [pdf, other

    cs.PL cs.LG stat.ML

    Typilus: Neural Type Hints

    Authors: Miltiadis Allamanis, Earl T. Barr, Soline Ducousso, Zheng Gao

    Abstract: Type inference over partial contexts in dynamically typed languages is challenging. In this work, we present a graph neural network model that predicts types by probabilistically reasoning over a program's structure, names, and patterns. The network uses deep similarity learning to learn a TypeSpace -- a continuous relaxation of the discrete space of types -- and how to embed the type properties o… ▽ More

    Submitted 6 April, 2020; originally announced April 2020.

    Comments: Accepted to PLDI 2020

  33. arXiv:2004.04618  [pdf, other

    cs.LG eess.SP stat.ML

    Deep Reinforcement Learning (DRL): Another Perspective for Unsupervised Wireless Localization

    Authors: You Li, Xin Hu, Yuan Zhuang, Zhouzheng Gao, Peng Zhang, Naser El-Sheimy

    Abstract: Location is key to spatialize internet-of-things (IoT) data. However, it is challenging to use low-cost IoT devices for robust unsupervised localization (i.e., localization without training data that have known location labels). Thus, this paper proposes a deep reinforcement learning (DRL) based unsupervised wireless-localization method. The main contributions are as follows. (1) This paper propos… ▽ More

    Submitted 9 April, 2020; originally announced April 2020.

  34. arXiv:2003.10375  [pdf, other

    eess.SP cs.LG stat.ML

    FTT-NAS: Discovering Fault-Tolerant Convolutional Neural Architecture

    Authors: Xuefei Ning, Guangjun Ge, Wenshuo Li, Zhenhua Zhu, Yin Zheng, Xiaoming Chen, Zhen Gao, Yu Wang, Huazhong Yang

    Abstract: With the fast evolvement of embedded deep-learning computing systems, applications powered by deep learning are moving from the cloud to the edge. When deploying neural networks (NNs) onto the devices under complex environments, there are various types of possible faults: soft errors caused by cosmic radiation and radioactive impurities, voltage instability, aging, temperature variations, and mali… ▽ More

    Submitted 12 April, 2021; v1 submitted 20 March, 2020; originally announced March 2020.

    Comments: 24 pages; to appear in TODAES

  35. arXiv:2003.06365  [pdf

    q-fin.PM cs.LG stat.ML

    Application of Deep Q-Network in Portfolio Management

    Authors: Ziming Gao, Yuan Gao, Yi Hu, Zhengyong Jiang, Jionglong Su

    Abstract: Machine Learning algorithms and Neural Networks are widely applied to many different areas such as stock market prediction, face recognition and population analysis. This paper will introduce a strategy based on the classic Deep Reinforcement Learning algorithm, Deep Q-Network, for portfolio management in stock market. It is a type of deep neural network which is optimized by Q Learning. To make t… ▽ More

    Submitted 13 March, 2020; originally announced March 2020.

  36. arXiv:2003.03881  [pdf, other

    stat.ME stat.AP

    Assessment of Heterogeneous Treatment Effect Estimation Accuracy via Matching

    Authors: Zijun Gao, Trevor Hastie, Robert Tibshirani

    Abstract: We study the assessment of the accuracy of heterogeneous treatment effect (HTE) estimation, where the HTE is not directly observable so standard computation of prediction errors is not applicable. To tackle the difficulty, we propose an assessment approach by constructing pseudo-observations of the HTE based on matching. Our contributions are three-fold: first, we introduce a novel matching distan… ▽ More

    Submitted 8 March, 2020; originally announced March 2020.

  37. arXiv:2002.06471  [pdf, ps, other

    math.ST stat.ME

    Minimax Optimal Nonparametric Estimation of Heterogeneous Treatment Effects

    Authors: Zijun Gao, Yanjun Han

    Abstract: A central goal of causal inference is to detect and estimate the treatment effects of a given treatment or intervention on an outcome variable of interest, where a member known as the heterogeneous treatment effect (HTE) is of growing popularity in recent practical applications such as the personalized medicine. In this paper, we model the HTE as a smooth nonparametric difference between two less… ▽ More

    Submitted 24 October, 2020; v1 submitted 15 February, 2020; originally announced February 2020.

    Comments: To appear at NeurIPS 2020 as a spotlight presentation

  38. arXiv:2002.04829  [pdf, other

    cs.CV stat.ML

    Uniform Interpolation Constrained Geodesic Learning on Data Manifold

    Authors: Cong Geng, Jia Wang, Li Chen, Wenbo Bao, Chu Chu, Zhiyong Gao

    Abstract: In this paper, we propose a method to learn a minimizing geodesic within a data manifold. Along the learned geodesic, our method can generate high-quality interpolations between two given data samples. Specifically, we use an autoencoder network to map data samples into latent space and perform interpolation via an interpolation network. We add prior geometric information to regularize our autoenc… ▽ More

    Submitted 14 August, 2020; v1 submitted 12 February, 2020; originally announced February 2020.

    Comments: submitted to NIPS 2020

  39. arXiv:2002.03382  [pdf, other

    stat.ME

    Segmenting High-dimensional Matrix-valued Time Series via Sequential Transformations

    Authors: Zhaoxing Gao

    Abstract: Modeling matrix-valued time series is an interesting and important research topic. In this paper, we extend the method of Chang et al. (2017) to matrix-valued time series. For any given $p\times q$ matrix-valued time series, we look for linear transformations to segment the matrix into many small sub-matrices for which each of them are uncorrelated with the others both contemporaneously and serial… ▽ More

    Submitted 9 February, 2020; originally announced February 2020.

  40. arXiv:2001.07072  [pdf

    cs.LG stat.ML

    Projection based Active Gaussian Process Regression for Pareto Front Modeling

    Authors: Zhengqi Gao, Jun Tao, Yangfeng Su, Dian Zhou, Xuan Zeng

    Abstract: Pareto Front (PF) modeling is essential in decision making problems across all domains such as economics, medicine or engineering. In Operation Research literature, this task has been addressed based on multi-objective optimization algorithms. However, without learning models for PF, these methods cannot examine whether a new provided point locates on PF or not. In this paper, we reconsider the ta… ▽ More

    Submitted 20 January, 2020; originally announced January 2020.

  41. arXiv:1910.05701  [pdf, ps, other

    math.ST stat.ME

    Phase Transitions in Genome-wide Association Studies and Categorical Variable Screenings

    Authors: Zheng Gao

    Abstract: Motivated by genome-wide association screening studies (GWAS), we study high-dimensional marginal screenings of categorical variables where test statistics have approximate chi-square distributions. We characterize four new phase transitions in high-dimensional chi-square models, and derive the signal sizes necessary and sufficient for statistical procedures to simultaneously control false discove… ▽ More

    Submitted 3 June, 2022; v1 submitted 13 October, 2019; originally announced October 2019.

    Comments: 40 pages, 8 figures

    MSC Class: 62G10; 62G20

  42. arXiv:1910.03203  [pdf

    cs.LG stat.AP stat.ML

    Random forest model identifies serve strength as a key predictor of tennis match outcome

    Authors: Zijian Gao, Amanda Kowalczyk

    Abstract: Tennis is a popular sport worldwide, boasting millions of fans and numerous national and international tournaments. Like many sports, tennis has benefitted from the popularity of rigorous record-kee** of game and player information, as well as the growth of machine learning methods for use in sports analytics. Of particular interest to bettors and betting companies alike is potential use of spor… ▽ More

    Submitted 8 October, 2019; originally announced October 2019.

    Comments: 12 pages, 5 figures, 2 tables

  43. Self-paced Ensemble for Highly Imbalanced Massive Data Classification

    Authors: Zhining Liu, Wei Cao, Zhifeng Gao, Jiang Bian, Hechang Chen, Yi Chang, Tie-Yan Liu

    Abstract: Many real-world applications reveal difficulties in learning classifiers from imbalanced data. The rising big data era has been witnessing more classification tasks with large-scale but extremely imbalance and low-quality datasets. Most of existing learning methods suffer from poor performance or low computation efficiency under such a scenario. To tackle this problem, we conduct deep investigatio… ▽ More

    Submitted 17 October, 2020; v1 submitted 8 September, 2019; originally announced September 2019.

    Comments: IEEE 36th International Conference on Data Engineering (ICDE 2020)

    Journal ref: 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 2020: 841-852

  44. arXiv:1908.08616  [pdf, other

    cs.LG math.OC stat.ML

    Quadratic Surface Support Vector Machine with L1 Norm Regularization

    Authors: Ahmad Mousavi, Zheming Gao, Lanshan Han, Alvin Lim

    Abstract: We propose $\ell_1$ norm regularized quadratic surface support vector machine models for binary classification in supervised learning. We establish their desired theoretical properties, including the existence and uniqueness of the optimal solution, reduction to the standard SVMs over (almost) linearly separable data sets, and detection of true sparsity pattern over (almost) quadratically separabl… ▽ More

    Submitted 30 January, 2021; v1 submitted 22 August, 2019; originally announced August 2019.

  45. arXiv:1907.13353  [pdf, other

    cs.LG stat.ML

    A Novel Multiple Classifier Generation and Combination Framework Based on Fuzzy Clustering and Individualized Ensemble Construction

    Authors: Zhen Gao, Maryam Zand, Jianhua Ruan

    Abstract: Multiple classifier system (MCS) has become a successful alternative for improving classification performance. However, studies have shown inconsistent results for different MCSs, and it is often difficult to predict which MCS algorithm works the best on a particular problem. We believe that the two crucial steps of MCS - base classifier generation and multiple classifier combination, need to be d… ▽ More

    Submitted 31 July, 2019; originally announced July 2019.

  46. arXiv:1907.06582  [pdf, other

    cs.LG stat.ML

    AMAD: Adversarial Multiscale Anomaly Detection on High-Dimensional and Time-Evolving Categorical Data

    Authors: Zheng Gao, Lin Guo, Chi Ma, Xiao Ma, Kai Sun, Hang Xiang, Xiaoqiang Zhu, Hongsong Li, Xiaozhong Liu

    Abstract: Anomaly detection is facing with emerging challenges in many important industry domains, such as cyber security and online recommendation and advertising. The recent trend in these areas calls for anomaly detection on time-evolving data with high-dimensional categorical features without labeled samples. Also, there is an increasing demand for identifying and monitoring irregular patterns at multip… ▽ More

    Submitted 12 July, 2019; originally announced July 2019.

    Comments: Accepted by 2019 KDD Workshop on Deep Learning Practice for High-Dimensional Sparse Data

  47. arXiv:1906.09981  [pdf, other

    eess.SP stat.ML

    Optimal WDM Power Allocation via Deep Learning for Radio on Free Space Optics Systems

    Authors: Zhan Gao, Mark Eisen, Alejandro Ribeiro

    Abstract: Radio on Free Space Optics (RoFSO), as a universal platform for heterogeneous wireless services, is able to transmit multiple radio frequency signals at high rates in free space optical networks. This paper investigates the optimal design of power allocation for Wavelength Division Multiplexing (WDM) transmission in RoFSO systems. The proposed problem is a weighted total capacity maximization prob… ▽ More

    Submitted 21 June, 2019; originally announced June 2019.

  48. arXiv:1904.03779  [pdf, ps, other

    cs.LG stat.ML

    Cluster Develo** 1-Bit Matrix Completion

    Authors: Chengkun Zhang. Junbin Gao, Stephen Lu

    Abstract: Matrix completion has a long-time history of usage as the core technique of recommender systems. In particular, 1-bit matrix completion, which considers the prediction as a ``Recommended'' or ``Not Recommended'' question, has proved its significance and validity in the field. However, while customers and products aggregate into interacted clusters, state-of-the-art model-based 1-bit recommender sy… ▽ More

    Submitted 7 April, 2019; originally announced April 2019.

    Comments: 16 Pages

  49. arXiv:1904.01763  [pdf, other

    stat.ML cs.IT cs.LG

    Batched Multi-armed Bandits Problem

    Authors: Zijun Gao, Yanjun Han, Zhimei Ren, Zhengqing Zhou

    Abstract: In this paper, we study the multi-armed bandit problem in the batched setting where the employed policy must split data into a small number of batches. While the minimax regret for the two-armed stochastic bandits has been completely characterized in \cite{perchet2016batched}, the effect of the number of arms on the regret for the multi-armed case is still open. Moreover, the question whether adap… ▽ More

    Submitted 26 October, 2019; v1 submitted 3 April, 2019; originally announced April 2019.

    Comments: To appear in NeurIPS 2019 as an oral presentation

  50. arXiv:1810.03445  [pdf

    cs.CL cs.LG stat.ML

    Building a language evolution tree based on word vector combination model

    Authors: Zhu Gao, Yanhui Jiang, Junhui Gao

    Abstract: In this paper, we try to explore the evolution of language through case calculations. First, we chose the novels of eleven British writers from 1400 to 2005 and found the corresponding works; Then, we use the natural language processing tool to construct the corresponding eleven corpora, and calculate the respective word vectors of 100 high-frequency words in eleven corpora; Next, for each corpus,… ▽ More

    Submitted 4 October, 2018; originally announced October 2018.