Search | arXiv e-print repository

arXiv:2405.20838 [pdf, other]

einspace: Searching for Neural Architectures from Fundamental Operations

Authors: Linus Ericsson, Miguel Espinosa, Chenhongyi Yang, Antreas Antoniou, Amos Storkey, Shay B. Cohen, Steven McDonagh, Elliot J. Crowley

Abstract: Neural architecture search (NAS) finds high performing networks for a given task. Yet the results of NAS are fairly prosaic; they did not e.g. create a shift from convolutional structures to transformers. This is not least because the search spaces in NAS often aren't diverse enough to include such transformations a priori. Instead, for NAS to provide greater potential for fundamental design shift… ▽ More Neural architecture search (NAS) finds high performing networks for a given task. Yet the results of NAS are fairly prosaic; they did not e.g. create a shift from convolutional structures to transformers. This is not least because the search spaces in NAS often aren't diverse enough to include such transformations a priori. Instead, for NAS to provide greater potential for fundamental design shifts, we need a novel expressive search space design which is built from more fundamental operations. To this end, we introduce einspace, a search space based on a parameterised probabilistic context-free grammar. Our space is versatile, supporting architectures of various sizes and complexities, while also containing diverse network operations which allow it to model convolutions, attention components and more. It contains many existing competitive architectures, and provides flexibility for discovering new ones. Using this search space, we perform experiments to find novel architectures as well as improvements on existing ones on the diverse Unseen NAS datasets. We show that competitive architectures can be obtained by searching from scratch, and we consistently find large improvements when initialising the search with strong baselines. We believe that this work is an important advancement towards a transformative NAS paradigm where search space expressivity and strategic search initialisation play key roles. △ Less

Submitted 31 May, 2024; originally announced May 2024.

Comments: Project page at https://linusericsson.github.io/einspace/

arXiv:2405.20550 [pdf]

Uncertainty Quantification for Deep Learning

Authors: Peter Jan van Leeuwen, J. Christine Chiu, C. Kevin Yang

Abstract: A complete and statistically consistent uncertainty quantification for deep learning is provided, including the sources of uncertainty arising from (1) the new input data, (2) the training and testing data (3) the weight vectors of the neural network, and (4) the neural network because it is not a perfect predictor. Using Bayes Theorem and conditional probability densities, we demonstrate how each… ▽ More A complete and statistically consistent uncertainty quantification for deep learning is provided, including the sources of uncertainty arising from (1) the new input data, (2) the training and testing data (3) the weight vectors of the neural network, and (4) the neural network because it is not a perfect predictor. Using Bayes Theorem and conditional probability densities, we demonstrate how each uncertainty source can be systematically quantified. We also introduce a fast and practical way to incorporate and combine all sources of errors for the first time. For illustration, the new method is applied to quantify errors in cloud autoconversion rates, predicted from an artificial neural network that was trained by aircraft cloud probe measurements in the Azores and the stochastic collection equation formulated as a two-moment bin model. For this specific example, the output uncertainty arising from uncertainty in the training and testing data is dominant, followed by uncertainty in the input data, in the trained neural network, and uncertainty in the weights. We discuss the usefulness of the methodology for machine learning practice, and how, through inclusion of uncertainty in the training data, the new methodology is less sensitive to input data that falls outside of the training data set. △ Less

Submitted 30 May, 2024; originally announced May 2024.

Comments: 25 pages 4 figures, submitted to Environmental data Science

MSC Class: 62D99 ACM Class: G.3

arXiv:2405.02539 [pdf, ps, other]

Distributed Iterative Hard Thresholding for Variable Selection in Tobit Models

Authors: Changxin Yang, Zhongyi Zhu, Heng Lian

Abstract: While extensive research has been conducted on high-dimensional data and on regression with left-censored responses, simultaneously addressing these complexities remains challenging, with only a few proposed methods available. In this paper, we utilize the Iterative Hard Thresholding (IHT) algorithm on the Tobit model in such a setting. Theoretical analysis demonstrates that our estimator converge… ▽ More While extensive research has been conducted on high-dimensional data and on regression with left-censored responses, simultaneously addressing these complexities remains challenging, with only a few proposed methods available. In this paper, we utilize the Iterative Hard Thresholding (IHT) algorithm on the Tobit model in such a setting. Theoretical analysis demonstrates that our estimator converges with a near-optimal minimax rate. Additionally, we extend the method to a distributed setting, requiring only a few rounds of communication while retaining the estimation rate of the centralized version. Simulation results show that the IHT algorithm for the Tobit model achieves superior accuracy in predictions and subset selection, with the distributed estimator closely matching that of the centralized estimator. When applied to high-dimensional left-censored HIV viral load data, our method also exhibits similar superiority. △ Less

Submitted 3 May, 2024; originally announced May 2024.

arXiv:2404.09729 [pdf]

Amplitude-Phase Fusion for Enhanced Electrocardiogram Morphological Analysis

Authors: Shuaicong Hu, Yanan Wang, Jian Liu, **gyu Lin, Shengmei Qin, Zhenning Nie, Zhifeng Yao, Wenjie Cai, Cuiwei Yang

Abstract: Considering the variability of amplitude and phase patterns in electrocardiogram (ECG) signals due to cardiac activity and individual differences, existing entropy-based studies have not fully utilized these two patterns and lack integration. To address this gap, this paper proposes a novel fusion entropy metric, morphological ECG entropy (MEE) for the first time, specifically designed for ECG mor… ▽ More Considering the variability of amplitude and phase patterns in electrocardiogram (ECG) signals due to cardiac activity and individual differences, existing entropy-based studies have not fully utilized these two patterns and lack integration. To address this gap, this paper proposes a novel fusion entropy metric, morphological ECG entropy (MEE) for the first time, specifically designed for ECG morphology, to comprehensively describe the fusion of amplitude and phase patterns. MEE is computed based on beat-level samples, enabling detailed analysis of each cardiac cycle. Experimental results demonstrate that MEE achieves rapid, accurate, and label-free localization of abnormal ECG arrhythmia regions. Furthermore, MEE provides a method for assessing sample diversity, facilitating compression of imbalanced training sets (via representative sample selection), and outperforms random pruning. Additionally, MEE exhibits the ability to describe areas of poor quality. By discussing, it proves the robustness of MEE value calculation to noise interference and its low computational complexity. Finally, we integrate this method into a clinical interactive interface to provide a more convenient and intuitive user experience. These findings indicate that MEE serves as a valuable clinical descriptor for ECG characterization. The implementation code can be referenced at the following link: https://github.com/fdu-harry/ECG-MEE-metric. △ Less

Submitted 15 April, 2024; originally announced April 2024.

Comments: 16 pages, 12 figures

ACM Class: I.5.2

arXiv:2404.02120 [pdf, other]

DEMO: Dose Exploration, Monitoring, and Optimization Using a Biological Mediator for Clinical Outcomes

Authors: Cheng-Han Yang, Peter F. Thall, Ruitao Lin

Abstract: Phase 1-2 designs provide a methodological advance over phase 1 designs for dose finding by using both clinical response and toxicity. A phase 1-2 trial still may fail to select a truly optimal dose. because early response is not a perfect surrogate for long term therapeutic success. To address this problem, a generalized phase 1-2 design first uses a phase 1-2 design's components to identify a se… ▽ More Phase 1-2 designs provide a methodological advance over phase 1 designs for dose finding by using both clinical response and toxicity. A phase 1-2 trial still may fail to select a truly optimal dose. because early response is not a perfect surrogate for long term therapeutic success. To address this problem, a generalized phase 1-2 design first uses a phase 1-2 design's components to identify a set of candidate doses, adaptively randomizes patients among the candidates, and after longer follow up selects a dose to maximize long-term success rate. In this paper, we extend this paradigm by proposing a design that exploits an early treatment-related, real-valued biological outcome, such as pharmacodynamic activity or an immunological effect, that may act as a mediator between dose and clinical outcomes, including tumor response, toxicity, and survival time. We assume multivariate dose-outcome models that include effects appearing in causal pathways from dose to the clinical outcomes. Bayesian model selection is used to identify and eliminate biologically inactive doses. At the end of the trial, a therapeutically optimal dose is chosen from the set of doses that are acceptably safe, clinically effective, and biologically active to maximize restricted mean survival time. Results of a simulation study show that the proposed design may provide substantial improvements over designs that ignore the biological variable. △ Less

Submitted 2 April, 2024; originally announced April 2024.

arXiv:2403.11960 [pdf, other]

CASPER: Causality-Aware Spatiotemporal Graph Neural Networks for Spatiotemporal Time Series Imputation

Authors: Baoyu **g, Dawei Zhou, Kan Ren, Carl Yang

Abstract: Spatiotemporal time series is the foundation of understanding human activities and their impacts, which is usually collected via monitoring sensors placed at different locations. The collected data usually contains missing values due to various failures, which have significant impact on data analysis. To impute the missing values, a lot of methods have been introduced. When recovering a specific d… ▽ More Spatiotemporal time series is the foundation of understanding human activities and their impacts, which is usually collected via monitoring sensors placed at different locations. The collected data usually contains missing values due to various failures, which have significant impact on data analysis. To impute the missing values, a lot of methods have been introduced. When recovering a specific data point, most existing methods tend to take into consideration all the information relevant to that point regardless of whether they have a cause-and-effect relationship. During data collection, it is inevitable that some unknown confounders are included, e.g., background noise in time series and non-causal shortcut edges in the constructed sensor network. These confounders could open backdoor paths between the input and output, in other words, they establish non-causal correlations between the input and output. Over-exploiting these non-causal correlations could result in overfitting and make the model vulnerable to noises. In this paper, we first revisit spatiotemporal time series imputation from a causal perspective, which shows the causal relationships among the input, output, embeddings and confounders. Next, we show how to block the confounders via the frontdoor adjustment. Based on the results of the frontdoor adjustment, we introduce a novel Causality-Aware SPatiotEmpoRal graph neural network (CASPER), which contains a novel Spatiotemporal Causal Attention (SCA) and a Prompt Based Decoder (PBD). PBD could reduce the impact of confounders and SCA could discover the sparse causal relationships among embeddings. Theoretical analysis reveals that SCA discovers causal relationships based on the values of gradients. We evaluate Casper on three real-world datasets, and the experimental results show that Casper outperforms the baselines and effectively discovers causal relationships. △ Less

Submitted 18 March, 2024; originally announced March 2024.

Comments: Preprint. Work in progress

arXiv:2403.10424 [pdf, other]

Structured Evaluation of Synthetic Tabular Data

Authors: Scott Cheng-Hsin Yang, Baxter Eaves, Michael Schmidt, Ken Swanson, Patrick Shafto

Abstract: Tabular data is common yet typically incomplete, small in volume, and access-restricted due to privacy concerns. Synthetic data generation offers potential solutions. Many metrics exist for evaluating the quality of synthetic tabular data; however, we lack an objective, coherent interpretation of the many metrics. To address this issue, we propose an evaluation framework with a single, mathematica… ▽ More Tabular data is common yet typically incomplete, small in volume, and access-restricted due to privacy concerns. Synthetic data generation offers potential solutions. Many metrics exist for evaluating the quality of synthetic tabular data; however, we lack an objective, coherent interpretation of the many metrics. To address this issue, we propose an evaluation framework with a single, mathematical objective that posits that the synthetic data should be drawn from the same distribution as the observed data. Through various structural decomposition of the objective, this framework allows us to reason for the first time the completeness of any set of metrics, as well as unifies existing metrics, including those that stem from fidelity considerations, downstream application, and model-based approaches. Moreover, the framework motivates model-free baselines and a new spectrum of metrics. We evaluate structurally informed synthesizers and synthesizers powered by deep learning. Using our structured framework, we show that synthetic data generators that explicitly represent tabular structure outperform other methods, especially on smaller datasets. △ Less

Submitted 29 March, 2024; v1 submitted 15 March, 2024; originally announced March 2024.

arXiv:2402.13187 [pdf, other]

Testing Calibration in Nearly-Linear Time

Authors: Lunjia Hu, Arun Jambulapati, Kevin Tian, Chutong Yang

Abstract: In the recent literature on machine learning and decision making, calibration has emerged as a desirable and widely-studied statistical property of the outputs of binary prediction models. However, the algorithmic aspects of measuring model calibration have remained relatively less well-explored. Motivated by [BGHN23], which proposed a rigorous framework for measuring distances to calibration, we… ▽ More In the recent literature on machine learning and decision making, calibration has emerged as a desirable and widely-studied statistical property of the outputs of binary prediction models. However, the algorithmic aspects of measuring model calibration have remained relatively less well-explored. Motivated by [BGHN23], which proposed a rigorous framework for measuring distances to calibration, we initiate the algorithmic study of calibration through the lens of property testing. We define the problem of calibration testing from samples where given $n$ draws from a distribution $\mathcal{D}$ on $(predictions, binary outcomes)$, our goal is to distinguish between the case where $\mathcal{D}$ is perfectly calibrated, and the case where $\mathcal{D}$ is $\varepsilon$-far from calibration. We make the simple observation that the empirical smooth calibration linear program can be reformulated as an instance of minimum-cost flow on a highly-structured graph, and design an exact dynamic programming-based solver for it which runs in time $O(n\log^2(n))$, and solves the calibration testing problem information-theoretically optimally in the same time. This improves upon state-of-the-art black-box linear program solvers requiring $Ω(n^ω)$ time, where $ω> 2$ is the exponent of matrix multiplication. We also develop algorithms for tolerant variants of our testing problem improving upon black-box linear program solvers, and give sample complexity lower bounds for alternative calibration measures to the one considered in this work. Finally, we present experiments showing the testing problem we define faithfully captures standard notions of calibration, and that our algorithms scale efficiently to accommodate large sample sizes. △ Less

Submitted 21 June, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

arXiv:2402.11283 [pdf, other]

Deep adaptive sampling for surrogate modeling without labeled data

Authors: Xili Wang, Kejun Tang, Jiayu Zhai, Xiaoliang Wan, Chao Yang

Abstract: Surrogate modeling is of great practical significance for parametric differential equation systems. In contrast to classical numerical methods, using physics-informed deep learning methods to construct simulators for such systems is a promising direction due to its potential to handle high dimensionality, which requires minimizing a loss over a training set of random samples. However, the random s… ▽ More Surrogate modeling is of great practical significance for parametric differential equation systems. In contrast to classical numerical methods, using physics-informed deep learning methods to construct simulators for such systems is a promising direction due to its potential to handle high dimensionality, which requires minimizing a loss over a training set of random samples. However, the random samples introduce statistical errors, which may become the dominant errors for the approximation of low-regularity and high-dimensional problems. In this work, we present a deep adaptive sampling method for surrogate modeling ($\text{DAS}^2$), where we generalize the deep adaptive sampling (DAS) method [62] [Tang, Wan and Yang, 2023] to build surrogate models for low-regularity parametric differential equations. In the parametric setting, the residual loss function can be regarded as an unnormalized probability density function (PDF) of the spatial and parametric variables. This PDF is approximated by a deep generative model, from which new samples are generated and added to the training set. Since the new samples match the residual-induced distribution, the refined training set can further reduce the statistical error in the current approximate solution. We demonstrate the effectiveness of $\text{DAS}^2$ with a series of numerical experiments, including the parametric lid-driven 2D cavity flow problem with a continuous range of Reynolds numbers from 100 to 1000. △ Less

Submitted 17 February, 2024; originally announced February 2024.

arXiv:2311.14652 [pdf, other]

One Pass Streaming Algorithm for Super Long Token Attention Approximation in Sublinear Space

Authors: Raghav Addanki, Chenyang Li, Zhao Song, Chiwun Yang

Abstract: Attention computation takes both the time complexity of $O(n^2)$ and the space complexity of $O(n^2)$ simultaneously, which makes deploying Large Language Models (LLMs) in streaming applications that involve long contexts requiring substantial computational resources. In recent OpenAI DevDay (Nov 6, 2023), OpenAI released a new model that is able to support a 128K-long document, in our paper, we f… ▽ More Attention computation takes both the time complexity of $O(n^2)$ and the space complexity of $O(n^2)$ simultaneously, which makes deploying Large Language Models (LLMs) in streaming applications that involve long contexts requiring substantial computational resources. In recent OpenAI DevDay (Nov 6, 2023), OpenAI released a new model that is able to support a 128K-long document, in our paper, we focus on the memory-efficient issue when context length $n$ is much greater than 128K ($n \gg 2^d$). Considering a single-layer self-attention with Query, Key, and Value matrices $Q, K, V \in \mathbb{R}^{n \times d}$, the polynomial method approximates the attention output $T \in \mathbb{R}^{n \times d}$. It accomplishes this by constructing $U_1, U_2 \in \mathbb{R}^{n \times t}$ to expedite attention ${\sf Attn}(Q, K, V)$ computation within $n^{1+o(1)}$ time executions. Despite this, computing the approximated attention matrix $U_1U_2^\top \in \mathbb{R}^{n \times n}$ still necessitates $O(n^2)$ space, leading to significant memory usage. In response to these challenges, we introduce a new algorithm that only reads one pass of the data in a streaming fashion. This method employs sublinear space $o(n)$ to store three sketch matrices, alleviating the need for exact $K, V$ storage. Notably, our algorithm exhibits exceptional memory-efficient performance with super-long tokens. As the token length $n$ increases, our error guarantee diminishes while the memory usage remains nearly constant. This unique attribute underscores the potential of our technique in efficiently handling LLMs in streaming applications. △ Less

Submitted 5 February, 2024; v1 submitted 24 November, 2023; originally announced November 2023.

arXiv:2310.12462 [pdf, other]

Unmasking Transformers: A Theoretical Approach to Data Recovery via Attention Weights

Authors: Yichuan Deng, Zhao Song, Shenghao Xie, Chiwun Yang

Abstract: In the realm of deep learning, transformers have emerged as a dominant architecture, particularly in natural language processing tasks. However, with their widespread adoption, concerns regarding the security and privacy of the data processed by these models have arisen. In this paper, we address a pivotal question: Can the data fed into transformers be recovered using their attention weights and… ▽ More In the realm of deep learning, transformers have emerged as a dominant architecture, particularly in natural language processing tasks. However, with their widespread adoption, concerns regarding the security and privacy of the data processed by these models have arisen. In this paper, we address a pivotal question: Can the data fed into transformers be recovered using their attention weights and outputs? We introduce a theoretical framework to tackle this problem. Specifically, we present an algorithm that aims to recover the input data $X \in \mathbb{R}^{d \times n}$ from given attention weights $W = QK^\top \in \mathbb{R}^{d \times d}$ and output $B \in \mathbb{R}^{n \times n}$ by minimizing the loss function $L(X)$. This loss function captures the discrepancy between the expected output and the actual output of the transformer. Our findings have significant implications for the Localized Layer-wise Mechanism (LLM), suggesting potential vulnerabilities in the model's design from a security and privacy perspective. This work underscores the importance of understanding and safeguarding the internal workings of transformers to ensure the confidentiality of processed data. △ Less

Submitted 19 October, 2023; originally announced October 2023.

arXiv:2309.16240 [pdf, other]

Beyond Reverse KL: Generalizing Direct Preference Optimization with Diverse Divergence Constraints

Authors: Chaoqi Wang, Yibo Jiang, Chenghao Yang, Han Liu, Yuxin Chen

Abstract: The increasing capabilities of large language models (LLMs) raise opportunities for artificial general intelligence but concurrently amplify safety concerns, such as potential misuse of AI systems, necessitating effective AI alignment. Reinforcement Learning from Human Feedback (RLHF) has emerged as a promising pathway towards AI alignment but brings forth challenges due to its complexity and depe… ▽ More The increasing capabilities of large language models (LLMs) raise opportunities for artificial general intelligence but concurrently amplify safety concerns, such as potential misuse of AI systems, necessitating effective AI alignment. Reinforcement Learning from Human Feedback (RLHF) has emerged as a promising pathway towards AI alignment but brings forth challenges due to its complexity and dependence on a separate reward model. Direct Preference Optimization (DPO) has been proposed as an alternative, and it remains equivalent to RLHF under the reverse KL regularization constraint. This paper presents $f$-DPO, a generalized approach to DPO by incorporating diverse divergence constraints. We show that under certain $f$-divergences, including Jensen-Shannon divergence, forward KL divergences and $α$-divergences, the complex relationship between the reward and optimal policy can also be simplified by addressing the Karush-Kuhn-Tucker conditions. This eliminates the need for estimating the normalizing constant in the Bradley-Terry model and enables a tractable map** between the reward function and the optimal policy. Our approach optimizes LLMs to align with human preferences in a more efficient and supervised manner under a broad set of divergence constraints. Empirically, adopting these divergences ensures a balance between alignment performance and generation diversity. Importantly, $f$-DPO outperforms PPO-based methods in divergence efficiency, and divergence constraints directly influence expected calibration error (ECE). △ Less

Submitted 28 September, 2023; originally announced September 2023.

Comments: Preprint

arXiv:2309.08642 [pdf, other]

A Stochastic Online Forecast-and-Optimize Framework for Real-Time Energy Dispatch in Virtual Power Plants under Uncertainty

Authors: Wei Jiang, Zhongkai Yi, Li Wang, Hanwei Zhang, Jihai Zhang, Fangquan Lin, Cheng Yang

Abstract: Aggregating distributed energy resources in power systems significantly increases uncertainties, in particular caused by the fluctuation of renewable energy generation. This issue has driven the necessity of widely exploiting advanced predictive control techniques under uncertainty to ensure long-term economics and decarbonization. In this paper, we propose a real-time uncertainty-aware energy dis… ▽ More Aggregating distributed energy resources in power systems significantly increases uncertainties, in particular caused by the fluctuation of renewable energy generation. This issue has driven the necessity of widely exploiting advanced predictive control techniques under uncertainty to ensure long-term economics and decarbonization. In this paper, we propose a real-time uncertainty-aware energy dispatch framework, which is composed of two key elements: (i) A hybrid forecast-and-optimize sequential task, integrating deep learning-based forecasting and stochastic optimization, where these two stages are connected by the uncertainty estimation at multiple temporal resolutions; (ii) An efficient online data augmentation scheme, jointly involving model pre-training and online fine-tuning stages. In this way, the proposed framework is capable to rapidly adapt to the real-time data distribution, as well as to target on uncertainties caused by data drift, model discrepancy and environment perturbations in the control process, and finally to realize an optimal and robust dispatch solution. The proposed framework won the championship in CityLearn Challenge 2022, which provided an influential opportunity to investigate the potential of AI application in the energy domain. In addition, comprehensive experiments are conducted to interpret its effectiveness in the real-life scenario of smart building energy management. △ Less

Submitted 14 September, 2023; originally announced September 2023.

Comments: Preprint. Accepted by CIKM 23

arXiv:2308.11138 [pdf, ps, other]

NLP-based detection of systematic anomalies among the narratives of consumer complaints

Authors: Peiheng Gao, Ning Sun, Xuefeng Wang, Chen Yang, Ričardas Zitikis

Abstract: We develop an NLP-based procedure for detecting systematic nonmeritorious consumer complaints, simply called systematic anomalies, among complaint narratives. While classification algorithms are used to detect pronounced anomalies, in the case of smaller and frequent systematic anomalies, the algorithms may falter due to a variety of reasons, including technical ones as well as natural limitations… ▽ More We develop an NLP-based procedure for detecting systematic nonmeritorious consumer complaints, simply called systematic anomalies, among complaint narratives. While classification algorithms are used to detect pronounced anomalies, in the case of smaller and frequent systematic anomalies, the algorithms may falter due to a variety of reasons, including technical ones as well as natural limitations of human analysts. Therefore, as the next step after classification, we convert the complaint narratives into quantitative data, which are then analyzed using an algorithm for detecting systematic anomalies. We illustrate the entire procedure using complaint narratives from the Consumer Complaint Database of the Consumer Financial Protection Bureau. △ Less

Submitted 26 March, 2024; v1 submitted 21 August, 2023; originally announced August 2023.

arXiv:2308.06106 [pdf, other]

Hawkes Processes with Delayed Granger Causality

Authors: Chao Yang, Hengyuan Miao, Shuang Li

Abstract: We aim to explicitly model the delayed Granger causal effects based on multivariate Hawkes processes. The idea is inspired by the fact that a causal event usually takes some time to exert an effect. Studying this time lag itself is of interest. Given the proposed model, we first prove the identifiability of the delay parameter under mild conditions. We further investigate a model estimation method… ▽ More We aim to explicitly model the delayed Granger causal effects based on multivariate Hawkes processes. The idea is inspired by the fact that a causal event usually takes some time to exert an effect. Studying this time lag itself is of interest. Given the proposed model, we first prove the identifiability of the delay parameter under mild conditions. We further investigate a model estimation method under a complex setting, where we want to infer the posterior distribution of the time lags and understand how this distribution varies across different scenarios. We treat the time lags as latent variables and formulate a Variational Auto-Encoder (VAE) algorithm to approximate the posterior distribution of the time lags. By explicitly modeling the time lags in Hawkes processes, we add flexibility to the model. The inferred time-lag posterior distributions are of scientific meaning and help trace the original causal time that supports the root cause analysis. We empirically evaluate our model's event prediction and time-lag inference accuracy on synthetic and real data, achieving promising results. △ Less

Submitted 11 August, 2023; originally announced August 2023.

Comments: 19 pages

arXiv:2306.17584 [pdf, other]

Flexible and Accurate Methods for Estimation and Inference of Gaussian Graphical Models with Applications

Authors: Yueqi Qian, Xianghong Hu, Can Yang

Abstract: The Gaussian graphical model (GGM) incorporates an undirected graph to represent the conditional dependence between variables, with the precision matrix encoding partial correlation between pair of variables given the others. To achieve flexible and accurate estimation and inference of GGM, we propose the novel method FLAG, which utilizes the random effects model for pairwise conditional regressio… ▽ More The Gaussian graphical model (GGM) incorporates an undirected graph to represent the conditional dependence between variables, with the precision matrix encoding partial correlation between pair of variables given the others. To achieve flexible and accurate estimation and inference of GGM, we propose the novel method FLAG, which utilizes the random effects model for pairwise conditional regression to estimate the precision matrix and applies statistical tests to recover the graph. Compared with existing methods, FLAG has several unique advantages: (i) it provides accurate estimation without sparsity assumptions on the precision matrix, (ii) it allows for element-wise inference of the precision matrix, (iii) it achieves computational efficiency by develo** an efficient PX-EM algorithm and a MM algorithm accelerated with low-rank updates, and (iv) it enables joint estimation of multiple graphs using FLAG-Meta or FLAG-CA. The proposed methods are evaluated using various simulation settings and real data applications, including gene expression in the human brain, term association in university websites, and stock prices in the U.S. financial market. The results demonstrate that FLAG and its extensions provide accurate precision estimation and graph recovery. △ Less

Submitted 30 June, 2023; originally announced June 2023.

arXiv:2305.18702 [pdf, other]

Adversarial Adaptive Sampling: Unify PINN and Optimal Transport for the Approximation of PDEs

Authors: Kejun Tang, Jiayu Zhai, Xiaoliang Wan, Chao Yang

Abstract: Solving partial differential equations (PDEs) is a central task in scientific computing. Recently, neural network approximation of PDEs has received increasing attention due to its flexible meshless discretization and its potential for high-dimensional problems. One fundamental numerical difficulty is that random samples in the training set introduce statistical errors into the discretization of l… ▽ More Solving partial differential equations (PDEs) is a central task in scientific computing. Recently, neural network approximation of PDEs has received increasing attention due to its flexible meshless discretization and its potential for high-dimensional problems. One fundamental numerical difficulty is that random samples in the training set introduce statistical errors into the discretization of loss functional which may become the dominant error in the final approximation, and therefore overshadow the modeling capability of the neural network. In this work, we propose a new minmax formulation to optimize simultaneously the approximate solution, given by a neural network model, and the random samples in the training set, provided by a deep generative model. The key idea is to use a deep generative model to adjust random samples in the training set such that the residual induced by the approximate PDE solution can maintain a smooth profile when it is being minimized. Such an idea is achieved by implicitly embedding the Wasserstein distance between the residual-induced distribution and the uniform distribution into the loss, which is then minimized together with the residual. A nearly uniform residual profile means that its variance is small for any normalized weight function such that the Monte Carlo approximation error of the loss functional is reduced significantly for a certain sample size. The adversarial adaptive sampling (AAS) approach proposed in this work is the first attempt to formulate two essential components, minimizing the residual and seeking the optimal training set, into one minmax objective functional for the neural network approximation of PDEs. △ Less

Submitted 14 March, 2024; v1 submitted 29 May, 2023; originally announced May 2023.

Comments: ICLR, 2024

arXiv:2305.14612 [pdf]

Assessment of Anterior Cruciate Ligament Injury Risk Based on Human Key Points Detection Algorithm

Authors: Ziyu Gong, Xiong Zhao, Chen Yang

Abstract: This paper aims to detect the potential injury risk of the anterior cruciate ligament (ACL) by proposing an ACL potential injury risk assessment algorithm based on key points of the human body detected using computer vision technology. To obtain the key points data of the human body in each frame, OpenPose, an open source computer vision algorithm, was employed. The obtained data underwent preproc… ▽ More This paper aims to detect the potential injury risk of the anterior cruciate ligament (ACL) by proposing an ACL potential injury risk assessment algorithm based on key points of the human body detected using computer vision technology. To obtain the key points data of the human body in each frame, OpenPose, an open source computer vision algorithm, was employed. The obtained data underwent preprocessing and were then fed into an ACL potential injury feature extraction model based on the Landing Error Evaluation System (LESS). This model extracted several important parameters, including the knee flexion angle, the trunk flexion on the sagittal plane, trunk flexion angle on the frontal plane, the ankle knee horizontal distance, and the ankle shoulder horizontal distance. Each of these features was assigned a threshold interval, and a segmented evaluation function was utilized to score them accordingly. To calculate the final score of the participant, the score values were input into a weighted scoring model designed based on the Analytic Hierarchy Process (AHP). The AHP based model takes into account the relative importance of each feature in the overall assessment. The results demonstrate that the proposed algorithm effectively detects the potential risk of ACL injury. The proposed algorithm demonstrates its effectiveness in detecting ACL injury risk, offering valuable insights for injury prevention and intervention strategies in sports and related fields. Code is available at: https://github.com/ZiyuGong-proj/Assessment-of-ACL-Injury-Risk-Based-on-Openpose △ Less

Submitted 23 May, 2023; originally announced May 2023.

Comments: 17 pages,and 6 figures

arXiv:2303.10388 [pdf]

doi 10.1093/bioinformatics/btad470

ggpicrust2: an R package for PICRUSt2 predicted functional profile analysis and visualization

Authors: Chen Yang, Jiahao Mai, Xuan Cao, Aaron Burberry, Fabio Cominelli, Liangliang Zhang

Abstract: Microbiome research is now moving beyond the compositional analysis of microbial taxa in a sample. Increasing evidence from large human microbiome studies suggests that functional consequences of changes in the intestinal microbiome may provide more power for studying their impact on inflammation and immune responses. Although 16S rRNA analysis is one of the most popular and a cost-effective metho… ▽ More Microbiome research is now moving beyond the compositional analysis of microbial taxa in a sample. Increasing evidence from large human microbiome studies suggests that functional consequences of changes in the intestinal microbiome may provide more power for studying their impact on inflammation and immune responses. Although 16S rRNA analysis is one of the most popular and a cost-effective method to profile the microbial compositions, marker-gene sequencing cannot provide direct information about the functional genes that are present in the genomes of community members. Bioinformatic tools have been developed to predict microbiome function with 16S rRNA gene data. Among them, PICRUSt2 has become one of the most popular functional profile prediction tools, which generates community-wide pathway abundances. However, no state-of-art inference tools are available to test the differences in pathway abundances between comparison groups. We have developed ggpicrust2, an R package, to do extensive differential abundance(DA) analyses and provide publishable visualization to highlight the signals. △ Less

Submitted 9 April, 2023; v1 submitted 18 March, 2023; originally announced March 2023.

Comments: 4 pages, 1 figure

arXiv:2303.05341 [pdf, other]

Penalized Deep Partially Linear Cox Models with Application to CT Scans of Lung Cancer Patients

Authors: Yuming Sun, Jian Kang, Chinmay Haridas, Nicholas R. Mayne, Alexandra L. Potter, Chi-Fu Jeffrey Yang, David C. Christiani, Yi Li

Abstract: Lung cancer is a leading cause of cancer mortality globally, highlighting the importance of understanding its mortality risks to design effective patient-centered therapies. The National Lung Screening Trial (NLST) employed computed tomography texture analysis, which provides objective measurements of texture patterns on CT scans, to quantify the mortality risks of lung cancer patients. Partially… ▽ More Lung cancer is a leading cause of cancer mortality globally, highlighting the importance of understanding its mortality risks to design effective patient-centered therapies. The National Lung Screening Trial (NLST) employed computed tomography texture analysis, which provides objective measurements of texture patterns on CT scans, to quantify the mortality risks of lung cancer patients. Partially linear Cox models have gained popularity for survival analysis by dissecting the hazard function into parametric and nonparametric components, allowing for the effective incorporation of both well-established risk factors (such as age and clinical variables) and emerging risk factors (e.g., image features) within a unified framework. However, when the dimension of parametric components exceeds the sample size, the task of model fitting becomes formidable, while nonparametric modeling grapples with the curse of dimensionality. We propose a novel Penalized Deep Partially Linear Cox Model (Penalized DPLC), which incorporates the SCAD penalty to select important texture features and employs a deep neural network to estimate the nonparametric component of the model. We prove the convergence and asymptotic properties of the estimator and compare it to other methods through extensive simulation studies, evaluating its performance in risk prediction and feature selection. The proposed method is applied to the NLST study dataset to uncover the effects of key clinical and imaging risk factors on patients' survival. Our findings provide valuable insights into the relationship between these factors and survival outcomes. △ Less

Submitted 29 September, 2023; v1 submitted 9 March, 2023; originally announced March 2023.

arXiv:2303.02566 [pdf, other]

MFAI: A Scalable Bayesian Matrix Factorization Approach to Leveraging Auxiliary Information

Authors: Zhiwei Wang, Fa Zhang, Cong Zheng, Xianghong Hu, Mingxuan Cai, Can Yang

Abstract: In various practical situations, matrix factorization methods suffer from poor data quality, such as high data sparsity and low signal-to-noise ratio (SNR). Here, we consider a matrix factorization problem by utilizing auxiliary information, which is massively available in real-world applications, to overcome the challenges caused by poor data quality. Unlike existing methods that mainly rely on s… ▽ More In various practical situations, matrix factorization methods suffer from poor data quality, such as high data sparsity and low signal-to-noise ratio (SNR). Here, we consider a matrix factorization problem by utilizing auxiliary information, which is massively available in real-world applications, to overcome the challenges caused by poor data quality. Unlike existing methods that mainly rely on simple linear models to combine auxiliary information with the main data matrix, we propose to integrate gradient boosted trees in the probabilistic matrix factorization framework to effectively leverage auxiliary information (MFAI). Thus, MFAI naturally inherits several salient features of gradient boosted trees, such as the capability of flexibly modeling nonlinear relationships and robustness to irrelevant features and missing values in auxiliary information. The parameters in MFAI can be automatically determined under the empirical Bayes framework, making it adaptive to the utilization of auxiliary information and immune to overfitting. Moreover, MFAI is computationally efficient and scalable to large datasets by exploiting variational inference. We demonstrate the advantages of MFAI through comprehensive numerical results from simulation studies and real data analyses. Our approach is implemented in the R package mfair available at https://github.com/YangLabHKUST/mfair. △ Less

Submitted 12 February, 2024; v1 submitted 4 March, 2023; originally announced March 2023.

arXiv:2302.13231 [pdf]

A Synthetic Texas Backbone Power System with Climate-Dependent Spatio-Temporal Correlated Profiles

Authors: ** Lu, Xingpeng Li, Hongyi Li, Taher Chegini, Carlos Gamarra, Y. C. Ethan Yang, Margaret Cook, Gavin Dillingham

Abstract: Most power system test cases only have electrical parameters and can be used only for studies based on a snapshot of system profiles. To facilitate more comprehensive and practical studies, a synthetic power system including spatio-temporal correlated profiles for the entire year of 2019 at one-hour resolution has been created in this work. This system, referred to as the synthetic Texas 123-bus b… ▽ More Most power system test cases only have electrical parameters and can be used only for studies based on a snapshot of system profiles. To facilitate more comprehensive and practical studies, a synthetic power system including spatio-temporal correlated profiles for the entire year of 2019 at one-hour resolution has been created in this work. This system, referred to as the synthetic Texas 123-bus backbone transmission (TX-123BT) system, has very similar temporal and spatial characteristics with the actual Electric Reliability Council of Texas (ERCOT) system. It has a backbone network consisting of only high-voltage transmission lines in Texas, which is obtained by the K-medoids clustering method. The climate data extracted from the North American Land Data Assimilation System (NLDAS) are used to create the climate-dependent profiles of renewable generation and transmission thermal limits. Two climate-dependent models are implemented to determine wind and solar power production pro-files respectively. In addition, two sets of climate-dependent dy-namic line rating (DLR) profiles are created with the actual climate information: (i) daily DLR and (ii) hourly DLR. Simulation results of security-constrained unit commitment (SCUC) conducted on each of the daily system profiles have validated the developed one-year hourly time series dataset. △ Less

Submitted 25 February, 2023; originally announced February 2023.

Comments: 10 pages, 14 figures, 12 tables

arXiv:2302.06807 [pdf, other]

Horospherical Decision Boundaries for Large Margin Classification in Hyperbolic Space

Authors: Xiran Fan, Chun-Hao Yang, Baba C. Vemuri

Abstract: Hyperbolic spaces have been quite popular in the recent past for representing hierarchically organized data. Further, several classification algorithms for data in these spaces have been proposed in the literature. These algorithms mainly use either hyperplanes or geodesics for decision boundaries in a large margin classifiers setting leading to a non-convex optimization problem. In this paper, we… ▽ More Hyperbolic spaces have been quite popular in the recent past for representing hierarchically organized data. Further, several classification algorithms for data in these spaces have been proposed in the literature. These algorithms mainly use either hyperplanes or geodesics for decision boundaries in a large margin classifiers setting leading to a non-convex optimization problem. In this paper, we propose a novel large margin classifier based on horospherical decision boundaries that leads to a geodesically convex optimization problem that can be optimized using any Riemannian gradient descent technique guaranteeing a globally optimal solution. We present several experiments depicting the competitive performance of our classifier in comparison to SOTA. △ Less

Submitted 28 September, 2023; v1 submitted 13 February, 2023; originally announced February 2023.

Comments: To appear at Neural Information Processing Systems (NeurIPS) 2023

arXiv:2211.08881 [pdf]

A foretaste of Qatar 2022: decreased playing time of internationals after the Africa Cup of Nations

Authors: Otto Kolbinger, Chenyuyan Yang, Martin Lames

Abstract: Due to the unfavourable climatic conditions in Qatar during summertime, the FIFA World Cup 2022 will be played during on-going seasons of the major European leagues. This study investigates how national teams' tournaments scheduled at such a time window impact the playing time of released players, using data from the Africa Cups of Nations (AFCON). For 262 internationals playing at the 2013, 2015… ▽ More Due to the unfavourable climatic conditions in Qatar during summertime, the FIFA World Cup 2022 will be played during on-going seasons of the major European leagues. This study investigates how national teams' tournaments scheduled at such a time window impact the playing time of released players, using data from the Africa Cups of Nations (AFCON). For 262 internationals playing at the 2013, 2015 and 2021 AFCON, we compared the share of possible games and minutes played before and after the tournament using Mann-Whitney-U tests. We found a significant decrease of 3.3% for games (p=.029, CL_Effect_Size=44.5%) and 3.1% for minutes played respectively (p=.018, CL_Effect_Size=44.9%). For a subsample of 163 players, which played for the same club the preceding seasons, we found that these players tend to have played more in the second half of the previous season, resulting in a net decrease of 6.8% for games (p=.011, CL_Effect_Size=42.3%) and 7.1% for minutes played (p=.007, CL_Effect_Size=41.9%). Conclusions for the FIFA World Cup 2022 should only be drawn carefully as the number of released players was comparatively low. However, the findings give some indication that releasing clubs might suffer the rest of the season after this tournament. △ Less

Submitted 16 November, 2022; originally announced November 2022.

Comments: 15 pages, 2 figures

arXiv:2207.13248 [pdf, ps, other]

Tail maximal dependence in bivariate models: estimation and applications

Authors: Ning Sun, Chen Yang, Ričardas Zitikis

Abstract: Assessing dependence within co-movements of financial instruments has been of much interest in risk management. Typically, indices of tail dependence are used to quantify the strength of such dependence, although many of the indices underestimate the strength. Hence, we advocate the use of a statistical procedure designed to estimate the maximal strength of dependence that can possibly occur among… ▽ More Assessing dependence within co-movements of financial instruments has been of much interest in risk management. Typically, indices of tail dependence are used to quantify the strength of such dependence, although many of the indices underestimate the strength. Hence, we advocate the use of a statistical procedure designed to estimate the maximal strength of dependence that can possibly occur among the co-movements. We illustrate the procedure using simulated and real data-sets. △ Less

Submitted 19 September, 2022; v1 submitted 26 July, 2022; originally announced July 2022.

MSC Class: 62P05

arXiv:2204.07615 [pdf, other]

TabNAS: Rejection Sampling for Neural Architecture Search on Tabular Datasets

Authors: Chengrun Yang, Gabriel Bender, Hanxiao Liu, Pieter-Jan Kindermans, Madeleine Udell, Yifeng Lu, Quoc Le, Da Huang

Abstract: The best neural architecture for a given machine learning problem depends on many factors: not only the complexity and structure of the dataset, but also on resource constraints including latency, compute, energy consumption, etc. Neural architecture search (NAS) for tabular datasets is an important but under-explored problem. Previous NAS algorithms designed for image search spaces incorporate re… ▽ More The best neural architecture for a given machine learning problem depends on many factors: not only the complexity and structure of the dataset, but also on resource constraints including latency, compute, energy consumption, etc. Neural architecture search (NAS) for tabular datasets is an important but under-explored problem. Previous NAS algorithms designed for image search spaces incorporate resource constraints directly into the reinforcement learning (RL) rewards. However, for NAS on tabular datasets, this protocol often discovers suboptimal architectures. This paper develops TabNAS, a new and more effective approach to handle resource constraints in tabular NAS using an RL controller motivated by the idea of rejection sampling. TabNAS immediately discards any architecture that violates the resource constraints without training or learning from that architecture. TabNAS uses a Monte-Carlo-based correction to the RL policy gradient update to account for this extra filtering step. Results on several tabular datasets demonstrate the superiority of TabNAS over previous reward-sha** methods: it finds better models that obey the constraints. △ Less

Submitted 20 October, 2022; v1 submitted 15 April, 2022; originally announced April 2022.

Comments: NeurIPS 2022; 30 pages, 15 figures, 7 tables

arXiv:2203.15804 [pdf]

Improving The Diagnosis of Thyroid Cancer by Machine Learning and Clinical Data

Authors: Nan Miles Xi, Lin Wang, Chuanjia Yang

Abstract: Thyroid cancer is a common endocrine carcinoma that occurs in the thyroid gland. Much effort has been invested in improving its diagnosis, and thyroidectomy remains the primary treatment method. A successful operation without unnecessary side injuries relies on an accurate preoperative diagnosis. Current human assessment of thyroid nodule malignancy is prone to errors and may not guarantee an accu… ▽ More Thyroid cancer is a common endocrine carcinoma that occurs in the thyroid gland. Much effort has been invested in improving its diagnosis, and thyroidectomy remains the primary treatment method. A successful operation without unnecessary side injuries relies on an accurate preoperative diagnosis. Current human assessment of thyroid nodule malignancy is prone to errors and may not guarantee an accurate preoperative diagnosis. This study proposed a machine framework to predict thyroid nodule malignancy based on a novel clinical dataset we collected. The 10-fold cross-validation, bootstrap analysis, and permutation predictor importance were applied to estimate and interpret the model performance under uncertainty. The comparison between model prediction and expert assessment shows the advantage of our framework over human judgment in predicting thyroid nodule malignancy. Our method is accurate, interpretable, and thus useable as additional evidence in the preoperative diagnosis for thyroid cancer. △ Less

Submitted 27 March, 2022; originally announced March 2022.

arXiv:2201.09433 [pdf, other]

Active Learning Polynomial Threshold Functions

Authors: Omri Ben-Eliezer, Max Hopkins, Chutong Yang, Hantao Yu

Abstract: We initiate the study of active learning polynomial threshold functions (PTFs). While traditional lower bounds imply that even univariate quadratics cannot be non-trivially actively learned, we show that allowing the learner basic access to the derivatives of the underlying classifier circumvents this issue and leads to a computationally efficient algorithm for active learning degree-$d$ univariat… ▽ More We initiate the study of active learning polynomial threshold functions (PTFs). While traditional lower bounds imply that even univariate quadratics cannot be non-trivially actively learned, we show that allowing the learner basic access to the derivatives of the underlying classifier circumvents this issue and leads to a computationally efficient algorithm for active learning degree-$d$ univariate PTFs in $\tilde{O}(d^3\log(1/\varepsilonδ))$ queries. We also provide near-optimal algorithms and analyses for active learning PTFs in several average case settings. Finally, we prove that access to derivatives is insufficient for active learning multivariate PTFs, even those of just two variables. △ Less

Submitted 1 October, 2022; v1 submitted 23 January, 2022; originally announced January 2022.

MSC Class: 68Q32

arXiv:2112.14038 [pdf, other]

DAS-PINNs: A deep adaptive sampling method for solving high-dimensional partial differential equations

Authors: Kejun Tang, Xiaoliang Wan, Chao Yang

Abstract: In this work we propose a deep adaptive sampling (DAS) method for solving partial differential equations (PDEs), where deep neural networks are utilized to approximate the solutions of PDEs and deep generative models are employed to generate new collocation points that refine the training set. The overall procedure of DAS consists of two components: solving the PDEs by minimizing the residual loss… ▽ More In this work we propose a deep adaptive sampling (DAS) method for solving partial differential equations (PDEs), where deep neural networks are utilized to approximate the solutions of PDEs and deep generative models are employed to generate new collocation points that refine the training set. The overall procedure of DAS consists of two components: solving the PDEs by minimizing the residual loss on the collocation points in the training set and generating a new training set to further improve the accuracy of current approximate solution. In particular, we treat the residual as a probability density function and approximate it with a deep generative model, called KRnet. The new samples from KRnet are consistent with the distribution induced by the residual, i.e., more samples are located in the region of large residual and less samples are located in the region of small residual. Analogous to classical adaptive methods such as the adaptive finite element, KRnet acts as an error indicator that guides the refinement of the training set. Compared to the neural network approximation obtained with uniformly distributed collocation points, the developed algorithms can significantly improve the accuracy, especially for low regularity and high-dimensional problems. We demonstrate the effectiveness of the proposed DAS method with numerical experiments. △ Less

Submitted 5 July, 2022; v1 submitted 28 December, 2021; originally announced December 2021.

arXiv:2112.03402 [pdf, other]

Nested Hyperbolic Spaces for Dimensionality Reduction and Hyperbolic NN Design

Authors: Xiran Fan, Chun-Hao Yang, Baba C. Vemuri

Abstract: Hyperbolic neural networks have been popular in the recent past due to their ability to represent hierarchical data sets effectively and efficiently. The challenge in develo** these networks lies in the nonlinearity of the embedding space namely, the Hyperbolic space. Hyperbolic space is a homogeneous Riemannian manifold of the Lorentz group. Most existing methods (with some exceptions) use loca… ▽ More Hyperbolic neural networks have been popular in the recent past due to their ability to represent hierarchical data sets effectively and efficiently. The challenge in develo** these networks lies in the nonlinearity of the embedding space namely, the Hyperbolic space. Hyperbolic space is a homogeneous Riemannian manifold of the Lorentz group. Most existing methods (with some exceptions) use local linearization to define a variety of operations paralleling those used in traditional deep neural networks in Euclidean spaces. In this paper, we present a novel fully hyperbolic neural network which uses the concept of projections (embeddings) followed by an intrinsic aggregation and a nonlinearity all within the hyperbolic space. The novelty here lies in the projection which is designed to project data on to a lower-dimensional embedded hyperbolic space and hence leads to a nested hyperbolic space representation independently useful for dimensionality reduction. The main theoretical contribution is that the proposed embedding is proved to be isometric and equivariant under the Lorentz transformations. This projection is computationally efficient since it can be expressed by simple linear operations, and, due to the aforementioned equivariance property, it allows for weight sharing. The nested hyperbolic space representation is the core component of our network and therefore, we first compare this ensuing nested hyperbolic space representation with other dimensionality reduction methods such as tangent PCA, principal geodesic analysis (PGA) and HoroPCA. Based on this equivariant embedding, we develop a novel fully hyperbolic graph convolutional neural network architecture to learn the parameters of the projection. Finally, we present experiments demonstrating comparative performance of our network on several publicly available data sets. △ Less

Submitted 2 December, 2021; originally announced December 2021.

Comments: 19 pages, 6 figures

arXiv:2111.13378 [pdf, other]

Differentially Private Methods for Releasing Results of Stability Analyses

Authors: Chengxin Yang, Jerome P. Reiter

Abstract: Data stewards and analysts can promote transparent and trustworthy science and policy-making by facilitating assessments of the sensitivity of published results to alternate analysis choices. For example, researchers may want to assess whether the results change substantially when different subsets of data points (e.g., sets formed by demographic characteristics) are used in the analysis, or when… ▽ More Data stewards and analysts can promote transparent and trustworthy science and policy-making by facilitating assessments of the sensitivity of published results to alternate analysis choices. For example, researchers may want to assess whether the results change substantially when different subsets of data points (e.g., sets formed by demographic characteristics) are used in the analysis, or when different models (e.g., with or without log transformations) are estimated on the data. Releasing the results of such stability analyses leaks information about the data subjects. When the underlying data are confidential, the data stewards and analysts may seek to bound this information leakage. We present methods for stability analyses that can satisfy differential privacy, a definition of data confidentiality providing such bounds. We use regression modeling as the motivating example. The basic idea is to split the data into disjoint subsets, compute a measure summarizing the difference between the published and alternative analysis on each subset, aggregate these subset estimates, and add noise to the aggregated value to satisfy differential privacy. We illustrate the methods using regressions in which an analyst compares coefficient estimates for different groups in the data, and in which analysts fit two different models on the data. △ Less

Submitted 23 August, 2023; v1 submitted 26 November, 2021; originally announced November 2021.

Comments: 30 pages, 3 figures

arXiv:2109.14236 [pdf, other]

LightSecAgg: a Lightweight and Versatile Design for Secure Aggregation in Federated Learning

Authors: **hyun So, Chaoyang He, Chien-Sheng Yang, Songze Li, Qian Yu, Ramy E. Ali, Basak Guler, Salman Avestimehr

Abstract: Secure model aggregation is a key component of federated learning (FL) that aims at protecting the privacy of each user's individual model while allowing for their global aggregation. It can be applied to any aggregation-based FL approach for training a global or personalized model. Model aggregation needs to also be resilient against likely user dropouts in FL systems, making its design substanti… ▽ More Secure model aggregation is a key component of federated learning (FL) that aims at protecting the privacy of each user's individual model while allowing for their global aggregation. It can be applied to any aggregation-based FL approach for training a global or personalized model. Model aggregation needs to also be resilient against likely user dropouts in FL systems, making its design substantially more complex. State-of-the-art secure aggregation protocols rely on secret sharing of the random-seeds used for mask generations at the users to enable the reconstruction and cancellation of those belonging to the dropped users. The complexity of such approaches, however, grows substantially with the number of dropped users. We propose a new approach, named LightSecAgg, to overcome this bottleneck by changing the design from "random-seed reconstruction of the dropped users" to "one-shot aggregate-mask reconstruction of the active users via mask encoding/decoding". We show that LightSecAgg achieves the same privacy and dropout-resiliency guarantees as the state-of-the-art protocols while significantly reducing the overhead for resiliency against dropped users. We also demonstrate that, unlike existing schemes, LightSecAgg can be applied to secure aggregation in the asynchronous FL setting. Furthermore, we provide a modular system design and optimized on-device parallelization for scalable implementation, by enabling computational overlap** between model training and on-device encoding, as well as improving the speed of concurrent receiving and sending of chunked masks. We evaluate LightSecAgg via extensive experiments for training diverse models on various datasets in a realistic FL system with large number of users and demonstrate that LightSecAgg significantly reduces the total training time. △ Less

Submitted 1 February, 2022; v1 submitted 29 September, 2021; originally announced September 2021.

Comments: This paper is accepted to the 5th MLSys Conference, Santa Clara, CA, USA, 2022

arXiv:2106.15743 [pdf, other]

BONuS: Multiple multivariate testing with a data-adaptivetest statistic

Authors: Chiao-Yu Yang, Lihua Lei, Nhat Ho, Will Fithian

Abstract: We propose a new adaptive empirical Bayes framework, the Bag-Of-Null-Statistics (BONuS) procedure, for multiple testing where each hypothesis testing problem is itself multivariate or nonparametric. BONuS is an adaptive and interactive knockoff-type method that helps improve the testing power while controlling the false discovery rate (FDR), and is closely connected to the "counting knockoffs" pro… ▽ More We propose a new adaptive empirical Bayes framework, the Bag-Of-Null-Statistics (BONuS) procedure, for multiple testing where each hypothesis testing problem is itself multivariate or nonparametric. BONuS is an adaptive and interactive knockoff-type method that helps improve the testing power while controlling the false discovery rate (FDR), and is closely connected to the "counting knockoffs" procedure analyzed in Weinstein et al. (2017). Contrary to procedures that start with a $p$-value for each hypothesis, our method analyzes the entire data set to adaptively estimate an optimal $p$-value transform based on an empirical Bayes model. Despite the extra adaptivity, our method controls FDR in finite samples even if the empirical Bayes model is incorrect or the estimation is poor. An extension, the Double BONuS procedure, validates the empirical Bayes model to guard against power loss due to model misspecification. △ Less

Submitted 1 July, 2021; v1 submitted 29 June, 2021; originally announced June 2021.

arXiv:2106.13423 [pdf, other]

Federated Graph Classification over Non-IID Graphs

Authors: Han Xie, **g Ma, Li Xiong, Carl Yang

Abstract: Federated learning has emerged as an important paradigm for training machine learning models in different domains. For graph-level tasks such as graph classification, graphs can also be regarded as a special type of data samples, which can be collected and stored in separate local systems. Similar to other domains, multiple local systems, each holding a small set of graphs, may benefit from collab… ▽ More Federated learning has emerged as an important paradigm for training machine learning models in different domains. For graph-level tasks such as graph classification, graphs can also be regarded as a special type of data samples, which can be collected and stored in separate local systems. Similar to other domains, multiple local systems, each holding a small set of graphs, may benefit from collaboratively training a powerful graph mining model, such as the popular graph neural networks (GNNs). To provide more motivation towards such endeavors, we analyze real-world graphs from different domains to confirm that they indeed share certain graph properties that are statistically significant compared with random graphs. However, we also find that different sets of graphs, even from the same domain or same dataset, are non-IID regarding both graph structures and node features. To handle this, we propose a graph clustered federated learning (GCFL) framework that dynamically finds clusters of local systems based on the gradients of GNNs, and theoretically justify that such clusters can reduce the structure and feature heterogeneity among graphs owned by the local systems. Moreover, we observe the gradients of GNNs to be rather fluctuating in GCFL which impedes high-quality clustering, and design a gradient sequence-based clustering mechanism based on dynamic time war** (GCFL+). Extensive experimental results and in-depth analysis demonstrate the effectiveness of our proposed frameworks. △ Less

Submitted 7 November, 2021; v1 submitted 25 June, 2021; originally announced June 2021.

Comments: Accepted to NeurIPS 2021

arXiv:2106.08171 [pdf, other]

Evaluating Modules in Graph Contrastive Learning

Authors: Ganqu Cui, Yufeng Du, Cheng Yang, Jie Zhou, Liang Xu, Xing Zhou, Xingyi Cheng, Zhiyuan Liu

Abstract: The recent emergence of contrastive learning approaches facilitates the application on graph representation learning (GRL), introducing graph contrastive learning (GCL) into the literature. These methods contrast semantically similar and dissimilar sample pairs to encode the semantics into node or graph embeddings. However, most existing works only performed \textbf{model-level} evaluation, and di… ▽ More The recent emergence of contrastive learning approaches facilitates the application on graph representation learning (GRL), introducing graph contrastive learning (GCL) into the literature. These methods contrast semantically similar and dissimilar sample pairs to encode the semantics into node or graph embeddings. However, most existing works only performed \textbf{model-level} evaluation, and did not explore the combination space of modules for more comprehensive and systematic studies. For effective \textbf{module-level} evaluation, we propose a framework that decomposes GCL models into four modules: (1) a \textbf{sampler} to generate anchor, positive and negative data samples (nodes or graphs); (2) an \textbf{encoder} and a \textbf{readout} function to get sample embeddings; (3) a \textbf{discriminator} to score each sample pair (anchor-positive and anchor-negative); and (4) an \textbf{estimator} to define the loss function. Based on this framework, we conduct controlled experiments over a wide range of architectural designs and hyperparameter settings on node and graph classification tasks. Specifically, we manage to quantify the impact of a single module, investigate the interaction between modules, and compare the overall performance with current model architectures. Our key findings include a set of module-level guidelines for GCL, e.g., simple samplers from LINE and DeepWalk are strong and robust; an MLP encoder associated with Sum readout could achieve competitive performance on graph classification. Finally, we release our implementations and results as OpenGCL, a modularized toolkit that allows convenient reproduction, standard model and module evaluation, and easy extension. OpenGCL is available at \url{https://github.com/thunlp/OpenGCL}. △ Less

Submitted 2 June, 2022; v1 submitted 15 June, 2021; originally announced June 2021.

arXiv:2105.09670 [pdf, other]

Ensemble machine learning approach for screening of coronary heart disease based on echocardiography and risk factors

Authors: **gyi Zhang, Huolan Zhu, Yongkai Chen, Chenguang Yang, Huimin Cheng, Yi Li, Wenxuan Zhong, Fang Wang

Abstract: Background: Extensive clinical evidence suggests that a preventive screening of coronary heart disease (CHD) at an earlier stage can greatly reduce the mortality rate. We use 64 two-dimensional speckle tracking echocardiography (2D-STE) features and seven clinical features to predict whether one has CHD. Methods: We develop a machine learning approach that integrates a number of popular classifica… ▽ More Background: Extensive clinical evidence suggests that a preventive screening of coronary heart disease (CHD) at an earlier stage can greatly reduce the mortality rate. We use 64 two-dimensional speckle tracking echocardiography (2D-STE) features and seven clinical features to predict whether one has CHD. Methods: We develop a machine learning approach that integrates a number of popular classification methods together by model stacking, and generalize the traditional stacking method to a two-step stacking method to improve the diagnostic performance. Results: By borrowing strengths from multiple classification models through the proposed method, we improve the CHD classification accuracy from around 70% to 87.7% on the testing set. The sensitivity of the proposed method is 0.903 and the specificity is 0.843, with an AUC of 0.904, which is significantly higher than those of the individual classification models. Conclusions: Our work lays a foundation for the deployment of speckle tracking echocardiography-based screening tools for coronary heart disease. △ Less

Submitted 20 May, 2021; originally announced May 2021.

Comments: 30 pages, 5 figures, 5 tables

arXiv:2101.00323 [pdf, other]

TenIPS: Inverse Propensity Sampling for Tensor Completion

Authors: Chengrun Yang, Lijun Ding, Ziyang Wu, Madeleine Udell

Abstract: Tensors are widely used to represent multiway arrays of data. The recovery of missing entries in a tensor has been extensively studied, generally under the assumption that entries are missing completely at random (MCAR). However, in most practical settings, observations are missing not at random (MNAR): the probability that a given entry is observed (also called the propensity) may depend on other… ▽ More Tensors are widely used to represent multiway arrays of data. The recovery of missing entries in a tensor has been extensively studied, generally under the assumption that entries are missing completely at random (MCAR). However, in most practical settings, observations are missing not at random (MNAR): the probability that a given entry is observed (also called the propensity) may depend on other entries in the tensor or even on the value of the missing entry. In this paper, we study the problem of completing a partially observed tensor with MNAR observations, without prior information about the propensities. To complete the tensor, we assume that both the original tensor and the tensor of propensities have low multilinear rank. The algorithm first estimates the propensities using a convex relaxation and then predicts missing values using a higher-order SVD approach, reweighting the observed tensor by the inverse propensities. We provide finite-sample error bounds on the resulting complete tensor. Numerical experiments demonstrate the effectiveness of our approach. △ Less

Submitted 22 April, 2021; v1 submitted 1 January, 2021; originally announced January 2021.

Comments: AISTATS 2021

arXiv:2011.10195 [pdf, ps, other]

Detecting systematic anomalies affecting systems when inputs are stationary time series

Authors: Ning Sun, Chen Yang, Ričardas Zitikis

Abstract: We develop an anomaly-detection method when systematic anomalies, possibly statistically very similar to genuine inputs, are affecting control systems at the input and/or output stages. The method allows anomaly-free inputs (i.e., those before contamination) to originate from a wide class of random sequences, thus opening up possibilities for diverse applications. To illustrate how the method work… ▽ More We develop an anomaly-detection method when systematic anomalies, possibly statistically very similar to genuine inputs, are affecting control systems at the input and/or output stages. The method allows anomaly-free inputs (i.e., those before contamination) to originate from a wide class of random sequences, thus opening up possibilities for diverse applications. To illustrate how the method works on data, and how to interpret its results and make decisions, we analyze several actual time series, which are originally non-stationary but in the process of analysis are converted into stationary. As a further illustration, we provide a controlled experiment with anomaly-free inputs following an ARMA time series model under various contamination scenarios. △ Less

Submitted 31 January, 2022; v1 submitted 19 November, 2020; originally announced November 2020.

MSC Class: 94A12; 94A13; 62P30; 62P35

arXiv:2010.14589 [pdf, other]

Nested Grassmannians for Dimensionality Reduction with Applications

Authors: Chun-Hao Yang, Baba C. Vemuri

Abstract: In the recent past, nested structures in Riemannian manifolds has been studied in the context of dimensionality reduction as an alternative to the popular principal geodesic analysis (PGA) technique, for example, the principal nested spheres. In this paper, we propose a novel framework for constructing a nested sequence of homogeneous Riemannian manifolds. Common examples of homogeneous Riemannian… ▽ More In the recent past, nested structures in Riemannian manifolds has been studied in the context of dimensionality reduction as an alternative to the popular principal geodesic analysis (PGA) technique, for example, the principal nested spheres. In this paper, we propose a novel framework for constructing a nested sequence of homogeneous Riemannian manifolds. Common examples of homogeneous Riemannian manifolds include the $n$-sphere, the Stiefel manifold, the Grassmann manifold and many others. In particular, we focus on applying the proposed framework to the Grassmann manifold, giving rise to the nested Grassmannians (NG). An important application in which Grassmann manifolds are encountered is planar shape analysis. Specifically, each planar (2D) shape can be represented as a point in the complex projective space which is a complex Grass-mann manifold. Some salient features of our framework are: (i) it explicitly exploits the geometry of the homogeneous Riemannian manifolds and (ii) the nested lower-dimensional submanifolds need not be geodesic. With the proposed NG structure, we develop algorithms for the supervised and unsupervised dimensionality reduction problems respectively. The proposed algorithms are compared with PGA via simulation studies and real data experiments and are shown to achieve a higher ratio of expressed variance compared to PGA. △ Less

Submitted 1 March, 2022; v1 submitted 27 October, 2020; originally announced October 2020.

Comments: 21 pages, 9 figures. Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://www.melba-journal.org

arXiv:2010.12636 [pdf, ps, other]

Nonseparable Symplectic Neural Networks

Authors: Shiying Xiong, Yun** Tong, Xingzhe He, Shuqi Yang, Cheng Yang, Bo Zhu

Abstract: Predicting the behaviors of Hamiltonian systems has been drawing increasing attention in scientific machine learning. However, the vast majority of the literature was focused on predicting separable Hamiltonian systems with their kinematic and potential energy terms being explicitly decoupled while building data-driven paradigms to predict nonseparable Hamiltonian systems that are ubiquitous in fl… ▽ More Predicting the behaviors of Hamiltonian systems has been drawing increasing attention in scientific machine learning. However, the vast majority of the literature was focused on predicting separable Hamiltonian systems with their kinematic and potential energy terms being explicitly decoupled while building data-driven paradigms to predict nonseparable Hamiltonian systems that are ubiquitous in fluid dynamics and quantum mechanics were rarely explored. The main computational challenge lies in the effective embedding of symplectic priors to describe the inherently coupled evolution of position and momentum, which typically exhibits intricate dynamics. To solve the problem, we propose a novel neural network architecture, Nonseparable Symplectic Neural Networks (NSSNNs), to uncover and embed the symplectic structure of a nonseparable Hamiltonian system from limited observation data. The enabling mechanics of our approach is an augmented symplectic time integrator to decouple the position and momentum energy terms and facilitate their evolution. We demonstrated the efficacy and versatility of our method by predicting a wide range of Hamiltonian systems, both separable and nonseparable, including chaotic vortical flows. We showed the unique computational merits of our approach to yield long-term, accurate, and robust predictions for large-scale Hamiltonian systems by rigorously enforcing symplectomorphism. △ Less

Submitted 19 February, 2022; v1 submitted 23 October, 2020; originally announced October 2020.

Comments: ICLR2021

arXiv:2009.05204 [pdf, other]

Transfer Learning of Graph Neural Networks with Ego-graph Information Maximization

Authors: Qi Zhu, Carl Yang, Yidan Xu, Haonan Wang, Chao Zhang, Jiawei Han

Abstract: Graph neural networks (GNNs) have achieved superior performance in various applications, but training dedicated GNNs can be costly for large-scale graphs. Some recent work started to study the pre-training of GNNs. However, none of them provide theoretical insights into the design of their frameworks, or clear requirements and guarantees towards their transferability. In this work, we establish a… ▽ More Graph neural networks (GNNs) have achieved superior performance in various applications, but training dedicated GNNs can be costly for large-scale graphs. Some recent work started to study the pre-training of GNNs. However, none of them provide theoretical insights into the design of their frameworks, or clear requirements and guarantees towards their transferability. In this work, we establish a theoretically grounded and practically useful framework for the transfer learning of GNNs. Firstly, we propose a novel view towards the essential graph information and advocate the capturing of it as the goal of transferable GNN training, which motivates the design of EGI (Ego-Graph Information maximization) to analytically achieve this goal. Secondly, when node features are structure-relevant, we conduct an analysis of EGI transferability regarding the difference between the local graph Laplacians of the source and target graphs. We conduct controlled synthetic experiments to directly justify our theoretical conclusions. Comprehensive experiments on two real-world network datasets show consistent results in the analyzed setting of direct-transfering, while those on large-scale knowledge graphs show promising results in the more practical setting of transfering with fine-tuning. △ Less

Submitted 26 October, 2021; v1 submitted 10 September, 2020; originally announced September 2020.

Comments: NeurIPS 2021

arXiv:2007.01594 [pdf, other]

doi 10.1145/3394486.3403140

Adaptive Graph Encoder for Attributed Graph Embedding

Authors: Ganqu Cui, Jie Zhou, Cheng Yang, Zhiyuan Liu

Abstract: Attributed graph embedding, which learns vector representations from graph topology and node features, is a challenging task for graph analysis. Recently, methods based on graph convolutional networks (GCNs) have made great progress on this task. However,existing GCN-based methods have three major drawbacks. Firstly,our experiments indicate that the entanglement of graph convolutional filters and… ▽ More Attributed graph embedding, which learns vector representations from graph topology and node features, is a challenging task for graph analysis. Recently, methods based on graph convolutional networks (GCNs) have made great progress on this task. However,existing GCN-based methods have three major drawbacks. Firstly,our experiments indicate that the entanglement of graph convolutional filters and weight matrices will harm both the performance and robustness. Secondly, we show that graph convolutional filters in these methods reveal to be special cases of generalized Laplacian smoothing filters, but they do not preserve optimal low-pass characteristics. Finally, the training objectives of existing algorithms are usually recovering the adjacency matrix or feature matrix, which are not always consistent with real-world applications. To address these issues, we propose Adaptive Graph Encoder (AGE), a novel attributed graph embedding framework. AGE consists of two modules: (1) To better alleviate the high-frequency noises in the node features, AGE first applies a carefully-designed Laplacian smoothing filter. (2) AGE employs an adaptive encoder that iteratively strengthens the filtered features for better node embeddings. We conduct experiments using four public benchmark datasets to validate AGE on node clustering and link prediction tasks. Experimental results show that AGE consistently outperforms state-of-the-art graph embedding methods considerably on these tasks. △ Less

Submitted 3 July, 2020; originally announced July 2020.

Comments: To appear in KDD 2020

arXiv:2006.08816 [pdf, other]

Signed Graph Metric Learning via Gershgorin Disc Perfect Alignment

Authors: Cheng Yang, Gene Cheung, Wei Hu

Abstract: Given a convex and differentiable objective $Q(\M)$ for a real symmetric matrix $\M$ in the positive definite (PD) cone -- used to compute Mahalanobis distances -- we propose a fast general metric learning framework that is entirely projection-free. We first assume that $\M$ resides in a space $\cS$ of generalized graph Laplacian matrices corresponding to balanced signed graphs. $\M \in \cS$ that… ▽ More Given a convex and differentiable objective $Q(\M)$ for a real symmetric matrix $\M$ in the positive definite (PD) cone -- used to compute Mahalanobis distances -- we propose a fast general metric learning framework that is entirely projection-free. We first assume that $\M$ resides in a space $\cS$ of generalized graph Laplacian matrices corresponding to balanced signed graphs. $\M \in \cS$ that is also PD is called a graph metric matrix. Unlike low-rank metric matrices common in the literature, $\cS$ includes the important diagonal-only matrices as a special case. The key theorem to circumvent full eigen-decomposition and enable fast metric matrix optimization is Gershgorin disc perfect alignment (GDPA): given $\M \in \cS$ and diagonal matrix $§$, where $S_{ii} = 1/v_i$ and $\v$ is $\M$'s first eigenvector, we prove that Gershgorin disc left-ends of similarity transform $\B = §\M §^{-1}$ are perfectly aligned at the smallest eigenvalue $λ_{\min}$. Using this theorem, we replace the PD cone constraint in the metric learning problem with tightest possible linear constraints per iteration, so that the alternating optimization of the diagonal / off-diagonal terms in $\M$ can be solved efficiently as linear programs via the Frank-Wolfe method. We update $\v$ using Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) with warm start as entries in $\M$ are optimized successively. Experiments show that our graph metric optimization is significantly faster than cone-projection schemes, and produces competitive binary classification performance. △ Less

Submitted 10 June, 2021; v1 submitted 15 June, 2020; originally announced June 2020.

Comments: code available: https://github.com/bobchengyang/SGML

arXiv:2006.06983 [pdf, other]

Characterizing Impacts of Heterogeneity in Federated Learning upon Large-Scale Smartphone Data

Authors: Chengxu Yang, Qipeng Wang, Mengwei Xu, Zhenpeng Chen, Kaigui Bian, Yunxin Liu, Xuanzhe Liu

Abstract: Federated learning (FL) is an emerging, privacy-preserving machine learning paradigm, drawing tremendous attention in both academia and industry. A unique characteristic of FL is heterogeneity, which resides in the various hardware specifications and dynamic states across the participating devices. Theoretically, heterogeneity can exert a huge influence on the FL training process, e.g., causing a… ▽ More Federated learning (FL) is an emerging, privacy-preserving machine learning paradigm, drawing tremendous attention in both academia and industry. A unique characteristic of FL is heterogeneity, which resides in the various hardware specifications and dynamic states across the participating devices. Theoretically, heterogeneity can exert a huge influence on the FL training process, e.g., causing a device unavailable for training or unable to upload its model updates. Unfortunately, these impacts have never been systematically studied and quantified in existing FL literature. In this paper, we carry out the first empirical study to characterize the impacts of heterogeneity in FL. We collect large-scale data from 136k smartphones that can faithfully reflect heterogeneity in real-world settings. We also build a heterogeneity-aware FL platform that complies with the standard FL protocol but with heterogeneity in consideration. Based on the data and the platform, we conduct extensive experiments to compare the performance of state-of-the-art FL algorithms under heterogeneity-aware and heterogeneity-unaware settings. Results show that heterogeneity causes non-trivial performance degradation in FL, including up to 9.2% accuracy drop, 2.32x lengthened training time, and undermined fairness. Furthermore, we analyze potential impact factors and find that device failure and participant bias are two potential factors for performance degradation. Our study provides insightful implications for FL practitioners. On the one hand, our findings suggest that FL algorithm designers consider necessary heterogeneity during the evaluation. On the other hand, our findings urge system providers to design specific mechanisms to mitigate the impacts of heterogeneity. △ Less

Submitted 12 March, 2021; v1 submitted 12 June, 2020; originally announced June 2020.

arXiv:2006.04239 [pdf, other]

doi 10.1145/3394486.3403196

Unsupervised Differentiable Multi-aspect Network Embedding

Authors: Chanyoung Park, Carl Yang, Qi Zhu, Donghyun Kim, Hwanjo Yu, Jiawei Han

Abstract: Network embedding is an influential graph mining technique for representing nodes in a graph as distributed vectors. However, the majority of network embedding methods focus on learning a single vector representation for each node, which has been recently criticized for not being capable of modeling multiple aspects of a node. To capture the multiple aspects of each node, existing studies mainly r… ▽ More Network embedding is an influential graph mining technique for representing nodes in a graph as distributed vectors. However, the majority of network embedding methods focus on learning a single vector representation for each node, which has been recently criticized for not being capable of modeling multiple aspects of a node. To capture the multiple aspects of each node, existing studies mainly rely on offline graph clustering performed prior to the actual embedding, which results in the cluster membership of each node (i.e., node aspect distribution) fixed throughout training of the embedding model. We argue that this not only makes each node always have the same aspect distribution regardless of its dynamic context, but also hinders the end-to-end training of the model that eventually leads to the final embedding quality largely dependent on the clustering. In this paper, we propose a novel end-to-end framework for multi-aspect network embedding, called asp2vec, in which the aspects of each node are dynamically assigned based on its local context. More precisely, among multiple aspects, we dynamically assign a single aspect to each node based on its current context, and our aspect selection module is end-to-end differentiable via the Gumbel-Softmax trick. We also introduce the aspect regularization framework to capture the interactions among the multiple aspects in terms of relatedness and diversity. We further demonstrate that our proposed framework can be readily extended to heterogeneous networks. Extensive experiments towards various downstream tasks on various types of homogeneous networks and a heterogeneous network demonstrate the superiority of asp2vec. △ Less

Submitted 7 July, 2020; v1 submitted 7 June, 2020; originally announced June 2020.

Comments: KDD 2020 (Research Track). 9 Pages + Appendix (2 Pages). Source code can be found https://github.com/pcy1302/asp2vec. Typo fixed in Fig.2

arXiv:2006.04216 [pdf, other]

Efficient AutoML Pipeline Search with Matrix and Tensor Factorization

Authors: Chengrun Yang, Jicong Fan, Ziyang Wu, Madeleine Udell

Abstract: Data scientists seeking a good supervised learning model on a new dataset have many choices to make: they must preprocess the data, select features, possibly reduce the dimension, select an estimation algorithm, and choose hyperparameters for each of these pipeline components. With new pipeline components comes a combinatorial explosion in the number of choices! In this work, we design a new AutoM… ▽ More Data scientists seeking a good supervised learning model on a new dataset have many choices to make: they must preprocess the data, select features, possibly reduce the dimension, select an estimation algorithm, and choose hyperparameters for each of these pipeline components. With new pipeline components comes a combinatorial explosion in the number of choices! In this work, we design a new AutoML system to address this challenge: an automated system to design a supervised learning pipeline. Our system uses matrix and tensor factorization as surrogate models to model the combinatorial pipeline search space. Under these models, we develop greedy experiment design protocols to efficiently gather information about a new dataset. Experiments on large corpora of real-world classification problems demonstrate the effectiveness of our approach. △ Less

Submitted 7 June, 2020; originally announced June 2020.

Comments: This is an extended version of AutoML Pipeline Selection: Efficiently Navigating the Combinatorial Space (DOI: 10.1145/3394486.3403197) at KDD 2020

arXiv:2005.13974 [pdf]

On the Bound of Cumulative Return in Trading Series and the Verification Using Technical Trading Rules

Authors: Can Yang, Junjie Zhai, Helong Li

Abstract: Although there is a wide use of technical trading rules in stock markets, the profitability of them still remains controversial. This paper first presents and proves the upper bound of cumulative return, and then introduces many of conventional technical trading rules. Furthermore, with the help of bootstrap methodology, we investigate the profitability of technical trading rules on different inte… ▽ More Although there is a wide use of technical trading rules in stock markets, the profitability of them still remains controversial. This paper first presents and proves the upper bound of cumulative return, and then introduces many of conventional technical trading rules. Furthermore, with the help of bootstrap methodology, we investigate the profitability of technical trading rules on different international stock markets, including developed markets and emerging markets. At last, the results show that the technical trading rules are hard to beat the market, and even less profitable than the random trading strategy. △ Less

Submitted 19 May, 2020; originally announced May 2020.

arXiv:2005.04843 [pdf, other]

Semi-supervised Hypergraph Node Classification on Hypergraph Line Expansion

Authors: Chaoqi Yang, Ruijie Wang, Shuochao Yao, Tarek Abdelzaher

Abstract: Previous hypergraph expansions are solely carried out on either vertex level or hyperedge level, thereby missing the symmetric nature of data co-occurrence, and resulting in information loss. To address the problem, this paper treats vertices and hyperedges equally and proposes a new hypergraph formulation named the \emph{line expansion (LE)} for hypergraphs learning. The new expansion bijectively… ▽ More Previous hypergraph expansions are solely carried out on either vertex level or hyperedge level, thereby missing the symmetric nature of data co-occurrence, and resulting in information loss. To address the problem, this paper treats vertices and hyperedges equally and proposes a new hypergraph formulation named the \emph{line expansion (LE)} for hypergraphs learning. The new expansion bijectively induces a homogeneous structure from the hypergraph by treating vertex-hyperedge pairs as "line nodes". By reducing the hypergraph to a simple graph, the proposed \emph{line expansion} makes existing graph learning algorithms compatible with the higher-order structure and has been proven as a unifying framework for various hypergraph expansions. We evaluate the proposed line expansion on five hypergraph datasets, the results show that our method beats SOTA baselines by a significant margin. △ Less

Submitted 13 April, 2023; v1 submitted 10 May, 2020; originally announced May 2020.

Comments: CIKM 2022, GitHub: https://github.com/ycq091044/LEGCN

arXiv:2005.01317 [pdf, ps, other]

doi 10.1109/TSP.2021.3062988

Robust Non-Linear Matrix Factorization for Dictionary Learning, Denoising, and Clustering

Authors: Jicong Fan, Chengrun Yang, Madeleine Udell

Abstract: Low dimensional nonlinear structure abounds in datasets across computer vision and machine learning. Kernelized matrix factorization techniques have recently been proposed to learn these nonlinear structures for denoising, classification, dictionary learning, and missing data imputation, by observing that the image of the matrix in a sufficiently large feature space is low-rank. However, these non… ▽ More Low dimensional nonlinear structure abounds in datasets across computer vision and machine learning. Kernelized matrix factorization techniques have recently been proposed to learn these nonlinear structures for denoising, classification, dictionary learning, and missing data imputation, by observing that the image of the matrix in a sufficiently large feature space is low-rank. However, these nonlinear methods fail in the presence of sparse noise or outliers. In this work, we propose a new robust nonlinear factorization method called Robust Non-Linear Matrix Factorization (RNLMF). RNLMF constructs a dictionary for the data space by factoring a kernelized feature space; a noisy matrix can then be decomposed as the sum of a sparse noise matrix and a clean data matrix that lies in a low dimensional nonlinear manifold. RNLMF is robust to sparse noise and outliers and scales to matrices with thousands of rows and columns. Empirically, RNLMF achieves noticeable improvements over baseline methods in denoising and clustering. △ Less

Submitted 2 December, 2020; v1 submitted 4 May, 2020; originally announced May 2020.

Journal ref: IEEE Transactions on Signal Processing 69, 1755-1770 (2021)

arXiv:2005.00072 [pdf, other]

Two Burning Questions on COVID-19: Did shutting down the economy help? Can we (partially) reopen the economy without risking the second wave?

Authors: Anish Agarwal, Abdullah Alomar, Arnab Sarker, Devavrat Shah, Dennis Shen, Cindy Yang

Abstract: As we reach the apex of the COVID-19 pandemic, the most pressing question facing us is: can we even partially reopen the economy without risking a second wave? We first need to understand if shutting down the economy helped. And if it did, is it possible to achieve similar gains in the war against the pandemic while partially opening up the economy? To do so, it is critical to understand the effec… ▽ More As we reach the apex of the COVID-19 pandemic, the most pressing question facing us is: can we even partially reopen the economy without risking a second wave? We first need to understand if shutting down the economy helped. And if it did, is it possible to achieve similar gains in the war against the pandemic while partially opening up the economy? To do so, it is critical to understand the effects of the various interventions that can be put into place and their corresponding health and economic implications. Since many interventions exist, the key challenge facing policy makers is understanding the potential trade-offs between them, and choosing the particular set of interventions that works best for their circumstance. In this memo, we provide an overview of Synthetic Interventions (a natural generalization of Synthetic Control), a data-driven and statistically principled method to perform what-if scenario planning, i.e., for policy makers to understand the trade-offs between different interventions before having to actually enact them. In essence, the method leverages information from different interventions that have already been enacted across the world and fits it to a policy maker's setting of interest, e.g., to estimate the effect of mobility-restricting interventions on the U.S., we use daily death data from countries that enforced severe mobility restrictions to create a "synthetic low mobility U.S." and predict the counterfactual trajectory of the U.S. if it had indeed applied a similar intervention. Using Synthetic Interventions, we find that lifting severe mobility restrictions and only retaining moderate mobility restrictions (at retail and transit locations), seems to effectively flatten the curve. We hope this provides guidance on weighing the trade-offs between the safety of the population, strain on the healthcare system, and impact on the economy. △ Less

Submitted 10 May, 2020; v1 submitted 30 April, 2020; originally announced May 2020.

Showing 1–50 of 118 results for author: Yang, C