Search | arXiv e-print repository

Optimal subsampling algorithm for the marginal model with large longitudinal data

Abstract: Big data is ubiquitous in practices, and it has also led to heavy computation burden. To reduce the calculation cost and ensure the effectiveness of parameter estimators, an optimal subset sampling method is proposed to estimate the parameters in marginal models with massive longitudinal data. The optimal subsampling probabilities are derived, and the corresponding asymptotic properties are establ… ▽ More Big data is ubiquitous in practices, and it has also led to heavy computation burden. To reduce the calculation cost and ensure the effectiveness of parameter estimators, an optimal subset sampling method is proposed to estimate the parameters in marginal models with massive longitudinal data. The optimal subsampling probabilities are derived, and the corresponding asymptotic properties are established to ensure the consistency and asymptotic normality of the estimator. Extensive simulation studies are carried out to evaluate the performance of the proposed method for continuous, binary and count data and with four different working correlation matrices. A depression data is used to illustrate the proposed method. △ Less

Submitted 15 November, 2023; originally announced November 2023.

arXiv:2306.08979 [pdf, other]

Ranking and Selection in Large-Scale Inference of Heteroscedastic Units

Authors: Bowen Gang, Luella Fu, Gareth James, Wenguang Sun

Abstract: The allocation of limited resources to a large number of potential candidates presents a pervasive challenge. In the context of ranking and selecting top candidates from heteroscedastic units, conventional methods often result in over-representations of subpopulations, and this issue is further exacerbated in large-scale settings where thousands of candidates are considered simultaneously. To addr… ▽ More The allocation of limited resources to a large number of potential candidates presents a pervasive challenge. In the context of ranking and selecting top candidates from heteroscedastic units, conventional methods often result in over-representations of subpopulations, and this issue is further exacerbated in large-scale settings where thousands of candidates are considered simultaneously. To address this challenge, we propose a new multiple comparison framework that incorporates a modified power notion to prioritize the selection of important effects and employs a novel ranking metric to assess the relative importance of units. We develop both oracle and data-driven algorithms, and demonstrate their effectiveness in controlling the error rates and achieving optimality. We evaluate the numerical performance of our proposed method using simulated and real data. The results show that our framework enables a more balanced selection of effects that are both statistically significant and practically important, and results in an objective and relevant ranking scheme that is well-suited to practical scenarios. △ Less

Submitted 15 June, 2023; originally announced June 2023.

Comments: 54 pages, 11 figures

arXiv:2111.03885 [pdf, other]

An Empirical Bayes Approach to Controlling the False Discovery Exceedance

Authors: Pallavi Basu, Luella Fu, Alessio Saretto, Wenguang Sun

Abstract: In large-scale multiple hypothesis testing problems, the false discovery exceedance (FDX) provides a desirable alternative to the widely used false discovery rate (FDR) when the false discovery proportion (FDP) is highly variable. We develop an empirical Bayes approach to control the FDX. We show that, for independent hypotheses from a two-group model and dependent hypotheses from a Gaussian model… ▽ More In large-scale multiple hypothesis testing problems, the false discovery exceedance (FDX) provides a desirable alternative to the widely used false discovery rate (FDR) when the false discovery proportion (FDP) is highly variable. We develop an empirical Bayes approach to control the FDX. We show that, for independent hypotheses from a two-group model and dependent hypotheses from a Gaussian model fulfilling the exchangeability condition, an oracle decision rule based on ranking and thresholding the local false discovery rate (lfdr) is optimal in the sense that the power is maximized subject to the FDX constraint. We propose a data-driven FDX procedure that uses carefully designed computational shortcuts to emulate the oracle rule. We investigate the empirical performance of the proposed method using both simulated and real data and study the merits of FDX control through an application for identifying abnormal stock trading strategies. △ Less

Submitted 20 April, 2023; v1 submitted 6 November, 2021; originally announced November 2021.

Comments: Updated

arXiv:2105.13600 [pdf, ps, other]

Placement Optimization and Power Control in Intelligent Reflecting Surface Aided Multiuser System

Authors: Bifeng Ling, Jiangbin Lyu, Liqun Fu

Abstract: Intelligent reflecting surface (IRS) is a new and revolutionary technology capable of reconfiguring the wireless propagation environment by controlling its massive low-cost passive reflecting elements. Different from prior works that focus on optimizing IRS reflection coefficients or single-IRS placement, we aim to maximize the minimum throughput of a single-cell multiuser system aided by multiple… ▽ More Intelligent reflecting surface (IRS) is a new and revolutionary technology capable of reconfiguring the wireless propagation environment by controlling its massive low-cost passive reflecting elements. Different from prior works that focus on optimizing IRS reflection coefficients or single-IRS placement, we aim to maximize the minimum throughput of a single-cell multiuser system aided by multiple IRSs, by joint multi-IRS placement and power control at the access point (AP), which is a mixed-integer non-convex problem with drastically increased complexity with the number of IRSs/users. To tackle this challenge, a ring-based IRS placement scheme is proposed along with a power control policy that equalizes the users' non-outage probability. An efficient searching algorithm is further proposed to obtain a close-to-optimal solution for arbitrary number of IRSs/rings. Numerical results validate our analysis and show that our proposed scheme significantly outperforms the benchmark schemes without IRS and/or with other power control policies. Moreover, it is shown that the IRSs are preferably deployed near AP for coverage range extension, while with more IRSs, they tend to spread out over the cell to cover more and get closer to target users. △ Less

Submitted 4 November, 2021; v1 submitted 28 May, 2021; originally announced May 2021.

Comments: To appear in GLOBECOM 2021. This paper focuses on the multi-IRS placement optimization and downlink AP power control for achieving max-min throughput in a single-cell multi-user system. A ring-based IRS placement scheme is proposed which utilizes the near-AP/near-user deployment modes. Closed-form power control policy is devised to equalize the users' non-outage probability

arXiv:2103.10613 [pdf, ps, other]

Robust penalized empirical likelihood in high dimensional longitudinal data analysis

Authors: Jiaqi Li, Liya Fu

Abstract: As an effective nonparametric method, empirical likelihood (EL) is appealing in combining estimating equations flexibly and adaptively for incorporating data information. To select important variables and estimating equations in the sparse high-dimensional model, we consider a penalized EL method based on robust estimating functions by applying two penalty functions for regularizing the regression… ▽ More As an effective nonparametric method, empirical likelihood (EL) is appealing in combining estimating equations flexibly and adaptively for incorporating data information. To select important variables and estimating equations in the sparse high-dimensional model, we consider a penalized EL method based on robust estimating functions by applying two penalty functions for regularizing the regression parameters and the associated Lagrange multipliers simultaneously, which allows the dimensionalities of both regression parameters and estimating equations to grow exponentially with the sample size. A first inspection on the robustness of estimating equations contributing to the estimating equations selection and variable selection is discussed from both theoretical perspective and intuitive simulation results in this paper. The proposed method can improve the robustness and effectiveness when the data have underlying outliers or heavy tails in the response variables and/or covariates. The robustness of the estimator is measured via the bounded influence function, and the oracle properties are also established under some regularity conditions. Extensive simulation studies and a yeast cell data are used to evaluate the performance of the proposed method. The numerical results reveal that the robustness of sparse estimating equations selection fundamentally enhances variable selection accuracy when the data have heavy tails and/or include underlying outliers. △ Less

Submitted 30 June, 2021; v1 submitted 18 March, 2021; originally announced March 2021.

Comments: 25 pages, 4 Tables

arXiv:2011.06241 [pdf, other]

doi 10.1002/sim.9213

Robust approach for variable selection with high dimensional Logitudinal data analysis

Authors: Liya Fu, Jiaqi Li, You-Gan Wang

Abstract: This paper proposes a new robust smooth-threshold estimating equation to select important variables and automatically estimate parameters for high dimensional longitudinal data. A novel working correlation matrix is proposed to capture correlations within the same subject. The proposed procedure works well when the number of covariates p increases as the number of subjects n increases. The propose… ▽ More This paper proposes a new robust smooth-threshold estimating equation to select important variables and automatically estimate parameters for high dimensional longitudinal data. A novel working correlation matrix is proposed to capture correlations within the same subject. The proposed procedure works well when the number of covariates p increases as the number of subjects n increases. The proposed estimates are competitive with the estimates obtained with the true correlation structure, especially when the data are contaminated. Moreover, the proposed method is robust against outliers in the response variables and/or covariates. Furthermore, the oracle properties for robust smooth-threshold estimating equations under "large n, diverging p" are established under some regularity conditions. Extensive simulation studies and a yeast cell cycle data are used to evaluate the performance of the proposed method, and results show that our proposed method is competitive with existing robust variable selection procedures. △ Less

Submitted 18 May, 2021; v1 submitted 12 November, 2020; originally announced November 2020.

Comments: 32 pages, 7 tables, 5 figures

Journal ref: Statistics in Medicine.(2021) 1-20

arXiv:2009.00387 [pdf, other]

doi 10.1145/3442442.3452323

Boosting Share Routing for Multi-task Learning

Authors: Xiaokai Chen, Xiaoguang Gu, Libo Fu

Abstract: Multi-task learning (MTL) aims to make full use of the knowledge contained in multi-task supervision signals to improve the overall performance. How to make the knowledge of multiple tasks shared appropriately is an open problem for MTL. Most existing deep MTL models are based on parameter sharing. However, suitable sharing mechanism is hard to design as the relationship among tasks is complicated… ▽ More Multi-task learning (MTL) aims to make full use of the knowledge contained in multi-task supervision signals to improve the overall performance. How to make the knowledge of multiple tasks shared appropriately is an open problem for MTL. Most existing deep MTL models are based on parameter sharing. However, suitable sharing mechanism is hard to design as the relationship among tasks is complicated. In this paper, we propose a general framework called Multi-Task Neural Architecture Search (MTNAS) to efficiently find a suitable sharing route for a given MTL problem. MTNAS modularizes the sharing part into multiple layers of sub-networks. It allows sparse connection among these sub-networks and soft sharing based on gating is enabled for a certain route. Benefiting from such setting, each candidate architecture in our search space defines a dynamic sparse sharing route which is more flexible compared with full-sharing in previous approaches. We show that existing typical sharing approaches are sub-graphs in our search space. Extensive experiments on three real-world recommendation datasets demonstrate MTANS achieves consistent improvement compared with single-task models and typical multi-task methods while maintaining high computation efficiency. Furthermore, in-depth experiments demonstrates that MTNAS can learn suitable sparse route to mitigate negative transfer. △ Less

Submitted 1 March, 2021; v1 submitted 1 September, 2020; originally announced September 2020.

arXiv:2008.07438 [pdf, ps, other]

Analysis and Optimization for Large-Scale LoRa Networks: Throughput Fairness and Scalability

Authors: Jiangbin Lyu, Dan Yu, Liqun Fu

Abstract: LoRa networks are pivotally enabling Long Range connectivity to low-cost and power-constrained user equipments (UEs) in a wide area, whereas a critical issue is to effectively allocate wireless resources to support potentially massive UEs while resolving the prominent near-far fairness issue, which is challenging due to the lack of tractable analytical model and the practical requirement for low-c… ▽ More LoRa networks are pivotally enabling Long Range connectivity to low-cost and power-constrained user equipments (UEs) in a wide area, whereas a critical issue is to effectively allocate wireless resources to support potentially massive UEs while resolving the prominent near-far fairness issue, which is challenging due to the lack of tractable analytical model and the practical requirement for low-complexity and low-overhead design. Leveraging on stochastic geometry, especially the Poisson rain model, we derive (semi-) closed-form formulas for the aggregate interference distribution, packet success probability and hence system throughput in both single-cell and multi-cell setups with frequency reuse, by accounting for channel fading, random UE distribution, partial packet overlap**, and/or multi-gateway packet reception. The analytical formulas require only average channel statistics and spatial UE distribution, which enable tractable network performance evaluation and incubate our proposed Iterative Balancing (IB) method that quickly yields high-level policies of joint spreading factor (SF) allocation, power control, and duty cycle adjustment for gauging the average max-min UE throughput or supported UE density with rate requirements. Numerical results validate the analytical formulas and the effectiveness of our proposed optimization scheme, which greatly alleviates the near-far fairness issue and reduces the spatial power consumption, while significantly improving the cell-edge throughput as well as the spatial (sum) throughput for the majority of UEs, by adapting to the UE/gateway densities. △ Less

Submitted 5 November, 2021; v1 submitted 17 August, 2020; originally announced August 2020.

Comments: To appear in IEEE IOT Journal. Stochastic geometry-based framework to model/analyze large-scale LoRa networks with channel fading/aggregate interference/packet overlap**/multi-GW reception. Jointly optimize SF/Tx-power/duty-cycle based on channel statistics and UE distribution. Achieve both fairness/power savings and improve cell-edge throughput and spatial (sum) throughput for majority of UEs. arXiv admin note: text overlap with arXiv:1904.12300

arXiv:2005.04288 [pdf, other]

Incremental Learning for End-to-End Automatic Speech Recognition

Authors: Li Fu, Xiaoxiao Li, Libo Zi, Zhengchen Zhang, Youzheng Wu, Xiaodong He, Bowen Zhou

Abstract: In this paper, we propose an incremental learning method for end-to-end Automatic Speech Recognition (ASR) which enables an ASR system to perform well on new tasks while maintaining the performance on its originally learned ones. To mitigate catastrophic forgetting during incremental learning, we design a novel explainability-based knowledge distillation for ASR models, which is combined with a re… ▽ More In this paper, we propose an incremental learning method for end-to-end Automatic Speech Recognition (ASR) which enables an ASR system to perform well on new tasks while maintaining the performance on its originally learned ones. To mitigate catastrophic forgetting during incremental learning, we design a novel explainability-based knowledge distillation for ASR models, which is combined with a response-based knowledge distillation to maintain the original model's predictions and the "reason" for the predictions. Our method works without access to the training data of original tasks, which addresses the cases where the previous data is no longer available or joint training is costly. Results on a multi-stage sequential training task show that our method outperforms existing ones in mitigating forgetting. Furthermore, in two practical scenarios, compared to the target-reference joint training method, the performance drop of our method is 0.02% Character Error Rate (CER), which is 97% smaller than the drops of the baseline methods. △ Less

Submitted 15 September, 2021; v1 submitted 11 May, 2020; originally announced May 2020.

Comments: ASRU 2021

arXiv:2003.03948 [pdf, ps, other]

An efficient Gehan-type estimation for the accelerated failure time model with clustered and censored data

Authors: Liya Fu, Zhuoran Yang, Yan Zhou, You-Gan Wang

Abstract: In medical studies, the collected covariates usually contain underlying outliers. For clustered /longitudinal data with censored observations, the traditional Gehan-type estimator is robust to outliers existing in response but sensitive to outliers in the covariate domain, and it also ignores the within-cluster correlations. To take account of within-cluster correlations, varying cluster sizes, an… ▽ More In medical studies, the collected covariates usually contain underlying outliers. For clustered /longitudinal data with censored observations, the traditional Gehan-type estimator is robust to outliers existing in response but sensitive to outliers in the covariate domain, and it also ignores the within-cluster correlations. To take account of within-cluster correlations, varying cluster sizes, and outliers in covariates, we propose weighted Gehan-type estimating functions for parameter estimation in the accelerated failure time model for clustered data. We provide the asymptotic properties of the resulting estimators and carry out simulation studies to evaluate the performance of the proposed method under a variety of realistic settings. The simulation results demonstrate that the proposed method is robust to the outliers existing in the covariate domain and lead to much more efficient estimators when a strong within-cluster correlation exists. Finally, the proposed method is applied to a medical dataset and more reliable and convincing results are hence obtained. △ Less

Submitted 9 March, 2020; originally announced March 2020.

Comments: ready for submission

MSC Class: 62F35 ACM Class: G.3

arXiv:2002.12586 [pdf, other]

Nonparametric Empirical Bayes Estimation on Heterogeneous Data

Authors: Trambak Banerjee, Luella J. Fu, Gareth M. James, Gourab Mukherjee, Wenguang Sun

Abstract: The simultaneous estimation of many parameters based on data collected from corresponding studies is a key research problem that has received renewed attention in the high-dimensional setting. Many practical situations involve heterogeneous data where heterogeneity is captured by a nuisance parameter. Effectively pooling information across samples while correctly accounting for heterogeneity prese… ▽ More The simultaneous estimation of many parameters based on data collected from corresponding studies is a key research problem that has received renewed attention in the high-dimensional setting. Many practical situations involve heterogeneous data where heterogeneity is captured by a nuisance parameter. Effectively pooling information across samples while correctly accounting for heterogeneity presents a significant challenge in large-scale estimation problems. We address this issue by introducing the ``Nonparametric Empirical Bayes Structural Tweedie" (NEST) estimator, which efficiently estimates the unknown effect sizes and properly adjusts for heterogeneity via a generalized version of Tweedie's formula. For the normal means problem, NEST simultaneously handles the two main selection biases introduced by heterogeneity: one, the selection bias in the mean, which cannot be effectively corrected without also correcting for, two, selection bias in the variance. We develop theory to show that NEST is asymptotically as good as the optimal Bayes rule that uniquely minimizes a weighted squared error loss. In our simulation studies NEST outperforms competing methods, with much efficiency gains in many settings. The proposed method is demonstrated on estimating the batting averages of baseball players and Sharpe ratios of mutual fund returns. Extensions to other members of the two-parameter exponential family are discussed. △ Less

Submitted 14 August, 2023; v1 submitted 28 February, 2020; originally announced February 2020.

Comments: Citations corrected and a new author added. No change in content!

MSC Class: 62G08; 62G05; 62G20 ACM Class: G.3

arXiv:1911.08784 [pdf]

Deep-seismic-prior-based reconstruction of seismic data using convolutional neural networks

Authors: Qun Liu, Lihua Fu, Meng Zhang

Abstract: Reconstruction of seismic data with missing traces is a long-standing issue in seismic data processing. In recent years, rank reduction operations are being commonly utilized to overcome this problem, which require the rank of seismic data to be a prior. However, the rank of field data is unknown; usually it requires much time to manually adjust the rank and just obtain an approximated rank. Metho… ▽ More Reconstruction of seismic data with missing traces is a long-standing issue in seismic data processing. In recent years, rank reduction operations are being commonly utilized to overcome this problem, which require the rank of seismic data to be a prior. However, the rank of field data is unknown; usually it requires much time to manually adjust the rank and just obtain an approximated rank. Methods based on deep learning require very large datasets for training; however acquiring large datasets is difficult owing to physical or financial constraints in practice. Therefore, in this work, we developed a novel method based on unsupervised learning using the intrinsic properties of a convolutional neural network known as U-net, without training datasets. Only one undersampled seismic data was needed, and the deep seismic prior of input data could be exploited by the network itself, thus making the reconstruction convenient. Furthermore, this method can handle both irregular and regular seismic data. Synthetic and field data were tested to assess the performance of the proposed algorithm (DSPRecon algorithm); the advantages of using our method were evaluated by comparing it with the singular spectrum analysis (SSA) method for irregular data reconstruction and de-aliased Cadzow method for regular data reconstruction. Experimental results showed that our method provided better reconstruction performance than the SSA or Cadzow methods. The recovered signal-to-noise ratios (SNRs) were 32.68 dB and 19.11 dB for the DSPRecon and SSA algorithms, respectively. Those for the DSPRecon and Cadzow methods were 35.91 dB and 15.32 dB, respectively. △ Less

Submitted 20 November, 2019; originally announced November 2019.

Comments: 5 pages,12 figures

arXiv:1910.08107 [pdf, other]

Heterocedasticity-Adjusted Ranking and Thresholding for Large-Scale Multiple Testing

Authors: Luella Fu, Bowen Gang, Gareth M. James, Wenguang Sun

Abstract: Standardization has been a widely adopted practice in multiple testing, for it takes into account the variability in sampling and makes the test statistics comparable across different study units. However, despite conventional wisdom to the contrary, we show that there can be a significant loss in information from basing hypothesis tests on standardized statistics rather than the full data. We dev… ▽ More Standardization has been a widely adopted practice in multiple testing, for it takes into account the variability in sampling and makes the test statistics comparable across different study units. However, despite conventional wisdom to the contrary, we show that there can be a significant loss in information from basing hypothesis tests on standardized statistics rather than the full data. We develop a new class of heteroscedasticity--adjusted ranking and thresholding (HART) rules that aim to improve existing methods by simultaneously exploiting commonalities and adjusting heterogeneities among the study units. The main idea of HART is to bypass standardization by directly incorporating both the summary statistic and its variance into the testing procedure. A key message is that the variance structure of the alternative distribution, which is subsumed under standardized statistics, is highly informative and can be exploited to achieve higher power. The proposed HART procedure is shown to be asymptotically valid and optimal for false discovery rate (FDR) control. Our simulation results demonstrate that HART achieves substantial power gain over existing methods at the same FDR level. We illustrate the implementation through a microarray analysis of myeloma. △ Less

Submitted 5 March, 2020; v1 submitted 17 October, 2019; originally announced October 2019.

Comments: 55 pages, 13 figures

arXiv:1812.07410 [pdf]

An Improved Deep Belief Network Model for Road Safety Analyses

Authors: Guangyuan Pan, Li** Fu, Lalita Thakali, Matthew Muresan, Ming Yu

Abstract: Crash prediction is a critical component of road safety analyses. A widely adopted approach to crash prediction is application of regression based techniques. The underlying calibration process is often time-consuming, requiring significant domain knowledge and expertise and cannot be easily automated. This paper introduces a new machine learning (ML) based approach as an alternative to the tradit… ▽ More Crash prediction is a critical component of road safety analyses. A widely adopted approach to crash prediction is application of regression based techniques. The underlying calibration process is often time-consuming, requiring significant domain knowledge and expertise and cannot be easily automated. This paper introduces a new machine learning (ML) based approach as an alternative to the traditional techniques. The proposed ML model is called regularized deep belief network, which is a deep neural network with two training steps: it is first trained using an unsupervised learning algorithm and then fine-tuned by initializing a Bayesian neural network with the trained weights from the first step. The resulting model is expected to have improved prediction power and reduced need for the time-consuming human intervention. In this paper, we attempt to demonstrate the potential of this new model for crash prediction through two case studies including a collision data set from 800 km stretch of Highway 401 and other highways in Ontario, Canada. Our intention is to show the performance of this ML approach in comparison to various traditional models including negative binomial (NB) model, kernel regression (KR), and Bayesian neural network (Bayesian NN). We also attempt to address other related issues such as effect of training data size and training parameters. △ Less

Submitted 17 December, 2018; originally announced December 2018.

Journal ref: Transportation Research Board 97th Annual Meeting, 2018

Showing 1–14 of 14 results for author: Fu, L