Search | arXiv e-print repository

Fast Inference Using Automatic Differentiation and Neural Transport in Astroparticle Physics

Authors: Dorian W. P. Amaral, Shixiao Liang, Juehang Qin, Christopher Tunnell

Abstract: Multi-dimensional parameter spaces are commonly encountered in astroparticle physics theories that attempt to capture novel phenomena. However, they often possess complicated posterior geometries that are expensive to traverse using techniques traditional to this community. Effectively sampling these spaces is crucial to bridge the gap between experiment and theory. Several recent innovations, whi… ▽ More Multi-dimensional parameter spaces are commonly encountered in astroparticle physics theories that attempt to capture novel phenomena. However, they often possess complicated posterior geometries that are expensive to traverse using techniques traditional to this community. Effectively sampling these spaces is crucial to bridge the gap between experiment and theory. Several recent innovations, which are only beginning to make their way into this field, have made navigating such complex posteriors possible. These include GPU acceleration, automatic differentiation, and neural-network-guided reparameterization. We apply these advancements to astroparticle physics experimental results in the context of novel neutrino physics and benchmark their performances against traditional nested sampling techniques. Compared to nested sampling alone, we find that these techniques increase performance for both nested sampling and Hamiltonian Monte Carlo, accelerating inference by factors of $\sim 100$ and $\sim 60$, respectively. As nested sampling also evaluates the Bayesian evidence, these advancements can be exploited to improve model comparison performance while retaining compatibility with existing implementations that are widely used in the natural sciences. △ Less

Submitted 23 May, 2024; originally announced May 2024.

Comments: 20 pages, 7 figures, 4 tables, 6 appendices

arXiv:2401.07206 [pdf, other]

Probabilistic Reduced-Dimensional Vector Autoregressive Modeling with Oblique Projections

Authors: Yanfang Mo, S. Joe Qin

Abstract: In this paper, we propose a probabilistic reduced-dimensional vector autoregressive (PredVAR) model to extract low-dimensional dynamics from high-dimensional noisy data. The model utilizes an oblique projection to partition the measurement space into a subspace that accommodates the reduced-dimensional dynamics and a complementary static subspace. An optimal oblique decomposition is derived for th… ▽ More In this paper, we propose a probabilistic reduced-dimensional vector autoregressive (PredVAR) model to extract low-dimensional dynamics from high-dimensional noisy data. The model utilizes an oblique projection to partition the measurement space into a subspace that accommodates the reduced-dimensional dynamics and a complementary static subspace. An optimal oblique decomposition is derived for the best predictability regarding prediction error covariance. Building on this, we develop an iterative PredVAR algorithm using maximum likelihood and the expectation-maximization (EM) framework. This algorithm alternately updates the estimates of the latent dynamics and optimal oblique projection, yielding dynamic latent variables with rank-ordered predictability and an explicit latent VAR model that is consistent with the outer projection model. The superior performance and efficiency of the proposed approach are demonstrated using data sets from a synthesized Lorenz system and an industrial process from Eastman Chemical. △ Less

Submitted 14 January, 2024; originally announced January 2024.

Comments: 16pages, 5 figures

arXiv:2311.06517 [pdf, other]

BClean: A Bayesian Data Cleaning System

Authors: Jianbin Qin, Sifan Huang, Yaoshu Wang, **g Zhu, Yifan Zhang, Yukai Miao, Rui Mao, Makoto Onizuka, Chuan Xiao

Abstract: There is a considerable body of work on data cleaning which employs various principles to rectify erroneous data and transform a dirty dataset into a cleaner one. One of prevalent approaches is probabilistic methods, including Bayesian methods. However, existing probabilistic methods often assume a simplistic distribution (e.g., Gaussian distribution), which is frequently underfitted in practice,… ▽ More There is a considerable body of work on data cleaning which employs various principles to rectify erroneous data and transform a dirty dataset into a cleaner one. One of prevalent approaches is probabilistic methods, including Bayesian methods. However, existing probabilistic methods often assume a simplistic distribution (e.g., Gaussian distribution), which is frequently underfitted in practice, or they necessitate experts to provide a complex prior distribution (e.g., via a programming language). This requirement is both labor-intensive and costly, rendering these methods less suitable for real-world applications. In this paper, we propose BClean, a Bayesian Cleaning system that features automatic Bayesian network construction and user interaction. We recast the data cleaning problem as a Bayesian inference that fully exploits the relationships between attributes in the observed dataset and any prior information provided by users. To this end, we present an automatic Bayesian network construction method that extends a structure learning-based functional dependency discovery method with similarity functions to capture the relationships between attributes. Furthermore, our system allows users to modify the generated Bayesian network in order to specify prior information or correct inaccuracies identified by the automatic generation process. We also design an effective scoring model (called the compensative scoring model) necessary for the Bayesian inference. To enhance the efficiency of data cleaning, we propose several approximation strategies for the Bayesian inference, including graph partitioning, domain pruning, and pre-detection. By evaluating on both real-world and synthetic datasets, we demonstrate that BClean is capable of achieving an F-measure of up to 0.9 in data cleaning, outperforming existing Bayesian methods by 2% and other data cleaning methods by 15%. △ Less

Submitted 11 November, 2023; originally announced November 2023.

Comments: Our source code is available at https://github.com/yyssl88/BClean

arXiv:2309.01161 [pdf, other]

Probabilistic Reduced-Dimensional Vector Autoregressive Modeling for Dynamics Prediction and Reconstruction with Oblique Projections

Authors: Yanfang Mo, Jiaxin Yu, S. Joe Qin

Abstract: In this paper, we propose a probabilistic reduced-dimensional vector autoregressive (PredVAR) model with oblique projections. This model partitions the measurement space into a dynamic subspace and a static subspace that do not need to be orthogonal. The partition allows us to apply an oblique projection to extract dynamic latent variables (DLVs) from high-dimensional data with maximized predictab… ▽ More In this paper, we propose a probabilistic reduced-dimensional vector autoregressive (PredVAR) model with oblique projections. This model partitions the measurement space into a dynamic subspace and a static subspace that do not need to be orthogonal. The partition allows us to apply an oblique projection to extract dynamic latent variables (DLVs) from high-dimensional data with maximized predictability. We develop an alternating iterative PredVAR algorithm that exploits the interaction between updating the latent VAR dynamics and estimating the oblique projection, using expectation maximization (EM) and a statistical constraint. In addition, the noise covariance matrices are estimated as a natural outcome of the EM method. A simulation case study of the nonlinear Lorenz oscillation system illustrates the advantages of the proposed approach over two alternatives. △ Less

Submitted 3 September, 2023; originally announced September 2023.

arXiv:2306.12925 [pdf, other]

AudioPaLM: A Large Language Model That Can Speak and Listen

Authors: Paul K. Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, Hannah Muckenhirn, Dirk Padfield, James Qin, Danny Rozenberg, Tara Sainath, Johan Schalkwyk, Matt Sharifi, Michelle Tadmor Ramanovich, Marco Tagliasacchi, Alexandru Tudor, Mihajlo Velimirović, Damien Vincent, Jiahui Yu, Yongqiang Wang, Vicky Zayats , et al. (5 additional authors not shown)

Abstract: We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the… ▽ More We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt. We release examples of our method at https://google-research.github.io/seanet/audiopalm/examples △ Less

Submitted 22 June, 2023; originally announced June 2023.

Comments: Technical report

arXiv:2306.04730 [pdf, other]

Stochastic Natural Thresholding Algorithms

Authors: Rachel Grotheer, Shuang Li, Anna Ma, Deanna Needell, **g Qin

Abstract: Sparse signal recovery is one of the most fundamental problems in various applications, including medical imaging and remote sensing. Many greedy algorithms based on the family of hard thresholding operators have been developed to solve the sparse signal recovery problem. More recently, Natural Thresholding (NT) has been proposed with improved computational efficiency. This paper proposes and disc… ▽ More Sparse signal recovery is one of the most fundamental problems in various applications, including medical imaging and remote sensing. Many greedy algorithms based on the family of hard thresholding operators have been developed to solve the sparse signal recovery problem. More recently, Natural Thresholding (NT) has been proposed with improved computational efficiency. This paper proposes and discusses convergence guarantees for stochastic natural thresholding algorithms by extending the NT from the deterministic version with linear measurements to the stochastic version with a general objective function. We also conduct various numerical experiments on linear and nonlinear measurements to demonstrate the performance of StoNT. △ Less

Submitted 7 June, 2023; originally announced June 2023.

arXiv:2212.03070 [pdf, ps, other]

doi 10.5705/ss.202020.0318

Hypothesis test on a mixture forward-incubation-time epidemic model with application to COVID-19 outbreak

Authors: Chunlin Wang, Pengfei Li, Yukun Liu, Xiao-Hua Zhou, **g Qin

Abstract: The distribution of the incubation period of the novel coronavirus disease that emerged in 2019 (COVID-19) has crucial clinical implications for understanding this disease and devising effective disease-control measures. Qin et al. (2020) designed a cross-sectional and forward follow-up study to collect the duration times between a specific observation time and the onset of COVID-19 symptoms for a… ▽ More The distribution of the incubation period of the novel coronavirus disease that emerged in 2019 (COVID-19) has crucial clinical implications for understanding this disease and devising effective disease-control measures. Qin et al. (2020) designed a cross-sectional and forward follow-up study to collect the duration times between a specific observation time and the onset of COVID-19 symptoms for a number of individuals. They further proposed a mixture forward-incubation-time epidemic model, which is a mixture of an incubation-period distribution and a forward time distribution, to model the collected duration times and to estimate the incubation-period distribution of COVID-19. In this paper, we provide sufficient conditions for the identifiability of the unknown parameters in the mixture forward-incubation-time epidemic model when the incubation period follows a two-parameter distribution. Under the same setup, we propose a likelihood ratio test (LRT) for testing the null hypothesis that the mixture forward-incubation-time epidemic model is a homogeneous exponential distribution. The testing problem is non-regular because a nuisance parameter is present only under the alternative. We establish the limiting distribution of the LRT and identify an explicit representation for it. The limiting distribution of the LRT under a sequence of local alternatives is also obtained. Our simulation results indicate that the LRT has desirable type I errors and powers, and we analyze a COVID-19 outbreak dataset from China to illustrate the usefulness of the LRT. △ Less

Submitted 5 December, 2022; originally announced December 2022.

Comments: 34 pages, 2 figures, 2 tables

Journal ref: Statistica Sinica (2023)

arXiv:2209.04569 [pdf, other]

Nearly optimal capture-recapture sampling and empirical likelihood weighting estimation for M-estimation with big data

Authors: Yan Fan, Yang Liu, Yukun Liu, **g Qin

Abstract: Subsampling techniques can reduce the computational costs of processing big data. Practical subsampling plans typically involve initial uniform sampling and refined sampling. With a subsample, big data inferences are generally built on the inverse probability weighting (IPW), which becomes unstable when the probability weights are close to zero and cannot incorporate auxiliary information. First,… ▽ More Subsampling techniques can reduce the computational costs of processing big data. Practical subsampling plans typically involve initial uniform sampling and refined sampling. With a subsample, big data inferences are generally built on the inverse probability weighting (IPW), which becomes unstable when the probability weights are close to zero and cannot incorporate auxiliary information. First, we consider capture-recapture sampling, which combines an initial uniform sampling with a second Poisson sampling. Under this sampling plan, we propose an empirical likelihood weighting (ELW) estimation approach to an M-estimation parameter. Second, based on the ELW method, we construct a nearly optimal capture-recapture sampling plan that balances estimation efficiency and computational costs. Third, we derive methods for determining the smallest sample sizes with which the proposed sampling-and-estimation method produces estimators of guaranteed precision. Our ELW method overcomes the instability of IPW by circumventing the use of inverse probabilities, and utilizes auxiliary information including the size and certain sample moments of big data. We show that the proposed ELW method produces more efficient estimators than IPW, leading to more efficient optimal sampling plans and more economical sample sizes for a prespecified estimation precision. These advantages are confirmed through simulation studies and real data analyses. △ Less

Submitted 9 September, 2022; originally announced September 2022.

arXiv:2106.15735 [pdf, other]

Active-set algorithms based statistical inference for shape-restricted generalized additive Cox regression models

Authors: Geng Deng, Guangning Xu, Qiang Fu, Xindong Wang, **g Qin

Abstract: Recently the shape-restricted inference has gained popularity in statistical and econometric literature in order to relax the linear or quadratic covariate effect in regression analyses. The typical shape-restricted covariate effect includes monotonic increasing, decreasing, convexity or concavity. In this paper, we introduce the shape-restricted inference to the celebrated Cox regression model (S… ▽ More Recently the shape-restricted inference has gained popularity in statistical and econometric literature in order to relax the linear or quadratic covariate effect in regression analyses. The typical shape-restricted covariate effect includes monotonic increasing, decreasing, convexity or concavity. In this paper, we introduce the shape-restricted inference to the celebrated Cox regression model (SR-Cox), in which the covariate response is modeled as shape-restricted additive functions. The SR-Cox regression approximates the shape-restricted functions using a spline basis expansion with data driven choice of knots. The underlying minimization of negative log-likelihood function is formulated as a convex optimization problem, which is solved with an active-set optimization algorithm. The highlight of this algorithm is that it eliminates the superfluous knots automatically. When covariate effects include combinations of convex or concave terms with unknown forms and linear terms, the most interesting finding is that SR-Cox produces accurate linear covariate effect estimates which are comparable to the maximum partial likelihood estimates if indeed the forms are known. We conclude that concave or convex SR-Cox models could significantly improve nonlinear covariate response recovery and model goodness of fit. △ Less

Submitted 2 July, 2021; v1 submitted 29 June, 2021; originally announced June 2021.

Comments: Updated with new latex template, 33 pages

MSC Class: 62GXX

arXiv:2105.08677 [pdf, ps, other]

Maximum profile binomial likelihood estimation for the semiparametric Box--Cox power transformation model

Authors: Pengfei Li, Tao Yu, Baojiang Chen, **g Qin

Abstract: The Box--Cox transformation model has been widely applied for many years. The parametric version of this model assumes that the random error follows a parametric distribution, say the normal distribution, and estimates the model parameters using the maximum likelihood method. The semiparametric version assumes that the distribution of the random error is completely unknown; existing methods either… ▽ More The Box--Cox transformation model has been widely applied for many years. The parametric version of this model assumes that the random error follows a parametric distribution, say the normal distribution, and estimates the model parameters using the maximum likelihood method. The semiparametric version assumes that the distribution of the random error is completely unknown; existing methods either need strong assumptions, or are less effective when the distribution of the random error significantly deviates from the normal distribution. We adopt the semiparametric assumption and propose a maximum profile binomial likelihood method. We theoretically establish the joint distribution of the estimators of the model parameters. Through extensive numerical studies, we demonstrate that our method has an advantage over existing methods, especially when the distribution of the random error deviates from the normal distribution. Furthermore, we compare the performance of our method and existing methods on an HIV data set. △ Less

Submitted 18 May, 2021; v1 submitted 18 May, 2021; originally announced May 2021.

Comments: 70 pages, 1 figure

arXiv:2103.13435 [pdf, ps, other]

Maximum pairwise-rank-likelihood-based inference for the semiparametric transformation model

Authors: Tao Yu, Pengfei Li, Baojiang Chen, Ao Yuan, **g Qin

Abstract: In this paper, we study the linear transformation model in the most general setup. This model includes many important and popular models in statistics and econometrics as special cases. Although it has been studied for many years, the methods in the literature are based on kernel-smoothing techniques or make use of only the ranks of the responses in the estimation of the parametric components. The… ▽ More In this paper, we study the linear transformation model in the most general setup. This model includes many important and popular models in statistics and econometrics as special cases. Although it has been studied for many years, the methods in the literature are based on kernel-smoothing techniques or make use of only the ranks of the responses in the estimation of the parametric components. The former approach needs a tuning parameter, which is not easily optimally specified in practice; and the latter is computationally expensive and may not make full use of the information in the data. In this paper, we propose two methods: a pairwise rank likelihood method and a score-function-based method based on this pairwise rank likelihood. We also explore the theoretical properties of the proposed estimators. Via extensive numerical studies, we demonstrate that our methods are appealing in that the estimators are not only robust to the distribution of the random errors but also lead to mean square errors that are in many cases comparable to or smaller than those of existing methods. △ Less

Submitted 24 March, 2021; originally announced March 2021.

Comments: 6 tables and 2 figures

MSC Class: 62J02; 62J05; 62J86

arXiv:2101.00105 [pdf, ps, other]

A selective review on calibration information from similar studies based on parametric likelihood or empirical likelihood

Authors: **g Qin, Yukun Liu, Pengfei Li

Abstract: In multi-center clinical trials, due to various reasons, the individual-level data are strictly restricted to be assessed publicly. Instead, the summarized information is widely available from published results. With the advance of computational technology, it has become very common in data analyses to run on hundreds or thousands of machines simultaneous, with the data distributed across those ma… ▽ More In multi-center clinical trials, due to various reasons, the individual-level data are strictly restricted to be assessed publicly. Instead, the summarized information is widely available from published results. With the advance of computational technology, it has become very common in data analyses to run on hundreds or thousands of machines simultaneous, with the data distributed across those machines and no longer available in a single central location. How to effectively assemble the summarized clinical data information or information from each machine in parallel computation has become a challenging task for statisticians and computer scientists. In this paper, we selectively review some recently-developed statistical methods, including communication efficient distributed statistical inference, and renewal estimation and incremental inference, which can be regarded as the latest development of calibration information methods in the era of big data. Even though those methods were developed in different fields and in different statistical frameworks, in principle, they are asymptotically equivalent to those well known methods developed in meta analysis. Almost no or little information is lost compared with the case when full data are available. As a general tool to integrate information, we also review the generalized method of moments and estimating equations approach by using empirical likelihood method. △ Less

Submitted 31 December, 2020; originally announced January 2021.

Comments: 37 pages

arXiv:2010.07204 [pdf, other]

Incorporating survival data into case-control studies with incident and prevalent cases

Authors: Soutrik Mandal, **g Qin, Ruth M. Pfeiffer

Abstract: Typically, case-control studies to estimate odds-ratios associating risk factors with disease incidence from logistic regression only include cases with newly diagnosed disease. Recently proposed methods allow incorporating information on prevalent cases, individuals who survived from disease diagnosis to sampling, into cross-sectionally sampled case-control studies under parametric assumptions fo… ▽ More Typically, case-control studies to estimate odds-ratios associating risk factors with disease incidence from logistic regression only include cases with newly diagnosed disease. Recently proposed methods allow incorporating information on prevalent cases, individuals who survived from disease diagnosis to sampling, into cross-sectionally sampled case-control studies under parametric assumptions for the survival time after diagnosis. Here we propose and study methods to additionally use prospectively observed survival times from prevalent and incident cases to adjust logistic models for the time between disease diagnosis and sampling, the backward time, for prevalent cases. This adjustment yields unbiased odds-ratio estimates from case-control studies that include prevalent cases. We propose a computationally simple two-step generalized method-of-moments estimation procedure. First, we estimate the survival distribution based on a semi-parametric Cox model using an expectation-maximization algorithm that yields fully efficient estimates and accommodates left truncation for the prevalent cases and right censoring. Then, we use the estimated survival distribution in an extension of the logistic model to three groups (controls, incident and prevalent cases), to accommodate the survival bias in prevalent cases. In simulations, when the amount of censoring was modest, odds-ratios from the two-step procedure were equally efficient as those estimated by jointly optimizing the logistic and survival data likelihoods under parametric assumptions. Even with 90% censoring they were as efficient as estimates obtained using only cross-sectionally available information under parametric assumptions. This indicates that utilizing prospective survival data from the cases lessens model dependency and improves precision of association estimates for case-control studies with prevalent cases. △ Less

Submitted 16 October, 2020; v1 submitted 14 October, 2020; originally announced October 2020.

arXiv:2009.12836 [pdf, other]

Normalization Techniques in Training DNNs: Methodology, Analysis and Application

Authors: Lei Huang, Jie Qin, Yi Zhou, Fan Zhu, Li Liu, Ling Shao

Abstract: Normalization techniques are essential for accelerating the training and improving the generalization of deep neural networks (DNNs), and have successfully been used in various applications. This paper reviews and comments on the past, present and future of normalization methods in the context of DNN training. We provide a unified picture of the main motivation behind different approaches from the… ▽ More Normalization techniques are essential for accelerating the training and improving the generalization of deep neural networks (DNNs), and have successfully been used in various applications. This paper reviews and comments on the past, present and future of normalization methods in the context of DNN training. We provide a unified picture of the main motivation behind different approaches from the perspective of optimization, and present a taxonomy for understanding the similarities and differences between them. Specifically, we decompose the pipeline of the most representative normalizing activation methods into three components: the normalization area partitioning, normalization operation and normalization representation recovery. In doing so, we provide insight for designing new normalization technique. Finally, we discuss the current progress in understanding normalization methods, and provide a comprehensive review of the applications of normalization for particular tasks, in which it can effectively solve the key issues. △ Less

Submitted 27 September, 2020; originally announced September 2020.

Comments: 20 pages

arXiv:2007.04873 [pdf, other]

Invertible Zero-Shot Recognition Flows

Authors: Yuming Shen, Jie Qin, Lei Huang

Abstract: Deep generative models have been successfully applied to Zero-Shot Learning (ZSL) recently. However, the underlying drawbacks of GANs and VAEs (e.g., the hardness of training with ZSL-oriented regularizers and the limited generation quality) hinder the existing generative ZSL models from fully bypassing the seen-unseen bias. To tackle the above limitations, for the first time, this work incorporat… ▽ More Deep generative models have been successfully applied to Zero-Shot Learning (ZSL) recently. However, the underlying drawbacks of GANs and VAEs (e.g., the hardness of training with ZSL-oriented regularizers and the limited generation quality) hinder the existing generative ZSL models from fully bypassing the seen-unseen bias. To tackle the above limitations, for the first time, this work incorporates a new family of generative models (i.e., flow-based models) into ZSL. The proposed Invertible Zero-shot Flow (IZF) learns factorized data embeddings (i.e., the semantic factors and the non-semantic ones) with the forward pass of an invertible flow network, while the reverse pass generates data samples. This procedure theoretically extends conventional generative flows to a factorized conditional scheme. To explicitly solve the bias problem, our model enlarges the seen-unseen distributional discrepancy based on negative sample-based distance measurement. Notably, IZF works flexibly with either a naive Bayesian classifier or a held-out trainable one for zero-shot recognition. Experiments on widely-adopted ZSL benchmarks demonstrate the significant performance gain of IZF over existing methods, in both classic and generalized settings. △ Less

Submitted 9 July, 2020; originally announced July 2020.

Comments: ECCV2020

arXiv:2003.05928 [pdf, ps, other]

On the Convergence of the Dynamic Inner PCA Algorithm

Authors: Sungho Shin, Alex D. Smith, S. Joe Qin, Victor M. Zavala

Abstract: Dynamic inner principal component analysis (DiPCA) is a powerful method for the analysis of time-dependent multivariate data. DiPCA extracts dynamic latent variables that capture the most dominant temporal trends by solving a large-scale, dense, and nonconvex nonlinear program (NLP). A scalable decomposition algorithm has been recently proposed in the literature to solve these challenging NLPs. Th… ▽ More Dynamic inner principal component analysis (DiPCA) is a powerful method for the analysis of time-dependent multivariate data. DiPCA extracts dynamic latent variables that capture the most dominant temporal trends by solving a large-scale, dense, and nonconvex nonlinear program (NLP). A scalable decomposition algorithm has been recently proposed in the literature to solve these challenging NLPs. The decomposition algorithm performs well in practice but its convergence properties are not well understood. In this work, we show that this algorithm is a specialized variant of a coordinate maximization algorithm. This observation allows us to explain why the decomposition algorithm might work (or not) in practice and can guide improvements. We compare the performance of the decomposition strategies with that of the off-the-shelf solver Ipopt. The results show that decomposition is more scalable and, surprisingly, delivers higher quality solutions. △ Less

Submitted 12 March, 2020; originally announced March 2020.

Journal ref: In Proceedings of Foundations of Process Analytics and Machine Learning, 2019

arXiv:2002.06442 [pdf, other]

doi 10.1145/3318464.3380570

Monotonic Cardinality Estimation of Similarity Selection: A Deep Learning Approach

Authors: Yaoshu Wang, Chuan Xiao, Jianbin Qin, Xin Cao, Yifang Sun, Wei Wang, Makoto Onizuka

Abstract: Due to the outstanding capability of capturing underlying data distributions, deep learning techniques have been recently utilized for a series of traditional database problems. In this paper, we investigate the possibilities of utilizing deep learning for cardinality estimation of similarity selection. Answering this problem accurately and efficiently is essential to many data management applicat… ▽ More Due to the outstanding capability of capturing underlying data distributions, deep learning techniques have been recently utilized for a series of traditional database problems. In this paper, we investigate the possibilities of utilizing deep learning for cardinality estimation of similarity selection. Answering this problem accurately and efficiently is essential to many data management applications, especially for query optimization. Moreover, in some applications the estimated cardinality is supposed to be consistent and interpretable. Hence a monotonic estimation w.r.t. the query threshold is preferred. We propose a novel and generic method that can be applied to any data type and distance function. Our method consists of a feature extraction model and a regression model. The feature extraction model transforms original data and threshold to a Hamming space, in which a deep learning-based regression model is utilized to exploit the incremental property of cardinality w.r.t. the threshold for both accuracy and monotonicity. We develop a training strategy tailored to our model as well as techniques for fast estimation. We also discuss how to handle updates. We demonstrate the accuracy and the efficiency of our method through experiments, and show how it improves the performance of a query optimizer. △ Less

Submitted 24 September, 2021; v1 submitted 15 February, 2020; originally announced February 2020.

ACM Class: H.2.4; I.5.1

arXiv:1910.09323 [pdf, other]

Recurrent Attentive Neural Process for Sequential Data

Authors: Shenghao Qin, Jiacheng Zhu, Jimmy Qin, Wenshuo Wang, Ding Zhao

Abstract: Neural processes (NPs) learn stochastic processes and predict the distribution of target output adaptively conditioned on a context set of observed input-output pairs. Furthermore, Attentive Neural Process (ANP) improved the prediction accuracy of NPs by incorporating attention mechanism among contexts and targets. In a number of real-world applications such as robotics, finance, speech, and biolo… ▽ More Neural processes (NPs) learn stochastic processes and predict the distribution of target output adaptively conditioned on a context set of observed input-output pairs. Furthermore, Attentive Neural Process (ANP) improved the prediction accuracy of NPs by incorporating attention mechanism among contexts and targets. In a number of real-world applications such as robotics, finance, speech, and biology, it is critical to learn the temporal order and recurrent structure from sequential data. However, the capability of NPs capturing these properties is limited due to its permutation invariance instinct. In this paper, we proposed the Recurrent Attentive Neural Process (RANP), or alternatively, Attentive Neural Process-RecurrentNeural Network(ANP-RNN), in which the ANP is incorporated into a recurrent neural network. The proposed model encapsulates both the inductive biases of recurrent neural networks and also the strength of NPs for modelling uncertainty. We demonstrate that RANP can effectively model sequential data and outperforms NPs and LSTMs remarkably in a 1D regression toy example as well as autonomous-driving applications. △ Less

Submitted 17 October, 2019; originally announced October 2019.

Comments: 12 pages, 6 figures, NeurIPS 2019 Workshop

arXiv:1909.02344 [pdf, other]

An Active Learning Approach for Reducing Annotation Cost in Skin Lesion Analysis

Authors: Xueying Shi, Qi Dou, Cheng Xue, **g Qin, Hao Chen, Pheng-Ann Heng

Abstract: Automated skin lesion analysis is very crucial in clinical practice, as skin cancer is among the most common human malignancy. Existing approaches with deep learning have achieved remarkable performance on this challenging task, however, heavily relying on large-scale labelled datasets. In this paper, we present a novel active learning framework for cost-effective skin lesion analysis. The goal is… ▽ More Automated skin lesion analysis is very crucial in clinical practice, as skin cancer is among the most common human malignancy. Existing approaches with deep learning have achieved remarkable performance on this challenging task, however, heavily relying on large-scale labelled datasets. In this paper, we present a novel active learning framework for cost-effective skin lesion analysis. The goal is to effectively select and utilize much fewer labelled samples, while the network can still achieve state-of-the-art performance. Our sample selection criteria complementarily consider both informativeness and representativeness, derived from decoupled aspects of measuring model certainty and covering sample diversity. To make wise use of the selected samples, we further design a simple yet effective strategy to aggregate intra-class images in pixel space, as a new form of data augmentation. We validate our proposed method on data of ISIC 2017 Skin Lesion Classification Challenge for two tasks. Using only up to 50% of samples, our approach can achieve state-of-the-art performances on both tasks, which are comparable or exceeding the accuracies with full-data training, and outperform other well-known active learning methods by a large margin. △ Less

Submitted 5 September, 2019; originally announced September 2019.

Comments: Accepted by MIML2019

arXiv:1908.08479 [pdf, other]

Iterative Hard Thresholding for Low CP-rank Tensor Models

Authors: Rachel Grotheer, Shuang Li, Anna Ma, Deanna Needell, **g Qin

Abstract: Recovery of low-rank matrices from a small number of linear measurements is now well-known to be possible under various model assumptions on the measurements. Such results demonstrate robustness and are backed with provable theoretical guarantees. However, extensions to tensor recovery have only recently began to be studied and developed, despite an abundance of practical tensor applications. Rece… ▽ More Recovery of low-rank matrices from a small number of linear measurements is now well-known to be possible under various model assumptions on the measurements. Such results demonstrate robustness and are backed with provable theoretical guarantees. However, extensions to tensor recovery have only recently began to be studied and developed, despite an abundance of practical tensor applications. Recently, a tensor variant of the Iterative Hard Thresholding method was proposed and theoretical results were obtained that guarantee exact recovery of tensors with low Tucker rank. In this paper, we utilize the same tensor version of the Restricted Isometry Property (RIP) to extend these results for tensors with low CANDECOMP/PARAFAC (CP) rank. In doing so, we leverage recent results on efficient approximations of CP decompositions that remove the need for challenging assumptions in prior works. We complement our theoretical findings with empirical results that showcase the potential of the approach. △ Less

Submitted 22 August, 2019; originally announced August 2019.

arXiv:1908.01260 [pdf, ps, other]

Full-semiparametric-likelihood-based inference for non-ignorable missing data

Authors: Yukun Liu, Pengfei Li, **g Qin

Abstract: During the past few decades, missing-data problems have been studied extensively, with a focus on the ignorable missing case, where the missing probability depends only on observable quantities. By contrast, research into non-ignorable missing data problems is quite limited. The main difficulty in solving such problems is that the missing probability and the regression likelihood function are tang… ▽ More During the past few decades, missing-data problems have been studied extensively, with a focus on the ignorable missing case, where the missing probability depends only on observable quantities. By contrast, research into non-ignorable missing data problems is quite limited. The main difficulty in solving such problems is that the missing probability and the regression likelihood function are tangled together in the likelihood presentation, and the model parameters may not be identifiable even under strong parametric model assumptions. In this paper we discuss a semiparametric model for non-ignorable missing data and propose a maximum full semiparametric likelihood estimation method, which is an efficient combination of the parametric conditional likelihood and the marginal nonparametric biased sampling likelihood. The extra marginal likelihood contribution can not only produce efficiency gain but also identify the underlying model parameters without additional assumptions. We further show that the proposed estimators for the underlying parameters and the response mean are semiparametrically efficient. Extensive simulations and a real data analysis demonstrate the advantage of the proposed method over competing methods. △ Less

Submitted 3 August, 2019; originally announced August 2019.

Comments: 45 pages

arXiv:1905.10115 [pdf, ps, other]

Multi-Kernel Correntropy for Robust Learning

Authors: Badong Chen, Yuqing Xie, Xin Wang, Zejian yuan, Pengju Ren, **g Qin

Abstract: As a novel similarity measure that is defined as the expectation of a kernel function between two random variables, correntropy has been successfully applied in robust machine learning and signal processing to combat large outliers. The kernel function in correntropy is usually a zero-mean Gaussian kernel. In a recent work, the concept of mixture correntropy (MC) was proposed to improve the learni… ▽ More As a novel similarity measure that is defined as the expectation of a kernel function between two random variables, correntropy has been successfully applied in robust machine learning and signal processing to combat large outliers. The kernel function in correntropy is usually a zero-mean Gaussian kernel. In a recent work, the concept of mixture correntropy (MC) was proposed to improve the learning performance, where the kernel function is a mixture Gaussian kernel, namely a linear combination of several zero-mean Gaussian kernels with different widths. In both correntropy and mixture correntropy, the center of the kernel function is, however, always located at zero. In the present work, to further improve the learning performance, we propose the concept of multi-kernel correntropy (MKC), in which each component of the mixture Gaussian kernel can be centered at a different location. The properties of the MKC are investigated and an efficient approach is proposed to determine the free parameters in MKC. Experimental results show that the learning algorithms under the maximum multi-kernel correntropy criterion (MMKCC) can outperform those under the original maximum correntropy criterion (MCC) and the maximum mixture correntropy criterion (MMCC). △ Less

Submitted 5 September, 2021; v1 submitted 24 May, 2019; originally announced May 2019.

Comments: 12 pages, 5 figures

arXiv:1905.01402 [pdf, ps, other]

Test for homogeneity with unordered paired observations

Authors: Jiahua Chen, Pengfei Li, **g Qin, Tao Yu

Abstract: In some applications, an experimental unit is composed of two distinct but related subunits. The response from such a unit is $(X_{1}, X_{2})$ but we observe only $Y_1 = \min\{X_{1},X_{2}\}$ and $Y_2 = \max\{X_{1},X_{2}\}$, i.e., the subunit identities are not observed. We call $(Y_1, Y_2)$ unordered paired observations. Based on unordered paired observations $\{(Y_{1i}, Y_{2i})\}_{i=1}^n$, we are… ▽ More In some applications, an experimental unit is composed of two distinct but related subunits. The response from such a unit is $(X_{1}, X_{2})$ but we observe only $Y_1 = \min\{X_{1},X_{2}\}$ and $Y_2 = \max\{X_{1},X_{2}\}$, i.e., the subunit identities are not observed. We call $(Y_1, Y_2)$ unordered paired observations. Based on unordered paired observations $\{(Y_{1i}, Y_{2i})\}_{i=1}^n$, we are interested in whether the marginal distributions for $X_1$ and $X_2$ are identical. Testing methods are available in the literature under the assumptions that $Var(X_1) = Var(X_2)$ and $Cov(X_1, X_2) = 0$. However, by extensive simulation studies, we observe that when one or both assumptions are violated, these methods have inflated type I errors or much lower powers. In this paper, we study the likelihood ratio test statistics for various scenarios and explore their limiting distributions without these restrictive assumptions. Furthermore, we develop Bartlett correction formulae for these statistics to enhance their precision when the sample size is not large. Simulation studies and real-data examples are used to illustrate the efficacy of the proposed methods. △ Less

Submitted 3 May, 2019; originally announced May 2019.

Comments: 30 pages, 1 figure

arXiv:1902.08295 [pdf, other]

Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling

Authors: Jonathan Shen, Patrick Nguyen, Yonghui Wu, Zhifeng Chen, Mia X. Chen, Ye Jia, Anjuli Kannan, Tara Sainath, Yuan Cao, Chung-Cheng Chiu, Yanzhang He, Jan Chorowski, Smit Hinsu, Stella Laurenzo, James Qin, Orhan Firat, Wolfgang Macherey, Suyog Gupta, Ankur Bapna, Shuyuan Zhang, Ruoming Pang, Ron J. Weiss, Rohit Prabhavalkar, Qiao Liang, Benoit Jacob , et al. (66 additional authors not shown)

Abstract: Lingvo is a Tensorflow framework offering a complete solution for collaborative deep learning research, with a particular focus towards sequence-to-sequence models. Lingvo models are composed of modular building blocks that are flexible and easily extensible, and experiment configurations are centralized and highly customizable. Distributed training and quantized inference are supported directly w… ▽ More Lingvo is a Tensorflow framework offering a complete solution for collaborative deep learning research, with a particular focus towards sequence-to-sequence models. Lingvo models are composed of modular building blocks that are flexible and easily extensible, and experiment configurations are centralized and highly customizable. Distributed training and quantized inference are supported directly within the framework, and it contains existing implementations of a large number of utilities, helper functions, and the newest research ideas. Lingvo has been used in collaboration by dozens of researchers in more than 20 papers over the last two years. This document outlines the underlying design of Lingvo and serves as an introduction to the various pieces of the framework, while also offering examples of advanced features that showcase the capabilities of the framework. △ Less

Submitted 21 February, 2019; originally announced February 2019.

arXiv:1810.08032 [pdf, other]

Augmenting Adjusted Plus-Minus in Soccer with FIFA Ratings

Authors: Francesca Matano, Lee F. Richardson, Taylor Pospisil, Collin Eubanks, **ing Qin

Abstract: In basketball and hockey, state-of-the-art player value statistics are often variants of Adjusted Plus-Minus (APM). But APM hasn't had the same impact in soccer, since soccer games are low scoring with a low number of substitutions. In soccer, perhaps the most comprehensive player value statistics come from video games, and in particular FIFA. FIFA ratings combine the subjective evaluations of ove… ▽ More In basketball and hockey, state-of-the-art player value statistics are often variants of Adjusted Plus-Minus (APM). But APM hasn't had the same impact in soccer, since soccer games are low scoring with a low number of substitutions. In soccer, perhaps the most comprehensive player value statistics come from video games, and in particular FIFA. FIFA ratings combine the subjective evaluations of over 9000 scouts, coaches, and season-ticket holders into ratings for over 18,000 players. This paper combines FIFA ratings and APM into a single metric, which we call Augmented APM. The key idea is recasting APM into a Bayesian framework, and incorporating FIFA ratings into the prior distribution. We show that Augmented APM predicts better than both standard APM and a model using only FIFA ratings. We also show that Augmented APM decorrelates players that are highly collinear. △ Less

Submitted 18 October, 2018; originally announced October 2018.

arXiv:1809.02403 [pdf, other]

Deep Recurrent Survival Analysis

Authors: Kan Ren, Jiarui Qin, Lei Zheng, Zhengyu Yang, Weinan Zhang, Lin Qiu, Yong Yu

Abstract: Survival analysis is a hotspot in statistical research for modeling time-to-event information with data censorship handling, which has been widely used in many applications such as clinical research, information system and other fields with survivorship bias. Many works have been proposed for survival analysis ranging from traditional statistic methods to machine learning models. However, the exis… ▽ More Survival analysis is a hotspot in statistical research for modeling time-to-event information with data censorship handling, which has been widely used in many applications such as clinical research, information system and other fields with survivorship bias. Many works have been proposed for survival analysis ranging from traditional statistic methods to machine learning models. However, the existing methodologies either utilize counting-based statistics on the segmented data, or have a pre-assumption on the event probability distribution w.r.t. time. Moreover, few works consider sequential patterns within the feature space. In this paper, we propose a Deep Recurrent Survival Analysis model which combines deep learning for conditional probability prediction at fine-grained level of the data, and survival analysis for tackling the censorship. By capturing the time dependency through modeling the conditional probability of the event for each sample, our method predicts the likelihood of the true event occurrence and estimates the survival rate over time, i.e., the probability of the non-occurrence of the event, for the censored data. Meanwhile, without assuming any specific form of the event probability distribution, our model shows great advantages over the previous works on fitting various sophisticated data distributions. In the experiments on the three real-world tasks from different fields, our model significantly outperforms the state-of-the-art solutions under various metrics. △ Less

Submitted 13 November, 2018; v1 submitted 7 September, 2018; originally announced September 2018.

Comments: AAAI 2019. Supplemental material, slides, code: https://github.com/rk2900/drsa

arXiv:1808.02430 [pdf, ps, other]

doi 10.1109/LSP.2019.2890973

Granger Causality Analysis Based on Quantized Minimum Error Entropy Criterion

Authors: Badong Chen, Rong** Ma, Siyu Yu, Shaoyi Du, **g Qin

Abstract: Linear regression model (LRM) based on mean square error (MSE) criterion is widely used in Granger causality analysis (GCA), which is the most commonly used method to detect the causality between a pair of time series. However, when signals are seriously contaminated by non-Gaussian noises, the LRM coefficients will be inaccurately identified. This may cause the GCA to detect a wrong causal relati… ▽ More Linear regression model (LRM) based on mean square error (MSE) criterion is widely used in Granger causality analysis (GCA), which is the most commonly used method to detect the causality between a pair of time series. However, when signals are seriously contaminated by non-Gaussian noises, the LRM coefficients will be inaccurately identified. This may cause the GCA to detect a wrong causal relationship. Minimum error entropy (MEE) criterion can be used to replace the MSE criterion to deal with the non-Gaussian noises. But its calculation requires a double summation operation, which brings computational bottlenecks to GCA especially when sizes of the signals are large. To address the aforementioned problems, in this study we propose a new method called GCA based on the quantized MEE (QMEE) criterion (GCA-QMEE), in which the QMEE criterion is applied to identify the LRM coefficients and the quantized error entropy is used to calculate the causality indexes. Compared with the traditional GCA, the proposed GCA-QMEE not only makes the results more discriminative, but also more robust. Its computational complexity is also not high because of the quantization operation. Illustrative examples on synthetic and EEG datasets are provided to verify the desirable performance and the availability of the GCA-QMEE. △ Less

Submitted 7 August, 2018; originally announced August 2018.

Comments: 5 pages, 2 figures, 3 tables

arXiv:1803.06365 [pdf, ps, other]

Inference for case-control studies with incident and prevalent cases

Authors: Marlena Maziarz, Yukun Liu, **g Qin, Ruth Pfeiffer

Abstract: We propose and study a fully efficient method to estimate associations of an exposure with disease incidence when both, incident cases and prevalent cases, i.e. individuals who were diagnosed with the disease at some prior time point and are alive at the time of sampling, are included in a case-control study. We extend the exponential tilting model for the relationship between exposure and case st… ▽ More We propose and study a fully efficient method to estimate associations of an exposure with disease incidence when both, incident cases and prevalent cases, i.e. individuals who were diagnosed with the disease at some prior time point and are alive at the time of sampling, are included in a case-control study. We extend the exponential tilting model for the relationship between exposure and case status to accommodate two case groups, and correct for the survival bias in the prevalent cases through a tilting term that depends on the parametric distribution of the backward time, i.e. the time from disease diagnosis to study enrollment. We construct an empirical likelihood that also incorporates the observed backward times for prevalent cases, obtain efficient estimates of odds ratio parameters that relate exposure to disease incidence and propose a likelihood ratio test for model parameters that has a standard chi-squared distribution. We quantify the changes in efficiency of association parameters when incident cases are supplemented with, or replaced by, prevalent cases in simulations. We illustrate our methods by estimating associations of single nucleotide polymorphisms (SNPs) with breast cancer incidence in a sample of controls and incident and prevalent cases from the U.S. Radiologic Technologists Health Study. △ Less

Submitted 16 March, 2018; originally announced March 2018.

Comments: 19 pages, 4 figures

arXiv:1704.08165 [pdf, other]

A Generalization of Convolutional Neural Networks to Graph-Structured Data

Authors: Yotam Hechtlinger, Purvasha Chakravarti, **ing Qin

Abstract: This paper introduces a generalization of Convolutional Neural Networks (CNNs) from low-dimensional grid data, such as images, to graph-structured data. We propose a novel spatial convolution utilizing a random walk to uncover the relations within the input, analogous to the way the standard convolution uses the spatial neighborhood of a pixel on the grid. The convolution has an intuitive interpre… ▽ More This paper introduces a generalization of Convolutional Neural Networks (CNNs) from low-dimensional grid data, such as images, to graph-structured data. We propose a novel spatial convolution utilizing a random walk to uncover the relations within the input, analogous to the way the standard convolution uses the spatial neighborhood of a pixel on the grid. The convolution has an intuitive interpretation, is efficient and scalable and can also be used on data with varying graph structure. Furthermore, this generalization can be applied to many standard regression or classification problems, by learning the the underlying graph. We empirically demonstrate the performance of the proposed CNN on MNIST, and challenge the state-of-the-art on Merck molecular activity data set. △ Less

Submitted 26 April, 2017; originally announced April 2017.

arXiv:1612.07019 [pdf, ps, other]

Robust Learning with Kernel Mean p-Power Error Loss

Authors: Badong Chen, Lei Xing, Xin Wang, **g Qin, Nanning Zheng

Abstract: Correntropy is a second order statistical measure in kernel space, which has been successfully applied in robust learning and signal processing. In this paper, we define a nonsecond order statistical measure in kernel space, called the kernel mean-p power error (KMPE), including the correntropic loss (CLoss) as a special case. Some basic properties of KMPE are presented. In particular, we apply th… ▽ More Correntropy is a second order statistical measure in kernel space, which has been successfully applied in robust learning and signal processing. In this paper, we define a nonsecond order statistical measure in kernel space, called the kernel mean-p power error (KMPE), including the correntropic loss (CLoss) as a special case. Some basic properties of KMPE are presented. In particular, we apply the KMPE to extreme learning machine (ELM) and principal component analysis (PCA), and develop two robust learning algorithms, namely ELM-KMPE and PCA-KMPE. Experimental results on synthetic and benchmark data show that the developed algorithms can achieve consistently better performance when compared with some existing methods. △ Less

Submitted 21 December, 2016; originally announced December 2016.

Comments: 11 pages, 7 figures, 10 tables

arXiv:1611.00326 [pdf, other]

Enhanced Factored Three-Way Restricted Boltzmann Machines for Speech Detection

Authors: Pengfei Sun, Jun Qin

Abstract: In this letter, we propose enhanced factored three way restricted Boltzmann machines (EFTW-RBMs) for speech detection. The proposed model incorporates conditional feature learning by multiplying the dynamical state of the third unit, which allows a modulation over the visible-hidden node pairs. Instead of stacking previous frames of speech as the third unit in a recursive manner, the correlation r… ▽ More In this letter, we propose enhanced factored three way restricted Boltzmann machines (EFTW-RBMs) for speech detection. The proposed model incorporates conditional feature learning by multiplying the dynamical state of the third unit, which allows a modulation over the visible-hidden node pairs. Instead of stacking previous frames of speech as the third unit in a recursive manner, the correlation related weighting coefficients are assigned to the contextual neighboring frames. Specifically, a threshold function is designed to capture the long-term features and blend the globally stored speech structure. A factored low rank approximation is introduced to reduce the parameters of the three-dimensional interaction tensor, on which non-negative constraint is imposed to address the sparsity characteristic. The validations through the area-under-ROC-curve (AUC) and signal distortion ratio (SDR) show that our approach outperforms several existing 1D and 2D (i.e., time and time-frequency domain) speech detection algorithms in various noisy environments. △ Less

Submitted 20 April, 2017; v1 submitted 1 November, 2016; originally announced November 2016.

Comments: 8 pages, Pattern Recognition Letter 2016

arXiv:1605.03868 [pdf, other]

A Nonparametric Likelihood Approach for Inference in Instrumental Variable Models

Authors: Kwonsang Lee, Bhaswar B. Bhattacharya, **g Qin, Dylan S. Small

Abstract: Instrumental variable methods allow for inference about the treatment effect by controlling for unmeasured confounding in randomized experiments with noncompliance. However, many studies do not consider the observed compliance behavior in the testing procedure, which can lead to a loss of power. In this paper, we propose a novel nonparametric likelihood approach, referred to as the binomial likeli… ▽ More Instrumental variable methods allow for inference about the treatment effect by controlling for unmeasured confounding in randomized experiments with noncompliance. However, many studies do not consider the observed compliance behavior in the testing procedure, which can lead to a loss of power. In this paper, we propose a novel nonparametric likelihood approach, referred to as the binomial likelihood (BL) method, that incorporates information on compliance behavior while overcoming several limitations of previous techniques and utilizing the advantages of likelihood methods. Our proposed method produces proper estimates of the counterfactual distribution functions by maximizing the binomial likelihood over the space of distribution functions. Using this we propose two versions of a binomial likelihood ratio test for the null hypothesis of no treatment effect. We show that both versions are more powerful to detect any distributional change than existing methods in finite sample cases, and are asymptotically equivalent to the two-sample Anderson-Darling test. We also develop an efficient algorithm for computing our estimates, and apply the binomial likelihood method to a study of the effect of Medicaid coverage on mental health using the Oregon Health Insurance Experiment. △ Less

Submitted 10 June, 2020; v1 submitted 12 May, 2016; originally announced May 2016.

Comments: Major changes. Updated BL method. New theorems and data analysis added

arXiv:1407.8412 [pdf, ps, other]

doi 10.1214/14-AOAS730

Combining isotonic regression and EM algorithm to predict genetic risk under monotonicity constraint

Authors: **g Qin, Tanya P. Garcia, Yanyuan Ma, Ming-Xin Tang, Karen Marder, Yuanjia Wang

Abstract: In certain genetic studies, clinicians and genetic counselors are interested in estimating the cumulative risk of a disease for individuals with and without a rare deleterious mutation. Estimating the cumulative risk is difficult, however, when the estimates are based on family history data. Often, the genetic mutation status in many family members is unknown; instead, only estimated probabilities… ▽ More In certain genetic studies, clinicians and genetic counselors are interested in estimating the cumulative risk of a disease for individuals with and without a rare deleterious mutation. Estimating the cumulative risk is difficult, however, when the estimates are based on family history data. Often, the genetic mutation status in many family members is unknown; instead, only estimated probabilities of a patient having a certain mutation status are available. Also, ages of disease-onset are subject to right censoring. Existing methods to estimate the cumulative risk using such family-based data only provide estimation at individual time points, and are not guaranteed to be monotonic or nonnegative. In this paper, we develop a novel method that combines Expectation-Maximization and isotonic regression to estimate the cumulative risk across the entire support. Our estimator is monotonic, satisfies self-consistent estimating equations and has high power in detecting differences between the cumulative risks of different populations. Application of our estimator to a Parkinson's disease (PD) study provides the age-at-onset distribution of PD in PARK2 mutation carriers and noncarriers, and reveals a significant difference between the distribution in compound heterozygous carriers compared to noncarriers, but not between heterozygous carriers and noncarriers. △ Less

Submitted 31 July, 2014; originally announced July 2014.

Comments: Published in at http://dx.doi.org/10.1214/14-AOAS730 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS730

Journal ref: Annals of Applied Statistics 2014, Vol. 8, No. 2, 1182-1208

arXiv:1407.3152 [pdf, ps, other]

Maximum Smoothed Likelihood Component Density Estimation in Mixture Models with Known Mixing Proportions

Authors: Tao Yu, Pengfei Li, **g Qin

Abstract: In this paper, we propose a maximum smoothed likelihood method to estimate the component density functions of mixture models, in which the mixing proportions are known and may differ among observations. The proposed estimates maximize a smoothed log likelihood function and inherit all the important properties of probability density functions. A majorization-minimization algorithm is suggested to c… ▽ More In this paper, we propose a maximum smoothed likelihood method to estimate the component density functions of mixture models, in which the mixing proportions are known and may differ among observations. The proposed estimates maximize a smoothed log likelihood function and inherit all the important properties of probability density functions. A majorization-minimization algorithm is suggested to compute the proposed estimates numerically. In theory, we show that starting from any initial value, this algorithm increases the smoothed likelihood function and further leads to estimates that maximize the smoothed likelihood function. This indicates the convergence of the algorithm. Furthermore, we theoretically establish the asymptotic convergence rate of our proposed estimators. An adaptive procedure is suggested to choose the bandwidths in our estimation procedure. Simulation studies show that the proposed method is more efficient than the existing method in terms of integrated squared errors. A real data example is further analyzed. △ Less

Submitted 11 July, 2014; originally announced July 2014.

Showing 1–34 of 34 results for author: Qin, J