Search | arXiv e-print repository

Regularized Estimation of Sparse Spectral Precision Matrices

Authors: Navonil Deb, Amy Kuceyeski, Sumanta Basu

Abstract: Spectral precision matrix, the inverse of a spectral density matrix, is an object of central interest in frequency-domain analysis of multivariate time series. Estimation of spectral precision matrix is a key step in calculating partial coherency and graphical model selection of stationary time series. When the dimension of a multivariate time series is moderate to large, traditional estimators of… ▽ More Spectral precision matrix, the inverse of a spectral density matrix, is an object of central interest in frequency-domain analysis of multivariate time series. Estimation of spectral precision matrix is a key step in calculating partial coherency and graphical model selection of stationary time series. When the dimension of a multivariate time series is moderate to large, traditional estimators of spectral density matrices such as averaged periodograms tend to be severely ill-conditioned, and one needs to resort to suitable regularization strategies involving optimization over complex variables. In this work, we propose complex graphical Lasso (CGLASSO), an $\ell_1$-penalized estimator of spectral precision matrix based on local Whittle likelihood maximization. We develop fast $\textit{pathwise coordinate descent}$ algorithms for implementing CGLASSO on large dimensional time series data sets. At its core, our algorithmic development relies on a ring isomorphism between complex and real matrices that helps map a number of optimization problems over complex variables to similar optimization problems over real variables. This finding may be of independent interest and more broadly applicable for high-dimensional statistical analysis with complex-valued data. We also present a complete non-asymptotic theory of our proposed estimator which shows that consistent estimation is possible in high-dimensional regime as long as the underlying spectral precision matrix is suitably sparse. We compare the performance of CGLASSO with competing alternatives on simulated data sets, and use it to construct partial coherence network among brain regions from a real fMRI data set. △ Less

Submitted 30 April, 2024; v1 submitted 20 January, 2024; originally announced January 2024.

Comments: 55 pages, 8 figures

MSC Class: 62H12; 62J07; 62M10; 62M15 ACM Class: G.3; I.5.2

arXiv:2312.16241

Analysis of Pleiotropy for Testosterone and Lipid Profiles in Males and Females

Authors: Srijan Chattopadhyay, Swapnaneel Bhattacharyya, Sevantee Basu

Abstract: In modern scientific studies, it is often imperative to determine whether a set of phenotypes is affected by a single factor. If such an influence is identified, it becomes essential to discern whether this effect is contingent upon categories such as sex or age group, and importantly, to understand whether this dependence is rooted in purely non-environmental reasons. The exploration of such depe… ▽ More In modern scientific studies, it is often imperative to determine whether a set of phenotypes is affected by a single factor. If such an influence is identified, it becomes essential to discern whether this effect is contingent upon categories such as sex or age group, and importantly, to understand whether this dependence is rooted in purely non-environmental reasons. The exploration of such dependencies often involves studying pleiotropy, a phenomenon wherein a single genetic locus impacts multiple traits. This heightened interest in uncovering dependencies by pleiotropy is fueled by the growing accessibility of summary statistics from genome-wide association studies (GWAS) and the establishment of thoroughly phenotyped sample collections. This advancement enables a systematic and comprehensive exploration of the genetic connections among various traits and diseases. additive genetic correlation illuminates the genetic connection between two traits, providing valuable insights into the shared biological pathways and underlying causal relationships between them. In this paper, we present a novel method to analyze such dependencies by studying additive genetic correlations between pairs of traits under consideration. Subsequently, we employ matrix comparison techniques to discern and elucidate sex-specific or age-group-specific associations, contributing to a deeper understanding of the nuanced dependencies within the studied traits. Our proposed method is computationally handy and requires only GWAS summary statistics. We validate our method by applying it to the UK Biobank data and present the results. △ Less

Submitted 21 March, 2024; v1 submitted 25 December, 2023; originally announced December 2023.

Comments: The authors have withdrawn this manuscript owing to the work having been performed in the lab of Anasuya Chakrabarty, but the mansucript being submitted without her knowledge or consent. Therefore, the authors do not wish this work to be cited as reference for the project. If you have any questions, please contact the corresponding author

arXiv:2312.10926 [pdf, other]

A Random Effects Model-based Method of Moments Estimation of Causal Effect in Mendelian Randomization Studies

Authors: Wenhao Cao, Saonli Basu

Abstract: Recent advances in genoty** technology have delivered a wealth of genetic data, which is rapidly advancing our understanding of the underlying genetic architecture of complex diseases. Mendelian Randomization (MR) leverages such genetic data to estimate the causal effect of an exposure factor on an outcome from observational studies. In this paper, we utilize genetic correlations to summarize in… ▽ More Recent advances in genoty** technology have delivered a wealth of genetic data, which is rapidly advancing our understanding of the underlying genetic architecture of complex diseases. Mendelian Randomization (MR) leverages such genetic data to estimate the causal effect of an exposure factor on an outcome from observational studies. In this paper, we utilize genetic correlations to summarize information on a large set of genetic variants associated with the exposure factor. Our proposed approach is a generalization of the MR-inverse variance weighting (IVW) approach where we can accommodate many weak and pleiotropic effects. Our approach quantifies the variation explained by all valid instrumental variables (IVs) instead of estimating the individual effects and thus could accommodate weak IVs. This is particularly useful for performing MR estimation in small studies, or minority populations where the selection of valid IVs is unreliable and thus has a large influence on the MR estimation. Through simulation and real data analysis, we demonstrate that our approach provides a robust alternative to the existing MR methods. We illustrate the robustness of our proposed approach under the violation of MR assumptions and compare the performance with several existing approaches. △ Less

Submitted 17 December, 2023; originally announced December 2023.

Comments: 24 pages, 5 figures

arXiv:2311.15384 [pdf, other]

Robust and Automatic Data Clustering: Dirichlet Process meets Median-of-Means

Authors: Supratik Basu, Jyotishka Ray Choudhury, Debolina Paul, Swagatam Das

Abstract: Clustering stands as one of the most prominent challenges within the realm of unsupervised machine learning. Among the array of centroid-based clustering algorithms, the classic $k$-means algorithm, rooted in Lloyd's heuristic, takes center stage as one of the extensively employed techniques in the literature. Nonetheless, both $k$-means and its variants grapple with noteworthy limitations. These… ▽ More Clustering stands as one of the most prominent challenges within the realm of unsupervised machine learning. Among the array of centroid-based clustering algorithms, the classic $k$-means algorithm, rooted in Lloyd's heuristic, takes center stage as one of the extensively employed techniques in the literature. Nonetheless, both $k$-means and its variants grapple with noteworthy limitations. These encompass a heavy reliance on initial cluster centroids, susceptibility to converging into local minima of the objective function, and sensitivity to outliers and noise in the data. When confronted with data containing noisy or outlier-laden observations, the Median-of-Means (MoM) estimator emerges as a stabilizing force for any centroid-based clustering framework. On a different note, a prevalent constraint among existing clustering methodologies resides in the prerequisite knowledge of the number of clusters prior to analysis. Utilizing model-based methodologies, such as Bayesian nonparametric models, offers the advantage of infinite mixture models, thereby circumventing the need for such requirements. Motivated by these facts, in this article, we present an efficient and automatic clustering technique by integrating the principles of model-based and centroid-based methodologies that mitigates the effect of noise on the quality of clustering while ensuring that the number of clusters need not be specified in advance. Statistical guarantees on the upper bound of clustering error, and rigorous assessment through simulated and real datasets suggest the advantages of our proposed method over existing state-of-the-art clustering algorithms. △ Less

Submitted 26 November, 2023; originally announced November 2023.

arXiv:2308.09166 [pdf, other]

Sparse reconstruction of ordinary differential equations with inference

Authors: Sara Venkatraman, Sumanta Basu, Martin T. Wells

Abstract: Sparse regression has emerged as a popular technique for learning dynamical systems from temporal data, beginning with the SINDy (Sparse Identification of Nonlinear Dynamics) framework proposed by arXiv:1509.03580. Quantifying the uncertainty inherent in differential equations learned from data remains an open problem, thus we propose leveraging recent advances in statistical inference for sparse… ▽ More Sparse regression has emerged as a popular technique for learning dynamical systems from temporal data, beginning with the SINDy (Sparse Identification of Nonlinear Dynamics) framework proposed by arXiv:1509.03580. Quantifying the uncertainty inherent in differential equations learned from data remains an open problem, thus we propose leveraging recent advances in statistical inference for sparse regression to address this issue. Focusing on systems of ordinary differential equations (ODEs), SINDy assumes that each equation is a parsimonious linear combination of a few candidate functions, such as polynomials, and uses methods such as sequentially-thresholded least squares or the Lasso to identify a small subset of these functions that govern the system's dynamics. We instead employ bias-corrected versions of the Lasso and ridge regression estimators, as well as an empirical Bayes variable selection technique known as SEMMS, to estimate each ODE as a linear combination of terms that are statistically significant. We demonstrate through simulations that this approach allows us to recover the functional terms that correctly describe the dynamics more often than existing methods that do not account for uncertainty. △ Less

Submitted 17 August, 2023; originally announced August 2023.

arXiv:2305.15343 [pdf, other]

Modeling Multiple Irregularly Spaced Financial Time Series

Authors: Chiranjit Dutta, Nalini Ravishanker, Sumanta Basu

Abstract: In this paper we propose univariate volatility models for irregularly spaced financial time series by modifying the regularly spaced stochastic volatility models. We also extend this approach to propose multivariate stochastic volatility (MSV) models for multiple irregularly spaced time series by modifying the MSV model that was used with daily data. We use these proposed models for modeling intra… ▽ More In this paper we propose univariate volatility models for irregularly spaced financial time series by modifying the regularly spaced stochastic volatility models. We also extend this approach to propose multivariate stochastic volatility (MSV) models for multiple irregularly spaced time series by modifying the MSV model that was used with daily data. We use these proposed models for modeling intraday logarithmic returns from health sector stocks data obtained from Trade and Quotes (TAQ) database at Wharton Research Data Services (WRDS). △ Less

Submitted 24 May, 2023; originally announced May 2023.

arXiv:2305.14639 [pdf, other]

Restricted Mean Survival Time Estimation Using Bayesian Nonparametric Dependent Mixture Models

Authors: Ruizhe Chen, Sanjib Basu, Qian Shi

Abstract: Restricted mean survival time (RMST) is an intuitive summary statistic for time-to-event random variables, and can be used for measuring treatment effects. Compared to hazard ratio, its estimation procedure is robust against the non-proportional hazards assumption. We propose nonparametric Bayeisan (BNP) estimators for RMST using a dependent stick-breaking process prior mixture model that adjusts… ▽ More Restricted mean survival time (RMST) is an intuitive summary statistic for time-to-event random variables, and can be used for measuring treatment effects. Compared to hazard ratio, its estimation procedure is robust against the non-proportional hazards assumption. We propose nonparametric Bayeisan (BNP) estimators for RMST using a dependent stick-breaking process prior mixture model that adjusts for mixed-type covariates. The proposed Bayesian estimators can yield both group-level causal estimate and subject-level predictions. Besides, we propose a novel dependent stick-breaking process prior that on average results in narrower credible intervals while maintaining similar coverage probability compared to a dependent probit stick-breaking process prior. We conduct simulation studies to investigate the performance of the proposed BNP RMST estimators compared to existing frequentist approaches and under different Bayesian modeling choices. The proposed framework is applied to estimate the treatment effect of an immuno therapy among KRAS wild-type colorectal cancer patients. △ Less

Submitted 23 May, 2023; originally announced May 2023.

arXiv:2212.06353 [pdf, ps, other]

Bayesian Arc Length Survival Analysis Model (BALSAM): Theory and Application to an HIV/AIDS Clinical Trial

Authors: Yan Gao, Rodney A. Sparapani, Sanjib Basu

Abstract: Stochastic volatility often implies increasing risks that are difficult to capture given the dynamic nature of real-world applications. We propose using arc length, a mathematical concept, to quantify cumulative variations (the total variability over time) to more fully characterize stochastic volatility. The hazard rate, as defined by the Cox proportional hazards model in survival analysis, is as… ▽ More Stochastic volatility often implies increasing risks that are difficult to capture given the dynamic nature of real-world applications. We propose using arc length, a mathematical concept, to quantify cumulative variations (the total variability over time) to more fully characterize stochastic volatility. The hazard rate, as defined by the Cox proportional hazards model in survival analysis, is assumed to be impacted by the instantaneous value of a longitudinal variable. However, when cumulative variations pose a significant impact on the hazard, this assumption is questionable. Our proposed Bayesian Arc Length Survival Analysis Model (BALSAM) infuses arc length into a united statistical framework by synthesizing three parallel components (joint models, distributed lag models, and arc length). We illustrate the use of BALSAM in simulation studies and also apply it to an HIV/AIDS clinical trial to assess the impact of cumulative variations of CD4 count (a critical longitudinal biomarker) on mortality while accounting for measurement errors and relevant variables. △ Less

Submitted 12 December, 2022; originally announced December 2022.

arXiv:2206.05374 [pdf, other]

Modeling Multivariate Positive-Valued Time Series Using R-INLA

Authors: Chiranjit Dutta, Nalini Ravishanker, Sumanta Basu

Abstract: In this paper we describe fast Bayesian statistical analysis of vector positive-valued time series, with application to interesting financial data streams. We discuss a flexible level correlated model (LCM) framework for building hierarchical models for vector positive-valued time series. The LCM allows us to combine marginal gamma distributions for the positive-valued component responses, while a… ▽ More In this paper we describe fast Bayesian statistical analysis of vector positive-valued time series, with application to interesting financial data streams. We discuss a flexible level correlated model (LCM) framework for building hierarchical models for vector positive-valued time series. The LCM allows us to combine marginal gamma distributions for the positive-valued component responses, while accounting for association among the components at a latent level. We use integrated nested Laplace approximation (INLA) for fast approximate Bayesian modeling via the R-INLA package, building custom functions to handle this setup. We use the proposed method to model interdependencies between realized volatility measures from several stock indexes. △ Less

Submitted 2 July, 2022; v1 submitted 10 June, 2022; originally announced June 2022.

Comments: 19 pages, 1 figure

arXiv:2205.10662 [pdf, other]

Equivariant Mesh Attention Networks

Authors: Sourya Basu, Jose Gallego-Posada, Francesco Viganò, James Rowbottom, Taco Cohen

Abstract: Equivariance to symmetries has proven to be a powerful inductive bias in deep learning research. Recent works on mesh processing have concentrated on various kinds of natural symmetries, including translations, rotations, scaling, node permutations, and gauge transformations. To date, no existing architecture is equivariant to all of these transformations. In this paper, we present an attention-ba… ▽ More Equivariance to symmetries has proven to be a powerful inductive bias in deep learning research. Recent works on mesh processing have concentrated on various kinds of natural symmetries, including translations, rotations, scaling, node permutations, and gauge transformations. To date, no existing architecture is equivariant to all of these transformations. In this paper, we present an attention-based architecture for mesh data that is provably equivariant to all transformations mentioned above. Our pipeline relies on the use of relative tangential features: a simple, effective, equivariance-friendly alternative to raw node positions as inputs. Experiments on the FAUST and TOSCA datasets confirm that our proposed architecture achieves improved performance on these benchmarks and is indeed equivariant, and therefore robust, to a wide variety of local/global transformations. △ Less

Submitted 27 August, 2022; v1 submitted 21 May, 2022; originally announced May 2022.

Comments: Published in Transactions on Machine Learning Research (08/2022). Official code made available at https://github.com/gallego-posada/eman - For the OpenReview entry, see https://openreview.net/forum?id=3IqqJh2Ycy

arXiv:2204.12001 [pdf]

Measuring Discrepancies in Airbnb Guest Acceptance Rates Using Anonymized Demographic Data

Authors: Siddhartha Basu, Ruthie Berman, Adam Bloomston, John Campbell, Anne Diaz, Nanako Era, Benjamin Evans, Sukhada Palkar, Skyler Wharton

Abstract: In order to make technological systems and platforms more equitable, organizations must be able to measure the scale of potential inequities as well as the efficacy of proposed solutions. In this paper, we present a system that measures discrepancies in platform user experience that are attributable to perceived race (experience gaps) using anonymized data. This allows for progress to be made in t… ▽ More In order to make technological systems and platforms more equitable, organizations must be able to measure the scale of potential inequities as well as the efficacy of proposed solutions. In this paper, we present a system that measures discrepancies in platform user experience that are attributable to perceived race (experience gaps) using anonymized data. This allows for progress to be made in this area while limiting any potential privacy risk. Specifically, the system enforces the privacy model of p-sensitive k-anonymity to conduct measurement without ever storing or having access to a 1:1 map** between user identifiers and perceived race. We test this system in the context of the Airbnb guest booking experience. Our simulation-based power analysis shows that the system can measure the efficacy of proposed platform-wide interventions with comparable precision to non-anonymized data. Our work establishes that measurement of experience gaps with anonymized data is feasible and can be used to guide the development of policies to promote equitable outcomes for users of Airbnb as well as other technology platforms. △ Less

Submitted 25 April, 2022; originally announced April 2022.

Comments: 51 pages, 24 figures

ACM Class: E.0; H.4; J.4

arXiv:2112.15326 [pdf, other]

doi 10.1101/2021.07.08.451684

An empirical Bayes approach to estimating dynamic models of co-regulated gene expression

Authors: Sara Venkatraman, Sumanta Basu, Andrew G. Clark, Sofie Delbare, Myung Hee Lee, Martin T. Wells

Abstract: Time-course gene expression datasets provide insight into the dynamics of complex biological processes, such as immune response and organ development. It is of interest to identify genes with similar temporal expression patterns because such genes are often biologically related. However, this task is challenging due to the high dimensionality of these datasets and the nonlinearity of gene expressi… ▽ More Time-course gene expression datasets provide insight into the dynamics of complex biological processes, such as immune response and organ development. It is of interest to identify genes with similar temporal expression patterns because such genes are often biologically related. However, this task is challenging due to the high dimensionality of these datasets and the nonlinearity of gene expression time dynamics. We propose an empirical Bayes approach to estimating ordinary differential equation (ODE) models of gene expression, from which we derive a similarity metric between genes called the Bayesian lead-lag $R^2$ (LLR2). Importantly, the calculation of the LLR2 leverages biological databases that document known interactions amongst genes; this information is automatically used to define informative prior distributions on the ODE model's parameters. As a result, the LLR2 is a biologically-informed metric that can be used to identify clusters or networks of functionally-related genes with co-moving or time-delayed expression patterns. We then derive data-driven shrinkage parameters from Stein's unbiased risk estimate that optimally balance the ODE model's fit to both data and external biological information. Using real gene expression data, we demonstrate that our methodology allows us to recover interpretable gene clusters and sparse networks. These results reveal new insights about the dynamics of biological systems. △ Less

Submitted 31 December, 2021; originally announced December 2021.

arXiv:2112.05041 [pdf, other]

Bayesian Functional Data Analysis over Dependent Regions and Its Application for Identification of Differentially Methylated Regions

Authors: Suvo Chatterjee, Shrabanti Chowdhury, Duchwan Ryu, Sanjib Basu

Abstract: We consider a Bayesian functional data analysis for observations measured as extremely long sequences. Splitting the sequence into a number of small windows with manageable length, the windows may not be independent especially when they are neighboring to each other. We propose to utilize Bayesian smoothing splines to estimate individual functional patterns within each window and to establish tran… ▽ More We consider a Bayesian functional data analysis for observations measured as extremely long sequences. Splitting the sequence into a number of small windows with manageable length, the windows may not be independent especially when they are neighboring to each other. We propose to utilize Bayesian smoothing splines to estimate individual functional patterns within each window and to establish transition models for parameters involved in each window to address the dependent structure between windows. The functional difference of groups of individuals at each window can be evaluated by Bayes Factor based on Markov Chain Monte Carlo samples in the analysis. In this paper, we examine the proposed method through simulation studies and apply it to identify differentially methylated genetic regions in TCGA lung adenocarcinoma data. △ Less

Submitted 9 December, 2021; originally announced December 2021.

arXiv:2107.14754 [pdf, other]

A Survey of Estimation Methods for Sparse High-dimensional Time Series Models

Authors: Sumanta Basu, David S. Matteson

Abstract: High-dimensional time series datasets are becoming increasingly common in many areas of biological and social sciences. Some important applications include gene regulatory network reconstruction using time course gene expression data, brain connectivity analysis from neuroimaging data, structural analysis of a large panel of macroeconomic indicators, and studying linkages among financial firms for… ▽ More High-dimensional time series datasets are becoming increasingly common in many areas of biological and social sciences. Some important applications include gene regulatory network reconstruction using time course gene expression data, brain connectivity analysis from neuroimaging data, structural analysis of a large panel of macroeconomic indicators, and studying linkages among financial firms for more robust financial regulation. These applications have led to renewed interest in develo** principled statistical methods and theory for estimating large time series models given only a relatively small number of temporally dependent samples. Sparse modeling approaches have gained popularity over the last two decades in statistics and machine learning for their interpretability and predictive accuracy. Although there is a rich literature on several sparsity inducing methods when samples are independent, research on the statistical properties of these methods for estimating time series models is still in progress. We survey some recent advances in this area, focusing on empirically successful lasso based estimation methods for two canonical multivariate time series models - stochastic regression and vector autoregression. We discuss key technical challenges arising in high-dimensional time series analysis and outline several interesting research directions. △ Less

Submitted 30 July, 2021; originally announced July 2021.

arXiv:2103.07501 [pdf, other]

Beyond $\log^2(T)$ Regret for Decentralized Bandits in Matching Markets

Authors: Soumya Basu, Karthik Abinav Sankararaman, Abishek Sankararaman

Abstract: We design decentralized algorithms for regret minimization in the two-sided matching market with one-sided bandit feedback that significantly improves upon the prior works (Liu et al. 2020a, 2020b, Sankararaman et al. 2020). First, for general markets, for any $\varepsilon > 0$, we design an algorithm that achieves a $O(\log^{1+\varepsilon}(T))$ regret to the agent-optimal stable matching, with un… ▽ More We design decentralized algorithms for regret minimization in the two-sided matching market with one-sided bandit feedback that significantly improves upon the prior works (Liu et al. 2020a, 2020b, Sankararaman et al. 2020). First, for general markets, for any $\varepsilon > 0$, we design an algorithm that achieves a $O(\log^{1+\varepsilon}(T))$ regret to the agent-optimal stable matching, with unknown time horizon $T$, improving upon the $O(\log^{2}(T))$ regret achieved in (Liu et al. 2020b). Second, we provide the optimal $Θ(\log(T))$ agent-optimal regret for markets satisfying uniqueness consistency -- markets where leaving participants don't alter the original stable matching. Previously, $Θ(\log(T))$ regret was achievable (Sankararaman et al. 2020, Liu et al. 2020b) in the much restricted serial dictatorship setting, when all arms have the same preference over the agents. We propose a phase-based algorithm, wherein each phase, besides deleting the globally communicated dominated arms the agents locally delete arms with which they collide often. This local deletion is pivotal in breaking deadlocks arising from rank heterogeneity of agents across arms. We further demonstrate the superiority of our algorithm over existing works through simulations. △ Less

Submitted 12 March, 2021; originally announced March 2021.

arXiv:2102.08554 [pdf, other]

Recoverability Landscape of Tree Structured Markov Random Fields under Symmetric Noise

Authors: Ashish Katiyar, Soumya Basu, Vatsal Shah, Constantine Caramanis

Abstract: We study the problem of learning tree-structured Markov random fields (MRF) on discrete random variables with common support when the observations are corrupted by a $k$-ary symmetric noise channel with unknown probability of error. For Ising models (support size = 2), past work has shown that graph structure can only be recovered up to the leaf clusters (a leaf node, its parent, and its siblings… ▽ More We study the problem of learning tree-structured Markov random fields (MRF) on discrete random variables with common support when the observations are corrupted by a $k$-ary symmetric noise channel with unknown probability of error. For Ising models (support size = 2), past work has shown that graph structure can only be recovered up to the leaf clusters (a leaf node, its parent, and its siblings form a leaf cluster) and exact recovery is impossible. No prior work has addressed the setting of support size of 3 or more, and indeed this setting is far richer. As we show, when the support size is 3 or more, the structure of the leaf clusters may be partially or fully identifiable. We provide a precise characterization of this phenomenon and show that the extent of recoverability is dictated by the joint PMF of the random variables. In particular, we provide necessary and sufficient conditions for exact recoverability. Furthermore, we present a polynomial time, sample efficient algorithm that recovers the exact tree when this is possible, or up to the unidentifiability as promised by our characterization, when full recoverability is impossible. Finally, we demonstrate the efficacy of our algorithm experimentally. △ Less

Submitted 14 June, 2021; v1 submitted 16 February, 2021; originally announced February 2021.

arXiv:2011.14066 [pdf, other]

On Generalization of Adaptive Methods for Over-parameterized Linear Regression

Authors: Vatsal Shah, Soumya Basu, Anastasios Kyrillidis, Sujay Sanghavi

Abstract: Over-parameterization and adaptive methods have played a crucial role in the success of deep learning in the last decade. The widespread use of over-parameterization has forced us to rethink generalization by bringing forth new phenomena, such as implicit regularization of optimization algorithms and double descent with training progression. A series of recent works have started to shed light on t… ▽ More Over-parameterization and adaptive methods have played a crucial role in the success of deep learning in the last decade. The widespread use of over-parameterization has forced us to rethink generalization by bringing forth new phenomena, such as implicit regularization of optimization algorithms and double descent with training progression. A series of recent works have started to shed light on these areas in the quest to understand -- why do neural networks generalize well? The setting of over-parameterized linear regression has provided key insights into understanding this mysterious behavior of neural networks. In this paper, we aim to characterize the performance of adaptive methods in the over-parameterized linear regression setting. First, we focus on two sub-classes of adaptive methods depending on their generalization performance. For the first class of adaptive methods, the parameter vector remains in the span of the data and converges to the minimum norm solution like gradient descent (GD). On the other hand, for the second class of adaptive methods, the gradient rotation caused by the pre-conditioner matrix results in an in-span component of the parameter vector that converges to the minimum norm solution and the out-of-span component that saturates. Our experiments on over-parameterized linear regression and deep neural networks support this theory. △ Less

Submitted 27 November, 2020; originally announced November 2020.

Comments: arXiv admin note: substantial text overlap with arXiv:1811.07055

arXiv:2010.12977 [pdf]

Effects of West Coast forest fire emissions on atmospheric environment: A coupled satellite and ground-based assessment

Authors: Srikanta Sannigrahi, Qi Zhang, Francesco Pilla, Bidroha Basu, Arunima Sarkar Basu

Abstract: Forest fires have a profound impact on the atmospheric environment and air quality across the ecosystems. The recent west coast forest fire in the United States of America (USA) has broken all the past records and caused severe environmental and public health burdens. As of middle September, nearly 6 million acres forest area were burned, and more than 25 casualties were reported so far. In this s… ▽ More Forest fires have a profound impact on the atmospheric environment and air quality across the ecosystems. The recent west coast forest fire in the United States of America (USA) has broken all the past records and caused severe environmental and public health burdens. As of middle September, nearly 6 million acres forest area were burned, and more than 25 casualties were reported so far. In this study, both satellite and in-situ air pollution data were utilized to examine the effects of this unprecedented wildfire on the atmospheric environment. The spatiotemporal concentrations of total six air pollutants, i.e. carbon monoxide (CO), nitrogen dioxide (NO2), sulfur dioxide (SO2), ozone (O3), particulate matter (PM2.5 and PM10), and aerosol index (AI), were measured for the periods of 15 August to 15 September for 2020 (fire year) and 2019 (reference year). The in-situ data-led measurements show that the highest increases in CO (ppm), PM2.5, and PM10 concentrations (μg/m3) were clustered around the west coastal fire-prone states, during the 15 August - 15 September period. The average CO concentration (ppm) was increased most significantly in Oregon (1147.10), followed by Washington (812.76), and California (13.17). Meanwhile, the concentration (μg/m3) in particulate matter (both PM2.5 and PM10), was increased in all three states affected severely by wildfires. Changes (positive) in both PM2.5 and PM10 were measured highest in Washington (45.83 and 88.47 for PM2.5 and PM10), followed by Oregon (41.99 and 62.75 for PM2.5 and PM10), and California (31.27 and 35.04 for PM2.5 and PM10). The average level of exposure to CO, PM2.5, and PM10 was also measured for all the three fire-prone states. The results of the exposure assessment revealed a strong tradeoff association between wildland fire and local/regional air quality standard. △ Less

Submitted 24 October, 2020; originally announced October 2020.

arXiv:2010.06090 [pdf, other]

A Model-free Approach for Testing Association

Authors: Saptarshi Chatterjee, Shrabanti Chowdhury, Sanjib Basu

Abstract: The question of association between outcome and feature is generally framed in the context of a model on functional and distributional forms. Our motivating application is that of identifying serum biomarkers of angiogenesis, energy metabolism, apoptosis, and inflammation, predictive of recurrence after lung resection in node-negative non-small cell lung cancer patients with tumor stage T2a or les… ▽ More The question of association between outcome and feature is generally framed in the context of a model on functional and distributional forms. Our motivating application is that of identifying serum biomarkers of angiogenesis, energy metabolism, apoptosis, and inflammation, predictive of recurrence after lung resection in node-negative non-small cell lung cancer patients with tumor stage T2a or less. We propose an omnibus approach for testing association that is free of assumptions on functional forms and distributions and can be used as a black box method. This proposed maximal permutation test is based on the idea of thresholding, is readily implementable and is computationally efficient. We illustrate that the proposed omnibus tests maintain their levels and have strong power as black box tests for detecting linear, nonlinear and quantile-based associations, even with outlier-prone and heavy-tailed error distributions and under nonparametric setting. We additionally illustrate the use of this approach in model-free feature screening and further examine the level and power of these tests for binary outcome. We compare the performance of the proposed omnibus tests with comparator methods in our motivating application to identify preoperative serum biomarkers associated with non-small cell lung cancer recurrence in early stage patients. △ Less

Submitted 12 October, 2020; originally announced October 2020.

Comments: 20 pages, 7 figures

arXiv:2008.09983 [pdf, other]

doi 10.14778/3415478.3415559

Leveraging Organizational Resources to Adapt Models to New Data Modalities

Authors: Sahaana Suri, Raghuveer Chanda, Neslihan Bulut, Pradyumna Narayana, Yemao Zeng, Peter Bailis, Sugato Basu, Girija Narlikar, Christopher Re, Abishek Sethi

Abstract: As applications in large organizations evolve, the machine learning (ML) models that power them must adapt the same predictive tasks to newly arising data modalities (e.g., a new video content launch in a social media application requires existing text or image models to extend to video). To solve this problem, organizations typically create ML pipelines from scratch. However, this fails to utiliz… ▽ More As applications in large organizations evolve, the machine learning (ML) models that power them must adapt the same predictive tasks to newly arising data modalities (e.g., a new video content launch in a social media application requires existing text or image models to extend to video). To solve this problem, organizations typically create ML pipelines from scratch. However, this fails to utilize the domain expertise and data they have cultivated from develo** tasks for existing modalities. We demonstrate how organizational resources, in the form of aggregate statistics, knowledge bases, and existing services that operate over related tasks, enable teams to construct a common feature space that connects new and existing data modalities. This allows teams to apply methods for training data curation (e.g., weak supervision and label propagation) and model training (e.g., forms of multi-modal learning) across these different data modalities. We study how this use of organizational resources composes at production scale in over 5 classification tasks at Google, and demonstrate how it reduces the time needed to develop models for new modalities from months to weeks to days. △ Less

Submitted 23 August, 2020; originally announced August 2020.

Journal ref: PVLDB,13(12): 3396-3410, 2020

arXiv:2008.08993 [pdf]

Effect of COVID-19 on noise pollution change in Dublin, Ireland

Authors: Bidroha Basu, Enda Murphy, Anna Molter, Arunima Sarkar Basu, Srikanta Sannigrahi, Miguel Belmonte, Francesco Pilla

Abstract: Noise pollution is considered to be the third most hazardous pollution after air and water pollution by the World Health Organization (WHO). Short as well as long-term exposure to noise pollution has several adverse effects on humans, ranging from psychiatric disorders such as anxiety and depression, hypertension, hormonal dysfunction, and blood pressure rise leading to cardiovascular disease. One… ▽ More Noise pollution is considered to be the third most hazardous pollution after air and water pollution by the World Health Organization (WHO). Short as well as long-term exposure to noise pollution has several adverse effects on humans, ranging from psychiatric disorders such as anxiety and depression, hypertension, hormonal dysfunction, and blood pressure rise leading to cardiovascular disease. One of the major sources of noise pollution is road traffic. The WHO reports that around 40% of Europe's population are currently exposed to high noise levels. This study investigates noise pollution in Dublin, Ireland before and after the lockdown imposed as a result of the COVID-19 pandemic. The analysis was performed using 2020 hourly data from 12 noise monitoring stations. More than 80% of stations recorded high noise levels for more that 60% of the time before the lockdown in Dublin. However, a significant reduction in average and minimum noise levels was observed at all stations during the lockdown period and this can be attributed to reductions in both road and air traffic movements. △ Less

Submitted 20 August, 2020; originally announced August 2020.

Comments: 20 pages, 8 figures

arXiv:2007.15421 [pdf, other]

Random Forests for dependent data

Authors: Arkajyoti Saha, Sumanta Basu, Abhirup Datta

Abstract: Random forest (RF) is one of the most popular methods for estimating regression functions. The local nature of the RF algorithm, based on intra-node means and variances, is ideal when errors are i.i.d. For dependent error processes like time series and spatial settings where data in all the nodes will be correlated, operating locally ignores this dependence. Also, RF will involve resampling of cor… ▽ More Random forest (RF) is one of the most popular methods for estimating regression functions. The local nature of the RF algorithm, based on intra-node means and variances, is ideal when errors are i.i.d. For dependent error processes like time series and spatial settings where data in all the nodes will be correlated, operating locally ignores this dependence. Also, RF will involve resampling of correlated data, violating the principles of bootstrap. Theoretically, consistency of RF has been established for i.i.d. errors, but little is known about the case of dependent errors. We propose RF-GLS, a novel extension of RF for dependent error processes in the same way Generalized Least Squares (GLS) fundamentally extends Ordinary Least Squares (OLS) for linear models under dependence. The key to this extension is the equivalent representation of the local decision-making in a regression tree as a global OLS optimization which is then replaced with a GLS loss to create a GLS-style regression tree. This also synergistically addresses the resampling issue, as the use of GLS loss amounts to resampling uncorrelated contrasts (pre-whitened data) instead of the correlated data. For spatial settings, RF-GLS can be used in conjunction with Gaussian Process correlated errors to generate kriging predictions at new locations. RF becomes a special case of RF-GLS with an identity working covariance matrix. We establish consistency of RF-GLS under beta- (absolutely regular) mixing error processes and show that this general result subsumes important cases like autoregressive time series and spatial Matern Gaussian Processes. As a byproduct, we also establish consistency of RF for beta-mixing processes, which to our knowledge, is the first such result for RF under dependence. We empirically demonstrate the improvement achieved by RF-GLS over RF for both estimation and prediction under dependence. △ Less

Submitted 28 June, 2021; v1 submitted 30 July, 2020; originally announced July 2020.

arXiv:2007.04511 [pdf, ps, other]

Causal Effects in Twin Studies: the Role of Interference

Authors: Bonnie Smith, Elizabeth L. Ogburn, Matt McGue, Saonli Basu, Daniel O. Scharfstein

Abstract: The use of twins designs to address causal questions is becoming increasingly popular. A standard assumption is that there is no interference between twins---that is, no twin's exposure has a causal impact on their co-twin's outcome. However, there may be settings in which this assumption would not hold, and this would (1) impact the causal interpretation of parameters obtained by commonly used ex… ▽ More The use of twins designs to address causal questions is becoming increasingly popular. A standard assumption is that there is no interference between twins---that is, no twin's exposure has a causal impact on their co-twin's outcome. However, there may be settings in which this assumption would not hold, and this would (1) impact the causal interpretation of parameters obtained by commonly used existing methods; (2) change which effects are of greatest interest; and (3) impact the conditions under which we may estimate these effects. We explore these issues, and we derive semi-parametric efficient estimators for causal effects in the presence of interference between twins. Using data from the Minnesota Twin Family Study, we apply our estimators to assess whether twins' consumption of alcohol in early adolescence may have a causal impact on their co-twins' substance use later in life. △ Less

Submitted 8 July, 2020; originally announced July 2020.

arXiv:2006.15166 [pdf, other]

Dominate or Delete: Decentralized Competing Bandits in Serial Dictatorship

Authors: Abishek Sankararaman, Soumya Basu, Karthik Abinav Sankararaman

Abstract: Online learning in a two-sided matching market, with demand side agents continuously competing to be matched with supply side (arms), abstracts the complex interactions under partial information on matching platforms (e.g. UpWork, TaskRabbit). We study the decentralized serial dictatorship setting, a two-sided matching market where the demand side agents have unknown and heterogeneous valuation ov… ▽ More Online learning in a two-sided matching market, with demand side agents continuously competing to be matched with supply side (arms), abstracts the complex interactions under partial information on matching platforms (e.g. UpWork, TaskRabbit). We study the decentralized serial dictatorship setting, a two-sided matching market where the demand side agents have unknown and heterogeneous valuation over the supply side (arms), while the arms have known uniform preference over the demand side (agents). We design the first decentralized algorithm -- UCB with Decentralized Dominant-arm Deletion (UCB-D3), for the agents, that does not require any knowledge of reward gaps or time horizon. UCB-D3 works in phases, where in each phase, agents delete \emph{dominated arms} -- the arms preferred by higher ranked agents, and play only from the non-dominated arms according to the UCB. At the end of the phase, agents broadcast in a decentralized fashion, their estimated preferred arms through {\em pure exploitation}. We prove both, a new regret lower bound for the decentralized serial dictatorship model, and that UCB-D3 is order optimal. △ Less

Submitted 12 March, 2021; v1 submitted 26 June, 2020; originally announced June 2020.

Comments: AISTATS, 2021

arXiv:2006.14651 [pdf, other]

Influence Functions in Deep Learning Are Fragile

Authors: Samyadeep Basu, Philip Pope, Soheil Feizi

Abstract: Influence functions approximate the effect of training samples in test-time predictions and have a wide variety of applications in machine learning interpretability and uncertainty estimation. A commonly-used (first-order) influence function can be implemented efficiently as a post-hoc method requiring access only to the gradients and Hessian of the model. For linear models, influence functions ar… ▽ More Influence functions approximate the effect of training samples in test-time predictions and have a wide variety of applications in machine learning interpretability and uncertainty estimation. A commonly-used (first-order) influence function can be implemented efficiently as a post-hoc method requiring access only to the gradients and Hessian of the model. For linear models, influence functions are well-defined due to the convexity of the underlying loss function and are generally accurate even across difficult settings where model changes are fairly large such as estimating group influences. Influence functions, however, are not well-understood in the context of deep learning with non-convex loss functions. In this paper, we provide a comprehensive and large-scale empirical study of successes and failures of influence functions in neural network models trained on datasets such as Iris, MNIST, CIFAR-10 and ImageNet. Through our extensive experiments, we show that the network architecture, its depth and width, as well as the extent of model parameterization and regularization techniques have strong effects in the accuracy of influence functions. In particular, we find that (i) influence estimates are fairly accurate for shallow networks, while for deeper networks the estimates are often erroneous; (ii) for certain network architectures and datasets, training with weight-decay regularization is important to get high-quality influence estimates; and (iii) the accuracy of influence estimates can vary significantly depending on the examined test points. These results suggest that in general influence functions in deep learning are fragile and call for develo** improved influence estimation methods to mitigate these issues in non-convex setups. △ Less

Submitted 10 February, 2021; v1 submitted 25 June, 2020; originally announced June 2020.

Comments: ICLR 2021

arXiv:2003.03426 [pdf, other]

Contextual Blocking Bandits

Authors: Soumya Basu, Orestis Papadigenopoulos, Constantine Caramanis, Sanjay Shakkottai

Abstract: We study a novel variant of the multi-armed bandit problem, where at each time step, the player observes an independently sampled context that determines the arms' mean rewards. However, playing an arm blocks it (across all contexts) for a fixed and known number of future time steps. The above contextual setting, which captures important scenarios such as recommendation systems or ad placement wit… ▽ More We study a novel variant of the multi-armed bandit problem, where at each time step, the player observes an independently sampled context that determines the arms' mean rewards. However, playing an arm blocks it (across all contexts) for a fixed and known number of future time steps. The above contextual setting, which captures important scenarios such as recommendation systems or ad placement with diverse users, invalidates greedy solution techniques that are effective for its non-contextual counterpart (Basu et al., NeurIPS19). Assuming knowledge of the context distribution and the mean reward of each arm-context pair, we cast the problem as an online bipartite matching problem, where the right-vertices (contexts) arrive stochastically and the left-vertices (arms) are blocked for a finite number of rounds each time they are matched. This problem has been recently studied in the full-information case, where competitive ratio bounds have been derived. We focus on the bandit setting, where the reward distributions are initially unknown; we propose a UCB-based variant of the full-information algorithm that guarantees a $\mathcal{O}(\log T)$-regret w.r.t. an $α$-optimal strategy in $T$ time steps, matching the $Ω(\log(T))$ regret lower bound in this setting. Due to the time correlations caused by blocking, existing techniques for upper bounding regret fail. For proving our regret bounds, we introduce the novel concepts of delayed exploitation and opportunistic subsampling and combine them with ideas from combinatorial bandits and non-stationary Markov chains coupling. △ Less

Submitted 17 June, 2020; v1 submitted 6 March, 2020; originally announced March 2020.

arXiv:2002.08405 [pdf, other]

On Under-exploration in Bandits with Mean Bounds from Confounded Data

Authors: Nihal Sharma, Soumya Basu, Karthikeyan Shanmugam, Sanjay Shakkottai

Abstract: We study a variant of the multi-armed bandit problem where side information in the form of bounds on the mean of each arm is provided. We develop the novel non-optimistic Global Under-Explore (GLUE) algorithm which uses the provided mean bounds (across all the arms) to infer pseudo-variances for each arm, which in turn decide the rate of exploration for the arms. We analyze the regret of GLUE and… ▽ More We study a variant of the multi-armed bandit problem where side information in the form of bounds on the mean of each arm is provided. We develop the novel non-optimistic Global Under-Explore (GLUE) algorithm which uses the provided mean bounds (across all the arms) to infer pseudo-variances for each arm, which in turn decide the rate of exploration for the arms. We analyze the regret of GLUE and prove regret upper bounds that are never worse than that of the standard UCB algorithm. Furthermore, we show that GLUE improves upon regret guarantees that exists in literature for structured bandit problems (both theoretically and empirically). Finally, we study the practical setting of learning adaptive interventions using prior data that has been confounded by unrecorded variables that affect rewards. We show that mean bounds can be inferred naturally from such logs and can thus be used to improve the learning process. We validate our findings through semi-synthetic experiments on data derived from real data sets. △ Less

Submitted 10 June, 2021; v1 submitted 19 February, 2020; originally announced February 2020.

arXiv:1911.07921 [pdf, other]

Privacy Leakage Avoidance with Switching Ensembles

Authors: Rauf Izmailov, Peter Lin, Chris Mesterharm, Samyadeep Basu

Abstract: We consider membership inference attacks, one of the main privacy issues in machine learning. These recently developed attacks have been proven successful in determining, with confidence better than a random guess, whether a given sample belongs to the dataset on which the attacked machine learning model was trained. Several approaches have been developed to mitigate this privacy leakage but the t… ▽ More We consider membership inference attacks, one of the main privacy issues in machine learning. These recently developed attacks have been proven successful in determining, with confidence better than a random guess, whether a given sample belongs to the dataset on which the attacked machine learning model was trained. Several approaches have been developed to mitigate this privacy leakage but the tradeoff performance implications of these defensive mechanisms (i.e., accuracy and utility of the defended machine learning model) are not well studied yet. We propose a novel approach of privacy leakage avoidance with switching ensembles (PASE), which both protects against current membership inference attacks and does that with very small accuracy penalty, while requiring acceptable increase in training and inference time. We test our PASE method, along with the the current state-of-the-art PATE approach, on three calibration image datasets and analyze their tradeoffs. △ Less

Submitted 18 November, 2019; originally announced November 2019.

arXiv:1911.00418 [pdf, other]

On Second-Order Group Influence Functions for Black-Box Predictions

Authors: Samyadeep Basu, Xuchen You, Soheil Feizi

Abstract: With the rapid adoption of machine learning systems in sensitive applications, there is an increasing need to make black-box models explainable. Often we want to identify an influential group of training samples in a particular test prediction for a given machine learning model. Existing influence functions tackle this problem by using first-order approximations of the effect of removing a sample… ▽ More With the rapid adoption of machine learning systems in sensitive applications, there is an increasing need to make black-box models explainable. Often we want to identify an influential group of training samples in a particular test prediction for a given machine learning model. Existing influence functions tackle this problem by using first-order approximations of the effect of removing a sample from the training set on model parameters. To compute the influence of a group of training samples (rather than an individual point) in model predictions, the change in optimal model parameters after removing that group from the training set can be large. Thus, in such cases, the first-order approximation can be loose. In this paper, we address this issue and propose second-order influence functions for identifying influential groups in test-time predictions. For linear models, across different sizes and types of groups, we show that using the proposed second-order influence function improves the correlation between the computed influence values and the ground truth ones. We also show that second-order influence functions could be used with optimization techniques to improve the selection of the most influential group for a test-sample. △ Less

Submitted 6 July, 2020; v1 submitted 1 November, 2019; originally announced November 2019.

Comments: To Appear in ICML 2020

arXiv:1910.04257 [pdf, other]

Membership Model Inversion Attacks for Deep Networks

Authors: Samyadeep Basu, Rauf Izmailov, Chris Mesterharm

Abstract: With the increasing adoption of AI, inherent security and privacy vulnerabilities formachine learning systems are being discovered. One such vulnerability makes itpossible for an adversary to obtain private information about the types of instancesused to train the targeted machine learning model. This so-called model inversionattack is based on sequential leveraging of classification scores toward… ▽ More With the increasing adoption of AI, inherent security and privacy vulnerabilities formachine learning systems are being discovered. One such vulnerability makes itpossible for an adversary to obtain private information about the types of instancesused to train the targeted machine learning model. This so-called model inversionattack is based on sequential leveraging of classification scores towards obtaininghigh confidence representations for various classes. However, for deep networks,such procedures usually lead to unrecognizable representations that are uselessfor the adversary. In this paper, we introduce a more realistic definition of modelinversion, where the adversary is aware of the general purpose of the attackedmodel (for instance, whether it is an OCR system or a facial recognition system),and the goal is to find realistic class representations within the corresponding lower-dimensional manifold (of, respectively, general symbols or general faces). To thatend, we leverage properties of generative adversarial networks for constructinga connected lower-dimensional manifold, and demonstrate the efficiency of ourmodel inversion attack that is carried out within that manifold. △ Less

Submitted 9 October, 2019; originally announced October 2019.

Comments: NeurIPS 2019, Workshop on Privacy in Machine Learning

arXiv:1910.03225 [pdf, other]

NGBoost: Natural Gradient Boosting for Probabilistic Prediction

Authors: Tony Duan, Anand Avati, Daisy Yi Ding, Khanh K. Thai, Sanjay Basu, Andrew Y. Ng, Alejandro Schuler

Abstract: We present Natural Gradient Boosting (NGBoost), an algorithm for generic probabilistic prediction via gradient boosting. Typical regression models return a point estimate, conditional on covariates, but probabilistic regression models output a full probability distribution over the outcome space, conditional on the covariates. This allows for predictive uncertainty estimation -- crucial in applica… ▽ More We present Natural Gradient Boosting (NGBoost), an algorithm for generic probabilistic prediction via gradient boosting. Typical regression models return a point estimate, conditional on covariates, but probabilistic regression models output a full probability distribution over the outcome space, conditional on the covariates. This allows for predictive uncertainty estimation -- crucial in applications like healthcare and weather forecasting. NGBoost generalizes gradient boosting to probabilistic regression by treating the parameters of the conditional distribution as targets for a multiparameter boosting algorithm. Furthermore, we show how the Natural Gradient is required to correct the training dynamics of our multiparameter boosting approach. NGBoost can be used with any base learner, any family of distributions with continuous parameters, and any scoring rule. NGBoost matches or exceeds the performance of existing methods for probabilistic prediction while offering additional benefits in flexibility, scalability, and usability. An open-source implementation is available at github.com/stanfordmlgroup/ngboost. △ Less

Submitted 9 June, 2020; v1 submitted 8 October, 2019; originally announced October 2019.

Comments: Accepted for ICML 2020

arXiv:1907.11975 [pdf, other]

Blocking Bandits

Authors: Soumya Basu, Rajat Sen, Sujay Sanghavi, Sanjay Shakkottai

Abstract: We consider a novel stochastic multi-armed bandit setting, where playing an arm makes it unavailable for a fixed number of time slots thereafter. This models situations where reusing an arm too often is undesirable (e.g. making the same product recommendation repeatedly) or infeasible (e.g. compute job scheduling on machines). We show that with prior knowledge of the rewards and delays of all the… ▽ More We consider a novel stochastic multi-armed bandit setting, where playing an arm makes it unavailable for a fixed number of time slots thereafter. This models situations where reusing an arm too often is undesirable (e.g. making the same product recommendation repeatedly) or infeasible (e.g. compute job scheduling on machines). We show that with prior knowledge of the rewards and delays of all the arms, the problem of optimizing cumulative reward does not admit any pseudo-polynomial time algorithm (in the number of arms) unless randomized exponential time hypothesis is false, by map** to the PINWHEEL scheduling problem. Subsequently, we show that a simple greedy algorithm that plays the available arm with the highest reward is asymptotically $(1-1/e)$ optimal. When the rewards are unknown, we design a UCB based algorithm which is shown to have $c \log T + o(\log T)$ cumulative regret against the greedy algorithm, leveraging the free exploration of arms due to the unavailability. Finally, when all the delays are equal the problem reduces to Combinatorial Semi-bandits providing us with a lower bound of $c' \log T+ ω(\log T)$. △ Less

Submitted 27 July, 2019; originally announced July 2019.

arXiv:1906.10845 [pdf, other]

A Debiased MDI Feature Importance Measure for Random Forests

Authors: Xiao Li, Yu Wang, Sumanta Basu, Karl Kumbier, Bin Yu

Abstract: Tree ensembles such as Random Forests have achieved impressive empirical success across a wide variety of applications. To understand how these models make predictions, people routinely turn to feature importance measures calculated from tree ensembles. It has long been known that Mean Decrease Impurity (MDI), one of the most widely used measures of feature importance, incorrectly assigns high imp… ▽ More Tree ensembles such as Random Forests have achieved impressive empirical success across a wide variety of applications. To understand how these models make predictions, people routinely turn to feature importance measures calculated from tree ensembles. It has long been known that Mean Decrease Impurity (MDI), one of the most widely used measures of feature importance, incorrectly assigns high importance to noisy features, leading to systematic bias in feature selection. In this paper, we address the feature selection bias of MDI from both theoretical and methodological perspectives. Based on the original definition of MDI by Breiman et al. for a single tree, we derive a tight non-asymptotic bound on the expected bias of MDI importance of noisy features, showing that deep trees have higher (expected) feature selection bias than shallow ones. However, it is not clear how to reduce the bias of MDI using its existing analytical expression. We derive a new analytical expression for MDI, and based on this new expression, we are able to propose a debiased MDI feature importance measure using out-of-bag samples, called MDI-oob. For both the simulated data and a genomic ChIP dataset, MDI-oob achieves state-of-the-art performance in feature selection from Random Forests for both deep and shallow trees. △ Less

Submitted 26 October, 2019; v1 submitted 26 June, 2019; originally announced June 2019.

Comments: NeurIPS'19. The first two authors contributed equally to this paper

arXiv:1906.06057 [pdf, ps, other]

Learning Mixtures of Graphs from Epidemic Cascades

Authors: Jessica Hoffmann, Soumya Basu, Surbhi Goel, Constantine Caramanis

Abstract: We consider the problem of learning the weighted edges of a balanced mixture of two undirected graphs from epidemic cascades. While mixture models are popular modeling tools, algorithmic development with rigorous guarantees has lagged. Graph mixtures are apparently no exception: until now, very little is known about whether this problem is solvable. To the best of our knowledge, we establish the… ▽ More We consider the problem of learning the weighted edges of a balanced mixture of two undirected graphs from epidemic cascades. While mixture models are popular modeling tools, algorithmic development with rigorous guarantees has lagged. Graph mixtures are apparently no exception: until now, very little is known about whether this problem is solvable. To the best of our knowledge, we establish the first necessary and sufficient conditions for this problem to be solvable in polynomial time on edge-separated graphs. When the conditions are met, i.e., when the graphs are connected with at least three edges, we give an efficient algorithm for learning the weights of both graphs with optimal sample complexity (up to log factors). We give complimentary results and provide sample-optimal (up to log factors) algorithms for mixtures of directed graphs of out-degree at least three, for mixture of undirected graphs of unbalanced and/or unknown priors. △ Less

Submitted 29 January, 2020; v1 submitted 14 June, 2019; originally announced June 2019.

Comments: 29 pages

arXiv:1904.10689 [pdf, other]

Layer Dynamics of Linearised Neural Nets

Authors: Saurav Basu, Koyel Mukherjee, Shrihari Vasudevan

Abstract: Despite the phenomenal success of deep learning in recent years, there remains a gap in understanding the fundamental mechanics of neural nets. More research is focussed on handcrafting complex and larger networks, and the design decisions are often ad-hoc and based on intuition. Some recent research has aimed to demystify the learning dynamics in neural nets by attempting to build a theory from f… ▽ More Despite the phenomenal success of deep learning in recent years, there remains a gap in understanding the fundamental mechanics of neural nets. More research is focussed on handcrafting complex and larger networks, and the design decisions are often ad-hoc and based on intuition. Some recent research has aimed to demystify the learning dynamics in neural nets by attempting to build a theory from first principles, such as characterising the non-linear dynamics of specialised \textit{linear} deep neural nets (such as orthogonal networks). In this work, we expand and derive properties of learning dynamics respected by general multi-layer linear neural nets. Although an over-parameterisation of a single layer linear network, linear multi-layer neural nets offer interesting insights that explain how learning dynamics proceed in small pockets of the data space. We show in particular that multiple layers in linear nets grow at approximately the same rate, and there are distinct phases of learning with markedly different layer growth. We then apply a linearisation process to a general RelU neural net and show how nonlinearity breaks down the growth symmetry observed in liner neural nets. Overall, our work can be viewed as an initial step in building a theory for understanding the effect of layer design on the learning dynamics from first principles. △ Less

Submitted 24 April, 2019; originally announced April 2019.

arXiv:1901.10061 [pdf, other]

A Framework for Deep Constrained Clustering -- Algorithms and Advances

Authors: Hong**g Zhang, Sugato Basu, Ian Davidson

Abstract: The area of constrained clustering has been extensively explored by researchers and used by practitioners. Constrained clustering formulations exist for popular algorithms such as k-means, mixture models, and spectral clustering but have several limitations. A fundamental strength of deep learning is its flexibility, and here we explore a deep learning framework for constrained clustering and in p… ▽ More The area of constrained clustering has been extensively explored by researchers and used by practitioners. Constrained clustering formulations exist for popular algorithms such as k-means, mixture models, and spectral clustering but have several limitations. A fundamental strength of deep learning is its flexibility, and here we explore a deep learning framework for constrained clustering and in particular explore how it can extend the field of constrained clustering. We show that our framework can not only handle standard together/apart constraints (without the well documented negative effects reported earlier) generated from labeled side information but more complex constraints generated from new types of side information such as continuous values and high-level domain knowledge. △ Less

Submitted 19 December, 2019; v1 submitted 28 January, 2019; originally announced January 2019.

Comments: Updated for ECML/PKDD 2019

arXiv:1812.03568 [pdf, other]

doi 10.1109/TSP.2018.2887401

Low Rank and Structured Modeling of High-dimensional Vector Autoregressions

Authors: Sumanta Basu, Xianqi Li, George Michailidis

Abstract: Network modeling of high-dimensional time series data is a key learning task due to its widespread use in a number of application areas, including macroeconomics, finance and neuroscience. While the problem of sparse modeling based on vector autoregressive models (VAR) has been investigated in depth in the literature, more complex network structures that involve low rank and group sparse component… ▽ More Network modeling of high-dimensional time series data is a key learning task due to its widespread use in a number of application areas, including macroeconomics, finance and neuroscience. While the problem of sparse modeling based on vector autoregressive models (VAR) has been investigated in depth in the literature, more complex network structures that involve low rank and group sparse components have received considerably less attention, despite their presence in data. Failure to account for low-rank structures results in spurious connectivity among the observed time series, which may lead practitioners to draw incorrect conclusions about pertinent scientific or policy questions. In order to accurately estimate a network of Granger causal interactions after accounting for latent effects, we introduce a novel approach for estimating low-rank and structured sparse high-dimensional VAR models. We introduce a regularized framework involving a combination of nuclear norm and lasso (or group lasso) penalty. Further, and subsequently establish non-asymptotic upper bounds on the estimation error rates of the low-rank and the structured sparse components. We also introduce a fast estimation algorithm and finally demonstrate the performance of the proposed modeling framework over standard sparse VAR estimates through numerical experiments on synthetic and real data. △ Less

Submitted 9 December, 2018; originally announced December 2018.

arXiv:1812.00532 [pdf, other]

Large Spectral Density Matrix Estimation by Thresholding

Authors: Yiming Sun, Yige Li, Amy Kuceyeski, Sumanta Basu

Abstract: Spectral density matrix estimation of multivariate time series is a classical problem in time series and signal processing. In modern neuroscience, spectral density based metrics are commonly used for analyzing functional connectivity among brain regions. In this paper, we develop a non-asymptotic theory for regularized estimation of high-dimensional spectral density matrices of Gaussian and linea… ▽ More Spectral density matrix estimation of multivariate time series is a classical problem in time series and signal processing. In modern neuroscience, spectral density based metrics are commonly used for analyzing functional connectivity among brain regions. In this paper, we develop a non-asymptotic theory for regularized estimation of high-dimensional spectral density matrices of Gaussian and linear processes using thresholded versions of averaged periodograms. Our theoretical analysis ensures that consistent estimation of spectral density matrix of a $p$-dimensional time series using $n$ samples is possible under high-dimensional regime $\log p / n \rightarrow 0$ as long as the true spectral density is approximately sparse. A key technical component of our analysis is a new concentration inequality of average periodogram around its expectation, which is of independent interest. Our estimation consistency results complement existing results for shrinkage based estimators of multivariate spectral density, which require no assumption on sparsity but only ensure consistent estimation in a regime $p^2/n \rightarrow 0$. In addition, our proposed thresholding based estimators perform consistent and automatic edge selection when learning coherence networks among the components of a multivariate time series. We demonstrate the advantage of our estimators using simulation studies and a real data application on functional connectivity analysis with fMRI data. △ Less

Submitted 2 December, 2018; originally announced December 2018.

arXiv:1810.08223 [pdf, other]

Micro-Browsing Models for Search Snippets

Authors: Muhammad Asiful Islam, Ramakrishnan Srikant, Sugato Basu

Abstract: Click-through rate (CTR) is a key signal of relevance for search engine results, both organic and sponsored. CTR of a result has two core components: (a) the probability of examination of a result by a user, and (b) the perceived relevance of the result given that it has been examined by the user. There has been considerable work on user browsing models, to model and analyze both the examination a… ▽ More Click-through rate (CTR) is a key signal of relevance for search engine results, both organic and sponsored. CTR of a result has two core components: (a) the probability of examination of a result by a user, and (b) the perceived relevance of the result given that it has been examined by the user. There has been considerable work on user browsing models, to model and analyze both the examination and the relevance components of CTR. In this paper, we propose a novel formulation: a micro-browsing model for how users read result snippets. The snippet text of a result often plays a critical role in the perceived relevance of the result. We study how particular words within a line of snippet can influence user behavior. We validate this new micro-browsing user model by considering the problem of predicting which snippet will yield higher CTR, and show that classification accuracy is dramatically higher with our micro-browsing user model. The key insight in this paper is that varying relatively few words within a snippet, and even their location within a snippet, can have a significant influence on the clickthrough of a snippet. △ Less

Submitted 18 October, 2018; originally announced October 2018.

arXiv:1810.07287 [pdf, other]

Signed iterative random forests to identify enhancer-associated transcription factor binding

Authors: Karl Kumbier, Sumanta Basu, Erwin Frise, Susan E. Celniker, James B. Brown, Susan Celniker, Bin Yu

Abstract: Standard ChIP-seq peak calling pipelines seek to differentiate biochemically reproducible signals of individual genomic elements from background noise. However, reproducibility alone does not imply functional regulation (e.g., enhancer activation, alternative splicing). Here we present a general-purpose, interpretable machine learning method: signed iterative random forests (siRF), which we use to… ▽ More Standard ChIP-seq peak calling pipelines seek to differentiate biochemically reproducible signals of individual genomic elements from background noise. However, reproducibility alone does not imply functional regulation (e.g., enhancer activation, alternative splicing). Here we present a general-purpose, interpretable machine learning method: signed iterative random forests (siRF), which we use to infer regulatory interactions among transcription factors and functional binding signatures surrounding enhancer elements in Drosophila melanogaster. △ Less

Submitted 12 July, 2023; v1 submitted 16 October, 2018; originally announced October 2018.

arXiv:1808.09521 [pdf, other]

Bounds on the conditional and average treatment effect with unobserved confounding factors

Authors: Steve Yadlowsky, Hongseok Namkoong, Sanjay Basu, John Duchi, Lu Tian

Abstract: For observational studies, we study the sensitivity of causal inference when treatment assignments may depend on unobserved confounders. We develop a loss minimization approach for estimating bounds on the conditional average treatment effect (CATE) when unobserved confounders have a bounded effect on the odds ratio of treatment selection. Our approach is scalable and allows flexible use of model… ▽ More For observational studies, we study the sensitivity of causal inference when treatment assignments may depend on unobserved confounders. We develop a loss minimization approach for estimating bounds on the conditional average treatment effect (CATE) when unobserved confounders have a bounded effect on the odds ratio of treatment selection. Our approach is scalable and allows flexible use of model classes in estimation, including nonparametric and black-box machine learning methods. Based on these bounds for the CATE, we propose a sensitivity analysis for the average treatment effect (ATE). Our semi-parametric estimator extends/bounds the augmented inverse propensity weighted (AIPW) estimator for the ATE under bounded unobserved confounding. By constructing a Neyman orthogonal score, our estimator of the bound for the ATE is a regular root-$n$ estimator so long as the nuisance parameters are estimated at the $o_p(n^{-1/4})$ rate. We complement our methodology with optimality results showing that our proposed bounds are tight in certain cases. We demonstrate our method on simulated and real data examples, and show accurate coverage of our confidence intervals in practical finite sample regimes with rich covariate information. △ Less

Submitted 9 March, 2022; v1 submitted 28 August, 2018; originally announced August 2018.

arXiv:1806.08819 [pdf, other]

doi 10.1017/dmp.2019.73

Forecasting Internally Displaced Population Migration Patterns in Syria and Yemen

Authors: Benjamin Q. Huynh, Sanjay Basu

Abstract: Armed conflict has led to an unprecedented number of internally displaced persons (IDPs) - individuals who are forced out of their homes but remain within their country. IDPs often urgently require shelter, food, and healthcare, yet prediction of when large fluxes of IDPs will cross into an area remains a major challenge for aid delivery organizations. Accurate forecasting of IDP migration would e… ▽ More Armed conflict has led to an unprecedented number of internally displaced persons (IDPs) - individuals who are forced out of their homes but remain within their country. IDPs often urgently require shelter, food, and healthcare, yet prediction of when large fluxes of IDPs will cross into an area remains a major challenge for aid delivery organizations. Accurate forecasting of IDP migration would empower humanitarian aid groups to more effectively allocate resources during conflicts. We show that monthly flow of IDPs from province to province in both Syria and Yemen can be accurately forecasted one month in advance, using publicly available data. We model monthly IDP flow using data on food price, fuel price, wage, geospatial, and news data. We find that machine learning approaches can more accurately forecast migration trends than baseline persistence models. Our findings thus potentially enable proactive aid allocation for IDPs in anticipation of forecasted arrivals. △ Less

Submitted 22 June, 2018; originally announced June 2018.

arXiv:1804.08472 [pdf, other]

doi 10.1142/S2010139220500172

High-Dimensional Estimation, Basis Assets, and the Adaptive Multi-Factor Model

Authors: Liao Zhu, Sumanta Basu, Robert A. Jarrow, Martin T. Wells

Abstract: The paper proposes a new algorithm for the high-dimensional financial data -- the Groupwise Interpretable Basis Selection (GIBS) algorithm, to estimate a new Adaptive Multi-Factor (AMF) asset pricing model, implied by the recently developed Generalized Arbitrage Pricing Theory, which relaxes the convention that the number of risk-factors is small. We first obtain an adaptive collection of basis as… ▽ More The paper proposes a new algorithm for the high-dimensional financial data -- the Groupwise Interpretable Basis Selection (GIBS) algorithm, to estimate a new Adaptive Multi-Factor (AMF) asset pricing model, implied by the recently developed Generalized Arbitrage Pricing Theory, which relaxes the convention that the number of risk-factors is small. We first obtain an adaptive collection of basis assets and then simultaneously test which basis assets correspond to which securities, using high-dimensional methods. The AMF model, along with the GIBS algorithm, is shown to have a significantly better fitting and prediction power than the Fama-French 5-factor model. △ Less

Submitted 10 December, 2021; v1 submitted 23 April, 2018; originally announced April 2018.

Journal ref: The Quarterly Journal of Finance. Vol. 10, No. 04, 2050017 (2020)

arXiv:1802.01141 [pdf, other]

doi 10.1038/s41598-023-35379-y

Simultaneous Selection of Multiple Important Single Nucleotide Polymorphisms in Familial Genome Wide Association Studies Data

Authors: Subhabrata Majumdar, Saonli Basu, Matt McGue, Snigdhansu Chatterjee

Abstract: We propose a resampling-based fast variable selection technique for detecting relevant single nucleotide polymorphisms (SNP) in a multi-marker mixed effect model. Due to computational complexity, current practice primarily involves testing the effect of one SNP at a time, commonly termed as `single SNP association analysis'. Joint modeling of genetic variants within a gene or pathway may have bett… ▽ More We propose a resampling-based fast variable selection technique for detecting relevant single nucleotide polymorphisms (SNP) in a multi-marker mixed effect model. Due to computational complexity, current practice primarily involves testing the effect of one SNP at a time, commonly termed as `single SNP association analysis'. Joint modeling of genetic variants within a gene or pathway may have better power to detect associated genetic variants, especially the ones with weak effects. In this paper, we propose a computationally efficient model selection approach -- based on the e-values framework -- for single SNP detection in families while utilizing information on multiple SNPs simultaneously. To overcome computational bottleneck of traditional model selection methods, our method trains one single model, and utilizes a fast and scalable bootstrap procedure. We illustrate through numerical studies that our proposed method is more effective in detecting SNPs associated with a trait than either single-marker analysis using family data or model selection methods that ignore the familial dependency structure. Further, we perform gene-level analysis in Minnesota Center for Twin and Family Research (MCTFR) dataset using our method to detect several SNPs using this that have been implicated to be associated with alcohol consumption. △ Less

Submitted 20 May, 2023; v1 submitted 4 February, 2018; originally announced February 2018.

Comments: Published in Scientific Reports

arXiv:1711.03623 [pdf, other]

Interpretable Vector AutoRegressions with Exogenous Time Series

Authors: Ines Wilms, Sumanta Basu, Jacob Bien, David S. Matteson

Abstract: The Vector AutoRegressive (VAR) model is fundamental to the study of multivariate time series. Although VAR models are intensively investigated by many researchers, practitioners often show more interest in analyzing VARX models that incorporate the impact of unmodeled exogenous variables (X) into the VAR. However, since the parameter space grows quadratically with the number of time series, estim… ▽ More The Vector AutoRegressive (VAR) model is fundamental to the study of multivariate time series. Although VAR models are intensively investigated by many researchers, practitioners often show more interest in analyzing VARX models that incorporate the impact of unmodeled exogenous variables (X) into the VAR. However, since the parameter space grows quadratically with the number of time series, estimation quickly becomes challenging. While several proposals have been made to sparsely estimate large VAR models, the estimation of large VARX models is under-explored. Moreover, typically these sparse proposals involve a lasso-type penalty and do not incorporate lag selection into the estimation procedure. As a consequence, the resulting models may be difficult to interpret. In this paper, we propose a lag-based hierarchically sparse estimator, called "HVARX", for large VARX models. We illustrate the usefulness of HVARX on a cross-category management marketing application. Our results show how it provides a highly interpretable model, and improves out-of-sample forecast accuracy compared to a lasso-type approach. △ Less

Submitted 9 November, 2017; originally announced November 2017.

Comments: Presented at NIPS 2017 Symposium on Interpretable Machine Learning

arXiv:1710.09326 [pdf, other]

A Robust and Unified Framework for Estimating Heritability in Twin Studies using Generalized Estimating Equations

Authors: Jaron Arbet, Matt McGue, Saonli Basu

Abstract: The development of a complex disease is an intricate interplay of genetic and environmental factors. "Heritability" is defined as the proportion of total trait variance due to genetic factors within a given population. Studies with monozygotic (MZ) and dizygotic (DZ) twins allow us to estimate heritability by fitting an "ACE" model which estimates the proportion of trait variance explained by addi… ▽ More The development of a complex disease is an intricate interplay of genetic and environmental factors. "Heritability" is defined as the proportion of total trait variance due to genetic factors within a given population. Studies with monozygotic (MZ) and dizygotic (DZ) twins allow us to estimate heritability by fitting an "ACE" model which estimates the proportion of trait variance explained by additive genetic (A), common shared environment (C), and unique non-shared environmental (E) latent effects, thus hel** us better understand disease risk and etiology. In this paper, we develop a flexible generalized estimating equations framework ("GEE2") for fitting twin ACE models that requires minimal distributional assumptions, rather only the first two moments need to be correctly specified. We prove that two commonly used methods for estimating heritability, the normal ACE model ("NACE") and Falconer's method, can both be fit within this unified GEE2 framework, which additionally provides robust standard errors. Although the traditional Falconer's method cannot directly adjust for covariates, we show that the corresponding GEE2 version ("GEE2-Falconer") can incorporate covariate effects for both mean and variance-level parameters (e.g. let heritability vary by sex or age). Given non-normal data, we show that the GEE2 models attain significantly better coverage of the true heritability compared to the traditional NACE and Falconer's methods. Finally, we demonstrate an important scenario where the NACE model produces biased estimates of heritability while Falconer's method remains unbiased. Overall, we recommend using the robust and flexible GEE2-Falconer model for estimating heritability in twin studies. △ Less

Submitted 17 October, 2018; v1 submitted 25 October, 2017; originally announced October 2017.

arXiv:1707.09208 [pdf, other]

Sparse Identification and Estimation of Large-Scale Vector AutoRegressive Moving Averages

Authors: Ines Wilms, Sumanta Basu, Jacob Bien, David S. Matteson

Abstract: The Vector AutoRegressive Moving Average (VARMA) model is fundamental to the theory of multivariate time series; however, identifiability issues have led practitioners to abandon it in favor of the simpler but more restrictive Vector AutoRegressive (VAR) model. We narrow this gap with a new optimization-based approach to VARMA identification built upon the principle of parsimony. Among all equival… ▽ More The Vector AutoRegressive Moving Average (VARMA) model is fundamental to the theory of multivariate time series; however, identifiability issues have led practitioners to abandon it in favor of the simpler but more restrictive Vector AutoRegressive (VAR) model. We narrow this gap with a new optimization-based approach to VARMA identification built upon the principle of parsimony. Among all equivalent data-generating models, we use convex optimization to seek the parameterization that is "simplest" in a certain sense. A user-specified strongly convex penalty is used to measure model simplicity, and that same penalty is then used to define an estimator that can be efficiently computed. We establish consistency of our estimators in a double-asymptotic regime. Our non-asymptotic error bound analysis accommodates both model specification and parameter estimation steps, a feature that is crucial for studying large-scale VARMA algorithms. Our analysis also provides new results on penalized estimation of infinite-order VAR, and elastic net regression under a singular covariance structure of regressors, which may be of independent interest. We illustrate the advantage of our method over VAR alternatives on three real data examples. △ Less

Submitted 8 June, 2021; v1 submitted 28 July, 2017; originally announced July 2017.

arXiv:1706.08457 [pdf, other]

doi 10.1073/pnas.1711236115

Iterative Random Forests to detect predictive and stable high-order interactions

Authors: Sumanta Basu, Karl Kumbier, James B. Brown, Bin Yu

Abstract: Genomics has revolutionized biology, enabling the interrogation of whole transcriptomes, genome-wide binding sites for proteins, and many other molecular processes. However, individual genomic assays measure elements that interact in vivo as components of larger molecular machines. Understanding how these high-order interactions drive gene expression presents a substantial statistical challenge. B… ▽ More Genomics has revolutionized biology, enabling the interrogation of whole transcriptomes, genome-wide binding sites for proteins, and many other molecular processes. However, individual genomic assays measure elements that interact in vivo as components of larger molecular machines. Understanding how these high-order interactions drive gene expression presents a substantial statistical challenge. Building on Random Forests (RF), Random Intersection Trees (RITs), and through extensive, biologically inspired simulations, we developed the iterative Random Forest algorithm (iRF). iRF trains a feature-weighted ensemble of decision trees to detect stable, high-order interactions with same order of computational cost as RF. We demonstrate the utility of iRF for high-order interaction discovery in two prediction problems: enhancer activity in the early Drosophila embryo and alternative splicing of primary transcripts in human derived cell lines. In Drosophila, among the 20 pairwise transcription factor interactions iRF identifies as stable (returned in more than half of bootstrap replicates), 80% have been previously reported as physical interactions. Moreover, novel third-order interactions, e.g. between Zelda (Zld), Giant (Gt), and Twist (Twi), suggest high-order relationships that are candidates for follow-up experiments. In human-derived cells, iRF re-discovered a central role of H3K36me3 in chromatin-mediated splicing regulation, and identified novel 5th and 6th order interactions, indicative of multi-valent nucleosomes with specific roles in splicing regulation. By decoupling the order of interactions from the computational cost of identification, iRF opens new avenues of inquiry into the molecular mechanisms underlying genome biology. △ Less

Submitted 23 December, 2017; v1 submitted 26 June, 2017; originally announced June 2017.

arXiv:1605.02699 [pdf, other]

A Theoretical Analysis of Deep Neural Networks for Texture Classification

Authors: Saikat Basu, Manohar Karki, Robert DiBiano, Supratik Mukhopadhyay, Sangram Ganguly, Ramakrishna Nemani, Shreekant Gayaka

Abstract: We investigate the use of Deep Neural Networks for the classification of image datasets where texture features are important for generating class-conditional discriminative representations. To this end, we first derive the size of the feature space for some standard textural features extracted from the input dataset and then use the theory of Vapnik-Chervonenkis dimension to show that hand-crafted… ▽ More We investigate the use of Deep Neural Networks for the classification of image datasets where texture features are important for generating class-conditional discriminative representations. To this end, we first derive the size of the feature space for some standard textural features extracted from the input dataset and then use the theory of Vapnik-Chervonenkis dimension to show that hand-crafted feature extraction creates low-dimensional representations which help in reducing the overall excess error rate. As a corollary to this analysis, we derive for the first time upper bounds on the VC dimension of Convolutional Neural Network as well as Dropout and Dropconnect networks and the relation between excess error rate of Dropout and Dropconnect networks. The concept of intrinsic dimension is used to validate the intuition that texture-based datasets are inherently higher dimensional as compared to handwritten digits or other object recognition datasets and hence more difficult to be shattered by neural networks. We then derive the mean distance from the centroid to the nearest and farthest sampling points in an n-dimensional manifold and show that the Relative Contrast of the sample data vanishes as dimensionality of the underlying vector space tends to infinity. △ Less

Submitted 21 June, 2016; v1 submitted 9 May, 2016; originally announced May 2016.

Comments: Accepted in International Joint Conference on Neural Networks, IJCNN 2016

arXiv:1601.00736 [pdf, other]

Penalized Maximum Likelihood Estimation of Multi-layered Gaussian Graphical Models

Authors: Jiahe Lin, Sumanta Basu, Moulinath Banerjee, George Michailidis

Abstract: Analyzing multi-layered graphical models provides insight into understanding the conditional relationships among nodes within layers after adjusting for and quantifying the effects of nodes from other layers. We obtain the penalized maximum likelihood estimator for Gaussian multi-layered graphical models, based on a computational approach involving screening of variables, iterative estimation of t… ▽ More Analyzing multi-layered graphical models provides insight into understanding the conditional relationships among nodes within layers after adjusting for and quantifying the effects of nodes from other layers. We obtain the penalized maximum likelihood estimator for Gaussian multi-layered graphical models, based on a computational approach involving screening of variables, iterative estimation of the directed edges between layers and undirected edges within layers and a final refitting and stability selection step that provides improved performance in finite sample settings. We establish the consistency of the estimator in a high-dimensional setting. To obtain this result, we develop a strategy that leverages the biconvexity of the likelihood function to ensure convergence of the developed iterative algorithm to a stationary point, as well as careful uniform error control of the estimates over iterations. The performance of the maximum likelihood estimator is illustrated on synthetic data. △ Less

Submitted 5 January, 2016; originally announced January 2016.

Showing 1–50 of 53 results for author: Basu, S