-
Functional Time Transformation Model with Applications to Digital Health
Authors:
Rahul Ghosal,
Marcos Matabuena,
Sujit K. Ghosh
Abstract:
The advent of wearable and sensor technologies now leads to functional predictors which are intrinsically infinite dimensional. While the existing approaches for functional data and survival outcomes lean on the well-established Cox model, the proportional hazard (PH) assumption might not always be suitable in real-world applications. Motivated by physiological signals encountered in digital medic…
▽ More
The advent of wearable and sensor technologies now leads to functional predictors which are intrinsically infinite dimensional. While the existing approaches for functional data and survival outcomes lean on the well-established Cox model, the proportional hazard (PH) assumption might not always be suitable in real-world applications. Motivated by physiological signals encountered in digital medicine, we develop a more general and flexible functional time-transformation model for estimating the conditional survival function with both functional and scalar covariates. A partially functional regression model is used to directly model the survival time on the covariates through an unknown monotone transformation and a known error distribution. We use Bernstein polynomials to model the monotone transformation function and the smooth functional coefficients. A sieve method of maximum likelihood is employed for estimation. Numerical simulations illustrate a satisfactory performance of the proposed method in estimation and inference. We demonstrate the application of the proposed model through two case studies involving wearable data i) Understanding the association between diurnal physical activity pattern and all-cause mortality based on accelerometer data from the National Health and Nutrition Examination Survey (NHANES) 2011-2014 and ii) Modelling Time-to-Hypoglycemia events in a cohort of diabetic patients based on distributional representation of continuous glucose monitoring (CGM) data. The results provide important epidemiological insights into the direct association between survival times and the physiological signals and also exhibit superior predictive performance compared to traditional summary based biomarkers in the CGM study.
△ Less
Submitted 28 June, 2024;
originally announced June 2024.
-
Lossless Image Compression Using Multi-level Dictionaries: Binary Images
Authors:
Samar Agnihotri,
Renu Rameshan,
Ritwik Ghosal
Abstract:
Lossless image compression is required in various applications to reduce storage or transmission costs of images, while requiring the reconstructed images to have zero information loss compared to the original. Existing lossless image compression methods either have simple design but poor compression performance, or complex design, better performance, but with no performance guarantees. In our end…
▽ More
Lossless image compression is required in various applications to reduce storage or transmission costs of images, while requiring the reconstructed images to have zero information loss compared to the original. Existing lossless image compression methods either have simple design but poor compression performance, or complex design, better performance, but with no performance guarantees. In our endeavor to develop a lossless image compression method with low complexity and guaranteed performance, we argue that compressibility of a color image is essentially derived from the patterns in its spatial structure, intensity variations, and color variations. Thus, we divide the overall design of a lossless image compression scheme into three parts that exploit corresponding redundancies. We further argue that the binarized version of an image captures its fundamental spatial structure and in this work, we propose a scheme for lossless compression of binary images.
The proposed scheme first learns dictionaries of $16\times16$, $8\times8$, $4\times4$, and $2\times 2$ square pixel patterns from various datasets of binary images. It then uses these dictionaries to encode binary images. These dictionaries have various interesting properties that are further exploited to construct an efficient scheme. Our preliminary results show that the proposed scheme consistently outperforms existing conventional and learning based lossless compression approaches, and provides, on average, as much as $1.5\times$ better performance than a common general purpose lossless compression scheme (WebP), more than $3\times$ better performance than a state of the art learning based scheme, and better performance than a specialized scheme for binary image compression (JBIG2).
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
Conformal uncertainty quantification using kernel depth measures in separable Hilbert spaces
Authors:
Marcos Matabuena,
Rahul Ghosal,
Pavlo Mozharovskyi,
Oscar Hernan Madrid Padilla,
Jukka-Pekka Onnela
Abstract:
Depth measures have gained popularity in the statistical literature for defining level sets in complex data structures like multivariate data, functional data, and graphs. Despite their versatility, integrating depth measures into regression modeling for establishing prediction regions remains underexplored. To address this gap, we propose a novel method utilizing a model-free uncertainty quantifi…
▽ More
Depth measures have gained popularity in the statistical literature for defining level sets in complex data structures like multivariate data, functional data, and graphs. Despite their versatility, integrating depth measures into regression modeling for establishing prediction regions remains underexplored. To address this gap, we propose a novel method utilizing a model-free uncertainty quantification algorithm based on conditional depth measures and conditional kernel mean embeddings. This enables the creation of tailored prediction and tolerance regions in regression models handling complex statistical responses and predictors in separable Hilbert spaces. Our focus in this paper is exclusively on examples where the response is a functional data object. To enhance practicality, we introduce a conformal prediction algorithm, providing non-asymptotic guarantees in the derived prediction region. Additionally, we establish both conditional and unconditional consistency results and fast convergence rates in some special homoscedastic cases. We evaluate the model finite sample performance in extensive simulation studies with different function objects as probability distributions and functional data. Finally, we apply the approach in a digital health application related to physical activity, aiming to offer personalized recommendations in the US. population based on individuals' characteristics.
△ Less
Submitted 22 May, 2024;
originally announced May 2024.
-
Generative AI-Based Text Generation Methods Using Pre-Trained GPT-2 Model
Authors:
Rohit Pandey,
Hetvi Waghela,
Sneha Rakshit,
Aparna Rangari,
Anjali Singh,
Rahul Kumar,
Ratnadeep Ghosal,
Jaydip Sen
Abstract:
This work delved into the realm of automatic text generation, exploring a variety of techniques ranging from traditional deterministic approaches to more modern stochastic methods. Through analysis of greedy search, beam search, top-k sampling, top-p sampling, contrastive searching, and locally typical searching, this work has provided valuable insights into the strengths, weaknesses, and potentia…
▽ More
This work delved into the realm of automatic text generation, exploring a variety of techniques ranging from traditional deterministic approaches to more modern stochastic methods. Through analysis of greedy search, beam search, top-k sampling, top-p sampling, contrastive searching, and locally typical searching, this work has provided valuable insights into the strengths, weaknesses, and potential applications of each method. Each text-generating method is evaluated using several standard metrics and a comparative study has been made on the performance of the approaches. Finally, some future directions of research in the field of automatic text generation are also identified.
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
Deep Learning Framework with Uncertainty Quantification for Survey Data: Assessing and Predicting Diabetes Mellitus Risk in the American Population
Authors:
Marcos Matabuena,
Juan C. Vidal,
Rahul Ghosal,
Jukka-Pekka Onnela
Abstract:
Complex survey designs are commonly employed in many medical cohorts. In such scenarios, develo** case-specific predictive risk score models that reflect the unique characteristics of the study design is essential. This approach is key to minimizing potential selective biases in results. The objectives of this paper are: (i) To propose a general predictive framework for regression and classifica…
▽ More
Complex survey designs are commonly employed in many medical cohorts. In such scenarios, develo** case-specific predictive risk score models that reflect the unique characteristics of the study design is essential. This approach is key to minimizing potential selective biases in results. The objectives of this paper are: (i) To propose a general predictive framework for regression and classification using neural network (NN) modeling, which incorporates survey weights into the estimation process; (ii) To introduce an uncertainty quantification algorithm for model prediction, tailored for data from complex survey designs; (iii) To apply this method in develo** robust risk score models to assess the risk of Diabetes Mellitus in the US population, utilizing data from the NHANES 2011-2014 cohort. The theoretical properties of our estimators are designed to ensure minimal bias and the statistical consistency, thereby ensuring that our models yield reliable predictions and contribute novel scientific insights in diabetes research. While focused on diabetes, this NN predictive framework is adaptable to create clinical models for a diverse range of diseases and medical cohorts. The software and the data used in this paper is publicly available on GitHub.
△ Less
Submitted 28 March, 2024;
originally announced March 2024.
-
Personalized Imputation in metric spaces via conformal prediction: Applications in Predicting Diabetes Development with Continuous Glucose Monitoring Information
Authors:
Marcos Matabuena,
Carla Díaz-Louzao,
Rahul Ghosal,
Francisco Gude
Abstract:
The challenge of handling missing data is widespread in modern data analysis, particularly during the preprocessing phase and in various inferential modeling tasks. Although numerous algorithms exist for imputing missing data, the assessment of imputation quality at the patient level often lacks personalized statistical approaches. Moreover, there is a scarcity of imputation methods for metric spa…
▽ More
The challenge of handling missing data is widespread in modern data analysis, particularly during the preprocessing phase and in various inferential modeling tasks. Although numerous algorithms exist for imputing missing data, the assessment of imputation quality at the patient level often lacks personalized statistical approaches. Moreover, there is a scarcity of imputation methods for metric space based statistical objects. The aim of this paper is to introduce a novel two-step framework that comprises: (i) a imputation methods for statistical objects taking values in metrics spaces, and (ii) a criterion for personalizing imputation using conformal inference techniques. This work is motivated by the need to impute distributional functional representations of continuous glucose monitoring (CGM) data within the context of a longitudinal study on diabetes, where a significant fraction of patients do not have available CGM profiles. The importance of these methods is illustrated by evaluating the effectiveness of CGM data as new digital biomarkers to predict the time to diabetes onset in healthy populations. To address these scientific challenges, we propose: (i) a new regression algorithm for missing responses; (ii) novel conformal prediction algorithms tailored for metric spaces with a focus on density responses within the 2-Wasserstein geometry; (iii) a broadly applicable personalized imputation method criterion, designed to enhance both of the aforementioned strategies, yet valid across any statistical model and data structure. Our findings reveal that incorporating CGM data into diabetes time-to-event analysis, augmented with a novel personalization phase of imputation, significantly enhances predictive accuracy by over ten percent compared to traditional predictive models for time to diabetes.
△ Less
Submitted 26 March, 2024;
originally announced March 2024.
-
A Generalized Acquisition Function for Preference-based Reward Learning
Authors:
Evan Ellis,
Gaurav R. Ghosal,
Stuart J. Russell,
Anca Dragan,
Erdem Bıyık
Abstract:
Preference-based reward learning is a popular technique for teaching robots and autonomous systems how a human user wants them to perform a task. Previous works have shown that actively synthesizing preference queries to maximize information gain about the reward function parameters improves data efficiency. The information gain criterion focuses on precisely identifying all parameters of the rewa…
▽ More
Preference-based reward learning is a popular technique for teaching robots and autonomous systems how a human user wants them to perform a task. Previous works have shown that actively synthesizing preference queries to maximize information gain about the reward function parameters improves data efficiency. The information gain criterion focuses on precisely identifying all parameters of the reward function. This can potentially be wasteful as many parameters may result in the same reward, and many rewards may result in the same behavior in the downstream tasks. Instead, we show that it is possible to optimize for learning the reward function up to a behavioral equivalence class, such as inducing the same ranking over behaviors, distribution over choices, or other related definitions of what makes two rewards similar. We introduce a tractable framework that can capture such definitions of similarity. Our experiments in a synthetic environment, an assistive robotics environment with domain transfer, and a natural language processing problem with real datasets demonstrate the superior performance of our querying method over the state-of-the-art information gain method.
△ Less
Submitted 9 March, 2024;
originally announced March 2024.
-
Multivariate Scalar on Multidimensional Distribution Regression
Authors:
Rahul Ghosal,
Marcos Matabuena
Abstract:
We develop a new method for multivariate scalar on multidimensional distribution regression. Traditional approaches typically analyze isolated univariate scalar outcomes or consider unidimensional distributional representations as predictors. However, these approaches are sub-optimal because: i) they fail to utilize the dependence between the distributional predictors: ii) neglect the correlation…
▽ More
We develop a new method for multivariate scalar on multidimensional distribution regression. Traditional approaches typically analyze isolated univariate scalar outcomes or consider unidimensional distributional representations as predictors. However, these approaches are sub-optimal because: i) they fail to utilize the dependence between the distributional predictors: ii) neglect the correlation structure of the response. To overcome these limitations, we propose a multivariate distributional analysis framework that harnesses the power of multivariate density functions and multitask learning. We develop a computationally efficient semiparametric estimation method for modelling the effect of the latent joint density on multivariate response of interest. Additionally, we introduce a new conformal algorithm for quantifying the uncertainty of regression models with multivariate responses and distributional predictors, providing valuable insights into the conditional distribution of the response. We have validated the effectiveness of our proposed method through comprehensive numerical simulations, clearly demonstrating its superior performance compared to traditional methods. The application of the proposed method is demonstrated on tri-axial accelerometer data from the National Health and Nutrition Examination Survey (NHANES) 2011-2014 for modelling the association between cognitive scores across various domains and distributional representation of physical activity among older adult population. Our results highlight the advantages of the proposed approach, emphasizing the significance of incorporating complete spatial information derived from the accelerometer device.
△ Less
Submitted 16 October, 2023;
originally announced October 2023.
-
Functional Principal Component Analysis for Continuous non-Gaussian, Truncated, and Discrete Functional Data
Authors:
Debangan Dey,
Rahul Ghosal,
Kathleen Merikangas,
Vadim Zipunnikov
Abstract:
Mobile health studies often collect multiple within-day self-reported assessments of participants' behavior and well-being on different scales such as physical activity (continuous), pain levels (truncated), mood states (ordinal), and life events (binary). These assessments, when indexed by time of day, can be treated as functional data of different types - continuous, truncated, ordinal, and bina…
▽ More
Mobile health studies often collect multiple within-day self-reported assessments of participants' behavior and well-being on different scales such as physical activity (continuous), pain levels (truncated), mood states (ordinal), and life events (binary). These assessments, when indexed by time of day, can be treated as functional data of different types - continuous, truncated, ordinal, and binary. We develop a functional principal component analysis that deals with all four types of functional data in a unified manner. It employs a semiparametric Gaussian copula model, assuming a generalized latent non-paranormal process as the underlying mechanism for these four types of functional data. We specify latent temporal dependence using a covariance estimated through Kendall's tau bridging method, incorporating smoothness during the bridging process. Simulation studies demonstrate the method's competitive performance under both dense and sparse sampling conditions. We then apply this approach to data from 497 participants in the National Institute of Mental Health Family Study of the Mood Disorder Spectrum to characterize within-day temporal patterns of mood differences among individuals with major mood disorder subtypes, including Major Depressive Disorder, Type 1, and Type 2 Bipolar Disorder.
△ Less
Submitted 21 September, 2023; v1 submitted 26 June, 2023;
originally announced June 2023.
-
Functional proportional hazards mixture cure model and its application to modelling the association between cancer mortality and physical activity in NHANES 2003-2006
Authors:
Rahul Ghosal,
Marcos Matabuena,
Jiajia Zhang
Abstract:
We develop a functional proportional hazards mixture cure (FPHMC) model with scalar and functional covariates measured at the baseline. The mixture cure model, useful in studying populations with a cure fraction of a particular event of interest is extended to functional data. We employ the EM algorithm and develop a semiparametric penalized spline-based approach to estimate the dynamic functional…
▽ More
We develop a functional proportional hazards mixture cure (FPHMC) model with scalar and functional covariates measured at the baseline. The mixture cure model, useful in studying populations with a cure fraction of a particular event of interest is extended to functional data. We employ the EM algorithm and develop a semiparametric penalized spline-based approach to estimate the dynamic functional coefficients of the incidence and the latency part. The proposed method is computationally efficient and simultaneously incorporates smoothness in the estimated functional coefficients via roughness penalty. Simulation studies illustrate a satisfactory performance of the proposed method in accurately estimating the model parameters and the baseline survival function. Finally, the clinical potential of the model is demonstrated in two real data examples that incorporate rich high-dimensional biomedical signals as functional covariates measured at the baseline and constitute novel domains to apply cure survival models in contemporary medical situations. In particular, we analyze i) minute-by-minute physical activity data from the National Health and Nutrition Examination Survey (NHANES) 2003-2006 to study the association between diurnal patterns of physical activity (PA) at baseline and all cancer mortality through 2019 while adjusting for other biological factors; ii) the impact of daily functional measures of disease severity collected in the intensive care unit on post ICU recovery and mortality event. Our findings provide novel epidemiological insights into the association between daily patterns of PA and cancer mortality. Software implementation and illustration of the proposed estimation method is provided in R.
△ Less
Submitted 30 March, 2023; v1 submitted 14 February, 2023;
originally announced February 2023.
-
Distributional outcome regression via quantile functions and its application to modelling continuously monitored heart rate and physical activity
Authors:
Rahul Ghosal,
Sujit K. Ghosh,
Jennifer A. Schrack,
Vadim Zipunnikov
Abstract:
Modern clinical and epidemiological studies widely employ wearables to record parallel streams of real-time data on human physiology and behavior. With recent advances in distributional data analysis, these high-frequency data are now often treated as distributional observations resulting in novel regression settings. Motivated by these modelling setups, we develop a distributional outcome regress…
▽ More
Modern clinical and epidemiological studies widely employ wearables to record parallel streams of real-time data on human physiology and behavior. With recent advances in distributional data analysis, these high-frequency data are now often treated as distributional observations resulting in novel regression settings. Motivated by these modelling setups, we develop a distributional outcome regression via quantile functions (DORQF) that expands existing literature with three key contributions: i) handling both scalar and distributional predictors, ii) ensuring jointly monotone regression structure without enforcing monotonicity on individual functional regression coefficients, iii) providing statistical inference via asymptotic projection-based joint confidence bands and a statistical test of global significance to quantify uncertainty of the estimated functional regression coefficients. The method is motivated by and applied to Actiheart component of Baltimore Longitudinal Study of Aging that collected one week of minute-level heart rate (HR) and physical activity (PA) data on 781 older adults to gain deeper understanding of age-related changes in daily life heart rate reserve, defined as a distribution of daily HR, while accounting for daily distribution of physical activity, age, gender, and body composition. Intriguingly, the results provide novel insights in epidemiology of daily life heart rate reserve.
△ Less
Submitted 14 February, 2024; v1 submitted 26 January, 2023;
originally announced January 2023.
-
Shape-constrained Estimation in Functional Regression with Bernstein Polynomials
Authors:
Rahul Ghosal,
Sujit Ghosh,
Jacek Urbanek,
Jennifer A. Schrack,
Vadim Zipunnikov
Abstract:
Shape restrictions on functional regression coefficients such as non-negativity, monotonicity, convexity or concavity are often available in the form of a prior knowledge or required to maintain a structural consistency in functional regression models. A new estimation method is developed in shape-constrained functional regression models using Bernstein polynomials. Specifically, estimation approa…
▽ More
Shape restrictions on functional regression coefficients such as non-negativity, monotonicity, convexity or concavity are often available in the form of a prior knowledge or required to maintain a structural consistency in functional regression models. A new estimation method is developed in shape-constrained functional regression models using Bernstein polynomials. Specifically, estimation approaches from nonparametric regression are extended to functional data, properly accounting for shape-constraints in a large class of functional regression models such as scalar-on-function regression (SOFR), function-on-scalar regression (FOSR), and function-on-function regression (FOFR). Theoretical results establish the asymptotic consistency of the constrained estimators under standard regularity conditions. A projection based approach provides point-wise asymptotic confidence intervals for the constrained estimators. A bootstrap test is developed facilitating testing of the shape constraints. Numerical analysis using simulations illustrate improvement in efficiency of the estimators from the use of the proposed method under shape constraints. Two applications include i) modeling a drug effect in a mental health study via shape-restricted FOSR and ii) modeling subject-specific quantile functions of accelerometry-estimated physical activity in the Baltimore Longitudinal Study of Aging (BLSA) as outcomes via shape-restricted quantile-function on scalar regression (QFOSR). R software implementation and illustration of the proposed estimation method and the test is provided.
△ Less
Submitted 9 September, 2022;
originally announced September 2022.
-
The Effect of Modeling Human Rationality Level on Learning Rewards from Multiple Feedback Types
Authors:
Gaurav R. Ghosal,
Matthew Zurek,
Daniel S. Brown,
Anca D. Dragan
Abstract:
When inferring reward functions from human behavior (be it demonstrations, comparisons, physical corrections, or e-stops), it has proven useful to model the human as making noisy-rational choices, with a "rationality coefficient" capturing how much noise or entropy we expect to see in the human behavior. Prior work typically sets the rationality level to a constant value, regardless of the type, o…
▽ More
When inferring reward functions from human behavior (be it demonstrations, comparisons, physical corrections, or e-stops), it has proven useful to model the human as making noisy-rational choices, with a "rationality coefficient" capturing how much noise or entropy we expect to see in the human behavior. Prior work typically sets the rationality level to a constant value, regardless of the type, or quality, of human feedback. However, in many settings, giving one type of feedback (e.g. a demonstration) may be much more difficult than a different type of feedback (e.g. answering a comparison query). Thus, we expect to see more or less noise depending on the type of human feedback. In this work, we advocate that grounding the rationality coefficient in real data for each feedback type, rather than assuming a default value, has a significant positive effect on reward learning. We test this in both simulated experiments and in a user study with real human feedback. We find that overestimating human rationality can have dire effects on reward learning accuracy and regret. We also find that fitting the rationality coefficient to human data enables better reward learning, even when the human deviates significantly from the noisy-rational choice model due to systematic biases. Further, we find that the rationality level affects the informativeness of each feedback type: surprisingly, demonstrations are not always the most informative -- when the human acts very suboptimally, comparisons actually become more informative, even when the rationality level is the same for both. Ultimately, our results emphasize the importance and advantage of paying attention to the assumed human-rationality level, especially when agents actively learn from multiple types of human feedback.
△ Less
Submitted 9 March, 2023; v1 submitted 22 August, 2022;
originally announced August 2022.
-
GRiD: GPU-Accelerated Rigid Body Dynamics with Analytical Gradients
Authors:
Brian Plancher,
Sabrina M. Neuman,
Radhika Ghosal,
Scott Kuindersma,
Vijay Janapa Reddi
Abstract:
We introduce GRiD: a GPU-accelerated library for computing rigid body dynamics with analytical gradients. GRiD was designed to accelerate the nonlinear trajectory optimization subproblem used in state-of-the-art robotic planning, control, and machine learning, which requires tens to hundreds of naturally parallel computations of rigid body dynamics and their gradients at each iteration. GRiD lever…
▽ More
We introduce GRiD: a GPU-accelerated library for computing rigid body dynamics with analytical gradients. GRiD was designed to accelerate the nonlinear trajectory optimization subproblem used in state-of-the-art robotic planning, control, and machine learning, which requires tens to hundreds of naturally parallel computations of rigid body dynamics and their gradients at each iteration. GRiD leverages URDF parsing and code generation to deliver optimized dynamics kernels that not only expose GPU-friendly computational patterns, but also take advantage of both fine-grained parallelism within each computation and coarse-grained parallelism between computations. Through this approach, when performing multiple computations of rigid body dynamics algorithms, GRiD provides as much as a 7.2x speedup over a state-of-the-art, multi-threaded CPU implementation, and maintains as much as a 2.5x speedup when accounting for I/O overhead. We release GRiD as an open-source library for use by the wider robotics community.
△ Less
Submitted 25 February, 2022; v1 submitted 14 September, 2021;
originally announced September 2021.
-
RoboRun: A Robot Runtime to Exploit Spatial Heterogeneity
Authors:
Behzad Boroujerdian,
Radhika Ghosal,
Jonathan Cruz,
Brian Plancher,
Vijay Janapa Reddi
Abstract:
The limited onboard energy of autonomous mobile robots poses a tremendous challenge for practical deployment. Hence, efficient computing solutions are imperative. A crucial shortcoming of state-of-the-art computing solutions is that they ignore the robot's operating environment heterogeneity and make static, worst-case assumptions. As this heterogeneity impacts the system's computing payload, an o…
▽ More
The limited onboard energy of autonomous mobile robots poses a tremendous challenge for practical deployment. Hence, efficient computing solutions are imperative. A crucial shortcoming of state-of-the-art computing solutions is that they ignore the robot's operating environment heterogeneity and make static, worst-case assumptions. As this heterogeneity impacts the system's computing payload, an optimal system must dynamically capture these changes in the environment and adjust its computational resources accordingly. This paper introduces RoboRun, a mobile-robot runtime that dynamically exploits the compute-environment synergy to improve performance and energy. We implement RoboRun in the Robot Operating System (ROS) and evaluate it on autonomous drones. We compare RoboRun against a state-of-the-art static design and show 4.5X and 4X improvements in mission time and energy, respectively, as well as a 36% reduction in CPU utilization.
△ Less
Submitted 30 August, 2021;
originally announced August 2021.
-
Bayesian Inference for Generalized Linear Model with Linear Inequality Constraints
Authors:
Rahul Ghosal,
Sujit K. Ghosh
Abstract:
Bayesian statistical inference for Generalized Linear Models (GLMs) with parameters lying on a constrained space is of general interest (e.g., in monotonic or convex regression), but often constructing valid prior distributions supported on a subspace spanned by a set of linear inequality constraints can be challenging, especially when some of the constraints might be binding leading to a lower di…
▽ More
Bayesian statistical inference for Generalized Linear Models (GLMs) with parameters lying on a constrained space is of general interest (e.g., in monotonic or convex regression), but often constructing valid prior distributions supported on a subspace spanned by a set of linear inequality constraints can be challenging, especially when some of the constraints might be binding leading to a lower dimensional subspace. For the general case with canonical link, it is shown that a generalized truncated multivariate normal supported on a desired subspace can be used. Moreover, it is shown that such prior distribution facilitates the construction of a general purpose product slice sampling method to obtain (approximate) samples from corresponding posterior distribution, making the inferential method computationally efficient for a wide class of GLMs with an arbitrary set of linear inequality constraints. The proposed product slice sampler is shown to be uniformly ergodic, having a geometric convergence rate under a set of mild regularity conditions satisfied by many popular GLMs (e.g., logistic and Poisson regressions with constrained coefficients). One of the primary advantages of the proposed Bayesian estimation method over classical methods is that uncertainty of parameter estimates is easily quantified by using the samples simulated from the path of the Markov Chain of the slice sampler. Numerical illustrations using simulated data sets are presented to illustrate the superiority of the proposed methods compared to some existing methods in terms of sampling bias and variances. In addition, real case studies are presented using data sets for fertilizer-crop production and estimating the SCRAM rate in nuclear power plants.
△ Less
Submitted 23 August, 2021;
originally announced August 2021.
-
Multi-Modal Prototype Learning for Interpretable Multivariable Time Series Classification
Authors:
Gaurav R. Ghosal,
Reza Abbasi-Asl
Abstract:
Multivariable time series classification problems are increasing in prevalence and complexity in a variety of domains, such as biology and finance. While deep learning methods are an effective tool for these problems, they often lack interpretability. In this work, we propose a novel modular prototype learning framework for multivariable time series classification. In the first stage of our framew…
▽ More
Multivariable time series classification problems are increasing in prevalence and complexity in a variety of domains, such as biology and finance. While deep learning methods are an effective tool for these problems, they often lack interpretability. In this work, we propose a novel modular prototype learning framework for multivariable time series classification. In the first stage of our framework, encoders extract features from each variable independently. Prototype layers identify single-variable prototypes in the resulting feature spaces. The next stage of our framework represents the multivariable time series sample points in terms of their similarity to these single-variable prototypes. This results in an inherently interpretable representation of multivariable patterns, on which prototype learning is applied to extract representative examples i.e. multivariable prototypes. Our framework is thus able to explicitly identify both informative patterns in the individual variables, as well as the relationships between the variables. We validate our framework on a simulated dataset with embedded patterns, as well as a real human activity recognition problem. Our framework attains comparable or superior classification performance to existing time series classification methods on these tasks. On the simulated dataset, we find that our model returns interpretations consistent with the embedded patterns. Moreover, the interpretations learned on the activity recognition dataset align with domain knowledge.
△ Less
Submitted 17 June, 2021;
originally announced June 2021.
-
Widening Access to Applied Machine Learning with TinyML
Authors:
Vijay Janapa Reddi,
Brian Plancher,
Susan Kennedy,
Laurence Moroney,
Pete Warden,
Anant Agarwal,
Colby Banbury,
Massimo Banzi,
Matthew Bennett,
Benjamin Brown,
Sharad Chitlangia,
Radhika Ghosal,
Sarah Grafman,
Rupert Jaeger,
Srivatsan Krishnan,
Maximilian Lam,
Daniel Leiker,
Cara Mann,
Mark Mazumder,
Dominic Pajak,
Dhilan Ramaprasad,
J. Evan Smith,
Matthew Stewart,
Dustin Tingley
Abstract:
Broadening access to both computational and educational resources is critical to diffusing machine-learning (ML) innovation. However, today, most ML resources and experts are siloed in a few countries and organizations. In this paper, we describe our pedagogical approach to increasing access to applied ML through a massive open online course (MOOC) on Tiny Machine Learning (TinyML). We suggest tha…
▽ More
Broadening access to both computational and educational resources is critical to diffusing machine-learning (ML) innovation. However, today, most ML resources and experts are siloed in a few countries and organizations. In this paper, we describe our pedagogical approach to increasing access to applied ML through a massive open online course (MOOC) on Tiny Machine Learning (TinyML). We suggest that TinyML, ML on resource-constrained embedded devices, is an attractive means to widen access because TinyML both leverages low-cost and globally accessible hardware, and encourages the development of complete, self-contained applications, from data collection to deployment. To this end, a collaboration between academia (Harvard University) and industry (Google) produced a four-part MOOC that provides application-oriented instruction on how to develop solutions using TinyML. The series is openly available on the edX MOOC platform, has no prerequisites beyond basic programming, and is designed for learners from a global variety of backgrounds. It introduces pupils to real-world applications, ML algorithms, data-set engineering, and the ethical considerations of these technologies via hands-on programming and deployment of TinyML applications in both the cloud and their own microcontrollers. To facilitate continued learning, community building, and collaboration beyond the courses, we launched a standalone website, a forum, a chat, and an optional course-project competition. We also released the course materials publicly, ho** they will inspire the next generation of ML practitioners and educators and further broaden access to cutting-edge ML technologies.
△ Less
Submitted 9 June, 2021; v1 submitted 7 June, 2021;
originally announced June 2021.
-
Scalar on time-by-distribution regression and its application for modelling associations between daily-living physical activity and cognitive functions in Alzheimer's Disease
Authors:
Rahul Ghosal,
Vijay R. Varma,
Dmitri Volfson,
Jacek Urbanek,
Jeffrey M. Hausdorff,
Amber Watts,
Vadim Zipunnikov
Abstract:
Wearable data is a rich source of information that can provide deeper understanding of links between human behaviours and human health. Existing modelling approaches use wearable data summarized at subject level via scalar summaries using regression techniques, temporal (time-of-day) curves using functional data analysis (FDA), and distributions using distributional data analysis (DDA). We propose…
▽ More
Wearable data is a rich source of information that can provide deeper understanding of links between human behaviours and human health. Existing modelling approaches use wearable data summarized at subject level via scalar summaries using regression techniques, temporal (time-of-day) curves using functional data analysis (FDA), and distributions using distributional data analysis (DDA). We propose to capture temporally local distributional information in wearable data using subject-specific time-by-distribution (TD) data objects. Specifically, we propose scalar on time-by-distribution regression (SOTDR) to model associations between scalar response of interest such as health outcomes or disease status and TD predictors. We show that TD data objects can be parsimoniously represented via a collection of time-varying L-moments that capture distributional changes over the time-of-day. The proposed method is applied to the accelerometry study of mild Alzheimer's disease (AD). Mild AD is found to be significantly associated with reduced maximal level of physical activity, particularly during morning hours. It is also demonstrated that TD predictors attain much stronger associations with clinical cognitive scales of attention, verbal memory, and executive function when compared to predictors summarized via scalar total activity counts, temporal functional curves, and quantile functions. Taken together, the present results suggest that the SOTDR analysis provides novel insights into cognitive function and AD.
△ Less
Submitted 7 June, 2021;
originally announced June 2021.
-
MAVFI: An End-to-End Fault Analysis Framework with Anomaly Detection and Recovery for Micro Aerial Vehicles
Authors:
Yu-Shun Hsiao,
Zishen Wan,
Tianyu Jia,
Radhika Ghosal,
Abdulrahman Mahmoud,
Arijit Raychowdhury,
David Brooks,
Gu-Yeon Wei,
Vijay Janapa Reddi
Abstract:
Safety and resilience are critical for autonomous unmanned aerial vehicles (UAVs). We introduce MAVFI, the micro aerial vehicles (MAVs) resilience analysis methodology to assess the effect of silent data corruption (SDC) on UAVs' mission metrics, such as flight time and success rate, for accurately measuring system resilience. To enhance the safety and resilience of robot systems bound by size, we…
▽ More
Safety and resilience are critical for autonomous unmanned aerial vehicles (UAVs). We introduce MAVFI, the micro aerial vehicles (MAVs) resilience analysis methodology to assess the effect of silent data corruption (SDC) on UAVs' mission metrics, such as flight time and success rate, for accurately measuring system resilience. To enhance the safety and resilience of robot systems bound by size, weight, and power (SWaP), we offer two low-overhead anomaly-based SDC detection and recovery algorithms based on Gaussian statistical models and autoencoder neural networks. Our anomaly error protection techniques are validated in numerous simulated environments. We demonstrate that the autoencoder-based technique can recover up to all failure cases in our studied scenarios with a computational overhead of no more than 0.0062%. Our application-aware resilience analysis framework, MAVFI, can be utilized to comprehensively test the resilience of other Robot Operating System (ROS)-based applications and is publicly available at https://github.com/harvard-edge/MAVBench/tree/mavfi.
△ Less
Submitted 30 January, 2023; v1 submitted 26 May, 2021;
originally announced May 2021.
-
Distributional data analysis via quantile functions and its application to modelling digital biomarkers of gait in Alzheimer's Disease
Authors:
Rahul Ghosal,
Vijay R. Varma,
Dmitri Volfson,
Inbar Hillel,
Jacek Urbanek,
Jeffrey M. Hausdorff,
Amber Watts,
Vadim Zipunnikov
Abstract:
With the advent of continuous health monitoring with wearable devices, users now generate their unique streams of continuous data such as minute-level step counts or heartbeats. Summarizing these streams via scalar summaries often ignores the distributional nature of wearable data and almost unavoidably leads to the loss of critical information. We propose to capture the distributional nature of w…
▽ More
With the advent of continuous health monitoring with wearable devices, users now generate their unique streams of continuous data such as minute-level step counts or heartbeats. Summarizing these streams via scalar summaries often ignores the distributional nature of wearable data and almost unavoidably leads to the loss of critical information. We propose to capture the distributional nature of wearable data via user-specific quantile functions (QF) and use these QFs as predictors in scalar-on-quantile-function-regression (SOQFR). As an alternative approach, we also propose to represent QFs via user-specific L-moments, robust rank-based analogs of traditional moments, and use L-moments as predictors in SOQFR (SOQFR-L). These two approaches provide two mutually consistent interpretations: in terms of quantile levels by SOQFR and in terms of L-moments by SOQFR-L. We also demonstrate how to deal with multi-modal distributional data via Joint and Individual Variation Explained (JIVE) using L-moments. The proposed methods are illustrated in a study of association of digital gait biomarkers with cognitive function in Alzheimer's disease (AD). Our analysis shows that the proposed methods demonstrate higher predictive performance and attain much stronger associations with clinical cognitive scales compared to simple distributional summaries.
△ Less
Submitted 25 October, 2021; v1 submitted 22 February, 2021;
originally announced February 2021.
-
Variable Selection in Functional Linear Concurrent Regression
Authors:
Rahul Ghosal,
Arnab Maity,
Timothy Clark,
Stefano B Longo
Abstract:
We propose a novel method for variable selection in functional linear concurrent regression. Our research is motivated by a fisheries footprint study where the goal is to identify important time-varying socio-structural drivers influencing patterns of seafood consumption, and hence fisheries footprint, over time, as well as estimating their dynamic effects. We develop a variable selection method i…
▽ More
We propose a novel method for variable selection in functional linear concurrent regression. Our research is motivated by a fisheries footprint study where the goal is to identify important time-varying socio-structural drivers influencing patterns of seafood consumption, and hence fisheries footprint, over time, as well as estimating their dynamic effects. We develop a variable selection method in functional linear concurrent regression extending the classically used scalar on scalar variable selection methods like LASSO, SCAD, and MCP. We show in functional linear concurrent regression the variable selection problem can be addressed as a group LASSO, and their natural extension; group SCAD or a group MCP problem. Through simulations, we illustrate our method, particularly with group SCAD or group MCP penalty, can pick out the relevant variables with high accuracy and has minuscule false positive and false negative rate even when data is observed sparsely, is contaminated with noise and the error process is highly non-stationary. We also demonstrate two real data applications of our method in studies of dietary calcium absorption and fisheries footprint in the selection of influential time-varying covariates.
△ Less
Submitted 31 October, 2019; v1 submitted 17 April, 2019;
originally announced April 2019.
-
A Score Based Test for Functional Linear Concurrent Regression
Authors:
Rahul Ghosal,
Arnab Maity
Abstract:
We propose a novel method for testing the null hypothesis of no effect of a covariate on the response in the context of functional linear concurrent regression. We establish an equivalent random effects formulation of our functional regression model under which our testing problem reduces to testing for zero variance component for random effects. For this purpose, we use a one-sided score test app…
▽ More
We propose a novel method for testing the null hypothesis of no effect of a covariate on the response in the context of functional linear concurrent regression. We establish an equivalent random effects formulation of our functional regression model under which our testing problem reduces to testing for zero variance component for random effects. For this purpose, we use a one-sided score test approach, which is an extension of the classical score test. We provide theoretical justification as to why our testing procedure has the right levels (asymptotically) under null using standard assumptions. Using numerical simulations, we show that our testing method has the desired type I error rate and gives higher power compared to a bootstrapped F test currently existing in the literature. Our model and testing procedure are shown to give good performances even when the data is sparsely observed, and the covariate is contaminated with noise. Applications of the proposed testing method are demonstrated on gait study and a dietary calcium absorption data.
△ Less
Submitted 12 December, 2019; v1 submitted 11 December, 2018;
originally announced December 2018.
-
Estimating menarcheal age distribution from partially recalled data
Authors:
Sedigheh Mirzaei Salehabadi,
Debasis Sengupta,
Rahul Ghosal
Abstract:
In a cross-sectional study, adolescent and young adult females were asked to recall the time of menarche, if experienced. Some respondents recalled the date exactly, some recalled only the month or the year of the event, and some were unable to recall anything. We consider estimation of the menarcheal age distribution from this interval censored data. A~complicated interplay between age-at-event a…
▽ More
In a cross-sectional study, adolescent and young adult females were asked to recall the time of menarche, if experienced. Some respondents recalled the date exactly, some recalled only the month or the year of the event, and some were unable to recall anything. We consider estimation of the menarcheal age distribution from this interval censored data. A~complicated interplay between age-at-event and calendar time, together with the evident fact of memory fading with time, makes the censoring informative. We propose a model where the probabilities of various types of recall would depend on the time since menarche. For parametric estimation we model these probabilities using multinomial regression function. Establishing consistency and asymptotic normality of the parametric MLE requires a bit of tweaking of the standard asymptotic theory, as the data format varies from case to case. We also provide a non-parametric MLE, propose a computationally simpler approximation, and establish the consistency of both these estimators under mild conditions. We study the small sample performance of the parametric and non-parametric estimators through Monte Carlo simulations. Moreover, we provide a graphical check of the assumption of the multinomial model for the recall probabilities, which appears to hold for the menarcheal data set. Our analysis shows that the use of the partially recalled part of the data indeed leads to smaller confidence intervals of the survival function.
△ Less
Submitted 3 March, 2019; v1 submitted 10 October, 2018;
originally announced October 2018.
-
A Statistical Exploration of Duckworth-Lewis Method Using Bayesian Inference
Authors:
Indrabati Bhattacharya,
Rahul Ghosal,
Sujit Ghosh
Abstract:
Duckworth-Lewis (D/L) method is the incumbent rain rule used to decide the result of a limited overs cricket match should it not be able to reach its natural conclusion. Duckworth and Lewis (1998) devised a two factor relationship between the numbers of overs a team had remaining and the number of wickets they had lost in order to quantify the percentage resources a team has at any stage of the ma…
▽ More
Duckworth-Lewis (D/L) method is the incumbent rain rule used to decide the result of a limited overs cricket match should it not be able to reach its natural conclusion. Duckworth and Lewis (1998) devised a two factor relationship between the numbers of overs a team had remaining and the number of wickets they had lost in order to quantify the percentage resources a team has at any stage of the match. As number of remaining overs decrease and lost wickets increase the resources are expected to decrease. The resource table which is still being used by ICC (International Cricket Council) for 50 overs cricket match suffers from lack of monotonicity both in numbers of overs left and number of wickets lost. We apply Bayesian inference to build a resource table which overcomes the non monotonicity problem of the current D/L resource table and show that it gives better prediction for teams in first innings score and hence it is more suitable for using in rain affected matches.
△ Less
Submitted 1 October, 2018;
originally announced October 2018.
-
Privacy Preserving Multi-Server k-means Computation over Horizontally Partitioned Data
Authors:
Riddhi Ghosal,
Sanjit Chatterjee
Abstract:
The k-means clustering is one of the most popular clustering algorithms in data mining. Recently a lot of research has been concentrated on the algorithm when the dataset is divided into multiple parties or when the dataset is too large to be handled by the data owner. In the latter case, usually some servers are hired to perform the task of clustering. The dataset is divided by the data owner amo…
▽ More
The k-means clustering is one of the most popular clustering algorithms in data mining. Recently a lot of research has been concentrated on the algorithm when the dataset is divided into multiple parties or when the dataset is too large to be handled by the data owner. In the latter case, usually some servers are hired to perform the task of clustering. The dataset is divided by the data owner among the servers who together perform the k-means and return the cluster labels to the owner. The major challenge in this method is to prevent the servers from gaining substantial information about the actual data of the owner. Several algorithms have been designed in the past that provide cryptographic solutions to perform privacy preserving k-means. We provide a new method to perform k-means over a large set using multiple servers. Our technique avoids heavy cryptographic computations and instead we use a simple randomization technique to preserve the privacy of the data. The k-means computed has exactly the same efficiency and accuracy as the k-means computed over the original dataset without any randomization. We argue that our algorithm is secure against honest but curious and passive adversary.
△ Less
Submitted 28 June, 2019; v1 submitted 11 August, 2018;
originally announced August 2018.
-
Analysing Relations involving small number of Monomials in AES S- Box
Authors:
Riddhi Ghosal
Abstract:
In the present day, AES is one the most widely used and most secure Encryption Systems prevailing. So, naturally lots of research work is going on to mount a significant attack on AES. Many different forms of Linear and differential cryptanalysis have been performed on AES. Of late, an active area of research has been Algebraic Cryptanalysis of AES, where although fast progress is being made, ther…
▽ More
In the present day, AES is one the most widely used and most secure Encryption Systems prevailing. So, naturally lots of research work is going on to mount a significant attack on AES. Many different forms of Linear and differential cryptanalysis have been performed on AES. Of late, an active area of research has been Algebraic Cryptanalysis of AES, where although fast progress is being made, there are still numerous scopes for research and improvement. One of the major reasons behind this being that algebraic cryptanalysis mainly depends on I/O relations of the AES S- Box (a major component of the AES). As, already known, that the key recovery algorithm of AES can be broken down as an MQ problem which is itself considered hard. Solving these equations depends on our ability reduce them into linear forms which are easily solvable under our current computational prowess. The lower the degree of these equations, the easier it is for us to linearlize hence the attack complexity reduces. The aim of this paper is to analyze the various relations involving small number of monomials of the AES S- Box and to answer the question whether it is actually possible to have such monomial equations for the S- Box if we restrict the degree of the monomials. In other words this paper aims to study such equations and see if they can be applicable for AES.
△ Less
Submitted 14 June, 2017;
originally announced August 2017.