-
EKM: An exact, polynomial-time algorithm for the $K$-medoids problem
Authors:
Xi He,
Max A. Little
Abstract:
The $K$-medoids problem is a challenging combinatorial clustering task, widely used in data analysis applications. While numerous algorithms have been proposed to solve this problem, none of these are able to obtain an exact (globally optimal) solution for the problem in polynomial time. In this paper, we present EKM: a novel algorithm for solving this problem exactly with worst-case…
▽ More
The $K$-medoids problem is a challenging combinatorial clustering task, widely used in data analysis applications. While numerous algorithms have been proposed to solve this problem, none of these are able to obtain an exact (globally optimal) solution for the problem in polynomial time. In this paper, we present EKM: a novel algorithm for solving this problem exactly with worst-case $O\left(N^{K+1}\right)$ time complexity. EKM is developed according to recent advances in transformational programming and combinatorial generation, using formal program derivation steps. The derived algorithm is provably correct by construction. We demonstrate the effectiveness of our algorithm by comparing it against various approximate methods on numerous real-world datasets. We show that the wall-clock run time of our algorithm matches the worst-case time complexity analysis on synthetic datasets, clearly outperforming the exponential time complexity of benchmark branch-and-bound based MIP solvers. To our knowledge, this is the first, rigorously-proven polynomial time, practical algorithm for this ubiquitous problem.
△ Less
Submitted 16 May, 2024;
originally announced May 2024.
-
INDUS: Effective and Efficient Language Models for Scientific Applications
Authors:
Bishwaranjan Bhattacharjee,
Aashka Trivedi,
Masayasu Muraoka,
Muthukumaran Ramasubramanian,
Takuma Udagawa,
Iksha Gurung,
Rong Zhang,
Bharath Dandala,
Rahul Ramachandran,
Manil Maskey,
Kaylin Bugbee,
Mike Little,
Elizabeth Fancher,
Lauren Sanders,
Sylvain Costes,
Sergi Blanco-Cuaresma,
Kelly Lockhart,
Thomas Allen,
Felix Grezes,
Megan Ansdell,
Alberto Accomazzi,
Yousef El-Kurdi,
Davis Wertheimer,
Birgit Pfitzmann,
Cesar Berrospi Ramis
, et al. (9 additional authors not shown)
Abstract:
Large language models (LLMs) trained on general domain corpora showed remarkable results on natural language processing (NLP) tasks. However, previous research demonstrated LLMs trained using domain-focused corpora perform better on specialized tasks. Inspired by this pivotal insight, we developed INDUS, a comprehensive suite of LLMs tailored for the Earth science, biology, physics, heliophysics,…
▽ More
Large language models (LLMs) trained on general domain corpora showed remarkable results on natural language processing (NLP) tasks. However, previous research demonstrated LLMs trained using domain-focused corpora perform better on specialized tasks. Inspired by this pivotal insight, we developed INDUS, a comprehensive suite of LLMs tailored for the Earth science, biology, physics, heliophysics, planetary sciences and astrophysics domains and trained using curated scientific corpora drawn from diverse data sources. The suite of models include: (1) an encoder model trained using domain-specific vocabulary and corpora to address natural language understanding tasks, (2) a contrastive-learning-based general text embedding model trained using a diverse set of datasets drawn from multiple sources to address information retrieval tasks and (3) smaller versions of these models created using knowledge distillation techniques to address applications which have latency or resource constraints. We also created three new scientific benchmark datasets namely, CLIMATE-CHANGE-NER (entity-recognition), NASA-QA (extractive QA) and NASA-IR (IR) to accelerate research in these multi-disciplinary fields. Finally, we show that our models outperform both general-purpose encoders (RoBERTa) and existing domain-specific encoders (SciBERT) on these new tasks as well as existing benchmark tasks in the domains of interest.
△ Less
Submitted 20 May, 2024; v1 submitted 17 May, 2024;
originally announced May 2024.
-
Algorithmic syntactic causal identification
Authors:
Dhurim Cakiqi,
Max A. Little
Abstract:
Causal identification in causal Bayes nets (CBNs) is an important tool in causal inference allowing the derivation of interventional distributions from observational distributions where this is possible in principle. However, most existing formulations of causal identification using techniques such as d-separation and do-calculus are expressed within the mathematical language of classical probabil…
▽ More
Causal identification in causal Bayes nets (CBNs) is an important tool in causal inference allowing the derivation of interventional distributions from observational distributions where this is possible in principle. However, most existing formulations of causal identification using techniques such as d-separation and do-calculus are expressed within the mathematical language of classical probability theory on CBNs. However, there are many causal settings where probability theory and hence current causal identification techniques are inapplicable such as relational databases, dataflow programs such as hardware description languages, distributed systems and most modern machine learning algorithms. We show that this restriction can be lifted by replacing the use of classical probability theory with the alternative axiomatic foundation of symmetric monoidal categories. In this alternative axiomatization, we show how an unambiguous and clean distinction can be drawn between the general syntax of causal models and any specific semantic implementation of that causal model. This allows a purely syntactic algorithmic description of general causal identification by a translation of recent formulations of the general ID algorithm through fixing. Our description is given entirely in terms of the non-parametric ADMG structure specifying a causal model and the algebraic signature of the corresponding monoidal category, to which a sequence of manipulations is then applied so as to arrive at a modified monoidal category in which the desired, purely syntactic interventional causal model, is obtained. We use this idea to derive purely syntactic analogues of classical back-door and front-door causal adjustment, and illustrate an application to a more complex causal model.
△ Less
Submitted 14 March, 2024;
originally announced March 2024.
-
Derm-T2IM: Harnessing Synthetic Skin Lesion Data via Stable Diffusion Models for Enhanced Skin Disease Classification using ViT and CNN
Authors:
Muhammad Ali Farooq,
Wang Yao,
Michael Schukat,
Mark A Little,
Peter Corcoran
Abstract:
This study explores the utilization of Dermatoscopic synthetic data generated through stable diffusion models as a strategy for enhancing the robustness of machine learning model training. Synthetic data generation plays a pivotal role in mitigating challenges associated with limited labeled datasets, thereby facilitating more effective model training. In this context, we aim to incorporate enhanc…
▽ More
This study explores the utilization of Dermatoscopic synthetic data generated through stable diffusion models as a strategy for enhancing the robustness of machine learning model training. Synthetic data generation plays a pivotal role in mitigating challenges associated with limited labeled datasets, thereby facilitating more effective model training. In this context, we aim to incorporate enhanced data transformation techniques by extending the recent success of few-shot learning and a small amount of data representation in text-to-image latent diffusion models. The optimally tuned model is further used for rendering high-quality skin lesion synthetic data with diverse and realistic characteristics, providing a valuable supplement and diversity to the existing training data. We investigate the impact of incorporating newly generated synthetic data into the training pipeline of state-of-art machine learning models, assessing its effectiveness in enhancing model performance and generalization to unseen real-world data. Our experimental results demonstrate the efficacy of the synthetic data generated through stable diffusion models helps in improving the robustness and adaptability of end-to-end CNN and vision transformer models on two different real-world skin lesion datasets.
△ Less
Submitted 10 January, 2024;
originally announced January 2024.
-
A generalisation of the method of regression calibration and comparison with Bayesian and frequentist model averaging methods
Authors:
Mark P Little,
Nobuyuki Hamada,
Lydia B Zablotska
Abstract:
For many cancer sites low-dose risks are not known and must be extrapolated from those observed in groups exposed at much higher levels of dose. Measurement error can substantially alter the dose-response shape and hence the extrapolated risk. Recently, there has been considerable attention paid to methods of dealing with shared errors, which are particularly important in occupational and environm…
▽ More
For many cancer sites low-dose risks are not known and must be extrapolated from those observed in groups exposed at much higher levels of dose. Measurement error can substantially alter the dose-response shape and hence the extrapolated risk. Recently, there has been considerable attention paid to methods of dealing with shared errors, which are particularly important in occupational and environmental settings. In this paper we test Bayesian model averaging (BMA) and frequentist model averaging (FMA) methods, the first of these similar to the so-called Bayesian two-dimensional Monte Carlo (2DMC) method, and both fairly recently proposed, against a very newly proposed modification of the regression calibration method, the extended regression calibration (ERC) method. The quasi-2DMC+BMA method performs well when a linear model is assumed, but poorly when a linear-quadratic model is assumed. FMA performs as well as quasi-2DMC+BMA when a linear model is assumed, and generally much better with a linear-quadratic model, although the coverage probability for the quadratic coefficient is uniformly too high. ERC yields coverage probabilities that are too low when shared and unshared Berkson errors are both large (50%), although otherwise it performs well, and coverage is generally better than the quasi-2DMC+BMA or FMA methods, particularly for the linear-quadratic model. The bias of predicted relative risk at a variety of doses is generally smallest for ERC, and largest for quasi-2DMC+BMA and FMA, with standard regression calibration and Monte Carlo maximum likelihood exhibiting bias in predicted relative risk generally somewhat intermediate between ERC and the other two methods. In general ERC performs best in the scenarios presented, and should be the method of choice in situations where there may be substantial shared error, or suspected curvature in the dose response.
△ Less
Submitted 13 March, 2024; v1 submitted 3 December, 2023;
originally announced December 2023.
-
An efficient, provably exact, practical algorithm for the 0-1 loss linear classification problem
Authors:
Xi He,
Waheed Ul Rahman,
Max A. Little
Abstract:
Algorithms for solving the linear classification problem have a long history, dating back at least to 1936 with linear discriminant analysis. For linearly separable data, many algorithms can obtain the exact solution to the corresponding 0-1 loss classification problem efficiently, but for data which is not linearly separable, it has been shown that this problem, in full generality, is NP-hard. Al…
▽ More
Algorithms for solving the linear classification problem have a long history, dating back at least to 1936 with linear discriminant analysis. For linearly separable data, many algorithms can obtain the exact solution to the corresponding 0-1 loss classification problem efficiently, but for data which is not linearly separable, it has been shown that this problem, in full generality, is NP-hard. Alternative approaches all involve approximations of some kind, including the use of surrogates for the 0-1 loss (for example, the hinge or logistic loss) or approximate combinatorial search, none of which can be guaranteed to solve the problem exactly. Finding efficient algorithms to obtain an exact i.e. globally optimal solution for the 0-1 loss linear classification problem with fixed dimension, remains an open problem. In research we report here, we detail the rigorous construction of a new algorithm, incremental cell enumeration (ICE), that can solve the 0-1 loss classification problem exactly in polynomial time. We prove correctness using concepts from the theory of hyperplane arrangements and oriented matroids. We demonstrate the effectiveness of this algorithm on synthetic and real-world datasets, showing optimal accuracy both in and out-of-sample, in practical computational time. We also empirically demonstrate how the use of approximate upper bound leads to polynomial time run-time improvements to the algorithm whilst retaining exactness. To our knowledge, this is the first, rigorously-proven polynomial time, practical algorithm for this long-standing problem.
△ Less
Submitted 2 August, 2023; v1 submitted 21 June, 2023;
originally announced June 2023.
-
TSDF: A simple yet comprehensive, unified data storage and exchange format standard for digital biosensor data in health applications
Authors:
Kasper Claes,
Valentina Ticcinelli,
Reham Badawy,
Yordan P. Raykov,
Luc J. W. Evers,
Max A. Little
Abstract:
Digital sensors are increasingly being used to monitor the change over time of physiological processes in biological health and disease, often using wearable devices. This generates very large amounts of digital sensor data, for which, a consensus on a common storage, exchange and archival data format standard, has yet to be reached. To address this gap, we propose Time Series Data Format (TSDF):…
▽ More
Digital sensors are increasingly being used to monitor the change over time of physiological processes in biological health and disease, often using wearable devices. This generates very large amounts of digital sensor data, for which, a consensus on a common storage, exchange and archival data format standard, has yet to be reached. To address this gap, we propose Time Series Data Format (TSDF): a unified, standardized format for storing all types of physiological sensor data, across diverse disease areas. We pose a series of format design criteria and review in detail current storage and exchange formats. When judged against these criteria, we find these current formats lacking, and propose a very simple, intuitive standard for both numerical sensor data and metadata, based on raw binary data and JSON-format text files, for sensor measurements/timestamps and metadata, respectively. By focusing on the common characteristics of diverse biosensor data, we define a set of necessary and sufficient metadata fields for storing, processing, exchanging, archiving and reliably interpreting, multi-channel biological time series data. Our aim is for this standardized format to increase the interpretability and exchangeability of data, thereby contributing to scientific reproducibility in studies where digital biosensor data forms a key evidence base.
△ Less
Submitted 22 November, 2022; v1 submitted 21 November, 2022;
originally announced November 2022.
-
Patient-Specific Game-Based Transfer Method for Parkinson's Disease Severity Prediction
Authors:
Zaifa Xue,
Huibin Lu,
Tao Zhang,
Max A. Little
Abstract:
Dysphonia is one of the early symptoms of Parkinson's disease (PD). Most existing methods use feature selection methods to find the optimal subset of voice features for all PD patients. Few have considered the heterogeneity between patients, which implies the need to provide specific prediction models for different patients. However, building the specific model faces the challenge of small sample…
▽ More
Dysphonia is one of the early symptoms of Parkinson's disease (PD). Most existing methods use feature selection methods to find the optimal subset of voice features for all PD patients. Few have considered the heterogeneity between patients, which implies the need to provide specific prediction models for different patients. However, building the specific model faces the challenge of small sample size, which makes it lack generalization ability. Instance transfer is an effective way to solve this problem. Therefore, this paper proposes a patient-specific game-based transfer (PSGT) method for PD severity prediction. First, a selection mechanism is used to select PD patients with similar disease trends to the target patient from the source domain, which greatly reduces the risk of negative transfer. Then, the contribution of the transferred subjects and their instances to the disease estimation of the target subject is fairly evaluated by the Shapley value, which improves the interpretability of the method. Next, the proportion of valid instances in the transferred subjects is determined, and the instances with higher contribution are transferred to further reduce the difference between the transferred instance subset and the target subject. Finally, the selected subset of instances is added to the training set of the target subject, and the extended data is fed into the random forest to improve the performance of the method. Parkinson's telemonitoring dataset is used to evaluate the feasibility and effectiveness. Experiment results show that the PSGT has better performance in both prediction error and stability over compared methods.
△ Less
Submitted 12 August, 2022; v1 submitted 6 August, 2022;
originally announced August 2022.
-
Dynamic programming by polymorphic semiring algebraic shortcut fusion
Authors:
Max A. Little,
Xi He,
Ugur Kayas
Abstract:
Dynamic programming (DP) is an algorithmic design paradigm for the efficient, exact solution of otherwise intractable, combinatorial problems. However, DP algorithm design is often presented in an ad-hoc manner. It is sometimes difficult to justify algorithm correctness. To address this issue, this paper presents a rigorous algebraic formalism for systematically deriving DP algorithms, based on se…
▽ More
Dynamic programming (DP) is an algorithmic design paradigm for the efficient, exact solution of otherwise intractable, combinatorial problems. However, DP algorithm design is often presented in an ad-hoc manner. It is sometimes difficult to justify algorithm correctness. To address this issue, this paper presents a rigorous algebraic formalism for systematically deriving DP algorithms, based on semiring polymorphism. We start with a specification, construct an algorithm to compute the required solution which is self-evidently correct because it exhaustively generates and evaluates all possible solutions meeting the specification. We then derive, through the use of shortcut fusion, an implementation of this algorithm which is both efficient and correct. We also demonstrate how, with the use of semiring lifting, the specification can be augmented with combinatorial constraints, showing how these constraints can be fused with the algorithm. We furthermore demonstrate how existing DP algorithms for a given combinatorial problem can be abstracted from their original context and re-purposed.
This approach can be applied to the full scope of combinatorial problems expressible in terms of semirings. This includes, for example: optimal probability and Viterbi decoding, probabilistic marginalization, logical inference, fuzzy sets, differentiable softmax, relational and provenance queries. The approach, building on ideas from the existing literature on constructive algorithmics, exploits generic properties of polymorphic functions, tupling and formal sums and algebraic simplifications arising from constraint algebras. We demonstrate the effectiveness of this formalism for some example applications arising in signal processing, bioinformatics and reliability engineering. Python software implementing these algorithms can be downloaded from: http://www.maxlittle.net/software/dppolyalg.zip.
△ Less
Submitted 4 January, 2024; v1 submitted 4 July, 2021;
originally announced July 2021.
-
Few-shot time series segmentation using prototype-defined infinite hidden Markov models
Authors:
Yazan Qarout,
Yordan P. Raykov,
Max A. Little
Abstract:
We propose a robust framework for interpretable, few-shot analysis of non-stationary sequential data based on flexible graphical models to express the structured distribution of sequential events, using prototype radial basis function (RBF) neural network emissions. A motivational link is demonstrated between prototypical neural network architectures for few-shot learning and the proposed RBF netw…
▽ More
We propose a robust framework for interpretable, few-shot analysis of non-stationary sequential data based on flexible graphical models to express the structured distribution of sequential events, using prototype radial basis function (RBF) neural network emissions. A motivational link is demonstrated between prototypical neural network architectures for few-shot learning and the proposed RBF network infinite hidden Markov model (RBF-iHMM). We show that RBF networks can be efficiently specified via prototypes allowing us to express complex nonstationary patterns, while hidden Markov models are used to infer principled high-level Markov dynamics. The utility of the framework is demonstrated on biomedical signal processing applications such as automated seizure detection from EEG data where RBF networks achieve state-of-the-art performance using a fraction of the data needed to train long-short-term memory variational autoencoders.
△ Less
Submitted 7 February, 2021;
originally announced February 2021.
-
Detecting Parkinson's Disease From an Online Speech-task
Authors:
Wasifur Rahman,
Sangwu Lee,
Md. Saiful Islam,
Victor Nikhil Antony,
Harshil Ratnu,
Mohammad Rafayet Ali,
Abdullah Al Mamun,
Ellen Wagner,
Stella Jensen-Roberts,
Max A. Little,
Ray Dorsey,
Ehsan Hoque
Abstract:
In this paper, we envision a web-based framework that can help anyone, anywhere around the world record a short speech task, and analyze the recorded data to screen for Parkinson's disease (PD). We collected data from 726 unique participants (262 PD, 38% female; 464 non-PD, 65% female; average age: 61) -- from all over the US and beyond. A small portion of the data was collected in a lab setting t…
▽ More
In this paper, we envision a web-based framework that can help anyone, anywhere around the world record a short speech task, and analyze the recorded data to screen for Parkinson's disease (PD). We collected data from 726 unique participants (262 PD, 38% female; 464 non-PD, 65% female; average age: 61) -- from all over the US and beyond. A small portion of the data was collected in a lab setting to compare quality. The participants were instructed to utter a popular pangram containing all the letters in the English alphabet "the quick brown fox jumps over the lazy dog..". We extracted both standard acoustic features (Mel Frequency Cepstral Coefficients (MFCC), jitter and shimmer variants) and deep learning based features from the speech data. Using these features, we trained several machine learning algorithms. We achieved 0.75 AUC (Area Under The Curve) performance on determining presence of self-reported Parkinson's disease by modeling the standard acoustic features through the XGBoost -- a gradient-boosted decision tree model. Further analysis reveal that the widely used MFCC features and a subset of previously validated dysphonia features designed for detecting Parkinson's from verbal phonation task (pronouncing 'ahh') contains the most distinct information. Our model performed equally well on data collected in controlled lab environment as well as 'in the wild' across different gender and age groups. Using this tool, we can collect data from almost anyone anywhere with a video/audio enabled device, contributing to equity and access in neurological care.
△ Less
Submitted 15 December, 2020; v1 submitted 2 September, 2020;
originally announced September 2020.
-
Pneumonia after bacterial or viral infection preceded or followed by radiation exposure -- a reanalysis of older radiobiological data and implications for low dose radiotherapy for COVID-19 pneumonia
Authors:
Mark P Little,
Wei Zhang,
Roy van Dusen,
Nobuyuki Hamada
Abstract:
Currently, there are 14 ongoing clinical studies on low dose radiotherapy (LDRT) for COVID-19 pneumonia. An underlying assumption is that irradiation of about 1 Gy is effective at ameliorating viral pneumonia. Its rationale, however, relies on early human case series or animal studies mostly obtained in the pre-antibiotic era, where rigorous statistical analyses were not performed. It therefore re…
▽ More
Currently, there are 14 ongoing clinical studies on low dose radiotherapy (LDRT) for COVID-19 pneumonia. An underlying assumption is that irradiation of about 1 Gy is effective at ameliorating viral pneumonia. Its rationale, however, relies on early human case series or animal studies mostly obtained in the pre-antibiotic era, where rigorous statistical analyses were not performed. It therefore remains unclear whether those early data support such assumptions. With standard statistical survival models, and based on a systematic literature review, we re-analyzed 14 radiobiological animal datasets in which animals received mostly fractionated doses of radiation before or after bacterial/viral inoculation, and assessing various health endpoints (mortality, pneumonia morbidity). In most datasets absorbed doses did not exceed 7 Gy. Various different model systems and types of challenging infection are considered. For 7 studies that evaluated post-inoculation radiation exposure (more relevant to LDRT for COVID-19 pneumonia) the results are heterogeneous, with 2 studies showing a significant increase (p<0.001) and another showing a significant decrease (p<0.001) in mortality associated with radiation exposure. For pre-inoculation exposure the results are also heterogeneous, with 6 datasets showing a significant increase (p<0.01) in mortality risk associated with radiation exposure and the other 2 showing a significant decrease (p<0.05) in mortality risk. Collectively, these data do not provide clear support for reductions in morbidity or mortality associated with post-infection radiation exposure. For pre-infection radiation exposure the inconsistency of direction of effect makes this body of data difficult to interpret. Nevertheless, one must be cautious about adducing evidence from the published reports of these old animal datasets.
△ Less
Submitted 5 September, 2020; v1 submitted 6 August, 2020;
originally announced August 2020.
-
Crystallography companion agent for high-throughput materials discovery
Authors:
Phillip M. Maffettone,
Lars Banko,
Peng Cui,
Yury Lysogorskiy,
Marc A. Little,
Daniel Olds,
Alfred Ludwig,
Andrew I. Cooper
Abstract:
The discovery of new structural and functional materials is driven by phase identification, often using X-ray diffraction (XRD). Automation has accelerated the rate of XRD measurements, greatly outpacing XRD analysis techniques that remain manual, time-consuming, error-prone, and impossible to scale. With the advent of autonomous robotic scientists or self-driving labs, contemporary techniques pro…
▽ More
The discovery of new structural and functional materials is driven by phase identification, often using X-ray diffraction (XRD). Automation has accelerated the rate of XRD measurements, greatly outpacing XRD analysis techniques that remain manual, time-consuming, error-prone, and impossible to scale. With the advent of autonomous robotic scientists or self-driving labs, contemporary techniques prohibit the integration of XRD. Here, we describe a computer program for the autonomous characterization of XRD data, driven by artificial intelligence (AI), for the discovery of new materials. Starting from structural databases, we train an ensemble model using a physically accurate synthetic dataset, which output probabilistic classifications -- rather than absolutes -- to overcome the overconfidence in traditional neural networks. This AI agent behaves as a companion to the researcher, improving accuracy and offering significant time savings. It was demonstrated on a diverse set of organic and inorganic materials characterization challenges. This innovation is directly applicable to inverse design approaches, robotic discovery systems, and can be immediately considered for other forms of characterization such as spectroscopy and the pair distribution function.
△ Less
Submitted 17 March, 2021; v1 submitted 1 August, 2020;
originally announced August 2020.
-
Modelling remote epidemic transmission in Western Australia and implications for pandemic response
Authors:
Michael Small,
Orlando Porras,
Michael Little,
David Cavanagh,
Harry Nicholas
Abstract:
We develop an agent-based model of disease transmission in remote communities in Western Australia. Despite extreme isolation, we show that the movement of people amongst a large number of small but isolated communities has the effect of causing transmission to spread quickly. Significant movement between remote communities, and regional and urban centres allows for infection to quickly spread to…
▽ More
We develop an agent-based model of disease transmission in remote communities in Western Australia. Despite extreme isolation, we show that the movement of people amongst a large number of small but isolated communities has the effect of causing transmission to spread quickly. Significant movement between remote communities, and regional and urban centres allows for infection to quickly spread to and then among these remote communities. Our conclusions are based on two characteristic features of remote communities in Western Australia: (1) high mobility of people amongst these communities, and (2) relatively high proportion of travellers from very small communities to major population centres. In models of infection initiated in the state capital, Perth, these remote communities are collectively and uniquely vulnerable. Our model and analysis does not account for possibly heightened impact due to preexisting conditions, such additional assumptions would only make the projections of this model more dire. We advocate stringent monitoring and control of movement to prevent significant impact on the indigenous population of Western Australia.
△ Less
Submitted 14 July, 2020;
originally announced July 2020.
-
Controlling for sparsity in sparse factor analysis models: adaptive latent feature sharing for piecewise linear dimensionality reduction
Authors:
Adam Farooq,
Yordan P. Raykov,
Petar Raykov,
Max A. Little
Abstract:
Ubiquitous linear Gaussian exploratory tools such as principle component analysis (PCA) and factor analysis (FA) remain widely used as tools for: exploratory analysis, pre-processing, data visualization and related tasks. However, due to their rigid assumptions including crowding of high dimensional data, they have been replaced in many settings by more flexible and still interpretable latent feat…
▽ More
Ubiquitous linear Gaussian exploratory tools such as principle component analysis (PCA) and factor analysis (FA) remain widely used as tools for: exploratory analysis, pre-processing, data visualization and related tasks. However, due to their rigid assumptions including crowding of high dimensional data, they have been replaced in many settings by more flexible and still interpretable latent feature models. The Feature allocation is usually modelled using discrete latent variables assumed to follow either parametric Beta-Bernoulli distribution or Bayesian nonparametric prior. In this work we propose a simple and tractable parametric feature allocation model which can address key limitations of current latent feature decomposition techniques. The new framework allows for explicit control over the number of features used to express each point and enables a more flexible set of allocation distributions including feature allocations with different sparsity levels. This approach is used to derive a novel adaptive Factor analysis (aFA), as well as, an adaptive probabilistic principle component analysis (aPPCA) capable of flexible structure discovery and dimensionality reduction in a wide case of scenarios. We derive both standard Gibbs sampler, as well as, an expectation-maximization inference algorithms that converge orders of magnitude faster to a reasonable point estimate solution. The utility of the proposed aPPCA model is demonstrated for standard PCA tasks such as feature learning, data visualization and data whitening. We show that aPPCA and aFA can infer interpretable high level features both when applied on raw MNIST and when applied for interpreting autoencoder features. We also demonstrate an application of the aPPCA to more robust blind source separation for functional magnetic resonance imaging (fMRI).
△ Less
Submitted 28 February, 2021; v1 submitted 22 June, 2020;
originally announced June 2020.
-
Probabilistic modelling of gait for robust passive monitoring in daily life
Authors:
Yordan P. Raykov,
Luc J. W. Evers,
Reham Badawy,
Bastiaan Bloem,
Tom M. Heskes,
Marjan Meinders,
Kasper Claes,
Max A. Little
Abstract:
Passive monitoring in daily life may provide invaluable insights about a person's health throughout the day. Wearable sensor devices are likely to play a key role in enabling such monitoring in a non-obtrusive fashion. However, sensor data collected in daily life reflects multiple health and behavior related factors together. This creates the need for structured principled analysis to produce reli…
▽ More
Passive monitoring in daily life may provide invaluable insights about a person's health throughout the day. Wearable sensor devices are likely to play a key role in enabling such monitoring in a non-obtrusive fashion. However, sensor data collected in daily life reflects multiple health and behavior related factors together. This creates the need for structured principled analysis to produce reliable and interpretable predictions that can be used to support clinical diagnosis and treatment. In this work we develop a principled modelling approach for free-living gait (walking) analysis. Gait is a promising target for non-obtrusive monitoring because it is common and indicative of various movement disorders such as Parkinson's disease (PD), yet its analysis has largely been limited to experimentally controlled lab settings. To locate and characterize stationary gait segments in free living using accelerometers, we present an unsupervised statistical framework designed to segment signals into differing gait and non-gait patterns. Our flexible probabilistic framework combines empirical assumptions about gait into a principled graphical model with all of its merits. We demonstrate the approach on a new video-referenced dataset including unscripted daily living activities of 25 PD patients and 25 controls, in and around their own houses. We evaluate our ability to detect gait and predict medication induced fluctuations in PD patients based on modelled gait. Our evaluation includes a comparison between sensors attached at multiple body locations including wrist, ankle, trouser pocket and lower back.
△ Less
Submitted 6 April, 2020;
originally announced April 2020.
-
Causal bootstrap**
Authors:
Max A. Little,
Reham Badawy
Abstract:
To draw scientifically meaningful conclusions and build reliable models of quantitative phenomena, cause and effect must be taken into consideration (either implicitly or explicitly). This is particularly challenging when the measurements are not from controlled experimental (interventional) settings, since cause and effect can be obscured by spurious, indirect influences. Modern predictive techni…
▽ More
To draw scientifically meaningful conclusions and build reliable models of quantitative phenomena, cause and effect must be taken into consideration (either implicitly or explicitly). This is particularly challenging when the measurements are not from controlled experimental (interventional) settings, since cause and effect can be obscured by spurious, indirect influences. Modern predictive techniques from machine learning are capable of capturing high-dimensional, nonlinear relationships between variables while relying on few parametric or probabilistic model assumptions. However, since these techniques are associational, applied to observational data they are prone to picking up spurious influences from non-experimental (observational) data, making their predictions unreliable. Techniques from causal inference, such as probabilistic causal diagrams and do-calculus, provide powerful (nonparametric) tools for drawing causal inferences from such observational data. However, these techniques are often incompatible with modern, nonparametric machine learning algorithms since they typically require explicit probabilistic models. Here, we develop causal bootstrap** for augmenting classical nonparametric bootstrap resampling with information on the causal relationship between variables. This makes it possible to resample observational data such that, if it is possible to identify an interventional relationship from that data, new data representing that relationship can be simulated from the original observational data. In this way, we can use modern machine learning algorithms unaltered to make statistically powerful, yet causally-robust, predictions. We develop several causal bootstrap** algorithms for drawing interventional inferences from observational data, for classification and regression problems, and demonstrate, using synthetic and real-world examples, the value of this approach.
△ Less
Submitted 9 December, 2020; v1 submitted 21 October, 2019;
originally announced October 2019.
-
Automatic Quality Control and Enhancement for Voice-Based Remote Parkinson's Disease Detection
Authors:
Amir Hossein Poorjam,
Mathew Shaji Kavalekalam,
Liming Shi,
Yordan P. Raykov,
Jesper Rindom Jensen,
Max A. Little,
Mads Græsbøll Christensen
Abstract:
The performance of voice-based Parkinson's disease (PD) detection systems degrades when there is an acoustic mismatch between training and operating conditions caused mainly by degradation in test signals. In this paper, we address this mismatch by considering three types of degradation commonly encountered in remote voice analysis, namely background noise, reverberation and nonlinear distortion,…
▽ More
The performance of voice-based Parkinson's disease (PD) detection systems degrades when there is an acoustic mismatch between training and operating conditions caused mainly by degradation in test signals. In this paper, we address this mismatch by considering three types of degradation commonly encountered in remote voice analysis, namely background noise, reverberation and nonlinear distortion, and investigate how these degradations influence the performance of a PD detection system. Given that the specific degradation is known, we explore the effectiveness of a variety of enhancement algorithms in compensating this mismatch and improving the PD detection accuracy. Then, we propose two approaches to automatically control the quality of recordings by identifying the presence and type of short-term and long-term degradations and protocol violations in voice signals. Finally, we experiment with using the proposed quality control methods to inform the choice of enhancement algorithm. Experimental results using the voice recordings of the mPower mobile PD data set under different degradation conditions show the effectiveness of the quality control approaches in selecting an appropriate enhancement method and, consequently, in improving the PD detection accuracy. This study is a step towards the development of a remote PD detection system capable of operating in unseen acoustic environments.
△ Less
Submitted 31 May, 2019; v1 submitted 28 May, 2019;
originally announced May 2019.
-
Adaptive probabilistic principal component analysis
Authors:
Adam Farooq,
Yordan P. Raykov,
Luc Evers,
Max A. Little
Abstract:
Using the linear Gaussian latent variable model as a starting point we relax some of the constraints it imposes by deriving a nonparametric latent feature Gaussian variable model. This model introduces additional discrete latent variables to the original structure. The Bayesian nonparametric nature of this new model allows it to adapt complexity as more data is observed and project each data point…
▽ More
Using the linear Gaussian latent variable model as a starting point we relax some of the constraints it imposes by deriving a nonparametric latent feature Gaussian variable model. This model introduces additional discrete latent variables to the original structure. The Bayesian nonparametric nature of this new model allows it to adapt complexity as more data is observed and project each data point onto a varying number of subspaces. The linear relationship between the continuous latent and observed variables make the proposed model straightforward to interpret, resembling a locally adaptive probabilistic PCA (A-PPCA). We propose two alternative Gibbs sampling procedures for inference in the new model and demonstrate its applicability on sensor data for passive health monitoring.
△ Less
Submitted 27 May, 2019;
originally announced May 2019.
-
Bayesian Pitch Tracking Based on the Harmonic Model
Authors:
Liming Shi,
Jesper Kjaer Nielsen,
Jesper Rindom Jensen,
Max A. Little,
Mads Graesboll Christensen
Abstract:
Fundamental frequency is one of the most important characteristics of speech and audio signals. Harmonic model-based fundamental frequency estimators offer a higher estimation accuracy and robustness against noise than the widely used autocorrelation-based methods. However, the traditional harmonic model-based estimators do not take the temporal smoothness of the fundamental frequency, the model o…
▽ More
Fundamental frequency is one of the most important characteristics of speech and audio signals. Harmonic model-based fundamental frequency estimators offer a higher estimation accuracy and robustness against noise than the widely used autocorrelation-based methods. However, the traditional harmonic model-based estimators do not take the temporal smoothness of the fundamental frequency, the model order, and the voicing into account as they process each data segment independently. In this paper, a fully Bayesian fundamental frequency tracking algorithm based on the harmonic model and a first-order Markov process model is proposed. Smoothness priors are imposed on the fundamental frequencies, model orders, and voicing using first-order Markov process models. Using these Markov models, fundamental frequency estimation and voicing detection errors can be reduced. Using the harmonic model, the proposed fundamental frequency tracker has an improved robustness to noise. An analytical form of the likelihood function, which can be computed efficiently, is derived. Compared to the state-of-the-art neural network and non-parametric approaches, the proposed fundamental frequency tracking algorithm reduces the mean absolute errors and gross errors by 15\% and 20\% on the Keele pitch database and 36\% and 26\% on sustained /a/ sounds from a database of Parkinson's disease voices under 0 dB white Gaussian noise. A MATLAB version of the proposed algorithm is made freely available for reproduction of the results\footnote{An implementation of the proposed algorithm using MATLAB may be found in \url{https://tinyurl.com/yxn4a543}
△ Less
Submitted 21 May, 2019;
originally announced May 2019.
-
Probabilistic modelling of gait for remote passive monitoring applications
Authors:
Yordan P. Raykov,
Luc J. W. Evers,
Reham Badawy,
Marjan J. Faber,
Bastiaan R. Bloem,
Kasper Claes,
Max A. Little
Abstract:
Passive and non-obtrusive health monitoring using wearables can potentially bring new insights into the user's health status throughout the day and may support clinical diagnosis and treatment. However, identifying segments of free-living data that sufficiently reflect the user's health is challenging. In this work we have studied the problem of modelling real-life gait which is a very indicative…
▽ More
Passive and non-obtrusive health monitoring using wearables can potentially bring new insights into the user's health status throughout the day and may support clinical diagnosis and treatment. However, identifying segments of free-living data that sufficiently reflect the user's health is challenging. In this work we have studied the problem of modelling real-life gait which is a very indicative behaviour for multiple movement disorders including Parkinson's disease (PD). We have developed a probabilistic framework for unsupervised analysis of the gait, clustering it into different types, which can be used to evaluate gait abnormalities occurring in daily life. Using a unique dataset which contains sensor and video recordings of people with and without PD in their own living environment, we show that our model driven approach achieves high accuracy gait detection and can capture clinical improvement after medication intake.
△ Less
Submitted 30 January, 2019; v1 submitted 4 December, 2018;
originally announced December 2018.
-
Investigating Voice as a Biomarker for leucine-rich repeat kinase 2-Associated Parkinson's Disease
Authors:
S. Arora,
N. P. Visanji,
T. A. Mestre,
A. Tsanas,
A. AlDakheel,
B. S. Connolly,
C. Gasca-Salas,
D. S. Kern,
J. Jain,
E. J. Slow,
A. Faust-Socher,
A. E. Lang,
M. A. Little,
C. Marras
Abstract:
We investigate the potential association between leucine-rich repeat kinase 2 (LRRK2) mutations and voice. Sustained phonations ('aaah' sounds) were recorded from 7 individuals with LRRK2-associated Parkinson's disease (PD), 17 participants with idiopathic PD (iPD), 20 non-manifesting LRRK2-mutation carriers, 25 related non-carriers, and 26 controls. In distinguishing LRRK2-associated PD and iPD,…
▽ More
We investigate the potential association between leucine-rich repeat kinase 2 (LRRK2) mutations and voice. Sustained phonations ('aaah' sounds) were recorded from 7 individuals with LRRK2-associated Parkinson's disease (PD), 17 participants with idiopathic PD (iPD), 20 non-manifesting LRRK2-mutation carriers, 25 related non-carriers, and 26 controls. In distinguishing LRRK2-associated PD and iPD, the mean sensitivity was 95.4% (SD 17.8%) and mean specificity was 89.6% (SD 26.5%). Voice features for non-manifesting carriers, related non-carriers, and controls were much less discriminatory. Vocal deficits in LRRK2-associated PD may be different than those in iPD. These preliminary results warrant longitudinal analyses and replication in larger cohorts
△ Less
Submitted 20 October, 2018;
originally announced October 2018.
-
A unified algorithm framework for quality control of sensor data for behavioural clinimetric testing
Authors:
Reham Badawy,
Yordan P. Raykov,
Max A. Little
Abstract:
The use of smartphone and wearable sensing technology for objective, non-invasive and remote clinimetric testing of symptoms has considerable potential. However, the clinimetric accuracy achievable with such technology is highly reliant on separating the useful from irrelevant or confounded sensor data. Monitoring patient symptoms using digital sensors outside of controlled, clinical lab settings…
▽ More
The use of smartphone and wearable sensing technology for objective, non-invasive and remote clinimetric testing of symptoms has considerable potential. However, the clinimetric accuracy achievable with such technology is highly reliant on separating the useful from irrelevant or confounded sensor data. Monitoring patient symptoms using digital sensors outside of controlled, clinical lab settings creates a variety of practical challenges, such as unavoidable and unexpected user behaviours. These behaviours often violate the assumptions of clinimetric testing protocols, where these protocols are designed to probe for specific symptoms. Such violations are frequent outside the lab, and can affect the accuracy of the subsequent data analysis and scientific conclusions. At the same time, curating sensor data by hand after the collection process is inherently subjective, laborious and error-prone. To address these problems, we report on a unified algorithmic framework for automated sensor data quality control, which can identify those parts of the sensor data which are sufficiently reliable for further analysis. Algorithms which are special cases of this framework for different sensor data types (e.g. accelerometer, digital audio) detect the extent to which the sensor data adheres to the assumptions of the test protocol for a variety of clinimetric tests. The approach is general enough to be applied to a large set of clinimetric tests and we demonstrate its performance on walking, balance and voice smartphone-based tests, designed to monitor the symptoms of Parkinson's disease.
△ Less
Submitted 23 November, 2017; v1 submitted 20 November, 2017;
originally announced November 2017.
-
High Frequency Remote Monitoring of Parkinson's Disease via Smartphone: Platform Overview and Medication Response Detection
Authors:
Andong Zhan,
Max A. Little,
Denzil A. Harris,
Solomon O. Abiola,
E. Ray Dorsey,
Suchi Saria,
Andreas Terzis
Abstract:
Objective: The aim of this study is to develop a smartphone-based high-frequency remote monitoring platform, assess its feasibility for remote monitoring of symptoms in Parkinson's disease, and demonstrate the value of data collected using the platform by detecting dopaminergic medication response. Methods: We have developed HopkinsPD, a novel smartphone-based monitoring platform, which measures s…
▽ More
Objective: The aim of this study is to develop a smartphone-based high-frequency remote monitoring platform, assess its feasibility for remote monitoring of symptoms in Parkinson's disease, and demonstrate the value of data collected using the platform by detecting dopaminergic medication response. Methods: We have developed HopkinsPD, a novel smartphone-based monitoring platform, which measures symptoms actively (i.e. data are collected when a suite of tests is initiated by the individual at specific times during the day), and passively (i.e. data are collected continuously in the background). After data collection, we extract features to assess measures of five key behaviors related to PD symptoms -- voice, balance, gait, dexterity, and reaction time. A random forest classifier is used to discriminate measurements taken after a dose of medication (treatment) versus before the medication dose (baseline). Results: A worldwide study for remote PD monitoring was established using HopkinsPD in July, 2014. This study used entirely remote, online recruitment and installation, demonstrating highly cost-effective scalability. In six months, 226 individuals (121 PD and 105 controls) contributed over 46,000 hours of passive monitoring data and approximately 8,000 instances of structured tests of voice, balance, gait, reaction, and dexterity. To the best of our knowledge, this is the first study to have collected data at such a scale for remote PD monitoring. Moreover, we demonstrate the initial ability to discriminate treatment from baseline with 71.0(+-0.4)% accuracy, which suggests medication response can be monitored remotely via smartphone-based measures.
△ Less
Submitted 5 January, 2016;
originally announced January 2016.
-
Simple approximate MAP Inference for Dirichlet processes
Authors:
Yordan P. Raykov,
Alexis Boukouvalas,
Max A. Little
Abstract:
The Dirichlet process mixture (DPM) is a ubiquitous, flexible Bayesian nonparametric statistical model. However, full probabilistic inference in this model is analytically intractable, so that computationally intensive techniques such as Gibb's sampling are required. As a result, DPM-based methods, which have considerable potential, are restricted to applications in which computational resources a…
▽ More
The Dirichlet process mixture (DPM) is a ubiquitous, flexible Bayesian nonparametric statistical model. However, full probabilistic inference in this model is analytically intractable, so that computationally intensive techniques such as Gibb's sampling are required. As a result, DPM-based methods, which have considerable potential, are restricted to applications in which computational resources and time for inference is plentiful. For example, they would not be practical for digital signal processing on embedded hardware, where computational resources are at a serious premium. Here, we develop simplified yet statistically rigorous approximate maximum a-posteriori (MAP) inference algorithms for DPMs. This algorithm is as simple as K-means clustering, performs in experiments as well as Gibb's sampling, while requiring only a fraction of the computational effort. Unlike related small variance asymptotics, our algorithm is non-degenerate and so inherits the "rich get richer" property of the Dirichlet process. It also retains a non-degenerate closed-form likelihood which enables standard tools such as cross-validation to be used. This is a well-posed approximation to the MAP solution of the probabilistic DPM model.
△ Less
Submitted 4 November, 2014;
originally announced November 2014.
-
Highly comparative time-series analysis: The empirical structure of time series and their methods
Authors:
Ben D. Fulcher,
Max A. Little,
Nick S. Jones
Abstract:
The process of collecting and organizing sets of observations represents a common theme throughout the history of science. However, despite the ubiquity of scientists measuring, recording, and analyzing the dynamics of different processes, an extensive organization of scientific time-series data and analysis methods has never been performed. Addressing this, annotated collections of over 35 000 re…
▽ More
The process of collecting and organizing sets of observations represents a common theme throughout the history of science. However, despite the ubiquity of scientists measuring, recording, and analyzing the dynamics of different processes, an extensive organization of scientific time-series data and analysis methods has never been performed. Addressing this, annotated collections of over 35 000 real-world and model-generated time series and over 9000 time-series analysis algorithms are analyzed in this work. We introduce reduced representations of both time series, in terms of their properties measured by diverse scientific methods, and of time-series analysis methods, in terms of their behaviour on empirical time series, and use them to organize these interdisciplinary resources. This new approach to comparing across diverse scientific data and methods allows us to organize time-series datasets automatically according to their properties, retrieve alternatives to particular analysis methods developed in other scientific disciplines, and automate the selection of useful methods for time-series classification and regression tasks. The broad scientific utility of these tools is demonstrated on datasets of electroencephalograms, self-affine time series, heart beat intervals, speech signals, and others, in each case contributing novel analysis techniques to the existing literature. Highly comparative techniques that compare across an interdisciplinary literature can thus be used to guide more focused research in time-series analysis for applications across the scientific disciplines.
△ Less
Submitted 3 April, 2013;
originally announced April 2013.
-
Deterministically driven random walks on a finite state space
Authors:
Colin M. W. Little
Abstract:
We introduce the concept of a deterministic walk. Confining our attention to the finite state case, we establish hypotheses that ensure that the deterministic walk is transitive, and show that this property is in some sense robust. We also establish conditions that ensure the existence of asymptotic occupation times.
We introduce the concept of a deterministic walk. Confining our attention to the finite state case, we establish hypotheses that ensure that the deterministic walk is transitive, and show that this property is in some sense robust. We also establish conditions that ensure the existence of asymptotic occupation times.
△ Less
Submitted 14 January, 2013;
originally announced January 2013.
-
Deterministically driven random walks in a random environment on Z
Authors:
Colin M. W. Little
Abstract:
We introduce the concept of a deterministic walk in a deterministic environment on a countable state space (DWDE). For the deterministic walk in a fixed environment we establish properties analogous to those found in Markov chain theory, but for systems that do not in general have the Markov property. In particular, we establish hypotheses ensuring that a DWDE on $\Z$ is either recurrent or transi…
▽ More
We introduce the concept of a deterministic walk in a deterministic environment on a countable state space (DWDE). For the deterministic walk in a fixed environment we establish properties analogous to those found in Markov chain theory, but for systems that do not in general have the Markov property. In particular, we establish hypotheses ensuring that a DWDE on $\Z$ is either recurrent or transient. An immediate consequence of this result is that a symmetric DWDE on $\Z$ is recurrent. Moreover, in the transient case, we show that the probability that the DWDE diverges to $+ \infty$ is either 0 or 1. In certain cases we compute the direction of divergence in the transient case.
△ Less
Submitted 14 January, 2013;
originally announced January 2013.
-
Generalized Methods and Solvers for Noise Removal from Piecewise Constant Signals
Authors:
Max A. Little,
Nick S. Jones
Abstract:
Removing noise from piecewise constant (PWC) signals, is a challenging signal processing problem arising in many practical contexts. For example, in exploration geosciences, noisy drill hole records need separating into stratigraphic zones, and in biophysics, jumps between molecular dwell states need extracting from noisy fluorescence microscopy signals. Many PWC denoising methods exist, including…
▽ More
Removing noise from piecewise constant (PWC) signals, is a challenging signal processing problem arising in many practical contexts. For example, in exploration geosciences, noisy drill hole records need separating into stratigraphic zones, and in biophysics, jumps between molecular dwell states need extracting from noisy fluorescence microscopy signals. Many PWC denoising methods exist, including total variation regularization, mean shift clustering, stepwise jump placement, running medians, convex clustering shrinkage and bilateral filtering; conventional linear signal processing methods are fundamentally unsuited however. This paper shows that most of these methods are associated with a special case of a generalized functional, minimized to achieve PWC denoising. The minimizer can be obtained by diverse solver algorithms, including stepwise jump placement, convex programming, finite differences, iterated running medians, least angle regression, regularization path following, and coordinate descent. We introduce novel PWC denoising methods, which, for example, combine global mean shift clustering with local total variation smoothing. Head-to-head comparisons between these methods are performed on synthetic data, revealing that our new methods have a useful role to play. Finally, overlaps between the methods of this paper and others such as wavelet shrinkage, hidden Markov models, and piecewise smooth filtering are touched on.
△ Less
Submitted 4 January, 2011; v1 submitted 22 December, 2010;
originally announced December 2010.
-
Steps and bumps: precision extraction of discrete states of molecular machines using physically-based, high-throughput time series analysis
Authors:
Max A. Little,
Bradley C. Steel,
Fan Bai,
Yoshiyuki Sowa,
Thomas Bilyard,
David M. Mueller,
Richard M. Berry,
Nick S. Jones
Abstract:
We report new statistical time-series analysis tools providing significant improvements in the rapid, precision extraction of discrete state dynamics from large databases of experimental observations of molecular machines. By building physical knowledge and statistical innovations into analysis tools, we demonstrate new techniques for recovering discrete state transitions buried in highly correlat…
▽ More
We report new statistical time-series analysis tools providing significant improvements in the rapid, precision extraction of discrete state dynamics from large databases of experimental observations of molecular machines. By building physical knowledge and statistical innovations into analysis tools, we demonstrate new techniques for recovering discrete state transitions buried in highly correlated molecular noise. We demonstrate the effectiveness of our approach on simulated and real examples of step-like rotation of the bacterial flagellar motor and the F1-ATPase enzyme. We show that our method can clearly identify molecular steps, symmetries and cascaded processes that are too weak for existing algorithms to detect, and can do so much faster than existing algorithms. Our techniques represent a major advance in the drive towards automated, precision, highthroughput studies of molecular machine dynamics. Modular, open-source software that implements these techniques is provided at http://www.eng.ox.ac.uk/samp/members/max/software/
△ Less
Submitted 7 April, 2010;
originally announced April 2010.
-
Sparse bayesian step-filtering for high-throughput analysis of molecular machine dynamics
Authors:
Max A. Little,
Nick S. Jones
Abstract:
Nature has evolved many molecular machines such as kinesin, myosin, and the rotary flagellar motor powered by an ion current from the mitochondria. Direct observation of the step-like motion of these machines with time series from novel experimental assays has recently become possible. These time series are corrupted by molecular and experimental noise that requires removal, but classical signal p…
▽ More
Nature has evolved many molecular machines such as kinesin, myosin, and the rotary flagellar motor powered by an ion current from the mitochondria. Direct observation of the step-like motion of these machines with time series from novel experimental assays has recently become possible. These time series are corrupted by molecular and experimental noise that requires removal, but classical signal processing is of limited use for recovering such step-like dynamics. This paper reports simple, novel Bayesian filters that are robust to step-like dynamics in noise, and introduce an L1-regularized, global filter whose sparse solution can be rapidly obtained by standard convex optimization methods. We show these techniques outperforming classical filters on simulated time series in terms of their ability to accurately recover the underlying step dynamics. To show the techniques in action, we extract step-like speed transitions from Rhodobacter sphaeroides flagellar motor time series. Code implementing these algorithms available from http://www.eng.ox.ac.uk/samp/members/max/software/.
△ Less
Submitted 29 March, 2010;
originally announced March 2010.
-
Monocyte and T-lymphocyte trans-endothelial migration in relation to cardiovascular disease: some alternative boundary conditions in a model recently proposed by Little et al. (PLoS Comput Biol 2009 5(10) e1000539)
Authors:
M. P. Little
Abstract:
We consider a slight modification to the monocyte and T-lymphocyte boundary conditions of Little et al. (PLoS Comput Biol 2009 5(10) e1000539) and derive alternative parameter estimates. No changes to the results and conclusions of the paper of Little et al. (PLoS Comput Biol 2009 5(10) e1000539) are implied.
We consider a slight modification to the monocyte and T-lymphocyte boundary conditions of Little et al. (PLoS Comput Biol 2009 5(10) e1000539) and derive alternative parameter estimates. No changes to the results and conclusions of the paper of Little et al. (PLoS Comput Biol 2009 5(10) e1000539) are implied.
△ Less
Submitted 28 February, 2010;
originally announced March 2010.
-
Variant assumptions made in deriving equilibrium solutions to Little et al (PLoS Comput Biol 2009 5(10) e1000539)
Authors:
Mark P Little,
Anna Gola,
Ioanna Tzoulaki,
Wendy Vandoolaeghe
Abstract:
The paper of Little et al. (PloS Comput Biol 2009 5(10) e1000539) outlined a system of reaction-diffusion equations that were used to describe induction of atherosclerotic disease. These were solved by considering an equilibrium solution and small perturbations around this equilibrium. Here we consider slight variant sets of assumptions that could be used to derive equilibrium solutions. In gene…
▽ More
The paper of Little et al. (PloS Comput Biol 2009 5(10) e1000539) outlined a system of reaction-diffusion equations that were used to describe induction of atherosclerotic disease. These were solved by considering an equilibrium solution and small perturbations around this equilibrium. Here we consider slight variant sets of assumptions that could be used to derive equilibrium solutions. In general they do not imply any change in the numerical results relating to monocyte chemo-attractant protein-1 (MCP-1) presented in that paper.
△ Less
Submitted 29 December, 2009;
originally announced December 2009.
-
Parameter identifiability and redundancy: theoretical considerations
Authors:
Mark P. Little,
Wolfgang F. Heidenreich,
Guangquan Li
Abstract:
In this paper we outline general considerations on parameter identifiability, and introduce the notion of weak local identifiability and gradient weak local identifiability. These are based on local properties of the likelihood, in particular the rank of the Hessian matrix. We relate these to the notions of parameter identifiability and redundancy previously introduced by Rothenberg (Econometric…
▽ More
In this paper we outline general considerations on parameter identifiability, and introduce the notion of weak local identifiability and gradient weak local identifiability. These are based on local properties of the likelihood, in particular the rank of the Hessian matrix. We relate these to the notions of parameter identifiability and redundancy previously introduced by Rothenberg (Econometrica 39 (1971) 577-591) and Catchpole and Morgan (Biometrika 84 (1997) 187-196). Within the exponential family parameter irredundancy, local identifiability, gradient weak local identifiability and weak local identifiability are shown to be equivalent. We consider applications to a recently developed class of cancer models of Little and Wright (Math Biosciences 183 (2003) 111-134) and Little et al. (J Theoret Biol 254 (2008) 229-238) that generalize a large number of other recently used quasi-biological cancer models, in particular those of Armitage and Doll (Br J Cancer 8 (1954) 1-12) and the two-mutation model (Moolgavkar and Venzon Math Biosciences 47 (1979) 55-77).
△ Less
Submitted 28 February, 2010; v1 submitted 26 December, 2008;
originally announced December 2008.
-
Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection
Authors:
Max A Little,
Patrick E McSharry,
Stephen J Roberts,
Declan AE Costello,
Irene M Moroz
Abstract:
Voice disorders affect patients profoundly, and acoustic tools can potentially measure voice function objectively. Nonetheless, existing tools are limited to analysing voices displaying near periodicity, and do not account for inherent biophysical nonlinearity and non-Gaussian randomness. They do not directly measure complex nonlinear aperiodicity, and turbulent, aeroacoustic, non-Gaussian rando…
▽ More
Voice disorders affect patients profoundly, and acoustic tools can potentially measure voice function objectively. Nonetheless, existing tools are limited to analysing voices displaying near periodicity, and do not account for inherent biophysical nonlinearity and non-Gaussian randomness. They do not directly measure complex nonlinear aperiodicity, and turbulent, aeroacoustic, non-Gaussian randomness. Often these tools have limited clinical usefulness. This paper introduces two new tools to speech analysis: recurrence and fractal scaling, which overcome the range limitations of existing tools by addressing directly these two symptoms of disorder, and a simple bootstrapped classifier distinguishes normal from disordered voices to 91.8% overall accuracy on a large database of subjects with a wide variety of voice disorders. They are widely applicable to the whole range of disordered voice phenomena by design. These new measures could therefore be used for a variety of practical clinical purposes.
△ Less
Submitted 30 June, 2007;
originally announced July 2007.
-
Testing the assumptions of linear prediction analysis in normal vowels
Authors:
Max Little,
Patrick E. McSharry,
Irene M. Moroz,
Stephen J. Roberts
Abstract:
This paper develops an improved surrogate data test to show experimental evidence, for all the simple vowels of US English, for both male and female speakers, that Gaussian linear prediction analysis, a ubiquitous technique in current speech technologies, cannot be used to extract all the dynamical structure of real speech time series. The test provides robust evidence undermining the validity o…
▽ More
This paper develops an improved surrogate data test to show experimental evidence, for all the simple vowels of US English, for both male and female speakers, that Gaussian linear prediction analysis, a ubiquitous technique in current speech technologies, cannot be used to extract all the dynamical structure of real speech time series. The test provides robust evidence undermining the validity of these linear techniques, supporting the assumptions of either dynamical nonlinearity and/or non-Gaussianity common to more recent, complex, efforts at dynamical modelling speech time series. However, an additional finding is that the classical assumptions cannot be ruled out entirely, and plausible evidence is given to explain the success of the linear Gaussian theory as a weak approximation to the true, nonlinear/non-Gaussian dynamics. This supports the use of appropriate hybrid linear/nonlinear/non-Gaussian modelling. With a calibrated calculation of statistic and particular choice of experimental protocol, some of the known systematic problems of the method of surrogate data testing are circumvented to obtain results to support the conclusions to a high level of significance.
△ Less
Submitted 4 January, 2006;
originally announced January 2006.
-
Chaotic Root-Finding for a Small Class of Polynomials
Authors:
Max Little,
Daniel Heesch
Abstract:
In this paper we present a new closed-form solution to a chaotic difference equation, y(n+1) = a2 y(n)^2 + a1 y(n) + a0 with coefficient a0 = (a1 - 4)(a1 + 2) / (4 a2) and using this solution, show how corresponding exact roots to a special set of related polynomials of order 2^p, p in the naturals, with two independent parameters can be generated, for any p.
In this paper we present a new closed-form solution to a chaotic difference equation, y(n+1) = a2 y(n)^2 + a1 y(n) + a0 with coefficient a0 = (a1 - 4)(a1 + 2) / (4 a2) and using this solution, show how corresponding exact roots to a special set of related polynomials of order 2^p, p in the naturals, with two independent parameters can be generated, for any p.
△ Less
Submitted 17 July, 2004;
originally announced July 2004.
-
A thought experiment on Quantum Mechanics and Distributed Failure Detection
Authors:
Mark C. Little
Abstract:
One of the biggest problems in current distributed systems is that presented by one machine attempting to determine the liveness of another in a timely manner. Unfortunately, the symptoms exhibited by a failed machine can also be the result of other causes, e.g., an overloaded machine or network which drops messages, making it impossible to detect a machine failure with cetainty until that machi…
▽ More
One of the biggest problems in current distributed systems is that presented by one machine attempting to determine the liveness of another in a timely manner. Unfortunately, the symptoms exhibited by a failed machine can also be the result of other causes, e.g., an overloaded machine or network which drops messages, making it impossible to detect a machine failure with cetainty until that machine recovers. This is a well understood problem and one which has led to a large body of research into failure suspectors: since it is not possible to detect a failure, the best one can do is suspect a failure and program accordingly. However, one machine's suspicions may not be the same as another's; therefore, these algorithms spend a considerable effort in ensuring a consistent view among all available machines of who is suspects of being failed. This paper describes a thought experiment on how quantum mechanics may be used to provide a failure detector that is guaranteed to give both accurate and instantaneous information about the liveness of machines, no matter the distances involved.
△ Less
Submitted 15 September, 2003;
originally announced September 2003.