-
Blind Biological Sequence Denoising with Self-Supervised Set Learning
Authors:
Nathan Ng,
Ji Won Park,
Jae Hyeon Lee,
Ryan Lewis Kelly,
Stephen Ra,
Kyunghyun Cho
Abstract:
Biological sequence analysis relies on the ability to denoise the imprecise output of sequencing platforms. We consider a common setting where a short sequence is read out repeatedly using a high-throughput long-read platform to generate multiple subreads, or noisy observations of the same sequence. Denoising these subreads with alignment-based approaches often fails when too few subreads are avai…
▽ More
Biological sequence analysis relies on the ability to denoise the imprecise output of sequencing platforms. We consider a common setting where a short sequence is read out repeatedly using a high-throughput long-read platform to generate multiple subreads, or noisy observations of the same sequence. Denoising these subreads with alignment-based approaches often fails when too few subreads are available or error rates are too high. In this paper, we propose a novel method for blindly denoising sets of sequences without directly observing clean source sequence labels. Our method, Self-Supervised Set Learning (SSSL), gathers subreads together in an embedding space and estimates a single set embedding as the midpoint of the subreads in both the latent and sequence spaces. This set embedding represents the "average" of the subreads and can be decoded into a prediction of the clean sequence. In experiments on simulated long-read DNA data, SSSL methods denoise small reads of $\leq 6$ subreads with 17% fewer errors and large reads of $>6$ subreads with 8% fewer errors compared to the best baseline. On a real dataset of antibody sequences, SSSL improves over baselines on two self-supervised metrics, with a significant improvement on difficult small reads that comprise over 60% of the test set. By accurately denoising these reads, SSSL promises to better realize the potential of high-throughput DNA sequencing data for downstream scientific applications.
△ Less
Submitted 4 September, 2023;
originally announced September 2023.
-
A Pareto-optimal compositional energy-based model for sampling and optimization of protein sequences
Authors:
Nataša Tagasovska,
Nathan C. Frey,
Andreas Loukas,
Isidro Hötzel,
Julien Lafrance-Vanasse,
Ryan Lewis Kelly,
Yan Wu,
Arvind Rajpal,
Richard Bonneau,
Kyunghyun Cho,
Stephen Ra,
Vladimir Gligorijević
Abstract:
Deep generative models have emerged as a popular machine learning-based approach for inverse design problems in the life sciences. However, these problems often require sampling new designs that satisfy multiple properties of interest in addition to learning the data distribution. This multi-objective optimization becomes more challenging when properties are independent or orthogonal to each other…
▽ More
Deep generative models have emerged as a popular machine learning-based approach for inverse design problems in the life sciences. However, these problems often require sampling new designs that satisfy multiple properties of interest in addition to learning the data distribution. This multi-objective optimization becomes more challenging when properties are independent or orthogonal to each other. In this work, we propose a Pareto-compositional energy-based model (pcEBM), a framework that uses multiple gradient descent for sampling new designs that adhere to various constraints in optimizing distinct properties. We demonstrate its ability to learn non-convex Pareto fronts and generate sequences that simultaneously satisfy multiple desired properties across a series of real-world antibody design tasks.
△ Less
Submitted 19 October, 2022;
originally announced October 2022.
-
A Human-Centered Machine-Learning Approach for Muscle-Tendon Junction Tracking in Ultrasound Images
Authors:
Christoph Leitner,
Robert Jarolim,
Bernhard Englmair,
Annika Kruse,
Karen Andrea Lara Hernandez,
Andreas Konrad,
Eric Su,
Jörg Schröttner,
Luke A. Kelly,
Glen A. Lichtwark,
Markus Tilp,
Christian Baumgartner
Abstract:
Biomechanical and clinical gait research observes muscles and tendons in limbs to study their functions and behaviour. Therefore, movements of distinct anatomical landmarks, such as muscle-tendon junctions, are frequently measured. We propose a reliable and time efficient machine-learning approach to track these junctions in ultrasound videos and support clinical biomechanists in gait analysis. In…
▽ More
Biomechanical and clinical gait research observes muscles and tendons in limbs to study their functions and behaviour. Therefore, movements of distinct anatomical landmarks, such as muscle-tendon junctions, are frequently measured. We propose a reliable and time efficient machine-learning approach to track these junctions in ultrasound videos and support clinical biomechanists in gait analysis. In order to facilitate this process, a method based on deep-learning was introduced. We gathered an extensive dataset, covering 3 functional movements, 2 muscles, collected on 123 healthy and 38 impaired subjects with 3 different ultrasound systems, and providing a total of 66864 annotated ultrasound images in our network training. Furthermore, we used data collected across independent laboratories and curated by researchers with varying levels of experience. For the evaluation of our method a diverse test-set was selected that is independently verified by four specialists. We show that our model achieves similar performance scores to the four human specialists in identifying the muscle-tendon junction position. Our method provides time-efficient tracking of muscle-tendon junctions, with prediction times of up to 0.078 seconds per frame (approx. 100 times faster than manual labeling). All our codes, trained models and test-set were made publicly available and our model is provided as a free-to-use online service on https://deepmtj.org/.
△ Less
Submitted 10 February, 2022;
originally announced February 2022.
-
Multimodal Approach for Assessing Neuromotor Coordination in Schizophrenia Using Convolutional Neural Networks
Authors:
Yashish M. Siriwardena,
Chris Kitchen,
Deanna L. Kelly,
Carol Espy-Wilson
Abstract:
This study investigates the speech articulatory coordination in schizophrenia subjects exhibiting strong positive symptoms (e.g. hallucinations and delusions), using two distinct channel-delay correlation methods. We show that the schizophrenic subjects with strong positive symptoms and who are markedly ill pose complex articulatory coordination pattern in facial and speech gestures than what is o…
▽ More
This study investigates the speech articulatory coordination in schizophrenia subjects exhibiting strong positive symptoms (e.g. hallucinations and delusions), using two distinct channel-delay correlation methods. We show that the schizophrenic subjects with strong positive symptoms and who are markedly ill pose complex articulatory coordination pattern in facial and speech gestures than what is observed in healthy subjects. This distinction in speech coordination pattern is used to train a multimodal convolutional neural network (CNN) which uses video and audio data during speech to distinguish schizophrenic patients with strong positive symptoms from healthy subjects. We also show that the vocal tract variables (TVs) which correspond to place of articulation and glottal source outperform the Mel-frequency Cepstral Coefficients (MFCCs) when fused with Facial Action Units (FAUs) in the proposed multimodal network. For the clinical dataset we collected, our best performing multimodal network improves the mean F1 score for detecting schizophrenia by around 18% with respect to the full vocal tract coordination (FVTC) baseline method implemented with fusing FAUs and MFCCs.
△ Less
Submitted 8 October, 2021;
originally announced October 2021.
-
Deepfake Videos in the Wild: Analysis and Detection
Authors:
Jiameng Pu,
Neal Mangaokar,
Lauren Kelly,
Parantapa Bhattacharya,
Kavya Sundaram,
Mobin Javed,
Bolun Wang,
Bimal Viswanath
Abstract:
AI-manipulated videos, commonly known as deepfakes, are an emerging problem. Recently, researchers in academia and industry have contributed several (self-created) benchmark deepfake datasets, and deepfake detection algorithms. However, little effort has gone towards understanding deepfake videos in the wild, leading to a limited understanding of the real-world applicability of research contributi…
▽ More
AI-manipulated videos, commonly known as deepfakes, are an emerging problem. Recently, researchers in academia and industry have contributed several (self-created) benchmark deepfake datasets, and deepfake detection algorithms. However, little effort has gone towards understanding deepfake videos in the wild, leading to a limited understanding of the real-world applicability of research contributions in this space. Even if detection schemes are shown to perform well on existing datasets, it is unclear how well the methods generalize to real-world deepfakes. To bridge this gap in knowledge, we make the following contributions: First, we collect and present the largest dataset of deepfake videos in the wild, containing 1,869 videos from YouTube and Bilibili, and extract over 4.8M frames of content. Second, we present a comprehensive analysis of the growth patterns, popularity, creators, manipulation strategies, and production methods of deepfake content in the real-world. Third, we systematically evaluate existing defenses using our new dataset, and observe that they are not ready for deployment in the real-world. Fourth, we explore the potential for transfer learning schemes and competition-winning techniques to improve defenses.
△ Less
Submitted 10 March, 2021; v1 submitted 6 March, 2021;
originally announced March 2021.
-
Deep Generative Pattern-Set Mixture Models for Nonignorable Missingness
Authors:
Sahra Ghalebikesabi,
Rob Cornish,
Luke J. Kelly,
Chris Holmes
Abstract:
We propose a variational autoencoder architecture to model both ignorable and nonignorable missing data using pattern-set mixtures as proposed by Little (1993). Our model explicitly learns to cluster the missing data into missingness pattern sets based on the observed data and missingness masks. Underpinning our approach is the assumption that the data distribution under missingness is probabilist…
▽ More
We propose a variational autoencoder architecture to model both ignorable and nonignorable missing data using pattern-set mixtures as proposed by Little (1993). Our model explicitly learns to cluster the missing data into missingness pattern sets based on the observed data and missingness masks. Underpinning our approach is the assumption that the data distribution under missingness is probabilistically semi-supervised by samples from the observed data distribution. Our setup trades off the characteristics of ignorable and nonignorable missingness and can thus be applied to data of both types. We evaluate our method on a wide range of data sets with different types of missingness and achieve state-of-the-art imputation performance. Our model outperforms many common imputation algorithms, especially when the amount of missing data is high and the missingness mechanism is nonignorable.
△ Less
Submitted 5 March, 2021;
originally announced March 2021.
-
Report on the 2019 Workshop on Smart Farming and Data Analytics (SFDAI)
Authors:
Liadh Kelly,
Simone van der Burg,
Aine Regan,
Peter Mooney
Abstract:
The 1st National workshop on Smart Farming and Data Analytics took place at Maynooth University in Ireland on June 12, 2019. The workshop included two invited keynote presentations, invited talks and breakout group discussions. The workshop attracted in the order of 50 participants, consisting of a mixture of computer scientists, general scientists, farmers, farm advisors, and agricultural busines…
▽ More
The 1st National workshop on Smart Farming and Data Analytics took place at Maynooth University in Ireland on June 12, 2019. The workshop included two invited keynote presentations, invited talks and breakout group discussions. The workshop attracted in the order of 50 participants, consisting of a mixture of computer scientists, general scientists, farmers, farm advisors, and agricultural business representatives. This allowed for lively discussion and cross-fertilization of ideas. And showed the significant interest in the smart farming domain, the many research challenges faced in the space and the potential for data analytics and information retrieval here.
△ Less
Submitted 7 September, 2020;
originally announced September 2020.
-
Predicting Injectable Medication Adherence via a Smart Sharps Bin and Machine Learning
Authors:
Yingqi Gu,
Akshay Zalkikar,
Lara Kelly,
Kieran Daly,
Tomas E. Ward
Abstract:
Medication non-adherence is a widespread problem affecting over 50% of people who have chronic illness and need chronic treatment. Non-adherence exacerbates health risks and drives significant increases in treatment costs. In order to address these challenges, the importance of predicting patients' adherence has been recognised. In other words, it is important to improve the efficiency of interven…
▽ More
Medication non-adherence is a widespread problem affecting over 50% of people who have chronic illness and need chronic treatment. Non-adherence exacerbates health risks and drives significant increases in treatment costs. In order to address these challenges, the importance of predicting patients' adherence has been recognised. In other words, it is important to improve the efficiency of interventions of the current healthcare system by prioritizing resources to the patients who are most likely to be non-adherent. Our objective in this work is to make predictions regarding individual patients' behaviour in terms of taking their medication on time during their next scheduled medication opportunity. We do this by leveraging a number of machine learning models. In particular, we demonstrate the use of a connected IoT device; a "Smart Sharps Bin", invented by HealthBeacon Ltd.; to monitor and track injection disposal of patients in their home environment. Using extensive data collected from these devices, five machine learning models, namely Extra Trees Classifier, Random Forest, XGBoost, Gradient Boosting and Multilayer Perception were trained and evaluated on a large dataset comprising 165,223 historic injection disposal records collected from 5,915 HealthBeacon units over the course of 3 years. The testing work was conducted on real-time data generated by the smart device over a time period after the model training was complete, i.e. true future data. The proposed machine learning approach demonstrated very good predictive performance exhibiting an Area Under the Receiver Operating Characteristic Curve (ROC AUC) of 0.86.
△ Less
Submitted 2 April, 2020;
originally announced April 2020.