Search | arXiv e-print repository

RTL-Repo: A Benchmark for Evaluating LLMs on Large-Scale RTL Design Projects

Abstract: Large Language Models (LLMs) have demonstrated potential in assisting with Register Transfer Level (RTL) design tasks. Nevertheless, there remains to be a significant gap in benchmarks that accurately reflect the complexity of real-world RTL projects. To address this, this paper presents RTL-Repo, a benchmark specifically designed to evaluate LLMs on large-scale RTL design projects. RTL-Repo inclu… ▽ More Large Language Models (LLMs) have demonstrated potential in assisting with Register Transfer Level (RTL) design tasks. Nevertheless, there remains to be a significant gap in benchmarks that accurately reflect the complexity of real-world RTL projects. To address this, this paper presents RTL-Repo, a benchmark specifically designed to evaluate LLMs on large-scale RTL design projects. RTL-Repo includes a comprehensive dataset of more than 4000 Verilog code samples extracted from public GitHub repositories, with each sample providing the full context of the corresponding repository. We evaluate several state-of-the-art models on the RTL-Repo benchmark, including GPT-4, GPT-3.5, Starcoder2, alongside Verilog-specific models like VeriGen and RTLCoder, and compare their performance in generating Verilog code for complex projects. The RTL-Repo benchmark provides a valuable resource for the hardware design community to assess and compare LLMs' performance in real-world RTL design scenarios and train LLMs specifically for Verilog code generation in complex, multi-file RTL projects. RTL-Repo is open-source and publicly available on Github. △ Less

Submitted 27 May, 2024; originally announced May 2024.

arXiv:2405.03327 [pdf, other]

Clustering of Disease Trajectories with Explainable Machine Learning: A Case Study on Postoperative Delirium Phenotypes

Authors: Xiaochen Zheng, Manuel Schürch, Xingyu Chen, Maria Angeliki Komninou, Reto Schüpbach, Ahmed Allam, Jan Bartussek, Michael Krauthammer

Abstract: The identification of phenotypes within complex diseases or syndromes is a fundamental component of precision medicine, which aims to adapt healthcare to individual patient characteristics. Postoperative delirium (POD) is a complex neuropsychiatric condition with significant heterogeneity in its clinical manifestations and underlying pathophysiology. We hypothesize that POD comprises several disti… ▽ More The identification of phenotypes within complex diseases or syndromes is a fundamental component of precision medicine, which aims to adapt healthcare to individual patient characteristics. Postoperative delirium (POD) is a complex neuropsychiatric condition with significant heterogeneity in its clinical manifestations and underlying pathophysiology. We hypothesize that POD comprises several distinct phenotypes, which cannot be directly observed in clinical practice. Identifying these phenotypes could enhance our understanding of POD pathogenesis and facilitate the development of targeted prevention and treatment strategies. In this paper, we propose an approach that combines supervised machine learning for personalized POD risk prediction with unsupervised clustering techniques to uncover potential POD phenotypes. We first demonstrate our approach using synthetic data, where we simulate patient cohorts with predefined phenotypes based on distinct sets of informative features. We aim to mimic any clinical disease with our synthetic data generation method. By training a predictive model and applying SHAP, we show that clustering patients in the SHAP feature importance space successfully recovers the true underlying phenotypes, outperforming clustering in the raw feature space. We then present a case study using real-world data from a cohort of elderly surgical patients. The results showcase the utility of our approach in uncovering clinically relevant subtypes of complex disorders like POD, paving the way for more precise and personalized treatment strategies. △ Less

Submitted 6 May, 2024; originally announced May 2024.

arXiv:2311.08149 [pdf, other]

Modeling Complex Disease Trajectories using Deep Generative Models with Semi-Supervised Latent Processes

Authors: Cécile Trottet, Manuel Schürch, Ahmed Allam, Imon Barua, Liubov Petelytska, Oliver Distler, Anna-Maria Hoffmann-Vold, Michael Krauthammer, the EUSTAR collaborators

Abstract: In this paper, we propose a deep generative time series approach using latent temporal processes for modeling and holistically analyzing complex disease trajectories. We aim to find meaningful temporal latent representations of an underlying generative process that explain the observed disease trajectories in an interpretable and comprehensive way. To enhance the interpretability of these latent t… ▽ More In this paper, we propose a deep generative time series approach using latent temporal processes for modeling and holistically analyzing complex disease trajectories. We aim to find meaningful temporal latent representations of an underlying generative process that explain the observed disease trajectories in an interpretable and comprehensive way. To enhance the interpretability of these latent temporal processes, we develop a semi-supervised approach for disentangling the latent space using established medical concepts. By combining the generative approach with medical knowledge, we leverage the ability to discover novel aspects of the disease while integrating medical concepts into the model. We show that the learned temporal latent processes can be utilized for further data analysis and clinical hypothesis testing, including finding similar patients and clustering the disease into new sub-types. Moreover, our method enables personalized online monitoring and prediction of multivariate time series including uncertainty quantification. We demonstrate the effectiveness of our approach in modeling systemic sclerosis, showcasing the potential of our machine learning model to capture complex disease trajectories and acquire new medical knowledge. △ Less

Submitted 29 January, 2024; v1 submitted 14 November, 2023; originally announced November 2023.

Comments: Extended Abstract presented at Machine Learning for Health (ML4H) symposium 2023, December 10th, 2023, New Orleans, United States, 23 pages

arXiv:2311.07744 [pdf, other]

Two-Stage Aggregation with Dynamic Local Attention for Irregular Time Series

Authors: Xingyu Chen, Xiaochen Zheng, Amina Mollaysa, Manuel Schürch, Ahmed Allam, Michael Krauthammer

Abstract: Irregular multivariate time series data is characterized by varying time intervals between consecutive observations of measured variables/signals (i.e., features) and varying sampling rates (i.e., recordings/measurement) across these features. Modeling time series while taking into account these irregularities is still a challenging task for machine learning methods. Here, we introduce TADA, a Two… ▽ More Irregular multivariate time series data is characterized by varying time intervals between consecutive observations of measured variables/signals (i.e., features) and varying sampling rates (i.e., recordings/measurement) across these features. Modeling time series while taking into account these irregularities is still a challenging task for machine learning methods. Here, we introduce TADA, a Two-stageAggregation process with Dynamic local Attention to harmonize time-wise and feature-wise irregularities in multivariate time series. In the first stage, the irregular time series undergoes temporal embedding (TE) using all available features at each time step. This process preserves the contribution of each available feature and generates a fixed-dimensional representation per time step. The second stage introduces a dynamic local attention (DLA) mechanism with adaptive window sizes. DLA aggregates time recordings using feature-specific windows to harmonize irregular time intervals capturing feature-specific sampling rates. Then hierarchical MLP mixer layers process the output of DLA through multiscale patching to leverage information at various scales for the downstream tasks. TADA outperforms state-of-the-art methods on three real-world datasets, including the latest MIMIC IV dataset, and highlights its effectiveness in handling irregular multivariate time series and its potential for various real-world applications. △ Less

Submitted 25 April, 2024; v1 submitted 13 November, 2023; originally announced November 2023.

Comments: A short version of this paper has been accepted for presentation at the Findings of Machine Learning for Health (ML4H) 2023 conference

arXiv:2311.07636 [pdf, other]

Attention-based Multi-task Learning for Base Editor Outcome Prediction

Authors: Amina Mollaysa, Ahmed Allam, Michael Krauthammer

Abstract: Human genetic diseases often arise from point mutations, emphasizing the critical need for precise genome editing techniques. Among these, base editing stands out as it allows targeted alterations at the single nucleotide level. However, its clinical application is hindered by low editing efficiency and unintended mutations, necessitating extensive trial-and-error experimentation in the laboratory… ▽ More Human genetic diseases often arise from point mutations, emphasizing the critical need for precise genome editing techniques. Among these, base editing stands out as it allows targeted alterations at the single nucleotide level. However, its clinical application is hindered by low editing efficiency and unintended mutations, necessitating extensive trial-and-error experimentation in the laboratory. To speed up this process, we present an attention-based two-stage machine learning model that learns to predict the likelihood of all possible editing outcomes for a given genomic target sequence. We further propose a multi-task learning schema to jointly learn multiple base editors (i.e. variants) at once. Our model's predictions consistently demonstrated a strong correlation with the actual experimental results on multiple datasets and base editor variants. These results provide further validation for the models' capacity to enhance and accelerate the process of refining base editing designs. △ Less

Submitted 15 November, 2023; v1 submitted 13 November, 2023; originally announced November 2023.

Comments: Extended Abstract presented at Machine Learning for Health (ML4H) symposium 2023, December 10th, 2023, New Orleans, United States, 15 pages. arXiv admin note: substantial text overlap with arXiv:2310.02919

arXiv:2310.02919 [pdf, other]

Attention-based Multi-task Learning for Base Editor Outcome Prediction

Authors: Amina Mollaysa, Ahmed Allam, Michael Krauthammer

Abstract: Human genetic diseases often arise from point mutations, emphasizing the critical need for precise genome editing techniques. Among these, base editing stands out as it allows targeted alterations at the single nucleotide level. However, its clinical application is hindered by low editing efficiency and unintended mutations, necessitating extensive trial-and-error experimentation in the laboratory… ▽ More Human genetic diseases often arise from point mutations, emphasizing the critical need for precise genome editing techniques. Among these, base editing stands out as it allows targeted alterations at the single nucleotide level. However, its clinical application is hindered by low editing efficiency and unintended mutations, necessitating extensive trial-and-error experimentation in the laboratory. To speed up this process, we present an attention-based two-stage machine learning model that learns to predict the likelihood of all possible editing outcomes for a given genomic target sequence. We further propose a multi-task learning schema to jointly learn multiple base editors (i.e. variants) at once. Our model's predictions consistently demonstrated a strong correlation with the actual experimental results on multiple datasets and base editor variants. These results provide further validation for the models' capacity to enhance and accelerate the process of refining base editing designs. △ Less

Submitted 10 November, 2023; v1 submitted 4 October, 2023; originally announced October 2023.

arXiv:2309.16521 [pdf, other]

Generating Personalized Insulin Treatments Strategies with Deep Conditional Generative Time Series Models

Authors: Manuel Schürch, Xiang Li, Ahmed Allam, Giulia Rathmes, Amina Mollaysa, Claudia Cavelti-Weder, Michael Krauthammer

Abstract: We propose a novel framework that combines deep generative time series models with decision theory for generating personalized treatment strategies. It leverages historical patient trajectory data to jointly learn the generation of realistic personalized treatment and future outcome trajectories through deep generative time series models. In particular, our framework enables the generation of nove… ▽ More We propose a novel framework that combines deep generative time series models with decision theory for generating personalized treatment strategies. It leverages historical patient trajectory data to jointly learn the generation of realistic personalized treatment and future outcome trajectories through deep generative time series models. In particular, our framework enables the generation of novel multivariate treatment strategies tailored to the personalized patient history and trained for optimal expected future outcomes based on conditional expected utility maximization. We demonstrate our framework by generating personalized insulin treatment strategies and blood glucose predictions for hospitalized diabetes patients, showcasing the potential of our approach for generating improved personalized treatment strategies. Keywords: deep generative model, probabilistic decision support, personalized treatment generation, insulin and blood glucose prediction △ Less

Submitted 13 November, 2023; v1 submitted 28 September, 2023; originally announced September 2023.

Comments: Extended Abstract presented at Machine Learning for Health (ML4H) symposium 2023, December 10th, 2023, New Orleans, United States, 17 pages

Journal ref: Machine Learning for Health (ML4H) 2023

arXiv:2303.18205 [pdf, other]

SimTS: Rethinking Contrastive Representation Learning for Time Series Forecasting

Authors: Xiaochen Zheng, Xingyu Chen, Manuel Schürch, Amina Mollaysa, Ahmed Allam, Michael Krauthammer

Abstract: Contrastive learning methods have shown an impressive ability to learn meaningful representations for image or time series classification. However, these methods are less effective for time series forecasting, as optimization of instance discrimination is not directly applicable to predicting the future state from the history context. Moreover, the construction of positive and negative pairs in cu… ▽ More Contrastive learning methods have shown an impressive ability to learn meaningful representations for image or time series classification. However, these methods are less effective for time series forecasting, as optimization of instance discrimination is not directly applicable to predicting the future state from the history context. Moreover, the construction of positive and negative pairs in current technologies strongly relies on specific time series characteristics, restricting their generalization across diverse types of time series data. To address these limitations, we propose SimTS, a simple representation learning approach for improving time series forecasting by learning to predict the future from the past in the latent space. SimTS does not rely on negative pairs or specific assumptions about the characteristics of the particular time series. Our extensive experiments on several benchmark time series forecasting datasets show that SimTS achieves competitive performance compared to existing contrastive learning methods. Furthermore, we show the shortcomings of the current contrastive learning framework used for time series forecasting through a detailed ablation study. Overall, our work suggests that SimTS is a promising alternative to other contrastive learning approaches for time series forecasting. △ Less

Submitted 31 March, 2023; originally announced March 2023.

Comments: 13 pages, 6 figures

arXiv:2302.04208 [pdf, other]

Exploratory Analysis of Federated Learning Methods with Differential Privacy on MIMIC-III

Authors: Aron N. Horvath, Matteo Berchier, Farhad Nooralahzadeh, Ahmed Allam, Michael Krauthammer

Abstract: Background: Federated learning methods offer the possibility of training machine learning models on privacy-sensitive data sets, which cannot be easily shared. Multiple regulations pose strict requirements on the storage and usage of healthcare data, leading to data being in silos (i.e. locked-in at healthcare facilities). The application of federated algorithms on these datasets could accelerate… ▽ More Background: Federated learning methods offer the possibility of training machine learning models on privacy-sensitive data sets, which cannot be easily shared. Multiple regulations pose strict requirements on the storage and usage of healthcare data, leading to data being in silos (i.e. locked-in at healthcare facilities). The application of federated algorithms on these datasets could accelerate disease diagnostic, drug development, as well as improve patient care. Methods: We present an extensive evaluation of the impact of different federation and differential privacy techniques when training models on the open-source MIMIC-III dataset. We analyze a set of parameters influencing a federated model performance, namely data distribution (homogeneous and heterogeneous), communication strategies (communication rounds vs. local training epochs), federation strategies (FedAvg vs. FedProx). Furthermore, we assess and compare two differential privacy (DP) techniques during model training: a stochastic gradient descent-based differential privacy algorithm (DP-SGD), and a sparse vector differential privacy technique (DP-SVT). Results: Our experiments show that extreme data distributions across sites (imbalance either in the number of patients or the positive label ratios between sites) lead to a deterioration of model performance when trained using the FedAvg strategy. This issue is resolved when using FedProx with the use of appropriate hyperparameter tuning. Furthermore, the results show that both differential privacy techniques can reach model performances similar to those of models trained without DP, however at the expense of a large quantifiable privacy leakage. Conclusions: We evaluate empirically the benefits of two federation strategies and propose optimal strategies for the choice of parameters when using differential privacy techniques. △ Less

Submitted 8 February, 2023; originally announced February 2023.

arXiv:2210.00802 [pdf, other]

DDoS: A Graph Neural Network based Drug Synergy Prediction Algorithm

Authors: Kyriakos Schwarz, Alicia Pliego-Mendieta, Amina Mollaysa, Lara Planas-Paz, Chantal Pauli, Ahmed Allam, Michael Krauthammer

Abstract: Drug synergy arises when the combined impact of two drugs exceeds the sum of their individual effects. While single-drug effects on cell lines are well-documented, the scarcity of data on drug synergy, considering the vast array of potential drug combinations, prompts a growing interest in computational approaches for predicting synergies in untested drug pairs. We introduce a Graph Neural Network… ▽ More Drug synergy arises when the combined impact of two drugs exceeds the sum of their individual effects. While single-drug effects on cell lines are well-documented, the scarcity of data on drug synergy, considering the vast array of potential drug combinations, prompts a growing interest in computational approaches for predicting synergies in untested drug pairs. We introduce a Graph Neural Network (\textit{GNN}) based model for drug synergy prediction, which utilizes drug chemical structures and cell line gene expression data. We extract data from the largest available drug combination database (DrugComb) and generate multiple synergy scores (commonly used in the literature) to create seven datasets that serve as a reliable benchmark with high confidence. In contrast to conventional models relying on pre-computed chemical features, our GNN-based approach learns task-specific drug representations directly from the graph structure of the drugs, providing superior performance in predicting drug synergies. Our work suggests that learning task-specific drug representations and leveraging a diverse dataset is a promising approach to advancing our understanding of drug-drug interaction and synergy. △ Less

Submitted 26 April, 2024; v1 submitted 3 October, 2022; originally announced October 2022.

arXiv:2205.12008 [pdf, other]

doi 10.23919/ACC53348.2022.9867490

Core-shell enhanced single particle model for LiFePO$_4$ batteries

Authors: Aki Takahashi, Gabriele Pozzato, Anirudh Allam, Vahid Azimi, Xueyan Li, Donghoon Lee, Johan Ko, Simona Onori

Abstract: In this paper, a novel electrochemical model for LiFePO$_4$ battery cells that accounts for the positive particle lithium intercalation and deintercalation dynamics is proposed. Starting from the enhanced single particle model, mass transport and balance equations along with suitable boundary conditions are introduced to model the phase transformation phenomena during lithiation and delithiation i… ▽ More In this paper, a novel electrochemical model for LiFePO$_4$ battery cells that accounts for the positive particle lithium intercalation and deintercalation dynamics is proposed. Starting from the enhanced single particle model, mass transport and balance equations along with suitable boundary conditions are introduced to model the phase transformation phenomena during lithiation and delithiation in the positive electrode material. The lithium-poor and lithium-rich phases are modeled using the core-shell principle, where a core composition is encapsulated with a shell composition. The coupled partial differential equations describing the phase transformation are discretized using the finite difference method, from which a system of ordinary differential equations written in state-space representation is obtained. Finally, model parameter identification is performed using experimental data from a 49Ah LFP pouch cell. △ Less

Submitted 20 May, 2022; originally announced May 2022.

arXiv:2203.04249 [pdf]

Evaluating feasibility of batteries for second-life applications using machine learning

Authors: Aki Takahashi, Anirudh Allam, Simona Onori

Abstract: This paper presents a combination of machine learning techniques to enable prompt evaluation of retired electric vehicle batteries as to either retain those batteries for a second-life application and extend their operation beyond the original and first intent or send them to recycle facilities. The proposed algorithm generates features from available battery current and voltage measurements with… ▽ More This paper presents a combination of machine learning techniques to enable prompt evaluation of retired electric vehicle batteries as to either retain those batteries for a second-life application and extend their operation beyond the original and first intent or send them to recycle facilities. The proposed algorithm generates features from available battery current and voltage measurements with simple statistics, selects and ranks the features using correlation analysis, and employs Gaussian Process Regression enhanced with bagging. This approach is validated over publicly available aging datasets of more than 200 cells with slow and fast charging, with different cathode chemistries, and for diverse operating conditions. Promising results are observed based on multiple training-test partitions, wherein the mean of Root Mean Squared Percent Error and Mean Percent Error performance errors are found to be less than 1.48% and 1.29%, respectively, in the worst-case scenarios. △ Less

Submitted 7 April, 2023; v1 submitted 8 March, 2022; originally announced March 2022.

Comments: 23 pages

arXiv:2203.04226 [pdf, other]

Extending life of Lithium-ion battery systems by embracing heterogeneities via an optimal control-based active balancing strategy

Authors: Vahid Azimi, Anirudh Allam, Simona Onori

Abstract: This paper formulates and solves a multi-objective fast charging-minimum degradation optimal control problem (OCP) for a lithium-ion battery module made of series-connected cells equipped with an active balancing circuitry. The cells in the module are subject to heterogeneity induced by manufacturing defects and non-uniform operating conditions. Each cell is expressed via a coupled nonlinear elect… ▽ More This paper formulates and solves a multi-objective fast charging-minimum degradation optimal control problem (OCP) for a lithium-ion battery module made of series-connected cells equipped with an active balancing circuitry. The cells in the module are subject to heterogeneity induced by manufacturing defects and non-uniform operating conditions. Each cell is expressed via a coupled nonlinear electrochemical, thermal, and aging model and the direct collocation approach is employed to transcribe the OCP into a nonlinear programming problem (NLP). The proposed OCP is formulated under two different schemes of charging operation: (i) same-charging-time (OCP-SCT) and (ii) different-charging-time (OCP-DCT). The former assumes simultaneous charging of all cells irrespective of their initial conditions, whereas the latter allows for different charging times of the cells to account for heterogeneous initial conditions. The problem is solved for a module with two series-connected cells with intrinsic heterogeneity among them in terms of state of charge and state of health. Results show that the OCP-DCT scheme provides more flexibility to deal with heterogeneity, boasting of lower temperature increase, charging current amplitudes, and degradation. Finally, comparison with the common practice of constant current (CC) charging over a long-term cycling operation shows that promising savings, in terms of retained capacity, are attainable under both the control (OCP-SCT and OCP-DCT) schemes. △ Less

Submitted 2 September, 2022; v1 submitted 8 March, 2022; originally announced March 2022.

Comments: 16 pages

arXiv:2012.13248 [pdf, other]

AttentionDDI: Siamese Attention-based Deep Learning method for drug-drug interaction predictions

Authors: Kyriakos Schwarz, Ahmed Allam, Nicolas Andres Perez Gonzalez, Michael Krauthammer

Abstract: Background: Drug-drug interactions (DDIs) refer to processes triggered by the administration of two or more drugs leading to side effects beyond those observed when drugs are administered by themselves. Due to the massive number of possible drug pairs, it is nearly impossible to experimentally test all combinations and discover previously unobserved side effects. Therefore, machine learning based… ▽ More Background: Drug-drug interactions (DDIs) refer to processes triggered by the administration of two or more drugs leading to side effects beyond those observed when drugs are administered by themselves. Due to the massive number of possible drug pairs, it is nearly impossible to experimentally test all combinations and discover previously unobserved side effects. Therefore, machine learning based methods are being used to address this issue. Methods: We propose a Siamese self-attention multi-modal neural network for DDI prediction that integrates multiple drug similarity measures that have been derived from a comparison of drug characteristics including drug targets, pathways and gene expression profiles. Results: Our proposed DDI prediction model provides multiple advantages: 1) It is trained end-to-end, overcoming limitations of models composed of multiple separate steps, 2) it offers model explainability via an Attention mechanism for identifying salient input features and 3) it achieves similar or better prediction performance (AUPR scores ranging from 0.77 to 0.92) compared to state-of-the-art DDI models when tested on various benchmark datasets. Novel DDI predictions are further validated using independent data resources. Conclusions: We find that a Siamese multi-modal neural network is able to accurately predict DDIs and that an Attention mechanism, typically used in the Natural Language Processing domain, can be beneficially applied to aid in DDI model explainability. △ Less

Submitted 24 December, 2020; originally announced December 2020.

arXiv:2008.10467 [pdf, other]

doi 10.1109/TCST.2020.3017566

On-line Capacity Estimation for Lithium-ion Battery Cells via an Electrochemical Model-based Adaptive Interconnected Observer

Authors: Anirudh Allam, Simona Onori

Abstract: Battery aging is a natural process that contributes to capacity and power fade, resulting in a gradual performance degradation over time and usage. State of Charge (SOC) and State of Health (SOH) monitoring of an aging battery poses a challenging task to the Battery Management System (BMS) due to the lack of direct measurements. Estimation algorithms based on an electrochemical model that take int… ▽ More Battery aging is a natural process that contributes to capacity and power fade, resulting in a gradual performance degradation over time and usage. State of Charge (SOC) and State of Health (SOH) monitoring of an aging battery poses a challenging task to the Battery Management System (BMS) due to the lack of direct measurements. Estimation algorithms based on an electrochemical model that take into account the impact of aging on physical battery parameters can provide accurate information on lithium concentration and cell capacity over a battery's usable lifespan. A temperature-dependent electrochemical model, the Enhanced Single Particle Model (ESPM), forms the basis for the synthesis of an adaptive interconnected observer that exploits the relationship between capacity and power fade, due to the growth of Solid Electrolyte Interphase layer (SEI), to enable combined estimation of states (lithium concentration in both electrodes and cell capacity) and aging-sensitive transport parameters (anode diffusion coefficient and SEI layer ionic conductivity). The practical stability conditions for the adaptive observer are derived using Lyapunov's theory. Validation results against experimental data show a bounded capacity estimation error within 2% of its true value. Further, effectiveness of capacity estimation is tested for two cells at different stages of aging. Robustness of capacity estimates under measurement noise and sensor bias are studied. △ Less

Submitted 24 June, 2021; v1 submitted 24 August, 2020; originally announced August 2020.

Comments: 16 pages

Journal ref: IEEE Transactions on Control Systems Technology, vol. 29, no. 4 (2021) 1636 - 1651

arXiv:2005.06630 [pdf]

Patient Similarity Analysis with Longitudinal Health Data

Authors: Ahmed Allam, Matthias Dittberner, Anna Sintsova, Dominique Brodbeck, Michael Krauthammer

Abstract: Healthcare professionals have long envisioned using the enormous processing powers of computers to discover new facts and medical knowledge locked inside electronic health records. These vast medical archives contain time-resolved information about medical visits, tests and procedures, as well as outcomes, which together form individual patient journeys. By assessing the similarities among these j… ▽ More Healthcare professionals have long envisioned using the enormous processing powers of computers to discover new facts and medical knowledge locked inside electronic health records. These vast medical archives contain time-resolved information about medical visits, tests and procedures, as well as outcomes, which together form individual patient journeys. By assessing the similarities among these journeys, it is possible to uncover clusters of common disease trajectories with shared health outcomes. The assignment of patient journeys to specific clusters may in turn serve as the basis for personalized outcome prediction and treatment selection. This procedure is a non-trivial computational problem, as it requires the comparison of patient data with multi-dimensional and multi-modal features that are captured at different times and resolutions. In this review, we provide a comprehensive overview of the tools and methods that are used in patient similarity analysis with longitudinal data and discuss its potential for improving clinical decision making. △ Less

Submitted 14 May, 2020; originally announced May 2020.

arXiv:1912.12999 [pdf, other]

AutoDiscern: Rating the Quality of Online Health Information with Hierarchical Encoder Attention-based Neural Networks

Authors: Laura Kinkead, Ahmed Allam, Michael Krauthammer

Abstract: Patients increasingly turn to search engines and online content before, or in place of, talking with a health professional. Low quality health information, which is common on the internet, presents risks to the patient in the form of misinformation and a possibly poorer relationship with their physician. To address this, the DISCERN criteria (developed at University of Oxford) are used to evaluate… ▽ More Patients increasingly turn to search engines and online content before, or in place of, talking with a health professional. Low quality health information, which is common on the internet, presents risks to the patient in the form of misinformation and a possibly poorer relationship with their physician. To address this, the DISCERN criteria (developed at University of Oxford) are used to evaluate the quality of online health information. However, patients are unlikely to take the time to apply these criteria to the health websites they visit. We built an automated implementation of the DISCERN instrument (Brief version) using machine learning models. We compared the performance of a traditional model (Random Forest) with that of a hierarchical encoder attention-based neural network (HEA) model using two language embeddings, BERT and BioBERT. The HEA BERT and BioBERT models achieved average F1-macro scores across all criteria of 0.75 and 0.74, respectively, outperforming the Random Forest model (average F1-macro = 0.69). Overall, the neural network based models achieved 81% and 86% average accuracy at 100% and 80% coverage, respectively, compared to 94% manual rating accuracy. The attention mechanism implemented in the HEA architectures not only provided 'model explainability' by identifying reasonable supporting sentences for the documents fulfilling the Brief DISCERN criteria, but also boosted F1 performance by 0.05 compared to the same architecture without an attention mechanism. Our research suggests that it is feasible to automate online health information quality assessment, which is an important step towards empowering patients to become informed partners in the healthcare process. △ Less

Submitted 26 May, 2020; v1 submitted 30 December, 2019; originally announced December 2019.

arXiv:1812.09549 [pdf, other]

Neural networks versus Logistic regression for 30 days all-cause readmission prediction

Authors: Ahmed Allam, Mate Nagy, George Thoma, Michael Krauthammer

Abstract: Heart failure (HF) is one of the leading causes of hospital admissions in the US. Readmission within 30 days after a HF hospitalization is both a recognized indicator for disease progression and a source of considerable financial burden to the healthcare system. Consequently, the identification of patients at risk for readmission is a key step in improving disease management and patient outcome. I… ▽ More Heart failure (HF) is one of the leading causes of hospital admissions in the US. Readmission within 30 days after a HF hospitalization is both a recognized indicator for disease progression and a source of considerable financial burden to the healthcare system. Consequently, the identification of patients at risk for readmission is a key step in improving disease management and patient outcome. In this work, we used a large administrative claims dataset to (1)explore the systematic application of neural network-based models versus logistic regression for predicting 30 days all-cause readmission after discharge from a HF admission, and (2)to examine the additive value of patients' hospitalization timelines on prediction performance. Based on data from 272,778 (49% female) patients with a mean (SD) age of 73 years (14) and 343,328 HF admissions (67% of total admissions), we trained and tested our predictive readmission models following a stratified 5-fold cross-validation scheme. Among the deep learning approaches, a recurrent neural network (RNN) combined with conditional random fields (CRF) model (RNNCRF) achieved the best performance in readmission prediction with 0.642 AUC (95% CI, 0.640-0.645). Other models, such as those based on RNN, convolutional neural networks and CRF alone had lower performance, with a non-timeline based model (MLP) performing worst. A competitive model based on logistic regression with LASSO achieved a performance of 0.643 AUC (95%CI, 0.640-0.646). We conclude that data from patient timelines improve 30 day readmission prediction for neural network-based models, that a logistic regression with LASSO has equal performance to the best neural network model and that the use of administrative data result in competitive performance compared to published approaches based on richer clinical datasets. △ Less

Submitted 22 December, 2018; originally announced December 2018.

arXiv:1509.05045 [pdf, other]

doi 10.1051/0004-6361/201527377

SDSS-IV eBOSS emission-line galaxy pilot survey

Authors: J. Comparat, T. Delubac, S. Jouvel, A. Raichoor, J-P. Kneib, C. Yeche, F. B. Abdalla, C. Le Cras, C. Maraston, D. M. Wilkinson, G. Zhu, E. Jullo, F. Prada, D. Schlegel, Z. Xu, H. Zou, J. Bautista, D. Bizyaev, A. Bolton, J. R. Brownstein, K. S. Dawson, S. Escoffier P. Gaulme, K. Kinemuchi, E. Malanushenko, V. Malanushenko , et al. (61 additional authors not shown)

Abstract: The Sloan Digital Sky Survey IV extended Baryonic Oscillation Spectroscopic Survey (SDSS-IV/eBOSS) will observe 195,000 emission-line galaxies (ELGs) to measure the Baryonic Acoustic Oscillation standard ruler (BAO) at redshift 0.9. To test different ELG selection algorithms, 9,000 spectra were observed with the SDSS spectrograph as a pilot survey based on data from several imaging surveys. First,… ▽ More The Sloan Digital Sky Survey IV extended Baryonic Oscillation Spectroscopic Survey (SDSS-IV/eBOSS) will observe 195,000 emission-line galaxies (ELGs) to measure the Baryonic Acoustic Oscillation standard ruler (BAO) at redshift 0.9. To test different ELG selection algorithms, 9,000 spectra were observed with the SDSS spectrograph as a pilot survey based on data from several imaging surveys. First, using visual inspection and redshift quality flags, we show that the automated spectroscopic redshifts assigned by the pipeline meet the quality requirements for a reliable BAO measurement. We also show the correlations between sky emission, signal-to-noise ratio in the emission lines, and redshift error. Then we provide a detailed description of each target selection algorithm we tested and compare them with the requirements of the eBOSS experiment. As a result, we provide reliable redshift distributions for the different target selection schemes we tested. Finally, we determine an target selection algorithms that is best suited to be applied on DECam photometry because they fulfill the eBOSS survey efficiency requirements. △ Less

Submitted 21 June, 2016; v1 submitted 16 September, 2015; originally announced September 2015.

Comments: 19 pages. Accepted in A and A

Journal ref: A&A 592, A121 (2016)

arXiv:1109.6884 [pdf, other]

ERA: Efficient Serial and Parallel Suffix Tree Construction for Very Long Strings

Authors: Essam Mansour, Amin Allam, Spiros Skiadopoulos, Panos Kalnis

Abstract: The suffix tree is a data structure for indexing strings. It is used in a variety of applications such as bioinformatics, time series analysis, clustering, text editing and data compression. However, when the string and the resulting suffix tree are too large to fit into the main memory, most existing construction algorithms become very inefficient. This paper presents a disk-based suffix tree con… ▽ More The suffix tree is a data structure for indexing strings. It is used in a variety of applications such as bioinformatics, time series analysis, clustering, text editing and data compression. However, when the string and the resulting suffix tree are too large to fit into the main memory, most existing construction algorithms become very inefficient. This paper presents a disk-based suffix tree construction method, called Elastic Range (ERa), which works efficiently with very long strings that are much larger than the available memory. ERa partitions the tree construction process horizontally and vertically and minimizes I/Os by dynamically adjusting the horizontal partitions independently for each vertical partition, based on the evolving shape of the tree and the available memory. Where appropriate, ERa also groups vertical partitions together to amortize the I/O cost. We developed a serial version; a parallel version for shared-memory and shared-disk multi-core systems; and a parallel version for shared-nothing architectures. ERa indexes the entire human genome in 19 minutes on an ordinary desktop computer. For comparison, the fastest existing method needs 15 minutes using 1024 CPUs on an IBM BlueGene supercomputer. △ Less

Submitted 30 September, 2011; originally announced September 2011.

Comments: VLDB2012

Journal ref: Proceedings of the VLDB Endowment (PVLDB), Vol. 5, No. 1, pp. 49-60 (2011)

Showing 1–20 of 20 results for author: Allam, A