Search | arXiv e-print repository

Coding historical causes of death data with Large Language Models

Authors: Bjørn Pedersen, Maisha Islam, Doris Tove Kristoffersen, Lars Ailo Bongo, Eilidh Garrett, Alice Reid, Hilde Sommerseth

Abstract: This paper investigates the feasibility of using pre-trained generative Large Language Models (LLMs) to automate the assignment of ICD-10 codes to historical causes of death. Due to the complex narratives often found in historical causes of death, this task has traditionally been manually performed by coding experts. We evaluate the ability of GPT-3.5, GPT-4, and Llama 2 LLMs to accurately assign… ▽ More This paper investigates the feasibility of using pre-trained generative Large Language Models (LLMs) to automate the assignment of ICD-10 codes to historical causes of death. Due to the complex narratives often found in historical causes of death, this task has traditionally been manually performed by coding experts. We evaluate the ability of GPT-3.5, GPT-4, and Llama 2 LLMs to accurately assign ICD-10 codes on the HiCaD dataset that contains causes of death recorded in the civil death register entries of 19,361 individuals from Ipswich, Kilmarnock, and the Isle of Skye from the UK between 1861-1901. Our findings show that GPT-3.5, GPT-4, and Llama 2 assign the correct code for 69%, 83%, and 40% of causes, respectively. However, we achieve a maximum accuracy of 89% by standard machine learning techniques. All LLMs performed better for causes of death that contained terms still in use today, compared to archaic terms. Also they perform better for short causes (1-2 words) compared to longer causes. LLMs therefore do not currently perform well enough for historical ICD-10 code assignment tasks. We suggest further fine-tuning or alternative frameworks to achieve adequate performance. △ Less

Submitted 13 May, 2024; originally announced May 2024.

Comments: 18 pages, 1 figure in main text, 3 figures in appendix

arXiv:2405.02913 [pdf]

Fast TILs estimation in lung cancer WSIs based on semi-stochastic patch sampling

Authors: Nikita Shvetsov, Anders Sildnes, Lill-Tove Rasmussen Busund, Stig Dalen, Kajsa Møllersen, Lars Ailo Bongo, Thomas K. Kilvaer

Abstract: Addressing the critical need for accurate prognostic biomarkers in cancer treatment, quantifying tumor-infiltrating lymphocytes (TILs) in non-small cell lung cancer (NSCLC) presents considerable challenges. Manual TIL quantification in whole slide images (WSIs) is laborious and subject to variability, potentially undermining patient outcomes. Our study introduces an automated pipeline that utilize… ▽ More Addressing the critical need for accurate prognostic biomarkers in cancer treatment, quantifying tumor-infiltrating lymphocytes (TILs) in non-small cell lung cancer (NSCLC) presents considerable challenges. Manual TIL quantification in whole slide images (WSIs) is laborious and subject to variability, potentially undermining patient outcomes. Our study introduces an automated pipeline that utilizes semi-stochastic patch sampling, patch classification to retain prognostically relevant patches, and cell quantification using the HoVer-Net model to streamline the TIL evaluation process. This pipeline efficiently excludes approximately 70% of areas not relevant for prognosis and requires only 5% of the remaining patches to maintain prognostic accuracy (c-index 0.65 +- 0.01). The computational efficiency achieved does not sacrifice prognostic accuracy, as demonstrated by the TILs score's strong correlation with patient survival, which surpasses traditional CD8 IHC scoring methods. While the pipeline demonstrates potential for enhancing NSCLC prognostication and personalization of treatment, comprehensive clinical validation is still required. Future research should focus on verifying its broader clinical utility and investigating additional biomarkers to improve NSCLC prognosis. △ Less

Submitted 5 May, 2024; originally announced May 2024.

Comments: 18 pages, 7 figures, 6 appendix pages

MSC Class: 68T07 ACM Class: I.4.6; I.4.9; J.3

arXiv:2308.02613 [pdf, other]

Interoperable synthetic health data with SyntHIR to enable the development of CDSS tools

Authors: Pavitra Chauhan, Mohsen Gamal Saad Askar, Bjørn Fjukstad, Lars Ailo Bongo, Edvard Pedersen

Abstract: There is a great opportunity to use high-quality patient journals and health registers to develop machine learning-based Clinical Decision Support Systems (CDSS). To implement a CDSS tool in a clinical workflow, there is a need to integrate, validate and test this tool on the Electronic Health Record (EHR) systems used to store and manage patient data. However, it is often not possible to get the… ▽ More There is a great opportunity to use high-quality patient journals and health registers to develop machine learning-based Clinical Decision Support Systems (CDSS). To implement a CDSS tool in a clinical workflow, there is a need to integrate, validate and test this tool on the Electronic Health Record (EHR) systems used to store and manage patient data. However, it is often not possible to get the necessary access to an EHR system due to legal compliance. We propose an architecture for generating and using synthetic EHR data for CDSS tool development. The architecture is implemented in a system called SyntHIR. The SyntHIR system uses the Fast Healthcare Interoperability Resources (FHIR) standards for data interoperability, the Gretel framework for generating synthetic data, the Microsoft Azure FHIR server as the FHIR-based EHR system and SMART on FHIR framework for tool transportability. We demonstrate the usefulness of SyntHIR by develo** a machine learning-based CDSS tool using data from the Norwegian Patient Register (NPR) and Norwegian Patient Prescriptions (NorPD). We demonstrate the development of the tool on the SyntHIR system and then lift it to the Open DIPS environment. In conclusion, SyntHIR provides a generic architecture for CDSS tool development using synthetic FHIR data and a testing environment before implementing it in a clinical setting. However, there is scope for improvement in terms of the quality of the synthetic data generated. The code is open source and available at https://github.com/potter-coder89/SyntHIR.git. △ Less

Submitted 4 August, 2023; originally announced August 2023.

arXiv:2306.16126 [pdf]

More efficient manual review of automatically transcribed tabular data

Authors: Bjørn-Richard Pedersen, Rigmor Katrine Johansen, Einar Holsbø, Hilde Sommerseth, Lars Ailo Bongo

Abstract: Machine learning methods have proven useful in transcribing historical data. However, results from even highly accurate methods require manual verification and correction. Such manual review can be time-consuming and expensive, therefore the objective of this paper was to make it more efficient. Previously, we used machine learning to transcribe 2.3 million handwritten occupation codes from the No… ▽ More Machine learning methods have proven useful in transcribing historical data. However, results from even highly accurate methods require manual verification and correction. Such manual review can be time-consuming and expensive, therefore the objective of this paper was to make it more efficient. Previously, we used machine learning to transcribe 2.3 million handwritten occupation codes from the Norwegian 1950 census with high accuracy (97%). We manually reviewed the 90,000 (3%) codes with the lowest model confidence. We allocated those 90,000 codes to human reviewers, who used our annotation tool to review the codes. To assess reviewer agreement, some codes were assigned to multiple reviewers. We then analyzed the review results to understand the relationship between accuracy improvements and effort. Additionally, we interviewed the reviewers to improve the workflow. The reviewers corrected 62.8% of the labels and agreed with the model label in 31.9% of cases. About 0.2% of the images could not be assigned a label, while for 5.1% the reviewers were uncertain, or they assigned an invalid label. 9,000 images were independently reviewed by multiple reviewers, resulting in an agreement of 86.43% and disagreement of 8.96%. We learned that our automatic transcription is biased towards the most frequent codes, with a higher degree of misclassification for the lowest frequency codes. Our interview findings show that the reviewers did internal quality control and found our custom tool well-suited. So, only one reviewer is needed, but they should report uncertainty. △ Less

Submitted 28 June, 2023; originally announced June 2023.

Comments: 19 pages, 5 figures, 1 table

arXiv:2306.01546 [pdf, other]

Publicly available datasets of breast histopathology H&E whole-slide images: A sco** review

Authors: Masoud Tafavvoghi, Lars Ailo Bongo, Nikita Shvetsov, Lill-Tove Rasmussen Busund, Kajsa Møllersen

Abstract: Advancements in digital pathology and computing resources have made a significant impact in the field of computational pathology for breast cancer diagnosis and treatment. However, access to high-quality labeled histopathological images of breast cancer is a big challenge that limits the development of accurate and robust deep learning models. In this sco** review, we identified the publicly ava… ▽ More Advancements in digital pathology and computing resources have made a significant impact in the field of computational pathology for breast cancer diagnosis and treatment. However, access to high-quality labeled histopathological images of breast cancer is a big challenge that limits the development of accurate and robust deep learning models. In this sco** review, we identified the publicly available datasets of breast H&E stained whole-slide images (WSI) that can be used to develop deep learning algorithms. We systematically searched nine scientific literature databases and nine research data repositories and found 17 publicly available datasets containing 10385 H&E WSIs of breast cancer. Moreover, we reported image metadata and characteristics for each dataset to assist researchers in selecting proper datasets for specific tasks in breast cancer computational pathology. In addition, we compiled two lists of breast H&E patches and private datasets as supplementary resources for researchers. Notably, only 28% of the included articles utilized multiple datasets, and only 14% used an external validation set, suggesting that the performance of other developed models may be susceptible to overestimation. The TCGA-BRCA was used in 52% of the selected studies. This dataset has a considerable selection bias that can impact the robustness and generalizability of the trained algorithms. There is also a lack of consistent metadata reporting of breast WSI datasets that can be an issue in develo** accurate deep learning models, indicating the necessity of establishing explicit guidelines for documenting breast WSI dataset characteristics and metadata. △ Less

Submitted 6 December, 2023; v1 submitted 2 June, 2023; originally announced June 2023.

Comments: 27 pages (including references), 8 figures, 3 tables, 5 supporting information materials

MSC Class: 68T01 General topics in artificial intelligence ACM Class: I.2.0

arXiv:2202.08794 [pdf]

Social network analysis of Staphylococcus aureus carriage in a general youth population

Authors: Dina Benedicte Stensen, Rafael Adolfo Nozal Cañadas, Lars Småbrekke, Karina Olsen, Christopher Sivert Nielsen, Kristian Svendsen, Anne Merethe Hanssen, Johanna Sollid, Gunnar Skov Simonsen, Lars Ailo Bongo, Anne-Sofie Furberg

Abstract: Staphylococcus aureus nasal carriage increases risk of infection and has been associated with lifestyle behavior and biological host characteristics. We used social network analysis to evaluate whether contacts have the same S. aureus genotype, or whether contagiousness is an indirect effect of contacts sharing the same lifestyle or characteristics. The Fit Futures 1 study collected data on soci… ▽ More Staphylococcus aureus nasal carriage increases risk of infection and has been associated with lifestyle behavior and biological host characteristics. We used social network analysis to evaluate whether contacts have the same S. aureus genotype, or whether contagiousness is an indirect effect of contacts sharing the same lifestyle or characteristics. The Fit Futures 1 study collected data on social contact among 1038 first level students in the same high school district in Norway. S. aureus persistent carriage was determined from two nasal swab cultures and genotype from spa-ty** of a positive throat swab culture. Bootstrap, t-tests, logistic regression, and autocorrelation were used to evaluate social network influence on host risk factors and S. aureus carriage. Both persistent carriage and spa-type were transmitted in the social network (p<0.001). The probability of carriage increased by 3.7% and 5.0% for each additional S. aureus positive friend, in univariable regression and multivariable autocorrelation analysis respectively. Male sex was associated with a 15% lower risk of transmission compared to women, although the prevalence of carriage was higher for men (36% versus 24%). Medium physical activity, medium and high alcohol-use, and normal-weight students had higher number of contacts, and increased risk of transmission (p<0.002). We demonstrate direct social transmission of S. aureus in a general youth population. Lifestyle factors are associated with risk of transmission suggesting indirect social group effects from having more similar environmental exposures. The predominance in carriage is determined by sex-specific predisposing host characteristics as social transmission is less frequent than in females. Better understanding of how social interactions influence S. aureus carriage dynamics in the population is important for develo** new preventive measures. △ Less

Submitted 17 February, 2022; originally announced February 2022.

Comments: 37 pages, 9 figures, 10 tables

arXiv:2202.06590 [pdf]

doi 10.3390/cancers14122974

A Pragmatic Machine Learning Approach to Quantify Tumor Infiltrating Lymphocytes in Whole Slide Images

Authors: Nikita Shvetsov, Morten Grønnesby, Edvard Pedersen, Kajsa Møllersen, Lill-Tove Rasmussen Busund, Ruth Schwienbacher, Lars Ailo Bongo, Thomas K. Kilvaer

Abstract: Increased levels of tumor infiltrating lymphocytes (TILs) in cancer tissue indicate favourable outcomes in many types of cancer. Manual quantification of immune cells is inaccurate and time consuming for pathologists. Our aim is to leverage a computational solution to automatically quantify TILs in whole slide images (WSIs) of standard diagnostic haematoxylin and eosin stained sections (H&E slides… ▽ More Increased levels of tumor infiltrating lymphocytes (TILs) in cancer tissue indicate favourable outcomes in many types of cancer. Manual quantification of immune cells is inaccurate and time consuming for pathologists. Our aim is to leverage a computational solution to automatically quantify TILs in whole slide images (WSIs) of standard diagnostic haematoxylin and eosin stained sections (H&E slides) from lung cancer patients. Our approach is to transfer an open source machine learning method for segmentation and classification of nuclei in H&E slides trained on public data to TIL quantification without manual labeling of our data. Our results show that additional augmentation improves model transferability when training on few samples/limited tissue types. Models trained with sufficient samples/tissue types do not benefit from our additional augmentation policy. Further, the resulting TIL quantification correlates to patient prognosis and compares favorably to the current state-of-the-art method for immune cell detection in non-small lung cancer (current standard CD8 cells in DAB stained TMAs HR 0.34 95% CI 0.17-0.68 vs TILs in HE WSIs: HoVer-Net PanNuke Aug Model HR 0.30 95% CI 0.15-0.60, HoVer-Net MoNuSAC Aug model HR 0.27 95% CI 0.14-0.53). Moreover, we implemented a cloud based system to train, deploy and visually inspect machine learning based annotation for H&E slides. Our pragmatic approach bridges the gap between machine learning research, translational clinical research and clinical implementation. However, validation in prospective studies is needed to assert that the method works in a clinical setting. △ Less

Submitted 14 February, 2022; originally announced February 2022.

Comments: 19 pages, 5 figures, 2 tables, 11 supplementary pages

MSC Class: 68T07 ACM Class: I.4.6; I.4.9; J.3

Journal ref: Cancers, 14 (2022) 12, 2974

arXiv:2109.02937 [pdf, other]

GeneNet VR: Interactive visualization of large-scale biological networks using a standalone headset

Authors: Álvaro Martínez Fernández, Lars Ailo Bongo, Edvard Pedersen

Abstract: Visualizations are an essential part of biomedical analysis result interpretation. Often, interactive networks are used to visualize the data. However, the high interconnectivity, and high dimensionality of the data often results in information overload, making it hard to interpret the results. To address the information overload problem, existing solutions typically either use data reduction, red… ▽ More Visualizations are an essential part of biomedical analysis result interpretation. Often, interactive networks are used to visualize the data. However, the high interconnectivity, and high dimensionality of the data often results in information overload, making it hard to interpret the results. To address the information overload problem, existing solutions typically either use data reduction, reduced interactivity, or expensive hardware. We propose using the affordable Oculus Quest Virtual Reality (VR) headset for interactive visualization of large-scale biological networks. We present the design and implementation of our solution, GeneNet VR, and we evaluate its scalability and usability using large gene-to-gene interaction networks. We achieve the 72 FPS required by the Oculus performance guidelines for the largest of our networks (2693 genes) using both a GPU and the Oculus Quest standalone. We found from our interviews with biomedical researchers that GeneNet VR is innovative, interesting, and easy to use for novice VR users. We believe affordable hardware like the Oculus Quest has a big potential for biological data analysis. However, additional work is required to evaluate its benefits to improve knowledge discovery for real data analysis use cases. GeneNet VR is open-sourced: https://github.com/kolibrid/GeneNet-VR. A video demonstrating GeneNet VR used to explore large biological networks: https://youtu.be/N4QDZiZqVNY. △ Less

Submitted 7 September, 2021; originally announced September 2021.

arXiv:2106.03996 [pdf]

doi 10.51964/hlcs11331

Lessons learned develo** and using a machine learning model to automatically transcribe 2.3 million handwritten occupation codes

Authors: Bjørn-Richard Pedersen, Einar Holsbø, Trygve Andersen, Nikita Shvetsov, Johan Ravn, Hilde Leikny Sommerseth, Lars Ailo Bongo

Abstract: Machine learning approaches achieve high accuracy for text recognition and are therefore increasingly used for the transcription of handwritten historical sources. However, using machine learning in production requires a streamlined end-to-end pipeline that scales to the dataset size and a model that achieves high accuracy with few manual transcriptions. The correctness of the model results must a… ▽ More Machine learning approaches achieve high accuracy for text recognition and are therefore increasingly used for the transcription of handwritten historical sources. However, using machine learning in production requires a streamlined end-to-end pipeline that scales to the dataset size and a model that achieves high accuracy with few manual transcriptions. The correctness of the model results must also be verified. This paper describes our lessons learned develo**, tuning and using the Occode end-to-end machine learning pipeline for transcribing 2.3 million handwritten occupation codes from the Norwegian 1950 population census. We achieve an accuracy of 97% for the automatically transcribed codes, and we send 3% of the codes for manual verification. We verify that the occupation code distribution found in our results matches the distribution found in our training data, which should be representative for the census as a whole. We believe our approach and lessons learned may be useful for other transcription projects that plan to use machine learning in production. The source code is available at: https://github.com/uit-hdl/rhd-codes △ Less

Submitted 1 December, 2021; v1 submitted 7 June, 2021; originally announced June 2021.

arXiv:2005.09890 [pdf]

doi 10.1145/3388440.3414862

Interactive exploration of population scale pharmacoepidemiology datasets

Authors: Tengel Ekrem Skar, Einar Holsbø, Kristian Svendsen, Lars Ailo Bongo

Abstract: Population-scale drug prescription data linked with adverse drug reaction (ADR) data supports the fitting of models large enough to detect drug use and ADR patterns that are not detectable using traditional methods on smaller datasets. However, detecting ADR patterns in large datasets requires tools for scalable data processing, machine learning for data analysis, and interactive visualization. To… ▽ More Population-scale drug prescription data linked with adverse drug reaction (ADR) data supports the fitting of models large enough to detect drug use and ADR patterns that are not detectable using traditional methods on smaller datasets. However, detecting ADR patterns in large datasets requires tools for scalable data processing, machine learning for data analysis, and interactive visualization. To our knowledge no existing pharmacoepidemiology tool supports all three requirements. We have therefore created a tool for interactive exploration of patterns in prescription datasets with millions of samples. We use Spark to preprocess the data for machine learning and for analyses using SQL queries. We have implemented models in Keras and the scikit-learn framework. The model results are visualized and interpreted using live Python coding in Jupyter. We apply our tool to explore a 384 million prescription data set from the Norwegian Prescription Database combined with a 62 million prescriptions for elders that were hospitalized. We preprocess the data in two minutes, train models in seconds, and plot the results in milliseconds. Our results show the power of combining computational power, short computation times, and ease of use for analysis of population scale pharmacoepidemiology datasets. The code is open source and available at: https://github.com/uit-hdl/norpd_prescription_analyses △ Less

Submitted 20 May, 2020; originally announced May 2020.

arXiv:1903.10251 [pdf]

doi 10.3390/s19081798

Convolutional neural network for breathing phase detection in lung sounds

Authors: Cristina Jácome, Johan Ravn, Einar Holsbø, Juan Carlos Aviles-Solis, Hasse Melbye, Lars Ailo Bongo

Abstract: We applied deep learning to create an algorithm for breathing phase detection in lung sound recordings, and we compared the breathing phases detected by the algorithm and manually annotated by two experienced lung sound researchers. Our algorithm uses a convolutional neural network with spectrograms as the features, removing the need to specify features explicitly. We trained and evaluated the alg… ▽ More We applied deep learning to create an algorithm for breathing phase detection in lung sound recordings, and we compared the breathing phases detected by the algorithm and manually annotated by two experienced lung sound researchers. Our algorithm uses a convolutional neural network with spectrograms as the features, removing the need to specify features explicitly. We trained and evaluated the algorithm using three subsets that are larger than previously seen in the literature. We evaluated the performance of the method using two methods. First, discrete count of agreed breathing phases (using 50% overlap between a pair of boxes), shows a mean agreement with lung sound experts of 97% for inspiration and 87% for expiration. Second, the fraction of time of agreement (in seconds) gives higher pseudo-kappa values for inspiration (0.73-0.88) than expiration (0.63-0.84), showing an average sensitivity of 97% and an average specificity of 84%. With both evaluation methods, the agreement between the annotators and the algorithm shows human level performance for the algorithm. The developed algorithm is valid for detecting breathing phases in lung sound recordings. △ Less

Submitted 25 March, 2019; originally announced March 2019.

arXiv:1901.05240 [pdf, other]

doi 10.1145/3304221.3325527

Teaching Electronics and Programming in Norwegian Schools Using the air:bit Sensor Kit

Authors: Bjørn Fjukstad, Nina Angelvik, Morten Grønnesby, Maria Wulff Hauglann, Hedinn Gunhildrud, Fredrik Høisæther Rasch, Julianne Iversen, Margaret Dalseng, Lars Ailo Bongo

Abstract: We describe lessons learned from using the air:bit project to introduce more than 150 students in the Norwegian upper secondary school to computer programming, engineering and environmental sciences. In the air:bit project, students build and code a portable air quality sensor kits, and use their air:bit to collect data to investigate patterns in air quality in their local environment. When the pr… ▽ More We describe lessons learned from using the air:bit project to introduce more than 150 students in the Norwegian upper secondary school to computer programming, engineering and environmental sciences. In the air:bit project, students build and code a portable air quality sensor kits, and use their air:bit to collect data to investigate patterns in air quality in their local environment. When the project ended students had collected more than 400,000 measurements with their air:bit kits, and could describe local patterns in air quality. Students participate in all parts of the project, from soldering components and programming the sensors, to analyzing the air quality measurements. We conducted a survey after the project and describe our lessons learned from the project. The results show that the project successfully taught the students fundamental concepts in computer programming, electronics, and the scientific method. In addition, all the participating teachers reported that their students had showed good learning outcomes. △ Less

Submitted 16 January, 2019; originally announced January 2019.

arXiv:1803.04337 [pdf, other]

doi 10.1371/journal.pone.0217541

Replication study: Development and validation of deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs

Authors: Mike Voets, Kajsa Møllersen, Lars Ailo Bongo

Abstract: Replication studies are essential for validation of new methods, and are crucial to maintain the high standards of scientific publications, and to use the results in practice. We have attempted to replicate the main method in 'Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs' published in JAMA 2016; 316(22). We re-implement… ▽ More Replication studies are essential for validation of new methods, and are crucial to maintain the high standards of scientific publications, and to use the results in practice. We have attempted to replicate the main method in 'Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs' published in JAMA 2016; 316(22). We re-implemented the method since the source code is not available, and we used publicly available data sets. The original study used non-public fundus images from EyePACS and three hospitals in India for training. We used a different EyePACS data set from Kaggle. The original study used the benchmark data set Messidor-2 to evaluate the algorithm's performance. We used the same data set. In the original study, ophthalmologists re-graded all images for diabetic retinopathy, macular edema, and image gradability. There was one diabetic retinopathy grade per image for our data sets, and we assessed image gradability ourselves. Hyper-parameter settings were not described in the original study. But some of these were later published. We were not able to replicate the original study. Our algorithm's area under the receiver operating curve (AUC) of 0.94 on the Kaggle EyePACS test set and 0.80 on Messidor-2 did not come close to the reported AUC of 0.99 in the original study. This may be caused by the use of a single grade per image, different data, or different not described hyper-parameter settings. This study shows the challenges of replicating deep learning, and the need for more replication studies to validate deep learning methods, especially for medical image analysis. Our source code and instructions are available at: https://github.com/mikevoets/jama16-retina-replication △ Less

Submitted 29 August, 2018; v1 submitted 12 March, 2018; originally announced March 2018.

Comments: The third version of this paper includes results from replication after certain hyper-parameters were published in later article. 16 pages, 6 figures, 1 table, presented at NOBIM 2018

arXiv:1706.00005 [pdf]

Feature Extraction for Machine Learning Based Crackle Detection in Lung Sounds from a Health Survey

Authors: Morten Grønnesby, Juan Carlos Aviles Solis, Einar Holsbø, Hasse Melbye, Lars Ailo Bongo

Abstract: In recent years, many innovative solutions for recording and viewing sounds from a stethoscope have become available. However, to fully utilize such devices, there is a need for an automated approach for detecting abnormal lung sounds, which is better than the existing methods that typically have been developed and evaluated using a small and non-diverse dataset. We propose a machine learning ba… ▽ More In recent years, many innovative solutions for recording and viewing sounds from a stethoscope have become available. However, to fully utilize such devices, there is a need for an automated approach for detecting abnormal lung sounds, which is better than the existing methods that typically have been developed and evaluated using a small and non-diverse dataset. We propose a machine learning based approach for detecting crackles in lung sounds recorded using a stethoscope in a large health survey. Our method is trained and evaluated using 209 files with crackles classified by expert listeners. Our analysis pipeline is based on features extracted from small windows in audio files. We evaluated several feature extraction methods and classifiers. We evaluated the pipeline using a training set of 175 crackle windows and 208 normal windows. We did 100 cycles of cross validation where we shuffled training sets between cycles. For all the division between training and evaluation was 70%-30%. We found and evaluated a 5-dimenstional vector with four features from the time domain and one from the spectrum domain. We evaluated several classifiers and found SVM with a Radial Basis Function Kernel to perform best. Our approach had a precision of 86% and recall of 84% for classifying a crackle in a window, which is more accurate than found in studies of health personnel. The low-dimensional feature vector makes the SVM very fast. The model can be trained on a regular computer in 1.44 seconds, and 319 crackles can be classified in 1.08 seconds. Our approach detects and visualizes individual crackles in recorded audio files. It is accurate, fast, and has low resource requirements. It can be used to train health personnel or as part of a smartphone application for Bluetooth stethoscopes. △ Less

Submitted 23 December, 2017; v1 submitted 31 May, 2017; originally announced June 2017.

arXiv:1609.03750 [pdf]

nsroot: Minimalist Process Isolation Tool Implemented With Linux Namespaces

Authors: Inge Alexander Raknes, Bjørn Fjukstad, Lars Ailo Bongo

Abstract: Data analyses in the life sciences are moving from tools run on a personal computer to services run on large computing platforms. This creates a need to package tools and dependencies for easy installation, configuration and deployment on distributed platforms. In addition, for secure execution there is a need for process isolation on a shared platform. Existing virtual machine and container techn… ▽ More Data analyses in the life sciences are moving from tools run on a personal computer to services run on large computing platforms. This creates a need to package tools and dependencies for easy installation, configuration and deployment on distributed platforms. In addition, for secure execution there is a need for process isolation on a shared platform. Existing virtual machine and container technologies are often more complex than traditional Unix utilities, like chroot, and often require root privileges in order to set up or use. This is especially challenging on HPC systems where users typically do not have root access. We therefore present nsroot, a lightweight Linux namespaces based process isolation tool. It allows restricting the runtime environment of data analysis tools that may not have been designed with security as a top priority, in order to reduce the risk and consequences of security breaches, without requiring any special privileges. The codebase of nsroot is small, and it provides a command line interface similar to chroot. It can be used on all Linux kernels that implement user namespaces. In addition, we propose combining nsroot with the AppImage format for secure execution of packaged applications. nsroot is open sourced and available at: https://github.com/uit-no/nsroot △ Less

Submitted 13 September, 2016; originally announced September 2016.

arXiv:1604.04103 [pdf]

META-pipe - Pipeline Annotation, Analysis and Visualization of Marine Metagenomic Sequence Data

Authors: Espen Mikal Robertsen, Tim Kahlke, Inge Alexander Raknes, Edvard Pedersen, Erik Kjærner Semb, Martin Ernstsen, Lars Ailo Bongo, Nils Peder Willassen

Abstract: The marine environment is one of the most important sources for microbial biodiversity on the planet. These microbes are drivers for many biogeochemical processes, and their enormous genetic potential is still not fully explored or exploited. Marine metagenomics (DNA shotgun sequencing), not only offers opportunities for studying structure and function of microbial communities, but also identifica… ▽ More The marine environment is one of the most important sources for microbial biodiversity on the planet. These microbes are drivers for many biogeochemical processes, and their enormous genetic potential is still not fully explored or exploited. Marine metagenomics (DNA shotgun sequencing), not only offers opportunities for studying structure and function of microbial communities, but also identification of novel biocatalysts and bioactive compounds. However, data analysis, management, storage, processing and interpretation are significant challenges in marine metagenomics due to the high diversity in samples and the size of the marine flagship projects. We provide a new pipeline, META-pipe, for marine metagenomics analysis. It offers pre- processing, assembly, taxonomic classification and functional analysis. To reduce the effort to develop and deploy it, we have integrated existing biological analysis frameworks, and compute and storage infrastructure resources. Our current META-pipe web service provides integration with identity provider services, distributed storage, computation on a Supercomputer, Galaxy workflows, and interactive data visualizations. We have evaluated the scalability and performance of the analysis pipeline. Our results demonstrate how to develop and deploy a pipeline on distributed compute and storage resources, and discusses important challenges related to this process. △ Less

Submitted 14 April, 2016; originally announced April 2016.

Comments: 22 pages, 10 figures

arXiv:1503.07759 [pdf, other]

doi 10.1016/j.future.2016.02.010

Large-scale Biological Meta-database Management

Authors: Edvard Pedersen, Lars Ailo Bongo

Abstract: Up-to-date meta-databases are vital for the analysis of biological data. However,the current exponential increase in biological data leads to exponentially increasing meta-database sizes. Large-scale meta-database management is therefore an important challenge for production platforms providing services for biological data analysis. In particular, there is often a need either to run an analysis wi… ▽ More Up-to-date meta-databases are vital for the analysis of biological data. However,the current exponential increase in biological data leads to exponentially increasing meta-database sizes. Large-scale meta-database management is therefore an important challenge for production platforms providing services for biological data analysis. In particular, there is often a need either to run an analysis with a particular version of a meta-database, or to rerun an analysis with an updated meta-database. We present our GeStore approach for biological meta-database management. It provides efficient storage and runtime generation of specific meta-database versions, and efficient incremental updates for biological data analysis tools. The approach is transparent to the tools, and we provide a framework that makes it easy to integrate GeStore with biological data analysis frameworks. We present the GeStore system, an evaluation of the performance characteristics of the system, and an evaluation of the benefits for a biological data analysis workflow. △ Less

Submitted 22 February, 2016; v1 submitted 26 March, 2015; originally announced March 2015.

Comments: 10 pages, 6 figures, 4 tables

Showing 1–17 of 17 results for author: Bongo, L A