Search | arXiv e-print repository

Enabling Global Image Data Sharing in the Life Sciences

Authors: Peter Bajcsy, Sreenivas Bhattiprolu, Katy Boerner, Beth A Cimini, Lucy Collinson, Jan Ellenberg, Reto Fiolka, Maryellen Giger, Wojtek Goscinski, Matthew Hartley, Nathan Hotaling, Rick Horwitz, Florian Jug, Anna Kreshuk, Emma Lundberg, Aastha Mathur, Kedar Narayan, Shuichi Onami, Anne L. Plant, Fred Prior, Jason Swedlow, Adam Taylor, Antje Keppler

Abstract: Coordinated collaboration is essential to realize the added value of and infrastructure requirements for global image data sharing in the life sciences. In this White Paper, we take a first step at presenting some of the most common use cases as well as critical/emerging use cases of (including the use of artificial intelligence for) biological and medical image data, which would benefit tremendou… ▽ More Coordinated collaboration is essential to realize the added value of and infrastructure requirements for global image data sharing in the life sciences. In this White Paper, we take a first step at presenting some of the most common use cases as well as critical/emerging use cases of (including the use of artificial intelligence for) biological and medical image data, which would benefit tremendously from better frameworks for sharing (including technical, resourcing, legal, and ethical aspects). In the second half of this paper, we paint an ideal world scenario for how global image data sharing could work and benefit all life sciences and beyond. As this is still a long way off, we conclude by suggesting several concrete measures directed toward our institutions, existing imaging communities and data initiatives, and national funders, as well as publishers. Our vision is that within the next ten years, most researchers in the world will be able to make their datasets openly available and use quality image data of interest to them for their research and benefit. This paper is published in parallel with a companion White Paper entitled Harmonizing the Generation and Pre-publication Stewardship of FAIR Image Data, which addresses challenges and opportunities related to producing well-documented and high-quality image data that is ready to be shared. The driving goal is to address remaining challenges and democratize access to everyday practices and tools for a spectrum of biomedical researchers, regardless of their expertise, access to resources, and geographical location. △ Less

Submitted 2 February, 2024; v1 submitted 23 January, 2024; originally announced January 2024.

Comments: This manuscript (arXiv:2401.13023) is published with a closely related companion entitled, Harmonizing the Generation and Pre-publication Stewardship of FAIR Image Data, which can be found at the following link, arXiv:2401.13022

arXiv:2312.08701 [pdf, other]

Enabling End-to-End Secure Federated Learning in Biomedical Research on Heterogeneous Computing Environments with APPFLx

Authors: Trung-Hieu Hoang, Jordan Fuhrman, Ravi Madduri, Miao Li, Pranshu Chaturvedi, Zilinghan Li, Kibaek Kim, Minseok Ryu, Ryan Chard, E. A. Huerta, Maryellen Giger

Abstract: Facilitating large-scale, cross-institutional collaboration in biomedical machine learning projects requires a trustworthy and resilient federated learning (FL) environment to ensure that sensitive information such as protected health information is kept confidential. In this work, we introduce APPFLx, a low-code FL framework that enables the easy setup, configuration, and running of FL experiment… ▽ More Facilitating large-scale, cross-institutional collaboration in biomedical machine learning projects requires a trustworthy and resilient federated learning (FL) environment to ensure that sensitive information such as protected health information is kept confidential. In this work, we introduce APPFLx, a low-code FL framework that enables the easy setup, configuration, and running of FL experiments across organizational and administrative boundaries while providing secure end-to-end communication, privacy-preserving functionality, and identity management. APPFLx is completely agnostic to the underlying computational infrastructure of participating clients. We demonstrate the capability of APPFLx as an easy-to-use framework for accelerating biomedical studies across institutions and healthcare systems while maintaining the protection of private medical data in two case studies: (1) predicting participant age from electrocardiogram (ECG) waveforms, and (2) detecting COVID-19 disease from chest radiographs. These experiments were performed securely across heterogeneous compute resources, including a mixture of on-premise high-performance computing and cloud computing, and highlight the role of federated learning in improving model generalizability and performance when aggregating data from multiple healthcare systems. Finally, we demonstrate that APPFLx serves as a convenient and easy-to-use framework for accelerating biomedical studies across institutions and healthcare system while maintaining the protection of private medical data. △ Less

Submitted 14 December, 2023; originally announced December 2023.

arXiv:2308.08786 [pdf, other]

APPFLx: Providing Privacy-Preserving Cross-Silo Federated Learning as a Service

Authors: Zilinghan Li, Shilan He, Pranshu Chaturvedi, Trung-Hieu Hoang, Minseok Ryu, E. A. Huerta, Volodymyr Kindratenko, Jordan Fuhrman, Maryellen Giger, Ryan Chard, Kibaek Kim, Ravi Madduri

Abstract: Cross-silo privacy-preserving federated learning (PPFL) is a powerful tool to collaboratively train robust and generalized machine learning (ML) models without sharing sensitive (e.g., healthcare of financial) local data. To ease and accelerate the adoption of PPFL, we introduce APPFLx, a ready-to-use platform that provides privacy-preserving cross-silo federated learning as a service. APPFLx empl… ▽ More Cross-silo privacy-preserving federated learning (PPFL) is a powerful tool to collaboratively train robust and generalized machine learning (ML) models without sharing sensitive (e.g., healthcare of financial) local data. To ease and accelerate the adoption of PPFL, we introduce APPFLx, a ready-to-use platform that provides privacy-preserving cross-silo federated learning as a service. APPFLx employs Globus authentication to allow users to easily and securely invite trustworthy collaborators for PPFL, implements several synchronous and asynchronous FL algorithms, streamlines the FL experiment launch process, and enables tracking and visualizing the life cycle of FL experiments, allowing domain experts and ML practitioners to easily orchestrate and evaluate cross-silo FL under one platform. APPFLx is available online at https://appflx.link △ Less

Submitted 17 August, 2023; originally announced August 2023.

arXiv:2303.10501 [pdf]

Longitudinal assessment of demographic representativeness in the Medical Imaging and Data Resource Center Open Data Commons

Authors: Heather M. Whitney, Natalie Baughan, Kyle J. Myers, Karen Drukker, Judy Gichoya, Brad Bower, Weijie Chen, Nicholas Gruszauskas, Jayashree Kalpathy-Cramer, Sanmi Koyejo, Rui C. Sá, Berkman Sahiner, Zi Zhang, Maryellen L. Giger

Abstract: Purpose: The Medical Imaging and Data Resource Center (MIDRC) open data commons was launched to accelerate the development of artificial intelligence (AI) algorithms to help address the COVID-19 pandemic. The purpose of this study was to quantify longitudinal representativeness of the demographic characteristics of the primary imaging dataset compared to the United States general population (US Ce… ▽ More Purpose: The Medical Imaging and Data Resource Center (MIDRC) open data commons was launched to accelerate the development of artificial intelligence (AI) algorithms to help address the COVID-19 pandemic. The purpose of this study was to quantify longitudinal representativeness of the demographic characteristics of the primary imaging dataset compared to the United States general population (US Census) and COVID-19 positive case counts from the Centers for Disease Control and Prevention (CDC). Approach: The Jensen Shannon distance (JSD) was used to longitudinally measure the similarity of the distribution of (1) all unique patients in the MIDRC data to the 2020 US Census and (2) all unique COVID-19 positive patients in the MIDRC data to the case counts reported by the CDC. The distributions were evaluated in the demographic categories of age at index, sex, race, ethnicity, and the intersection of race and ethnicity. Results: Representativeness the MIDRC data by ethnicity and the intersection of race and ethnicity was impacted by the percentage of CDC case counts for which data in these categories is not reported. The distributions by sex and race have retained their level of representativeness over time. Conclusion: The representativeness of the open medical imaging datasets in the curated public data commons at MIDRC has evolved over time as both the number of contributing institutions and overall number of subjects has grown. The use of metrics such as the JSD support measurement of representativeness, one step needed for fair and generalizable AI algorithm development. △ Less

Submitted 18 March, 2023; originally announced March 2023.

Comments: 33 pages, 8 figures, 5 supplemental figures, submitted to Journal of Medical Imaging

arXiv:2302.02425 [pdf, ps, other]

Principles and Guidelines for Sharing Biomedical Data for Secondary Use: The University of Chicago Perspective

Authors: Robert L. Grossman, Maryellen L. Giger, Julie A. Johnson, Jeremy D. Marks, Jessica P. Ridgway, Julian Solway, Walter M. Stadler

Abstract: Academic medical centers are generating an increasing amount of biomedical data and there is an increasing demand for biomedical data for research purposes by research projects, research consortia, companies, and other third parties. At the same time, as the number of patients grows and the amount of data per patient grows, there is an increasing possibility that some information about some patien… ▽ More Academic medical centers are generating an increasing amount of biomedical data and there is an increasing demand for biomedical data for research purposes by research projects, research consortia, companies, and other third parties. At the same time, as the number of patients grows and the amount of data per patient grows, there is an increasing possibility that some information about some patients may become available if the data is shared with third parties and the third parties have a data breach or violate the terms of the data use agreement. Balancing the importance of research that may result in improved patient outcomes with the importance of protecting patient data is challenging. The article discusses the principles, considerations about risks and mitigating risks, and guidelines used at the University of Chicago used for making decisions about sharing biomedical data with third parties. △ Less

Submitted 5 February, 2023; originally announced February 2023.

Comments: 6 pages

arXiv:1911.03022 [pdf, other]

Transfer Learning in 4D for Breast Cancer Diagnosis using Dynamic Contrast-Enhanced Magnetic Resonance Imaging

Authors: Qiyuan Hu, Heather M. Whitney, Maryellen L. Giger

Abstract: Deep transfer learning using dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) has shown strong predictive power in characterization of breast lesions. However, pretrained convolutional neural networks (CNNs) require 2D inputs, limiting the ability to exploit the rich 4D (volumetric and temporal) image information inherent in DCE-MRI that is clinically valuable for lesion assessment.… ▽ More Deep transfer learning using dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) has shown strong predictive power in characterization of breast lesions. However, pretrained convolutional neural networks (CNNs) require 2D inputs, limiting the ability to exploit the rich 4D (volumetric and temporal) image information inherent in DCE-MRI that is clinically valuable for lesion assessment. Training 3D CNNs from scratch, a common method to utilize high-dimensional information in medical images, is computationally expensive and is not best suited for moderately sized healthcare datasets. Therefore, we propose a novel approach using transfer learning that incorporates the 4D information from DCE-MRI, where volumetric information is collapsed at feature level by max pooling along the projection perpendicular to the transverse slices and the temporal information is contained either in second-post contrast subtraction images. Our methodology yielded an area under the receiver operating characteristic curve of 0.89+/-0.01 on a dataset of 1161 breast lesions, significantly outperforming a previous approach that incorporates the 4D information in DCE-MRI by the use of maximum intensity projection (MIP) images. △ Less

Submitted 7 November, 2019; originally announced November 2019.

Comments: Machine Learning for Health (ML4H) at NeurIPS 2019 - Extended Abstract

arXiv:1701.03882 [pdf, other]

Multi-task Learning in the Computerized Diagnosis of Breast Cancer on DCE-MRIs

Authors: Natalia Antropova, Benjamin Huynh, Maryellen Giger

Abstract: Hand-crafted features extracted from dynamic contrast-enhanced magnetic resonance images (DCE-MRIs) have shown strong predictive abilities in characterization of breast lesions. However, heterogeneity across medical image datasets hinders the generalizability of these features. One of the sources of the heterogeneity is the variation of MR scanner magnet strength, which has a strong influence on i… ▽ More Hand-crafted features extracted from dynamic contrast-enhanced magnetic resonance images (DCE-MRIs) have shown strong predictive abilities in characterization of breast lesions. However, heterogeneity across medical image datasets hinders the generalizability of these features. One of the sources of the heterogeneity is the variation of MR scanner magnet strength, which has a strong influence on image quality, leading to variations in the extracted image features. Thus, statistical decision algorithms need to account for such data heterogeneity. Despite the variations, we hypothesize that there exist underlying relationships between the features extracted from the datasets acquired with different magnet strength MR scanners. We compared the use of a multi-task learning (MTL) method that incorporates those relationships during the classifier training to support vector machines run on a merged dataset that includes cases with various MRI strength images. As a result, higher predictive power is achieved with the MTL method. △ Less

Submitted 14 January, 2017; originally announced January 2017.

Showing 1–7 of 7 results for author: Giger, M