Search | arXiv e-print repository

Dataset Representativeness and Downstream Task Fairness

Authors: Victor Borza, Andrew Estornell, Chien-Ju Ho, Bradley Malin, Yevgeniy Vorobeychik

Abstract: Our society collects data on people for a wide range of applications, from building a census for policy evaluation to running meaningful clinical trials. To collect data, we typically sample individuals with the goal of accurately representing a population of interest. However, current sampling processes often collect data opportunistically from data sources, which can lead to datasets that are bi… ▽ More Our society collects data on people for a wide range of applications, from building a census for policy evaluation to running meaningful clinical trials. To collect data, we typically sample individuals with the goal of accurately representing a population of interest. However, current sampling processes often collect data opportunistically from data sources, which can lead to datasets that are biased and not representative, i.e., the collected dataset does not accurately reflect the distribution of demographics of the true population. This is a concern because subgroups within the population can be under- or over-represented in a dataset, which may harm generalizability and lead to an unequal distribution of benefits and harms from downstream tasks that use such datasets (e.g., algorithmic bias in medical decision-making algorithms). In this paper, we assess the relationship between dataset representativeness and group-fairness of classifiers trained on that dataset. We demonstrate that there is a natural tension between dataset representativeness and classifier fairness; empirically we observe that training datasets with better representativeness can frequently result in classifiers with higher rates of unfairness. We provide some intuition as to why this occurs via a set of theoretical results in the case of univariate classifiers. We also find that over-sampling underrepresented groups can result in classifiers which exhibit greater bias to those groups. Lastly, we observe that fairness-aware sampling strategies (i.e., those which are specifically designed to select data with high downstream fairness) will often over-sample members of majority groups. These results demonstrate that the relationship between dataset representativeness and downstream classifier fairness is complex; balancing these two quantities requires special care from both model- and dataset-designers. △ Less

Submitted 28 June, 2024; originally announced July 2024.

Comments: 48 pages, 32 figures

arXiv:2406.01811 [pdf, other]

A Game-Theoretic Approach to Privacy-Utility Tradeoff in Sharing Genomic Summary Statistics

Authors: Tao Zhang, Rajagopal Venkatesaramani, Rajat K. De, Bradley A. Malin, Yevgeniy Vorobeychik

Abstract: The advent of online genomic data-sharing services has sought to enhance the accessibility of large genomic datasets by allowing queries about genetic variants, such as summary statistics, aiding care providers in distinguishing between spurious genomic variations and those with clinical significance. However, numerous studies have demonstrated that even sharing summary genomic information exposes… ▽ More The advent of online genomic data-sharing services has sought to enhance the accessibility of large genomic datasets by allowing queries about genetic variants, such as summary statistics, aiding care providers in distinguishing between spurious genomic variations and those with clinical significance. However, numerous studies have demonstrated that even sharing summary genomic information exposes individual members of such datasets to a significant privacy risk due to membership inference attacks. While several approaches have emerged that reduce privacy risks by adding noise or reducing the amount of information shared, these typically assume non-adaptive attacks that use likelihood ratio test (LRT) statistics. We propose a Bayesian game-theoretic framework for optimal privacy-utility tradeoff in the sharing of genomic summary statistics. Our first contribution is to prove that a very general Bayesian attacker model that anchors our game-theoretic approach is more powerful than the conventional LRT-based threat models in that it induces worse privacy loss for the defender who is modeled as a von Neumann-Morgenstern (vNM) decision-maker. We show this to be true even when the attacker uses a non-informative subjective prior. Next, we present an analytically tractable approach to compare the Bayesian attacks with arbitrary subjective priors and the Neyman-Pearson optimal LRT attacks under the Gaussian mechanism common in differential privacy frameworks. Finally, we propose an approach for approximating Bayes-Nash equilibria of the game using deep neural network generators to implicitly represent player mixed strategies. Our experiments demonstrate that the proposed game-theoretic framework yields both stronger attacks and stronger defense strategies than the state of the art. △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2402.11347 [pdf, other]

PhaseEvo: Towards Unified In-Context Prompt Optimization for Large Language Models

Authors: Wendi Cui, Jiaxin Zhang, Zhuohang Li, Hao Sun, Damien Lopez, Kamalika Das, Bradley Malin, Sricharan Kumar

Abstract: Crafting an ideal prompt for Large Language Models (LLMs) is a challenging task that demands significant resources and expert human input. Existing work treats the optimization of prompt instruction and in-context learning examples as distinct problems, leading to sub-optimal prompt performance. This research addresses this limitation by establishing a unified in-context prompt optimization framew… ▽ More Crafting an ideal prompt for Large Language Models (LLMs) is a challenging task that demands significant resources and expert human input. Existing work treats the optimization of prompt instruction and in-context learning examples as distinct problems, leading to sub-optimal prompt performance. This research addresses this limitation by establishing a unified in-context prompt optimization framework, which aims to achieve joint optimization of the prompt instruction and examples. However, formulating such optimization in the discrete and high-dimensional natural language space introduces challenges in terms of convergence and computational efficiency. To overcome these issues, we present PhaseEvo, an efficient automatic prompt optimization framework that combines the generative capability of LLMs with the global search proficiency of evolution algorithms. Our framework features a multi-phase design incorporating innovative LLM-based mutation operators to enhance search efficiency and accelerate convergence. We conduct an extensive evaluation of our approach across 35 benchmark tasks. The results demonstrate that PhaseEvo significantly outperforms the state-of-the-art baseline methods by a large margin whilst maintaining good efficiency. △ Less

Submitted 17 February, 2024; originally announced February 2024.

Comments: 50 pages, 9 figures, 26 tables

arXiv:2401.02132 [pdf, other]

DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and Improvement of Large Language Models

Authors: Wendi Cui, Jiaxin Zhang, Zhuohang Li, Lopez Damien, Kamalika Das, Bradley Malin, Sricharan Kumar

Abstract: Evaluating the quality and variability of text generated by Large Language Models (LLMs) poses a significant, yet unresolved research challenge. Traditional evaluation methods, such as ROUGE and BERTScore, which measure token similarity, often fail to capture the holistic semantic equivalence. This results in a low correlation with human judgments and intuition, which is especially problematic in… ▽ More Evaluating the quality and variability of text generated by Large Language Models (LLMs) poses a significant, yet unresolved research challenge. Traditional evaluation methods, such as ROUGE and BERTScore, which measure token similarity, often fail to capture the holistic semantic equivalence. This results in a low correlation with human judgments and intuition, which is especially problematic in high-stakes applications like healthcare and finance where reliability, safety, and robust decision-making are highly critical. This work proposes DCR, an automated framework for evaluating and improving the consistency of LLM-generated texts using a divide-conquer-reasoning approach. Unlike existing LLM-based evaluators that operate at the paragraph level, our method employs a divide-and-conquer evaluator (DCE) that breaks down the paragraph-to-paragraph comparison between two generated responses into individual sentence-to-paragraph comparisons, each evaluated based on predefined criteria. To facilitate this approach, we introduce an automatic metric converter (AMC) that translates the output from DCE into an interpretable numeric score. Beyond the consistency evaluation, we further present a reason-assisted improver (RAI) that leverages the analytical reasons with explanations identified by DCE to generate new responses aimed at reducing these inconsistencies. Through comprehensive and systematic empirical analysis, we show that our approach outperforms state-of-the-art methods by a large margin (e.g., +19.3% and +24.3% on the SummEval dataset) in evaluating the consistency of LLM generation across multiple benchmarks in semantic, factual, and summarization consistency tasks. Our approach also substantially reduces nearly 90% of output inconsistencies, showing promise for effective hallucination mitigation. △ Less

Submitted 4 January, 2024; originally announced January 2024.

arXiv:2311.11211 [pdf]

Leveraging Generative AI for Clinical Evidence Summarization Needs to Ensure Trustworthiness

Authors: Gongbo Zhang, Qiao **, Denis Jered McInerney, Yong Chen, Fei Wang, Curtis L. Cole, Qian Yang, Yanshan Wang, Bradley A. Malin, Mor Peleg, Byron C. Wallace, Zhiyong Lu, Chunhua Weng, Yifan Peng

Abstract: Evidence-based medicine promises to improve the quality of healthcare by empowering medical decisions and practices with the best available evidence. The rapid growth of medical evidence, which can be obtained from various sources, poses a challenge in collecting, appraising, and synthesizing the evidential information. Recent advancements in generative AI, exemplified by large language models, ho… ▽ More Evidence-based medicine promises to improve the quality of healthcare by empowering medical decisions and practices with the best available evidence. The rapid growth of medical evidence, which can be obtained from various sources, poses a challenge in collecting, appraising, and synthesizing the evidential information. Recent advancements in generative AI, exemplified by large language models, hold promise in facilitating the arduous task. However, develo** accountable, fair, and inclusive models remains a complicated undertaking. In this perspective, we discuss the trustworthiness of generative AI in the context of automated summarization of medical evidence. △ Less

Submitted 31 March, 2024; v1 submitted 18 November, 2023; originally announced November 2023.

arXiv:2311.01740 [pdf, other]

SAC3: Reliable Hallucination Detection in Black-Box Language Models via Semantic-aware Cross-check Consistency

Authors: Jiaxin Zhang, Zhuohang Li, Kamalika Das, Bradley A. Malin, Sricharan Kumar

Abstract: Hallucination detection is a critical step toward understanding the trustworthiness of modern language models (LMs). To achieve this goal, we re-examine existing detection approaches based on the self-consistency of LMs and uncover two types of hallucinations resulting from 1) question-level and 2) model-level, which cannot be effectively identified through self-consistency check alone. Building u… ▽ More Hallucination detection is a critical step toward understanding the trustworthiness of modern language models (LMs). To achieve this goal, we re-examine existing detection approaches based on the self-consistency of LMs and uncover two types of hallucinations resulting from 1) question-level and 2) model-level, which cannot be effectively identified through self-consistency check alone. Building upon this discovery, we propose a novel sampling-based method, i.e., semantic-aware cross-check consistency (SAC3) that expands on the principle of self-consistency checking. Our SAC3 approach incorporates additional mechanisms to detect both question-level and model-level hallucinations by leveraging advances including semantically equivalent question perturbation and cross-model response consistency checking. Through extensive and systematic empirical analysis, we demonstrate that SAC3 outperforms the state of the art in detecting both non-factual and factual statements across multiple question-answering and open-domain generation benchmarks. △ Less

Submitted 18 February, 2024; v1 submitted 3 November, 2023; originally announced November 2023.

Comments: EMNLP 2023

arXiv:2309.00154 [pdf, other]

Learning From Peers: A Survey of Perception and Utilization of Online Peer Support Among Informal Dementia Caregivers

Authors: Zhijun Yin, Lauren Stratton, Qingyuan Song, Congning Ni, Lijun Song, Patricia A. Commiskey, Qingxia Chen, Monica Moreno, Sam Fazio, Bradley A. Malin

Abstract: Informal dementia caregivers are those who care for a person living with dementia (PLWD) without receiving payment (e.g., family members, friends, or other unpaid caregivers). These informal caregivers are subject to substantial mental, physical, and financial burdens. Online communities enable these caregivers to exchange caregiving strategies and communicate experiences with other caregivers who… ▽ More Informal dementia caregivers are those who care for a person living with dementia (PLWD) without receiving payment (e.g., family members, friends, or other unpaid caregivers). These informal caregivers are subject to substantial mental, physical, and financial burdens. Online communities enable these caregivers to exchange caregiving strategies and communicate experiences with other caregivers whom they generally do not know in real life. Research has demonstrated the benefits of peer support in online communities, but they are limited in focusing merely on caregivers who are already online users. In this paper, we designed and administered a survey to investigate the perception and utilization of online peer support from 140 informal dementia caregivers (with 100 online-community caregivers). Our findings show that the behavior to access any online community is only significantly associated with their belief in the value of online peer support (p = 0.006). Moreover, 33 (83%) of the 40 non-online-community caregivers had a belief score above 24, a score assigned when a neutral option is selected for each belief question. The reasons most articulated for not accessing any online community were no time to do so (14; 10%), and insufficient online information searching skills (9; 6%). Our findings suggest that online peer support is valuable, but practical strategies are needed to assist informal dementia caregivers who have limited time or searching skills. △ Less

Submitted 31 August, 2023; originally announced September 2023.

arXiv:2308.11027 [pdf, other]

Split Learning for Distributed Collaborative Training of Deep Learning Models in Health Informatics

Authors: Zhuohang Li, Chao Yan, Xinmeng Zhang, Gharib Gharibi, Zhijun Yin, Xiaoqian Jiang, Bradley A. Malin

Abstract: Deep learning continues to rapidly evolve and is now demonstrating remarkable potential for numerous medical prediction tasks. However, realizing deep learning models that generalize across healthcare organizations is challenging. This is due, in part, to the inherent siloed nature of these organizations and patient privacy requirements. To address this problem, we illustrate how split learning ca… ▽ More Deep learning continues to rapidly evolve and is now demonstrating remarkable potential for numerous medical prediction tasks. However, realizing deep learning models that generalize across healthcare organizations is challenging. This is due, in part, to the inherent siloed nature of these organizations and patient privacy requirements. To address this problem, we illustrate how split learning can enable collaborative training of deep learning models across disparate and privately maintained health datasets, while kee** the original records and model parameters private. We introduce a new privacy-preserving distributed learning framework that offers a higher level of privacy compared to conventional federated learning. We use several biomedical imaging and electronic health record (EHR) datasets to show that deep learning models trained via split learning can achieve highly similar performance to their centralized and federated counterparts while greatly improving computational efficiency and reducing privacy risks. △ Less

Submitted 21 August, 2023; originally announced August 2023.

arXiv:2302.01763 [pdf, other]

Enabling Trade-offs in Privacy and Utility in Genomic Data Beacons and Summary Statistics

Authors: Rajagopal Venkatesaramani, Zhiyu Wan, Bradley A. Malin, Yevgeniy Vorobeychik

Abstract: The collection and sharing of genomic data are becoming increasingly commonplace in research, clinical, and direct-to-consumer settings. The computational protocols typically adopted to protect individual privacy include sharing summary statistics, such as allele frequencies, or limiting query responses to the presence/absence of alleles of interest using web-services called Beacons. However, even… ▽ More The collection and sharing of genomic data are becoming increasingly commonplace in research, clinical, and direct-to-consumer settings. The computational protocols typically adopted to protect individual privacy include sharing summary statistics, such as allele frequencies, or limiting query responses to the presence/absence of alleles of interest using web-services called Beacons. However, even such limited releases are susceptible to likelihood-ratio-based membership-inference attacks. Several approaches have been proposed to preserve privacy, which either suppress a subset of genomic variants or modify query responses for specific variants (e.g., adding noise, as in differential privacy). However, many of these approaches result in a significant utility loss, either suppressing many variants or adding a substantial amount of noise. In this paper, we introduce optimization-based approaches to explicitly trade off the utility of summary data or Beacon responses and privacy with respect to membership-inference attacks based on likelihood-ratios, combining variant suppression and modification. We consider two attack models. In the first, an attacker applies a likelihood-ratio test to make membership-inference claims. In the second model, an attacker uses a threshold that accounts for the effect of the data release on the separation in scores between individuals in the dataset and those who are not. We further introduce highly scalable approaches for approximately solving the privacy-utility tradeoff problem when information is either in the form of summary statistics or presence/absence queries. Finally, we show that the proposed approaches outperform the state of the art in both utility and privacy through an extensive evaluation with public datasets. △ Less

Submitted 11 January, 2023; originally announced February 2023.

arXiv:2210.09975 [pdf]

Risk of re-identification for shared clinical speech recordings

Authors: Daniela A. Wiepert, Bradley A. Malin, Joseph R. Duffy, Rene L. Utianski, John L. Stricker, David T. Jones, Hugo Botha

Abstract: Large, curated datasets are required to leverage speech-based tools in healthcare. These are costly to produce, resulting in increased interest in data sharing. As speech can potentially identify speakers (i.e., voiceprints), sharing recordings raises privacy concerns. We examine the re-identification risk for speech recordings, without reference to demographic or metadata, using a state-of-the-ar… ▽ More Large, curated datasets are required to leverage speech-based tools in healthcare. These are costly to produce, resulting in increased interest in data sharing. As speech can potentially identify speakers (i.e., voiceprints), sharing recordings raises privacy concerns. We examine the re-identification risk for speech recordings, without reference to demographic or metadata, using a state-of-the-art speaker recognition system. We demonstrate that the risk is inversely related to the number of comparisons an adversary must consider, i.e., the search space. Risk is high for a small search space but drops as the search space grows ($precision >0.85$ for $<1*10^{6}$ comparisons, $precision <0.5$ for $>3*10^{6}$ comparisons). Next, we show that the nature of a speech recording influences re-identification risk, with non-connected speech (e.g., vowel prolongation) being harder to identify. Our findings suggest that speaker recognition systems can be used to re-identify participants in specific circumstances, but in practice, the re-identification risk appears low. △ Less

Submitted 21 August, 2023; v1 submitted 18 October, 2022; originally announced October 2022.

Comments: 24 pages, 6 figures

arXiv:2208.01636 [pdf, ps, other]

A Roadmap for Greater Public Use of Privacy-Sensitive Government Data: Workshop Report

Authors: Chris Clifton, Bradley Malin, Anna Oganian, Ramesh Raskar, Vivek Sharma

Abstract: Government agencies collect and manage a wide range of ever-growing datasets. While such data has the potential to support research and evidence-based policy making, there are concerns that the dissemination of such data could infringe upon the privacy of the individuals (or organizations) from whom such data was collected. To appraise the current state of data sharing, as well as learn about oppo… ▽ More Government agencies collect and manage a wide range of ever-growing datasets. While such data has the potential to support research and evidence-based policy making, there are concerns that the dissemination of such data could infringe upon the privacy of the individuals (or organizations) from whom such data was collected. To appraise the current state of data sharing, as well as learn about opportunities for stimulating such sharing at a faster pace, a virtual workshop was held on May 21st and 26th, 2021, sponsored by the National Science Foundation and National Institute of Standards and Technologies, where a multinational collection of researchers and practitioners were brought together to discuss their experiences and learn about recently developed technologies for managing privacy while sharing data. The workshop specifically focused on challenges and successes in government data sharing at various levels. The first day focused on successful examples of new technology applied to sharing of public data, including formal privacy techniques, synthetic data, and cryptographic approaches. Day two emphasized brainstorming sessions on some of the challenges and directions to address them. △ Less

Submitted 17 June, 2022; originally announced August 2022.

Comments: 23 pages

arXiv:2208.01230 [pdf]

doi 10.1038/s41467-022-35295-1

A Multifaceted Benchmarking of Synthetic Electronic Health Record Generation Models

Authors: Chao Yan, Yao Yan, Zhiyu Wan, Ziqi Zhang, Larsson Omberg, Justin Guinney, Sean D. Mooney, Bradley A. Malin

Abstract: Synthetic health data have the potential to mitigate privacy concerns when sharing data to support biomedical research and the development of innovative healthcare applications. Modern approaches for data generation based on machine learning, generative adversarial networks (GAN) methods in particular, continue to evolve and demonstrate remarkable potential. Yet there is a lack of a systematic ass… ▽ More Synthetic health data have the potential to mitigate privacy concerns when sharing data to support biomedical research and the development of innovative healthcare applications. Modern approaches for data generation based on machine learning, generative adversarial networks (GAN) methods in particular, continue to evolve and demonstrate remarkable potential. Yet there is a lack of a systematic assessment framework to benchmark methods as they emerge and determine which methods are most appropriate for which use cases. In this work, we introduce a generalizable benchmarking framework to appraise key characteristics of synthetic health data with respect to utility and privacy metrics. We apply the framework to evaluate synthetic data generation methods for electronic health records (EHRs) data from two large academic medical centers with respect to several use cases. The results illustrate that there is a utility-privacy tradeoff for sharing synthetic EHR data. The results further indicate that no method is unequivocally the best on all criteria in each use case, which makes it evident why synthetic data generation methods need to be assessed in context. △ Less

Submitted 1 August, 2022; originally announced August 2022.

arXiv:2207.02445 [pdf, other]

Distillation to Enhance the Portability of Risk Models Across Institutions with Large Patient Claims Database

Authors: Steve Nyemba, Chao Yan, Ziqi Zhang, Amol Rajmane, Pablo Meyer, Prithwish Chakraborty, Bradley Malin

Abstract: Artificial intelligence, and particularly machine learning (ML), is increasingly developed and deployed to support healthcare in a variety of settings. However, clinical decision support (CDS) technologies based on ML need to be portable if they are to be adopted on a broad scale. In this respect, models developed at one institution should be reusable at another. Yet there are numerous examples of… ▽ More Artificial intelligence, and particularly machine learning (ML), is increasingly developed and deployed to support healthcare in a variety of settings. However, clinical decision support (CDS) technologies based on ML need to be portable if they are to be adopted on a broad scale. In this respect, models developed at one institution should be reusable at another. Yet there are numerous examples of portability failure, particularly due to naive application of ML models. Portability failure can lead to suboptimal care and medical errors, which ultimately could prevent the adoption of ML-based CDS in practice. One specific healthcare challenge that could benefit from enhanced portability is the prediction of 30-day readmission risk. Research to date has shown that deep learning models can be effective at modeling such risk. In this work, we investigate the practicality of model portability through a cross-site evaluation of readmission prediction models. To do so, we apply a recurrent neural network, augmented with self-attention and blended with expert features, to build readmission prediction models for two independent large scale claims datasets. We further present a novel transfer learning technique that adapts the well-known method of born-again network (BAN) training. Our experiments show that direct application of ML models trained at one institution and tested at another institution perform worse than models trained and tested at the same institution. We further show that the transfer learning approach based on the BAN produces models that are better than those trained on just a single institution's data. Notably, this improvement is consistent across both sites and occurs after a single retraining, which illustrates the potential for a cheap and general model transfer mechanism of readmission risk prediction. △ Less

Submitted 6 July, 2022; originally announced July 2022.

arXiv:2112.13301 [pdf, other]

Defending Against Membership Inference Attacks on Beacon Services

Authors: Rajagopal Venkatesaramani, Zhiyu Wan, Bradley A. Malin, Yevgeniy Vorobeychik

Abstract: Large genomic datasets are now created through numerous activities, including recreational genealogical investigations, biomedical research, and clinical care. At the same time, genomic data has become valuable for reuse beyond their initial point of collection, but privacy concerns often hinder access. Over the past several years, Beacon services have emerged to broaden accessibility to such data… ▽ More Large genomic datasets are now created through numerous activities, including recreational genealogical investigations, biomedical research, and clinical care. At the same time, genomic data has become valuable for reuse beyond their initial point of collection, but privacy concerns often hinder access. Over the past several years, Beacon services have emerged to broaden accessibility to such data. These services enable users to query for the presence of a particular minor allele in a private dataset, information that can help care providers determine if genomic variation is spurious or has some known clinical indication. However, various studies have shown that even this limited access model can leak if individuals are members in the underlying dataset. Several approaches for mitigating this vulnerability have been proposed, but they are limited in that they 1) typically rely on heuristics and 2) offer probabilistic privacy guarantees, but neglect utility. In this paper, we present a novel algorithmic framework to ensure privacy in a Beacon service setting with a minimal number of query response flips (e.g., changing a positive response to a negative). Specifically, we represent this problem as combinatorial optimization in both the batch setting (where queries arrive all at once), as well as the online setting (where queries arrive sequentially). The former setting has been the primary focus in prior literature, whereas real Beacons allow sequential queries, motivating the latter investigation. We present principled algorithms in this framework with both privacy and, in some cases, worst-case utility guarantees. Moreover, through an extensive experimental evaluation, we show that the proposed approaches significantly outperform the state of the art in terms of privacy and utility. △ Less

Submitted 25 December, 2021; originally announced December 2021.

arXiv:2106.14649 [pdf]

doi 10.1093/jamia/ocac011

Dynamically Adjusting Case Reporting Policy to Maximize Privacy and Utility in the Face of a Pandemic

Authors: J. Thomas Brown, Chao Yan, Weiyi Xia, Zhijun Yin, Zhiyu Wan, Aris Gkoulalas-Divanis, Murat Kantarcioglu, Bradley A. Malin

Abstract: Supporting public health research and the public's situational awareness during a pandemic requires continuous dissemination of infectious disease surveillance data. Legislation, such as the Health Insurance Portability and Accountability Act of 1996 (HIPAA) and recent state-level regulations, permits sharing de-identified person-level data; however, current de-identification approaches are limite… ▽ More Supporting public health research and the public's situational awareness during a pandemic requires continuous dissemination of infectious disease surveillance data. Legislation, such as the Health Insurance Portability and Accountability Act of 1996 (HIPAA) and recent state-level regulations, permits sharing de-identified person-level data; however, current de-identification approaches are limited. namely, they are inefficient, relying on retrospective disclosure risk assessments, and do not flex with changes in infection rates or population demographics over time. In this paper, we introduce a framework to dynamically adapt de-identification for near-real time sharing of person-level surveillance data. The framework leverages a simulation mechanism, capable of application at any geographic level, to forecast the re-identification risk of sharing the data under a wide range of generalization policies. The estimates inform weekly, prospective policy selection to maintain the proportion of records corresponding to a group size less than 11 (PK11) at or below 0.1. Fixing the policy at the start of each week facilitates timely dataset updates and supports sharing granular date information. We use August 2020 through October 2021 case data from Johns Hopkins University and the Centers for Disease Control and Prevention to demonstrate the framework's effectiveness in maintaining the PK!1 threshold of 0.01. When sharing COVID-19 county-level case data across all US counties, the framework's approach meets the threshold for 96.2% of daily data releases, while a policy based on current de-identification techniques meets the threshold for 32.3%. Periodically adapting the data publication policies preserves privacy while enhancing public health utility through timely updates and sharing epidemiologically critical features. △ Less

Submitted 25 February, 2022; v1 submitted 21 June, 2021; originally announced June 2021.

Comments: Updated to peer-reviewed version. Main text only without figures. Complete version is available in the Journal of the American Medical Informatics Association at https://doi.org/10.1093/jamia/ocac011

arXiv:2104.04377 [pdf, other]

Blending Knowledge in Deep Recurrent Networks for Adverse Event Prediction at Hospital Discharge

Authors: Prithwish Chakraborty, James Codella, Piyush Madan, Ying Li, Hu Huang, Yoonyoung Park, Chao Yan, Ziqi Zhang, Cheng Gao, Steve Nyemba, Xu Min, Sanjib Basak, Mohamed Ghalwash, Zach Shahn, Parthasararathy Suryanarayanan, Italo Buleje, Shannon Harrer, Sarah Miller, Amol Rajmane, Colin Walsh, Jonathan Wanderer, Gigi Yuen Reed, Kenney Ng, Daby Sow, Bradley A. Malin

Abstract: Deep learning architectures have an extremely high-capacity for modeling complex data in a wide variety of domains. However, these architectures have been limited in their ability to support complex prediction problems using insurance claims data, such as readmission at 30 days, mainly due to data sparsity issue. Consequently, classical machine learning methods, especially those that embed domain… ▽ More Deep learning architectures have an extremely high-capacity for modeling complex data in a wide variety of domains. However, these architectures have been limited in their ability to support complex prediction problems using insurance claims data, such as readmission at 30 days, mainly due to data sparsity issue. Consequently, classical machine learning methods, especially those that embed domain knowledge in handcrafted features, are often on par with, and sometimes outperform, deep learning approaches. In this paper, we illustrate how the potential of deep learning can be achieved by blending domain knowledge within deep learning architectures to predict adverse events at hospital discharge, including readmissions. More specifically, we introduce a learning architecture that fuses a representation of patient data computed by a self-attention based recurrent neural network, with clinically relevant features. We conduct extensive experiments on a large claims dataset and show that the blended method outperforms the standard machine learning approaches. △ Less

Submitted 9 April, 2021; originally announced April 2021.

Comments: Presented at the AMIA 2021 Virtual Informatics Summit

arXiv:2102.08557 [pdf, other]

doi 10.1126/sciadv.abg3296

Re-identification of Individuals in Genomic Datasets Using Public Face Images

Authors: Rajagopal Venkatesaramani, Bradley A. Malin, Yevgeniy Vorobeychik

Abstract: DNA sequencing is becoming increasingly commonplace, both in medical and direct-to-consumer settings. To promote discovery, collected genomic data is often de-identified and shared, either in public repositories, such as OpenSNP, or with researchers through access-controlled repositories. However, recent studies have suggested that genomic data can be effectively matched to high-resolution three-d… ▽ More DNA sequencing is becoming increasingly commonplace, both in medical and direct-to-consumer settings. To promote discovery, collected genomic data is often de-identified and shared, either in public repositories, such as OpenSNP, or with researchers through access-controlled repositories. However, recent studies have suggested that genomic data can be effectively matched to high-resolution three-dimensional face images, which raises a concern that the increasingly ubiquitous public face images can be linked to shared genomic data, thereby re-identifying individuals in the genomic data. While these investigations illustrate the possibility of such an attack, they assume that those performing the linkage have access to extremely well-curated data. Given that this is unlikely to be the case in practice, it calls into question the pragmatic nature of the attack. As such, we systematically study this re-identification risk from two perspectives: first, we investigate how successful such linkage attacks can be when real face images are used, and second, we consider how we can empower individuals to have better control over the associated re-identification risk. We observe that the true risk of re-identification is likely substantially smaller for most individuals than prior literature suggests. In addition, we demonstrate that the addition of a small amount of carefully crafted noise to images can enable a controlled trade-off between re-identification success and the quality of shared images, with risk typically significantly lowered even with noise that is imperceptible to humans. △ Less

Submitted 16 February, 2021; originally announced February 2021.

arXiv:2012.10020 [pdf, other]

EVA: Generating Longitudinal Electronic Health Records Using Conditional Variational Autoencoders

Authors: Siddharth Biswal, Soumya Ghosh, Jon Duke, Bradley Malin, Walter Stewart, Jimeng Sun

Abstract: Researchers require timely access to real-world longitudinal electronic health records (EHR) to develop, test, validate, and implement machine learning solutions that improve the quality and efficiency of healthcare. In contrast, health systems value deeply patient privacy and data security. De-identified EHRs do not adequately address the needs of health systems, as de-identified data are suscept… ▽ More Researchers require timely access to real-world longitudinal electronic health records (EHR) to develop, test, validate, and implement machine learning solutions that improve the quality and efficiency of healthcare. In contrast, health systems value deeply patient privacy and data security. De-identified EHRs do not adequately address the needs of health systems, as de-identified data are susceptible to re-identification and its volume is also limited. Synthetic EHRs offer a potential solution. In this paper, we propose EHR Variational Autoencoder (EVA) for synthesizing sequences of discrete EHR encounters (e.g., clinical visits) and encounter features (e.g., diagnoses, medications, procedures). We illustrate that EVA can produce realistic EHR sequences, account for individual differences among patients, and can be conditioned on specific disease conditions, thus enabling disease-specific studies. We design efficient, accurate inference algorithms by combining stochastic gradient Markov Chain Monte Carlo with amortized variational inference. We assess the utility of the methods on large real-world EHR repositories containing over 250, 000 patients. Our experiments, which include user studies with knowledgeable clinicians, indicate the generated EHR sequences are realistic. We confirmed the performance of predictive models trained on the synthetic data are similar with those trained on real EHRs. Additionally, our findings indicate that augmenting real data with synthetic EHRs results in the best predictive performance - improving the best baseline by as much as 8% in top-20 recall. △ Less

Submitted 17 December, 2020; originally announced December 2020.

arXiv:2010.08855 [pdf, other]

GOAT: GPU Outsourcing of Deep Learning Training With Asynchronous Probabilistic Integrity Verification Inside Trusted Execution Environment

Authors: Aref Asvadishireh**i, Murat Kantarcioglu, Bradley Malin

Abstract: Machine learning models based on Deep Neural Networks (DNNs) are increasingly deployed in a wide range of applications ranging from self-driving cars to COVID-19 treatment discovery. To support the computational power necessary to learn a DNN, cloud environments with dedicated hardware support have emerged as critical infrastructure. However, there are many integrity challenges associated with out… ▽ More Machine learning models based on Deep Neural Networks (DNNs) are increasingly deployed in a wide range of applications ranging from self-driving cars to COVID-19 treatment discovery. To support the computational power necessary to learn a DNN, cloud environments with dedicated hardware support have emerged as critical infrastructure. However, there are many integrity challenges associated with outsourcing computation. Various approaches have been developed to address these challenges, building on trusted execution environments (TEE). Yet, no existing approach scales up to support realistic integrity-preserving DNN model training for heavy workloads (deep architectures and millions of training examples) without sustaining a significant performance hit. To mitigate the time gap between pure TEE (full integrity) and pure GPU (no integrity), we combine random verification of selected computation steps with systematic adjustments of DNN hyper-parameters (e.g., a narrow gradient clip** range), hence limiting the attacker's ability to shift the model parameters significantly provided that the step is not selected for verification during its training phase. Experimental results show the new approach achieves 2X to 20X performance improvement over pure TEE based solution while guaranteeing a very high probability of integrity (e.g., 0.999) with respect to state-of-the-art DNN backdoor attacks. △ Less

Submitted 17 October, 2020; originally announced October 2020.

arXiv:2003.07904 [pdf, other]

Generating Electronic Health Records with Multiple Data Types and Constraints

Authors: Chao Yan, Ziqi Zhang, Steve Nyemba, Bradley A. Malin

Abstract: Sharing electronic health records (EHRs) on a large scale may lead to privacy intrusions. Recent research has shown that risks may be mitigated by simulating EHRs through generative adversarial network (GAN) frameworks. Yet the methods developed to date are limited because they 1) focus on generating data of a single type (e.g., diagnosis codes), neglecting other data types (e.g., demographics, pr… ▽ More Sharing electronic health records (EHRs) on a large scale may lead to privacy intrusions. Recent research has shown that risks may be mitigated by simulating EHRs through generative adversarial network (GAN) frameworks. Yet the methods developed to date are limited because they 1) focus on generating data of a single type (e.g., diagnosis codes), neglecting other data types (e.g., demographics, procedures or vital signs) and 2) do not represent constraints between features. In this paper, we introduce a method to simulate EHRs composed of multiple data types by 1) refining the GAN model, 2) accounting for feature constraints, and 3) incorporating key utility measures for such generation tasks. Our analysis with over $770,000$ EHRs from Vanderbilt University Medical Center demonstrates that the new model achieves higher performance in terms of retaining basic statistics, cross-feature correlations, latent structural properties, feature constraints and associated patterns from real data, without sacrificing privacy. △ Less

Submitted 23 March, 2020; v1 submitted 17 March, 2020; originally announced March 2020.

arXiv:2001.08529 [pdf, other]

Leveraging Blockchain for Immutable Logging and Querying Across Multiple Sites

Authors: Mustafa Safa Ozdayi, Murat Kantarcioglu, Bradley Malin

Abstract: Blockchain has emerged as a decentralized and distributed framework that enables tamper-resilience and, thus, practical immutability for stored data. This immutability property is important in scenarios where auditability is desired, such as in maintaining access logs for sensitive healthcare and biomedical data.However, the underlying data structure of blockchain, by default, does not provide cap… ▽ More Blockchain has emerged as a decentralized and distributed framework that enables tamper-resilience and, thus, practical immutability for stored data. This immutability property is important in scenarios where auditability is desired, such as in maintaining access logs for sensitive healthcare and biomedical data.However, the underlying data structure of blockchain, by default, does not provide capabilities to efficiently query the stored data. In this investigation, we show that it is possible to efficiently run complex audit queries over the access log data stored on blockchains by using additional key-value stores. This paper specifically reports on the approach we designed for the blockchain track of iDASH Privacy & Security Workshop 2018 competition.Particularly, we implemented our solution and compared its loading and query-response performance with SQLite, a commonly used relational database, using the data provided by the iDASH 2018 organizers. Depending on the query type and the data size, the run time difference between blockchain based query-response and SQLite based query-response ranged from 0.2 seconds to 6 seconds. A deeper inspection revealed that range queries were the bottleneck of our solution which, nevertheless, scales up linearly. Concretely, this investigation demonstrates that blockchain-based systems can provide reasonable query-response times to complex queries even if they only use simple key-value stores to manage their data. Consequently, we show that blockchains may be useful for maintaining data with auditability and immutability requirements across multiple sites. △ Less

Submitted 5 March, 2020; v1 submitted 13 January, 2020; originally announced January 2020.

arXiv:1905.06946 [pdf, other]

To Warn or Not to Warn: Online Signaling in Audit Games

Authors: Chao Yan, Haifeng Xu, Yevgeniy Vorobeychik, Bo Li, Daniel Fabbri, Bradley Malin

Abstract: Routine operational use of sensitive data is often governed by law and regulation. For instance, in the medical domain, there are various statues at the state and federal level that dictate who is permitted to work with patients' records and under what conditions. To screen for potential privacy breaches, logging systems are usually deployed to trigger alerts whenever suspicious access is detected… ▽ More Routine operational use of sensitive data is often governed by law and regulation. For instance, in the medical domain, there are various statues at the state and federal level that dictate who is permitted to work with patients' records and under what conditions. To screen for potential privacy breaches, logging systems are usually deployed to trigger alerts whenever suspicious access is detected. However, such mechanisms are often inefficient because 1) the vast majority of triggered alerts are false positives, 2) small budgets make it unlikely that a real attack will be detected, and 3) attackers can behave strategically, such that traditional auditing mechanisms cannot easily catch them. To improve efficiency, information systems may invoke signaling, so that whenever a suspicious access request occurs, the system can, in real time, warn the user that the access may be audited. Then, at the close of a finite period, a selected subset of suspicious accesses are audited. This gives rise to an online problem in which one needs to determine 1) whether a warning should be triggered and 2) the likelihood that the data request event will be audited. In this paper, we formalize this auditing problem as a Signaling Audit Game (SAG), in which we model the interactions between an auditor and an attacker in the context of signaling and the usability cost is represented as a factor of the auditor's payoff. We study the properties of its Stackelberg equilibria and develop a scalable approach to compute its solution. We show that a strategic presentation of warnings adds value in that SAGs realize significantly higher utility for the auditor than systems without signaling. We illustrate the value of the proposed auditing model and the consistency of its advantages over existing baseline methods. △ Less

Submitted 21 October, 2019; v1 submitted 16 May, 2019; originally announced May 2019.

ACM Class: I.2.0; H.4.0; H.2.7; J.1; J.3

arXiv:1904.02065 [pdf, other]

Health and Kinship Matter: Learning About Direct-To-Consumer Genetic Testing User Experiences via Online Discussions

Authors: Zhijun Yin, Lijun Song, Ellen Clayton, Bradley Malin

Abstract: Direct-to-consumer (DTC) genetic testing has gained in popularity over the past decade, with over 12 million consumers to date. Given its increasing stature in society, along with weak regulatory oversight, it is important to learn about actual consumers' testing experiences. Traditional interviews or survey-based studies have been limited in that they had small sample sizes or lacked detailed des… ▽ More Direct-to-consumer (DTC) genetic testing has gained in popularity over the past decade, with over 12 million consumers to date. Given its increasing stature in society, along with weak regulatory oversight, it is important to learn about actual consumers' testing experiences. Traditional interviews or survey-based studies have been limited in that they had small sample sizes or lacked detailed descriptions of personal experiences. Yet many people are now sharing their DTC genetic testing experiences via online social media platforms. In this paper, we focused on one particularly lively online discussion forum, \textit{r/23andme} subreddit, where, as of before March 2018, 5,857 users published 37,183 posts. We applied topic modeling to the posts and examined the identified topics and temporal posting trends. We further applied regression analysis to learn the association between the attention that a submission received, in terms of votes and comments, and the posting content. Our findings indicate that bursts of the increase of such online discussion in 2017 may correlate with the Food and Drug Administration's authorization for marketing of 23andMe genetic test on health risks, as well as the hot sale of 23andMe's products on Black Friday. While ancestry composition was a popular subject, kinship was steadily growing towards a major online discussion topic. Moreover, compared with other topics, health and kinship were more likely to receive attention, in terms of votes, while testing reports were more likely to receive attention, in term of comments. Our findings suggest that people may not always be prepared to deal with the unexpected consequences of DTC genetic testing. Moreover, it appears that the users in this subreddit might not sufficiently consider privacy when taking a test or seeking an interpretation from a third-party service provider. △ Less

Submitted 3 April, 2019; originally announced April 2019.

arXiv:1808.02602 [pdf, other]

PIVETed-Granite: Computational Phenotypes through Constrained Tensor Factorization

Authors: Jette Henderson, Bradley A. Malin, Joyce C. Ho, Joydeep Ghosh

Abstract: It has been recently shown that sparse, nonnegative tensor factorization of multi-modal electronic health record data is a promising approach to high-throughput computational phenoty**. However, such approaches typically do not leverage available domain knowledge while extracting the phenotypes; hence, some of the suggested phenotypes may not map well to clinical concepts or may be very similar… ▽ More It has been recently shown that sparse, nonnegative tensor factorization of multi-modal electronic health record data is a promising approach to high-throughput computational phenoty**. However, such approaches typically do not leverage available domain knowledge while extracting the phenotypes; hence, some of the suggested phenotypes may not map well to clinical concepts or may be very similar to other suggested phenotypes. To address these issues, we present a novel, automatic approach called PIVETed-Granite that mines existing biomedical literature (PubMed) to obtain cannot-link constraints that are then used as side-information during a tensor-factorization based computational phenoty** process. The resulting improvements are clearly observed in experiments using a large dataset from VUMC to identify phenotypes for hypertensive patients. △ Less

Submitted 7 August, 2018; originally announced August 2018.

arXiv:1801.07215 [pdf, other]

Get Your Workload in Order: Game Theoretic Prioritization of Database Auditing

Authors: Chao Yan, Bo Li, Yevgeniy Vorobeychik, Aron Laszka, Daniel Fabbri, Bradley Malin

Abstract: For enhancing the privacy protections of databases, where the increasing amount of detailed personal data is stored and processed, multiple mechanisms have been developed, such as audit logging and alert triggers, which notify administrators about suspicious activities; however, the two main limitations in common are: 1) the volume of such alerts is often substantially greater than the capabilitie… ▽ More For enhancing the privacy protections of databases, where the increasing amount of detailed personal data is stored and processed, multiple mechanisms have been developed, such as audit logging and alert triggers, which notify administrators about suspicious activities; however, the two main limitations in common are: 1) the volume of such alerts is often substantially greater than the capabilities of resource-constrained organizations, and 2) strategic attackers may disguise their actions or carefully choosing which records they touch, making incompetent the statistical detection models. For solving them, we introduce a novel approach to database auditing that explicitly accounts for adversarial behavior by 1) prioritizing the order in which types of alerts are investigated and 2) providing an upper bound on how much resource to allocate for each type. We model the interaction between a database auditor and potential attackers as a Stackelberg game in which the auditor chooses an auditing policy and attackers choose which records to target. A corresponding approach combining linear programming, column generation, and heuristic search is proposed to derive an auditing policy. For testing the policy-searching performance, a publicly available credit card application dataset are adopted, on which it shows that our methods produce high-quality mixed strategies as database audit policies, and our general approach significantly outperforms non-game-theoretic baselines. △ Less

Submitted 22 January, 2018; originally announced January 2018.

ACM Class: D.4.6; H.2.0; K.6.5; J.1; I.2

arXiv:1712.02193 [pdf, other]

Systematizing Genome Privacy Research: A Privacy-Enhancing Technologies Perspective

Authors: Alexandros Mittos, Bradley Malin, Emiliano De Cristofaro

Abstract: Rapid advances in human genomics are enabling researchers to gain a better understanding of the role of the genome in our health and well-being, stimulating hope for more effective and cost efficient healthcare. However, this also prompts a number of security and privacy concerns stemming from the distinctive characteristics of genomic data. To address them, a new research community has emerged an… ▽ More Rapid advances in human genomics are enabling researchers to gain a better understanding of the role of the genome in our health and well-being, stimulating hope for more effective and cost efficient healthcare. However, this also prompts a number of security and privacy concerns stemming from the distinctive characteristics of genomic data. To address them, a new research community has emerged and produced a large number of publications and initiatives. In this paper, we rely on a structured methodology to contextualize and provide a critical analysis of the current knowledge on privacy-enhancing technologies used for testing, storing, and sharing genomic data, using a representative sample of the work published in the past decade. We identify and discuss limitations, technical challenges, and issues faced by the community, focusing in particular on those that are inherently tied to the nature of the problem and are harder for the community alone to address. Finally, we report on the importance and difficulty of the identified challenges based on an online survey of genome data privacy experts △ Less

Submitted 17 August, 2018; v1 submitted 6 December, 2017; originally announced December 2017.

Comments: To appear in the Proceedings on Privacy Enhancing Technologies (PoPETs), Vol. 2019, Issue 1

arXiv:1706.00487 [pdf]

Learning Bundled Care Opportunities from Electronic Medical Records

Authors: You Chen, Abel N. Kho, David Liebovitz, Catherine Ivory, Sarah Osmundson, Jiang Bian, Bradley A. Malin

Abstract: Objectives: The fee-for-service approach to healthcare leads to the management of a patient's conditions in an independent manner, inducing various negative consequences. It is recognized that a bundled care approach to healthcare-one that manages a collection of health conditions together-may enable greater efficacy and cost savings. However, it is not always evident which sets of conditions shou… ▽ More Objectives: The fee-for-service approach to healthcare leads to the management of a patient's conditions in an independent manner, inducing various negative consequences. It is recognized that a bundled care approach to healthcare-one that manages a collection of health conditions together-may enable greater efficacy and cost savings. However, it is not always evident which sets of conditions should be managed in a bundled program. Study Design: Retrospective inference of clusters of health conditions from an electronic medical record (EMR) system. A survey of healthcare experts to ascertain the plausibility of the clusters for bundled care programs. Methods: We designed a data-driven framework to infer clusters of health conditions via their shared clinical workflows according to EMR utilization by healthcare employees. We evaluated the framework with approximately 16,500 inpatient stays from a large medical center. The plausibility of the clusters for bundled care was assessed through a survey of a panel of healthcare experts using an analysis of variance (ANOVA) under a 95% confidence interval. Results: The framework inferred four condition clusters: 1) fetal abnormalities, 2) late pregnancies, 3) prostate problems, and 4) chronic diseases (with congestive heart failure featuring prominently). Each cluster was deemed plausible by the experts for bundled care. Conclusions: The findings suggest that data from EMRs may provide a basis for discovering new directions in bundled care. Still, translating such findings into actual care management will require further refinement, implementation, and evaluation. △ Less

Submitted 26 May, 2017; originally announced June 2017.

Comments: 27 pages, 3 figures, 3 tables

arXiv:1705.09713 [pdf]

A Data-Driven Analysis of the Influence of Care Coordination on Trauma Outcome

Authors: You Chen, Mayur B. Patel, Candace D. McNaughton, Bradley A. Malin

Abstract: OBJECTIVE: To test the hypothesis that variation in care coordination is related to LOS. DESIGN We applied a spectral co-clustering methodology to simultaneously infer groups of patients and care coordination patterns, in the form of interaction networks of health care professionals, from electronic medical record (EMR) utilization data. The care coordination pattern for each patient group was rep… ▽ More OBJECTIVE: To test the hypothesis that variation in care coordination is related to LOS. DESIGN We applied a spectral co-clustering methodology to simultaneously infer groups of patients and care coordination patterns, in the form of interaction networks of health care professionals, from electronic medical record (EMR) utilization data. The care coordination pattern for each patient group was represented by standard social network characteristics and its relationship with hospital LOS was assessed via a negative binomial regression with a 95% confidence interval. SETTING AND PATIENTS This study focuses on 5,588 adult patients hospitalized for trauma at the Vanderbilt University Medical Center. The EMRs were accessed by healthcare professionals from 179 operational areas during 158,467 operational actions. MAIN OUTCOME MEASURES: Hospital LOS for trauma inpatients, as an indicator of care coordination efficiency. RESULTS: Three general types of care coordination patterns were discovered, each of which was affiliated with a specific patient group. The first patient group exhibited the shortest hospital LOS and was managed by a care coordination pattern that involved the smallest number of operational areas (102 areas, as opposed to 125 and 138 for the other patient groups), but exhibited the largest number of collaborations between operational areas (e.g., an average of 27.1 connections per operational area compared to 22.5 and 23.3 for the other two groups). The hospital LOS for the second and third patient groups was 14 hours (P = 0.024) and 10 hours (P = 0.042) longer than the first patient group, respectively. △ Less

Submitted 26 May, 2017; originally announced May 2017.

Comments: 25 pages, 1 figure, 2 tables

arXiv:1703.06490 [pdf, other]

Generating Multi-label Discrete Patient Records using Generative Adversarial Networks

Authors: Edward Choi, Siddharth Biswal, Bradley Malin, Jon Duke, Walter F. Stewart, Jimeng Sun

Abstract: Access to electronic health record (EHR) data has motivated computational advances in medical research. However, various concerns, particularly over privacy, can limit access to and collaborative use of EHR data. Sharing synthetic EHR data could mitigate risk. In this paper, we propose a new approach, medical Generative Adversarial Network (medGAN), to generate realistic synthetic patient records.… ▽ More Access to electronic health record (EHR) data has motivated computational advances in medical research. However, various concerns, particularly over privacy, can limit access to and collaborative use of EHR data. Sharing synthetic EHR data could mitigate risk. In this paper, we propose a new approach, medical Generative Adversarial Network (medGAN), to generate realistic synthetic patient records. Based on input real patient records, medGAN can generate high-dimensional discrete variables (e.g., binary and count features) via a combination of an autoencoder and generative adversarial networks. We also propose minibatch averaging to efficiently avoid mode collapse, and increase the learning efficiency with batch normalization and shortcut connections. To demonstrate feasibility, we showed that medGAN generates synthetic patient records that achieve comparable performance to real data on many experiments including distribution statistics, predictive modeling tasks and a medical expert review. We also empirically observe a limited privacy risk in both identity and attribute disclosure using medGAN. △ Less

Submitted 11 January, 2018; v1 submitted 19 March, 2017; originally announced March 2017.

Comments: Accepted at Machine Learning in Health Care (MLHC) 2017

arXiv:1609.04466 [pdf, other]

Phenoty** using Structured Collective Matrix Factorization of Multi--source EHR Data

Authors: Suriya Gunasekar, Joyce C. Ho, Joydeep Ghosh, Stephanie Kreml, Abel N Kho, Joshua C Denny, Bradley A Malin, Jimeng Sun

Abstract: The increased availability of electronic health records (EHRs) have spearheaded the initiative for precision medicine using data driven approaches. Essential to this effort is the ability to identify patients with certain medical conditions of interest from simple queries on EHRs, or EHR-based phenotypes. Existing rule--based phenoty** approaches are extremely labor intensive. Instead, dimension… ▽ More The increased availability of electronic health records (EHRs) have spearheaded the initiative for precision medicine using data driven approaches. Essential to this effort is the ability to identify patients with certain medical conditions of interest from simple queries on EHRs, or EHR-based phenotypes. Existing rule--based phenoty** approaches are extremely labor intensive. Instead, dimensionality reduction and latent factor estimation techniques from machine learning can be adapted for phenotype extraction with no (or minimal) human supervision. We propose to identify an easily interpretable latent space shared across various sources of EHR data as potential candidates for phenotypes. By incorporating multiple EHR data sources (e.g., diagnosis, medications, and lab reports) available in heterogeneous datatypes in a generalized \textit{Collective Matrix Factorization (CMF)}, our methods can generate rich phenotypes. Further, easy interpretability in phenoty** application requires sparse representations of the candidate phenotypes, for example each phenotype derived from patients' medication and diagnosis data should preferably be represented by handful of diagnosis and medications, ($5$--$10$ active components). We propose a constrained formulation of CMF for estimating sparse phenotypes. We demonstrate the efficacy of our model through an extensive empirical study on EHR data from Vanderbilt University Medical Center. △ Less

Submitted 14 September, 2016; originally announced September 2016.

arXiv:1605.00300 [pdf, ps, other]

CheapSMC: A Framework to Minimize SMC Cost in Cloud

Authors: Erman Pattuk, Murat Kantarcioglu, Huseyin Ulusoy, Bradley Malin

Abstract: Secure multi-party computation (SMC) techniques are increasingly becoming more efficient and practical thanks to many recent novel improvements. The recent work have shown that different protocols that are implemented using different sharing mechanisms (e.g., boolean, arithmetic sharings, etc.) may have different computational and communication costs. Although there are some works that automatical… ▽ More Secure multi-party computation (SMC) techniques are increasingly becoming more efficient and practical thanks to many recent novel improvements. The recent work have shown that different protocols that are implemented using different sharing mechanisms (e.g., boolean, arithmetic sharings, etc.) may have different computational and communication costs. Although there are some works that automatically mix protocols of different sharing schemes to fasten execution, none of them provide a generic optimization framework to find the cheapest mixed-protocol SMC execution for cloud deployment. In this work, we propose a generic SMC optimization framework CheapSMC that can use any mixed-protocol SMC circuit evaluation tool as a black-box to find the cheapest SMC cloud deployment option. To find the cheapest SMC protocol, CheapSMC runs one time benchmarks for the target cloud service and gathers performance statistics for basic circuit components. Using these performance statistics, optimization layer of CheapSMC runs multiple heuristics to find the cheapest mix-protocol circuit evaluation. Later on, the optimized circuit is passed to a mix-protocol SMC tool for actual executable generation. Our empirical results gathered by running different cases studies show that significant cost savings could be achieved using our optimization framework. △ Less

Submitted 1 May, 2016; originally announced May 2016.

arXiv:1405.1891 [pdf, other]

Privacy in the Genomic Era

Authors: Muhammad Naveed, Erman Ayday, Ellen W. Clayton, Jacques Fellay, Carl A. Gunter, Jean-Pierre Hubaux, Bradley A. Malin, XiaoFeng Wang

Abstract: Genome sequencing technology has advanced at a rapid pace and it is now possible to generate highly-detailed genotypes inexpensively. The collection and analysis of such data has the potential to support various applications, including personalized medical services. While the benefits of the genomics revolution are trumpeted by the biomedical community, the increased availability of such data has… ▽ More Genome sequencing technology has advanced at a rapid pace and it is now possible to generate highly-detailed genotypes inexpensively. The collection and analysis of such data has the potential to support various applications, including personalized medical services. While the benefits of the genomics revolution are trumpeted by the biomedical community, the increased availability of such data has major implications for personal privacy; notably because the genome has certain essential features, which include (but are not limited to) (i) an association with traits and certain diseases, (ii) identification capability (e.g., forensics), and (iii) revelation of family relationships. Moreover, direct-to-consumer DNA testing increases the likelihood that genome data will be made available in less regulated environments, such as the Internet and for-profit companies. The problem of genome data privacy thus resides at the crossroads of computer science, medicine, and public policy. While the computer scientists have addressed data privacy for various data types, there has been less attention dedicated to genomic data. Thus, the goal of this paper is to provide a systematization of knowledge for the computer science community. In doing so, we address some of the (sometimes erroneous) beliefs of this field and we report on a survey we conducted about genome data privacy with biomedical specialists. Then, after characterizing the genome privacy problem, we review the state-of-the-art regarding privacy attacks on genomic data and strategies for mitigating such attacks, as well as contextualizing these attacks from the perspective of medicine and public policy. This paper concludes with an enumeration of the challenges for genome data privacy and presents a framework to systematize the analysis of threats and the design of countermeasures as the field moves forward. △ Less

Submitted 17 June, 2015; v1 submitted 8 May, 2014; originally announced May 2014.

ACM Class: K.6.5

arXiv:0912.2548 [pdf, ps, other]

Towards Utility-driven Anonymization of Transactions

Authors: Grigorios Loukides, Aris Gkoulalas-Divanis, Bradley Malin

Abstract: Publishing person-specific transactions in an anonymous form is increasingly required by organizations. Recent approaches ensure that potentially identifying information (e.g., a set of diagnosis codes) cannot be used to link published transactions to persons' identities, but all are limited in application because they incorporate coarse privacy requirements (e.g., protecting a certain set of m… ▽ More Publishing person-specific transactions in an anonymous form is increasingly required by organizations. Recent approaches ensure that potentially identifying information (e.g., a set of diagnosis codes) cannot be used to link published transactions to persons' identities, but all are limited in application because they incorporate coarse privacy requirements (e.g., protecting a certain set of m diagnosis codes requires protecting all m-sized sets), do not integrate utility requirements, and tend to explore a small portion of the solution space. In this paper, we propose a more general framework for anonymizing transactional data under specific privacy and utility requirements. We model such requirements as constraints, investigate how these constraints can be specified, and propose COAT (COnstraint-based Anonymization of Transactions), an algorithm that anonymizes transactions using a flexible hierarchy-free generalization scheme to meet the specified constraints. Experiments with benchmark datasets verify that COAT significantly outperforms the current state-of-the-art algorithm in terms of data utility, while being comparable in terms of efficiency. The effectiveness of our approach is also demonstrated in a real-world scenario, which requires disseminating a private, patient-specific transactional dataset in a way that preserves both privacy and utility in intended studies. △ Less

Submitted 26 January, 2010; v1 submitted 13 December, 2009; originally announced December 2009.

Showing 1–33 of 33 results for author: Malin, B