Skip to main content

Showing 1–33 of 33 results for author: Malin, B

.
  1. arXiv:2407.00170  [pdf, other

    cs.LG cs.CY

    Dataset Representativeness and Downstream Task Fairness

    Authors: Victor Borza, Andrew Estornell, Chien-Ju Ho, Bradley Malin, Yevgeniy Vorobeychik

    Abstract: Our society collects data on people for a wide range of applications, from building a census for policy evaluation to running meaningful clinical trials. To collect data, we typically sample individuals with the goal of accurately representing a population of interest. However, current sampling processes often collect data opportunistically from data sources, which can lead to datasets that are bi… ▽ More

    Submitted 28 June, 2024; originally announced July 2024.

    Comments: 48 pages, 32 figures

  2. arXiv:2406.01811  [pdf, other

    cs.CR

    A Game-Theoretic Approach to Privacy-Utility Tradeoff in Sharing Genomic Summary Statistics

    Authors: Tao Zhang, Rajagopal Venkatesaramani, Rajat K. De, Bradley A. Malin, Yevgeniy Vorobeychik

    Abstract: The advent of online genomic data-sharing services has sought to enhance the accessibility of large genomic datasets by allowing queries about genetic variants, such as summary statistics, aiding care providers in distinguishing between spurious genomic variations and those with clinical significance. However, numerous studies have demonstrated that even sharing summary genomic information exposes… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

  3. arXiv:2402.11347  [pdf, other

    cs.CL

    PhaseEvo: Towards Unified In-Context Prompt Optimization for Large Language Models

    Authors: Wendi Cui, Jiaxin Zhang, Zhuohang Li, Hao Sun, Damien Lopez, Kamalika Das, Bradley Malin, Sricharan Kumar

    Abstract: Crafting an ideal prompt for Large Language Models (LLMs) is a challenging task that demands significant resources and expert human input. Existing work treats the optimization of prompt instruction and in-context learning examples as distinct problems, leading to sub-optimal prompt performance. This research addresses this limitation by establishing a unified in-context prompt optimization framew… ▽ More

    Submitted 17 February, 2024; originally announced February 2024.

    Comments: 50 pages, 9 figures, 26 tables

  4. arXiv:2401.02132  [pdf, other

    cs.CL cs.AI

    DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and Improvement of Large Language Models

    Authors: Wendi Cui, Jiaxin Zhang, Zhuohang Li, Lopez Damien, Kamalika Das, Bradley Malin, Sricharan Kumar

    Abstract: Evaluating the quality and variability of text generated by Large Language Models (LLMs) poses a significant, yet unresolved research challenge. Traditional evaluation methods, such as ROUGE and BERTScore, which measure token similarity, often fail to capture the holistic semantic equivalence. This results in a low correlation with human judgments and intuition, which is especially problematic in… ▽ More

    Submitted 4 January, 2024; originally announced January 2024.

  5. arXiv:2311.11211  [pdf

    cs.AI

    Leveraging Generative AI for Clinical Evidence Summarization Needs to Ensure Trustworthiness

    Authors: Gongbo Zhang, Qiao **, Denis Jered McInerney, Yong Chen, Fei Wang, Curtis L. Cole, Qian Yang, Yanshan Wang, Bradley A. Malin, Mor Peleg, Byron C. Wallace, Zhiyong Lu, Chunhua Weng, Yifan Peng

    Abstract: Evidence-based medicine promises to improve the quality of healthcare by empowering medical decisions and practices with the best available evidence. The rapid growth of medical evidence, which can be obtained from various sources, poses a challenge in collecting, appraising, and synthesizing the evidential information. Recent advancements in generative AI, exemplified by large language models, ho… ▽ More

    Submitted 31 March, 2024; v1 submitted 18 November, 2023; originally announced November 2023.

  6. arXiv:2311.01740  [pdf, other

    cs.CL

    SAC3: Reliable Hallucination Detection in Black-Box Language Models via Semantic-aware Cross-check Consistency

    Authors: Jiaxin Zhang, Zhuohang Li, Kamalika Das, Bradley A. Malin, Sricharan Kumar

    Abstract: Hallucination detection is a critical step toward understanding the trustworthiness of modern language models (LMs). To achieve this goal, we re-examine existing detection approaches based on the self-consistency of LMs and uncover two types of hallucinations resulting from 1) question-level and 2) model-level, which cannot be effectively identified through self-consistency check alone. Building u… ▽ More

    Submitted 18 February, 2024; v1 submitted 3 November, 2023; originally announced November 2023.

    Comments: EMNLP 2023

  7. arXiv:2309.00154  [pdf, other

    cs.CY

    Learning From Peers: A Survey of Perception and Utilization of Online Peer Support Among Informal Dementia Caregivers

    Authors: Zhijun Yin, Lauren Stratton, Qingyuan Song, Congning Ni, Lijun Song, Patricia A. Commiskey, Qingxia Chen, Monica Moreno, Sam Fazio, Bradley A. Malin

    Abstract: Informal dementia caregivers are those who care for a person living with dementia (PLWD) without receiving payment (e.g., family members, friends, or other unpaid caregivers). These informal caregivers are subject to substantial mental, physical, and financial burdens. Online communities enable these caregivers to exchange caregiving strategies and communicate experiences with other caregivers who… ▽ More

    Submitted 31 August, 2023; originally announced September 2023.

  8. arXiv:2308.11027  [pdf, other

    cs.LG cs.CR

    Split Learning for Distributed Collaborative Training of Deep Learning Models in Health Informatics

    Authors: Zhuohang Li, Chao Yan, Xinmeng Zhang, Gharib Gharibi, Zhijun Yin, Xiaoqian Jiang, Bradley A. Malin

    Abstract: Deep learning continues to rapidly evolve and is now demonstrating remarkable potential for numerous medical prediction tasks. However, realizing deep learning models that generalize across healthcare organizations is challenging. This is due, in part, to the inherent siloed nature of these organizations and patient privacy requirements. To address this problem, we illustrate how split learning ca… ▽ More

    Submitted 21 August, 2023; originally announced August 2023.

  9. arXiv:2302.01763  [pdf, other

    cs.CR cs.AI

    Enabling Trade-offs in Privacy and Utility in Genomic Data Beacons and Summary Statistics

    Authors: Rajagopal Venkatesaramani, Zhiyu Wan, Bradley A. Malin, Yevgeniy Vorobeychik

    Abstract: The collection and sharing of genomic data are becoming increasingly commonplace in research, clinical, and direct-to-consumer settings. The computational protocols typically adopted to protect individual privacy include sharing summary statistics, such as allele frequencies, or limiting query responses to the presence/absence of alleles of interest using web-services called Beacons. However, even… ▽ More

    Submitted 11 January, 2023; originally announced February 2023.

  10. arXiv:2210.09975  [pdf

    eess.AS cs.CR cs.LG cs.SD

    Risk of re-identification for shared clinical speech recordings

    Authors: Daniela A. Wiepert, Bradley A. Malin, Joseph R. Duffy, Rene L. Utianski, John L. Stricker, David T. Jones, Hugo Botha

    Abstract: Large, curated datasets are required to leverage speech-based tools in healthcare. These are costly to produce, resulting in increased interest in data sharing. As speech can potentially identify speakers (i.e., voiceprints), sharing recordings raises privacy concerns. We examine the re-identification risk for speech recordings, without reference to demographic or metadata, using a state-of-the-ar… ▽ More

    Submitted 21 August, 2023; v1 submitted 18 October, 2022; originally announced October 2022.

    Comments: 24 pages, 6 figures

  11. arXiv:2208.01636  [pdf, ps, other

    cs.CR cs.CV cs.CY cs.LG

    A Roadmap for Greater Public Use of Privacy-Sensitive Government Data: Workshop Report

    Authors: Chris Clifton, Bradley Malin, Anna Oganian, Ramesh Raskar, Vivek Sharma

    Abstract: Government agencies collect and manage a wide range of ever-growing datasets. While such data has the potential to support research and evidence-based policy making, there are concerns that the dissemination of such data could infringe upon the privacy of the individuals (or organizations) from whom such data was collected. To appraise the current state of data sharing, as well as learn about oppo… ▽ More

    Submitted 17 June, 2022; originally announced August 2022.

    Comments: 23 pages

  12. arXiv:2208.01230  [pdf

    cs.LG cs.AI cs.CY

    A Multifaceted Benchmarking of Synthetic Electronic Health Record Generation Models

    Authors: Chao Yan, Yao Yan, Zhiyu Wan, Ziqi Zhang, Larsson Omberg, Justin Guinney, Sean D. Mooney, Bradley A. Malin

    Abstract: Synthetic health data have the potential to mitigate privacy concerns when sharing data to support biomedical research and the development of innovative healthcare applications. Modern approaches for data generation based on machine learning, generative adversarial networks (GAN) methods in particular, continue to evolve and demonstrate remarkable potential. Yet there is a lack of a systematic ass… ▽ More

    Submitted 1 August, 2022; originally announced August 2022.

  13. arXiv:2207.02445  [pdf, other

    cs.LG cs.AI cs.CY

    Distillation to Enhance the Portability of Risk Models Across Institutions with Large Patient Claims Database

    Authors: Steve Nyemba, Chao Yan, Ziqi Zhang, Amol Rajmane, Pablo Meyer, Prithwish Chakraborty, Bradley Malin

    Abstract: Artificial intelligence, and particularly machine learning (ML), is increasingly developed and deployed to support healthcare in a variety of settings. However, clinical decision support (CDS) technologies based on ML need to be portable if they are to be adopted on a broad scale. In this respect, models developed at one institution should be reusable at another. Yet there are numerous examples of… ▽ More

    Submitted 6 July, 2022; originally announced July 2022.

  14. arXiv:2112.13301  [pdf, other

    cs.CR q-bio.GN

    Defending Against Membership Inference Attacks on Beacon Services

    Authors: Rajagopal Venkatesaramani, Zhiyu Wan, Bradley A. Malin, Yevgeniy Vorobeychik

    Abstract: Large genomic datasets are now created through numerous activities, including recreational genealogical investigations, biomedical research, and clinical care. At the same time, genomic data has become valuable for reuse beyond their initial point of collection, but privacy concerns often hinder access. Over the past several years, Beacon services have emerged to broaden accessibility to such data… ▽ More

    Submitted 25 December, 2021; originally announced December 2021.

  15. Dynamically Adjusting Case Reporting Policy to Maximize Privacy and Utility in the Face of a Pandemic

    Authors: J. Thomas Brown, Chao Yan, Weiyi Xia, Zhijun Yin, Zhiyu Wan, Aris Gkoulalas-Divanis, Murat Kantarcioglu, Bradley A. Malin

    Abstract: Supporting public health research and the public's situational awareness during a pandemic requires continuous dissemination of infectious disease surveillance data. Legislation, such as the Health Insurance Portability and Accountability Act of 1996 (HIPAA) and recent state-level regulations, permits sharing de-identified person-level data; however, current de-identification approaches are limite… ▽ More

    Submitted 25 February, 2022; v1 submitted 21 June, 2021; originally announced June 2021.

    Comments: Updated to peer-reviewed version. Main text only without figures. Complete version is available in the Journal of the American Medical Informatics Association at https://doi.org/10.1093/jamia/ocac011

  16. arXiv:2104.04377  [pdf, other

    cs.LG

    Blending Knowledge in Deep Recurrent Networks for Adverse Event Prediction at Hospital Discharge

    Authors: Prithwish Chakraborty, James Codella, Piyush Madan, Ying Li, Hu Huang, Yoonyoung Park, Chao Yan, Ziqi Zhang, Cheng Gao, Steve Nyemba, Xu Min, Sanjib Basak, Mohamed Ghalwash, Zach Shahn, Parthasararathy Suryanarayanan, Italo Buleje, Shannon Harrer, Sarah Miller, Amol Rajmane, Colin Walsh, Jonathan Wanderer, Gigi Yuen Reed, Kenney Ng, Daby Sow, Bradley A. Malin

    Abstract: Deep learning architectures have an extremely high-capacity for modeling complex data in a wide variety of domains. However, these architectures have been limited in their ability to support complex prediction problems using insurance claims data, such as readmission at 30 days, mainly due to data sparsity issue. Consequently, classical machine learning methods, especially those that embed domain… ▽ More

    Submitted 9 April, 2021; originally announced April 2021.

    Comments: Presented at the AMIA 2021 Virtual Informatics Summit

  17. arXiv:2102.08557  [pdf, other

    cs.LG cs.CR cs.CY

    Re-identification of Individuals in Genomic Datasets Using Public Face Images

    Authors: Rajagopal Venkatesaramani, Bradley A. Malin, Yevgeniy Vorobeychik

    Abstract: DNA sequencing is becoming increasingly commonplace, both in medical and direct-to-consumer settings. To promote discovery, collected genomic data is often de-identified and shared, either in public repositories, such as OpenSNP, or with researchers through access-controlled repositories. However, recent studies have suggested that genomic data can be effectively matched to high-resolution three-d… ▽ More

    Submitted 16 February, 2021; originally announced February 2021.

  18. arXiv:2012.10020  [pdf, other

    cs.LG cs.AI

    EVA: Generating Longitudinal Electronic Health Records Using Conditional Variational Autoencoders

    Authors: Siddharth Biswal, Soumya Ghosh, Jon Duke, Bradley Malin, Walter Stewart, Jimeng Sun

    Abstract: Researchers require timely access to real-world longitudinal electronic health records (EHR) to develop, test, validate, and implement machine learning solutions that improve the quality and efficiency of healthcare. In contrast, health systems value deeply patient privacy and data security. De-identified EHRs do not adequately address the needs of health systems, as de-identified data are suscept… ▽ More

    Submitted 17 December, 2020; originally announced December 2020.

  19. arXiv:2010.08855  [pdf, other

    cs.CR cs.LG

    GOAT: GPU Outsourcing of Deep Learning Training With Asynchronous Probabilistic Integrity Verification Inside Trusted Execution Environment

    Authors: Aref Asvadishireh**i, Murat Kantarcioglu, Bradley Malin

    Abstract: Machine learning models based on Deep Neural Networks (DNNs) are increasingly deployed in a wide range of applications ranging from self-driving cars to COVID-19 treatment discovery. To support the computational power necessary to learn a DNN, cloud environments with dedicated hardware support have emerged as critical infrastructure. However, there are many integrity challenges associated with out… ▽ More

    Submitted 17 October, 2020; originally announced October 2020.

  20. arXiv:2003.07904  [pdf, other

    cs.LG cs.CY stat.ML

    Generating Electronic Health Records with Multiple Data Types and Constraints

    Authors: Chao Yan, Ziqi Zhang, Steve Nyemba, Bradley A. Malin

    Abstract: Sharing electronic health records (EHRs) on a large scale may lead to privacy intrusions. Recent research has shown that risks may be mitigated by simulating EHRs through generative adversarial network (GAN) frameworks. Yet the methods developed to date are limited because they 1) focus on generating data of a single type (e.g., diagnosis codes), neglecting other data types (e.g., demographics, pr… ▽ More

    Submitted 23 March, 2020; v1 submitted 17 March, 2020; originally announced March 2020.

  21. arXiv:2001.08529  [pdf, other

    cs.DB cs.CR cs.DC

    Leveraging Blockchain for Immutable Logging and Querying Across Multiple Sites

    Authors: Mustafa Safa Ozdayi, Murat Kantarcioglu, Bradley Malin

    Abstract: Blockchain has emerged as a decentralized and distributed framework that enables tamper-resilience and, thus, practical immutability for stored data. This immutability property is important in scenarios where auditability is desired, such as in maintaining access logs for sensitive healthcare and biomedical data.However, the underlying data structure of blockchain, by default, does not provide cap… ▽ More

    Submitted 5 March, 2020; v1 submitted 13 January, 2020; originally announced January 2020.

  22. arXiv:1905.06946  [pdf, other

    cs.CR cs.DB cs.GT

    To Warn or Not to Warn: Online Signaling in Audit Games

    Authors: Chao Yan, Haifeng Xu, Yevgeniy Vorobeychik, Bo Li, Daniel Fabbri, Bradley Malin

    Abstract: Routine operational use of sensitive data is often governed by law and regulation. For instance, in the medical domain, there are various statues at the state and federal level that dictate who is permitted to work with patients' records and under what conditions. To screen for potential privacy breaches, logging systems are usually deployed to trigger alerts whenever suspicious access is detected… ▽ More

    Submitted 21 October, 2019; v1 submitted 16 May, 2019; originally announced May 2019.

    ACM Class: I.2.0; H.4.0; H.2.7; J.1; J.3

  23. arXiv:1904.02065  [pdf, other

    cs.CY

    Health and Kinship Matter: Learning About Direct-To-Consumer Genetic Testing User Experiences via Online Discussions

    Authors: Zhijun Yin, Lijun Song, Ellen Clayton, Bradley Malin

    Abstract: Direct-to-consumer (DTC) genetic testing has gained in popularity over the past decade, with over 12 million consumers to date. Given its increasing stature in society, along with weak regulatory oversight, it is important to learn about actual consumers' testing experiences. Traditional interviews or survey-based studies have been limited in that they had small sample sizes or lacked detailed des… ▽ More

    Submitted 3 April, 2019; originally announced April 2019.

  24. arXiv:1808.02602  [pdf, other

    cs.LG stat.ML

    PIVETed-Granite: Computational Phenotypes through Constrained Tensor Factorization

    Authors: Jette Henderson, Bradley A. Malin, Joyce C. Ho, Joydeep Ghosh

    Abstract: It has been recently shown that sparse, nonnegative tensor factorization of multi-modal electronic health record data is a promising approach to high-throughput computational phenoty**. However, such approaches typically do not leverage available domain knowledge while extracting the phenotypes; hence, some of the suggested phenotypes may not map well to clinical concepts or may be very similar… ▽ More

    Submitted 7 August, 2018; originally announced August 2018.

  25. arXiv:1801.07215  [pdf, other

    cs.AI cs.CR cs.DB cs.GT cs.MA

    Get Your Workload in Order: Game Theoretic Prioritization of Database Auditing

    Authors: Chao Yan, Bo Li, Yevgeniy Vorobeychik, Aron Laszka, Daniel Fabbri, Bradley Malin

    Abstract: For enhancing the privacy protections of databases, where the increasing amount of detailed personal data is stored and processed, multiple mechanisms have been developed, such as audit logging and alert triggers, which notify administrators about suspicious activities; however, the two main limitations in common are: 1) the volume of such alerts is often substantially greater than the capabilitie… ▽ More

    Submitted 22 January, 2018; originally announced January 2018.

    ACM Class: D.4.6; H.2.0; K.6.5; J.1; I.2

  26. arXiv:1712.02193  [pdf, other

    cs.CR

    Systematizing Genome Privacy Research: A Privacy-Enhancing Technologies Perspective

    Authors: Alexandros Mittos, Bradley Malin, Emiliano De Cristofaro

    Abstract: Rapid advances in human genomics are enabling researchers to gain a better understanding of the role of the genome in our health and well-being, stimulating hope for more effective and cost efficient healthcare. However, this also prompts a number of security and privacy concerns stemming from the distinctive characteristics of genomic data. To address them, a new research community has emerged an… ▽ More

    Submitted 17 August, 2018; v1 submitted 6 December, 2017; originally announced December 2017.

    Comments: To appear in the Proceedings on Privacy Enhancing Technologies (PoPETs), Vol. 2019, Issue 1

  27. arXiv:1706.00487  [pdf

    cs.CY

    Learning Bundled Care Opportunities from Electronic Medical Records

    Authors: You Chen, Abel N. Kho, David Liebovitz, Catherine Ivory, Sarah Osmundson, Jiang Bian, Bradley A. Malin

    Abstract: Objectives: The fee-for-service approach to healthcare leads to the management of a patient's conditions in an independent manner, inducing various negative consequences. It is recognized that a bundled care approach to healthcare-one that manages a collection of health conditions together-may enable greater efficacy and cost savings. However, it is not always evident which sets of conditions shou… ▽ More

    Submitted 26 May, 2017; originally announced June 2017.

    Comments: 27 pages, 3 figures, 3 tables

  28. arXiv:1705.09713  [pdf

    cs.CY

    A Data-Driven Analysis of the Influence of Care Coordination on Trauma Outcome

    Authors: You Chen, Mayur B. Patel, Candace D. McNaughton, Bradley A. Malin

    Abstract: OBJECTIVE: To test the hypothesis that variation in care coordination is related to LOS. DESIGN We applied a spectral co-clustering methodology to simultaneously infer groups of patients and care coordination patterns, in the form of interaction networks of health care professionals, from electronic medical record (EMR) utilization data. The care coordination pattern for each patient group was rep… ▽ More

    Submitted 26 May, 2017; originally announced May 2017.

    Comments: 25 pages, 1 figure, 2 tables

  29. arXiv:1703.06490  [pdf, other

    cs.LG cs.NE

    Generating Multi-label Discrete Patient Records using Generative Adversarial Networks

    Authors: Edward Choi, Siddharth Biswal, Bradley Malin, Jon Duke, Walter F. Stewart, Jimeng Sun

    Abstract: Access to electronic health record (EHR) data has motivated computational advances in medical research. However, various concerns, particularly over privacy, can limit access to and collaborative use of EHR data. Sharing synthetic EHR data could mitigate risk. In this paper, we propose a new approach, medical Generative Adversarial Network (medGAN), to generate realistic synthetic patient records.… ▽ More

    Submitted 11 January, 2018; v1 submitted 19 March, 2017; originally announced March 2017.

    Comments: Accepted at Machine Learning in Health Care (MLHC) 2017

  30. arXiv:1609.04466  [pdf, other

    stat.AP

    Phenoty** using Structured Collective Matrix Factorization of Multi--source EHR Data

    Authors: Suriya Gunasekar, Joyce C. Ho, Joydeep Ghosh, Stephanie Kreml, Abel N Kho, Joshua C Denny, Bradley A Malin, Jimeng Sun

    Abstract: The increased availability of electronic health records (EHRs) have spearheaded the initiative for precision medicine using data driven approaches. Essential to this effort is the ability to identify patients with certain medical conditions of interest from simple queries on EHRs, or EHR-based phenotypes. Existing rule--based phenoty** approaches are extremely labor intensive. Instead, dimension… ▽ More

    Submitted 14 September, 2016; originally announced September 2016.

  31. arXiv:1605.00300  [pdf, ps, other

    cs.CR

    CheapSMC: A Framework to Minimize SMC Cost in Cloud

    Authors: Erman Pattuk, Murat Kantarcioglu, Huseyin Ulusoy, Bradley Malin

    Abstract: Secure multi-party computation (SMC) techniques are increasingly becoming more efficient and practical thanks to many recent novel improvements. The recent work have shown that different protocols that are implemented using different sharing mechanisms (e.g., boolean, arithmetic sharings, etc.) may have different computational and communication costs. Although there are some works that automatical… ▽ More

    Submitted 1 May, 2016; originally announced May 2016.

  32. arXiv:1405.1891  [pdf, other

    cs.CR

    Privacy in the Genomic Era

    Authors: Muhammad Naveed, Erman Ayday, Ellen W. Clayton, Jacques Fellay, Carl A. Gunter, Jean-Pierre Hubaux, Bradley A. Malin, XiaoFeng Wang

    Abstract: Genome sequencing technology has advanced at a rapid pace and it is now possible to generate highly-detailed genotypes inexpensively. The collection and analysis of such data has the potential to support various applications, including personalized medical services. While the benefits of the genomics revolution are trumpeted by the biomedical community, the increased availability of such data has… ▽ More

    Submitted 17 June, 2015; v1 submitted 8 May, 2014; originally announced May 2014.

    ACM Class: K.6.5

  33. arXiv:0912.2548  [pdf, ps, other

    cs.DB cs.CR

    Towards Utility-driven Anonymization of Transactions

    Authors: Grigorios Loukides, Aris Gkoulalas-Divanis, Bradley Malin

    Abstract: Publishing person-specific transactions in an anonymous form is increasingly required by organizations. Recent approaches ensure that potentially identifying information (e.g., a set of diagnosis codes) cannot be used to link published transactions to persons' identities, but all are limited in application because they incorporate coarse privacy requirements (e.g., protecting a certain set of m… ▽ More

    Submitted 26 January, 2010; v1 submitted 13 December, 2009; originally announced December 2009.