Search | arXiv e-print repository

arXiv:2406.19531 [pdf, other]

Forward and Backward State Abstractions for Off-policy Evaluation

Authors: Meiling Hao, **fan Su, Liyuan Hu, Zoltan Szabo, Qingyuan Zhao, Chengchun Shi

Abstract: Off-policy evaluation (OPE) is crucial for evaluating a target policy's impact offline before its deployment. However, achieving accurate OPE in large state spaces remains challenging.This paper studies state abstractions-originally designed for policy learning-in the context of OPE. Our contributions are three-fold: (i) We define a set of irrelevance conditions central to learning state abstracti… ▽ More Off-policy evaluation (OPE) is crucial for evaluating a target policy's impact offline before its deployment. However, achieving accurate OPE in large state spaces remains challenging.This paper studies state abstractions-originally designed for policy learning-in the context of OPE. Our contributions are three-fold: (i) We define a set of irrelevance conditions central to learning state abstractions for OPE. (ii) We derive sufficient conditions for achieving irrelevance in Q-functions and marginalized importance sampling ratios, the latter obtained by constructing a time-reversed Markov decision process (MDP) based on the observed MDP. (iii) We propose a novel two-step procedure that sequentially projects the original state space into a smaller space, which substantially simplify the sample complexity of OPE arising from high cardinality. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: 42 pages, 5 figures

ACM Class: G.3; I.2.6; G.1.2

arXiv:2405.15403 [pdf, other]

Fine-Grained Dynamic Framework for Bias-Variance Joint Optimization on Data Missing Not at Random

Authors: Mingming Ha, Xuewen Tao, Wenfang Lin, Qionxu Ma, Wujiang Xu, Linxun Chen

Abstract: In most practical applications such as recommendation systems, display advertising, and so forth, the collected data often contains missing values and those missing values are generally missing-not-at-random, which deteriorates the prediction performance of models. Some existing estimators and regularizers attempt to achieve unbiased estimation to improve the predictive performance. However, varia… ▽ More In most practical applications such as recommendation systems, display advertising, and so forth, the collected data often contains missing values and those missing values are generally missing-not-at-random, which deteriorates the prediction performance of models. Some existing estimators and regularizers attempt to achieve unbiased estimation to improve the predictive performance. However, variances and generalization bound of these methods are generally unbounded when the propensity scores tend to zero, compromising their stability and robustness. In this paper, we first theoretically reveal that limitations of regularization techniques. Besides, we further illustrate that, for more general estimators, unbiasedness will inevitably lead to unbounded variance. These general laws inspire us that the estimator designs is not merely about eliminating bias, reducing variance, or simply achieve a bias-variance trade-off. Instead, it involves a quantitative joint optimization of bias and variance. Then, we develop a systematic fine-grained dynamic learning framework to jointly optimize bias and variance, which adaptively selects an appropriate estimator for each user-item pair according to the predefined objective function. With this operation, the generalization bounds and variances of models are reduced and bounded with theoretical guarantees. Extensive experiments are conducted to verify the theoretical results and the effectiveness of the proposed dynamic learning framework. △ Less

Submitted 24 May, 2024; originally announced May 2024.

arXiv:2207.14753 [pdf, other]

Estimating Causal Effects with Hidden Confounding using Instrumental Variables and Environments

Authors: James P. Long, Hongxu Zhu, Kim-Anh Do, Min ** Ha

Abstract: Recent works have proposed regression models which are invariant across data collection environments. These estimators often have a causal interpretation under conditions on the environments and type of invariance imposed. One recent example, the Causal Dantzig (CD), is consistent under hidden confounding and represents an alternative to classical instrumental variable estimators such as Two Stage… ▽ More Recent works have proposed regression models which are invariant across data collection environments. These estimators often have a causal interpretation under conditions on the environments and type of invariance imposed. One recent example, the Causal Dantzig (CD), is consistent under hidden confounding and represents an alternative to classical instrumental variable estimators such as Two Stage Least Squares (TSLS). In this work we derive the CD as a generalized method of moments (GMM) estimator. The GMM representation leads to several practical results, including 1) creation of the Generalized Causal Dantzig (GCD) estimator which can be applied to problems with continuous environments where the CD cannot be fit 2) a Hybrid (GCD-TSLS combination) estimator which has properties superior to GCD or TSLS alone 3) straightforward asymptotic results for all methods using GMM theory. We compare the CD, GCD, TSLS, and Hybrid estimators in simulations and an application to a Flow Cytometry data set. The newly proposed GCD and Hybrid estimators have superior performance to existing methods in many settings. △ Less

Submitted 9 November, 2023; v1 submitted 29 July, 2022; originally announced July 2022.

Comments: 32 pages, 7 figures, 4 tables

arXiv:2111.11529 [pdf, other]

Bayesian Robust Learning in Chain Graph Models for Integrative Pharmacogenomics

Authors: Moumita Chakraborty, Veerabhadran Baladandayuthapani, Anindya Bhadra, Min ** Ha

Abstract: Integrative analysis of multi-level pharmacogenomic data for modeling dependencies across various biological domains is crucial for develo** genomic-testing based treatments. Chain graphs characterize conditional dependence structures of such multi-level data where variables are naturally partitioned into multiple ordered layers, consisting of both directed and undirected edges. Existing literat… ▽ More Integrative analysis of multi-level pharmacogenomic data for modeling dependencies across various biological domains is crucial for develo** genomic-testing based treatments. Chain graphs characterize conditional dependence structures of such multi-level data where variables are naturally partitioned into multiple ordered layers, consisting of both directed and undirected edges. Existing literature mostly focus on Gaussian chain graphs, which are ill-suited for non-normal distributions with heavy-tailed marginals, potentially leading to inaccurate inferences. We propose a Bayesian robust chain graph model (RCGM) based on random transformations of marginals using Gaussian scale mixtures to account for node-level non-normality in continuous multivariate data. This flexible modeling strategy facilitates identification of conditional sign dependencies among non-normal nodes while still being able to infer conditional dependencies among normal nodes. In simulations, we demonstrate that RCGM outperforms existing Gaussian chain graph inference methods in data generated from various non-normal mechanisms. We apply our method to genomic, transcriptomic and proteomic data to understand underlying biological processes holistically for drug response and resistance in lung cancer cell lines. Our analysis reveals inter- and intra- platform dependencies of key signaling pathways to monotherapies of icotinib, erlotinib and osimertinib among other drugs, along with shared patterns of molecular mechanisms behind drug actions. △ Less

Submitted 22 November, 2021; originally announced November 2021.

Comments: 35 pages, 5 figures; Supplementary material follows after the main document

arXiv:2110.14374 [pdf, other]

A2I Transformer: Permutation-equivariant attention network for pairwise and many-body interactions with minimal featurization

Authors: Ji Woong Yu, Min Young Ha, Bumjoon Seo, Won Bo Lee

Abstract: The combination of neural network potential (NNP) with molecular simulations plays an important role in an efficient and thorough understanding of a molecular system's potential energy surface (PES). However, gras** the interplay between input features and their local contribution to NNP is growingly evasive due to heavy featurization. In this work, we suggest an end-to-end model which directly… ▽ More The combination of neural network potential (NNP) with molecular simulations plays an important role in an efficient and thorough understanding of a molecular system's potential energy surface (PES). However, gras** the interplay between input features and their local contribution to NNP is growingly evasive due to heavy featurization. In this work, we suggest an end-to-end model which directly predicts per-atom energy from the coordinates of particles, avoiding expert-guided featurization of the network input. Employing self-attention as the main workhorse, our model is intrinsically equivariant under the permutation operation, resulting in the invariance of the total potential energy. We tested our model against several challenges in molecular simulation problems, including periodic boundary condition (PBC), $n$-body interaction, and binary composition. Our model yielded stable predictions in all tested systems with errors significantly smaller than the potential energy fluctuation acquired from molecular dynamics simulations. Thus, our work provides a minimal baseline model that encodes complex interactions in a condensed phase system to facilitate the data-driven analysis of physicochemical systems. △ Less

Submitted 27 October, 2021; originally announced October 2021.

arXiv:2108.00968 [pdf, other]

Robust Semantic Segmentation with Superpixel-Mix

Authors: Gianni Franchi, Nacim Belkhir, Mai Lan Ha, Yufei Hu, Andrei Bursuc, Volker Blanz, Angela Yao

Abstract: Along with predictive performance and runtime speed, reliability is a key requirement for real-world semantic segmentation. Reliability encompasses robustness, predictive uncertainty and reduced bias. To improve reliability, we introduce Superpixel-mix, a new superpixel-based data augmentation method with teacher-student consistency training. Unlike other mixing-based augmentation techniques, mixi… ▽ More Along with predictive performance and runtime speed, reliability is a key requirement for real-world semantic segmentation. Reliability encompasses robustness, predictive uncertainty and reduced bias. To improve reliability, we introduce Superpixel-mix, a new superpixel-based data augmentation method with teacher-student consistency training. Unlike other mixing-based augmentation techniques, mixing superpixels between images is aware of object boundaries, while yielding consistent gains in segmentation accuracy. Our proposed technique achieves state-of-the-art results in semi-supervised semantic segmentation on the Cityscapes dataset. Moreover, Superpixel-mix improves the reliability of semantic segmentation by reducing network uncertainty and bias, as confirmed by competitive results under strong distributions shift (adverse weather, image corruptions) and when facing out-of-distribution data. △ Less

Submitted 21 October, 2021; v1 submitted 2 August, 2021; originally announced August 2021.

Comments: Accepted to BMVC2021

arXiv:2106.01921 [pdf, ps, other]

doi 10.1002/sam.11559

Sample Selection Bias in Evaluation of Prediction Performance of Causal Models

Authors: James P. Long, Min ** Ha

Abstract: Causal models are notoriously difficult to validate because they make untestable assumptions regarding confounding. New scientific experiments offer the possibility of evaluating causal models using prediction performance. Prediction performance measures are typically robust to violations in causal assumptions. However, prediction performance does depend on the selection of training and test sets.… ▽ More Causal models are notoriously difficult to validate because they make untestable assumptions regarding confounding. New scientific experiments offer the possibility of evaluating causal models using prediction performance. Prediction performance measures are typically robust to violations in causal assumptions. However, prediction performance does depend on the selection of training and test sets. Biased training sets can lead to optimistic assessments of model performance. In this work, we revisit the prediction performance of several recently proposed causal models tested on a genetic perturbation data set of Kemmeren. We find that sample selection bias is likely a key driver of model performance. We propose using a less-biased evaluation set for assessing prediction performance and compare models on this new set. In this setting, the causal models have similar or worse performance compared to standard association-based estimators such as Lasso. Finally, we compare the performance of causal estimators in simulation studies that reproduce the Kemmeren structure of genetic knockout experiments but without any sample selection bias. These results provide an improved understanding of the performance of several causal models and offer guidance on how future studies should use Kemmeren. △ Less

Submitted 26 October, 2021; v1 submitted 3 June, 2021; originally announced June 2021.

Comments: 12 pages, 4 figures, 2 tables

arXiv:2011.06061 [pdf, other]

A Framework for Mediation Analysis with Multiple Exposures, Multivariate Mediators, and Non-Linear Response Models

Authors: James P. Long, Ehsan Irajizad, James D. Doecke, Kim-Anh Do, Min ** Ha

Abstract: Mediation analysis seeks to identify and quantify the paths by which an exposure affects an outcome. Intermediate variables which are effected by the exposure and which effect the outcome are known as mediators. There exists extensive work on mediation analysis in the context of models with a single mediator and continuous and binary outcomes. However these methods are often not suitable for multi… ▽ More Mediation analysis seeks to identify and quantify the paths by which an exposure affects an outcome. Intermediate variables which are effected by the exposure and which effect the outcome are known as mediators. There exists extensive work on mediation analysis in the context of models with a single mediator and continuous and binary outcomes. However these methods are often not suitable for multi-omic data that include highly interconnected variables measuring biological mechanisms and various types of outcome variables such as censored survival responses. In this article, we develop a general framework for causal mediation analysis with multiple exposures, multivariate mediators, and continuous, binary, and survival responses. We estimate mediation effects on several scales including the mean difference, odds ratio, and restricted mean scale as appropriate for various outcome models. Our estimation method avoids imposing constraints on model parameters such as the rare disease assumption while accommodating continuous exposures. We evaluate the framework and compare it to other methods in extensive simulation studies by assessing bias, type I error and power at a range of sample sizes, disease prevalences, and number of false mediators. Using Kidney Renal Clear Cell Carcinoma data from The Cancer Genome Atlas, we identify proteins which mediate the effect of metabolic gene expression on survival. Software for implementing this unified framework is made available in an R package (https://github.com/longjp/mediateR). △ Less

Submitted 11 November, 2020; originally announced November 2020.

Comments: 17 pages, 5 figures

arXiv:2002.07122 [pdf, other]

Bayesian Structure Learning in Multi-layered Genomic Networks

Authors: Min ** Ha, Francesco Stingo, Veerabhadran Baladandayuthapani

Abstract: Integrative network modeling of data arising from multiple genomic platforms provides insight into the holistic picture of the interactive system, as well as the flow of information across many disease domains including cancer. The basic data structure consists of a sequence of hierarchically ordered datasets for each individual subject, which facilitates integration of diverse inputs, such as gen… ▽ More Integrative network modeling of data arising from multiple genomic platforms provides insight into the holistic picture of the interactive system, as well as the flow of information across many disease domains including cancer. The basic data structure consists of a sequence of hierarchically ordered datasets for each individual subject, which facilitates integration of diverse inputs, such as genomic, transcriptomic, and proteomic data. A primary analytical task in such contexts is to model the layered architecture of networks where the vertices can be naturally partitioned into ordered layers, dictated by multiple platforms, and exhibit both undirected and directed relationships. We propose a multi-layered Gaussian graphical model (mlGGM) to investigate conditional independence structures in such multi-level genomic networks in human cancers. We implement a Bayesian node-wise selection (BANS) approach based on variable selection techniques that coherently accounts for the multiple types of dependencies in mlGGM; this flexible strategy exploits edge-specific prior knowledge and selects sparse and interpretable models. Through simulated data generated under various scenarios, we demonstrate that BANS outperforms other existing multivariate regression-based methodologies. Our integrative genomic network analysis for key signaling pathways across multiple cancer types highlights commonalities and differences of p53 integrative networks and epigenetic effects of BRCA2 on p53 and its interaction with T68 phosphorylated CHK2, that may have translational utilities of finding biomarkers and therapeutic targets. △ Less

Submitted 17 February, 2020; originally announced February 2020.

Comments: 39 pages with 8 figures and 1 table

arXiv:1811.02629 [pdf, other]

Identifying the Best Machine Learning Algorithms for Brain Tumor Segmentation, Progression Assessment, and Overall Survival Prediction in the BRATS Challenge

Authors: Spyridon Bakas, Mauricio Reyes, Andras Jakab, Stefan Bauer, Markus Rempfler, Alessandro Crimi, Russell Takeshi Shinohara, Christoph Berger, Sung Min Ha, Martin Rozycki, Marcel Prastawa, Esther Alberts, Jana Lipkova, John Freymann, Justin Kirby, Michel Bilello, Hassan Fathallah-Shaykh, Roland Wiest, Jan Kirschke, Benedikt Wiestler, Rivka Colen, Aikaterini Kotrotsou, Pamela Lamontagne, Daniel Marcus, Mikhail Milchenko , et al. (402 additional authors not shown)

Abstract: Gliomas are the most common primary brain malignancies, with different degrees of aggressiveness, variable prognosis and various heterogeneous histologic sub-regions, i.e., peritumoral edematous/invaded tissue, necrotic core, active and non-enhancing core. This intrinsic heterogeneity is also portrayed in their radio-phenotype, as their sub-regions are depicted by varying intensity profiles dissem… ▽ More Gliomas are the most common primary brain malignancies, with different degrees of aggressiveness, variable prognosis and various heterogeneous histologic sub-regions, i.e., peritumoral edematous/invaded tissue, necrotic core, active and non-enhancing core. This intrinsic heterogeneity is also portrayed in their radio-phenotype, as their sub-regions are depicted by varying intensity profiles disseminated across multi-parametric magnetic resonance imaging (mpMRI) scans, reflecting varying biological properties. Their heterogeneous shape, extent, and location are some of the factors that make these tumors difficult to resect, and in some cases inoperable. The amount of resected tumor is a factor also considered in longitudinal scans, when evaluating the apparent tumor for potential diagnosis of progression. Furthermore, there is mounting evidence that accurate segmentation of the various tumor sub-regions can offer the basis for quantitative image analysis towards prediction of patient overall survival. This study assesses the state-of-the-art machine learning (ML) methods used for brain tumor image analysis in mpMRI scans, during the last seven instances of the International Brain Tumor Segmentation (BraTS) challenge, i.e., 2012-2018. Specifically, we focus on i) evaluating segmentations of the various glioma sub-regions in pre-operative mpMRI scans, ii) assessing potential tumor progression by virtue of longitudinal growth of tumor sub-regions, beyond use of the RECIST/RANO criteria, and iii) predicting the overall survival from pre-operative mpMRI scans of patients that underwent gross total resection. Finally, we investigate the challenge of identifying the best ML algorithms for each of these tasks, considering that apart from being diverse on each instance of the challenge, the multi-institutional mpMRI BraTS dataset has also been a continuously evolving/growing dataset. △ Less

Submitted 23 April, 2019; v1 submitted 5 November, 2018; originally announced November 2018.

Comments: The International Multimodal Brain Tumor Segmentation (BraTS) Challenge

arXiv:1405.1603 [pdf, other]

PenPC: A Two-step Approach to Estimate the Skeletons of High Dimensional Directed Acyclic Graphs

Authors: Min ** Ha, Wei Sun, Jichun Xie

Abstract: Estimation of the skeleton of a directed acyclic graph (DAG) is of great importance for understanding the underlying DAG and causaleffects can be assessed from the skeleton when the DAG is notidentifiable. We propose a novel method named PenPC toestimate the skeleton of a high-dimensional DAG by a two-stepapproach. We first estimate the non-zero entries of a concentrationmatrix using penalized reg… ▽ More Estimation of the skeleton of a directed acyclic graph (DAG) is of great importance for understanding the underlying DAG and causaleffects can be assessed from the skeleton when the DAG is notidentifiable. We propose a novel method named PenPC toestimate the skeleton of a high-dimensional DAG by a two-stepapproach. We first estimate the non-zero entries of a concentrationmatrix using penalized regression, and then fix the differencebetween the concentration matrix and the skeleton by evaluating aset of conditional independence hypotheses. For high dimensionalproblems where the number of vertices $p$ is in polynomial orexponential scale of sample size $n$, we study the asymptoticproperty of PenPC on two types of graphs: traditionalrandom graphs where all the vertices have the same expected numberof neighbors, and scale-free graphs where a few vertices may have alarge number of neighbors. As illustrated by extensive simulationsand applications on gene expression data of cancer patients, PenPChas higher sensitivity and specificity than the standard-of-the-artmethod, the PC-stable algorithm. △ Less

Submitted 7 May, 2014; originally announced May 2014.

Showing 1–11 of 11 results for author: Ha, M