Search | arXiv e-print repository

AI Competitions and Benchmarks: Dataset Development

Authors: Romain Egele, Julio C. S. Jacques Junior, Jan N. van Rijn, Isabelle Guyon, Xavier Baró, Albert Clapés, Prasanna Balaprakash, Sergio Escalera, Thomas Moeslund, Jun Wan

Abstract: Machine learning is now used in many applications thanks to its ability to predict, generate, or discover patterns from large quantities of data. However, the process of collecting and transforming data for practical use is intricate. Even in today's digital era, where substantial data is generated daily, it is uncommon for it to be readily usable; most often, it necessitates meticulous manual dat… ▽ More Machine learning is now used in many applications thanks to its ability to predict, generate, or discover patterns from large quantities of data. However, the process of collecting and transforming data for practical use is intricate. Even in today's digital era, where substantial data is generated daily, it is uncommon for it to be readily usable; most often, it necessitates meticulous manual data preparation. The haste in develo** new models can frequently result in various shortcomings, potentially posing risks when deployed in real-world scenarios (eg social discrimination, critical failures), leading to the failure or substantial escalation of costs in AI-based projects. This chapter provides a comprehensive overview of established methodological tools, enriched by our practical experience, in the development of datasets for machine learning. Initially, we develop the tasks involved in dataset development and offer insights into their effective management (including requirements, design, implementation, evaluation, distribution, and maintenance). Then, we provide more details about the implementation process which includes data collection, transformation, and quality evaluation. Finally, we address practical considerations regarding dataset distribution and maintenance. △ Less

Submitted 15 April, 2024; originally announced April 2024.

Comments: Preprint version of the 3rd Chapter of the book: Competitions and Benchmarks, the science behind the contests (https://sites.google.com/chalearn.org/book/home)

arXiv:2403.06238 [pdf, other]

Quantifying the Uncertainty of Imputed Demographic Disparity Estimates: The Dual-Bootstrap

Authors: Benjamin Lu, Jia Wan, Derek Ouyang, Jacob Goldin, Daniel E. Ho

Abstract: Measuring average differences in an outcome across racial or ethnic groups is a crucial first step for equity assessments, but researchers often lack access to data on individuals' races and ethnicities to calculate them. A common solution is to impute the missing race or ethnicity labels using proxies, then use those imputations to estimate the disparity. Conventional standard errors mischaracter… ▽ More Measuring average differences in an outcome across racial or ethnic groups is a crucial first step for equity assessments, but researchers often lack access to data on individuals' races and ethnicities to calculate them. A common solution is to impute the missing race or ethnicity labels using proxies, then use those imputations to estimate the disparity. Conventional standard errors mischaracterize the resulting estimate's uncertainty because they treat the imputation model as given and fixed, instead of as an unknown object that must be estimated with uncertainty. We propose a dual-bootstrap approach that explicitly accounts for measurement uncertainty and thus enables more accurate statistical inference, which we demonstrate via simulation. In addition, we adapt our approach to the commonly used Bayesian Improved Surname Geocoding (BISG) imputation algorithm, where direct bootstrap** is infeasible because the underlying Census Bureau data are unavailable. In simulations, we find that measurement uncertainty is generally insignificant for BISG except in particular circumstances; bias, not variance, is likely the predominant source of error. We apply our method to quantify the uncertainty of prevalence estimates of common health conditions by race using data from the American Family Cohort. △ Less

Submitted 10 March, 2024; originally announced March 2024.

Comments: 31 pages; 7 figures; CRIW Race, Ethnicity, and Economic Statistics for the 21st Century, Spring 2024

arXiv:2312.16607 [pdf, other]

A Polarization and Radiomics Feature Fusion Network for the Classification of Hepatocellular Carcinoma and Intrahepatic Cholangiocarcinoma

Authors: Jia Dong, Yao Yao, Liyan Lin, Yang Dong, Jiachen Wan, Ran Peng, Chao Li, Hui Ma

Abstract: Classifying hepatocellular carcinoma (HCC) and intrahepatic cholangiocarcinoma (ICC) is a critical step in treatment selection and prognosis evaluation for patients with liver diseases. Traditional histopathological diagnosis poses challenges in this context. In this study, we introduce a novel polarization and radiomics feature fusion network, which combines polarization features obtained from Mu… ▽ More Classifying hepatocellular carcinoma (HCC) and intrahepatic cholangiocarcinoma (ICC) is a critical step in treatment selection and prognosis evaluation for patients with liver diseases. Traditional histopathological diagnosis poses challenges in this context. In this study, we introduce a novel polarization and radiomics feature fusion network, which combines polarization features obtained from Mueller matrix images of liver pathological samples with radiomics features derived from corresponding pathological images to classify HCC and ICC. Our fusion network integrates a two-tier fusion approach, comprising early feature-level fusion and late classification-level fusion. By harnessing the strengths of polarization imaging techniques and image feature-based machine learning, our proposed fusion network significantly enhances classification accuracy. Notably, even at reduced imaging resolutions, the fusion network maintains robust performance due to the additional information provided by polarization features, which may not align with human visual perception. Our experimental results underscore the potential of this fusion network as a powerful tool for computer-aided diagnosis of HCC and ICC, showcasing the benefits and prospects of integrating polarization imaging techniques into the current image-intensive digital pathological diagnosis. We aim to contribute this innovative approach to top-tier journals, offering fresh insights and valuable tools in the fields of medical imaging and cancer diagnosis. By introducing polarization imaging into liver cancer classification, we demonstrate its interdisciplinary potential in addressing challenges in medical image analysis, promising advancements in medical imaging and cancer diagnosis. △ Less

Submitted 27 December, 2023; originally announced December 2023.

arXiv:2311.02146 [pdf, other]

Bayesian Optimization of Function Networks with Partial Evaluations

Authors: Poompol Buathong, Jiayue Wan, Raul Astudillo, Samuel Daulton, Maximilian Balandat, Peter I. Frazier

Abstract: Bayesian optimization is a powerful framework for optimizing functions that are expensive or time-consuming to evaluate. Recent work has considered Bayesian optimization of function networks (BOFN), where the objective function is given by a network of functions, each taking as input the output of previous nodes in the network as well as additional parameters. Leveraging this network structure has… ▽ More Bayesian optimization is a powerful framework for optimizing functions that are expensive or time-consuming to evaluate. Recent work has considered Bayesian optimization of function networks (BOFN), where the objective function is given by a network of functions, each taking as input the output of previous nodes in the network as well as additional parameters. Leveraging this network structure has been shown to yield significant performance improvements. Existing BOFN algorithms for general-purpose networks evaluate the full network at each iteration. However, many real-world applications allow for evaluating nodes individually. To exploit this, we propose a novel knowledge gradient acquisition function that chooses which node and corresponding inputs to evaluate in a cost-aware manner, thereby reducing query costs by evaluating only on a part of the network at each step. We provide an efficient approach to optimizing our acquisition function and show that it outperforms existing BOFN methods and other benchmarks across several synthetic and real-world problems. Our acquisition function is the first to enable cost-aware optimization of a broad class of function networks. △ Less

Submitted 12 June, 2024; v1 submitted 3 November, 2023; originally announced November 2023.

Comments: 34 pages, 15 figures, 3 tables

arXiv:2111.07517 [pdf, other]

Correlation Improves Group Testing: Capturing the Dilution Effect

Authors: Jiayue Wan, Yujia Zhang, Peter I. Frazier

Abstract: Population-wide screening to identify and isolate infectious individuals is a powerful tool for controlling COVID-19 and other infectious diseases. Group testing can enable such screening despite limited testing resources. Samples' viral loads are often positively correlated, either because prevalence and sample collection are both correlated with geography, or through intentional enhancement, e.g… ▽ More Population-wide screening to identify and isolate infectious individuals is a powerful tool for controlling COVID-19 and other infectious diseases. Group testing can enable such screening despite limited testing resources. Samples' viral loads are often positively correlated, either because prevalence and sample collection are both correlated with geography, or through intentional enhancement, e.g., by pooling samples from people in similar risk groups. Such correlation is known to improve test efficiency in mathematical models with fixed sensitivity. In reality, however, dilution degrades a pooled test's sensitivity by an amount that varies with the number of positives in the pool. In the presence of this dilution effect, we study the impact of correlation on the most widely-used group testing procedure, the Dorfman procedure. We show that correlation's effects are significantly altered by the dilution effect. We prove that under a general correlation structure, pooling correlated samples together (called correlated pooling) achieves higher sensitivity but can degrade test efficiency compared to independently pooling the samples (called naive pooling) using the same pool size. We identify an alternative measure of test resource usage, the number of positives found per test consumed, which we argue is better aligned with infection control, and show that correlated pooling outperforms naive pooling on this measure. We build a realistic agent-based simulation to contextualize our theoretical results within an epidemic control framework. We argue that the dilution effect makes it even more important for policy-makers evaluating group testing protocols for large-scale screening to incorporate naturally arising correlation and to intentionally maximize correlation. △ Less

Submitted 5 November, 2023; v1 submitted 14 November, 2021; originally announced November 2021.

Comments: 66 pages, 10 figures, 15 tables

arXiv:2102.03497 [pdf, other]

Weight Rescaling: Effective and Robust Regularization for Deep Neural Networks with Batch Normalization

Authors: Ziquan Liu, Yufei Cui, Jia Wan, Yu Mao, Antoni B. Chan

Abstract: Weight decay is often used to ensure good generalization in the training practice of deep neural networks with batch normalization (BN-DNNs), where some convolution layers are invariant to weight rescaling due to the normalization. In this paper, we demonstrate that the practical usage of weight decay still has some unsolved problems in spite of existing theoretical work on explaining the effect o… ▽ More Weight decay is often used to ensure good generalization in the training practice of deep neural networks with batch normalization (BN-DNNs), where some convolution layers are invariant to weight rescaling due to the normalization. In this paper, we demonstrate that the practical usage of weight decay still has some unsolved problems in spite of existing theoretical work on explaining the effect of weight decay in BN-DNNs. On the one hand, when the non-adaptive learning rate e.g. SGD with momentum is used, the effective learning rate continues to increase even after the initial training stage, which leads to an overfitting effect in many neural architectures. On the other hand, in both SGDM and adaptive learning rate optimizers e.g. Adam, the effect of weight decay on generalization is quite sensitive to the hyperparameter. Thus, finding an optimal weight decay parameter requires extensive parameter searching. To address those weaknesses, we propose to regularize the weight norm using a simple yet effective weight rescaling (WRS) scheme as an alternative to weight decay. WRS controls the weight norm by explicitly rescaling it to the unit norm, which prevents a large increase to the gradient but also ensures a sufficiently large effective learning rate to improve generalization. On a variety of computer vision applications including image classification, object detection, semantic segmentation and crowd counting, we show the effectiveness and robustness of WRS compared with weight decay, implicit weight rescaling (weight standardization) and gradient projection (AdamP). △ Less

Submitted 17 June, 2022; v1 submitted 5 February, 2021; originally announced February 2021.

Comments: Preprint

arXiv:2008.06642 [pdf, other]

Group Testing Enables Asymptomatic Screening for COVID-19 Mitigation: Feasibility and Optimal Pool Size Selection with Dilution Effects

Authors: Yifan Lin, Yuxuan Ren, **gyuan Wan, Massey Cashore, Jiayue Wan, Yujia Zhang, Peter Frazier, Enlu Zhou

Abstract: Repeated asymptomatic screening for SARS-CoV-2 promises to control spread of the virus but would require too many resources to implement at scale. Group testing is promising for screening more people with fewer test resources: multiple samples tested together in one pool can be excluded with one negative test result. Existing approaches to group testing design for SARS-CoV-2 asymptomatic screening… ▽ More Repeated asymptomatic screening for SARS-CoV-2 promises to control spread of the virus but would require too many resources to implement at scale. Group testing is promising for screening more people with fewer test resources: multiple samples tested together in one pool can be excluded with one negative test result. Existing approaches to group testing design for SARS-CoV-2 asymptomatic screening, however, do not consider dilution effects: that false negatives become more common with larger pools. As a consequence, they may recommend pool sizes that are too large or misestimate the benefits of screening. Modeling dilution effects, we derive closed-form expressions for the expected number of tests and false negative/positives per person screened under two popular group testing methods: the linear and square array methods. We find that test error correlation induced by a common viral load across an individual's samples results in many fewer false negatives than would be expected from less realistic but more widely assumed independent errors. This insight also suggests that false positives can be controlled through repeated tests without significantly increasing false negatives. Using these closed-form expressions to trace a Pareto frontier over error rates and tests, we design testing protocols for repeated asymptomatic screening of a large population. We minimize disease prevalence by optimizing a time-varying pool sizes and screening frequency constrained by daily test capacity and a false positive limit. This provides a testing protocol practitioners can use for mitigating COVID-19. In a case study, we demonstrate the effectiveness of this methodology in controlling spread. △ Less

Submitted 16 November, 2020; v1 submitted 14 August, 2020; originally announced August 2020.

arXiv:1909.11532 [pdf, other]

Deep Neural Network Framework Based on Backward Stochastic Differential Equations for Pricing and Hedging American Options in High Dimensions

Authors: Yangang Chen, Justin W. L. Wan

Abstract: We propose a deep neural network framework for computing prices and deltas of American options in high dimensions. The architecture of the framework is a sequence of neural networks, where each network learns the difference of the price functions between adjacent timesteps. We introduce the least squares residual of the associated backward stochastic differential equation as the loss function. Our… ▽ More We propose a deep neural network framework for computing prices and deltas of American options in high dimensions. The architecture of the framework is a sequence of neural networks, where each network learns the difference of the price functions between adjacent timesteps. We introduce the least squares residual of the associated backward stochastic differential equation as the loss function. Our proposed framework yields prices and deltas on the entire spacetime, not only at a given point. The computational cost of the proposed approach is quadratic in dimension, which addresses the curse of dimensionality issue that state-of-the-art approaches suffer. Our numerical simulations demonstrate these contributions, and show that the proposed neural network framework outperforms state-of-the-art approaches in high dimensions. △ Less

Submitted 25 September, 2019; originally announced September 2019.

Comments: 35 pages, 11 figures, 15 tables

arXiv:1811.07143 [pdf, other]

High Quality Prediction of Protein Q8 Secondary Structure by Diverse Neural Network Architectures

Authors: Iddo Drori, Isht Dwivedi, Pranav Shrestha, Jeffrey Wan, Yueqi Wang, Yunchu He, Anthony Mazza, Hugh Krogh-Freeman, Dimitri Leggas, Kendal Sandridge, Linyong Nan, Kaveri Thakoor, Chinmay Joshi, Sonam Goenka, Chen Keasar, Itsik Pe'er

Abstract: We tackle the problem of protein secondary structure prediction using a common task framework. This lead to the introduction of multiple ideas for neural architectures based on state of the art building blocks, used in this task for the first time. We take a principled machine learning approach, which provides genuine, unbiased performance measures, correcting longstanding errors in the applicatio… ▽ More We tackle the problem of protein secondary structure prediction using a common task framework. This lead to the introduction of multiple ideas for neural architectures based on state of the art building blocks, used in this task for the first time. We take a principled machine learning approach, which provides genuine, unbiased performance measures, correcting longstanding errors in the application domain. We focus on the Q8 resolution of secondary structure, an active area for continuously improving methods. We use an ensemble of strong predictors to achieve accuracy of 70.7% (on the CB513 test set using the CB6133filtered training set). These results are statistically indistinguishable from those of the top existing predictors. In the spirit of reproducible research we make our data, models and code available, aiming to set a gold standard for purity of training and testing sets. Such good practices lower entry barriers to this domain and facilitate reproducible, extendable research. △ Less

Submitted 17 November, 2018; originally announced November 2018.

Comments: NIPS 2018 Workshop on Machine Learning for Molecules and Materials, 10 pages

arXiv:1104.1671 [pdf, ps, other]

Density-based Monte Carlo filter and its applications in estimation of unobservable variables and pharmacokinetic parameters

Authors: Guanghui Huang, Jian** Wan, Hui Chen

Abstract: Nonlinear stochastic differential equation models with unobservable variables are now widely used in the analysis of PK/PD data. The unobservable variables are often estimated with extended Kalman filter (EKF), and the unknown pharmacokinetic parameters are usually estimated by maximum likelihood estimator. However, EKF is inadequate for nonlinear PK/PD models, and MLE is known to be biased downwa… ▽ More Nonlinear stochastic differential equation models with unobservable variables are now widely used in the analysis of PK/PD data. The unobservable variables are often estimated with extended Kalman filter (EKF), and the unknown pharmacokinetic parameters are usually estimated by maximum likelihood estimator. However, EKF is inadequate for nonlinear PK/PD models, and MLE is known to be biased downwards. A density-based Monte Carlo filter (DMF) is proposed to estimate the unobservable variables, and a simulation-based procedure is proposed to estimate the unknown parameters in this paper, where a genetic algorithm is designed to search the optimal values of pharmacokinetic parameters. The performances of EKF and DMF are compared through simulations, and it is found that the results based on DMF are more accurate than those given by EKF with respect to mean absolute error. △ Less

Submitted 5 March, 2012; v1 submitted 9 April, 2011; originally announced April 2011.

Comments: 15 pages, 1 figure, 2 tables

Showing 1–10 of 10 results for author: Wan, J