-
Bell vs Bell: a ding-dong battle over quantum incompleteness
Authors:
Michael J. W. Hall
Abstract:
Does determinism (or even the incompleteness of quantum mechanics) follow from locality and perfect correlations? In a 1964 paper John Bell gave the first demonstration that quantum mechanics is incompatible with local hidden variables. Since then a vigorous debate has rung out over whether he relied on an assumption of determinism or instead, as he later claimed in a 1981 paper, derived determini…
▽ More
Does determinism (or even the incompleteness of quantum mechanics) follow from locality and perfect correlations? In a 1964 paper John Bell gave the first demonstration that quantum mechanics is incompatible with local hidden variables. Since then a vigorous debate has rung out over whether he relied on an assumption of determinism or instead, as he later claimed in a 1981 paper, derived determinism from assumptions of locality and perfect correlation. This paper aims to bring clarity to the debate via simple examples and rigorous results. It is first recalled, via quantum and classical counterexamples, that the weakest statistical form of locality consistent with Bell's 1964 paper (parameter independence) is insufficient for the derivation of determinism. Attention is then turned to critically assess Bell's appealing to the Einstein-Rosen-Podolsky incompleteness argument to support his claim. It is shown this argument is itself incomplete, via counterexamples that expose two logical gaps. However, closing these gaps via a strong ``counterfactual'' reality criterion enables a rigorous derivation of each of quantum incompleteness, determinism and parameter independence, and in this sense justifies Bell's claim. Consequences for quantum interpretations are briefly discussed.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
Why quantum correlations are shocking
Authors:
Michael J. W. Hall
Abstract:
A simple minimalist argument is given for why some correlations between quantum systems boggle our classical intuition. The argument relies on two elementary physical assumptions, and recovers the standard experimentally-testable Bell inequality in a form that applies equally well to correlations between six-sided dice and between photon polarizations. The first assumption, that measurement select…
▽ More
A simple minimalist argument is given for why some correlations between quantum systems boggle our classical intuition. The argument relies on two elementary physical assumptions, and recovers the standard experimentally-testable Bell inequality in a form that applies equally well to correlations between six-sided dice and between photon polarizations. The first assumption, that measurement selection in a first lab leaves the measurement statistics in a remote lab invariant (no-signaling), has been empirically verified, and is shown to be equivalent to the existence of a joint probability distribution for quantities measured in the first lab. The observed violation of the Bell inequality is then equivalent to failure of a second assumption, that measurement selection in the remote lab leaves this joint distribution invariant. Indeed, the degree of violation lower-bounds the variation of the joint distribution. It directly follows there are just three possible physical mechanisms underlying such violations -- action-at-a-distance (superluminality), unavoidable common factors linking measurement choice and distant properties (conspiracy), and intrinsically incompatible physical quantities (complementarity). The argument extends to all Bell inequalities, and is briefly compared with other derivations.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Decomposed evaluations of geographic disparities in text-to-image models
Authors:
Abhishek Sureddy,
Dishant Padalia,
Nandhinee Periyakaruppa,
Oindrila Saha,
Adina Williams,
Adriana Romero-Soriano,
Megan Richards,
Polina Kirichenko,
Melissa Hall
Abstract:
Recent work has identified substantial disparities in generated images of different geographic regions, including stereotypical depictions of everyday objects like houses and cars. However, existing measures for these disparities have been limited to either human evaluations, which are time-consuming and costly, or automatic metrics evaluating full images, which are unable to attribute these dispa…
▽ More
Recent work has identified substantial disparities in generated images of different geographic regions, including stereotypical depictions of everyday objects like houses and cars. However, existing measures for these disparities have been limited to either human evaluations, which are time-consuming and costly, or automatic metrics evaluating full images, which are unable to attribute these disparities to specific parts of the generated images. In this work, we introduce a new set of metrics, Decomposed Indicators of Disparities in Image Generation (Decomposed-DIG), that allows us to separately measure geographic disparities in the depiction of objects and backgrounds in generated images. Using Decomposed-DIG, we audit a widely used latent diffusion model and find that generated images depict objects with better realism than backgrounds and that backgrounds in generated images tend to contain larger regional disparities than objects. We use Decomposed-DIG to pinpoint specific examples of disparities, such as stereotypical background generation in Africa, struggling to generate modern vehicles in Africa, and unrealistically placing some objects in outdoor settings. Informed by our metric, we use a new prompting structure that enables a 52% worst-region improvement and a 20% average improvement in generated background diversity.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Consistency-diversity-realism Pareto fronts of conditional image generative models
Authors:
Pietro Astolfi,
Marlene Careil,
Melissa Hall,
Oscar Mañas,
Matthew Muckley,
Jakob Verbeek,
Adriana Romero Soriano,
Michal Drozdzal
Abstract:
Building world models that accurately and comprehensively represent the real world is the utmost aspiration for conditional image generative models as it would enable their use as world simulators. For these models to be successful world models, they should not only excel at image quality and prompt-image consistency but also ensure high representation diversity. However, current research in gener…
▽ More
Building world models that accurately and comprehensively represent the real world is the utmost aspiration for conditional image generative models as it would enable their use as world simulators. For these models to be successful world models, they should not only excel at image quality and prompt-image consistency but also ensure high representation diversity. However, current research in generative models mostly focuses on creative applications that are predominantly concerned with human preferences of image quality and aesthetics. We note that generative models have inference time mechanisms - or knobs - that allow the control of generation consistency, quality, and diversity. In this paper, we use state-of-the-art text-to-image and image-and-text-to-image models and their knobs to draw consistency-diversity-realism Pareto fronts that provide a holistic view on consistency-diversity-realism multi-objective. Our experiments suggest that realism and consistency can both be improved simultaneously; however there exists a clear tradeoff between realism/consistency and diversity. By looking at Pareto optimal points, we note that earlier models are better at representation diversity and worse in consistency/realism, and more recent models excel in consistency/realism while decreasing significantly the representation diversity. By computing Pareto fronts on a geodiverse dataset, we find that the first version of latent diffusion models tends to perform better than more recent models in all axes of evaluation, and there exist pronounced consistency-diversity-realism disparities between geographical regions. Overall, our analysis clearly shows that there is no best model and the choice of model should be determined by the downstream application. With this analysis, we invite the research community to consider Pareto fronts as an analytical tool to measure progress towards world models.
△ Less
Submitted 14 June, 2024;
originally announced June 2024.
-
You are what you eat? Feeding foundation models a regionally diverse food dataset of World Wide Dishes
Authors:
Jabez Magomere,
Shu Ishida,
Tejumade Afonja,
Aya Salama,
Daniel Kochin,
Foutse Yuehgoh,
Imane Hamzaoui,
Raesetje Sefala,
Aisha Alaagib,
Elizaveta Semenova,
Lauren Crais,
Siobhan Mackenzie Hall
Abstract:
Foundation models are increasingly ubiquitous in our daily lives, used in everyday tasks such as text-image searches, interactions with chatbots, and content generation. As use increases, so does concern over the disparities in performance and fairness of these models for different people in different parts of the world. To assess these growing regional disparities, we present World Wide Dishes, a…
▽ More
Foundation models are increasingly ubiquitous in our daily lives, used in everyday tasks such as text-image searches, interactions with chatbots, and content generation. As use increases, so does concern over the disparities in performance and fairness of these models for different people in different parts of the world. To assess these growing regional disparities, we present World Wide Dishes, a mixed text and image dataset consisting of 765 dishes, with dish names collected in 131 local languages. World Wide Dishes has been collected purely through human contribution and decentralised means, by creating a website widely distributed through social networks. Using the dataset, we demonstrate a novel means of operationalising capability and representational biases in foundation models such as language models and text-to-image generative models. We enrich these studies with a pilot community review to understand, from a first-person perspective, how these models generate images for people in five African countries and the United States.
We find that these models generally do not produce quality text and image outputs of dishes specific to different regions. This is true even for the US, which is typically considered to be more well-resourced in training data - though the generation of US dishes does outperform that of the investigated African countries. The models demonstrate a propensity to produce outputs that are inaccurate as well as culturally misrepresentative, flattening, and insensitive. These failures in capability and representational bias have the potential to further reinforce stereotypes and disproportionately contribute to erasure based on region. The dataset and code are available at https://github.com/oxai/world-wide-dishes/.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
Improving Geo-diversity of Generated Images with Contextualized Vendi Score Guidance
Authors:
Reyhane Askari Hemmat,
Melissa Hall,
Alicia Sun,
Candace Ross,
Michal Drozdzal,
Adriana Romero-Soriano
Abstract:
With the growing popularity of text-to-image generative models, there has been increasing focus on understanding their risks and biases. Recent work has found that state-of-the-art models struggle to depict everyday objects with the true diversity of the real world and have notable gaps between geographic regions. In this work, we aim to increase the diversity of generated images of common objects…
▽ More
With the growing popularity of text-to-image generative models, there has been increasing focus on understanding their risks and biases. Recent work has found that state-of-the-art models struggle to depict everyday objects with the true diversity of the real world and have notable gaps between geographic regions. In this work, we aim to increase the diversity of generated images of common objects such that per-region variations are representative of the real world. We introduce an inference time intervention, contextualized Vendi Score Guidance (c-VSG), that guides the backwards steps of latent diffusion models to increase the diversity of a sample as compared to a "memory bank" of previously generated images while constraining the amount of variation within that of an exemplar set of real-world contextualizing images. We evaluate c-VSG with two geographically representative datasets and find that it substantially increases the diversity of generated images, both for the worst performing regions and on average, while simultaneously maintaining or improving image quality and consistency. Additionally, qualitative analyses reveal that diversity of generated images is significantly improved, including along the lines of reductive region portrayals present in the original model. We hope that this work is a step towards text-to-image generative models that reflect the true geographic diversity of the world.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
An Introduction to Vision-Language Modeling
Authors:
Florian Bordes,
Richard Yuanzhe Pang,
Anurag Ajay,
Alexander C. Li,
Adrien Bardes,
Suzanne Petryk,
Oscar Mañas,
Zhiqiu Lin,
Anas Mahmoud,
Bargav Jayaraman,
Mark Ibrahim,
Melissa Hall,
Yunyang Xiong,
Jonathan Lebensold,
Candace Ross,
Srihari Jayakumar,
Chuan Guo,
Diane Bouchacourt,
Haider Al-Tahan,
Karthik Padthe,
Vasu Sharma,
Hu Xu,
Xiaoqing Ellen Tan,
Megan Richards,
Samuel Lavoie
, et al. (16 additional authors not shown)
Abstract:
Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From having a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technol…
▽ More
Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From having a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technology. However, there are many challenges that need to be addressed to improve the reliability of those models. While language is discrete, vision evolves in a much higher dimensional space in which concepts cannot always be easily discretized. To better understand the mechanics behind map** vision to language, we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them. Then, we present and discuss approaches to evaluate VLMs. Although this work primarily focuses on map** images to language, we also discuss extending VLMs to videos.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
Controlling dephasing of coupled qubits via shared-bath coherence
Authors:
L. M. J. Hall,
L. S. Sirkina,
A. Morreau,
W. Langbein,
E. A. Muljarov
Abstract:
The interaction of a quantum system with its environment limits qubit coherence times and restricts its utility in quantum information processing applications. In this Letter, we show that the decoherence of a coupled qubit system can be minimized, or even eliminated by exploiting the quantum coherence of the bath itself. We investigate the dephasing in a system of two spatially separated, electro…
▽ More
The interaction of a quantum system with its environment limits qubit coherence times and restricts its utility in quantum information processing applications. In this Letter, we show that the decoherence of a coupled qubit system can be minimized, or even eliminated by exploiting the quantum coherence of the bath itself. We investigate the dephasing in a system of two spatially separated, electronically decoupled qubits, with direct or mediated coupling, interacting with a shared bath. For illustration we treat Förster or cavity-mediated coupling between semiconductor quantum dots interacting with acoustic phonons. Using the rigorous method of Trotter's decomposition with cumulant expansion, we demonstrate a reduction in the dephasing rates at specific distances. This control is a coherent effect of the shared bath and is absent for independent baths. It can be understood in terms of phonon-assisted transitions between the entangled qubit states of the coupled system.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
Towards Geographic Inclusion in the Evaluation of Text-to-Image Models
Authors:
Melissa Hall,
Samuel J. Bell,
Candace Ross,
Adina Williams,
Michal Drozdzal,
Adriana Romero Soriano
Abstract:
Rapid progress in text-to-image generative models coupled with their deployment for visual content creation has magnified the importance of thoroughly evaluating their performance and identifying potential biases. In pursuit of models that generate images that are realistic, diverse, visually appealing, and consistent with the given prompt, researchers and practitioners often turn to automated met…
▽ More
Rapid progress in text-to-image generative models coupled with their deployment for visual content creation has magnified the importance of thoroughly evaluating their performance and identifying potential biases. In pursuit of models that generate images that are realistic, diverse, visually appealing, and consistent with the given prompt, researchers and practitioners often turn to automated metrics to facilitate scalable and cost-effective performance profiling. However, commonly-used metrics often fail to account for the full diversity of human preference; often even in-depth human evaluations face challenges with subjectivity, especially as interpretations of evaluation criteria vary across regions and cultures. In this work, we conduct a large, cross-cultural study to study how much annotators in Africa, Europe, and Southeast Asia vary in their perception of geographic representation, visual appeal, and consistency in real and generated images from state-of-the art public APIs. We collect over 65,000 image annotations and 20 survey responses. We contrast human annotations with common automated metrics, finding that human preferences vary notably across geographic location and that current metrics do not fully account for this diversity. For example, annotators in different locations often disagree on whether exaggerated, stereotypical depictions of a region are considered geographically representative. In addition, the utility of automatic evaluations is dependent on assumptions about their set-up, such as the alignment of feature extractors with human perception of object similarity or the definition of "appeal" captured in reference datasets used to ground evaluations. We recommend steps for improved automatic and human evaluations.
△ Less
Submitted 7 May, 2024;
originally announced May 2024.
-
Improving Text-to-Image Consistency via Automatic Prompt Optimization
Authors:
Oscar Mañas,
Pietro Astolfi,
Melissa Hall,
Candace Ross,
Jack Urbanek,
Adina Williams,
Aishwarya Agrawal,
Adriana Romero-Soriano,
Michal Drozdzal
Abstract:
Impressive advances in text-to-image (T2I) generative models have yielded a plethora of high performing models which are able to generate aesthetically appealing, photorealistic images. Despite the progress, these models still struggle to produce images that are consistent with the input prompt, oftentimes failing to capture object quantities, relations and attributes properly. Existing solutions…
▽ More
Impressive advances in text-to-image (T2I) generative models have yielded a plethora of high performing models which are able to generate aesthetically appealing, photorealistic images. Despite the progress, these models still struggle to produce images that are consistent with the input prompt, oftentimes failing to capture object quantities, relations and attributes properly. Existing solutions to improve prompt-image consistency suffer from the following challenges: (1) they oftentimes require model fine-tuning, (2) they only focus on nearby prompt samples, and (3) they are affected by unfavorable trade-offs among image quality, representation diversity, and prompt-image consistency. In this paper, we address these challenges and introduce a T2I optimization-by-prompting framework, OPT2I, which leverages a large language model (LLM) to improve prompt-image consistency in T2I models. Our framework starts from a user prompt and iteratively generates revised prompts with the goal of maximizing a consistency score. Our extensive validation on two datasets, MSCOCO and PartiPrompts, shows that OPT2I can boost the initial consistency score by up to 24.9% in terms of DSG score while preserving the FID and increasing the recall between generated and real data. Our work paves the way toward building more reliable and robust T2I systems by harnessing the power of LLMs.
△ Less
Submitted 26 March, 2024;
originally announced March 2024.
-
Defending Against Indirect Prompt Injection Attacks With Spotlighting
Authors:
Keegan Hines,
Gary Lopez,
Matthew Hall,
Federico Zarfati,
Yonatan Zunger,
Emre Kiciman
Abstract:
Large Language Models (LLMs), while powerful, are built and trained to process a single text input. In common applications, multiple inputs can be processed by concatenating them together into a single stream of text. However, the LLM is unable to distinguish which sections of prompt belong to various input sources. Indirect prompt injection attacks take advantage of this vulnerability by embeddin…
▽ More
Large Language Models (LLMs), while powerful, are built and trained to process a single text input. In common applications, multiple inputs can be processed by concatenating them together into a single stream of text. However, the LLM is unable to distinguish which sections of prompt belong to various input sources. Indirect prompt injection attacks take advantage of this vulnerability by embedding adversarial instructions into untrusted data being processed alongside user commands. Often, the LLM will mistake the adversarial instructions as user commands to be followed, creating a security vulnerability in the larger system. We introduce spotlighting, a family of prompt engineering techniques that can be used to improve LLMs' ability to distinguish among multiple sources of input. The key insight is to utilize transformations of an input to provide a reliable and continuous signal of its provenance. We evaluate spotlighting as a defense against indirect prompt injection attacks, and find that it is a robust defense that has minimal detrimental impact to underlying NLP tasks. Using GPT-family models, we find that spotlighting reduces the attack success rate from greater than {50}\% to below {2}\% in our experiments with minimal impact on task efficacy.
△ Less
Submitted 20 March, 2024;
originally announced March 2024.
-
Large language models surpass human experts in predicting neuroscience results
Authors:
Xiaoliang Luo,
Akilles Rechardt,
Guangzhi Sun,
Kevin K. Nejad,
Felipe Yáñez,
Bati Yilmaz,
Kangjoo Lee,
Alexandra O. Cohen,
Valentina Borghesani,
Anton Pashkov,
Daniele Marinazzo,
Jonathan Nicholas,
Alessandro Salatiello,
Ilia Sucholutsky,
Pasquale Minervini,
Sepehr Razavi,
Roberta Rocca,
Elkhan Yusifov,
Tereza Okalova,
Nianlong Gu,
Martin Ferianc,
Mikail Khona,
Kaustubh R. Patil,
Pui-Shee Lee,
Rui Mata
, et al. (14 additional authors not shown)
Abstract:
Scientific discoveries often hinge on synthesizing decades of research, a task that potentially outstrips human information processing capacities. Large language models (LLMs) offer a solution. LLMs trained on the vast scientific literature could potentially integrate noisy yet interrelated findings to forecast novel results better than human experts. To evaluate this possibility, we created Brain…
▽ More
Scientific discoveries often hinge on synthesizing decades of research, a task that potentially outstrips human information processing capacities. Large language models (LLMs) offer a solution. LLMs trained on the vast scientific literature could potentially integrate noisy yet interrelated findings to forecast novel results better than human experts. To evaluate this possibility, we created BrainBench, a forward-looking benchmark for predicting neuroscience results. We find that LLMs surpass experts in predicting experimental outcomes. BrainGPT, an LLM we tuned on the neuroscience literature, performed better yet. Like human experts, when LLMs were confident in their predictions, they were more likely to be correct, which presages a future where humans and LLMs team together to make discoveries. Our approach is not neuroscience-specific and is transferable to other knowledge-intensive endeavors.
△ Less
Submitted 21 June, 2024; v1 submitted 4 March, 2024;
originally announced March 2024.
-
Integrating ytopt and libEnsemble to Autotune OpenMC
Authors:
Xingfu Wu,
John R. Tramm,
Jeffrey Larson,
John-Luke Navarro,
Prasanna Balaprakash,
Brice Videau,
Michael Kruse,
Paul Hovland,
Valerie Taylor,
Mary Hall
Abstract:
ytopt is a Python machine-learning-based autotuning software package developed within the ECP PROTEAS-TUNE project. The ytopt software adopts an asynchronous search framework that consists of sampling a small number of input parameter configurations and progressively fitting a surrogate model over the input-output space until exhausting the user-defined maximum number of evaluations or the wall-cl…
▽ More
ytopt is a Python machine-learning-based autotuning software package developed within the ECP PROTEAS-TUNE project. The ytopt software adopts an asynchronous search framework that consists of sampling a small number of input parameter configurations and progressively fitting a surrogate model over the input-output space until exhausting the user-defined maximum number of evaluations or the wall-clock time. libEnsemble is a Python toolkit for coordinating workflows of asynchronous and dynamic ensembles of calculations across massively parallel resources developed within the ECP PETSc/TAO project. libEnsemble helps users take advantage of massively parallel resources to solve design, decision, and inference problems and expands the class of problems that can benefit from increased parallelism. In this paper we present our methodology and framework to integrate ytopt and libEnsemble to take advantage of massively parallel resources to accelerate the autotuning process. Specifically, we focus on using the proposed framework to autotune the ECP ExaSMR application OpenMC, an open source Monte Carlo particle transport code. OpenMC has seven tunable parameters some of which have large ranges such as the number of particles in-flight, which is in the range of 100,000 to 8 million, with its default setting of 1 million. Setting the proper combination of these parameter values to achieve the best performance is extremely time-consuming. Therefore, we apply the proposed framework to autotune the MPI/OpenMP offload version of OpenMC based on a user-defined metric such as the figure of merit (FoM) (particles/s) or energy efficiency energy-delay product (EDF) on the OLCF Frontier TDS system Crusher. The experimental results show that we achieve improvement up to 29.49% in FoM and up to 30.44% in EDP.
△ Less
Submitted 14 February, 2024;
originally announced February 2024.
-
Fully automated planning for anatomical fetal brain MRI on 0.55T
Authors:
Sara Neves Silva,
Sarah McElroy,
Jordina Aviles Verdera,
Kathleen Colford,
Kamilah St Clair,
Raphael Tomi-Tricot,
Alena Uus,
Valery Ozenne,
Megan Hall,
Lisa Story,
Kuberan Pushparajah,
Mary A Rutherford,
Joseph V Hajnal,
Jana Hutter
Abstract:
Purpose: Widening the availability of fetal MRI with fully automatic real-time planning of radiological brain planes on 0.55T MRI. Methods: Deep learning-based detection of key brain landmarks on a whole-uterus EPI scan enables the subsequent fully automatic planning of the radiological single-shot Turbo Spin Echo acquisitions. The landmark detection pipeline was trained on over 120 datasets from…
▽ More
Purpose: Widening the availability of fetal MRI with fully automatic real-time planning of radiological brain planes on 0.55T MRI. Methods: Deep learning-based detection of key brain landmarks on a whole-uterus EPI scan enables the subsequent fully automatic planning of the radiological single-shot Turbo Spin Echo acquisitions. The landmark detection pipeline was trained on over 120 datasets from varying field strength, echo times and resolutions and quantitatively evaluated. The entire automatic planning solution was tested prospectively in nine fetal subjects between 20 and 37 weeks. Comprehensive evaluation of all steps, the distance between manual and automatic landmarks, the planning quality and the resulting image quality was conducted. Results: Prospective automatic planning was performed in real-time without latency in all subjects. The landmark detection accuracy was 4.21+-2.56 mm for the fetal eyes and 6.47+-3.23 for the cerebellum, planning quality was 2.44/3 (compared to 2.56/3 for manual planning) and diagnostic image quality was 2.14 compared to 2.07 for manual planning. Conclusions: Real-time automatic planning of all three key fetal brain planes was successfully achieved and will pave the way towards simplifying the acquisition of fetal MRI thereby widening the availability of this modality in non-specialist centres.
△ Less
Submitted 18 January, 2024;
originally announced January 2024.
-
Transfer-Learning-Based Autotuning Using Gaussian Copula
Authors:
Thomas Randall,
Jaehoon Koo,
Brice Videau,
Michael Kruse,
Xingfu Wu,
Paul Hovland,
Mary Hall,
Rong Ge,
Prasanna Balaprakash
Abstract:
As diverse high-performance computing (HPC) systems are built, many opportunities arise for applications to solve larger problems than ever before. Given the significantly increased complexity of these HPC systems and application tuning, empirical performance tuning, such as autotuning, has emerged as a promising approach in recent years. Despite its effectiveness, autotuning is often a computatio…
▽ More
As diverse high-performance computing (HPC) systems are built, many opportunities arise for applications to solve larger problems than ever before. Given the significantly increased complexity of these HPC systems and application tuning, empirical performance tuning, such as autotuning, has emerged as a promising approach in recent years. Despite its effectiveness, autotuning is often a computationally expensive approach. Transfer learning (TL)-based autotuning seeks to address this issue by leveraging the data from prior tuning. Current TL methods for autotuning spend significant time modeling the relationship between parameter configurations and performance, which is ineffective for few-shot (that is, few empirical evaluations) tuning on new tasks. We introduce the first generative TL-based autotuning approach based on the Gaussian copula (GC) to model the high-performing regions of the search space from prior data and then generate high-performing configurations for new tasks. This allows a sampling-based approach that maximizes few-shot performance and provides the first probabilistic estimation of the few-shot budget for effective TL-based autotuning. We compare our generative TL approach with state-of-the-art autotuning techniques on several benchmarks. We find that the GC is capable of achieving 64.37% of peak few-shot performance in its first evaluation. Furthermore, the GC model can determine a few-shot transfer budget that yields up to 33.39$\times$ speedup, a dramatic improvement over the 20.58$\times$ speedup using prior techniques.
△ Less
Submitted 9 January, 2024;
originally announced January 2024.
-
Microscaling Data Formats for Deep Learning
Authors:
Bita Darvish Rouhani,
Ritchie Zhao,
Ankit More,
Mathew Hall,
Alireza Khodamoradi,
Summer Deng,
Dhruv Choudhary,
Marius Cornea,
Eric Dellinger,
Kristof Denolf,
Stosic Dusan,
Venmugil Elango,
Maximilian Golub,
Alexander Heinecke,
Phil James-Roxby,
Dharmesh Jani,
Gaurav Kolhe,
Martin Langhammer,
Ada Li,
Levi Melnick,
Maral Mesmakhosroshahi,
Andres Rodriguez,
Michael Schulte,
Rasoul Shafipour,
Lei Shao
, et al. (8 additional authors not shown)
Abstract:
Narrow bit-width data formats are key to reducing the computational and storage costs of modern deep learning applications. This paper evaluates Microscaling (MX) data formats that combine a per-block scaling factor with narrow floating-point and integer types for individual elements. MX formats balance the competing needs of hardware efficiency, model accuracy, and user friction. Empirical result…
▽ More
Narrow bit-width data formats are key to reducing the computational and storage costs of modern deep learning applications. This paper evaluates Microscaling (MX) data formats that combine a per-block scaling factor with narrow floating-point and integer types for individual elements. MX formats balance the competing needs of hardware efficiency, model accuracy, and user friction. Empirical results on over two dozen benchmarks demonstrate practicality of MX data formats as a drop-in replacement for baseline FP32 for AI inference and training with low user friction. We also show the first instance of training generative language models at sub-8-bit weights, activations, and gradients with minimal accuracy loss and no modifications to the training recipe.
△ Less
Submitted 19 October, 2023; v1 submitted 16 October, 2023;
originally announced October 2023.
-
Quantifying and mitigating the impact of label errors on model disparity metrics
Authors:
Julius Adebayo,
Melissa Hall,
Bowen Yu,
Bobbie Chern
Abstract:
Errors in labels obtained via human annotation adversely affect a model's performance. Existing approaches propose ways to mitigate the effect of label error on a model's downstream accuracy, yet little is known about its impact on a model's disparity metrics. Here we study the effect of label error on a model's disparity metrics. We empirically characterize how varying levels of label error, in b…
▽ More
Errors in labels obtained via human annotation adversely affect a model's performance. Existing approaches propose ways to mitigate the effect of label error on a model's downstream accuracy, yet little is known about its impact on a model's disparity metrics. Here we study the effect of label error on a model's disparity metrics. We empirically characterize how varying levels of label error, in both training and test data, affect these disparity metrics. We find that group calibration and other metrics are sensitive to train-time and test-time label error -- particularly for minority groups. This disparate effect persists even for models trained with noise-aware algorithms. To mitigate the impact of training-time label error, we present an approach to estimate the influence of a training input's label on a model's group disparity metric. We empirically assess the proposed approach on a variety of datasets and find significant improvement, compared to alternative approaches, in identifying training inputs that improve a model's disparity metric. We complement the approach with an automatic relabel-and-finetune scheme that produces updated models with, provably, improved group calibration error.
△ Less
Submitted 3 October, 2023;
originally announced October 2023.
-
VPA: Fully Test-Time Visual Prompt Adaptation
Authors:
Jiachen Sun,
Mark Ibrahim,
Melissa Hall,
Ivan Evtimov,
Z. Morley Mao,
Cristian Canton Ferrer,
Caner Hazirbas
Abstract:
Textual prompt tuning has demonstrated significant performance improvements in adapting natural language processing models to a variety of downstream tasks by treating hand-engineered prompts as trainable parameters. Inspired by the success of textual prompting, several studies have investigated the efficacy of visual prompt tuning. In this work, we present Visual Prompt Adaptation (VPA), the firs…
▽ More
Textual prompt tuning has demonstrated significant performance improvements in adapting natural language processing models to a variety of downstream tasks by treating hand-engineered prompts as trainable parameters. Inspired by the success of textual prompting, several studies have investigated the efficacy of visual prompt tuning. In this work, we present Visual Prompt Adaptation (VPA), the first framework that generalizes visual prompting with test-time adaptation. VPA introduces a small number of learnable tokens, enabling fully test-time and storage-efficient adaptation without necessitating source-domain information. We examine our VPA design under diverse adaptation settings, encompassing single-image, batched-image, and pseudo-label adaptation. We evaluate VPA on multiple tasks, including out-of-distribution (OOD) generalization, corruption robustness, and domain adaptation. Experimental results reveal that VPA effectively enhances OOD generalization by 3.3% across various models, surpassing previous test-time approaches. Furthermore, we show that VPA improves corruption robustness by 6.5% compared to strong baselines. Finally, we demonstrate that VPA also boosts domain adaptation performance by relatively 5.2%. Our VPA also exhibits marked effectiveness in improving the robustness of zero-shot recognition for vision-language models.
△ Less
Submitted 26 September, 2023;
originally announced September 2023.
-
Limits of economy and fidelity for programmable assembly of size-controlled triply-periodic polyhedra
Authors:
Carlos M. Duque,
Douglas M. Hall,
Botond Tyukodi,
Michael F. Hagan,
Christian D. Santangelo,
Gregory M. Grason
Abstract:
We propose and investigate an extension of the Caspar-Klug symmetry principles for viral capsid assembly to the programmable assembly of size-controlled triply-periodic polyhedra, discrete variants of the Primitive, Diamond, and Gyroid cubic minimal surfaces. Inspired by a recent class of programmable DNA origami colloids, we demonstrate that the economy of design in these crystalline assemblies -…
▽ More
We propose and investigate an extension of the Caspar-Klug symmetry principles for viral capsid assembly to the programmable assembly of size-controlled triply-periodic polyhedra, discrete variants of the Primitive, Diamond, and Gyroid cubic minimal surfaces. Inspired by a recent class of programmable DNA origami colloids, we demonstrate that the economy of design in these crystalline assemblies -- in terms of the growth of the number of distinct particle species required with the increased size-scale (e.g. periodicity) -- is comparable to viral shells. We further test the role of geometric specificity in these assemblies via dynamical assembly simulations, which show that conditions for simultaneously efficient and high-fidelity assembly require an intermediate degree of flexibility of local angles and lengths in programmed assembly. Off-target misassembly occurs via incorporation of a variant of disclination defects, generalized to the case of hyperbolic crystals. The possibility of these topological defects is a direct consequence of the very same symmetry principles that underlie the economical design, exposing a basic tradeoff between design economy and fidelity of programmable, size controlled assembly.
△ Less
Submitted 8 September, 2023;
originally announced September 2023.
-
FACET: Fairness in Computer Vision Evaluation Benchmark
Authors:
Laura Gustafson,
Chloe Rolland,
Nikhila Ravi,
Quentin Duval,
Aaron Adcock,
Cheng-Yang Fu,
Melissa Hall,
Candace Ross
Abstract:
Computer vision models have known performance disparities across attributes such as gender and skin tone. This means during tasks such as classification and detection, model performance differs for certain classes based on the demographics of the people in the image. These disparities have been shown to exist, but until now there has not been a unified approach to measure these differences for com…
▽ More
Computer vision models have known performance disparities across attributes such as gender and skin tone. This means during tasks such as classification and detection, model performance differs for certain classes based on the demographics of the people in the image. These disparities have been shown to exist, but until now there has not been a unified approach to measure these differences for common use-cases of computer vision models. We present a new benchmark named FACET (FAirness in Computer Vision EvaluaTion), a large, publicly available evaluation set of 32k images for some of the most common vision tasks - image classification, object detection and segmentation. For every image in FACET, we hired expert reviewers to manually annotate person-related attributes such as perceived skin tone and hair type, manually draw bounding boxes and label fine-grained person-related classes such as disk jockey or guitarist. In addition, we use FACET to benchmark state-of-the-art vision models and present a deeper understanding of potential performance disparities and challenges across sensitive demographic attributes. With the exhaustive annotations collected, we probe models using single demographics attributes as well as multiple attributes using an intersectional approach (e.g. hair color and perceived skin tone). Our results show that classification, detection, segmentation, and visual grounding models exhibit performance disparities across demographic attributes and intersections of attributes. These harms suggest that not all people represented in datasets receive fair and equitable treatment in these vision tasks. We hope current and future results using our benchmark will contribute to fairer, more robust vision models. FACET is available publicly at https://facet.metademolab.com/
△ Less
Submitted 31 August, 2023;
originally announced September 2023.
-
DIG In: Evaluating Disparities in Image Generations with Indicators for Geographic Diversity
Authors:
Melissa Hall,
Candace Ross,
Adina Williams,
Nicolas Carion,
Michal Drozdzal,
Adriana Romero Soriano
Abstract:
The unprecedented photorealistic results achieved by recent text-to-image generative systems and their increasing use as plug-and-play content creation solutions make it crucial to understand their potential biases. In this work, we introduce three indicators to evaluate the realism, diversity and prompt-generation consistency of text-to-image generative systems when prompted to generate objects f…
▽ More
The unprecedented photorealistic results achieved by recent text-to-image generative systems and their increasing use as plug-and-play content creation solutions make it crucial to understand their potential biases. In this work, we introduce three indicators to evaluate the realism, diversity and prompt-generation consistency of text-to-image generative systems when prompted to generate objects from across the world. Our indicators complement qualitative analysis of the broader impact of such systems by enabling automatic and efficient benchmarking of geographic disparities, an important step towards building responsible visual content creation systems. We use our proposed indicators to analyze potential geographic biases in state-of-the-art visual content creation systems and find that: (1) models have less realism and diversity of generations when prompting for Africa and West Asia than Europe, (2) prompting with geographic information comes at a cost to prompt-consistency and diversity of generated images, and (3) models exhibit more region-level disparities for some objects than others. Perhaps most interestingly, our indicators suggest that progress in image generation quality has come at the cost of real-world geographic representation. Our comprehensive evaluation constitutes a crucial step towards ensuring a positive experience of visual content creation for everyone.
△ Less
Submitted 18 March, 2024; v1 submitted 11 August, 2023;
originally announced August 2023.
-
An automated pipeline for quantitative T2* fetal body MRI and segmentation at low field
Authors:
Kelly Payette,
Alena Uus,
Jordina Aviles Verdera,
Carla Avena Zampieri,
Megan Hall,
Lisa Story,
Maria Deprez,
Mary A. Rutherford,
Joseph V. Hajnal,
Sebastien Ourselin,
Raphael Tomi-Tricot,
Jana Hutter
Abstract:
Fetal Magnetic Resonance Imaging at low field strengths is emerging as an exciting direction in perinatal health. Clinical low field (0.55T) scanners are beneficial for fetal imaging due to their reduced susceptibility-induced artefacts, increased T2* values, and wider bore (widening access for the increasingly obese pregnant population). However, the lack of standard automated image processing to…
▽ More
Fetal Magnetic Resonance Imaging at low field strengths is emerging as an exciting direction in perinatal health. Clinical low field (0.55T) scanners are beneficial for fetal imaging due to their reduced susceptibility-induced artefacts, increased T2* values, and wider bore (widening access for the increasingly obese pregnant population). However, the lack of standard automated image processing tools such as segmentation and reconstruction hampers wider clinical use. In this study, we introduce a semi-automatic pipeline using quantitative MRI for the fetal body at low field strength resulting in fast and detailed quantitative T2* relaxometry analysis of all major fetal body organs. Multi-echo dynamic sequences of the fetal body were acquired and reconstructed into a single high-resolution volume using deformable slice-to-volume reconstruction, generating both structural and quantitative T2* 3D volumes. A neural network trained using a semi-supervised approach was created to automatically segment these fetal body 3D volumes into ten different organs (resulting in dice values > 0.74 for 8 out of 10 organs). The T2* values revealed a strong relationship with GA in the lungs, liver, and kidney parenchyma (R^2>0.5). This pipeline was used successfully for a wide range of GAs (17-40 weeks), and is robust to motion artefacts. Low field fetal MRI can be used to perform advanced MRI analysis, and is a viable option for clinical scanning.
△ Less
Submitted 9 August, 2023;
originally announced August 2023.
-
Rule By Example: Harnessing Logical Rules for Explainable Hate Speech Detection
Authors:
Christopher Clarke,
Matthew Hall,
Gaurav Mittal,
Ye Yu,
Sandra Sajeev,
Jason Mars,
Mei Chen
Abstract:
Classic approaches to content moderation typically apply a rule-based heuristic approach to flag content. While rules are easily customizable and intuitive for humans to interpret, they are inherently fragile and lack the flexibility or robustness needed to moderate the vast amount of undesirable content found online today. Recent advances in deep learning have demonstrated the promise of using hi…
▽ More
Classic approaches to content moderation typically apply a rule-based heuristic approach to flag content. While rules are easily customizable and intuitive for humans to interpret, they are inherently fragile and lack the flexibility or robustness needed to moderate the vast amount of undesirable content found online today. Recent advances in deep learning have demonstrated the promise of using highly effective deep neural models to overcome these challenges. However, despite the improved performance, these data-driven models lack transparency and explainability, often leading to mistrust from everyday users and a lack of adoption by many platforms. In this paper, we present Rule By Example (RBE): a novel exemplar-based contrastive learning approach for learning from logical rules for the task of textual content moderation. RBE is capable of providing rule-grounded predictions, allowing for more explainable and customizable predictions compared to typical deep learning-based approaches. We demonstrate that our approach is capable of learning rich rule embedding representations using only a few data examples. Experimental results on 3 popular hate speech classification datasets show that RBE is able to outperform state-of-the-art deep learning classifiers as well as the use of rules in both supervised and unsupervised settings while providing explainable model predictions via rule-grounding.
△ Less
Submitted 24 July, 2023;
originally announced July 2023.
-
Chronic pain detection from resting-state raw EEG signals using improved feature selection
Authors:
Jean Li,
Dirk De Ridder,
Divya Adhia,
Matthew Hall,
Jeremiah D. Deng
Abstract:
We present an automatic approach that works on resting-state raw EEG data for chronic pain detection. A new feature selection algorithm - modified Sequential Floating Forward Selection (mSFFS) - is proposed. The improved feature selection scheme is rather compact but displays better class separability as indicated by the Bhattacharyya distance measures and better visualization results. It also out…
▽ More
We present an automatic approach that works on resting-state raw EEG data for chronic pain detection. A new feature selection algorithm - modified Sequential Floating Forward Selection (mSFFS) - is proposed. The improved feature selection scheme is rather compact but displays better class separability as indicated by the Bhattacharyya distance measures and better visualization results. It also outperforms selections generated by other benchmark methods, boosting the test accuracy to 97.5% and yielding a test accuracy of 81.4% on an external dataset that contains different types of chronic pain
△ Less
Submitted 27 June, 2023;
originally announced June 2023.
-
VisoGender: A dataset for benchmarking gender bias in image-text pronoun resolution
Authors:
Siobhan Mackenzie Hall,
Fernanda Gonçalves Abrantes,
Hanwen Zhu,
Grace Sodunke,
Aleksandar Shtedritski,
Hannah Rose Kirk
Abstract:
We introduce VisoGender, a novel dataset for benchmarking gender bias in vision-language models. We focus on occupation-related biases within a hegemonic system of binary gender, inspired by Winograd and Winogender schemas, where each image is associated with a caption containing a pronoun relationship of subjects and objects in the scene. VisoGender is balanced by gender representation in profess…
▽ More
We introduce VisoGender, a novel dataset for benchmarking gender bias in vision-language models. We focus on occupation-related biases within a hegemonic system of binary gender, inspired by Winograd and Winogender schemas, where each image is associated with a caption containing a pronoun relationship of subjects and objects in the scene. VisoGender is balanced by gender representation in professional roles, supporting bias evaluation in two ways: i) resolution bias, where we evaluate the difference between pronoun resolution accuracies for image subjects with gender presentations perceived as masculine versus feminine by human annotators and ii) retrieval bias, where we compare ratios of professionals perceived to have masculine and feminine gender presentations retrieved for a gender-neutral search query. We benchmark several state-of-the-art vision-language models and find that they demonstrate bias in resolving binary gender in complex scenes. While the direction and magnitude of gender bias depends on the task and the model being evaluated, captioning models are generally less biased than Vision-Language Encoders. Dataset and code are available at https://github.com/oxai/visogender
△ Less
Submitted 12 December, 2023; v1 submitted 21 June, 2023;
originally announced June 2023.
-
Develo** Digital Twins for Earth Systems: Purpose, Requisites, and Benefits
Authors:
Yuhan Rao,
Rob Redmon,
Kirstine Dale,
Sue E. Haupt,
Aaron Hopkinson,
Ann Bostrom,
Sid Boukabara,
Thomas Geenen,
David M. Hall,
Benjamin D. Smith,
Dev Niyogi,
V. Ramaswamy,
Eric A. Kihn
Abstract:
The accelerated change in our planet due to human activities has led to grand societal challenges including health crises, intensified extreme weather events, food security, environmental injustice, etc. Digital twin systems combined with emerging technologies such as artificial intelligence and edge computing provide opportunities to support planning and decision-making to address these challenge…
▽ More
The accelerated change in our planet due to human activities has led to grand societal challenges including health crises, intensified extreme weather events, food security, environmental injustice, etc. Digital twin systems combined with emerging technologies such as artificial intelligence and edge computing provide opportunities to support planning and decision-making to address these challenges. Digital twins for Earth systems (DT4ESs) are defined as the digital representation of the complex integrated Earth system including both natural processes and human activities. They have the potential to enable a diverse range of users to explore what-if scenarios across spatial and temporal scales to improve our understanding, prediction, mitigation, and adaptation to grand societal challenges. The 4th NOAA AI Workshop convened around 100 members who are develo** or interested in participating in the development of DT4ES to discuss a shared community vision and path forward on fostering a future ecosystem of interoperable DT4ES. This paper summarizes the workshop discussions around DT4ES. We first defined the foundational features of a viable digital twins for Earth system that can be used to guide the development of various use cases of DT4ES. Finally, we made practical recommendations for the community on different aspects of collaboration in order to enable a future ecosystem of interoperable DT4ES, including equity-centered use case development, community-driven investigation of interoperability for DT4ES, trust-oriented co-development, and develo** a community of practice.
△ Less
Submitted 19 June, 2023;
originally announced June 2023.
-
First direct measurement constraining the $^{34}$Ar($α$,p)$^{37}$K reaction cross section for mixed hydrogen and helium burning in accreting neutron stars
Authors:
J. Browne,
K. A. Chipps,
K. Schmidt,
H. Schatz,
S. Ahn,
S. D. Pain,
F. Montes,
W. J. Ong,
U. Greife,
J. Allen,
D. W. Bardayan,
J. C. Blackmon,
D. Blankstein,
S. Cha,
K. Y. Chae,
M. Febbraro,
M. R. Hall,
K. L. Jones,
A. Kontos,
Z. Meisel,
P. D. O'Malley,
K. T. Schmitt,
K. Smith,
M. S. Smith,
P. Thompson
, et al. (3 additional authors not shown)
Abstract:
The rate of the final step in the astrophysical $α$p-process, the $^{34}$Ar($α$,\textit{p})$^{37}$K reaction, suffers from large uncertainties due to lack of experimental data, despite having a considerable impact on the observable light curves of x-ray bursts and the composition of the ashes of hydrogen and helium burning on accreting neutron stars. We present the first direct measurement constra…
▽ More
The rate of the final step in the astrophysical $α$p-process, the $^{34}$Ar($α$,\textit{p})$^{37}$K reaction, suffers from large uncertainties due to lack of experimental data, despite having a considerable impact on the observable light curves of x-ray bursts and the composition of the ashes of hydrogen and helium burning on accreting neutron stars. We present the first direct measurement constraining the $^{34}$Ar($α$,p)$^{37}$K reaction cross section, using the Jet Experiments in Nuclear Structure and Astrophysics (JENSA) gas jet target. The combined cross section for the $^{34}$Ar,Cl($α$,p)$^{37}$K,Ar reaction is found to agree well with Hauser-Feshbach predictions. The $^{34}$Ar($α$,2p)$^{36}$Ar cross section, which can be exclusively attributed to the $^{34}$Ar beam component, also agrees to within the typical uncertainties quoted for statistical models. This indicates the applicability of the statistical model for predicting astrophysical ($α$,p) reaction rates in this part of the $α$p process, in contrast to earlier findings from indirect reaction studies indicating orders-of-magnitude discrepancies. This removes a significant uncertainty in models of hydrogen and helium burning on accreting neutron stars.
△ Less
Submitted 30 May, 2023;
originally announced May 2023.
-
Balancing the Picture: Debiasing Vision-Language Datasets with Synthetic Contrast Sets
Authors:
Brandon Smith,
Miguel Farinha,
Siobhan Mackenzie Hall,
Hannah Rose Kirk,
Aleksandar Shtedritski,
Max Bain
Abstract:
Vision-language models are growing in popularity and public visibility to generate, edit, and caption images at scale; but their outputs can perpetuate and amplify societal biases learned during pre-training on uncurated image-text pairs from the internet. Although debiasing methods have been proposed, we argue that these measurements of model bias lack validity due to dataset bias. We demonstrate…
▽ More
Vision-language models are growing in popularity and public visibility to generate, edit, and caption images at scale; but their outputs can perpetuate and amplify societal biases learned during pre-training on uncurated image-text pairs from the internet. Although debiasing methods have been proposed, we argue that these measurements of model bias lack validity due to dataset bias. We demonstrate there are spurious correlations in COCO Captions, the most commonly used dataset for evaluating bias, between background context and the gender of people in-situ. This is problematic because commonly-used bias metrics (such as Bias@K) rely on per-gender base rates. To address this issue, we propose a novel dataset debiasing pipeline to augment the COCO dataset with synthetic, gender-balanced contrast sets, where only the gender of the subject is edited and the background is fixed. However, existing image editing methods have limitations and sometimes produce low-quality images; so, we introduce a method to automatically filter the generated images based on their similarity to real images. Using our balanced synthetic contrast sets, we benchmark bias in multiple CLIP-based models, demonstrating how metrics are skewed by imbalance in the original COCO images. Our results indicate that the proposed approach improves the validity of the evaluation, ultimately contributing to more realistic understanding of bias in vision-language models.
△ Less
Submitted 24 May, 2023;
originally announced May 2023.
-
Rethinking Multimodal Content Moderation from an Asymmetric Angle with Mixed-modality
Authors:
Jialin Yuan,
Ye Yu,
Gaurav Mittal,
Matthew Hall,
Sandra Sajeev,
Mei Chen
Abstract:
There is a rapidly growing need for multimodal content moderation (CM) as more and more content on social media is multimodal in nature. Existing unimodal CM systems may fail to catch harmful content that crosses modalities (e.g., memes or videos), which may lead to severe consequences. In this paper, we present a novel CM model, Asymmetric Mixed-Modal Moderation (AM3), to target multimodal and un…
▽ More
There is a rapidly growing need for multimodal content moderation (CM) as more and more content on social media is multimodal in nature. Existing unimodal CM systems may fail to catch harmful content that crosses modalities (e.g., memes or videos), which may lead to severe consequences. In this paper, we present a novel CM model, Asymmetric Mixed-Modal Moderation (AM3), to target multimodal and unimodal CM tasks. Specifically, to address the asymmetry in semantics between vision and language, AM3 has a novel asymmetric fusion architecture that is designed to not only fuse the common knowledge in both modalities but also to exploit the unique information in each modality. Unlike previous works that focus on representing the two modalities into a similar feature space while overlooking the intrinsic difference between the information conveyed in multimodality and in unimodality (asymmetry in modalities), we propose a novel cross-modality contrastive loss to learn the unique knowledge that only appears in multimodality. This is critical as some harmful intent may only be conveyed through the intersection of both modalities. With extensive experiments, we show that AM3 outperforms all existing state-of-the-art methods on both multimodal and unimodal CM benchmarks.
△ Less
Submitted 13 December, 2023; v1 submitted 17 May, 2023;
originally announced May 2023.
-
Asymmetry and tighter uncertainty relations for Rényi entropies via quantum-classical decompositions of resource measures
Authors:
Michael J. W. Hall
Abstract:
It is known that the variance and entropy of quantum observables decompose into intrinsically quantum and classical contributions. Here a general method of constructing quantum-classical decompositions of resources such as uncertainty is discussed, with the quantum contribution specified by a measure of the noncommutativity of a given set of operators relative to the quantum state, and the classic…
▽ More
It is known that the variance and entropy of quantum observables decompose into intrinsically quantum and classical contributions. Here a general method of constructing quantum-classical decompositions of resources such as uncertainty is discussed, with the quantum contribution specified by a measure of the noncommutativity of a given set of operators relative to the quantum state, and the classical contribution generated by the mixedness of the state. Suitable measures of noncommutativity or 'quantumness' include quantum Fisher information, and the asymmetry of a given set, group or algebra of operators, and are generalised to nonprojective observables and quantum channels. Strong entropic uncertainty relations and lower bounds for Rényi entropies are obtained, valid for arbitrary discrete observables, that take the mixedness of the state into account via a classical contribution to the lower bound. These relations can also be interpreted without reference to quantum-classical decompositions, as tradeoff relations that bound the asymmetry of one observable in terms of the entropy of another.
△ Less
Submitted 26 May, 2023; v1 submitted 12 April, 2023;
originally announced April 2023.
-
Pinpointing Why Object Recognition Performance Degrades Across Income Levels and Geographies
Authors:
Laura Gustafson,
Megan Richards,
Melissa Hall,
Caner Hazirbas,
Diane Bouchacourt,
Mark Ibrahim
Abstract:
Despite impressive advances in object-recognition, deep learning systems' performance degrades significantly across geographies and lower income levels raising pressing concerns of inequity. Addressing such performance gaps remains a challenge, as little is understood about why performance degrades across incomes or geographies. We take a step in this direction by annotating images from Dollar Str…
▽ More
Despite impressive advances in object-recognition, deep learning systems' performance degrades significantly across geographies and lower income levels raising pressing concerns of inequity. Addressing such performance gaps remains a challenge, as little is understood about why performance degrades across incomes or geographies. We take a step in this direction by annotating images from Dollar Street, a popular benchmark of geographically and economically diverse images, labeling each image with factors such as color, shape, and background. These annotations unlock a new granular view into how objects differ across incomes and regions. We then use these object differences to pinpoint model vulnerabilities across incomes and regions. We study a range of modern vision models, finding that performance disparities are most associated with differences in texture, occlusion, and images with darker lighting. We illustrate how insights from our factor labels can surface mitigations to improve models' performance disparities. As an example, we show that mitigating a model's vulnerability to texture can improve performance on the lower income level. We release all the factor annotations along with an interactive dashboard to facilitate research into more equitable vision systems.
△ Less
Submitted 11 April, 2023;
originally announced April 2023.
-
ytopt: Autotuning Scientific Applications for Energy Efficiency at Large Scales
Authors:
Xingfu Wu,
Prasanna Balaprakash,
Michael Kruse,
Jaehoon Koo,
Brice Videau,
Paul Hovland,
Valerie Taylor,
Brad Geltz,
Siddhartha Jana,
Mary Hall
Abstract:
As we enter the exascale computing era, efficiently utilizing power and optimizing the performance of scientific applications under power and energy constraints has become critical and challenging. We propose a low-overhead autotuning framework to autotune performance and energy for various hybrid MPI/OpenMP scientific applications at large scales and to explore the tradeoffs between application r…
▽ More
As we enter the exascale computing era, efficiently utilizing power and optimizing the performance of scientific applications under power and energy constraints has become critical and challenging. We propose a low-overhead autotuning framework to autotune performance and energy for various hybrid MPI/OpenMP scientific applications at large scales and to explore the tradeoffs between application runtime and power/energy for energy efficient application execution, then use this framework to autotune four ECP proxy applications -- XSBench, AMG, SWFFT, and SW4lite. Our approach uses Bayesian optimization with a Random Forest surrogate model to effectively search parameter spaces with up to 6 million different configurations on two large-scale production systems, Theta at Argonne National Laboratory and Summit at Oak Ridge National Laboratory. The experimental results show that our autotuning framework at large scales has low overhead and achieves good scalability. Using the proposed autotuning framework to identify the best configurations, we achieve up to 91.59% performance improvement, up to 21.2% energy savings, and up to 37.84% EDP improvement on up to 4,096 nodes.
△ Less
Submitted 28 March, 2023;
originally announced March 2023.
-
Towards Reliable Assessments of Demographic Disparities in Multi-Label Image Classifiers
Authors:
Melissa Hall,
Bobbie Chern,
Laura Gustafson,
Denisse Ventura,
Harshad Kulkarni,
Candace Ross,
Nicolas Usunier
Abstract:
Disaggregated performance metrics across demographic groups are a hallmark of fairness assessments in computer vision. These metrics successfully incentivized performance improvements on person-centric tasks such as face analysis and are used to understand risks of modern models. However, there is a lack of discussion on the vulnerabilities of these measurements for more complex computer vision ta…
▽ More
Disaggregated performance metrics across demographic groups are a hallmark of fairness assessments in computer vision. These metrics successfully incentivized performance improvements on person-centric tasks such as face analysis and are used to understand risks of modern models. However, there is a lack of discussion on the vulnerabilities of these measurements for more complex computer vision tasks. In this paper, we consider multi-label image classification and, specifically, object categorization tasks. First, we highlight design choices and trade-offs for measurement that involve more nuance than discussed in prior computer vision literature. These challenges are related to the necessary scale of data, definition of groups for images, choice of metric, and dataset imbalances. Next, through two case studies using modern vision models, we demonstrate that naive implementations of these assessments are brittle. We identify several design choices that look merely like implementation details but significantly impact the conclusions of assessments, both in terms of magnitude and direction (on which group the classifiers work best) of disparities. Based on ablation studies, we propose some recommendations to increase the reliability of these assessments. Finally, through a qualitative analysis we find that concepts with large disparities tend to have varying definitions and representations between groups, with inconsistencies across datasets and annotators. While this result suggests avenues for mitigation through more consistent data collection, it also highlights that ambiguous label definitions remain a challenge when performing model assessments. Vision models are expanding and becoming more ubiquitous; it is even more important that our disparity assessments accurately reflect the true performance of models.
△ Less
Submitted 16 February, 2023;
originally announced February 2023.
-
With Shared Microexponents, A Little Shifting Goes a Long Way
Authors:
Bita Rouhani,
Ritchie Zhao,
Venmugil Elango,
Rasoul Shafipour,
Mathew Hall,
Maral Mesmakhosroshahi,
Ankit More,
Levi Melnick,
Maximilian Golub,
Girish Varatkar,
Lei Shao,
Gaurav Kolhe,
Dimitry Melts,
Jasmine Klar,
Renee L'Heureux,
Matt Perry,
Doug Burger,
Eric Chung,
Zhaoxia Deng,
Sam Naghshineh,
Jongsoo Park,
Maxim Naumov
Abstract:
This paper introduces Block Data Representations (BDR), a framework for exploring and evaluating a wide spectrum of narrow-precision formats for deep learning. It enables comparison of popular quantization standards, and through BDR, new formats based on shared microexponents (MX) are identified, which outperform other state-of-the-art quantization approaches, including narrow-precision floating-p…
▽ More
This paper introduces Block Data Representations (BDR), a framework for exploring and evaluating a wide spectrum of narrow-precision formats for deep learning. It enables comparison of popular quantization standards, and through BDR, new formats based on shared microexponents (MX) are identified, which outperform other state-of-the-art quantization approaches, including narrow-precision floating-point and block floating-point. MX utilizes multiple levels of quantization scaling with ultra-fine scaling factors based on shared microexponents in the hardware. The effectiveness of MX is demonstrated on real-world models including large-scale generative pretraining and inferencing, and production-scale recommendation systems.
△ Less
Submitted 12 April, 2023; v1 submitted 15 February, 2023;
originally announced February 2023.
-
Vision-Language Models Performing Zero-Shot Tasks Exhibit Gender-based Disparities
Authors:
Melissa Hall,
Laura Gustafson,
Aaron Adcock,
Ishan Misra,
Candace Ross
Abstract:
We explore the extent to which zero-shot vision-language models exhibit gender bias for different vision tasks. Vision models traditionally required task-specific labels for representing concepts, as well as finetuning; zero-shot models like CLIP instead perform tasks with an open-vocabulary, meaning they do not need a fixed set of labels, by using text embeddings to represent concepts. With these…
▽ More
We explore the extent to which zero-shot vision-language models exhibit gender bias for different vision tasks. Vision models traditionally required task-specific labels for representing concepts, as well as finetuning; zero-shot models like CLIP instead perform tasks with an open-vocabulary, meaning they do not need a fixed set of labels, by using text embeddings to represent concepts. With these capabilities in mind, we ask: Do vision-language models exhibit gender bias when performing zero-shot image classification, object detection and semantic segmentation? We evaluate different vision-language models with multiple datasets across a set of concepts and find (i) all models evaluated show distinct performance differences based on the perceived gender of the person co-occurring with a given concept in the image and that aggregating analyses over all concepts can mask these concerns; (ii) model calibration (i.e. the relationship between accuracy and confidence) also differs distinctly by perceived gender, even when evaluating on similar representations of concepts; and (iii) these observed disparities align with existing gender biases in word embeddings from language models. These findings suggest that, while language greatly expands the capability of vision tasks, it can also contribute to social biases in zero-shot vision settings. Furthermore, biases can further propagate when foundational models like CLIP are used by other models to enable zero-shot capabilities.
△ Less
Submitted 26 January, 2023;
originally announced January 2023.
-
Coarse-grained modeling of polymers with end-on and side-on liquid crystal moieties: effect of architecture
Authors:
Diego Becerra,
Pranav R. Jois,
Lisa M. Hall
Abstract:
Mesogens, which are typically stiff rodlike or disklike molecules, are able to self-organize into liquid crystal (LC) phases in a certain temperature range. Such mesogens, or LC groups, can be attached to polymer chains in various configurations including within the backbone (main-chain LC polymers) or at the ends of side-chains attached to the backbone in an end-on or side-on configuration (side-…
▽ More
Mesogens, which are typically stiff rodlike or disklike molecules, are able to self-organize into liquid crystal (LC) phases in a certain temperature range. Such mesogens, or LC groups, can be attached to polymer chains in various configurations including within the backbone (main-chain LC polymers) or at the ends of side-chains attached to the backbone in an end-on or side-on configuration (side-chain LC polymers or SCLCPs), which can display synergistic properties arising from both their LC and polymeric character. At lower temperatures, chain conformations may be significantly altered due to the mesoscale LC ordering, thus, when heating from the LC ordered state through the LC to isotropic phase transition, the chains return from a more stretched to a more random coil conformation. This can cause macroscopic shape changes, which depend significantly on the type of LC attachment and other architectural properties of the polymer. Here, to study the structure-property relationships for SCLCPs with a range of different architectures, we develop a coarse-grained model that includes torsional potentials along with LC interactions of a Gay--Berne form. We create systems of different side chain lengths, chain stiffnesses, and LC attachment types, and track their structural properties as a function of temperature. Our modeled systems indeed form a variety of well-organized mesophase structures at low temperatures, and we predict higher LC to isotropic transition temperatures for the end-on side-chain systems than for analogous side-on side-chain systems. Understanding these phase transitions and their dependence on polymer architecture can be useful in designing materials with reversible and controllable deformations.
△ Less
Submitted 19 December, 2022;
originally announced December 2022.
-
Extending the Use of Information Theory in Segregation Analyses to Construct Comprehensive Models of Segregation
Authors:
Boris Barron,
Yunus A. Kinkhabwala,
Chriss Hess,
Matthew Hall,
Itai Cohen,
Tomás A. Arias
Abstract:
The traditional approach to the quantitative study of segregation is to employ indices that are selected by ``desirable properties''. Here, we detail how information theory underpins entropy-based indices and demonstrate how desirable properties can be used to systematically construct models of segregation. The resulting models capture all indices which satisfy the selected properties and provide…
▽ More
The traditional approach to the quantitative study of segregation is to employ indices that are selected by ``desirable properties''. Here, we detail how information theory underpins entropy-based indices and demonstrate how desirable properties can be used to systematically construct models of segregation. The resulting models capture all indices which satisfy the selected properties and provide new insights, such as how the entropy index presumes a particular form of intergroup interactions and how the dissimilarity index depends on the regional composition. Additionally, our approach reveals that functions, rather than indices, tend to be necessary mathematical tools for a comprehensive quantification of segregation. We then proceed with exploratory considerations of two-group residential segregation, finding striking similarities in major U.S. cities, subtle segregation patterns that correlate with minority group diversity, and substantive reductions in segregation that may be overlooked with traditional approaches. Finally, we explore the promise of our approach for segregation forecasting.
△ Less
Submitted 13 December, 2022;
originally announced December 2022.
-
Better Heisenberg limits, coherence bounds, and energy-time tradeoffs via quantum Rényi information
Authors:
Michael J. W. Hall
Abstract:
An uncertainty relation for the Rényi entropies of conjugate quantum observables is used to obtain a strong Heisenberg limit of the form ${\rm RMSE} \geq f(α)/(\langle N\rangle+\frac12)$, bounding the root mean square error of any estimate of a random optical phase shift in terms of average photon number, where $f(α)$ is maximised for non-Shannon entropies. Related simple yet strong uncertainty re…
▽ More
An uncertainty relation for the Rényi entropies of conjugate quantum observables is used to obtain a strong Heisenberg limit of the form ${\rm RMSE} \geq f(α)/(\langle N\rangle+\frac12)$, bounding the root mean square error of any estimate of a random optical phase shift in terms of average photon number, where $f(α)$ is maximised for non-Shannon entropies. Related simple yet strong uncertainty relations linking phase uncertainty to the photon number distribution, such as $ΔΦ\geq \max_n p_n$, are also obtained. These results are significantly strengthened via upper and lower bounds on the Rényi mutual information of quantum communication channels, related to asymmetry and convolution, and applied to the estimation (with prior information) of unitary shift parameters such as rotation angle and time, and to obtain strong bounds on measures of coherence. Sharper Rényi entropic uncertainty relations are also obtained, including time-energy uncertainty relations for Hamiltonians with discrete spectra. In the latter case almost-periodic Rényi entropies are introduced for nonperiodic systems.
△ Less
Submitted 17 November, 2022; v1 submitted 26 October, 2022;
originally announced October 2022.
-
Building blocks of non-Euclidean ribbons: Size-controlled self-assembly via discrete, frustrated particles
Authors:
Douglas M. Hall,
Mark J. Stevens,
Gregory M. Grason
Abstract:
Geometric frustration offers a pathway to soft matter self-assembly with controllable finite sizes. While the understanding of frustration in soft matter assembly derives almost exclusively from continuum elastic descriptions, a current challenge is to understand the connection between microscopic physical properties of misfitting ``building blocks" and emergent assembly behavior at mesoscale. We…
▽ More
Geometric frustration offers a pathway to soft matter self-assembly with controllable finite sizes. While the understanding of frustration in soft matter assembly derives almost exclusively from continuum elastic descriptions, a current challenge is to understand the connection between microscopic physical properties of misfitting ``building blocks" and emergent assembly behavior at mesoscale. We present and analyze a particle-based description of what is arguably the best studied example for frustrated soft matter assembly, negative-curvature ribbon assembly, observed in both assemblies of chiral surfactants and shape-frustrated nanoparticles. Based on our particle model, known as {\it saddle wedge monomers}, we numerically test the connection between microscopic shape and interactions of the misfitting subunits and the emergent behavior at the supra-particle scale, specifically focusing on the propagation and relaxation of inter-particle strains, the emergent role of extrinsic shape on frustrated ribbons and the equilibrium regime of finite width selection. Beyond the intuitive role of shape misfit, we show that self-limitation is critically dependent on the finite range of cohesive interactions, with larger size finite assemblies requiring increasing short-range interparticle forces. Additionally, we demonstrate that non-linearities arising from discrete particle interactions alter self-limiting behavior due to both strain-softening in shape-flattened assembly and partial yielding of highly strained bonds, which in turn may give rise to states of hierarchical, multidomain assembly. Tracing the regimes of frustration-limited assembly to the specific microscopic features of misfitting particle shapes and interactions provides necessary guidance for translating the theory of size-programmable assembly into design of intentionally-frustrated colloidal particles.
△ Less
Submitted 15 October, 2022;
originally announced October 2022.
-
Polyhedral Specification and Code Generation of Sparse Tensor Contraction with Co-Iteration
Authors:
Tuowen Zhao,
Tobi Popoola,
Mary Hall,
Catherine Olschanowsky,
Michelle Mills Strout
Abstract:
This paper presents a code generator for sparse tensor contraction computations. It leverages a mathematical representation of loop nest computations in the sparse polyhedral framework (SPF), which extends the polyhedral model to support non-affine computations, such as arise in sparse tensors. SPF is extended to perform layout specification, optimization, and code generation of sparse tensor code…
▽ More
This paper presents a code generator for sparse tensor contraction computations. It leverages a mathematical representation of loop nest computations in the sparse polyhedral framework (SPF), which extends the polyhedral model to support non-affine computations, such as arise in sparse tensors. SPF is extended to perform layout specification, optimization, and code generation of sparse tensor code: 1) we develop a polyhedral layout specification that decouples iteration spaces for layout and computation; and, 2) we develop efficient co-iteration of sparse tensors by combining polyhedra scanning over the layout of one sparse tensor with the synthesis of code to find corresponding elements in other tensors through an SMT solver.
We compare the generated code with that produced by a state-of-the-art tensor compiler, TACO. We achieve on average 1.63$\times$ faster parallel performance than TACO on sparse-sparse co-iteration and describe how to improve that to 2.72$\times$ average speedup by switching the find algorithms. We also demonstrate that decoupling iteration spaces of layout and computation enables additional layout and computation combinations to be supported.
△ Less
Submitted 24 August, 2022;
originally announced August 2022.
-
Applications of Blockchain for the Governance of Integrated Project Delivery: A Crypto Commons Approach
Authors:
Jens J. Hunhevicz,
Pierre-Antoine Brasey,
Marcella M. M. Bonanomi,
Daniel M. Hall,
Martin Fischer
Abstract:
This paper outlines why and how blockchain can digitally support and evolve the governance of collaborative project deliveries, such as integrated project deliveries (IPDs), to provide the foundation for novel and disruptive forms of organizational collaboration in the construction industry. Previous work has conceptualized IPDs as a common pool resource (CPR) scenario, where shared resources are…
▽ More
This paper outlines why and how blockchain can digitally support and evolve the governance of collaborative project deliveries, such as integrated project deliveries (IPDs), to provide the foundation for novel and disruptive forms of organizational collaboration in the construction industry. Previous work has conceptualized IPDs as a common pool resource (CPR) scenario, where shared resources are collectively governed. Through the use of blockchain and smart contracts for trustworthy peer-to-peer transactions and execution logic, Ostrom's design principles can be digitally encoded to scale CPR scenarios. Building on the identified connections, the paper 1) synthesizes fourteen blockchain-based mechanisms to govern CPRs, 2) identifies twenty-two applications of these mechanisms to govern IPDs, and 3) introduces a conceptualization of the above relationships towards a holistic understanding of collaborative project deliveries on the crypto commons for novel collective organization of construction project delivery between both humans and machines.
△ Less
Submitted 14 July, 2022;
originally announced July 2022.
-
Quenching of Single-Particle Strength in A=15 Nuclei
Authors:
B. P. Kay,
T. L. Tang,
I. A. Tolstukhin,
G. B. Roderick,
A. J. Mitchell,
Y. Ayyad,
S. A. Bennett,
J. Chen,
K. A. Chipps,
H. L. Crawford,
S. J. Freeman,
K. Garrett,
M. D. Gott,
M. R. Hall,
C. R. Hoffman,
H. Jayatissa,
A. O. Macchiavelli,
P. T. MacGregor,
D. K. Sharp,
G. L. Wilson
Abstract:
Absolute cross sections for the addition of $s$- and $d$-wave neutrons to $^{14}$C and $^{14}$N have been determined simultaneously via the ($d$,$p$) reaction at 10 MeV/u. The difference between the neutron and proton separation energies, $ΔS$, is around $-20$ MeV for the $^{14}$C$+$$n$ system and $+8$ MeV for $^{14}$N$+$$n$. The population of the $1s_{1/2}$ and $0d_{5/2}$ orbitals for both system…
▽ More
Absolute cross sections for the addition of $s$- and $d$-wave neutrons to $^{14}$C and $^{14}$N have been determined simultaneously via the ($d$,$p$) reaction at 10 MeV/u. The difference between the neutron and proton separation energies, $ΔS$, is around $-20$ MeV for the $^{14}$C$+$$n$ system and $+8$ MeV for $^{14}$N$+$$n$. The population of the $1s_{1/2}$ and $0d_{5/2}$ orbitals for both systems is reduced by a factor of approximately 0.5 compared to the independent single-particle model, or about 0.6 when compared to the shell model. This finding strongly contrasts with results deduced from intermediate-energy knockout reactions between similar nuclei on targets of $^{9}$Be and $^{12}$C. The simultaneous technique used removes many systematic uncertainties.
△ Less
Submitted 5 July, 2022;
originally announced July 2022.
-
Simple precession calculation for Mercury: a linearization approach
Authors:
Michael J. W. Hall
Abstract:
The additional precession of Mercury due to general relativity can be calculated by a method that is no more difficult than solving for the Newtonian orbit. The method relies on linearizing the relativistic orbit equation, is simpler than standard textbook methods, and is closely related to Newton's theorem on revolving orbits. The main result is accurate to all orders in $\tfrac{1}{c}$ for near-c…
▽ More
The additional precession of Mercury due to general relativity can be calculated by a method that is no more difficult than solving for the Newtonian orbit. The method relies on linearizing the relativistic orbit equation, is simpler than standard textbook methods, and is closely related to Newton's theorem on revolving orbits. The main result is accurate to all orders in $\tfrac{1}{c}$ for near-circular orbits.
△ Less
Submitted 23 June, 2022;
originally announced June 2022.
-
A method for comparing multiple imputation techniques: a case study on the U.S. National COVID Cohort Collaborative
Authors:
Elena Casiraghi,
Rachel Wong,
Margaret Hall,
Ben Coleman,
Marco Notaro,
Michael D. Evans,
Jena S. Tronieri,
Hannah Blau,
Bryan Laraway,
Tiffany J. Callahan,
Lauren E. Chan,
Carolyn T. Bramante,
John B. Buse,
Richard A. Moffitt,
Til Sturmer,
Steven G. Johnson,
Yu Raymond Shao,
Justin Reese,
Peter N. Robinson,
Alberto Paccanaro,
Giorgio Valentini,
Jared D. Huling,
Kenneth Wilkins,
:,
Tell Bennet
, et al. (12 additional authors not shown)
Abstract:
Healthcare datasets obtained from Electronic Health Records have proven to be extremely useful to assess associations between patients' predictors and outcomes of interest. However, these datasets often suffer from missing values in a high proportion of cases and the simple removal of these cases may introduce severe bias. For these reasons, several multiple imputation algorithms have been propose…
▽ More
Healthcare datasets obtained from Electronic Health Records have proven to be extremely useful to assess associations between patients' predictors and outcomes of interest. However, these datasets often suffer from missing values in a high proportion of cases and the simple removal of these cases may introduce severe bias. For these reasons, several multiple imputation algorithms have been proposed to attempt to recover the missing information. Each algorithm presents strengths and weaknesses, and there is currently no consensus on which multiple imputation algorithms works best in a given scenario. Furthermore, the selection of each algorithm parameters and data-related modelling choices are also both crucial and challenging. In this paper, we propose a novel framework to numerically evaluate strategies for handling missing data in the context of statistical analysis, with a particular focus on multiple imputation techniques. We demonstrate the feasibility of our approach on a large cohort of type-2 diabetes patients provided by the National COVID Cohort Collaborative (N3C) Enclave, where we explored the influence of various patient characteristics on outcomes related to COVID-19. Our analysis included classic multiple imputation techniques as well as simple complete-case Inverse Probability Weighted models. The experiments presented here show that our approach could effectively highlight the most valid and performant missing-data handling strategy for our case study. Moreover, our methodology allowed us to gain an understanding of the behavior of the different models and of how it changed as we modified their parameters. Our method is general and can be applied to different research fields and on datasets containing heterogeneous types.
△ Less
Submitted 25 September, 2022; v1 submitted 13 June, 2022;
originally announced June 2022.
-
"I'm sorry to hear that": Finding New Biases in Language Models with a Holistic Descriptor Dataset
Authors:
Eric Michael Smith,
Melissa Hall,
Melanie Kambadur,
Eleonora Presani,
Adina Williams
Abstract:
As language models grow in popularity, it becomes increasingly important to clearly measure all possible markers of demographic identity in order to avoid perpetuating existing societal harms. Many datasets for measuring bias currently exist, but they are restricted in their coverage of demographic axes and are commonly used with preset bias tests that presuppose which types of biases models can e…
▽ More
As language models grow in popularity, it becomes increasingly important to clearly measure all possible markers of demographic identity in order to avoid perpetuating existing societal harms. Many datasets for measuring bias currently exist, but they are restricted in their coverage of demographic axes and are commonly used with preset bias tests that presuppose which types of biases models can exhibit. In this work, we present a new, more inclusive bias measurement dataset, HolisticBias, which includes nearly 600 descriptor terms across 13 different demographic axes. HolisticBias was assembled in a participatory process including experts and community members with lived experience of these terms. These descriptors combine with a set of bias measurement templates to produce over 450,000 unique sentence prompts, which we use to explore, identify, and reduce novel forms of bias in several generative models. We demonstrate that HolisticBias is effective at measuring previously undetectable biases in token likelihoods from language models, as well as in an offensiveness classifier. We will invite additions and amendments to the dataset, which we hope will serve as a basis for more easy-to-use and standardized methods for evaluating bias in NLP models.
△ Less
Submitted 27 October, 2022; v1 submitted 18 May, 2022;
originally announced May 2022.
-
Constraining the $^{30}$P($p,γ)^{31}$S reaction rate in ONe novae via the weak, low-energy, $β$-delayed proton decay of $^{31}$Cl
Authors:
T. Budner,
M. Friedman,
C. Wrede,
B. A. Brown,
J. José,
D. Pérez-Loureiro,
L. J. Sun,
J. Surbrook,
Y. Ayyad,
D. W. Bardayan,
K. Chae,
A. A. Chen,
K. A. Chipps,
M. Cortesi,
B. Glassman,
M. R. Hall,
M. Janasik,
J. Liang,
P. O'Malley,
E. Pollacco,
A. Psaltis,
J. Stomps,
T. Wheeler
Abstract:
The $^{30}$P$(p,γ)^{31}$S reaction plays an important role in understanding nucleosynthesis of $A\geq 30$ nuclides in oxygen-neon novae. The Gaseous Detector with Germanium Tagging was used to measure $^{31}$Cl $β$-delayed proton decay through the key $J^π=3/2^{+}$, 260-keV resonance. The intensity $I^{260}_{βp} = 8.3^{+1.2}_{-0.9} \times 10^{-6}$ represents the weakest $β$-delayed, charged-partic…
▽ More
The $^{30}$P$(p,γ)^{31}$S reaction plays an important role in understanding nucleosynthesis of $A\geq 30$ nuclides in oxygen-neon novae. The Gaseous Detector with Germanium Tagging was used to measure $^{31}$Cl $β$-delayed proton decay through the key $J^π=3/2^{+}$, 260-keV resonance. The intensity $I^{260}_{βp} = 8.3^{+1.2}_{-0.9} \times 10^{-6}$ represents the weakest $β$-delayed, charged-particle emission ever measured below 400 keV, resulting in a proton branching ratio of $Γ_p / Γ= 2.5^{+0.4}_{-0.3} \times 10^{-4}$. By combining this measurement with shell-model calculations for $Γ_γ$ and past work on other resonances, the total $^{30}$P$(p,γ)^{31}$S rate has been determined with reduced uncertainty. The new rate has been used in hydrodynamic simulations to model the composition of nova ejecta, leading to a concrete prediction of $^{30}$Si/$^{28}$Si excesses in presolar nova grains and the calibration of nuclear thermometers.
△ Less
Submitted 11 April, 2022;
originally announced April 2022.
-
Stress accumulation versus shape flattening in frustrated, warped-jigsaw particle assemblies
Authors:
Isaac R. Spivack,
Douglas M. Hall,
Gregory M. Grason
Abstract:
Geometrically frustrated assembly has emerged as an attractive paradigm for understanding and engineering assemblies with self-limiting, finite equilibrium dimensions. We propose and study a novel 2D particle based on a so-called "warped jigsaw" (WJ) shape design: directional bonds in a tapered particle favor curvature along multi-particle rows that frustrate 2D lattice order. We investigate how l…
▽ More
Geometrically frustrated assembly has emerged as an attractive paradigm for understanding and engineering assemblies with self-limiting, finite equilibrium dimensions. We propose and study a novel 2D particle based on a so-called "warped jigsaw" (WJ) shape design: directional bonds in a tapered particle favor curvature along multi-particle rows that frustrate 2D lattice order. We investigate how large-scale intra-assembly stress gradients emerge from the microscopic properties of the particles using a combination of numerical simulation and continuum elasticity. WJ particles can favor anisotropic ribbon assemblies, whose lateral width may be self-limiting depending on the relative strength of cohesive to elastic forces in the assembly, which we show to be controlled by the range of interactions and degree of shape misfit. The upper limits of self-limited size are controlled by the crossover between two elastic modes in assembly: the accumulation of shear with increasing width at small widths giving way to unbending of preferred row curvature, permitting assembly to grow to unlimited sizes. We show that the stiffness controlling distinct elastic modes is governed by combination and placement of repulsive and attractive binding regions, providing a means to extend the range of accumulating stress to sizes that are far in excess of the single particle size, which we corroborate via numerical studies of discrete particles of variable interactions. Lastly, we relate the ground-state energetics of the model to lower and upper limits on equilibrium assembly size control set by the fluctuations of width along the ribbon boundary.
△ Less
Submitted 8 April, 2022;
originally announced April 2022.
-
Understanding out-of-distribution accuracies through quantifying difficulty of test samples
Authors:
Berfin Simsek,
Melissa Hall,
Levent Sagun
Abstract:
Existing works show that although modern neural networks achieve remarkable generalization performance on the in-distribution (ID) dataset, the accuracy drops significantly on the out-of-distribution (OOD) datasets \cite{recht2018cifar, recht2019imagenet}. To understand why a variety of models consistently make more mistakes in the OOD datasets, we propose a new metric to quantify the difficulty o…
▽ More
Existing works show that although modern neural networks achieve remarkable generalization performance on the in-distribution (ID) dataset, the accuracy drops significantly on the out-of-distribution (OOD) datasets \cite{recht2018cifar, recht2019imagenet}. To understand why a variety of models consistently make more mistakes in the OOD datasets, we propose a new metric to quantify the difficulty of the test images (either ID or OOD) that depends on the interaction of the training dataset and the model. In particular, we introduce \textit{confusion score} as a label-free measure of image difficulty which quantifies the amount of disagreement on a given test image based on the class conditional probabilities estimated by an ensemble of trained models. Using the confusion score, we investigate CIFAR-10 and its OOD derivatives. Next, by partitioning test and OOD datasets via their confusion scores, we predict the relationship between ID and OOD accuracies for various architectures. This allows us to obtain an estimator of the OOD accuracy of a given model only using ID test labels. Our observations indicate that the biggest contribution to the accuracy drop comes from images with high confusion scores. Upon further inspection, we report on the nature of the misclassified images grouped by their confusion scores: \textit{(i)} images with high confusion scores contain \textit{weak spurious correlations} that appear in multiple classes in the training data and lack clear \textit{class-specific features}, and \textit{(ii)} images with low confusion scores exhibit spurious correlations that belong to another class, namely \textit{class-specific spurious correlations}.
△ Less
Submitted 28 March, 2022;
originally announced March 2022.
-
A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning
Authors:
Hugo Berg,
Siobhan Mackenzie Hall,
Yash Bhalgat,
Wonsuk Yang,
Hannah Rose Kirk,
Aleksandar Shtedritski,
Max Bain
Abstract:
Vision-language models can encode societal biases and stereotypes, but there are challenges to measuring and mitigating these multimodal harms due to lacking measurement robustness and feature degradation. To address these challenges, we investigate bias measures and apply ranking metrics for image-text representations. We then investigate debiasing methods and show that prepending learned embeddi…
▽ More
Vision-language models can encode societal biases and stereotypes, but there are challenges to measuring and mitigating these multimodal harms due to lacking measurement robustness and feature degradation. To address these challenges, we investigate bias measures and apply ranking metrics for image-text representations. We then investigate debiasing methods and show that prepending learned embeddings to text queries that are jointly trained with adversarial debiasing and a contrastive loss reduces various bias measures with minimal degradation to the image-text representation.
△ Less
Submitted 25 October, 2022; v1 submitted 22 March, 2022;
originally announced March 2022.
-
Experimentally ruling out joint reality based on operational completeness
Authors:
Qiuxin Zhang,
Yu Xiang,
Xiaoting Gao,
Chenhao Zhu,
Yuxin Wang,
Liangyu Ding,
Xiang Zhang,
Shuaning Zhang,
Shuming Cheng,
Michael J. W. Hall,
Qiongyi He,
Wei Zhang
Abstract:
Whether the observables of a physical system admit real values is of fundamental importance to a deep understanding of nature. In this work, we report a device-independent experiment to confirm that the joint reality of two observables on a single two-level system is incompatible with the assumption of operational completeness, which is strictly weaker than that of preparation noncontextuality. We…
▽ More
Whether the observables of a physical system admit real values is of fundamental importance to a deep understanding of nature. In this work, we report a device-independent experiment to confirm that the joint reality of two observables on a single two-level system is incompatible with the assumption of operational completeness, which is strictly weaker than that of preparation noncontextuality. We implement two observables on a trapped $^{171}{\rm Yb}^{+}$ ion to test this incompatibility via violation of certain inequalities derived from both linear and nonlinear criteria. Moreover, by introducing a highly controllable dephasing channel, we show that the nonlinear criterion is more robust against noise. Our results push the fundamental limit to delineate the quantum-classical boundary and pave the way for exploring relevant problems in other scenarios.
△ Less
Submitted 3 February, 2024; v1 submitted 10 March, 2022;
originally announced March 2022.