Search | arXiv e-print repository

GELDA: A generative language annotation framework to reveal visual biases in datasets

Authors: Krish Kabra, Kathleen M. Lewis, Guha Balakrishnan

Abstract: Bias analysis is a crucial step in the process of creating fair datasets for training and evaluating computer vision models. The bottleneck in dataset analysis is annotation, which typically requires: (1) specifying a list of attributes relevant to the dataset domain, and (2) classifying each image-attribute pair. While the second step has made rapid progress in automation, the first has remained… ▽ More Bias analysis is a crucial step in the process of creating fair datasets for training and evaluating computer vision models. The bottleneck in dataset analysis is annotation, which typically requires: (1) specifying a list of attributes relevant to the dataset domain, and (2) classifying each image-attribute pair. While the second step has made rapid progress in automation, the first has remained human-centered, requiring an experimenter to compile lists of in-domain attributes. However, an experimenter may have limited foresight leading to annotation "blind spots," which in turn can lead to flawed downstream dataset analyses. To combat this, we propose GELDA, a nearly automatic framework that leverages large generative language models (LLMs) to propose and label various attributes for a domain. GELDA takes a user-defined domain caption (e.g., "a photo of a bird," "a photo of a living room") and uses an LLM to hierarchically generate attributes. In addition, GELDA uses the LLM to decide which of a set of vision-language models (VLMs) to use to classify each attribute in images. Results on real datasets show that GELDA can generate accurate and diverse visual attribute suggestions, and uncover biases such as confounding between class labels and background features. Results on synthetic datasets demonstrate that GELDA can be used to evaluate the biases of text-to-image diffusion models and generative adversarial networks. Overall, we show that while GELDA is not accurate enough to replace human annotators, it can serve as a complementary tool to help humans analyze datasets in a cheap, low-effort, and flexible manner. △ Less

Submitted 29 November, 2023; originally announced November 2023.

Comments: 21 pages, 15 figures, 9 tables

arXiv:2307.11315 [pdf, other]

GIST: Generating Image-Specific Text for Fine-grained Object Classification

Authors: Kathleen M. Lewis, Emily Mu, Adrian V. Dalca, John Guttag

Abstract: Recent vision-language models outperform vision-only models on many image classification tasks. However, because of the absence of paired text/image descriptions, it remains difficult to fine-tune these models for fine-grained image classification. In this work, we propose a method, GIST, for generating image-specific fine-grained text descriptions from image-only datasets, and show that these tex… ▽ More Recent vision-language models outperform vision-only models on many image classification tasks. However, because of the absence of paired text/image descriptions, it remains difficult to fine-tune these models for fine-grained image classification. In this work, we propose a method, GIST, for generating image-specific fine-grained text descriptions from image-only datasets, and show that these text descriptions can be used to improve classification. Key parts of our method include 1. prompting a pretrained large language model with domain-specific prompts to generate diverse fine-grained text descriptions for each class and 2. using a pretrained vision-language model to match each image to label-preserving text descriptions that capture relevant visual features in the image. We demonstrate the utility of GIST by fine-tuning vision-language models on the image-and-generated-text pairs to learn an aligned vision-language representation space for improved classification. We evaluate our learned representation space in full-shot and few-shot scenarios across four diverse fine-grained classification datasets, each from a different domain. Our method achieves an average improvement of $4.1\%$ in accuracy over CLIP linear probes and an average of $1.1\%$ improvement in accuracy over the previous state-of-the-art image-text classification method on the full-shot datasets. Our method achieves similar improvements across few-shot regimes. Code is available at https://github.com/emu1729/GIST. △ Less

Submitted 4 August, 2023; v1 submitted 20 July, 2023; originally announced July 2023.

Comments: The first two authors contributed equally to this work and are listed in alphabetical order

arXiv:2211.02892 [pdf, other]

SizeGAN: Improving Size Representation in Clothing Catalogs

Authors: Kathleen M. Lewis, John Guttag

Abstract: Online clothing catalogs lack diversity in body shape and garment size. Brands commonly display their garments on models of one or two sizes, rarely including plus-size models. To our knowledge, our paper presents the first method for generating images of garments and models in a new target size to tackle the size under-representation problem. Our primary technical contribution is a conditional ge… ▽ More Online clothing catalogs lack diversity in body shape and garment size. Brands commonly display their garments on models of one or two sizes, rarely including plus-size models. To our knowledge, our paper presents the first method for generating images of garments and models in a new target size to tackle the size under-representation problem. Our primary technical contribution is a conditional generative adversarial network that learns deformation fields at multiple resolutions to realistically change the size of models and garments. Results from our two user studies show SizeGAN outperforms alternative methods along three dimensions -- realism, garment faithfulness, and size -- which are all important for real world use. △ Less

Submitted 26 June, 2023; v1 submitted 5 November, 2022; originally announced November 2022.

arXiv:2102.08540 [pdf, other]

Intuitively Assessing ML Model Reliability through Example-Based Explanations and Editing Model Inputs

Authors: Harini Suresh, Kathleen M. Lewis, John V. Guttag, Arvind Satyanarayan

Abstract: Interpretability methods aim to help users build trust in and understand the capabilities of machine learning models. However, existing approaches often rely on abstract, complex visualizations that poorly map to the task at hand or require non-trivial ML expertise to interpret. Here, we present two visual analytics modules that facilitate an intuitive assessment of model reliability. To help user… ▽ More Interpretability methods aim to help users build trust in and understand the capabilities of machine learning models. However, existing approaches often rely on abstract, complex visualizations that poorly map to the task at hand or require non-trivial ML expertise to interpret. Here, we present two visual analytics modules that facilitate an intuitive assessment of model reliability. To help users better characterize and reason about a model's uncertainty, we visualize raw and aggregate information about a given input's nearest neighbors. Using an interactive editor, users can manipulate this input in semantically-meaningful ways, determine the effect on the output, and compare against their prior expectations. We evaluate our interface using an electrocardiogram beat classification case study. Compared to a baseline feature importance interface, we find that 14 physicians are better able to align the model's uncertainty with domain-relevant factors and build intuition about its capabilities and limitations. △ Less

Submitted 9 July, 2021; v1 submitted 16 February, 2021; originally announced February 2021.

arXiv:2101.02285 [pdf, other]

TryOnGAN: Body-Aware Try-On via Layered Interpolation

Authors: Kathleen M Lewis, Srivatsan Varadharajan, Ira Kemelmacher-Shlizerman

Abstract: Given a pair of images-target person and garment on another person-we automatically generate the target person in the given garment. Previous methods mostly focused on texture transfer via paired data training, while overlooking body shape deformations, skin color, and seamless blending of garment with the person. This work focuses on those three components, while also not requiring paired data tr… ▽ More Given a pair of images-target person and garment on another person-we automatically generate the target person in the given garment. Previous methods mostly focused on texture transfer via paired data training, while overlooking body shape deformations, skin color, and seamless blending of garment with the person. This work focuses on those three components, while also not requiring paired data training. We designed a pose conditioned StyleGAN2 architecture with a clothing segmentation branch that is trained on images of people wearing garments. Once trained, we propose a new layered latent space interpolation method that allows us to preserve and synthesize skin color and target body shape while transferring the garment from a different person. We demonstrate results on high resolution 512x512 images, and extensively compare to state of the art in try-on on both latent space generated and real images. △ Less

Submitted 2 June, 2021; v1 submitted 6 January, 2021; originally announced January 2021.

arXiv:2001.01026 [pdf, other]

Painting Many Pasts: Synthesizing Time Lapse Videos of Paintings

Authors: Amy Zhao, Guha Balakrishnan, Kathleen M. Lewis, Frédo Durand, John V. Guttag, Adrian V. Dalca

Abstract: We introduce a new video synthesis task: synthesizing time lapse videos depicting how a given painting might have been created. Artists paint using unique combinations of brushes, strokes, and colors. There are often many possible ways to create a given painting. Our goal is to learn to capture this rich range of possibilities. Creating distributions of long-term videos is a challenge for learni… ▽ More We introduce a new video synthesis task: synthesizing time lapse videos depicting how a given painting might have been created. Artists paint using unique combinations of brushes, strokes, and colors. There are often many possible ways to create a given painting. Our goal is to learn to capture this rich range of possibilities. Creating distributions of long-term videos is a challenge for learning-based video synthesis methods. We present a probabilistic model that, given a single image of a completed painting, recurrently synthesizes steps of the painting process. We implement this model as a convolutional neural network, and introduce a novel training scheme to enable learning from a limited dataset of painting time lapses. We demonstrate that this model can be used to sample many time steps, enabling long-term stochastic video synthesis. We evaluate our method on digital and watercolor paintings collected from video websites, and show that human raters find our synthetic videos to be similar to time lapse videos produced by real artists. Our code is available at https://xamyzhao.github.io/timecraft. △ Less

Submitted 25 April, 2020; v1 submitted 3 January, 2020; originally announced January 2020.

Comments: 10 pages, CVPR 2020

arXiv:1812.06932 [pdf, other]

doi 10.1145/3368555.3384462

Fast Learning-based Registration of Sparse 3D Clinical Images

Authors: Kathleen M. Lewis, Natalia S. Rost, John Guttag, Adrian V. Dalca

Abstract: We introduce SparseVM, a method that registers clinical-quality 3D MR scans both faster and more accurately than previously possible. Deformable alignment, or registration, of clinical scans is a fundamental task for many clinical neuroscience studies. However, most registration algorithms are designed for high-resolution research-quality scans. In contrast to research-quality scans, clinical scan… ▽ More We introduce SparseVM, a method that registers clinical-quality 3D MR scans both faster and more accurately than previously possible. Deformable alignment, or registration, of clinical scans is a fundamental task for many clinical neuroscience studies. However, most registration algorithms are designed for high-resolution research-quality scans. In contrast to research-quality scans, clinical scans are often sparse, missing up to 86% of the slices available in research-quality scans. Existing methods for registering these sparse images are either inaccurate or extremely slow. We present a learning-based registration method, SparseVM, that is more accurate and orders of magnitude faster than the most accurate clinical registration methods. To our knowledge, it is the first method to use deep learning specifically tailored to registering clinical images. We demonstrate our method on a clinically-acquired MRI dataset of stroke patients and on a simulated sparse MRI dataset. Our code is available as part of the VoxelMorph package at http://voxelmorph.mit.edu/. △ Less

Submitted 6 April, 2020; v1 submitted 17 December, 2018; originally announced December 2018.

Comments: This version was accepted to CHIL. It builds on the previous version of the paper and includes more experimental results

Showing 1–7 of 7 results for author: Lewis, K M