Skip to main content

Showing 1–50 of 54 results for author: Plummer, B A

.
  1. arXiv:2406.07822  [pdf, other

    cs.CV cs.CL

    Tell Me What's Next: Textual Foresight for Generic UI Representations

    Authors: Andrea Burns, Kate Saenko, Bryan A. Plummer

    Abstract: Mobile app user interfaces (UIs) are rich with action, text, structure, and image content that can be utilized to learn generic UI representations for tasks like automating user commands, summarizing content, and evaluating the accessibility of user interfaces. Prior work has learned strong visual representations with local or global captioning losses, but fails to retain both granularities. To co… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted to ACL 2024 Findings. Data and code to be released at https://github.com/aburns4/textualforesight

  2. arXiv:2406.01449  [pdf, other

    cs.CV

    SLANT: Spurious Logo ANalysis Toolkit

    Authors: Maan Qraitem, Piotr Teterwak, Kate Saenko, Bryan A. Plummer

    Abstract: Online content is filled with logos, from ads and social media posts to website branding and product placements. Consequently, these logos are prevalent in the extensive web-scraped datasets used to pretrain Vision-Language Models, which are used for a wide array of tasks (content moderation, object classification). While these models have been shown to learn harmful correlations in various tasks,… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

  3. arXiv:2405.16419  [pdf, other

    cs.CV cs.AI

    Enhancing Feature Diversity Boosts Channel-Adaptive Vision Transformers

    Authors: Chau Pham, Bryan A. Plummer

    Abstract: Multi-Channel Imaging (MCI) contains an array of challenges for encoding useful feature representations not present in traditional images. For example, images from two different satellites may both contain RGB channels, but the remaining channels can be different for each imaging source. Thus, MCI models must support a variety of channel configurations at test time. Recent work has extended tradit… ▽ More

    Submitted 25 May, 2024; originally announced May 2024.

  4. arXiv:2404.04346  [pdf, other

    cs.CV

    Koala: Key frame-conditioned long video-LLM

    Authors: Reuben Tan, Ximeng Sun, ** Hu, Jui-hsien Wang, Hanieh Deilamsalehy, Bryan A. Plummer, Bryan Russell, Kate Saenko

    Abstract: Long video question answering is a challenging task that involves recognizing short-term activities and reasoning about their fine-grained relationships. State-of-the-art video Large Language Models (vLLMs) hold promise as a viable solution due to their demonstrated emergent capabilities on new tasks. However, despite being trained on millions of short seconds-long videos, vLLMs are unable to unde… ▽ More

    Submitted 3 May, 2024; v1 submitted 5 April, 2024; originally announced April 2024.

    Comments: Accepted at CVPR 2024 as a poster highlight

  5. arXiv:2402.11744  [pdf, other

    cs.CL

    Machine-Generated Text Localization

    Authors: Zhong** Zhang, Wenda Qin, Bryan A. Plummer

    Abstract: Machine-Generated Text (MGT) detection aims to identify a piece of text as machine or human written. Prior work has primarily formulated MGT detection as a binary classification task over an entire document, with limited work exploring cases where only part of a document is machine generated. This paper provides the first in-depth study of MGT that localizes the portions of a document that were ma… ▽ More

    Submitted 10 June, 2024; v1 submitted 18 February, 2024; originally announced February 2024.

    Comments: ACL 2024 (findings)

  6. arXiv:2402.00626  [pdf, other

    cs.CV cs.CR cs.LG

    Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks

    Authors: Maan Qraitem, Nazia Tasnim, Piotr Teterwak, Kate Saenko, Bryan A. Plummer

    Abstract: Typographic Attacks, which involve pasting misleading text onto an image, were noted to harm the performance of Vision-Language Models like CLIP. However, the susceptibility of recent Large Vision-Language Models to these attacks remains understudied. Furthermore, prior work's Typographic attacks against CLIP randomly sample a misleading class from a predefined set of categories. However, this sim… ▽ More

    Submitted 16 February, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

  7. arXiv:2312.14985  [pdf, other

    cs.CV

    UniHuman: A Unified Model for Editing Human Images in the Wild

    Authors: Nannan Li, Qing Liu, Krishna Kumar Singh, Yilin Wang, Jianming Zhang, Bryan A. Plummer, Zhe Lin

    Abstract: Human image editing includes tasks like changing a person's pose, their clothing, or editing the image according to a text prompt. However, prior work often tackles these tasks separately, overlooking the benefit of mutual reinforcement from learning them jointly. In this paper, we propose UniHuman, a unified model that addresses multiple facets of human image editing in real-world settings. To en… ▽ More

    Submitted 31 March, 2024; v1 submitted 22 December, 2023; originally announced December 2023.

    Comments: Accepted to CVPR 2024

  8. arXiv:2312.01629  [pdf, other

    cs.CV

    CLAMP: Contrastive LAnguage Model Prompt-tuning

    Authors: Piotr Teterwak, Ximeng Sun, Bryan A. Plummer, Kate Saenko, Ser-Nam Lim

    Abstract: Large language models (LLMs) have emerged as powerful general-purpose interfaces for many machine learning problems. Recent work has adapted LLMs to generative visual tasks like image captioning, visual question answering, and visual chat, using a relatively small amount of instruction-tuning data. In this paper, we explore whether modern LLMs can also be adapted to classifying an image into a set… ▽ More

    Submitted 26 March, 2024; v1 submitted 4 December, 2023; originally announced December 2023.

  9. arXiv:2312.01274  [pdf, other

    cs.CV

    Learning to Compose SuperWeights for Neural Parameter Allocation Search

    Authors: Piotr Teterwak, Soren Nelson, Nikoli Dryden, Dina Bashkirova, Kate Saenko, Bryan A. Plummer

    Abstract: Neural parameter allocation search (NPAS) automates parameter sharing by obtaining weights for a network given an arbitrary, fixed parameter budget. Prior work has two major drawbacks we aim to address. First, there is a disconnect in the sharing pattern between the search and training steps, where weights are warped for layers of different sizes during the search to measure similarity, but not du… ▽ More

    Submitted 2 December, 2023; originally announced December 2023.

    Comments: Accepted at IEEE Winter Conference on Applications of Computer Vision (WACV) 2024

  10. arXiv:2312.00827  [pdf, other

    cs.CV

    A Unified Framework for Connecting Noise Modeling to Boost Noise Detection

    Authors: Siqi Wang, Chau Pham, Bryan A. Plummer

    Abstract: Noisy labels can impair model performance, making the study of learning with noisy labels an important topic. Two conventional approaches are noise modeling and noise detection. However, these two methods are typically studied independently, and there has been limited work on their collaboration. In this work, we explore the integration of these two approaches, proposing an interconnected structur… ▽ More

    Submitted 30 November, 2023; originally announced December 2023.

  11. arXiv:2311.04251  [pdf, other

    cs.LG cs.AI cs.CV

    MixtureGrowth: Growing Neural Networks by Recombining Learned Parameters

    Authors: Chau Pham, Piotr Teterwak, Soren Nelson, Bryan A. Plummer

    Abstract: Most deep neural networks are trained under fixed network architectures and require retraining when the architecture changes. If expanding the network's size is needed, it is necessary to retrain from scratch, which is expensive. To avoid this, one can grow from a small network by adding random weights over time to gradually achieve the target network size. However, this naive approach falls short… ▽ More

    Submitted 7 November, 2023; originally announced November 2023.

    Comments: Accepted at IEEE Winter Conference on Applications of Computer Vision (WACV) 2024

  12. arXiv:2310.19224  [pdf, other

    cs.CV

    CHAMMI: A benchmark for channel-adaptive models in microscopy imaging

    Authors: Zitong Chen, Chau Pham, Siqi Wang, Michael Doron, Nikita Moshkov, Bryan A. Plummer, Juan C. Caicedo

    Abstract: Most neural networks assume that input images have a fixed number of channels (three for RGB images). However, there are many settings where the number of channels may vary, such as microscopy images where the number of channels changes depending on instruments and experimental goals. Yet, there has not been a systemic attempt to create and evaluate neural networks that are invariant to the number… ▽ More

    Submitted 16 January, 2024; v1 submitted 29 October, 2023; originally announced October 2023.

    Comments: Accepted at NeurIPS Track on Datasets and Benchmarks, 2023

  13. arXiv:2310.06272  [pdf, other

    cs.CL cs.AI cs.LG

    Let Models Speak Ciphers: Multiagent Debate through Embeddings

    Authors: Chau Pham, Boyi Liu, Yingxiang Yang, Zhengyu Chen, Tianyi Liu, Jianbo Yuan, Bryan A. Plummer, Zhaoran Wang, Hongxia Yang

    Abstract: Discussion and debate among Large Language Models (LLMs) have gained considerable attention due to their potential to enhance the reasoning ability of LLMs. Although natural language is an obvious choice for communication due to LLM's language understanding capability, the token sampling step needed when generating natural language poses a potential risk of information loss, as it uses only one to… ▽ More

    Submitted 26 February, 2024; v1 submitted 9 October, 2023; originally announced October 2023.

    Comments: Accepted to ICLR 2024

  14. arXiv:2308.16741  [pdf, other

    cs.AI cs.CV

    Socratis: Are large multimodal models emotionally aware?

    Authors: Katherine Deng, Arijit Ray, Reuben Tan, Saadia Gabriel, Bryan A. Plummer, Kate Saenko

    Abstract: Existing emotion prediction benchmarks contain coarse emotion labels which do not consider the diversity of emotions that an image and text can elicit in humans due to various reasons. Learning diverse reactions to multimodal content is important as intelligent machines take a central role in generating and delivering content to society. To address this gap, we propose Socratis, a societal reactio… ▽ More

    Submitted 2 November, 2023; v1 submitted 31 August, 2023; originally announced August 2023.

    Comments: ICCV 2023 WECIA

  15. arXiv:2308.04553  [pdf, other

    cs.CV cs.LG

    From Fake to Real: Pretraining on Balanced Synthetic Images to Prevent Bias

    Authors: Maan Qraitem, Kate Saenko, Bryan A. Plummer

    Abstract: Visual recognition models are prone to learning spurious correlations induced by a biased training set where certain conditions $B$ (\eg, Indoors) are over-represented in certain classes $Y$ (\eg, Big Dogs). Synthetic data from generative models offers a promising direction to mitigate this issue by augmenting underrepresented conditions in the real dataset. However, this introduces another potent… ▽ More

    Submitted 29 September, 2023; v1 submitted 8 August, 2023; originally announced August 2023.

  16. arXiv:2307.12854  [pdf, other

    cs.CV

    Multiscale Video Pretraining for Long-Term Activity Forecasting

    Authors: Reuben Tan, Matthias De Lange, Michael Iuzzolino, Bryan A. Plummer, Kate Saenko, Karl Ridgeway, Lorenzo Torresani

    Abstract: Long-term activity forecasting is an especially challenging research problem because it requires understanding the temporal relationships between observed actions, as well as the variability and complexity of human activities. Despite relying on strong supervision via expensive human annotations, state-of-the-art forecasting approaches often generalize poorly to unseen data. To alleviate this issu… ▽ More

    Submitted 24 July, 2023; originally announced July 2023.

  17. arXiv:2306.11911  [pdf, other

    cs.CV

    LNL+K: Learning with Noisy Labels and Noise Source Distribution Knowledge

    Authors: Siqi Wang, Bryan A. Plummer

    Abstract: Learning with noisy labels (LNL) is challenging as the model tends to memorize noisy labels, which can lead to overfitting. Many LNL methods detect clean samples by maximizing the similarity between samples in each category, which does not make any assumptions about likely noise sources. However, we often have some knowledge about the potential source(s) of noisy labels. For example, an image misl… ▽ More

    Submitted 20 June, 2023; originally announced June 2023.

  18. arXiv:2305.17489  [pdf, other

    cs.CV

    Text-to-image Editing by Image Information Removal

    Authors: Zhong** Zhang, Jian Zheng, Jacob Zhiyuan Fang, Bryan A. Plummer

    Abstract: Diffusion models have demonstrated impressive performance in text-guided image generation. Current methods that leverage the knowledge of these models for image editing either fine-tune them using the input image (e.g., Imagic) or incorporate structure information as additional constraints (e.g., ControlNet). However, fine-tuning large-scale diffusion models on a single image can lead to severe ov… ▽ More

    Submitted 7 November, 2023; v1 submitted 27 May, 2023; originally announced May 2023.

    Comments: Full paper is accepted by WACV2024; Best paper runner-up of AI4CC@CVPR 2023

  19. arXiv:2305.05432  [pdf, other

    cs.CL cs.CV

    WikiWeb2M: A Page-Level Multimodal Wikipedia Dataset

    Authors: Andrea Burns, Krishna Srinivasan, Joshua Ainslie, Geoff Brown, Bryan A. Plummer, Kate Saenko, Jianmo Ni, Mandy Guo

    Abstract: Webpages have been a rich resource for language and vision-language tasks. Yet only pieces of webpages are kept: image-caption pairs, long text articles, or raw HTML, never all in one place. Webpage tasks have resultingly received little attention and structured image-text data underused. To study multimodal webpage understanding, we introduce the Wikipedia Webpage 2M (WikiWeb2M) suite; the first… ▽ More

    Submitted 9 May, 2023; originally announced May 2023.

    Comments: Accepted at the WikiWorkshop 2023. Data is readily available at https://github.com/google-research-datasets/wit/blob/main/wikiweb2m.md. arXiv admin note: text overlap with arXiv:2305.03668

  20. arXiv:2305.03689  [pdf, other

    cs.CV

    COLA: A Benchmark for Compositional Text-to-image Retrieval

    Authors: Arijit Ray, Filip Radenovic, Abhimanyu Dubey, Bryan A. Plummer, Ranjay Krishna, Kate Saenko

    Abstract: Compositional reasoning is a hallmark of human visual intelligence. Yet, despite the size of large vision-language models, they struggle to represent simple compositions by combining objects with their attributes. To measure this lack of compositional capability, we design Cola, a text-to-image retrieval benchmark to Compose Objects Localized with Attributes. To solve Cola, a model must retrieve i… ▽ More

    Submitted 2 November, 2023; v1 submitted 5 May, 2023; originally announced May 2023.

    Comments: Accepted to NeurIPS 2023. Webpage: https://cs-people.bu.edu/array/research/cola/

  21. arXiv:2305.03668  [pdf, other

    cs.CL cs.CV

    A Suite of Generative Tasks for Multi-Level Multimodal Webpage Understanding

    Authors: Andrea Burns, Krishna Srinivasan, Joshua Ainslie, Geoff Brown, Bryan A. Plummer, Kate Saenko, Jianmo Ni, Mandy Guo

    Abstract: Webpages have been a rich, scalable resource for vision-language and language only tasks. Yet only pieces of webpages are kept in existing datasets: image-caption pairs, long text articles, or raw HTML, never all in one place. Webpage tasks have resultingly received little attention and structured image-text data left underused. To study multimodal webpage understanding, we introduce the Wikipedia… ▽ More

    Submitted 20 October, 2023; v1 submitted 5 May, 2023; originally announced May 2023.

    Comments: Accepted in EMNLP 2023, revision contains camera ready edits. Data can be downloaded at https://github.com/google-research-datasets/wit/blob/main/wikiweb2m.md

  22. arXiv:2304.01973  [pdf, other

    cs.LG cs.CV

    ERM++: An Improved Baseline for Domain Generalization

    Authors: Piotr Teterwak, Kuniaki Saito, Theodoros Tsiligkaridis, Kate Saenko, Bryan A. Plummer

    Abstract: Domain Generalization (DG) measures a classifier's ability to generalize to new distributions of data it was not trained on. Recent work has shown that a hyperparameter-tuned Empirical Risk Minimization (ERM) training procedure, that is simply minimizing the empirical risk on the source domains, can outperform most existing DG methods. ERM has achieved such strong results while only tuning hyper-p… ▽ More

    Submitted 26 March, 2024; v1 submitted 4 April, 2023; originally announced April 2023.

    Comments: An improved baseline for Domain Generalization

  23. arXiv:2303.16342  [pdf, other

    cs.CV cs.AI cs.CL

    Language-Guided Audio-Visual Source Separation via Trimodal Consistency

    Authors: Reuben Tan, Arijit Ray, Andrea Burns, Bryan A. Plummer, Justin Salamon, Oriol Nieto, Bryan Russell, Kate Saenko

    Abstract: We propose a self-supervised approach for learning to perform audio source separation in videos based on natural language queries, using only unlabeled video and audio pairs as training data. A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform, all without access to… ▽ More

    Submitted 23 September, 2023; v1 submitted 28 March, 2023; originally announced March 2023.

    Comments: Accepted at CVPR 2023

  24. arXiv:2211.12112  [pdf, other

    cs.CV cs.AI cs.LG

    Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark

    Authors: Vitali Petsiuk, Alexander E. Siemenn, Saisamrit Surbehera, Zad Chin, Keith Tyser, Gregory Hunter, Arvind Raghavan, Yann Hicke, Bryan A. Plummer, Ori Kerret, Tonio Buonassisi, Kate Saenko, Armando Solar-Lezama, Iddo Drori

    Abstract: We provide a new multi-task benchmark for evaluating text-to-image models. We perform a human evaluation comparing the most common open-source (Stable Diffusion) and commercial (DALL-E 2) models. Twenty computer science AI graduate students evaluated the two models, on three tasks, at three difficulty levels, across ten prompts each, providing 3,600 ratings. Text-to-image generation has seen rapid… ▽ More

    Submitted 22 November, 2022; originally announced November 2022.

    Comments: NeurIPS 2022 Workshop on Human Evaluation of Generative Models (HEGM)

  25. arXiv:2210.01887  [pdf, other

    cs.CV

    Collecting The Puzzle Pieces: Disentangled Self-Driven Human Pose Transfer by Permuting Textures

    Authors: Nannan Li, Kevin J. Shih, Bryan A. Plummer

    Abstract: Human pose transfer synthesizes new view(s) of a person for a given pose. Recent work achieves this via self-reconstruction, which disentangles a person's pose and texture information by breaking the person down into parts, then recombines them for reconstruction. However, part-level disentanglement preserves some pose information that can create unwanted artifacts. In this paper, we propose Pose… ▽ More

    Submitted 30 August, 2023; v1 submitted 4 October, 2022; originally announced October 2022.

    Comments: Accepted to ICCV 2023

  26. arXiv:2209.15605  [pdf, other

    cs.CV

    Bias Mimicking: A Simple Sampling Approach for Bias Mitigation

    Authors: Maan Qraitem, Kate Saenko, Bryan A. Plummer

    Abstract: Prior work has shown that Visual Recognition datasets frequently underrepresent bias groups $B$ (\eg Female) within class labels $Y$ (\eg Programmers). This dataset bias can lead to models that learn spurious correlations between class labels and bias groups such as age, gender, or race. Most recent methods that address this problem require significant architectural changes or additional loss func… ▽ More

    Submitted 27 April, 2023; v1 submitted 30 September, 2022; originally announced September 2022.

    Comments: Accepted at CVPR 2023

  27. arXiv:2207.13061  [pdf, other

    cs.CV cs.AI cs.CL

    NewsStories: Illustrating articles with visual summaries

    Authors: Reuben Tan, Bryan A. Plummer, Kate Saenko, JP Lewis, Avneesh Sud, Thomas Leung

    Abstract: Recent self-supervised approaches have used large-scale image-text datasets to learn powerful representations that transfer to many tasks without finetuning. These methods often assume that there is one-to-one correspondence between its images and their (short) captions. However, many tasks require reasoning about multiple images and long text narratives, such as describing news articles with visu… ▽ More

    Submitted 14 August, 2022; v1 submitted 26 July, 2022; originally announced July 2022.

    Comments: Accepted at ECCV 2022

  28. arXiv:2207.06555  [pdf, other

    cs.CV

    Supervised Attribute Information Removal and Reconstruction for Image Manipulation

    Authors: Nannan Li, Bryan A. Plummer

    Abstract: The goal of attribute manipulation is to control specified attribute(s) in given images. Prior work approaches this problem by learning disentangled representations for each attribute that enables it to manipulate the encoded source attributes to the target attributes. However, encoded attributes are often correlated with relevant image content. Thus, the source attribute information can often be… ▽ More

    Submitted 13 July, 2022; originally announced July 2022.

    Comments: Accepted at ECCV 2022

  29. arXiv:2203.13281  [pdf, other

    cs.CV

    Movie Genre Classification by Language Augmentation and Shot Sampling

    Authors: Zhong** Zhang, Yiwen Gu, Bryan A. Plummer, Xin Miao, Jiayi Liu, Huayan Wang

    Abstract: Video-based movie genre classification has garnered considerable attention due to its various applications in recommendation systems. Prior work has typically addressed this task by adapting models from traditional video classification tasks, such as action recognition or event detection. However, these models often neglect language elements (e.g., narrations or conversations) present in videos, w… ▽ More

    Submitted 7 November, 2023; v1 submitted 24 March, 2022; originally announced March 2022.

    Comments: Accepted at WACV2024

  30. arXiv:2203.12849  [pdf, other

    cs.CV

    Complex Scene Image Editing by Scene Graph Comprehension

    Authors: Zhong** Zhang, Huiwen He, Bryan A. Plummer, Zhenyu Liao, Huayan Wang

    Abstract: Conditional diffusion models have demonstrated impressive performance on various tasks like text-guided semantic image editing. Prior work requires image regions to be identified manually by human users or use an object detector that only perform well for object-centric manipulations. For example, if an input image contains multiple objects with the same semantic meaning (such as a group of birds)… ▽ More

    Submitted 19 September, 2023; v1 submitted 24 March, 2022; originally announced March 2022.

    Comments: Accepted to BMVC 2023

  31. arXiv:2202.02312  [pdf, other

    cs.CL cs.CV cs.HC

    A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility

    Authors: Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, Bryan A. Plummer

    Abstract: Vision-language navigation (VLN), in which an agent follows language instruction in a visual environment, has been studied under the premise that the input command is fully feasible in the environment. Yet in practice, a request may not be possible due to language ambiguity or environment changes. To study VLN with unknown command feasibility, we introduce a new dataset Mobile app Tasks with Itera… ▽ More

    Submitted 14 August, 2022; v1 submitted 4 February, 2022; originally announced February 2022.

    Comments: Accepted at the European Conference on Computer Vision (ECCV) 2022. This is a new version of the paper with additional experimental results and a few prior implementation bugs fixed

  32. arXiv:2112.05917  [pdf, other

    cs.CL

    Show, Write, and Retrieve: Entity-aware Article Generation and Retrieval

    Authors: Zhong** Zhang, Yiwen Gu, Bryan A. Plummer

    Abstract: Article comprehension is an important challenge in natural language processing with many applications such as article generation or image-to-article retrieval. Prior work typically encodes all tokens in articles uniformly using pretrained language models. However, in many applications, such as understanding news stories, these articles are based on real-world events and may reference many named en… ▽ More

    Submitted 20 October, 2023; v1 submitted 11 December, 2021; originally announced December 2021.

    Comments: Accepted at EMNLP 2023 Findings

  33. arXiv:2112.03237  [pdf, other

    cs.CV

    From Coarse to Fine-grained Concept based Discrimination for Phrase Detection

    Authors: Maan Qraitem, Bryan A. Plummer

    Abstract: Phrase detection requires methods to identify if a phrase is relevant to an image and localize it, if applicable. A key challenge for training more discriminative detection models is sampling negatives. Sampling techniques from prior work focus primarily on hard, often noisy, negatives disregarding the broader distribution of negative samples. Our proposed CFCD-Net addresses this through two novel… ▽ More

    Submitted 14 November, 2022; v1 submitted 6 December, 2021; originally announced December 2021.

  34. arXiv:2112.03208  [pdf, other

    cs.LG

    Anchoring to Exemplars for Training Mixture-of-Expert Cell Embeddings

    Authors: Siqi Wang, Manyuan Lu, Nikita Moshkov, Juan C. Caicedo, Bryan A. Plummer

    Abstract: Analyzing the morphology of cells in microscopy images can provide insights into the mechanism of compounds or the function of genes. Addressing this task requires methods that can not only extract biological information from the images, but also ignore technical variations, ie, changes in experimental procedure or differences between equipments used to collect microscopy images. We propose Treatm… ▽ More

    Submitted 6 December, 2021; originally announced December 2021.

  35. arXiv:2110.10596  [pdf, other

    cs.CV cs.LG

    Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos

    Authors: Reuben Tan, Bryan A. Plummer, Kate Saenko, Hailin **, Bryan Russell

    Abstract: We introduce the task of spatially localizing narrated interactions in videos. Key to our approach is the ability to learn to spatially localize interactions with self-supervision on a large corpus of videos with accompanying transcribed narrations. To achieve this goal, we propose a multilayer cross-modal attention network that enables effective optimization of a contrastive loss during training.… ▽ More

    Submitted 2 December, 2021; v1 submitted 20 October, 2021; originally announced October 2021.

    Comments: Accepted at NeurIPS 2021 (Spotlight)

  36. arXiv:2104.08560  [pdf, other

    cs.CL cs.CV

    Mobile App Tasks with Iterative Feedback (MoTIF): Addressing Task Feasibility in Interactive Visual Environments

    Authors: Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, Bryan A. Plummer

    Abstract: In recent years, vision-language research has shifted to study tasks which require more complex reasoning, such as interactive question answering, visual common sense reasoning, and question-answer plausibility prediction. However, the datasets used for these problems fail to capture the complexity of real inputs and multimodal environments, such as ambiguous natural language requests and diverse… ▽ More

    Submitted 17 April, 2021; originally announced April 2021.

    Comments: Accepted at the workshop on Visually Grounded Interaction and Language (ViGIL) at NAACL 2021

  37. Detecting Cross-Modal Inconsistency to Defend Against Neural Fake News

    Authors: Reuben Tan, Bryan A. Plummer, Kate Saenko

    Abstract: Large-scale dissemination of disinformation online intended to mislead or deceive the general population is a major societal problem. Rapid progression in image, video, and natural language generative models has only exacerbated this situation and intensified our need for an effective defense mechanism. While existing approaches have been proposed to defend against neural fake news, they are gener… ▽ More

    Submitted 21 October, 2020; v1 submitted 16 September, 2020; originally announced September 2020.

    Comments: Accepted at EMNLP 2020

  38. arXiv:2008.00348  [pdf, other

    cs.CV

    Self-supervised Visual Attribute Learning for Fashion Compatibility

    Authors: Donghyun Kim, Kuniaki Saito, Samarth Mishra, Stan Sclaroff, Kate Saenko, Bryan A Plummer

    Abstract: Many self-supervised learning (SSL) methods have been successful in learning semantically meaningful visual representations by solving pretext tasks. However, prior work in SSL focuses on tasks like object recognition or detection, which aim to learn object shapes and assume that the features should be invariant to concepts like colors and textures. Thus, these SSL methods perform poorly on downst… ▽ More

    Submitted 11 August, 2021; v1 submitted 1 August, 2020; originally announced August 2020.

    Comments: Accepted to VIPriors Workshop ICCV 2021

  39. arXiv:2006.10598  [pdf, other

    cs.LG cs.CL cs.CV stat.ML

    Neural Parameter Allocation Search

    Authors: Bryan A. Plummer, Nikoli Dryden, Julius Frost, Torsten Hoefler, Kate Saenko

    Abstract: Training neural networks requires increasing amounts of memory. Parameter sharing can reduce memory and communication costs, but existing methods assume networks have many identical layers and utilize hand-crafted sharing strategies that fail to generalize. We introduce Neural Parameter Allocation Search (NPAS), a novel task where the goal is to train a neural network given an arbitrary, fixed par… ▽ More

    Submitted 15 March, 2022; v1 submitted 18 June, 2020; originally announced June 2020.

    Comments: Accepted at ICLR 2022

  40. arXiv:2004.04312  [pdf, other

    cs.CV cs.CL

    Learning to Scale Multilingual Representations for Vision-Language Tasks

    Authors: Andrea Burns, Donghyun Kim, Derry Wijaya, Kate Saenko, Bryan A. Plummer

    Abstract: Current multilingual vision-language models either require a large number of additional parameters for each supported language, or suffer performance degradation as languages are added. In this paper, we propose a Scalable Multilingual Aligned Language Representation (SMALR) that supports many languages with few model parameters without sacrificing downstream task performance. SMALR learns a fixed… ▽ More

    Submitted 27 August, 2020; v1 submitted 8 April, 2020; originally announced April 2020.

    Comments: ECCV 2020 accepted spotlight paper

  41. arXiv:2003.08264  [pdf, other

    cs.CV

    Cross-domain Self-supervised Learning for Domain Adaptation with Few Source Labels

    Authors: Donghyun Kim, Kuniaki Saito, Tae-Hyun Oh, Bryan A. Plummer, Stan Sclaroff, Kate Saenko

    Abstract: Existing unsupervised domain adaptation methods aim to transfer knowledge from a label-rich source domain to an unlabeled target domain. However, obtaining labels for some source domains may be very expensive, making complete labeling as used in prior work impractical. In this work, we investigate a new domain adaptation scenario with sparsely labeled source data, where only a few examples in the… ▽ More

    Submitted 18 March, 2020; originally announced March 2020.

  42. arXiv:2002.07362  [pdf, other

    cs.CV

    MILA: Multi-Task Learning from Videos via Efficient Inter-Frame Attention

    Authors: Donghyun Kim, Tian Lan, Chuhang Zou, Ning Xu, Bryan A. Plummer, Stan Sclaroff, Jayan Eledath, Gerard Medioni

    Abstract: Prior work in multi-task learning has mainly focused on predictions on a single image. In this work, we present a new approach for multi-task learning from videos via efficient inter-frame local attention (MILA). Our approach contains a novel inter-frame attention module which allows learning of task-specific attention across frames. We embed the attention module in a ``slow-fast'' architecture, w… ▽ More

    Submitted 10 October, 2021; v1 submitted 17 February, 2020; originally announced February 2020.

    Comments: Accepted in ICCV 2021 MTL Workshop

  43. arXiv:1909.13784  [pdf, other

    cs.CV cs.LG eess.IV

    LoGAN: Latent Graph Co-Attention Network for Weakly-Supervised Video Moment Retrieval

    Authors: Reuben Tan, Huijuan Xu, Kate Saenko, Bryan A. Plummer

    Abstract: The goal of weakly-supervised video moment retrieval is to localize the video segment most relevant to the given natural language query without access to temporal annotations during training. Prior strongly- and weakly-supervised approaches often leverage co-attention mechanisms to learn visual-semantic representations for localization. However, while such approaches tend to focus on identifying r… ▽ More

    Submitted 28 March, 2020; v1 submitted 27 September, 2019; originally announced September 2019.

  44. arXiv:1909.03493  [pdf, other

    cs.CV cs.CL

    MULE: Multimodal Universal Language Embedding

    Authors: Donghyun Kim, Kuniaki Saito, Kate Saenko, Stan Sclaroff, Bryan A. Plummer

    Abstract: Existing vision-language methods typically support two languages at a time at most. In this paper, we present a modular approach which can easily be incorporated into existing vision-language methods in order to support many languages. We accomplish this by learning a single shared Multimodal Universal Language Embedding (MULE) which has been visually-semantically aligned across all languages. The… ▽ More

    Submitted 28 December, 2019; v1 submitted 8 September, 2019; originally announced September 2019.

    Comments: Accepted as an oral at AAAI 2020

  45. arXiv:1908.08589  [pdf, other

    cs.CV

    Learning Similarity Conditions Without Explicit Supervision

    Authors: Reuben Tan, Mariya I. Vasileva, Kate Saenko, Bryan A. Plummer

    Abstract: Many real-world tasks require models to compare images along multiple similarity conditions (e.g. similarity in color, category or shape). Existing methods often reason about these complex similarity relationships by learning condition-aware embeddings. While such embeddings aid models in learning different notions of similarity, they also limit their capability to generalize to unseen categories… ▽ More

    Submitted 22 August, 2019; originally announced August 2019.

    Comments: Accepted at ICCV 2019

  46. arXiv:1908.06327  [pdf, other

    cs.CV cs.CL

    Language Features Matter: Effective Language Representations for Vision-Language Tasks

    Authors: Andrea Burns, Reuben Tan, Kate Saenko, Stan Sclaroff, Bryan A. Plummer

    Abstract: Shouldn't language and vision features be treated equally in vision-language (VL) tasks? Many VL approaches treat the language component as an afterthought, using simple language models that are either built upon fixed word embeddings trained on text-only data or are learned from scratch. We believe that language features deserve more attention, and conduct experiments which compare different word… ▽ More

    Submitted 17 August, 2019; originally announced August 2019.

    Comments: ICCV 2019 accepted paper

  47. arXiv:1905.10797  [pdf, other

    cs.CV

    Why do These Match? Explaining the Behavior of Image Similarity Models

    Authors: Bryan A. Plummer, Mariya I. Vasileva, Vitali Petsiuk, Kate Saenko, David Forsyth

    Abstract: Explaining a deep learning model can help users understand its behavior and allow researchers to discern its shortcomings. Recent work has primarily focused on explaining models for tasks like image classification or visual question answering. In this paper, we introduce Salient Attributes for Network Explanation (SANE) to explain image similarity models, where a model's output is a score measurin… ▽ More

    Submitted 24 August, 2020; v1 submitted 26 May, 2019; originally announced May 2019.

    Comments: Accepted at ECCV 2020

  48. Revisiting Image-Language Networks for Open-ended Phrase Detection

    Authors: Bryan A. Plummer, Kevin J. Shih, Yichen Li, Ke Xu, Svetlana Lazebnik, Stan Sclaroff, Kate Saenko

    Abstract: Most existing work that grounds natural language phrases in images starts with the assumption that the phrase in question is relevant to the image. In this paper we address a more realistic version of the natural language grounding task where we must both identify whether the phrase is relevant to an image and localize the phrase. This can also be viewed as a generalization of object detection to… ▽ More

    Submitted 12 October, 2020; v1 submitted 17 November, 2018; originally announced November 2018.

    Comments: Accepted to TPAMI

  49. arXiv:1809.08714  [pdf, other

    cs.CV

    Give me a hint! Navigating Image Databases using Human-in-the-loop Feedback

    Authors: Bryan A. Plummer, M. Hadi Kiapour, Shuai Zheng, Robinson Piramuthu

    Abstract: In this paper, we introduce an attribute-based interactive image search which can leverage human-in-the-loop feedback to iteratively refine image search results. We study active image search where human feedback is solicited exclusively in visual form, without using relative attribute annotations used by prior work which are not typically found in many datasets. In order to optimize the image sele… ▽ More

    Submitted 23 September, 2018; originally announced September 2018.

  50. arXiv:1804.05113  [pdf, other

    cs.CV

    Multilevel Language and Vision Integration for Text-to-Clip Retrieval

    Authors: Huijuan Xu, Kun He, Bryan A. Plummer, Leonid Sigal, Stan Sclaroff, Kate Saenko

    Abstract: We address the problem of text-based activity retrieval in video. Given a sentence describing an activity, our task is to retrieve matching clips from an untrimmed video. To capture the inherent structures present in both text and video, we introduce a multilevel model that integrates vision and language features earlier and more tightly than prior work. First, we inject text features early on whe… ▽ More

    Submitted 25 December, 2018; v1 submitted 13 April, 2018; originally announced April 2018.

    Comments: AAAI 2019