Skip to main content

Showing 1–50 of 171 results for author: Saenko, K

.
  1. arXiv:2406.07822  [pdf, other

    cs.CV cs.CL

    Tell Me What's Next: Textual Foresight for Generic UI Representations

    Authors: Andrea Burns, Kate Saenko, Bryan A. Plummer

    Abstract: Mobile app user interfaces (UIs) are rich with action, text, structure, and image content that can be utilized to learn generic UI representations for tasks like automating user commands, summarizing content, and evaluating the accessibility of user interfaces. Prior work has learned strong visual representations with local or global captioning losses, but fails to retain both granularities. To co… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted to ACL 2024 Findings. Data and code to be released at https://github.com/aburns4/textualforesight

  2. arXiv:2406.01449  [pdf, other

    cs.CV

    SLANT: Spurious Logo ANalysis Toolkit

    Authors: Maan Qraitem, Piotr Teterwak, Kate Saenko, Bryan A. Plummer

    Abstract: Online content is filled with logos, from ads and social media posts to website branding and product placements. Consequently, these logos are prevalent in the extensive web-scraped datasets used to pretrain Vision-Language Models, which are used for a wide array of tasks (content moderation, object classification). While these models have been shown to learn harmful correlations in various tasks,… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

  3. arXiv:2405.17247  [pdf, other

    cs.LG

    An Introduction to Vision-Language Modeling

    Authors: Florian Bordes, Richard Yuanzhe Pang, Anurag Ajay, Alexander C. Li, Adrien Bardes, Suzanne Petryk, Oscar MaƱas, Zhiqiu Lin, Anas Mahmoud, Bargav Jayaraman, Mark Ibrahim, Melissa Hall, Yunyang Xiong, Jonathan Lebensold, Candace Ross, Srihari Jayakumar, Chuan Guo, Diane Bouchacourt, Haider Al-Tahan, Karthik Padthe, Vasu Sharma, Hu Xu, Xiaoqing Ellen Tan, Megan Richards, Samuel Lavoie , et al. (16 additional authors not shown)

    Abstract: Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From having a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technol… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  4. arXiv:2404.13706  [pdf, other

    cs.CV cs.AI cs.LG

    Concept Arithmetics for Circumventing Concept Inhibition in Diffusion Models

    Authors: Vitali Petsiuk, Kate Saenko

    Abstract: Motivated by ethical and legal concerns, the scientific community is actively develo** methods to limit the misuse of Text-to-Image diffusion models for reproducing copyrighted, violent, explicit, or personal information in the generated images. Simultaneously, researchers put these newly developed safety measures to the test by assuming the role of an adversary to find vulnerabilities and backd… ▽ More

    Submitted 21 April, 2024; originally announced April 2024.

  5. arXiv:2404.04346  [pdf, other

    cs.CV

    Koala: Key frame-conditioned long video-LLM

    Authors: Reuben Tan, Ximeng Sun, ** Hu, Jui-hsien Wang, Hanieh Deilamsalehy, Bryan A. Plummer, Bryan Russell, Kate Saenko

    Abstract: Long video question answering is a challenging task that involves recognizing short-term activities and reasoning about their fine-grained relationships. State-of-the-art video Large Language Models (vLLMs) hold promise as a viable solution due to their demonstrated emergent capabilities on new tasks. However, despite being trained on millions of short seconds-long videos, vLLMs are unable to unde… ▽ More

    Submitted 3 May, 2024; v1 submitted 5 April, 2024; originally announced April 2024.

    Comments: Accepted at CVPR 2024 as a poster highlight

  6. arXiv:2402.00626  [pdf, other

    cs.CV cs.CR cs.LG

    Vision-LLMs Can Fool Themselves with Self-Generated Typographic Attacks

    Authors: Maan Qraitem, Nazia Tasnim, Piotr Teterwak, Kate Saenko, Bryan A. Plummer

    Abstract: Typographic Attacks, which involve pasting misleading text onto an image, were noted to harm the performance of Vision-Language Models like CLIP. However, the susceptibility of recent Large Vision-Language Models to these attacks remains understudied. Furthermore, prior work's Typographic attacks against CLIP randomly sample a misleading class from a predefined set of categories. However, this sim… ▽ More

    Submitted 16 February, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

  7. arXiv:2401.00420  [pdf, other

    cs.CV cs.AI

    SynCDR : Training Cross Domain Retrieval Models with Synthetic Data

    Authors: Samarth Mishra, Carlos D. Castillo, Hongcheng Wang, Kate Saenko, Venkatesh Saligrama

    Abstract: In cross-domain retrieval, a model is required to identify images from the same semantic category across two visual domains. For instance, given a sketch of an object, a model needs to retrieve a real image of it from an online store's catalog. A standard approach for such a problem is learning a feature space of images where Euclidean distances reflect similarity. Even without human annotations,… ▽ More

    Submitted 19 March, 2024; v1 submitted 31 December, 2023; originally announced January 2024.

    Comments: Pre-print

  8. arXiv:2312.01629  [pdf, other

    cs.CV

    CLAMP: Contrastive LAnguage Model Prompt-tuning

    Authors: Piotr Teterwak, Ximeng Sun, Bryan A. Plummer, Kate Saenko, Ser-Nam Lim

    Abstract: Large language models (LLMs) have emerged as powerful general-purpose interfaces for many machine learning problems. Recent work has adapted LLMs to generative visual tasks like image captioning, visual question answering, and visual chat, using a relatively small amount of instruction-tuning data. In this paper, we explore whether modern LLMs can also be adapted to classifying an image into a set… ▽ More

    Submitted 26 March, 2024; v1 submitted 4 December, 2023; originally announced December 2023.

  9. arXiv:2312.01274  [pdf, other

    cs.CV

    Learning to Compose SuperWeights for Neural Parameter Allocation Search

    Authors: Piotr Teterwak, Soren Nelson, Nikoli Dryden, Dina Bashkirova, Kate Saenko, Bryan A. Plummer

    Abstract: Neural parameter allocation search (NPAS) automates parameter sharing by obtaining weights for a network given an arbitrary, fixed parameter budget. Prior work has two major drawbacks we aim to address. First, there is a disconnect in the sharing pattern between the search and training steps, where weights are warped for layers of different sizes during the search to measure similarity, but not du… ▽ More

    Submitted 2 December, 2023; originally announced December 2023.

    Comments: Accepted at IEEE Winter Conference on Applications of Computer Vision (WACV) 2024

  10. arXiv:2312.00833  [pdf, other

    cs.CV

    Lasagna: Layered Score Distillation for Disentangled Object Relighting

    Authors: Dina Bashkirova, Arijit Ray, Rupayan Mallick, Sarah Adel Bargal, Jianming Zhang, Ranjay Krishna, Kate Saenko

    Abstract: Professional artists, photographers, and other visual content creators use object relighting to establish their photo's desired effect. Unfortunately, manual tools that allow relighting have a steep learning curve and are difficult to master. Although generative editing methods now enable some forms of image editing, relighting is still beyond today's capabilities; existing methods struggle to kee… ▽ More

    Submitted 30 November, 2023; originally announced December 2023.

  11. arXiv:2310.18946  [pdf, other

    cs.CV cs.MM

    Video Frame Interpolation with Many-to-many Splatting and Spatial Selective Refinement

    Authors: ** Hu, Simon Niklaus, Lu Zhang, Stan Sclaroff, Kate Saenko

    Abstract: In this work, we first propose a fully differentiable Many-to-Many (M2M) splatting framework to interpolate frames efficiently. Given a frame pair, we estimate multiple bidirectional flows to directly forward warp the pixels to the desired time step before fusing overlap** pixels. In doing so, each source pixel renders multiple target pixels and each target pixel can be synthesized from a larger… ▽ More

    Submitted 29 October, 2023; originally announced October 2023.

    Comments: T-PAMI. arXiv admin note: substantial text overlap with arXiv:2204.03513

  12. arXiv:2308.16741  [pdf, other

    cs.AI cs.CV

    Socratis: Are large multimodal models emotionally aware?

    Authors: Katherine Deng, Arijit Ray, Reuben Tan, Saadia Gabriel, Bryan A. Plummer, Kate Saenko

    Abstract: Existing emotion prediction benchmarks contain coarse emotion labels which do not consider the diversity of emotions that an image and text can elicit in humans due to various reasons. Learning diverse reactions to multimodal content is important as intelligent machines take a central role in generating and delivering content to society. To address this gap, we propose Socratis, a societal reactio… ▽ More

    Submitted 2 November, 2023; v1 submitted 31 August, 2023; originally announced August 2023.

    Comments: ICCV 2023 WECIA

  13. arXiv:2308.12949  [pdf, other

    cs.LG cs.CV

    Label Budget Allocation in Multi-Task Learning

    Authors: Ximeng Sun, Kihyuk Sohn, Kate Saenko, Clayton Mellina, Xiao Bian

    Abstract: The cost of labeling data often limits the performance of machine learning systems. In multi-task learning, related tasks provide information to each other and improve overall performance, but the label cost can vary among tasks. How should the label budget (i.e. the amount of money spent on labeling) be allocated among different tasks to achieve optimal multi-task performance? We are the first to… ▽ More

    Submitted 24 August, 2023; originally announced August 2023.

  14. arXiv:2308.04553  [pdf, other

    cs.CV cs.LG

    From Fake to Real: Pretraining on Balanced Synthetic Images to Prevent Bias

    Authors: Maan Qraitem, Kate Saenko, Bryan A. Plummer

    Abstract: Visual recognition models are prone to learning spurious correlations induced by a biased training set where certain conditions $B$ (\eg, Indoors) are over-represented in certain classes $Y$ (\eg, Big Dogs). Synthetic data from generative models offers a promising direction to mitigate this issue by augmenting underrepresented conditions in the real dataset. However, this introduces another potent… ▽ More

    Submitted 29 September, 2023; v1 submitted 8 August, 2023; originally announced August 2023.

  15. arXiv:2308.01890  [pdf, other

    cs.CV cs.LG

    DualCoOp++: Fast and Effective Adaptation to Multi-Label Recognition with Limited Annotations

    Authors: ** Hu, Ximeng Sun, Stan Sclaroff, Kate Saenko

    Abstract: Multi-label image recognition in the low-label regime is a task of great challenge and practical significance. Previous works have focused on learning the alignment between textual and visual spaces to compensate for limited image labels, yet may suffer from reduced accuracy due to the scarcity of high-quality multi-label annotations. In this research, we leverage the powerful alignment between te… ▽ More

    Submitted 13 December, 2023; v1 submitted 3 August, 2023; originally announced August 2023.

    Comments: TPAMI. arXiv admin note: substantial text overlap with arXiv:2206.09541

  16. arXiv:2307.12854  [pdf, other

    cs.CV

    Multiscale Video Pretraining for Long-Term Activity Forecasting

    Authors: Reuben Tan, Matthias De Lange, Michael Iuzzolino, Bryan A. Plummer, Kate Saenko, Karl Ridgeway, Lorenzo Torresani

    Abstract: Long-term activity forecasting is an especially challenging research problem because it requires understanding the temporal relationships between observed actions, as well as the variability and complexity of human activities. Despite relying on strong supervision via expensive human annotations, state-of-the-art forecasting approaches often generalize poorly to unseen data. To alleviate this issu… ▽ More

    Submitted 24 July, 2023; originally announced July 2023.

  17. arXiv:2306.17848  [pdf, other

    cs.CV

    Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing

    Authors: Ariel N. Lee, Sarah Adel Bargal, Janavi Kasera, Stan Sclaroff, Kate Saenko, Nataniel Ruiz

    Abstract: Vision transformers (ViTs) have significantly changed the computer vision landscape and have periodically exhibited superior performance in vision tasks compared to convolutional neural networks (CNNs). Although the jury is still out on which model type is superior, each has unique inductive biases that shape their learning and generalization performance. For example, ViTs have interesting propert… ▽ More

    Submitted 30 June, 2023; originally announced June 2023.

  18. arXiv:2305.05432  [pdf, other

    cs.CL cs.CV

    WikiWeb2M: A Page-Level Multimodal Wikipedia Dataset

    Authors: Andrea Burns, Krishna Srinivasan, Joshua Ainslie, Geoff Brown, Bryan A. Plummer, Kate Saenko, Jianmo Ni, Mandy Guo

    Abstract: Webpages have been a rich resource for language and vision-language tasks. Yet only pieces of webpages are kept: image-caption pairs, long text articles, or raw HTML, never all in one place. Webpage tasks have resultingly received little attention and structured image-text data underused. To study multimodal webpage understanding, we introduce the Wikipedia Webpage 2M (WikiWeb2M) suite; the first… ▽ More

    Submitted 9 May, 2023; originally announced May 2023.

    Comments: Accepted at the WikiWorkshop 2023. Data is readily available at https://github.com/google-research-datasets/wit/blob/main/wikiweb2m.md. arXiv admin note: text overlap with arXiv:2305.03668

  19. arXiv:2305.03689  [pdf, other

    cs.CV

    COLA: A Benchmark for Compositional Text-to-image Retrieval

    Authors: Arijit Ray, Filip Radenovic, Abhimanyu Dubey, Bryan A. Plummer, Ranjay Krishna, Kate Saenko

    Abstract: Compositional reasoning is a hallmark of human visual intelligence. Yet, despite the size of large vision-language models, they struggle to represent simple compositions by combining objects with their attributes. To measure this lack of compositional capability, we design Cola, a text-to-image retrieval benchmark to Compose Objects Localized with Attributes. To solve Cola, a model must retrieve i… ▽ More

    Submitted 2 November, 2023; v1 submitted 5 May, 2023; originally announced May 2023.

    Comments: Accepted to NeurIPS 2023. Webpage: https://cs-people.bu.edu/array/research/cola/

  20. arXiv:2305.03668  [pdf, other

    cs.CL cs.CV

    A Suite of Generative Tasks for Multi-Level Multimodal Webpage Understanding

    Authors: Andrea Burns, Krishna Srinivasan, Joshua Ainslie, Geoff Brown, Bryan A. Plummer, Kate Saenko, Jianmo Ni, Mandy Guo

    Abstract: Webpages have been a rich, scalable resource for vision-language and language only tasks. Yet only pieces of webpages are kept in existing datasets: image-caption pairs, long text articles, or raw HTML, never all in one place. Webpage tasks have resultingly received little attention and structured image-text data left underused. To study multimodal webpage understanding, we introduce the Wikipedia… ▽ More

    Submitted 20 October, 2023; v1 submitted 5 May, 2023; originally announced May 2023.

    Comments: Accepted in EMNLP 2023, revision contains camera ready edits. Data can be downloaded at https://github.com/google-research-datasets/wit/blob/main/wikiweb2m.md

  21. arXiv:2304.01973  [pdf, other

    cs.LG cs.CV

    ERM++: An Improved Baseline for Domain Generalization

    Authors: Piotr Teterwak, Kuniaki Saito, Theodoros Tsiligkaridis, Kate Saenko, Bryan A. Plummer

    Abstract: Domain Generalization (DG) measures a classifier's ability to generalize to new distributions of data it was not trained on. Recent work has shown that a hyperparameter-tuned Empirical Risk Minimization (ERM) training procedure, that is simply minimizing the empirical risk on the source domains, can outperform most existing DG methods. ERM has achieved such strong results while only tuning hyper-p… ▽ More

    Submitted 26 March, 2024; v1 submitted 4 April, 2023; originally announced April 2023.

    Comments: An improved baseline for Domain Generalization

  22. arXiv:2303.18232  [pdf, other

    cs.CV

    DIME-FM: DIstilling Multimodal and Efficient Foundation Models

    Authors: Ximeng Sun, Pengchuan Zhang, Peizhao Zhang, Hardik Shah, Kate Saenko, Xide Xia

    Abstract: Large Vision-Language Foundation Models (VLFM), such as CLIP, ALIGN and Florence, are trained on large-scale datasets of image-caption pairs and achieve superior transferability and robustness on downstream tasks, but they are difficult to use in many practical applications due to their large size, high latency and fixed architectures. Unfortunately, recent work shows training a small custom VLFM… ▽ More

    Submitted 14 August, 2023; v1 submitted 31 March, 2023; originally announced March 2023.

    Comments: Accepted to ICCV 2023

  23. arXiv:2303.16342  [pdf, other

    cs.CV cs.AI cs.CL

    Language-Guided Audio-Visual Source Separation via Trimodal Consistency

    Authors: Reuben Tan, Arijit Ray, Andrea Burns, Bryan A. Plummer, Justin Salamon, Oriol Nieto, Bryan Russell, Kate Saenko

    Abstract: We propose a self-supervised approach for learning to perform audio source separation in videos based on natural language queries, using only unlabeled video and audio pairs as training data. A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform, all without access to… ▽ More

    Submitted 23 September, 2023; v1 submitted 28 March, 2023; originally announced March 2023.

    Comments: Accepted at CVPR 2023

  24. arXiv:2303.14828  [pdf, other

    cs.CV

    VisDA 2022 Challenge: Domain Adaptation for Industrial Waste Sorting

    Authors: Dina Bashkirova, Samarth Mishra, Diala Lteif, Piotr Teterwak, Donghyun Kim, Fadi Alladkani, James Akl, Berk Calli, Sarah Adel Bargal, Kate Saenko, Daehan Kim, Minseok Seo, Young** Jeon, Dong-Geol Choi, Shahaf Ettedgui, Raja Giryes, Shady Abu-Hussein, Binhui Xie, Shuang Li

    Abstract: Label-efficient and reliable semantic segmentation is essential for many real-life applications, especially for industrial settings with high visual diversity, such as waste sorting. In industrial waste sorting, one of the biggest challenges is the extreme diversity of the input stream depending on factors like the location of the sorting facility, the equipment available in the facility, and the… ▽ More

    Submitted 26 March, 2023; originally announced March 2023.

    Comments: Proceedings of Machine Learning Research

  25. arXiv:2303.14744  [pdf, other

    cs.CV

    Mind the Backbone: Minimizing Backbone Distortion for Robust Object Detection

    Authors: Kuniaki Saito, Donghyun Kim, Piotr Teterwak, Rogerio Feris, Kate Saenko

    Abstract: Building object detectors that are robust to domain shifts is critical for real-world applications. Prior approaches fine-tune a pre-trained backbone and risk overfitting it to in-distribution (ID) data and distorting features useful for out-of-distribution (OOD) generalization. We propose to use Relative Gradient Norm (RGN) as a way to measure the vulnerability of a backbone to feature distortion… ▽ More

    Submitted 15 May, 2023; v1 submitted 26 March, 2023; originally announced March 2023.

    Comments: Project page: http://ai.bu.edu/mind_back/

  26. arXiv:2302.05496  [pdf, other

    cs.CV cs.AI

    MaskSketch: Unpaired Structure-guided Masked Image Generation

    Authors: Dina Bashkirova, Jose Lezama, Kihyuk Sohn, Kate Saenko, Irfan Essa

    Abstract: Recent conditional image generation methods produce images of remarkable diversity, fidelity and realism. However, the majority of these methods allow conditioning only on labels or text prompts, which limits their level of control over the generation result. In this paper, we introduce MaskSketch, an image generation method that allows spatial conditioning of the generation result using a guiding… ▽ More

    Submitted 10 February, 2023; originally announced February 2023.

  27. arXiv:2302.03084  [pdf, other

    cs.CV

    Pic2Word: Map** Pictures to Words for Zero-shot Composed Image Retrieval

    Authors: Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, Tomas Pfister

    Abstract: In Composed Image Retrieval (CIR), a user combines a query image with text to describe their intended target. Existing methods rely on supervised learning of CIR models using labeled triplets consisting of the query image, text specification, and the target image. Labeling such triplets is expensive and hinders broad applicability of CIR. In this work, we propose to study an important task, Zero-S… ▽ More

    Submitted 15 May, 2023; v1 submitted 6 February, 2023; originally announced February 2023.

    Comments: CVPR2023

  28. arXiv:2301.06987  [pdf, other

    cs.RO cs.LG

    The SwaNNFlight System: On-the-Fly Sim-to-Real Adaptation via Anchored Learning

    Authors: Bassel El Mabsout, Shahin Roozkhosh, Siddharth Mysore, Kate Saenko, Renato Mancuso

    Abstract: Reinforcement Learning (RL) agents trained in simulated environments and then deployed in the real world are often sensitive to the differences in dynamics presented, commonly termed the sim-to-real gap. With the goal of minimizing this gap on resource-constrained embedded systems, we train and live-adapt agents on quadrotors built from off-the-shelf hardware. In achieving this we developed three… ▽ More

    Submitted 17 January, 2023; originally announced January 2023.

  29. arXiv:2211.16499  [pdf, other

    cs.CV cs.AI cs.LG

    Finding Differences Between Transformers and ConvNets Using Counterfactual Simulation Testing

    Authors: Nataniel Ruiz, Sarah Adel Bargal, Cihang Xie, Kate Saenko, Stan Sclaroff

    Abstract: Modern deep neural networks tend to be evaluated on static test sets. One shortcoming of this is the fact that these deep neural networks cannot be easily evaluated for robustness issues with respect to specific scene variations. For example, it is hard to study the robustness of these networks to variations of object scale, object pose, scene lighting and 3D occlusions. The main reason is that co… ▽ More

    Submitted 29 November, 2022; originally announced November 2022.

    Comments: Published at the Conference on Neural Information Processing Systems (NeurIPS) 2022

  30. arXiv:2211.14703  [pdf, other

    cs.CV

    Exploring Consistency in Cross-Domain Transformer for Domain Adaptive Semantic Segmentation

    Authors: Kaihong Wang, Donghyun Kim, Rogerio Feris, Kate Saenko, Margrit Betke

    Abstract: While transformers have greatly boosted performance in semantic segmentation, domain adaptive transformers are not yet well explored. We identify that the domain gap can cause discrepancies in self-attention. Due to this gap, the transformer attends to spurious regions or pixels, which deteriorates accuracy on the target domain. We propose to perform adaptation on attention maps with cross-domain… ▽ More

    Submitted 20 December, 2022; v1 submitted 26 November, 2022; originally announced November 2022.

  31. arXiv:2211.12112  [pdf, other

    cs.CV cs.AI cs.LG

    Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark

    Authors: Vitali Petsiuk, Alexander E. Siemenn, Saisamrit Surbehera, Zad Chin, Keith Tyser, Gregory Hunter, Arvind Raghavan, Yann Hicke, Bryan A. Plummer, Ori Kerret, Tonio Buonassisi, Kate Saenko, Armando Solar-Lezama, Iddo Drori

    Abstract: We provide a new multi-task benchmark for evaluating text-to-image models. We perform a human evaluation comparing the most common open-source (Stable Diffusion) and commercial (DALL-E 2) models. Twenty computer science AI graduate students evaluated the two models, on three tasks, at three difficulty levels, across ten prompts each, providing 3,600 ratings. Text-to-image generation has seen rapid… ▽ More

    Submitted 22 November, 2022; originally announced November 2022.

    Comments: NeurIPS 2022 Workshop on Human Evaluation of Generative Models (HEGM)

  32. arXiv:2209.15605  [pdf, other

    cs.CV

    Bias Mimicking: A Simple Sampling Approach for Bias Mitigation

    Authors: Maan Qraitem, Kate Saenko, Bryan A. Plummer

    Abstract: Prior work has shown that Visual Recognition datasets frequently underrepresent bias groups $B$ (\eg Female) within class labels $Y$ (\eg Programmers). This dataset bias can lead to models that learn spurious correlations between class labels and bias groups such as age, gender, or race. Most recent methods that address this problem require significant architectural changes or additional loss func… ▽ More

    Submitted 27 April, 2023; v1 submitted 30 September, 2022; originally announced September 2022.

    Comments: Accepted at CVPR 2023

  33. arXiv:2209.03648  [pdf, other

    cs.CV

    FETA: Towards Specializing Foundation Models for Expert Task Applications

    Authors: Amit Alfassy, Assaf Arbelle, Oshri Halimi, Sivan Harary, Roei Herzig, Eli Schwartz, Rameswar Panda, Michele Dolfi, Christoph Auer, Kate Saenko, PeterW. J. Staar, Rogerio Feris, Leonid Karlinsky

    Abstract: Foundation Models (FMs) have demonstrated unprecedented capabilities including zero-shot learning, high fidelity data synthesis, and out of domain generalization. However, as we show in this paper, FMs still have poor out-of-the-box performance on expert tasks (e.g. retrieval of car manuals technical illustrations from language queries), data for which is either unseen or belonging to a long-tail… ▽ More

    Submitted 19 December, 2022; v1 submitted 8 September, 2022; originally announced September 2022.

  34. arXiv:2207.13061  [pdf, other

    cs.CV cs.AI cs.CL

    NewsStories: Illustrating articles with visual summaries

    Authors: Reuben Tan, Bryan A. Plummer, Kate Saenko, JP Lewis, Avneesh Sud, Thomas Leung

    Abstract: Recent self-supervised approaches have used large-scale image-text datasets to learn powerful representations that transfer to many tasks without finetuning. These methods often assume that there is one-to-one correspondence between its images and their (short) captions. However, many tasks require reasoning about multiple images and long text narratives, such as describing news articles with visu… ▽ More

    Submitted 14 August, 2022; v1 submitted 26 July, 2022; originally announced July 2022.

    Comments: Accepted at ECCV 2022

  35. arXiv:2206.09541  [pdf, other

    cs.CV

    DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations

    Authors: Ximeng Sun, ** Hu, Kate Saenko

    Abstract: Solving multi-label recognition (MLR) for images in the low-label regime is a challenging task with many real-world applications. Recent work learns an alignment between textual and visual spaces to compensate for insufficient image labels, but loses accuracy because of the limited amount of available MLR annotations. In this work, we utilize the strong alignment of textual and visual features pre… ▽ More

    Submitted 19 June, 2022; originally announced June 2022.

  36. arXiv:2206.01125  [pdf, other

    cs.CV

    Prefix Conditioning Unifies Language and Label Supervision

    Authors: Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, Tomas Pfister

    Abstract: Image-classification datasets have been used to pretrain image recognition models. Recently, web-scale image-caption datasets have emerged as a source of powerful pretraining alternative. Image-caption datasets are more ``open-domain'', containing a wider variety of scene types and vocabulary words than traditional classification datasets, and models trained on these datasets have demonstrated str… ▽ More

    Submitted 15 May, 2023; v1 submitted 2 June, 2022; originally announced June 2022.

    Comments: CVPR2023

  37. arXiv:2204.11929  [pdf, other

    cs.CV

    Temporal Relevance Analysis for Video Action Models

    Authors: Quanfu Fan, Donghyun Kim, Chun-Fu, Chen, Stan Sclaroff, Kate Saenko, Sarah Adel Bargal

    Abstract: In this paper, we provide a deep analysis of temporal modeling for action recognition, an important but underexplored problem in the literature. We first propose a new approach to quantify the temporal relationships between frames captured by CNN-based action models based on layer-wise relevance propagation. We then conduct comprehensive experiments and in-depth analysis to provide a better unders… ▽ More

    Submitted 25 April, 2022; originally announced April 2022.

  38. arXiv:2204.03513  [pdf, other

    cs.CV cs.AI cs.MM

    Many-to-many Splatting for Efficient Video Frame Interpolation

    Authors: ** Hu, Simon Niklaus, Stan Sclaroff, Kate Saenko

    Abstract: Motion-based video frame interpolation commonly relies on optical flow to warp pixels from the inputs to the desired interpolation instant. Yet due to the inherent challenges of motion estimation (e.g. occlusions and discontinuities), most state-of-the-art interpolation approaches require subsequent refinement of the warped result to generate satisfying outputs, which drastically decreases the eff… ▽ More

    Submitted 7 April, 2022; originally announced April 2022.

    Comments: CVPR2022, Project: https://github.com/feinanshan/M2M_VFI

  39. arXiv:2204.00172  [pdf, other

    cs.CV cs.LG

    A Unified Framework for Domain Adaptive Pose Estimation

    Authors: Donghyun Kim, Kaihong Wang, Kate Saenko, Margrit Betke, Stan Sclaroff

    Abstract: While pose estimation is an important computer vision task, it requires expensive annotation and suffers from domain shift. In this paper, we investigate the problem of domain adaptive 2D pose estimation that transfers knowledge learned on a synthetic source domain to a target domain without supervision. While several domain adaptive pose estimation models have been proposed recently, they are not… ▽ More

    Submitted 5 August, 2022; v1 submitted 31 March, 2022; originally announced April 2022.

  40. arXiv:2203.11819  [pdf, other

    cs.CV

    A Broad Study of Pre-training for Domain Generalization and Adaptation

    Authors: Donghyun Kim, Kaihong Wang, Stan Sclaroff, Kate Saenko

    Abstract: Deep models must learn robust and transferable representations in order to perform well on new domains. While domain transfer methods (e.g., domain adaptation, domain generalization) have been proposed to learn transferable representations across domains, they are typically applied to ResNet backbones pre-trained on ImageNet. Thus, existing works pay little attention to the effects of pre-training… ▽ More

    Submitted 20 July, 2022; v1 submitted 22 March, 2022; originally announced March 2022.

  41. arXiv:2202.04800  [pdf, other

    cs.CV cs.CL

    The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning

    Authors: Jack Hessel, Jena D. Hwang, Jae Sung Park, Rowan Zellers, Chandra Bhagavatula, Anna Rohrbach, Kate Saenko, Ye** Choi

    Abstract: Humans have remarkable capacity to reason abductively and hypothesize about what lies beyond the literal content of an image. By identifying concrete visual clues scattered throughout a scene, we almost can't help but draw probable inferences beyond the literal scene based on our everyday experience and knowledge about the world. For example, if we see a "20 mph" sign alongside a road, we might as… ▽ More

    Submitted 25 July, 2022; v1 submitted 9 February, 2022; originally announced February 2022.

    Comments: code, data, models at http://visualabduction.com/

    Journal ref: ECCV 2022

  42. arXiv:2202.02312  [pdf, other

    cs.CL cs.CV cs.HC

    A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility

    Authors: Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, Bryan A. Plummer

    Abstract: Vision-language navigation (VLN), in which an agent follows language instruction in a visual environment, has been studied under the premise that the input command is fully feasible in the environment. Yet in practice, a request may not be possible due to language ambiguity or environment changes. To study VLN with unknown command feasibility, we introduce a new dataset Mobile app Tasks with Itera… ▽ More

    Submitted 14 August, 2022; v1 submitted 4 February, 2022; originally announced February 2022.

    Comments: Accepted at the European Conference on Computer Vision (ECCV) 2022. This is a new version of the paper with additional experimental results and a few prior implementation bugs fixed

  43. arXiv:2201.12462  [pdf, other

    cs.LG cs.AI cs.HC cs.RO

    Explaining Reinforcement Learning Policies through Counterfactual Trajectories

    Authors: Julius Frost, Olivia Watkins, Eric Weiner, Pieter Abbeel, Trevor Darrell, Bryan Plummer, Kate Saenko

    Abstract: In order for humans to confidently decide where to employ RL agents for real-world tasks, a human developer must validate that the agent will perform well at test-time. Some policy interpretability methods facilitate this by capturing the policy's decision making in a set of agent rollouts. However, even the most informative trajectories of training time behavior may give little insight into the a… ▽ More

    Submitted 18 March, 2022; v1 submitted 28 January, 2022; originally announced January 2022.

    Comments: Accepted at ICML HILL 2021 Workshop

    ACM Class: I.2.6

  44. arXiv:2112.05090  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    Extending the WILDS Benchmark for Unsupervised Adaptation

    Authors: Shiori Sagawa, Pang Wei Koh, Tony Lee, Irena Gao, Sang Michael Xie, Kendrick Shen, Ananya Kumar, Weihua Hu, Michihiro Yasunaga, Henrik Marklund, Sara Beery, Etienne David, Ian Stavness, Wei Guo, Jure Leskovec, Kate Saenko, Tatsunori Hashimoto, Sergey Levine, Chelsea Finn, Percy Liang

    Abstract: Machine learning systems deployed in the wild are often trained on a source distribution but deployed on a different target distribution. Unlabeled data can be a powerful point of leverage for mitigating these distribution shifts, as it is frequently much more available than labeled data and can often be obtained from distributions beyond the source distribution as well. However, existing distribu… ▽ More

    Submitted 23 April, 2022; v1 submitted 9 December, 2021; originally announced December 2021.

  45. arXiv:2112.02300  [pdf, other

    cs.CV

    Unsupervised Domain Generalization by Learning a Bridge Across Domains

    Authors: Sivan Harary, Eli Schwartz, Assaf Arbelle, Peter Staar, Shady Abu-Hussein, Elad Amrani, Roei Herzig, Amit Alfassy, Raja Giryes, Hilde Kuehne, Dina Katabi, Kate Saenko, Rogerio Feris, Leonid Karlinsky

    Abstract: The ability to generalize learned representations across significantly different visual domains, such as between real photos, clipart, paintings, and sketches, is a fundamental capacity of the human visual system. In this paper, different from most cross-domain works that utilize some (or full) source domain supervision, we approach a relatively new and very practical Unsupervised Domain Generaliz… ▽ More

    Submitted 17 May, 2022; v1 submitted 4 December, 2021; originally announced December 2021.

  46. arXiv:2112.01698  [pdf, other

    cs.CV

    Learning to Detect Every Thing in an Open World

    Authors: Kuniaki Saito, ** Hu, Trevor Darrell, Kate Saenko

    Abstract: Many open-world applications require the detection of novel objects, yet state-of-the-art object detection and instance segmentation networks do not excel at this task. The key issue lies in their assumption that regions without any annotations should be suppressed as negatives, which teaches the model to treat the unannotated objects as background. To address this issue, we propose a simple yet s… ▽ More

    Submitted 12 April, 2022; v1 submitted 2 December, 2021; originally announced December 2021.

    Comments: Project page is available at https://ksaito-ut.github.io/openworld_ldet/

  47. arXiv:2112.00054  [pdf, other

    cs.CV cs.LG

    Task2Sim : Towards Effective Pre-training and Transfer from Synthetic Data

    Authors: Samarth Mishra, Rameswar Panda, Cheng Perng Phoo, Chun-Fu Chen, Leonid Karlinsky, Kate Saenko, Venkatesh Saligrama, Rogerio S. Feris

    Abstract: Pre-training models on Imagenet or other massive datasets of real images has led to major advances in computer vision, albeit accompanied with shortcomings related to curation cost, privacy, usage rights, and ethical issues. In this paper, for the first time, we study the transferability of pre-trained models based on synthetic data generated by graphics simulators to downstream tasks from very di… ▽ More

    Submitted 28 March, 2022; v1 submitted 30 November, 2021; originally announced December 2021.

    Comments: Accepted to CVPR'22

  48. arXiv:2111.13279  [pdf, other

    cs.CV

    Disentangled Unsupervised Image Translation via Restricted Information Flow

    Authors: Ben Usman, Dina Bashkirova, Kate Saenko

    Abstract: Unsupervised image-to-image translation methods aim to map images from one domain into plausible examples from another domain while preserving structures shared across two domains. In the many-to-many setting, an additional guidance example from the target domain is used to determine domain-specific attributes of the generated image. In the absence of attribute annotations, methods have to infer w… ▽ More

    Submitted 25 November, 2021; originally announced November 2021.

  49. arXiv:2110.15128  [pdf, other

    cs.CV

    Contrast and Mix: Temporal Contrastive Video Domain Adaptation with Background Mixing

    Authors: Aadarsh Sahoo, Rutav Shah, Rameswar Panda, Kate Saenko, Abir Das

    Abstract: Unsupervised domain adaptation which aims to adapt models trained on a labeled source domain to a completely unlabeled target domain has attracted much attention in recent years. While many domain adaptation techniques have been proposed for images, the problem of unsupervised domain adaptation in videos remains largely underexplored. In this paper, we introduce Contrast and Mix (CoMix), a new con… ▽ More

    Submitted 28 October, 2021; originally announced October 2021.

    Comments: Accepted to NeurIPS 2021. Project page: https://cvir.github.io/projects/comix

  50. arXiv:2110.10596  [pdf, other

    cs.CV cs.LG

    Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos

    Authors: Reuben Tan, Bryan A. Plummer, Kate Saenko, Hailin **, Bryan Russell

    Abstract: We introduce the task of spatially localizing narrated interactions in videos. Key to our approach is the ability to learn to spatially localize interactions with self-supervision on a large corpus of videos with accompanying transcribed narrations. To achieve this goal, we propose a multilayer cross-modal attention network that enables effective optimization of a contrastive loss during training.… ▽ More

    Submitted 2 December, 2021; v1 submitted 20 October, 2021; originally announced October 2021.

    Comments: Accepted at NeurIPS 2021 (Spotlight)