Skip to main content

Showing 1–25 of 25 results for author: Lal, V

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.02333  [pdf, other

    cs.CL cs.CV

    Why do LLaVA Vision-Language Models Reply to Images in English?

    Authors: Musashi Hinck, Carolin Holtermann, Matthew Lyle Olson, Florian Schneider, Sungduk Yu, Anahita Bhiwandiwalla, Anne Lauscher, Shaoyen Tseng, Vasudev Lal

    Abstract: We uncover a surprising multilingual bias occurring in a popular class of multimodal vision-language models (VLMs). Including an image in the query to a LLaVA-style VLM significantly increases the likelihood of the model returning an English response, regardless of the language of the query. This paper investigates the causes of this loss with a two-pronged approach that combines extensive ablatio… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: Pre-print

  2. arXiv:2406.19593  [pdf, other

    cs.CL cs.CV

    SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs

    Authors: Xin Su, Man Luo, Kris W Pan, Tien Pei Chou, Vasudev Lal, Phillip Howard

    Abstract: Synthetic data generation has gained significant attention recently for its utility in training large vision and language models. However, the application of synthetic data to the training of multimodal context-augmented generation systems has been relatively unexplored. This gap in existing work is important because existing vision and language models (VLMs) are not trained specifically for conte… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

  3. arXiv:2406.01843  [pdf, other

    cs.CV

    L-MAGIC: Language Model Assisted Generation of Images with Coherence

    Authors: Zhipeng Cai, Matthias Mueller, Reiner Birkl, Diana Wofk, Shao-Yen Tseng, JunDa Cheng, Gabriela Ben-Melech Stan, Vasudev Lal, Michael Paulitsch

    Abstract: In the current era of generative AI breakthroughs, generating panoramic scenes from a single input image remains a key challenge. Most existing methods use diffusion-based iterative or simultaneous multi-view inpainting. However, the lack of global scene layout priors leads to subpar outputs with duplicated objects (e.g., multiple beds in a bedroom) or requires time-consuming human text inputs for… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

    Comments: accepted to CVPR 2024

  4. arXiv:2404.03118  [pdf, other

    cs.CV

    LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models

    Authors: Gabriela Ben Melech Stan, Estelle Aflalo, Raanan Yehezkel Rohekar, Anahita Bhiwandiwalla, Shao-Yen Tseng, Matthew Lyle Olson, Yaniv Gurwicz, Chenfei Wu, Nan Duan, Vasudev Lal

    Abstract: In the rapidly evolving landscape of artificial intelligence, multi-modal large language models are emerging as a significant area of interest. These models, which combine various forms of data input, are becoming increasingly popular. However, understanding their internal mechanisms remains a complex task. Numerous advancements have been made in the field of explainability tools and mechanisms, y… ▽ More

    Submitted 24 June, 2024; v1 submitted 3 April, 2024; originally announced April 2024.

  5. arXiv:2404.01331  [pdf, other

    cs.CL cs.AI

    LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model

    Authors: Musashi Hinck, Matthew L. Olson, David Cobbley, Shao-Yen Tseng, Vasudev Lal

    Abstract: We train a suite of multimodal foundation models (MMFM) using the popular LLaVA framework with the recently released Gemma family of large language models (LLMs). Of particular interest is the 2B parameter Gemma model, which provides opportunities to construct capable small-scale MMFMs. In line with findings from other papers in this space, we test the effect of ablating three design features: pre… ▽ More

    Submitted 10 June, 2024; v1 submitted 29 March, 2024; originally announced April 2024.

    Comments: CVPR 2024, MMFM workshop. Authors 1 and 2 contributed equally. Models available at https://huggingface.co/intel/llava-gemma-2b/ and https://huggingface.co/intel/llava-gemma-7b/ Training code at https://github.com/IntelLabs/multimodal_cognitive_ai/tree/main/LLaVA-Gemma

  6. arXiv:2404.01197  [pdf, other

    cs.CV

    Getting it Right: Improving Spatial Consistency in Text-to-Image Models

    Authors: Agneet Chatterjee, Gabriela Ben Melech Stan, Estelle Aflalo, Sayak Paul, Dhruba Ghosh, Tejas Gokhale, Ludwig Schmidt, Hannaneh Hajishirzi, Vasudev Lal, Chitta Baral, Yezhou Yang

    Abstract: One of the key shortcomings in current text-to-image (T2I) models is their inability to consistently generate images which faithfully follow the spatial relationships specified in the text prompt. In this paper, we offer a comprehensive investigation of this limitation, while also develo** datasets and methods that achieve state-of-the-art performance. First, we find that current vision-language… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

    Comments: project webpage : https://spright-t2i.github.io/

  7. arXiv:2312.00825  [pdf, other

    cs.CV cs.AI

    SocialCounterfactuals: Probing and Mitigating Intersectional Social Biases in Vision-Language Models with Counterfactual Examples

    Authors: Phillip Howard, Avinash Madasu, Tiep Le, Gustavo Lujan Moreno, Anahita Bhiwandiwalla, Vasudev Lal

    Abstract: While vision-language models (VLMs) have achieved remarkable performance improvements recently, there is growing evidence that these models also posses harmful biases with respect to social attributes such as gender and race. Prior studies have primarily focused on probing such bias attributes individually while ignoring biases associated with intersections between social attributes. This could be… ▽ More

    Submitted 9 April, 2024; v1 submitted 30 November, 2023; originally announced December 2023.

    Comments: Accepted to CVPR 2024. arXiv admin note: text overlap with arXiv:2310.02988

  8. arXiv:2311.12229  [pdf, other

    cs.AI

    NeuroPrompts: An Adaptive Framework to Optimize Prompts for Text-to-Image Generation

    Authors: Shachar Rosenman, Vasudev Lal, Phillip Howard

    Abstract: Despite impressive recent advances in text-to-image diffusion models, obtaining high-quality images often requires prompt engineering by humans who have developed expertise in using them. In this work, we present NeuroPrompts, an adaptive framework that automatically enhances a user's prompt to improve the quality of generations produced by text-to-image models. Our framework utilizes constrained… ▽ More

    Submitted 5 April, 2024; v1 submitted 20 November, 2023; originally announced November 2023.

    Comments: Accepted to EACL 2024 System Demonstration Track

  9. arXiv:2311.03226  [pdf, other

    cs.CV cs.AI

    LDM3D-VR: Latent Diffusion Model for 3D VR

    Authors: Gabriela Ben Melech Stan, Diana Wofk, Estelle Aflalo, Shao-Yen Tseng, Zhipeng Cai, Michael Paulitsch, Vasudev Lal

    Abstract: Latent diffusion models have proven to be state-of-the-art in the creation and manipulation of visual outputs. However, as far as we know, the generation of depth maps jointly with RGB is still limited. We introduce LDM3D-VR, a suite of diffusion models targeting virtual reality development that includes LDM3D-pano and LDM3D-SR. These models enable the generation of panoramic RGBD based on textual… ▽ More

    Submitted 6 November, 2023; originally announced November 2023.

    Comments: Accepted to Workshop on Diffusion Models, NeurIPS 2023

  10. arXiv:2310.04914  [pdf, other

    cs.CV cs.AI cs.CL

    Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks

    Authors: Avinash Madasu, Anahita Bhiwandiwalla, Vasudev Lal

    Abstract: Foundational multimodal models pre-trained on large scale image-text pairs or video-text pairs or both have shown strong generalization abilities on downstream tasks. However unlike image-text models, pretraining video-text models is always not feasible due to the difficulty in collecting large-scale clean and aligned data, and exponential computational costs involved in the pretraining phase. The… ▽ More

    Submitted 24 November, 2023; v1 submitted 7 October, 2023; originally announced October 2023.

  11. arXiv:2310.02988  [pdf, other

    cs.CV cs.AI

    Probing Intersectional Biases in Vision-Language Models with Counterfactual Examples

    Authors: Phillip Howard, Avinash Madasu, Tiep Le, Gustavo Lujan Moreno, Vasudev Lal

    Abstract: While vision-language models (VLMs) have achieved remarkable performance improvements recently, there is growing evidence that these models also posses harmful biases with respect to social attributes such as gender and race. Prior studies have primarily focused on probing such bias attributes individually while ignoring biases associated with intersections between social attributes. This could be… ▽ More

    Submitted 4 October, 2023; originally announced October 2023.

  12. arXiv:2309.14356  [pdf, other

    cs.LG cs.CL cs.CV

    COCO-Counterfactuals: Automatically Constructed Counterfactual Examples for Image-Text Pairs

    Authors: Tiep Le, Vasudev Lal, Phillip Howard

    Abstract: Counterfactual examples have proven to be valuable in the field of natural language processing (NLP) for both evaluating and improving the robustness of language models to spurious correlations in datasets. Despite their demonstrated utility for NLP, multimodal counterfactual examples have been relatively unexplored due to the difficulty of creating paired image-text data with minimal counterfactu… ▽ More

    Submitted 31 October, 2023; v1 submitted 22 September, 2023; originally announced September 2023.

    Comments: Accepted to NeurIPS 2023 Datasets and Benchmarks Track

  13. arXiv:2306.16533  [pdf, other

    cs.CV cs.AI cs.CL

    ICSVR: Investigating Compositional and Syntactic Understanding in Video Retrieval Models

    Authors: Avinash Madasu, Vasudev Lal

    Abstract: Video retrieval (VR) involves retrieving the ground truth video from the video database given a text caption or vice-versa. The two important components of compositionality: objects & attributes and actions are joined using correct syntax to form a proper text query. These components (objects & attributes, actions and syntax) each play an important role to help distinguish among videos and retriev… ▽ More

    Submitted 10 June, 2024; v1 submitted 28 June, 2023; originally announced June 2023.

  14. arXiv:2306.00103  [pdf, other

    cs.CV cs.CL cs.LG

    ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning

    Authors: Xiao Xu, Bei Li, Chenfei Wu, Shao-Yen Tseng, Anahita Bhiwandiwalla, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan

    Abstract: Two-Tower Vision-Language (VL) models have shown promising improvements on various downstream VL tasks. Although the most advanced work improves performance by building bridges between encoders, it suffers from ineffective layer-by-layer utilization of uni-modal representations and cannot flexibly exploit different levels of uni-modal semantic knowledge. In this work, we propose ManagerTower, a no… ▽ More

    Submitted 31 May, 2023; originally announced June 2023.

    Comments: Accepted by ACL 2023 Main Conference, Oral

  15. arXiv:2305.12248  [pdf, other

    cs.CL cs.CV

    Brain encoding models based on multimodal transformers can transfer across language and vision

    Authors: Jerry Tang, Meng Du, Vy A. Vo, Vasudev Lal, Alexander G. Huth

    Abstract: Encoding models have been used to assess how the human brain represents concepts in language and vision. While language and vision rely on similar concept representations, current encoding models are typically trained and tested on brain responses to each modality in isolation. Recent advances in multimodal pretraining have produced transformers that can extract aligned representations of concepts… ▽ More

    Submitted 20 May, 2023; originally announced May 2023.

  16. arXiv:2305.10853  [pdf, other

    cs.CV

    LDM3D: Latent Diffusion Model for 3D

    Authors: Gabriela Ben Melech Stan, Diana Wofk, Scottie Fox, Alex Redden, Will Saxton, Jean Yu, Estelle Aflalo, Shao-Yen Tseng, Fabio Nonato, Matthias Muller, Vasudev Lal

    Abstract: This research paper proposes a Latent Diffusion Model for 3D (LDM3D) that generates both image and depth map data from a given text prompt, allowing users to generate RGBD images from text prompts. The LDM3D model is fine-tuned on a dataset of tuples containing an RGB image, depth map and caption, and validated through extensive experiments. We also develop an application called DepthFusion, which… ▽ More

    Submitted 21 May, 2023; v1 submitted 18 May, 2023; originally announced May 2023.

  17. arXiv:2305.04978  [pdf, other

    cs.CL

    NeuroComparatives: Neuro-Symbolic Distillation of Comparative Knowledge

    Authors: Phillip Howard, Junlin Wang, Vasudev Lal, Gadi Singer, Ye** Choi, Swabha Swayamdipta

    Abstract: Comparative knowledge (e.g., steel is stronger and heavier than styrofoam) is an essential component of our world knowledge, yet understudied in prior literature. In this paper, we harvest the dramatic improvements in knowledge capabilities of language models into a large-scale comparative knowledge base. While the ease of acquisition of such comparative knowledge is much higher from extreme-scale… ▽ More

    Submitted 5 April, 2024; v1 submitted 8 May, 2023; originally announced May 2023.

    Comments: Accepted to NAACL 2024 Findings

  18. Thrill-K Architecture: Towards a Solution to the Problem of Knowledge Based Understanding

    Authors: Gadi Singer, Joscha Bach, Tetiana Grinberg, Nagib Hakim, Phillip Howard, Vasudev Lal, Zev Rivlin

    Abstract: While end-to-end learning systems are rapidly gaining capabilities and popularity, the increasing computational demands for deploying such systems, along with a lack of flexibility, adaptability, explainability, reasoning and verification capabilities, require new types of architectures. Here we introduce a classification of hybrid systems which, based on an analysis of human knowledge and intelli… ▽ More

    Submitted 28 February, 2023; originally announced March 2023.

    Comments: Artificial General Intelligence: 15th International Conference, AGI 2022, Seattle, WA, USA, August 2022, Proceedings

    Journal ref: Springer Lecture Notes in Computer Science, vol 13539, 2023

  19. arXiv:2302.05016  [pdf, ps, other

    cs.CV cs.AI cs.CL

    Is Multimodal Vision Supervision Beneficial to Language?

    Authors: Avinash Madasu, Vasudev Lal

    Abstract: Vision (image and video) - Language (VL) pre-training is the recent popular paradigm that achieved state-of-the-art results on multi-modal tasks like image-retrieval, video-retrieval, visual question answering etc. These models are trained in an unsupervised way and greatly benefit from the complementary modality supervision. In this paper, we explore if the language representations trained using… ▽ More

    Submitted 14 April, 2023; v1 submitted 9 February, 2023; originally announced February 2023.

  20. arXiv:2210.12365  [pdf, other

    cs.CL

    NeuroCounterfactuals: Beyond Minimal-Edit Counterfactuals for Richer Data Augmentation

    Authors: Phillip Howard, Gadi Singer, Vasudev Lal, Ye** Choi, Swabha Swayamdipta

    Abstract: While counterfactual data augmentation offers a promising step towards robust generalization in natural language processing, producing a set of counterfactuals that offer valuable inductive bias for models remains a challenge. Most existing approaches for producing counterfactuals, manual or automated, rely on small perturbations via minimal edits, resulting in simplistic changes. We introduce Neu… ▽ More

    Submitted 22 October, 2022; originally announced October 2022.

    Comments: Findings of EMNLP 2022

  21. Cross-Domain Aspect Extraction using Transformers Augmented with Knowledge Graphs

    Authors: Phillip Howard, Arden Ma, Vasudev Lal, Ana Paula Simoes, Daniel Korat, Oren Pereg, Moshe Wasserblat, Gadi Singer

    Abstract: The extraction of aspect terms is a critical step in fine-grained sentiment analysis of text. Existing approaches for this task have yielded impressive results when the training and testing data are from the same domain. However, these methods show a drastic decrease in performance when applied to cross-domain settings where the domain of the testing data differs from that of the training data. To… ▽ More

    Submitted 18 October, 2022; originally announced October 2022.

    ACM Class: I.2.7

    Journal ref: Proceedings of the 31st ACM International Conference on Information & Knowledge Management (CIKM 2022). Association for Computing Machinery, New York, NY, USA, 780-790

  22. arXiv:2208.11553  [pdf, other

    cs.CV

    MuMUR : Multilingual Multimodal Universal Retrieval

    Authors: Avinash Madasu, Estelle Aflalo, Gabriela Ben Melech Stan, Shachar Rosenman, Shao-Yen Tseng, Gedas Bertasius, Vasudev Lal

    Abstract: Multi-modal retrieval has seen tremendous progress with the development of vision-language models. However, further improving these models require additional labelled data which is a huge manual effort. In this paper, we propose a framework MuMUR, that utilizes knowledge transfer from a multilingual model to boost the performance of multi-modal (image and video) retrieval. We first use state-of-th… ▽ More

    Submitted 19 September, 2023; v1 submitted 24 August, 2022; originally announced August 2022.

    Comments: This is an extension of the previous MKTVR paper (for which you can find a reference here : https://dl.acm.org/doi/abs/10.1007/978-3-031-28244-7_42 or in a previous version on arxiv). This version was published to the Information Retrieval Journal

  23. arXiv:2206.08657  [pdf, other

    cs.CV cs.CL cs.LG

    BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning

    Authors: Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, Nan Duan

    Abstract: Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language representation learning in recent years. Current VL models either use lightweight uni-modal encoders and learn to extract, align and fuse both modalities simultaneously in a deep cross-modal encoder, or feed the last-layer uni-modal representations from the deep pre-trained uni-modal encoders into the top cr… ▽ More

    Submitted 26 March, 2024; v1 submitted 17 June, 2022; originally announced June 2022.

    Comments: Accepted by AAAI 2023, Oral

  24. arXiv:2203.17247  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers

    Authors: Estelle Aflalo, Meng Du, Shao-Yen Tseng, Yongfei Liu, Chenfei Wu, Nan Duan, Vasudev Lal

    Abstract: Breakthroughs in transformer-based models have revolutionized not only the NLP field, but also vision and multimodal systems. However, although visualization and interpretability tools have become available for NLP models, internal mechanisms of vision and multimodal transformers remain largely opaque. With the success of these transformers, it is increasingly critical to understand their inner wo… ▽ More

    Submitted 22 August, 2022; v1 submitted 30 March, 2022; originally announced March 2022.

    Comments: Best Demo Award at CVPR 2022

  25. arXiv:2109.10504  [pdf, other

    cs.CV

    KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation

    Authors: Yongfei Liu, Chenfei Wu, Shao-yen Tseng, Vasudev Lal, Xuming He, Nan Duan

    Abstract: Self-supervised vision-and-language pretraining (VLP) aims to learn transferable multi-modal representations from large-scale image-text data and to achieve strong performances on a broad scope of vision-language tasks after finetuning. Previous mainstream VLP approaches typically adopt a two-step strategy relying on external object detectors to encode images in a multi-modal Transformer framework… ▽ More

    Submitted 7 August, 2022; v1 submitted 21 September, 2021; originally announced September 2021.