Skip to main content

Showing 1–50 of 57 results for author: Wang, X E

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.19263  [pdf, other

    cs.CL cs.CV

    Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding

    Authors: Yue Fan, Lei Ding, Ching-Chen Kuo, Shan Jiang, Yang Zhao, Xinze Guan, Jie Yang, Yi Zhang, Xin Eric Wang

    Abstract: Graphical User Interfaces (GUIs) are central to our interaction with digital devices. Recently, growing efforts have been made to build models for various GUI understanding tasks. However, these efforts largely overlook an important GUI-referring task: screen reading based on user-indicated points, which we name the Screen Point-and-Read (SPR) task. This task is predominantly handled by rigid acce… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

  2. arXiv:2406.12831  [pdf, other

    cs.CV cs.AI cs.MM

    VIA: A Spatiotemporal Video Adaptation Framework for Global and Local Video Editing

    Authors: **g Gu, Yuwei Fang, Ivan Skorokhodov, Peter Wonka, Xinya Du, Sergey Tulyakov, Xin Eric Wang

    Abstract: Video editing stands as a cornerstone of digital media, from entertainment and education to professional communication. However, previous methods often overlook the necessity of comprehensively understanding both global and local contexts, leading to inaccurate and inconsistency edits in the spatiotemporal dimension, especially for long videos. In this paper, we introduce VIA, a unified spatiotemp… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: 13 pages, 11 figures

  3. arXiv:2406.09305  [pdf, other

    cs.CV

    Toffee: Efficient Million-Scale Dataset Construction for Subject-Driven Text-to-Image Generation

    Authors: Yufan Zhou, Ruiyi Zhang, Kaizhi Zheng, Nanxuan Zhao, Jiuxiang Gu, Zichao Wang, Xin Eric Wang, Tong Sun

    Abstract: In subject-driven text-to-image generation, recent works have achieved superior performance by training the model on synthetic datasets containing numerous image pairs. Trained on these datasets, generative models can produce text-aligned images for specific subject from arbitrary testing image in a zero-shot manner. They even outperform methods which require additional fine-tuning on testing imag… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  4. arXiv:2406.08407  [pdf, other

    cs.CV cs.AI cs.CL

    MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

    Authors: Xuehai He, Weixi Feng, Kaizhi Zheng, Yujie Lu, Wanrong Zhu, Jiachen Li, Yue Fan, Jianfeng Wang, Linjie Li, Zhengyuan Yang, Kevin Lin, William Yang Wang, Lijuan Wang, Xin Eric Wang

    Abstract: Multimodal Language Language Models (MLLMs) demonstrate the emerging abilities of "world models" -- interpreting and reasoning about complex real-world dynamics. To assess these abilities, we posit videos are the ideal medium, as they encapsulate rich representations of real-world dynamics and causalities. To this end, we introduce MMWorld, a new benchmark for multi-discipline, multi-faceted multi… ▽ More

    Submitted 13 June, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

  5. arXiv:2405.20421  [pdf, other

    cs.AI

    Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA

    Authors: Qianqi Yan, Xuehai He, Xiang Yue, Xin Eric Wang

    Abstract: Large Multimodal Models (LMMs) have shown remarkable progress in medical Visual Question Answering (Med-VQA), achieving high accuracy on existing benchmarks. However, their reliability under robust evaluation is questionable. This study reveals that when subjected to simple probing evaluation, state-of-the-art models perform worse than random guessing on medical diagnosis questions. To address thi… ▽ More

    Submitted 21 June, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

  6. arXiv:2405.04834  [pdf, other

    cs.CV

    FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation

    Authors: Xuehai He, Jian Zheng, Jacob Zhiyuan Fang, Robinson Piramuthu, Mohit Bansal, Vicente Ordonez, Gunnar A Sigurdsson, Nanyun Peng, Xin Eric Wang

    Abstract: Controllable text-to-image (T2I) diffusion models generate images conditioned on both text prompts and semantic inputs of other modalities like edge maps. Nevertheless, current controllable T2I methods commonly face challenges related to efficiency and faithfulness, especially when conditioning on multiple inputs from either the same or diverse modalities. In this paper, we propose a novel Flexibl… ▽ More

    Submitted 21 May, 2024; v1 submitted 8 May, 2024; originally announced May 2024.

  7. arXiv:2404.05717  [pdf, other

    cs.CV cs.AI

    SwapAnything: Enabling Arbitrary Object Swap** in Personalized Visual Editing

    Authors: **g Gu, Yilin Wang, Nanxuan Zhao, Wei Xiong, Qing Liu, Zhifei Zhang, He Zhang, Jianming Zhang, HyunJoon Jung, Xin Eric Wang

    Abstract: Effective editing of personal content holds a pivotal role in enabling individuals to express their creativity, weaving captivating narratives within their visual stories, and elevate the overall quality and impact of their visual content. Therefore, in this work, we introduce SwapAnything, a novel framework that can swap any objects in an image with personalized concepts given by the reference, w… ▽ More

    Submitted 6 May, 2024; v1 submitted 8 April, 2024; originally announced April 2024.

    Comments: 18 pages, 16 figures, 3 tables

  8. arXiv:2402.19441  [pdf

    cs.GR

    3D Gaussian Model for Animation and Texturing

    Authors: Xiangzhi Eric Wang, Zackary P. T. Sin

    Abstract: 3D Gaussian Splatting has made a marked impact on neural rendering by achieving impressive fidelity and performance. Despite this achievement, however, it is not readily applicable to develo** interactive applications. Real-time applications like XR apps and games require functions such as animation, UV-map**, and model editing simultaneously manipulated through the usage of a 3D model. We pro… ▽ More

    Submitted 29 February, 2024; originally announced February 2024.

  9. arXiv:2401.15847  [pdf, other

    cs.CV cs.AI cs.CL

    Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA

    Authors: Yue Fan, **g Gu, Kaiwen Zhou, Qianqi Yan, Shan Jiang, Ching-Chen Kuo, Xinze Guan, Xin Eric Wang

    Abstract: Multipanel images, commonly seen as web screenshots, posters, etc., pervade our daily lives. These images, characterized by their composition of multiple subfigures in distinct layouts, effectively convey information to people. Toward building advanced multimodal AI applications, such as agents that understand complex scenes and navigate through webpages, the skill of multipanel visual reasoning i… ▽ More

    Submitted 27 June, 2024; v1 submitted 28 January, 2024; originally announced January 2024.

    Comments: ACL 2024

  10. arXiv:2310.05872  [pdf, other

    cs.CV cs.AI cs.CL

    ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models

    Authors: Kaiwen Zhou, Kwonjoon Lee, Teruhisa Misu, Xin Eric Wang

    Abstract: In our work, we explore the synergistic capabilities of pre-trained vision-and-language models (VLMs) and large language models (LLMs) on visual commonsense reasoning (VCR) problems. We find that VLMs and LLMs-based decision pipelines are good at different kinds of VCR problems. Pre-trained VLMs exhibit strong performance for problems involving understanding the literal visual content, which we no… ▽ More

    Submitted 17 May, 2024; v1 submitted 9 October, 2023; originally announced October 2023.

  11. arXiv:2310.03903  [pdf, other

    cs.CL cs.MA

    LLM-Coordination: Evaluating and Analyzing Multi-agent Coordination Abilities in Large Language Models

    Authors: Saaket Agashe, Yue Fan, Anthony Reyna, Xin Eric Wang

    Abstract: The emergent reasoning and Theory of Mind (ToM) abilities demonstrated by Large Language Models (LLMs) make them promising candidates for develo** coordination agents. In this study, we introduce a new LLM-Coordination Benchmark aimed at a detailed analysis of LLMs within the context of Pure Coordination Games, where participating agents need to cooperate for the most gain. This benchmark evalua… ▽ More

    Submitted 2 April, 2024; v1 submitted 5 October, 2023; originally announced October 2023.

  12. arXiv:2310.02239  [pdf, other

    cs.CV cs.AI

    MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens

    Authors: Kaizhi Zheng, Xuehai He, Xin Eric Wang

    Abstract: The effectiveness of Multimodal Large Language Models (MLLMs) demonstrates a profound capability in multimodal understanding. However, the simultaneous generation of images with coherent texts is still underdeveloped. Addressing this, we introduce a novel interleaved vision-and-language generation method, centered around the concept of ``generative vokens". These vokens serve as pivotal elements c… ▽ More

    Submitted 15 March, 2024; v1 submitted 3 October, 2023; originally announced October 2023.

    Comments: 23 pages, 10 figures

  13. arXiv:2306.00905  [pdf, other

    cs.CL cs.AI cs.CV

    T2IAT: Measuring Valence and Stereotypical Biases in Text-to-Image Generation

    Authors: Jialu Wang, Xinyue Gabby Liu, Zonglin Di, Yang Liu, Xin Eric Wang

    Abstract: Warning: This paper contains several contents that may be toxic, harmful, or offensive. In the last few years, text-to-image generative models have gained remarkable success in generating images with unprecedented quality accompanied by a breakthrough of inference speed. Despite their rapid progress, human biases that manifest in the training examples, particularly with regard to common stereoty… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: ACL 2023

    ACM Class: I.2.6

  14. arXiv:2305.18286  [pdf, other

    cs.CV cs.AI

    Photoswap: Personalized Subject Swap** in Images

    Authors: **g Gu, Yilin Wang, Nanxuan Zhao, Tsu-Jui Fu, Wei Xiong, Qing Liu, Zhifei Zhang, He Zhang, Jianming Zhang, HyunJoon Jung, Xin Eric Wang

    Abstract: In an era where images and visual content dominate our digital landscape, the ability to manipulate and personalize these images has become a necessity. Envision seamlessly substituting a tabby cat lounging on a sunlit window sill in a photograph with your own playful puppy, all while preserving the original charm and composition of the image. We present Photoswap, a novel approach that enables th… ▽ More

    Submitted 29 May, 2023; originally announced May 2023.

    Comments: 14 pages

  15. arXiv:2305.15393  [pdf, other

    cs.CV cs.AI

    LayoutGPT: Compositional Visual Planning and Generation with Large Language Models

    Authors: Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, William Yang Wang

    Abstract: Attaining a high degree of user controllability in visual generation often requires intricate, fine-grained inputs like layouts. However, such inputs impose a substantial burden on users when compared to simple text inputs. To address the issue, we study how Large Language Models (LLMs) can serve as visual planners by generating layouts from text conditions, and thus collaborate with visual genera… ▽ More

    Submitted 28 October, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: NeurIPS 2023

  16. arXiv:2305.14260  [pdf, other

    cs.CL

    R2H: Building Multimodal Navigation Helpers that Respond to Help Requests

    Authors: Yue Fan, **g Gu, Kaizhi Zheng, Xin Eric Wang

    Abstract: Intelligent navigation-helper agents are critical as they can navigate users in unknown areas through environmental awareness and conversational ability, serving as potential accessibility tools for individuals with disabilities. In this work, we first introduce a novel benchmark, Respond to Help Requests (R2H), to promote the development of multi-modal navigation helpers capable of responding to… ▽ More

    Submitted 17 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: EMNLP 2023

  17. arXiv:2305.11317  [pdf, other

    cs.CL

    Collaborative Generative AI: Integrating GPT-k for Efficient Editing in Text-to-Image Generation

    Authors: Wanrong Zhu, Xinyi Wang, Yujie Lu, Tsu-Jui Fu, Xin Eric Wang, Miguel Eckstein, William Yang Wang

    Abstract: The field of text-to-image (T2I) generation has garnered significant attention both within the research community and among everyday users. Despite the advancements of T2I models, a common issue encountered by users is the need for repetitive editing of input prompts in order to receive a satisfactory image, which is time-consuming and labor-intensive. Given the demonstrated text generation power… ▽ More

    Submitted 28 October, 2023; v1 submitted 18 May, 2023; originally announced May 2023.

    Comments: EMNLP 2023

  18. arXiv:2305.11116  [pdf, other

    cs.CV cs.CL

    LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation

    Authors: Yujie Lu, Xianjun Yang, Xiujun Li, Xin Eric Wang, William Yang Wang

    Abstract: Existing automatic evaluation on text-to-image synthesis can only provide an image-text matching score, without considering the object-level compositionality, which results in poor correlation with human judgments. In this work, we propose LLMScore, a new framework that offers evaluation scores with multi-granularity compositionality. LLMScore leverages the large language models (LLMs) to evaluate… ▽ More

    Submitted 18 May, 2023; originally announced May 2023.

  19. arXiv:2305.10722  [pdf, other

    cs.CV

    Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners

    Authors: Xuehai He, Weixi Feng, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, William Yang Wang, Xin Eric Wang

    Abstract: Diffusion models, such as Stable Diffusion, have shown incredible performance on text-to-image generation. Since text-to-image generation often requires models to generate visual concepts with fine-grained details and attributes specified in text prompts, can we leverage the powerful representations learned by pre-trained diffusion models for discriminative tasks such as image-text matching? To an… ▽ More

    Submitted 24 April, 2024; v1 submitted 18 May, 2023; originally announced May 2023.

  20. arXiv:2305.03510  [pdf, other

    cs.CL cs.AI

    Parameter-Efficient Cross-lingual Transfer of Vision and Language Models via Translation-based Alignment

    Authors: Zhen Zhang, Jialu Wang, Xin Eric Wang

    Abstract: Pre-trained vision and language models such as CLIP have witnessed remarkable success in connecting images and texts with a primary focus on English texts. Despite recent efforts to extend CLIP to support other languages, disparities in performance among different languages have been observed due to uneven resource availability. Additionally, current cross-lingual transfer methods of those pre-tra… ▽ More

    Submitted 28 October, 2023; v1 submitted 2 May, 2023; originally announced May 2023.

    Comments: Findings of EMNLP

  21. arXiv:2305.01795  [pdf, other

    cs.CL

    Multimodal Procedural Planning via Dual Text-Image Prompting

    Authors: Yujie Lu, Pan Lu, Zhiyu Chen, Wanrong Zhu, Xin Eric Wang, William Yang Wang

    Abstract: Embodied agents have achieved prominent performance in following human instructions to complete tasks. However, the potential of providing instructions informed by texts and images to assist humans in completing tasks remains underexplored. To uncover this capability, we present the multimodal procedural planning (MPP) task, in which models are given a high-level goal and generate plans of paired… ▽ More

    Submitted 2 May, 2023; originally announced May 2023.

  22. arXiv:2305.00581  [pdf, other

    cs.CV cs.AI cs.CL

    Multimodal Graph Transformer for Multimodal Question Answering

    Authors: Xuehai He, Xin Eric Wang

    Abstract: Despite the success of Transformer models in vision and language tasks, they often learn knowledge from enormous data implicitly and cannot utilize structured input data directly. On the other hand, structured learning approaches such as graph neural networks (GNNs) that integrate prior information can barely compete with Transformer models. In this work, we aim to benefit from both worlds and pro… ▽ More

    Submitted 30 April, 2023; originally announced May 2023.

  23. arXiv:2301.13166  [pdf, other

    cs.AI cs.CV cs.LG cs.RO

    ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object Navigation

    Authors: Kaiwen Zhou, Kaizhi Zheng, Connor Pryor, Yilin Shen, Hongxia **, Lise Getoor, Xin Eric Wang

    Abstract: The ability to accurately locate and navigate to a specific object is a crucial capability for embodied agents that operate in the real world and interact with objects to complete tasks. Such object navigation tasks usually require large-scale training in visual environments with labeled objects, which generalizes poorly to novel objects in unknown environments. In this work, we present a novel ze… ▽ More

    Submitted 6 July, 2023; v1 submitted 30 January, 2023; originally announced January 2023.

  24. arXiv:2212.05032  [pdf, other

    cs.CV cs.CL

    Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

    Authors: Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, William Yang Wang

    Abstract: Large-scale diffusion models have achieved state-of-the-art results on text-to-image synthesis (T2I) tasks. Despite their ability to generate high-quality yet creative images, we observe that attribution-binding and compositional capabilities are still considered major challenging issues, especially when involving multiple objects. In this work, we improve the compositional skills of T2I models, s… ▽ More

    Submitted 28 February, 2023; v1 submitted 9 December, 2022; originally announced December 2022.

    Comments: ICLR 2023 Camera Ready version

  25. arXiv:2211.14769  [pdf, other

    cs.AI cs.CL cs.CR cs.CV

    Navigation as Attackers Wish? Towards Building Robust Embodied Agents under Federated Learning

    Authors: Yunchao Zhang, Zonglin Di, Kaiwen Zhou, Cihang Xie, Xin Eric Wang

    Abstract: Federated embodied agent learning protects the data privacy of individual visual environments by kee** data locally at each client (the individual environment) during training. However, since the local data is inaccessible to the server under federated learning, attackers may easily poison the training data of the local client to build a backdoor in the agent without notice. Deploying such an ag… ▽ More

    Submitted 16 March, 2024; v1 submitted 27 November, 2022; originally announced November 2022.

  26. arXiv:2211.13854  [pdf, other

    cs.CV cs.AI cs.CL

    ComCLIP: Training-Free Compositional Image and Text Matching

    Authors: Kenan Jiang, Xuehai He, Ruize Xu, Xin Eric Wang

    Abstract: Contrastive Language-Image Pretraining (CLIP) has demonstrated great zero-shot performance for matching images and text. However, it is still challenging to adapt vision-lanaguage pretrained models like CLIP to compositional image and text matching -- a more challenging image and text matching task requiring the model understanding of compositional word concepts and visual components. Towards bett… ▽ More

    Submitted 12 April, 2024; v1 submitted 24 November, 2022; originally announced November 2022.

  27. arXiv:2210.10362  [pdf, other

    cs.CV cs.AI cs.CL

    CPL: Counterfactual Prompt Learning for Vision and Language Models

    Authors: Xuehai He, Diji Yang, Weixi Feng, Tsu-Jui Fu, Arjun Akula, Varun Jampani, Pradyumna Narayana, Sugato Basu, William Yang Wang, Xin Eric Wang

    Abstract: Prompt tuning is a new few-shot transfer learning technique that only tunes the learnable prompt for pre-trained vision and language models such as CLIP. However, existing prompt tuning methods tend to learn spurious or entangled representations, which leads to poor generalization to unseen concepts. Towards non-spurious and efficient prompt learning from limited examples, this paper presents a no… ▽ More

    Submitted 4 November, 2022; v1 submitted 19 October, 2022; originally announced October 2022.

  28. arXiv:2210.03765  [pdf, other

    cs.CL cs.AI

    Visualize Before You Write: Imagination-Guided Open-Ended Text Generation

    Authors: Wanrong Zhu, An Yan, Yujie Lu, Wenda Xu, Xin Eric Wang, Miguel Eckstein, William Yang Wang

    Abstract: Recent advances in text-to-image synthesis make it possible to visualize machine imaginations for a given context. On the other hand, when generating text, human writers are gifted at creative visualization, which enhances their writings by forming imaginations as blueprints before putting down the stories in words. Inspired by such a cognitive process, we ask the natural question of whether we ca… ▽ More

    Submitted 14 February, 2023; v1 submitted 7 October, 2022; originally announced October 2022.

    Comments: EACL 2023

  29. arXiv:2209.04725  [pdf, other

    cs.CV cs.CL

    Anticipating the Unseen Discrepancy for Vision and Language Navigation

    Authors: Yujie Lu, Huiliang Zhang, ** Nie, Weixi Feng, Wenda Xu, Xin Eric Wang, William Yang Wang

    Abstract: Vision-Language Navigation requires the agent to follow natural language instructions to reach a specific target. The large discrepancy between seen and unseen environments makes it challenging for the agent to generalize well. Previous studies propose data augmentation methods to mitigate the data bias explicitly or implicitly and provide improvements in generalization. However, they try to memor… ▽ More

    Submitted 10 September, 2022; originally announced September 2022.

  30. arXiv:2208.13266  [pdf, other

    cs.AI cs.CL cs.CV cs.RO

    JARVIS: A Neuro-Symbolic Commonsense Reasoning Framework for Conversational Embodied Agents

    Authors: Kaizhi Zheng, Kaiwen Zhou, **g Gu, Yue Fan, Jialu Wang, Zonglin Di, Xuehai He, Xin Eric Wang

    Abstract: Building a conversational embodied agent to execute real-life tasks has been a long-standing yet quite challenging research goal, as it requires effective human-agent communication, multi-modal understanding, long-range sequential decision making, etc. Traditional symbolic methods have scaling and generalization issues, while end-to-end deep learning models suffer from data scarcity and high task… ▽ More

    Submitted 7 September, 2022; v1 submitted 28 August, 2022; originally announced August 2022.

    Comments: 20 pages

  31. arXiv:2206.15437  [pdf, other

    cs.LG cs.CY

    Understanding Instance-Level Impact of Fairness Constraints

    Authors: Jialu Wang, Xin Eric Wang, Yang Liu

    Abstract: A variety of fairness constraints have been proposed in the literature to mitigate group-level statistical bias. Their impacts have been largely evaluated for different groups of populations corresponding to a set of sensitive attributes, such as race or gender. Nonetheless, the community has not observed sufficient explorations for how imposing fairness constraints fare at an instance level. Buil… ▽ More

    Submitted 30 June, 2022; originally announced June 2022.

    Comments: 17 pages, 6 figures, ICML 2022

    ACM Class: I.2.6

  32. arXiv:2206.08522  [pdf, other

    cs.RO cs.CL cs.CV

    VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation

    Authors: Kaizhi Zheng, Xiaotong Chen, Odest Chadwicke Jenkins, Xin Eric Wang

    Abstract: Benefiting from language flexibility and compositionality, humans naturally intend to use language to command an embodied agent for complex tasks such as navigation and object manipulation. In this work, we aim to fill the blank of the last mile of embodied agents -- object manipulation by following human guidance, e.g., "move the red mug next to the box while kee** it upright." To this end, we… ▽ More

    Submitted 17 August, 2022; v1 submitted 16 June, 2022; originally announced June 2022.

  33. arXiv:2206.02928  [pdf, other

    cs.CL cs.AI cs.LG

    Neuro-Symbolic Procedural Planning with Commonsense Prompting

    Authors: Yujie Lu, Weixi Feng, Wanrong Zhu, Wenda Xu, Xin Eric Wang, Miguel Eckstein, William Yang Wang

    Abstract: Procedural planning aims to implement complex high-level goals by decomposition into sequential simpler low-level steps. Although procedural planning is a basic skill set for humans in daily life, it remains a challenge for large language models (LLMs) that lack a deep understanding of the cause-effect relations in procedures. Previous methods require manual exemplars to acquire procedural plannin… ▽ More

    Submitted 16 February, 2023; v1 submitted 6 June, 2022; originally announced June 2022.

    Comments: ICLR 2023 Spotlight

  34. arXiv:2205.12219  [pdf, other

    cs.CV cs.AI cs.CL

    Aerial Vision-and-Dialog Navigation

    Authors: Yue Fan, Winson Chen, Tongzhou Jiang, Chun Zhou, Yi Zhang, Xin Eric Wang

    Abstract: The ability to converse with humans and follow natural language commands is crucial for intelligent unmanned aerial vehicles (a.k.a. drones). It can relieve people's burden of holding a controller all the time, allow multitasking, and make drone control more accessible for people with disabilities or with their hands occupied. To this end, we introduce Aerial Vision-and-Dialog Navigation (AVDN), t… ▽ More

    Submitted 1 June, 2023; v1 submitted 24 May, 2022; originally announced May 2022.

    Comments: Accepted by ACL 2023 Findings

  35. arXiv:2204.08535  [pdf, other

    cs.CL

    Imagination-Augmented Natural Language Understanding

    Authors: Yujie Lu, Wanrong Zhu, Xin Eric Wang, Miguel Eckstein, William Yang Wang

    Abstract: Human brains integrate linguistic and perceptual information simultaneously to understand natural language, and hold the critical ability to render imaginations. Such abilities enable us to construct new abstract concepts or concrete objects, and are essential in involving practical knowledge to solve problems in low-resource scenarios. However, most existing methods for Natural Language Understan… ▽ More

    Submitted 3 May, 2022; v1 submitted 18 April, 2022; originally announced April 2022.

    Comments: NAACL 2022 Main Conference

  36. arXiv:2203.16329  [pdf, other

    cs.CV cs.AI

    Parameter-efficient Model Adaptation for Vision Transformers

    Authors: Xuehai He, Chunyuan Li, Pengchuan Zhang, Jianwei Yang, Xin Eric Wang

    Abstract: In computer vision, it has achieved great transfer learning performance via adapting large-scale pretrained vision models (e.g., vision transformers) to downstream tasks. Common approaches for model adaptation either update all model parameters or leverage linear probes. In this paper, we aim to study parameter-efficient model adaptation strategies for vision transformers on the image classificati… ▽ More

    Submitted 13 July, 2023; v1 submitted 29 March, 2022; originally announced March 2022.

  37. arXiv:2203.14936  [pdf, other

    cs.AI cs.CL cs.CV cs.LG

    FedVLN: Privacy-preserving Federated Vision-and-Language Navigation

    Authors: Kaiwen Zhou, Xin Eric Wang

    Abstract: Data privacy is a central problem for embodied agents that can perceive the environment, communicate with humans, and act in the real world. While hel** humans complete tasks, the agent may observe and process sensitive information of users, such as house environments, human activities, etc. In this work, we introduce privacy-preserving embodied agent learning for the task of Vision-and-Language… ▽ More

    Submitted 23 September, 2022; v1 submitted 28 March, 2022; originally announced March 2022.

    Comments: Accepted by ECCV 2022

  38. arXiv:2203.14474  [pdf, other

    cs.CL

    Interpretable Research Replication Prediction via Variational Contextual Consistency Sentence Masking

    Authors: Tianyi Luo, Rui Meng, Xin Eric Wang, Yang Liu

    Abstract: Research Replication Prediction (RRP) is the task of predicting whether a published research result can be replicated or not. Building an interpretable neural text classifier for RRP promotes the understanding of why a research paper is predicted as replicable or non-replicable and therefore makes its real-world application more reliable and trustworthy. However, the prior works on model interpret… ▽ More

    Submitted 27 March, 2022; originally announced March 2022.

  39. arXiv:2203.13049  [pdf, other

    cs.CV

    Compositional Temporal Grounding with Structured Variational Cross-Graph Correspondence Learning

    Authors: Juncheng Li, Junlin Xie, Long Qian, Linchao Zhu, Siliang Tang, Fei Wu, Yi Yang, Yueting Zhuang, Xin Eric Wang

    Abstract: Temporal grounding in videos aims to localize one target video segment that semantically corresponds to a given query sentence. Thanks to the semantic diversity of natural language descriptions, temporal grounding allows activity grounding beyond pre-defined classes and has received increasing attention in recent years. The semantic diversity is rooted in the principle of compositionality in lingu… ▽ More

    Submitted 28 March, 2022; v1 submitted 24 March, 2022; originally announced March 2022.

    Comments: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022

  40. arXiv:2203.12667  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions

    Authors: **g Gu, Eliana Stefani, Qi Wu, Jesse Thomason, Xin Eric Wang

    Abstract: A long-term goal of AI research is to build intelligent agents that can communicate with humans in natural language, perceive the environment, and perform real-world tasks. Vision-and-Language Navigation (VLN) is a fundamental and interdisciplinary research topic towards this goal, and receives increasing attention from natural language processing, computer vision, robotics, and machine learning c… ▽ More

    Submitted 3 June, 2022; v1 submitted 22 March, 2022; originally announced March 2022.

    Comments: 19 pages. Accepted to ACL 2022

    Journal ref: ACL 2022, Long, pages 7606,7623, Dublin, Ireland. Association for Computational Linguistics

  41. arXiv:2112.00967  [pdf, other

    cs.CV cs.AI cs.CL

    Relational Graph Learning for Grounded Video Description Generation

    Authors: Wenqiao Zhang, Xin Eric Wang, Siliang Tang, Haizhou Shi, Haocheng Shi, Jun Xiao, Yueting Zhuang, William Yang Wang

    Abstract: Grounded video description (GVD) encourages captioning models to attend to appropriate video regions (e.g., objects) dynamically and generate a description. Such a setting can help explain the decisions of captioning models and prevents the model from hallucinating object words in its description. However, such design mainly focuses on object word generation and thus may ignore fine-grained inform… ▽ More

    Submitted 1 December, 2021; originally announced December 2021.

    Comments: 10 pages, 5 figures, ACM MM 2020

  42. arXiv:2109.05433  [pdf, other

    cs.CV cs.CL

    Are Gender-Neutral Queries Really Gender-Neutral? Mitigating Gender Bias in Image Search

    Authors: Jialu Wang, Yang Liu, Xin Eric Wang

    Abstract: Internet search affects people's cognition of the world, so mitigating biases in search results and learning fair models is imperative for social good. We study a unique gender bias in image search in this work: the search images are often gender-imbalanced for gender-neutral natural language queries. We diagnose two typical image search models, the specialized model trained on in-domain datasets… ▽ More

    Submitted 12 September, 2021; originally announced September 2021.

    Comments: 14 pages, EMNLP 2021

    ACM Class: I.2.7

  43. arXiv:2106.10852  [pdf, other

    cs.CV

    CUDA-GHR: Controllable Unsupervised Domain Adaptation for Gaze and Head Redirection

    Authors: Swati **dal, Xin Eric Wang

    Abstract: The robustness of gaze and head pose estimation models is highly dependent on the amount of labeled data. Recently, generative modeling has shown excellent results in generating photo-realistic images, which can alleviate the need for annotations. However, adopting such generative models to new domains while maintaining their ability to provide fine-grained control over different image attributes,… ▽ More

    Submitted 19 September, 2022; v1 submitted 21 June, 2021; originally announced June 2021.

    Comments: Accepted at WACV2023, Camera-ready version

  44. arXiv:2106.06683  [pdf, other

    cs.CL cs.AI cs.LG

    Assessing Multilingual Fairness in Pre-trained Multimodal Representations

    Authors: Jialu Wang, Yang Liu, Xin Eric Wang

    Abstract: Recently pre-trained multimodal models, such as CLIP, have shown exceptional capabilities towards connecting images and natural language. The textual representations in English can be desirably transferred to multilingualism and support downstream multimodal tasks for different languages. Nevertheless, the principle of multilingual fairness is rarely scrutinized: do multilingual multimodal models… ▽ More

    Submitted 16 March, 2022; v1 submitted 11 June, 2021; originally announced June 2021.

    Comments: 15 pages, 18 figures

  45. arXiv:2106.05970  [pdf, other

    cs.CL cs.AI cs.CV

    ImaginE: An Imagination-Based Automatic Evaluation Metric for Natural Language Generation

    Authors: Wanrong Zhu, Xin Eric Wang, An Yan, Miguel Eckstein, William Yang Wang

    Abstract: Automatic evaluations for natural language generation (NLG) conventionally rely on token-level or embedding-level comparisons with text references. This differs from human language processing, for which visual imagination often improves comprehension. In this work, we propose ImaginE, an imagination-based automatic evaluation metric for natural language generation. With the help of StableDiffusion… ▽ More

    Submitted 14 February, 2023; v1 submitted 10 June, 2021; originally announced June 2021.

    Comments: EACL 2023

  46. arXiv:2106.04632  [pdf, other

    cs.CV cs.CL

    VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

    Authors: Linjie Li, Jie Lei, Zhe Gan, Licheng Yu, Yen-Chun Chen, Rohit Pillai, Yu Cheng, Luowei Zhou, Xin Eric Wang, William Yang Wang, Tamara Lee Berg, Mohit Bansal, **g**g Liu, Lijuan Wang, Zicheng Liu

    Abstract: Most existing video-and-language (VidL) research focuses on a single dataset, or multiple datasets of a single task. In reality, a truly useful VidL system is expected to be easily generalizable to diverse tasks, domains, and datasets. To facilitate the evaluation of such systems, we introduce Video-And-Language Understanding Evaluation (VALUE) benchmark, an assemblage of 11 VidL datasets over 3 p… ▽ More

    Submitted 18 August, 2021; v1 submitted 8 June, 2021; originally announced June 2021.

    Comments: To appear in 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks

  47. arXiv:2106.00178  [pdf, other

    cs.CV

    Language-Driven Image Style Transfer

    Authors: Tsu-Jui Fu, Xin Eric Wang, William Yang Wang

    Abstract: Despite having promising results, style transfer, which requires preparing style images in advance, may result in lack of creativity and accessibility. Following human instruction, on the other hand, is the most natural way to perform artistic style transfer that can significantly improve controllability for visual effect applications. We introduce a new task, language-driven artistic style transf… ▽ More

    Submitted 17 July, 2022; v1 submitted 31 May, 2021; originally announced June 2021.

    Comments: ECCV'22

  48. arXiv:2104.01122  [pdf, other

    cs.CV

    M3L: Language-based Video Editing via Multi-Modal Multi-Level Transformers

    Authors: Tsu-Jui Fu, Xin Eric Wang, Scott T. Grafton, Miguel P. Eckstein, William Yang Wang

    Abstract: Video editing tools are widely used nowadays for digital design. Although the demand for these tools is high, the prior knowledge required makes it difficult for novices to get started. Systems that could follow natural language instructions to perform automatic editing would significantly improve accessibility. This paper introduces the language-based video editing (LBVE) task, which allows the m… ▽ More

    Submitted 18 March, 2022; v1 submitted 2 April, 2021; originally announced April 2021.

    Comments: CVPR'22

  49. arXiv:2103.16561  [pdf, other

    cs.CV cs.AI cs.CL

    Diagnosing Vision-and-Language Navigation: What Really Matters

    Authors: Wanrong Zhu, Yuankai Qi, Pradyumna Narayana, Kazoo Sone, Sugato Basu, Xin Eric Wang, Qi Wu, Miguel Eckstein, William Yang Wang

    Abstract: Vision-and-language navigation (VLN) is a multimodal task where an agent follows natural language instructions and navigates in visual environments. Multiple setups have been proposed, and researchers apply new model architectures or training techniques to boost navigation performance. However, there still exist non-negligible gaps between machines' performance and human benchmarks. Moreover, the… ▽ More

    Submitted 4 May, 2022; v1 submitted 30 March, 2021; originally announced March 2021.

    Comments: NAACL 2022

  50. arXiv:2102.01860  [pdf, other

    cs.CV cs.CL

    L2C: Describing Visual Differences Needs Semantic Understanding of Individuals

    Authors: An Yan, Xin Eric Wang, Tsu-Jui Fu, William Yang Wang

    Abstract: Recent advances in language and vision push forward the research of captioning a single image to describing visual differences between image pairs. Suppose there are two images, I_1 and I_2, and the task is to generate a description W_{1,2} comparing them, existing methods directly model { I_1, I_2 } -> W_{1,2} map** without the semantic understanding of individuals. In this paper, we introduce… ▽ More

    Submitted 2 February, 2021; originally announced February 2021.

    Comments: EACL-2021 short