Skip to main content

Showing 1–50 of 220 results for author: Wang, W Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.16851  [pdf, other

    cs.CL cs.AI cs.CV

    Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts

    Authors: Aditya Sharma, Michael Saxon, William Yang Wang

    Abstract: We present LoCoVQA, a dynamic benchmark generator for evaluating long-context extractive reasoning in vision language models (VLMs). LoCoVQA augments test examples for mathematical reasoning, VQA, and character recognition tasks with increasingly long visual contexts composed of both in-distribution and out-of-distribution distractor images. Across these tasks, a diverse set of VLMs rapidly lose… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: Under review

  2. arXiv:2406.14867  [pdf, other

    cs.LG cs.AI cs.CL

    DistiLRR: Transferring Code Repair for Low-Resource Programming Languages

    Authors: Kyle Wong, Alfonso Amayuelas, Liangming Pan, William Yang Wang

    Abstract: Large language models (LLMs) have shown remarkable performance on code generation tasks. A recent application of LLMs for code generation is iterative code repair, where a model fixes an incorrect program by rationalizing about errors and generating a new program. However, code repair is primarily studied on high-resource languages like Python, and the framework's efficacy is under-explored on low… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

  3. arXiv:2406.13869  [pdf, other

    cs.LG q-bio.BM

    Global Human-guided Counterfactual Explanations for Molecular Properties via Reinforcement Learning

    Authors: Danqing Wang, Antonis Antoniades, Kha-Dinh Luong, Edwin Zhang, Mert Kosan, Jiachen Li, Ambuj Singh, William Yang Wang, Lei Li

    Abstract: Counterfactual explanations of Graph Neural Networks (GNNs) offer a powerful way to understand data that can naturally be represented by a graph structure. Furthermore, in many domains, it is highly desirable to derive data-driven global explanations or rules that can better explain the high-level properties of the models and data in question. However, evaluating global counterfactual explanations… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

    Comments: Accepted by KDD 2024

  4. arXiv:2406.12168  [pdf, other

    cs.LG cs.AI cs.CL

    BPO: Supercharging Online Preference Learning by Adhering to the Proximity of Behavior LLM

    Authors: Wenda Xu, Jiachen Li, William Yang Wang, Lei Li

    Abstract: Direct alignment from preferences (DAP) has emerged as a promising paradigm for aligning large language models (LLMs) to human desiderata from pre-collected, offline preference datasets. While recent studies indicate that existing offline DAP methods can directly benefit from online training samples, we highlight the need to develop specific online DAP algorithms to fully harness the power of onli… ▽ More

    Submitted 19 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

    Comments: Wenda Xu and Jiachen Li contributed equally

  5. arXiv:2406.11069  [pdf, other

    cs.CV cs.AI cs.CL

    WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences

    Authors: Yujie Lu, Dongfu Jiang, Wenhu Chen, William Yang Wang, Ye** Choi, Bill Yuchen Lin

    Abstract: Recent breakthroughs in vision-language models (VLMs) emphasize the necessity of benchmarking human preferences in real-world multimodal interactions. To address this gap, we launched WildVision-Arena (WV-Arena), an online platform that collects human preferences to evaluate VLMs. We curated WV-Bench by selecting 500 high-quality samples from 8,000 user submissions in WV-Arena. WV-Bench uses GPT-4… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

    Comments: link: https://hf.co/spaces/WildVision/vision-arena

  6. arXiv:2406.08656  [pdf, other

    cs.CV cs.AI cs.CL

    TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation

    Authors: Weixi Feng, Jiachen Li, Michael Saxon, Tsu-jui Fu, Wenhu Chen, William Yang Wang

    Abstract: Video generation has many unique challenges beyond those of image generation. The temporal dimension introduces extensive possible variations across frames, over which consistency and continuity may be violated. In this study, we move beyond evaluating simple actions and argue that generated videos should incorporate the emergence of new concepts and their relation transitions like in real-world v… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  7. arXiv:2406.08407  [pdf, other

    cs.CV cs.AI cs.CL

    MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

    Authors: Xuehai He, Weixi Feng, Kaizhi Zheng, Yujie Lu, Wanrong Zhu, Jiachen Li, Yue Fan, Jianfeng Wang, Linjie Li, Zhengyuan Yang, Kevin Lin, William Yang Wang, Lijuan Wang, Xin Eric Wang

    Abstract: Multimodal Language Language Models (MLLMs) demonstrate the emerging abilities of "world models" -- interpreting and reasoning about complex real-world dynamics. To assess these abilities, we posit videos are the ideal medium, as they encapsulate rich representations of real-world dynamics and causalities. To this end, we introduce MMWorld, a new benchmark for multi-discipline, multi-faceted multi… ▽ More

    Submitted 13 June, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

  8. arXiv:2406.07546  [pdf, other

    cs.CV cs.AI cs.CL

    Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?

    Authors: Xingyu Fu, Muyu He, Yujie Lu, William Yang Wang, Dan Roth

    Abstract: We present a novel task and benchmark for evaluating the ability of text-to-image(T2I) generation models to produce images that fit commonsense in real life, which we call Commonsense-T2I. Given two adversarial text prompts containing an identical set of action words with minor differences, such as "a lightbulb without electricity" v.s. "a lightbulb with electricity", we evaluate whether T2I model… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Text-to-Image Generation, Commonsense, Project Url: https://zeyofu.github.io/CommonsenseT2I/

  9. arXiv:2405.20535  [pdf, other

    cs.AI cs.CL

    Unveiling the Impact of Coding Data Instruction Fine-Tuning on Large Language Models Reasoning

    Authors: Xinlu Zhang, Zhiyu Zoey Chen, Xi Ye, Xianjun Yang, Lichang Chen, William Yang Wang, Linda Ruth Petzold

    Abstract: Instruction Fine-Tuning (IFT) significantly enhances the zero-shot capabilities of pretrained Large Language Models (LLMs). While coding data is known to boost reasoning abilities during LLM pretraining, its role in activating internal reasoning capacities during IFT remains understudied. This paper investigates a key question: How does coding data impact LLMs' reasoning capacities during the IFT… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

  10. arXiv:2405.18750  [pdf, other

    cs.CV

    T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback

    Authors: Jiachen Li, Weixi Feng, Tsu-Jui Fu, Xinyi Wang, Sugato Basu, Wenhu Chen, William Yang Wang

    Abstract: Diffusion-based text-to-video (T2V) models have achieved significant success but continue to be hampered by the slow sampling speed of their iterative sampling processes. To address the challenge, consistency models have been proposed to facilitate fast inference, albeit at the cost of sample quality. In this work, we aim to break the quality bottleneck of a video consistency model (VCM) to achiev… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

    Comments: Project page: https://t2v-turbo.github.io/

  11. arXiv:2405.17978  [pdf, other

    cs.CL cs.AI

    FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm

    Authors: Xiaobao Wu, Thong Nguyen, Delvin Ce Zhang, William Yang Wang, Anh Tuan Luu

    Abstract: Topic models have been evolving rapidly over the years, from conventional to recent neural models. However, existing topic models generally struggle with either effectiveness, efficiency, or stability, highly impeding their practical applications. In this paper, we propose FASTopic, a fast, adaptive, stable, and transferable topic model. FASTopic follows a new paradigm: Dual Semantic-relation Reco… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

  12. arXiv:2405.14213  [pdf, other

    cs.CV cs.CL

    From Text to Pixel: Advancing Long-Context Understanding in MLLMs

    Authors: Yujie Lu, Xiujun Li, Tsu-Jui Fu, Miguel Eckstein, William Yang Wang

    Abstract: The rapid progress in Multimodal Large Language Models (MLLMs) has significantly advanced their ability to process and understand complex visual and textual information. However, the integration of multiple images and extensive textual contexts remains a challenge due to the inherent limitation of the models' capacity to handle long input sequences efficiently. In this paper, we introduce SEEKER,… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

  13. arXiv:2405.01769  [pdf, other

    cs.CL

    A Survey on Large Language Models for Critical Societal Domains: Finance, Healthcare, and Law

    Authors: Zhiyu Zoey Chen, **g Ma, Xinlu Zhang, Nan Hao, An Yan, Armineh Nourbakhsh, Xianjun Yang, Julian McAuley, Linda Petzold, William Yang Wang

    Abstract: In the fast-evolving domain of artificial intelligence, large language models (LLMs) such as GPT-3 and GPT-4 are revolutionizing the landscapes of finance, healthcare, and law: domains characterized by their reliance on professional expertise, challenging data acquisition, high-stakes, and stringent regulatory compliance. This survey offers a detailed exploration of the methodologies, applications… ▽ More

    Submitted 2 May, 2024; originally announced May 2024.

    Comments: 35 pages, 6 figures

  14. arXiv:2404.15271  [pdf, other

    cs.CV cs.AI cs.CL

    Automatic Layout Planning for Visually-Rich Documents with Instruction-Following Models

    Authors: Wanrong Zhu, Jennifer Healey, Ruiyi Zhang, William Yang Wang, Tong Sun

    Abstract: Recent advancements in instruction-following models have made user interactions with models more user-friendly and efficient, broadening their applicability. In graphic design, non-professional users often struggle to create visually appealing layouts due to limited skills and resources. In this work, we introduce a novel multimodal instruction-following framework for layout planning, allowing use… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

  15. arXiv:2404.07973  [pdf, other

    cs.CV

    Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

    Authors: Haotian Zhang, Haoxuan You, Philipp Dufter, Bowen Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu, William Yang Wang, Shih-Fu Chang, Zhe Gan, Yinfei Yang

    Abstract: While Ferret seamlessly integrates regional understanding into the Large Language Model (LLM) to facilitate its referring and grounding capability, it poses certain limitations: constrained by the pre-trained fixed visual encoder and failed to perform well on broader tasks. In this work, we unveil Ferret-v2, a significant upgrade to Ferret, with three key designs. (1) Any resolution grounding and… ▽ More

    Submitted 11 April, 2024; originally announced April 2024.

    Comments: Preprint. 14 pages, 4 figures

  16. arXiv:2404.04251  [pdf, other

    cs.CV cs.AI cs.CL

    Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2)

    Authors: Michael Saxon, Fatima Jahara, Mahsa Khoshnoodi, Yujie Lu, Aditya Sharma, William Yang Wang

    Abstract: With advances in the quality of text-to-image (T2I) models has come interest in benchmarking their prompt faithfulness-the semantic coherence of generated images to the prompts they were conditioned on. A variety of T2I faithfulness metrics have been proposed, leveraging advances in cross-modal embeddings and vision-language models (VLMs). However, these metrics are not rigorously compared and ben… ▽ More

    Submitted 22 May, 2024; v1 submitted 5 April, 2024; originally announced April 2024.

    Comments: 10 pages main, 12 pages appendices, 13 figures, 3 tables

  17. arXiv:2404.01295  [pdf, other

    cs.CL cs.AI

    Towards Safety and Helpfulness Balanced Responses via Controllable Large Language Models

    Authors: Yi-Lin Tuan, Xilun Chen, Eric Michael Smith, Louis Martin, Soumya Batra, Asli Celikyilmaz, William Yang Wang, Daniel M. Bikel

    Abstract: As large language models (LLMs) become easily accessible nowadays, the trade-off between safety and helpfulness can significantly impact user experience. A model that prioritizes safety will cause users to feel less engaged and assisted while prioritizing helpfulness will potentially cause harm. Possible harms include teaching people how to build a bomb, exposing youth to inappropriate content, an… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

  18. arXiv:2403.11092  [pdf, other

    cs.CL cs.AI cs.CV cs.CY eess.IV

    Lost in Translation? Translation Errors and Challenges for Fair Assessment of Text-to-Image Models on Multilingual Concepts

    Authors: Michael Saxon, Yiran Luo, Sharon Levy, Chitta Baral, Yezhou Yang, William Yang Wang

    Abstract: Benchmarks of the multilingual capabilities of text-to-image (T2I) models compare generated images prompted in a test language to an expected image distribution over a concept set. One such benchmark, "Conceptual Coverage Across Languages" (CoCo-CroLa), assesses the tangible noun inventory of T2I models by prompting them to generate pictures from a concept list translated to seven languages and co… ▽ More

    Submitted 17 March, 2024; originally announced March 2024.

    Comments: NAACL 2024 Main Conference

  19. arXiv:2403.11027  [pdf, other

    cs.CV cs.AI

    Reward Guided Latent Consistency Distillation

    Authors: Jiachen Li, Weixi Feng, Wenhu Chen, William Yang Wang

    Abstract: Latent Consistency Distillation (LCD) has emerged as a promising paradigm for efficient text-to-image synthesis. By distilling a latent consistency model (LCM) from a pre-trained teacher latent diffusion model (LDM), LCD facilitates the generation of high-fidelity images within merely 2 to 4 inference steps. However, the LCM's efficient inference is obtained at the cost of the sample quality. In t… ▽ More

    Submitted 16 March, 2024; originally announced March 2024.

    Comments: Project page: https://rg-lcd.github.io/

  20. arXiv:2402.18909  [pdf, other

    cs.CL cs.AI

    Updating Language Models with Unstructured Facts: Towards Practical Knowledge Editing

    Authors: Xiaobao Wu, Liangming Pan, William Yang Wang, Anh Tuan Luu

    Abstract: Knowledge editing aims to inject knowledge updates into language models to keep them correct and up-to-date. However, its current evaluation strategies are notably impractical: they solely update with well-curated structured facts (triplets with subjects, relations, and objects), whereas real-world knowledge updates commonly emerge in unstructured texts like news articles. In this paper, we propos… ▽ More

    Submitted 29 February, 2024; originally announced February 2024.

  21. arXiv:2402.18025  [pdf, other

    cs.CL

    Hire a Linguist!: Learning Endangered Languages with In-Context Linguistic Descriptions

    Authors: Kexun Zhang, Yee Man Choi, Zhenqiao Song, Taiqi He, William Yang Wang, Lei Li

    Abstract: How can large language models (LLMs) process and translate endangered languages? Many languages lack a large corpus to train a decent LLM; therefore existing LLMs rarely perform well in unseen, endangered languages. On the contrary, we observe that 2000 endangered languages, though without a large corpus, have a grammar book or a dictionary. We propose LINGOLLM, a training-free approach to enable… ▽ More

    Submitted 27 February, 2024; originally announced February 2024.

  22. arXiv:2402.16827  [pdf, other

    cs.CL cs.LG

    A Survey on Data Selection for Language Models

    Authors: Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, William Yang Wang

    Abstract: A major factor in the recent success of large language models is the use of enormous and ever-growing text datasets for unsupervised pre-training. However, naively training a model on all available data may not be optimal (or feasible), as the quality of available text data can vary. Filtering out data can also decrease the carbon footprint and financial costs of training models by reducing the am… ▽ More

    Submitted 8 March, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

    Comments: Paper list available at https://github.com/alon-albalak/data-selection-survey

  23. arXiv:2402.11436  [pdf, other

    cs.CL cs.AI

    Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement

    Authors: Wenda Xu, Guanglei Zhu, Xuandong Zhao, Liangming Pan, Lei Li, William Yang Wang

    Abstract: Recent studies show that large language models (LLMs) improve their performance through self-feedback on certain tasks while degrade on others. We discovered that such a contrary is due to LLM's bias in evaluating their own output. In this paper, we formally define LLM's self-bias - the tendency to favor its own generation - using two statistics. We analyze six LLMs (GPT-4, GPT-3.5, Gemini, LLaMA2… ▽ More

    Submitted 18 June, 2024; v1 submitted 17 February, 2024; originally announced February 2024.

  24. arXiv:2402.03268  [pdf, other

    cs.LG cs.AI cs.CL

    Understanding Reasoning Ability of Language Models From the Perspective of Reasoning Paths Aggregation

    Authors: Xinyi Wang, Alfonso Amayuelas, Kexun Zhang, Liangming Pan, Wenhu Chen, William Yang Wang

    Abstract: Pre-trained language models (LMs) are able to perform complex reasoning without explicit fine-tuning. To understand how pre-training with a next-token prediction objective contributes to the emergence of such reasoning capability, we propose that we can view an LM as deriving new conclusions by aggregating indirect reasoning paths seen at pre-training time. We found this perspective effective in t… ▽ More

    Submitted 20 June, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

    Comments: Accepted to ICML 2024

  25. arXiv:2401.17256  [pdf, other

    cs.CL

    Weak-to-Strong Jailbreaking on Large Language Models

    Authors: Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, William Yang Wang

    Abstract: Large language models (LLMs) are vulnerable to jailbreak attacks - resulting in harmful, unethical, or biased text generations. However, existing jailbreaking methods are computationally costly. In this paper, we propose the weak-to-strong jailbreaking attack, an efficient method to attack aligned LLMs to produce harmful text. Our key intuition is based on the observation that jailbroken and align… ▽ More

    Submitted 5 February, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

  26. arXiv:2401.13782  [pdf, other

    cs.DL cs.AI cs.CL cs.CV cs.LG cs.SI

    Tweets to Citations: Unveiling the Impact of Social Media Influencers on AI Research Visibility

    Authors: Iain Xie Weissburg, Mehir Arora, Xinyi Wang, Liangming Pan, William Yang Wang

    Abstract: As the number of accepted papers at AI and ML conferences reaches into the thousands, it has become unclear how researchers access and read research publications. In this paper, we investigate the role of social media influencers in enhancing the visibility of machine learning research, particularly the citation counts of papers they share. We have compiled a comprehensive dataset of over 8,000 pa… ▽ More

    Submitted 3 March, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

    Comments: 10 Pages, 14 Figures

  27. arXiv:2312.02406  [pdf, other

    cs.CL cs.LG

    Efficient Online Data Mixing For Language Model Pre-Training

    Authors: Alon Albalak, Liangming Pan, Colin Raffel, William Yang Wang

    Abstract: The data used to pretrain large language models has a decisive impact on a model's downstream performance, which has led to a large body of work on data selection methods that aim to automatically determine the most suitable data to use for pretraining. Existing data selection methods suffer from slow and computationally expensive processes, a problem amplified by the increasing size of models and… ▽ More

    Submitted 8 December, 2023; v1 submitted 4 December, 2023; originally announced December 2023.

  28. arXiv:2311.17647  [pdf, other

    cs.CV cs.AI cs.CL

    Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels?

    Authors: Xiujun Li, Yujie Lu, Zhe Gan, Jianfeng Gao, William Yang Wang, Ye** Choi

    Abstract: Recent multimodal large language models (MLLMs) have shown promising instruction following capabilities on vision-language tasks. In this work, we introduce VISUAL MODALITY INSTRUCTION (VIM), and investigate how well multimodal models can understand textual instructions provided in pixels, despite not being explicitly trained on such data during pretraining or fine-tuning. We adapt VIM to eight be… ▽ More

    Submitted 10 June, 2024; v1 submitted 29 November, 2023; originally announced November 2023.

    Comments: Github: https://github.com/VIM-Bench/VIM_TOOL, Model and Data: https://huggingface.co/VIM-Bench

  29. arXiv:2311.09336  [pdf, other

    cs.CL

    LLMRefine: Pinpointing and Refining Large Language Models via Fine-Grained Actionable Feedback

    Authors: Wenda Xu, Daniel Deutsch, Mara Finkelstein, Juraj Juraska, Biao Zhang, Zhongtao Liu, William Yang Wang, Lei Li, Markus Freitag

    Abstract: Recent large language models (LLM) are leveraging human feedback to improve their generation quality. However, human feedback is costly to obtain, especially during inference. In this work, we propose LLMRefine, an inference time optimization method to refine LLM's output. The core idea is to use a learned fine-grained feedback model to pinpoint defects and guide LLM to refine them iteratively. Us… ▽ More

    Submitted 2 April, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

    Comments: Accepted to NAACL 2024

  30. arXiv:2311.01361  [pdf, other

    cs.CV cs.CL

    GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks

    Authors: Xinlu Zhang, Yujie Lu, Weizhi Wang, An Yan, Jun Yan, Lianke Qin, Heng Wang, Xifeng Yan, William Yang Wang, Linda Ruth Petzold

    Abstract: Automatically evaluating vision-language tasks is challenging, especially when it comes to reflecting human judgments due to limitations in accounting for fine-grained details. Although GPT-4V has shown promising results in various multi-modal tasks, leveraging GPT-4V as a generalist evaluator for these tasks has not yet been systematically explored. We comprehensively validate GPT-4V's capabiliti… ▽ More

    Submitted 2 November, 2023; originally announced November 2023.

  31. arXiv:2310.15654  [pdf, other

    cs.CL cs.AI cs.CY cs.HC cs.LG

    A Survey on Detection of LLMs-Generated Content

    Authors: Xianjun Yang, Liangming Pan, Xuandong Zhao, Haifeng Chen, Linda Petzold, William Yang Wang, Wei Cheng

    Abstract: The burgeoning capabilities of advanced large language models (LLMs) such as ChatGPT have led to an increase in synthetic content generation with implications across a variety of sectors, including media, cybersecurity, public discourse, and education. As such, the ability to detect LLMs-generated content has become of paramount importance. We aim to provide a detailed overview of existing detecti… ▽ More

    Submitted 24 October, 2023; originally announced October 2023.

    Comments: We will keep updating at https://github.com/Xianjun-Yang/Awesome_papers_on_LLMs_detection.git

    Report number: 20 pages

  32. arXiv:2310.12426  [pdf, other

    cs.CL cs.AI

    MAF: Multi-Aspect Feedback for Improving Reasoning in Large Language Models

    Authors: Deepak Nathani, David Wang, Liangming Pan, William Yang Wang

    Abstract: Language Models (LMs) have shown impressive performance in various natural language tasks. However, when it comes to natural language reasoning, LMs still face challenges such as hallucination, generating incorrect intermediate reasoning steps, and making mathematical errors. Recent research has focused on enhancing LMs through self-improvement using feedback. Nevertheless, existing approaches rel… ▽ More

    Submitted 18 October, 2023; originally announced October 2023.

    Comments: Accepted at EMNLP 2023 Main Conference, Camera Ready

  33. arXiv:2310.09676  [pdf, other

    cs.RO cs.AI

    Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning

    Authors: Jiachen Li, Qiaozi Gao, Michael Johnston, Xiaofeng Gao, Xuehai He, Suhaila Shakiah, Hangjie Shi, Reza Ghanadan, William Yang Wang

    Abstract: Prompt-based learning has been demonstrated as a compelling paradigm contributing to large language models' tremendous success (LLMs). Inspired by their success in language tasks, existing research has leveraged LLMs in embodied instruction following and task planning. In this work, we tackle the problem of training a robot to understand multimodal prompts, interleaving vision signals with text de… ▽ More

    Submitted 27 May, 2024; v1 submitted 14 October, 2023; originally announced October 2023.

    Comments: Accepted by ICML 2024. Project page: https://midas-icml.github.io

  34. arXiv:2310.09624  [pdf, other

    cs.CL cs.AI cs.LG

    ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models

    Authors: Alex Mei, Sharon Levy, William Yang Wang

    Abstract: As large language models are integrated into society, robustness toward a suite of prompts is increasingly important to maintain reliability in a high-variance environment.Robustness evaluations must comprehensively encapsulate the various settings in which a user may invoke an intelligent system. This paper proposes ASSERT, Automated Safety Scenario Red Teaming, consisting of three methods -- sem… ▽ More

    Submitted 11 November, 2023; v1 submitted 14 October, 2023; originally announced October 2023.

    Comments: In Findings of the 2023 Conference on Empirical Methods in Natural Language Processing

  35. arXiv:2310.07146  [pdf, other

    cs.CL

    Empowering Psychotherapy with Large Language Models: Cognitive Distortion Detection through Diagnosis of Thought Prompting

    Authors: Zhiyu Chen, Yujie Lu, William Yang Wang

    Abstract: Mental illness remains one of the most critical public health issues of our time, due to the severe scarcity and accessibility limit of professionals. Psychotherapy requires high-level expertise to conduct deep, complex reasoning and analysis on the cognition modeling of the patients. In the era of Large Language Models, we believe it is the right time to develop AI assistance for computational ps… ▽ More

    Submitted 10 October, 2023; originally announced October 2023.

    Comments: EMNLP 2023 Findings

  36. arXiv:2310.05707  [pdf, other

    cs.CL cs.AI cs.LG

    Guiding Language Model Math Reasoning with Planning Tokens

    Authors: Xinyi Wang, Lucas Caccia, Oleksiy Ostapenko, Xingdi Yuan, William Yang Wang, Alessandro Sordoni

    Abstract: Large language models (LLMs) have recently attracted considerable interest for their ability to perform complex reasoning tasks, such as chain-of-thought reasoning. However, most of the existing approaches to enhance this ability rely heavily on data-driven methods, while neglecting the structural aspects of the model's reasoning capacity. We find that while LLMs can manage individual reasoning st… ▽ More

    Submitted 5 February, 2024; v1 submitted 9 October, 2023; originally announced October 2023.

  37. arXiv:2310.05103  [pdf, other

    cs.CL cs.AI cs.CR cs.LG

    Zero-Shot Detection of Machine-Generated Codes

    Authors: Xianjun Yang, Kexun Zhang, Haifeng Chen, Linda Petzold, William Yang Wang, Wei Cheng

    Abstract: This work proposes a training-free approach for the detection of LLMs-generated codes, mitigating the risks associated with their indiscriminate usage. To the best of our knowledge, our research is the first to investigate zero-shot detection techniques applied to code generated by advanced black-box LLMs like ChatGPT. Firstly, we find that existing training-based or zero-shot text detectors are i… ▽ More

    Submitted 8 October, 2023; originally announced October 2023.

    Comments: work in progress

  38. arXiv:2310.02949  [pdf, other

    cs.CL cs.AI cs.CR cs.LG

    Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models

    Authors: Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, Dahua Lin

    Abstract: Warning: This paper contains examples of harmful language, and reader discretion is recommended. The increasing open release of powerful large language models (LLMs) has facilitated the development of downstream applications by reducing the essential cost of data annotation and computation. To ensure AI safety, extensive safety-alignment measures have been conducted to armor these models against m… ▽ More

    Submitted 4 October, 2023; originally announced October 2023.

    Comments: Work in progress

  39. arXiv:2309.17102  [pdf, other

    cs.CV

    Guiding Instruction-based Image Editing via Multimodal Large Language Models

    Authors: Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, Zhe Gan

    Abstract: Instruction-based image editing improves the controllability and flexibility of image manipulation via natural commands without elaborate descriptions or regional masks. However, human instructions are sometimes too brief for current methods to capture and follow. Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation… ▽ More

    Submitted 5 February, 2024; v1 submitted 29 September, 2023; originally announced September 2023.

    Comments: ICLR'24 (Spotlight) ; Project at https://mllm-ie.github.io ; Code at https://github.com/tsujuifu/pytorch_mgie

  40. arXiv:2308.03188  [pdf, other

    cs.CL cs.AI cs.LG

    Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies

    Authors: Liangming Pan, Michael Saxon, Wenda Xu, Deepak Nathani, Xinyi Wang, William Yang Wang

    Abstract: Large language models (LLMs) have demonstrated remarkable performance across a wide array of NLP tasks. However, their efficacy is undermined by undesired and inconsistent behaviors, including hallucination, unfaithful reasoning, and toxic content. A promising approach to rectify these flaws is self-correction, where the LLM itself is prompted or guided to fix problems in its own output. Technique… ▽ More

    Submitted 29 August, 2023; v1 submitted 6 August, 2023; originally announced August 2023.

    Comments: Work in Progress. Version 2

  41. arXiv:2307.06082  [pdf, other

    cs.AI cs.CL cs.CV

    VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View

    Authors: Raphael Schumann, Wanrong Zhu, Weixi Feng, Tsu-Jui Fu, Stefan Riezler, William Yang Wang

    Abstract: Incremental decision making in real-world environments is one of the most challenging tasks in embodied artificial intelligence. One particularly demanding scenario is Vision and Language Navigation~(VLN) which requires visual and natural language understanding as well as spatial and temporal reasoning capabilities. The embodied agent needs to ground its understanding of navigation instructions in… ▽ More

    Submitted 24 January, 2024; v1 submitted 12 July, 2023; originally announced July 2023.

    Comments: Accepted at AAAI 2024

  42. arXiv:2306.01735  [pdf, other

    cs.CL cs.AI cs.CV eess.IV

    Multilingual Conceptual Coverage in Text-to-Image Models

    Authors: Michael Saxon, William Yang Wang

    Abstract: We propose "Conceptual Coverage Across Languages" (CoCo-CroLa), a technique for benchmarking the degree to which any generative text-to-image system provides multilingual parity to its training language in terms of tangible nouns. For each model we can assess "conceptual coverage" of a given target language relative to a source language by comparing the population of images generated for a series… ▽ More

    Submitted 2 June, 2023; originally announced June 2023.

    Comments: ACL 2023 main conference; 16 pages, 13 figures

  43. arXiv:2305.18842  [pdf, other

    cs.CL cs.AI cs.CV

    Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge

    Authors: Xingyu Fu, Sheng Zhang, Gukyeong Kwon, Pramuditha Perera, Henghui Zhu, Yuhao Zhang, Alexander Hanbo Li, William Yang Wang, Zhiguo Wang, Vittorio Castelli, Patrick Ng, Dan Roth, Bing Xiang

    Abstract: The open-ended Visual Question Answering (VQA) task requires AI models to jointly reason over visual and natural language inputs using world knowledge. Recently, pre-trained Language Models (PLM) such as GPT-3 have been applied to the task and shown to be powerful world knowledge sources. However, these methods suffer from low knowledge coverage caused by PLM bias -- the tendency to generate certa… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

    Comments: Accepted to ACL 2023 Findings

  44. arXiv:2305.17359  [pdf, other

    cs.CL cs.AI

    DNA-GPT: Divergent N-Gram Analysis for Training-Free Detection of GPT-Generated Text

    Authors: Xianjun Yang, Wei Cheng, Yue Wu, Linda Petzold, William Yang Wang, Haifeng Chen

    Abstract: Large language models (LLMs) have notably enhanced the fluency and diversity of machine-generated text. However, this progress also presents a significant challenge in detecting the origin of a given text, and current research on detection methods lags behind the rapid evolution of LLMs. Conventional training-based methods have limitations in flexibility, particularly when adapting to new domains,… ▽ More

    Submitted 4 October, 2023; v1 submitted 26 May, 2023; originally announced May 2023.

    Comments: Updates

  45. arXiv:2305.15393  [pdf, other

    cs.CV cs.AI

    LayoutGPT: Compositional Visual Planning and Generation with Large Language Models

    Authors: Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, William Yang Wang

    Abstract: Attaining a high degree of user controllability in visual generation often requires intricate, fine-grained inputs like layouts. However, such inputs impose a substantial burden on users when compared to simple text inputs. To address the issue, we study how Large Language Models (LLMs) can serve as visual planners by generating layouts from text conditions, and thus collaborate with visual genera… ▽ More

    Submitted 28 October, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: NeurIPS 2023

  46. arXiv:2305.14591  [pdf, other

    cs.CL cs.SE

    ALGO: Synthesizing Algorithmic Programs with LLM-Generated Oracle Verifiers

    Authors: Kexun Zhang, Danqing Wang, **gtao Xia, William Yang Wang, Lei Li

    Abstract: Large language models (LLMs) excel at implementing code from functionality descriptions but struggle with algorithmic problems that require not only implementation but also identification of the suitable algorithm. Moreover, LLM-generated programs lack guaranteed correctness and require human verification. To address these challenges, we propose ALGO, a framework that synthesizes Algorithmic progr… ▽ More

    Submitted 7 December, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: NeurIPS 2023

  47. arXiv:2305.14312  [pdf, other

    cs.CV

    Text-guided 3D Human Generation from 2D Collections

    Authors: Tsu-Jui Fu, Wenhan Xiong, Yixin Nie, **gyu Liu, Barlas Oğuz, William Yang Wang

    Abstract: 3D human modeling has been widely used for engaging interaction in gaming, film, and animation. The customization of these characters is crucial for creativity and scalability, which highlights the importance of controllability. In this work, we introduce Text-guided 3D Human Generation (\texttt{T3H}), where a model is to generate a 3D human, guided by the fashion description. There are two goals:… ▽ More

    Submitted 20 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: EMNLP'23 (Findings) ; Project website: https://text-3dh.github.io/

  48. arXiv:2305.14282  [pdf, other

    cs.CL cs.AI

    INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained Feedback

    Authors: Wenda Xu, Danqing Wang, Liangming Pan, Zhenqiao Song, Markus Freitag, William Yang Wang, Lei Li

    Abstract: Automatically evaluating the quality of language generation is critical. Although recent learned metrics show high correlation with human judgement, these metrics can not explain their verdict or associate the scores with defects in generated text. To address this limitation, we present InstructScore, an explainable evaluation metric for text generation. By harnessing both explicit human instructi… ▽ More

    Submitted 26 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: Accepted to EMNLP2023 Main Conference

  49. arXiv:2305.13903  [pdf, other

    cs.CL cs.CV

    Let's Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought

    Authors: Vaishnavi Himakunthala, Andy Ouyang, Daniel Rose, Ryan He, Alex Mei, Yujie Lu, Chinmay Sonar, Michael Saxon, William Yang Wang

    Abstract: Despite exciting recent results showing vision-language systems' capacity to reason about images using natural language, their capacity for video reasoning remains under-explored. We motivate framing video reasoning as the sequential understanding of a small number of keyframes, thereby leveraging the power and robustness of vision-language while alleviating the computational complexities of proce… ▽ More

    Submitted 9 November, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: Accepted to the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023)

  50. arXiv:2305.13669  [pdf, other

    cs.CL cs.AI

    The Knowledge Alignment Problem: Bridging Human and External Knowledge for Large Language Models

    Authors: Shuo Zhang, Liangming Pan, Junzhou Zhao, William Yang Wang

    Abstract: Large language models often necessitate grounding on external knowledge to generate faithful and reliable answers. Yet even with the correct groundings in the reference, they can ignore them and rely on wrong groundings or their inherent biases to hallucinate when users, being largely unaware of the specifics of the stored information, pose questions that might not directly correlate with the retr… ▽ More

    Submitted 12 June, 2024; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: ACL 2024, Findings