Unveiling Hallucination in Text, Image, Video, and Audio Foundation Models: A Comprehensive Review

Pranab Sahoo1, Prabhash Meharia1, Akash Ghosh1, Sriparna Saha1, Vinija Jain2, Aman Chadha2,3
1Department of Computer Science And Engineering, Indian Institute of Technology Patna
2Stanford University, 3Amazon GenAI
[email protected], [email protected], [email protected] [email protected], [email protected]
   Work does not relate to position at Amazon.
Abstract

The rapid advancement of foundation models (FMs) across language, image, audio, and video domains has shown remarkable capabilities in diverse tasks. However, the proliferation of FMs brings forth a critical challenge: the potential to generate hallucinated outputs, particularly in high-stakes applications. The tendency of foundation models to produce hallucinated content arguably represents the biggest hindrance to their widespread adoption in real-world scenarios, especially in domains where reliability and accuracy are paramount. This survey paper presents a comprehensive overview of recent developments that aim to identify and mitigate the problem of hallucination in FMs, spanning text, image, video, and audio modalities. By synthesizing recent advancements in detecting and mitigating hallucination across various modalities, the paper aims to provide valuable insights for researchers, developers, and practitioners. Essentially, it establishes a clear framework encompassing definition, taxonomy, and detection strategies for addressing hallucination in multimodal foundation models, laying the foundation for future research in this pivotal area.

1 Introduction

The rapid progress in large-scale foundation models (FMs), spanning language, image, audio, and video domains, has revolutionized the field of artificial intelligence (AI). Models such as GPT-3 Brown et al. (2020), MiniGPT-4 Zhu et al. (2023), AudioLLM Borsos et al. (2023), and LaViLa Zhao et al. (2022) have demonstrated remarkable abilities across diverse tasks, from text generation to multimodal understanding. As these models find wider applications in critical domains, there is a growing imperative to comprehend and alleviate their propensity to produce hallucinated outputs.

1.1 Hallucination

Hallucination refers to instances where FMs generate content that appears plausible and emulates human-like patterns but lacks a coherent understanding of the underlying context or factual grounding Xu et al. (2024b). These hallucinated outputs can manifest in various forms, ranging from trivial factual inaccuracies to more severe cases where the model generates entirely imaginary or absurd content. This issue is not limited to textual modalities but extends across diverse domains, including images, videos, and audio generated by large-scale foundation models. The root causes of hallucination are multifaceted, potentially stemming from biases in training data, limited access to current information, or the model’s inherent constraints in comprehending and generating contextually precise responses. Deploying these powerful models without addressing their hallucination tendencies can have severe consequences, perpetuating misinformation, leading to incorrect conclusions, and potentially causing adverse effects in critical applications. Addressing hallucinations in FMs has become an active research area and a key focus for development efforts, with ongoing efforts to mitigate this behavior. Strategies employed include fine-tuning models on domain-specific data, leveraging diverse training data to enhance model robustness, and develo** improved evaluation metrics to identify and reduce hallucination tendencies.

1.2 Types of Hallucination

Hallucinations in large foundation models can manifest in various forms, each posing unique challenges and requiring tailored strategies for mitigation. These types of hallucinations can be categorized as follows: Contextual disconnection Zhang et al. (2023d) describes a situation in which the output or content produced by a model across different modalities is inconsistent or out of sync with the context that the user or the input data have provided or expected. Semantic distortion  Tjio et al. (2022) refers to a type of inconsistency or error in generated content where the semantics or underlying meaning of the input is misrepresented or altered in the output. Content hallucination Moernaut et al. (2018)is the term used to describe a phenomenon seen in generative models when features or elements that are generated as output are either unreal given the context or absent from the input data. Factual inaccuracy  Zhang et al. (2023d) describes a kind of error seen in generative models when information that is inaccurate, deceptive, or at odds with the known facts appears in the generated output. Figure 1 illustrates various types of hallucinations.

Refer to caption
Figure 1: Types of hallucinations illustrated: This image shows several types of hallucinations. Proper explanations of hallucinations are indicated as hallucinated elements (HE) and are highlighted in bold red text.

1.3 Motivation and Contributions

Most of the existing survey papers have explored hallucination in the context of large language models (LLMs) Huang et al. (2023)Tonmoy et al. (2024). Recent studies have shown that hallucination can also occur in vision, audio, and video foundation models, highlighting the need for a comprehensive understanding of this challenge across multiple modalities Liu et al. (2024a)Sahoo et al. (2024)Rawte et al. (2023b). To address this gap, our survey aims to provide a holistic and multimodal perspective on the hallucination challenge in FMs. This review comprehensively examines existing research across language, vision, video, and audio domains to understand the mechanisms, detection methods, and mitigation strategies for hallucination in FMs. It serves as a vital resource for researchers and practitioners, aiding in the development of more robust AI solutions. Additionally, it includes a detailed taxonomy diagram in Figure. 2 and a summarized Table 1 illustrating recent advancements across different modalities. The contributions of this survey paper are as follows:

  • Establish a precise definition and structured taxonomy of hallucination in the context of large-scale foundation models.

  • Identify the key factors and mechanisms that contribute to the emergence of hallucination across different modalities.

  • We have presented the various detection and mitigation strategies that have been proposed to address the hallucination problem in a multimodal setting.

  • We have provided a comprehensive summary of the methodologies pertaining to hallucination techniques in large foundational models in Table 1, detailing their approaches to hallucination detection, mitigation, task considerations, datasets utilized, and evaluation metrics employed. This will offer readers a concise overview of recent advancements in this field.

The paper is structured as follows: Section 2 delves into hallucinations in LLMs, while Section 3 explores large vision language models. Additionally, Section 4 and Section 5 examine hallucinations in large video and large audio models, respectively. Section 6 presents a discourse on the ramifications of hallucinations, exploring whether they pose advantages or disadvantages. Subsequently, Section 7 addresses the limitations of the current work, while Section 8 outlines potential future directions. Finally, Section 9 presents the concluding remarks.

{forest}

forked edges, for tree= grow=east, reversed=true, anchor=base west, parent anchor=east, child anchor=west, base=center, font=, rectangle, draw=hidden-draw, rounded corners, align=center, text centered, minimum width=5em, edge+=darkgray, line width=1pt, s sep=3pt, inner xsep=2pt, inner ysep=3pt, line width=0.8pt, ver/.style= rotate=90, child anchor=north, parent anchor=south, anchor=center , , where level=1text width=15em,font=,, where level=2text width=14em,font=,, where level=3minimum width=10em,font=,, where level=4text width=26em,font=,, where level=5text width=20em,font=,, [ Hallucinations, for tree=fill=bg16, text width=14em [ Text §2, for tree=fill=red!50, text width=18em [ Detection, for tree=fill=bg1, text width=13.5em [ FACTOID Rawte et al. (2024b), for tree=fill=pg56,text width=28em ] [ FACTOR Muhlgay et al. (2023), for tree=fill=pg56,text width=28em ] [ FactCHD Chen et al. (2023b), for tree=fill=pg56,text width=28em ] [ HalluQA Cheng et al. (2023), for tree=fill=pg56,text width=28em ] [ HaluEval Li et al. (2023b), for tree=fill=pg56,text width=28em ] [ SelfCheckGPT Manakul et al. (2023), for tree=fill=pg56,text width=28em ] [ FACTSCORE Min et al. (2023), for tree=fill=pg56,text width=28em ] [ FacTool Chern et al. (2023), for tree=fill=pg56,text width=28em ] ] [ Mitigation, for tree=fill=bg2, text width=13.5em [ DoLa Chuang et al. (2023), for tree=fill=bg4,text width=28em ] [ CoK Li et al. (2023d), for tree=fill=bg4,text width=28em ] [ MixAlign Knowledge , for tree=fill=bg4,text width=28em ] [ SELF-FAMILIARITY Luo et al. (2023), for tree=fill=bg4,text width=28em ] [ CoVe Dhuliawala et al. (2023), for tree=fill=bg4,text width=28em ] [ HALOCHECK Elaraby et al. (2023), for tree=fill=bg4,text width=28em ] [ Instructional Prompting Varshney et al. (2023), for tree=fill=bg4,text width=28em ] [ PURR Chen et al. (2023a), for tree=fill=bg4,text width=28em ] [ LLM-AUGMENTER Peng et al. (2023), for tree=fill=bg4,text width=28em ] ] ] [ Image §3, for tree=fill=orange!50, text width=18em [ Detection, for tree=fill=bg1, text width=13.5em [ HallusionBench Guan et al. (2023), for tree=fill=dandyshade,text width=28em ] [ GAVIE Liu et al. (2023), for tree=fill=dandyshade,text width=28em ] [ PhD Liu et al. (2024b), for tree=fill=dandyshade,text width=28em ] [ VHTest Huang et al. (2024), for tree=fill=dandyshade,text width=28em ] [ M-HalDetect Gunjal et al. (2024), for tree=fill=dandyshade,text width=28em ] [ ChartBench Xu et al. (2023b), for tree=fill=dandyshade,text width=28em ] [ MSG-MCQ Lu et al. (2024), for tree=fill=dandyshade,text width=28em ] [ MMVP Tong et al. (2024), for tree=fill=dandyshade,text width=28em ] [ REVO-LION Liao et al. (2023), for tree=fill=dandyshade,text width=28em ] [ CIEM Hu et al. (2023), for tree=fill=dandyshade,text width=28em ] [ FAITHSCORE **g et al. (2023), for tree=fill=dandyshade,text width=28em ] [ POPE Li et al. (2023f), for tree=fill=dandyshade,text width=28em ] [ VQA Changpinyo et al. (2022), for tree=fill=dandyshade,text width=28em ] [ HaELM Wang et al. (2023), for tree=fill=dandyshade,text width=28em ] [ NOPE Lovenia et al. (2023), for tree=fill=dandyshade,text width=28em ] [ MMHAL-BENCH Sun et al. (2023), for tree=fill=dandyshade,text width=28em ] [ TouchStone Bai et al. (2023), for tree=fill=dandyshade,text width=28em ] ] [ Mitigation, for tree=fill=bg2, text width=13.5em [ HalluciDoctor Yu et al. (2023a), for tree=fill=bg8,text width=28em ] [ LURE Zhou et al. (2023), for tree=fill=bg8,text width=28em ] [ MARINE Zhao et al. (2024), for tree=fill=bg8,text width=28em ] [ FDPO Gunjal et al. (2024), for tree=fill=bg8,text width=28em ] [ ViGoR Yan et al. (2024), for tree=fill=bg8,text width=28em ] [ HA-DPO Zhao et al. (2023), for tree=fill=bg8,text width=28em ] [ VIGC Wang et al. (2024), for tree=fill=bg8,text width=28em ] [ Data Centric Approach Lu et al. (2024), for tree=fill=bg8,text width=28em ] [ MoF Tong et al. (2024), for tree=fill=bg8,text width=28em ] [ InternLM-XComposer Zhang et al. (2023b), for tree=fill=bg8,text width=28em ] [ VCD Leng et al. (2023), for tree=fill=bg8,text width=28em ] [ Factually Augmented RLHF Sun et al. (2023), for tree=fill=bg8,text width=28em ] [ ObjMLM Dai et al. (2022), for tree=fill=bg8,text width=28em ] ] ] [ Video §4 , for tree=fill=bg5, text width=18em [ Detection, for tree=fill=bg1, text width=13.5em [ EMScore Shi et al. (2022), for tree=fill=bg7,text width=28em ] ] [ Mitigation, for tree=fill=bg2, text width=13.5em [ FactVC Liu and Wan (2023), for tree=fill=bg13,text width=28em ] [ CLearViD Chuang and Fazli (2023), for tree=fill=bg13,text width=28em ] [ MGAT He et al. (2022), for tree=fill=bg13,text width=28em ] ] ] [ Audio §5 , for tree=fill=lime!50, text width=18em [ Detection, for tree=fill=bg1, text width=13.5em [ PAM Deshmukh et al. (2024), for tree=fill=bg15,text width=28em ] [ CompA Ghosh et al. (2023), for tree=fill=bg15,text width=28em ] ] [ Mitigation, for tree=fill=bg2, text width=13.5em [ MusicLDM Chen et al. (2024), for tree=fill=bg14,text width=28em ] [ Re-AudioLDM Yuan et al. (2024), for tree=fill=bg14,text width=28em ] [ SECap Xu et al. (2024a), for tree=fill=bg14,text width=28em ] [ RECAP Ghosh et al. (2024), for tree=fill=bg14,text width=28em ] [ Cacophony Zhu and Duan (2024), for tree=fill=bg14,text width=28em ] [ EnCLAP Kim et al. (2024), for tree=fill=bg14,text width=28em ] ] ] ]

Figure 2: Taxonomy of hallucination in large foundation models, organized around detection and mitigation techniques.
Refer to caption
Figure 3: LLM responses showing the types of hallucinations, highlighted in red, green, and blue Zhang et al. (2023d).

2 Hallucination in Large Language Models

Despite the progress of LLMs, a notable challenge persists in their proneness to hallucinate, impeding their practical implementation. For instance, the illustration in Figure 3 exemplifies the generated response by the LLM, showcasing indications of hallucination.

2.1 Hallucination Detection and Mitigation

Identifying hallucinations in LLMs is crucial for ensuring the credibility and reliability of their results, especially in scenarios requiring factual correctness. Existing fact-checking methods often rely on complex modules or external databases, requiring either output probability distributions or interfacing with external sources. SelfCheckGPT Manakul et al. (2023) offers a zero-resource black-box solution for detecting hallucinations in any LLM without relying on external resources. This method operates on the principle that an LLM familiar with a topic will produce consistent and comparable facts in its responses. In contrast, randomly sampled responses from an unfamiliar topic are likely to contain contradicting and hallucinated facts. Continuing the exploration of methods for passage-level hallucination detection, Yang et al. (2023) proposed a novel self-check approach based on reverse validation, aiming to identify factual errors without external resources automatically. They introduced a benchmark, Passage-level Hallucination Detection (PHD), generated using ChatGPT and annotated by human experts to assess different methods. Assessing the accuracy of long text generated by LLMs is challenging because it often contains both accurate and inaccurate information, making simple quality judgments insufficient. To address this, Min et al. (2023) introduced FACTSCORE (Factual Precision in Atomicity Score), a novel evaluation method that breaks down text into individual facts and measures their reliability. Huang and Chang (2023) introduced a unique strategy to mitigate hallucination risks in LLMs by drawing parallels with established web systems. They identified the absence of a "citation" mechanism, which refers to acknowledging or referencing sources or evidence, as a significant gap.

Addressing the need to identify factual inaccuracies in LLM-generated content, Rawte et al. (2024b) develop a multi-task learning (MTL) framework, integrating advanced long text embeddings like e5-mistral-7b-instruct, along with models such as GPT-3, SpanBERT, and RoFormer. This MTL approach exhibited a substantial performance gain, achieving an average accuracy improvement of 40% on the FACTOID benchmark when compared to leading textual entailment methods. Hallucination mitigation efforts have predominantly relied on empirical methods, leaving uncertainty regarding the possibility of complete elimination. To tackle this challenge, Xu et al. (2024b) introduces a formal framework defining hallucination as discrepancies observed between computable LLMs and a ground truth function. Through this framework, the study examines existing hallucination mitigation strategies and their practical implications for real-world LLM deployment. Rawte et al. (2024c) introduced the Sorry, Come Again (SCA) prompting technique to address hallucination in contemporary LLMs. SCA enhances comprehension by employing optimal paraphrasing and injecting [PAUSE] tokens to delay LLM generation. It analyzes linguistic nuances in prompts and their impact on the hallucinated generation, highlighting the difficulties posed by prompts characterized by lower readability, formality, or concreteness. Rawte et al. (2023a) investigate how LLMs respond to factually correct and incorrect prompts, categorizing their hallucinations into mild, moderate, and alarming subcategories. Additionally, the paper introduced the Hallucination eLiciTation dataset, comprising 75,000 text snippets annotated by humans, and introduced a novel Hallucination Vulnerability Index metric.

2.2 Domain Specific Works

Hallucinations pose severe risks in critical fields such as healthcare, finance, and law. Reliability and accuracy are crucial in these sectors, as any form of hallucination can lead to significant and adverse consequences.

2.2.1 Healthcare

In response to the hallucinations in the medical domain LLMs, Pal et al. (2023) introduced the Medical Domain Hallucination Test (Med-HALT), a specialized benchmark dataset aimed at assessing and mitigating hallucinations. Med-HALT consists of a varied multinational dataset drawn from medical records spanning numerous countries, encompassing a total of seven datasets. Ahmad et al. (2023) outlined essential steps for creating dependable, trustworthy, and unbiased models, emphasizing the need for quantifying, validating, and mitigating hallucinations within the healthcare context. Ji et al. (2023) introduced an interactive self-reflection approach aimed at improving the accuracy and coherence of answers generated by medical question-answering systems using LLMs. Through knowledge acquisition and feedback on answer generation, this methodology enhances the factuality, consistency, and logical progression of responses.

2.2.2 Finance

An empirical investigation explored the inclination of LLMs to generate hallucinations while engaging in financial activities.  Kang and Liu (2023) conducted an empirical investigation into the hallucination tendencies of LLMs in financial tasks. Their study assessed LLMs’ proficiency in explaining financial concepts, querying historical stock prices, and examined the efficacy of methods like few-shot learning and prompt-based tool learning in mitigating hallucination.  Roychowdhury et al. (2023) proposed a novel Langchain-based approach aimed at transforming data tables into hierarchical textual data chunks, facilitating versatile financial question answering. The framework involves classifying user queries by intention, retrieving relevant data chunks, generating customized LLM prompts, and evaluating responses for hallucinations and confidence.

2.2.3 Legal

The conventional approach to abstractive text summarization typically employs an encoder-decoder architecture, wherein the encoder encapsulates the essence of the source text while the decoder generates the summary. However, this method may produce summaries containing irrelevant or inaccurate information, which poses a significant concern in legal contexts where accuracy is crucial. To address these issues, Feijo and Moreira (2023) introduced LegalSumm, which creates distinct "views" of the source text, trains summarization models to produce independent summaries, and employs an entailment module to assess their fidelity to the source. Deroy et al. (2023) investigated the readiness of LLMs for generating abstractive summaries of case judgments by applying SOTA models to Indian court cases. While abstractive models generally scored slightly higher, the authors noted inconsistencies and hallucinations in the generated summaries. Understanding the meaning of open-ended legal terms is very important for legal professionals. They often look at how such terms were used and interpreted in previous court cases.  Savelka et al. (2023) evaluate GPT-4’s performance in generating factually accurate, clear, and relevant explanations of legal terms in legislation. A comparison is made between a baseline approach, where GPT-4 directly explains a legal term, and an augmented approach employing a legal information retrieval module to provide contextual sentences from case law.  Dahl et al. (2024) presents the initial evidence on the frequency and types of inaccuracies in the legal domain, providing valuable insights for evaluating LLMs in legal contexts. By examining the structured format of American case law, the study assesses three major LLMs: GPT-3.5, PaLM 2, and Llama.

2.3 Benchmark Evaluation

In certain instances, LLMs engage in a phenomenon termed "hallucination snowballing," where they fabricate false claims to rationalize prior hallucinations despite acknowledging their inaccuracy. To empirically explore this phenomenon, Zhang et al. (2023a) devise three question-answering datasets spanning diverse domains, wherein ChatGPT and GPT-4 often furnish inaccurate answers alongside explanations featuring at least one false claim. Significantly, the study suggested that the language model can discern these false claims as incorrect. Another benchmark dataset, FactCHD Chen et al. (2023b), was introduced to detect fact-conflicting hallucinations within intricate inferential contexts. It encompasses a range of datasets capturing different factual patterns and integrates fact-based evidence chains to improve assessment accuracy. Li et al. (2023b) introduced a dataset to assess the capability of LLMs to identify and recognize hallucinated or incorrect information. The outcomes highlighted ChatGPT’s inclination to produce hallucinated content, particularly on certain topics, introducing unverifiable information.

Refer to caption
Figure 4: Example of types of Vision Hallucination.  Liu et al. (2024b) including visuals and the matching question-answer pairs and hallucination elements. While red annotated words do not exist or do not correspond to elements within the image, whereas words highlighted in green have correspondences within the image. Question and Answer are denoted by the letters Q and A, respectively.

3 Hallucination in Large Vision-Language Models

Large Vision-Language Models (LVLMs) have garnered significant attention in the AI community for their capacity to handle visual and textual data simultaneously. Nonetheless, similar to LLMs, LVLMs also confront the issue of hallucination. Figure 4 illustrates an example of visual hallucination.

3.1 Hallucination Detection and Mitigation

Dai et al. (2022) investigate the issue of object hallucinations in Vision-Language Pre-training (VLP) models, where textual descriptions generated by these models contain non-existent or inaccurate objects based on input images.  Li et al. (2023f) reveal widespread and severe object hallucination issues and suggest that visual instructions may influence hallucination. They observe that objects frequently depicted in visual instructions or co-occurring with image objects are more susceptible to hallucination. To enhance the evaluation process of object hallucination, the authors introduced a polling-based query method called POPE, which demonstrates improved stability and flexibility in assessing object hallucination. The absence of a standardized metric for assessing object hallucination has hindered progress in understanding and addressing this issue. To address this gap, Lovenia et al. (2023) introduced NOPE (Negative Object Presence Evaluation), a benchmark for evaluating object hallucination in vision-language models (VLMs) through visual question answering (VQA). Utilizing LLMs, the study generated a dataset of 29.5k synthetic negative pronoun (NegP) instances for NOPE. It thoroughly assessed the capability of 10 VLMs in detecting the absence of objects in visual questions, in addition to their typical performance on visual questions across nine other VQA datasets. Existing research focuses primarily on object hallucination, overlooking the LVLM hallucinations. Liu et al. (2024b) delve into Intrinsic Vision-Language Hallucination (IVL-Hallu) and proposes several novel IVL-Hallu tasks, including attribute, object, multi-modal conflicting, and counter-common-sense hallucination. They introduced a challenging benchmark dataset to assess and explore IVL-Hallu, conducting experiments on five LVLMs that revealed their limited effectiveness in addressing the proposed tasks. To mitigate object hallucination in LVLMs without resorting to costly training or API reliance, Zhao et al. (2024) introduced MARINE, a training-free and API-free solution. MARINE enhances LVLMs’ visual comprehension by combining existing open-source vision models and leveraging guidance without classifiers to incorporate object grounding features, thereby enhancing the precision of generated outputs. Evaluations across six LVLMs revealed MARINE’s effectiveness in reducing hallucinations and enhancing output detail, validated through assessments using GPT-4V.

HalluciDoctor Yu et al. (2023a) tackled hallucinations in Multi-modal Large Language Models (MLLMs) by using human error detection to identify and eliminate various types of hallucinations. Through rebalancing data distribution via counterfactual visual instruction expansion, they successfully mitigate 44.6% of hallucinations while maintaining competitive performance. Despite proficiency in visual semantic comprehension and meme humor, MLLMs struggle with chart analysis and understanding. Addressing this, Xu et al. (2023b) propose ChartBench, a benchmark assessing chart comprehension. ChartBench exposes MLLMs’ limited reasoning with complex charts, prompting the need for novel evaluation metrics like Acc+ and a handcrafted prompt, ChartCoT. Zhang et al. (2023b) introduced InternLM-XComposer, an LVLM aimed at designed to address the challenge of hallucination in image-text comprehension and composition. The performance of InternLM-XComposer’s text-image composition is evaluated through a robust procedure involving both human assessment and comparison to GPT4-Vision, with the model demonstrating competitive performance against solutions like GPT4-V and GPT3.5.

Cutting-edge LVLMs like InstructBLIP Dai et al. (2024), while generating visually grounded responses, often include inaccuracies like fictional items and flawed relationships. To enhance accuracy, Gunjal et al. (2024) introduced M-HalDetect, a pioneering multi-modal fine-grained hallucination detection dataset. This dataset serves as a benchmark for training LVLMs, leading to more precise outputs. Using fine-grained multi-modal reward models and enhancing FDPO significantly reduced InstructBLIP’s hallucination rate. These methods not only improve accuracy across LVLMs like LLaVA and mPLUG-OWL but also highlight the versatility and effectiveness of M-HalDetect in identifying and minimizing hallucinations.

Despite advancements in multi-modal tasks, LMMs often generate descriptions that are inconsistent with the accompanying image or human instructions. To address this, Liu et al. (2023) develop LRV-Instruction, a comprehensive dataset comprising 400k visual instructions across 16 tasks. This dataset includes both positive and negative instructions in various styles and semantic levels. Through LRV-Instruction, the issue of hallucination in existing LMMs was extensively examined, confirming its effectiveness in enhancing visual instruction tuning. Additionally, they introduced GAVIE, a novel method for evaluating visual instruction tuning without the need for human-labeled answers, which can be adapted to different types of instructions. The LVLM Hallucination Revisor (LURE) algorithm Zhou et al. (2023) was developed to correct object hallucination in LVLMs by refining the descriptions to produce more accurate and less hallucinatory outputs. Its methodology is based on an in-depth statistical analysis that identifies key factors contributing to object hallucination, such as the co-occurrence of certain objects in images, the uncertainty associated with objects during LVLM decoding, and the tendency for hallucinations to occur towards the end of the generated text. LURE is designed for seamless integration with a variety of LVLMs. When tested across multiple LVLMs, the integration of LURE resulted in significant enhancements in object hallucination correction, consistently outperforming other methods in both GPT and human evaluations based on various metrics.

3.2 Benchmark Evaluation

The current methods of develo** LVLMs rely heavily on annotated benchmark datasets, which can exhibit domain bias and limit model generative capabilities. To address this, Li et al. (2023e) propose a novel data collection approach that synthesizes images and dialogues synchronously for visual instruction tuning, yielding a large dataset of image-dialogue pairs and multi-image instances.  Huang et al. (2024) introduced VHTest, a benchmark dataset with 1,200 diverse visual hallucinations (VH) instances across 8 VH modes. Evaluation of three SOTA MLLMs showed varying performance, with GPT-4V exhibiting lower hallucination than MiniGPT-v2. Rawte et al. (2024a) categorized visual hallucination in VLMs into eight orientations and introduced a dataset of 2,000 samples covering these types. They proposed three main categories of methods to mitigate hallucination: data-driven approaches, training adjustments, and post-processing techniques. Additionally, Wang et al. (2024) proposed the Visual Instruction Generation and Correction (VIGC) framework to address the shortage of high-quality instruction-tuning data for MLLMs. VIGC enables MLLMs to generate diverse instruction-tuning data while iteratively refining its quality through Visual Instruction Correction (VIC), mitigating hallucination risks. The framework produces diverse, high-quality data for fine-tuning models, validated through evaluations, improving benchmark performance, and overcoming language-only data limitations.

Refer to caption
Figure 5: A video featuring descriptions generated by VLTinT model and ground truth (GT) with description errors highlighted in red italics Chuang and Fazli (2023).

4 Hallucinations in Large Video Models

Large Video Models (LVMs) represent a significant advancement, allowing for the processing of video data at scale. Despite their potential for various applications like video understanding and generation, LVMs face challenges with hallucinations, where misinterpretations of video frames can result in artificial or inaccurate visual data. This issue arises due to the video data complexity, which requires the model to process and comprehend it thoroughly. Figure 5 demonstrates the instances of hallucination observed in LVMs.

4.1 Hallucination Detection and Mitigation

The intricate task of dense video captioning, involving the creation of descriptions for multiple events within a continuous video, necessitates a thorough understanding of video content and contextual reasoning to ensure accurate description generation. However, this endeavor faces numerous challenges, potentially resulting in instances of inaccuracies and hallucinations Iashin and Rahtu (2020)Suin and Rajagopalan (2020). Traditional methods detect event proposals first, then caption subsets, risking hallucinations due to overlooking temporal dependencies. To address this, Mun et al. (2019) introduces a novel approach to modeling temporal dependencies and leveraging context for coherent storytelling. By integrating an event sequence generation network and a sequential video captioning network trained with reinforcement learning and two-level rewards, the model captures contextual information more effectively, yielding coherent and accurate captions while minimizing the risk of hallucinations.  Liu and Wan (2023) introduces a novel weakly-supervised, model-based factuality metric called FactVC, which outperforms previous metrics. Furthermore, they provide two annotated datasets to promote further research in assessing the factuality of video captions. Wu and Gao (2023) proposed a context-aware model that incorporates information from past and future events to influence the description of the current event conditionally. Their approach utilizes a robust pre-trained context encoder to encode information about the surrounding context events, which is then integrated into the captioning module using a gate-attention mechanism. Experimental findings on the YouCookII and ActivityNet datasets demonstrate that the proposed context-aware model outperforms existing context-aware and pre-trained models significantly. To enhance dense video captioning,  Zhou et al. (2024) introduces a streaming model comprising a memory module for long video handling and a streaming decoding algorithm enabling predictions before video completion. This approach notably boosts performance on prominent dense video captioning benchmarks, such as YouCook2, ActivityNet, and ViTT.

Video infilling and prediction tasks are crucial for assessing a model’s ability to comprehend and anticipate the temporal dynamics within video sequences Höppe et al. (2022). To address this, Himakunthala et al. (2023) introduced an inference-time challenge dataset containing keyframes with dense captions and structured scene descriptions. This dataset contains keyframes supplemented with unstructured dense captions and structured FAMOUS: (Focus, Action, Mood, Objects, and Setting) scene descriptions, providing valuable contextual information to support the models’ understanding of the video content. They employed various language models like GPT-3, GPT-4, and Vicuna with greedy decoding to mitigate hallucination risks.

Prominent developments in video inpainting have been observed recently, especially in situations where explicit guidance like optical flow helps to propagate missing pixels across frames Ouyang et al. (2021). However, difficulties and constraints occur from a lack of cross-frame information. Yu et al. (2023b) aims to tackle the opposite issue rather than depending on using pixels from other frames. The suggested method presents a Deficiency-aware Masked Transformer (DMT), a dual-modality-compatible inpainting framework. This approach improves handling scenarios with incomplete information by pre-training an image inpainting model to serve as a prior for training the video model.

Understanding scene affordances, which involve potential actions and interactions within a scene, is crucial for comprehending images and videos. Kulal et al. (2023) introduces a method for realistically inserting people into scenes. The model seamlessly integrates individuals into scenes by deducing realistic poses based on the context and ensuring visually pleasing compositions. Chuang and Fazli (2023) introduces CLearViD, a transformer-based model that utilizes curriculum learning techniques to enhance performance. By adopting this approach, the model acquires more robust and generalizable features. Furthermore, CLearViD incorporates the Mish activation function to address issues like vanishing gradients, thereby reducing the risk of hallucinations by introducing nonlinearity and non-monotonicity. Extensive experiments and ablation studies validate the effectiveness of CLearViD, with evaluations on ActivityNet Captions and YouCook2 datasets showcasing significant improvements over existing SOTA models in terms of diversity metrics.

4.2 Benchmark Evaluation

Zhang et al. (2006) created an innovative two-level hierarchical fusion method to hallucinate facial expression sequences from training video samples using only one frontal face image with a neutral expression. To effectively train the system, they introduced a dataset specifically designed for facial expression hallucination, which included 112 video sequences covering four types of facial expressions (happy, angry, surprise, and fear) from 28 individuals, resulting in the generation of reasonable facial expression sequences in both the temporal and spatial domains with less artifact. In the realm of video understanding, the development of end-to-end chat-centric systems has become a growing area of interest. Zhou et al. (2018) assembled the YouCook2 dataset, an extensive set of cooking videos with temporally localized and described procedural segments, to facilitate procedure learning tasks. Li et al. (2023c) introduced "VideoChat", a novel approach integrating video foundation models and LLMs through a learnable neural interface to enhance spatiotemporal reasoning, event localization, and causal relationship inference in video understanding. The researchers constructed a video-centric instruction dataset with detailed descriptions and conversations, emphasizing spatiotemporal reasoning and causal relationships. To counteract model hallucination, they employed a multi-step process to condense video descriptions into coherent narratives using GPT-4 and refined them to improve clarity and coherence. To explore the challenge of deducing scene affordances, Kulal et al. (2023) curated a dataset of 2.4M video clips, showcasing a variety of plausible poses that align with the scene context.

Refer to caption
Figure 6: Audio hallucination examples for each classes - Type A: Involving hallucinations of both objects and actions || Type B: Featuring accurate objects but hallucinated actions || Type C: Displaying correct actions but hallucinated objects Nishimura et al. (2024).

5 Hallucinations in Large Audio Models

Large audio models (LAMs) have emerged as a powerful tool in the realm of audio processing and generation, with a wide range of applications like speech recognition, music analysis, audio synthesis, and captioning Latif et al. (2023)Ghosal et al. (2023). Although these models have demonstrated remarkable capabilities across various domains, they are susceptible to hallucinations. These anomalies can take several forms, from creating unrealistic audio by piecing together fabricated snippets to injecting false information, such as quotes or facts, into summaries. Additionally, they may fail to accurately capture the inherent features of audio signals, such as timbre, pitch, or background noise Shen et al. (2023).

5.1 Hallucination Detection and Mitigation

In the realm of audio captioning, where natural language descriptions for audio clips are automatically generated, a significant challenge arises from the over-reliance on the visual modality during the pre-training of audio-text models. This reliance introduced data noise and hallucinations, ultimately undermining the accuracy of the resulting captions. To address this issue, Xu et al. (2023a) introduces an AudioSet tag-guided model designed to bootstrap large-scale audio-text data (BLAT). Notably, this model sidesteps the incorporation of video, thus minimizing noise associated with the visual modality. The experimental findings across a range of tasks, including retrieval, generation, and classification, validate the effectiveness of BLAT in mitigating hallucination issues.

Speech emotions play a crucial role in human communication and find extensive applications in areas such as speech synthesis and natural language understanding. However, traditional categorization approaches may fall short of capturing the nuanced and intricate nature of emotions conveyed in human speech Jiang et al. (2019)Han et al. (2021)Ye et al. (2021). SECap Xu et al. (2024a), a framework designed for speech emotion captioning, aims to capture the intricate emotional nuances of speech using natural language. SECap utilizes various components, including LLaMA as the text decoder, HuBERT as the audio encoder, and Q-Former as the Bridge-Net, to generate coherent emotion captions based on speech features. Audio-language models, despite their capability for zero-shot inference, confront challenges like hallucinating task-specific details despite strong performance. To address this, Elizalde et al. (2024) introduces the Contrastive Language-Audio Pretraining (CLAP) model. Pre-trained with 4.6 million diverse audio-text pairs, CLAP features a dual-encoder architecture, enhancing representation learning for improved task generalization across sound, music, and speech domains.

5.2 Benchmark Evaluation

To address the scarcity of data in the specific domain of music captioning, Doh et al. (2023) introduced LP-MusicCaps, a comprehensive dataset comprising 0.5 million audio clips accompanied by approximately 2.2 million captions. Leveraging LLMs, they train a transformer-based music captioning model with the dataset and assess its performance under zero-shot and transfer-learning scenarios, demonstrating its superiority over supervised baseline models.  Nishimura et al. (2024) investigated audio hallucinations in large audio-video language models, where audio descriptions are generated primarily based on visual information, neglecting audio content. They have classified these hallucinations into three distinct types such as Involving hallucinations of both objects and actions, Featuring accurate objects but hallucinated actions, and Displaying correct actions but hallucinated objects as represented in Fig. 6. In their investigation, they gathered 1000 sentences by soliciting audio information and then annotated them to determine whether they contained auditory hallucinations, further categorizing the type of hallucination if detected. To assess compositional reasoning in LAMs, Ghosh et al. (2023) introduced CompA, consisting of two expert-annotated benchmarks primarily focused on real-world audio samples. This benchmark is employed to fine-tune CompA-CLAP with a novel learning approach, enhancing its compositional reasoning skills and demonstrating substantial improvement over all the baseline models in tasks requiring compositional reasoning.

TEXT Paper Detection Mitigation Task Dataset(s) Evaluation Metric(s)
Manakul et al. (2023) Yes No QA Wikibio Entropy
Li et al. (2022) Yes Yes QA, Dialog summarization Halueval Automatic
Mündler et al. Yes Yes Text generation Manual F1 Score
Chen et al. (2023a) No Yes Editing for attribution MCQ, Dialog Attribution, Preservation
Zhang et al. (2023c) No Yes Question knowledge alignment Fuzzy QA Attributable to Identified Sources
Zhang et al. (2023a) Yes No QA Manual Accuracy
Peng et al. (2023) No Yes Task-oriented dialog News, Customer service F1 Score, Bleu-4
Cui et al. (2023) No Yes QA Manual Ranking
Azaria and Mitchell (2023) Yes No Classification Manual Accuracy
Li et al. (2023d) Yes Yes Knowledge-intensive tasks Fever, QA Accuracy
Elaraby et al. (2023) Yes Yes Consistency, Actuality, QA Manual NBA domain Pearson Correlation Coefficient
Varshney et al. Yes Yes Text generation Wikibio Percentage of mitigated hallucination
Jha et al. (2023) Yes No Dialog N/A N/A
Pal et al. (2023) No No Reasoning hallucination Med-Halt Accuracy, Pointwise Score
McKenna et al. (2023) Yes No Textual entailment Altered Directional Inference Entailment Probability
Guerreiro et al. (2023) Yes Yes MT FLores 101, WMT ,TICO BLEU
Huang and Chang (2023) Yes Yes N/A N/A N/A
Luo et al. (2023) Yes Yes Concept extraction Concept-7 AUC, Accuracy, F1 Score
Gao et al. (2022) Yes Yes Editing attribution NQ, SQA Auto-AIS (Attr_auto)
Yang et al. (2023) Yes No
Detect factual
errors automatically
PHD, WikiBio-GPT3
Precision, Recall,
F1 Score, Accuracy
Min et al. (2023) Yes Yes Fact verification Manual(Wikipedia) FActScore
Rawte et al. (2024b) Yes Yes Factual inaccuracies detection FACTOID HV I_auto
Ahmad et al. (2023) Yes Yes Hallucination in healthcare N/A FActScores
Ji et al. (2023) Yes Yes Generative and knowledge-intensive
PubMedQA, MEDIQA2019,
MedQuAD, and MASH-QA
Unigram F1, ROUGE-L,
Med-NLI, and CTRLEval
Kang and Liu (2023) Yes Yes Hallucination in finance N/A FActScores
Roychowdhury (2024) No Yes QA N/A N/A
Savelka et al. (2023) No Yes Factual evaluation in legislation N/A N/A
Dahl et al. (2024) Yes No Legal hallucination Manual N/A
Rawte et al. (2024c) No Yes Comprehension enhancement SCA-90K Cosine similarity
IMAGE Li et al. (2023f) Yes No Evaluation of object hallucination MSCOCO CHAIR, POPE
Gunjal et al. (2024) Yes Yes VQA M-Hall Detect Accuracy
Dai et al. (2022) No Yes Image captioning CHAIR CIDEr
Lovenia et al. (2023) Yes No Object hallucination NOPE
METEOR, Exact match accuracy,
NegP Accuracy
Liu et al. (2024b) Yes No Intrinsic vision-language hallucination PhD Accuracy
Zhao et al. (2024) Yes Yes Non-existing object hallucination MSCOCO CHAIR, POPE, GPT-4V, recall
Huang et al. (2024) Yes No Visual hallucination YNQ, OEQ Accuracy
Rawte et al. (2024a) Yes No Video captioning ActivityNet-Fact, YouCook2-Fact FactVC
Wang et al. (2024) No Yes
Generate instruction data
for vision-language
VIGC-LLaVA-COCO,
VIGC-LLaVA-Objects365
Conv, Detail, Complex
Yu et al. (2023a) Yes Yes
Machine-generated
visual instruction
LLaVA-Instruction-158K CHAIR
Guan et al. (2023) No Yes Visual questions HallusionBench Accuracy
Liu et al. (2023) Yes Yes Vision language LRV-Instruction GAVIE
Xu et al. (2023b) Yes No
Evaluation of MLLMs on
chart comprehension
ChartBench Acc+
Lu et al. (2024) Yes Yes Vision language MSG-MCQ Accuracy
Tong et al. (2024) Yes No Visual question answering MMVP, VQA Accuracy
Liao et al. (2023) Yes No Vision language REVO-LION Meta Quality (MQ)
Hu et al. (2023) Yes Yes
Visual captioning,
Visual question answering
CIEM
Accuracy, Precision,
Recall, F1 Score
**g et al. (2023) Yes No Meta-evaluation LLaVA-1k, MSCOCO-Cap FAITHSCORE
Changpinyo et al. (2022) No Yes Multilingual visual question answering MaXM Accuracy
Wang et al. (2023) Yes No Content generation N/A Precision, Recall, F1 Score
Sun et al. (2023) No Yes Visual-language alignment MMHAL-BENCH N/A
Bai et al. (2023) Yes No
Evaluate hallucination of
vision language model
TouchStone Hallucination Score
Zhou et al. (2023) No Yes Hallucination mitigation in LVMs MSCOCO CHAIR, BLEU, CLIP
Yan et al. (2024) No Yes Visual grounding MMViG
HL, CA, AA,
RA, RL, RS, DL
Zhao et al. (2023) Yes Yes Overcome hallucination in LVMs POPE, SHR Accuracy, Precision, F1 Score
Zhang et al. (2023b) No Yes
Image text comprehension
and composition
MMBench, SeedBench, QBench,
MMBench-CN, Chinese Bench
LR, AR, RR,
FP-C, FP-S, CP
VIDEO Kulal et al. (2023) No Yes Affordance prediction Manual FID, FCKh
Himakunthala et al. (2023) No Yes Video infilling, Scene prediction Manual N/A
Li et al. (2023c) No Yes Visual dialogue Manual N/A
Zhou et al. (2024) No Yes Video captioning
ActivityNet Captions,
YouCook2, ViTT
CIDER, METEOR, SODAc
Höppe et al. (2022) Yes No Video prediction BAIR, Kinetics 600, UCF-101 Frechet Video Distance
Chuang and Fazli (2023) No Yes Video description Activity Net Captions, YouCook2
METEOR, ROUGE_L, CIDER,
BLEU_4, DIV-2, RE_4
AUDIO Li et al. (2023a) No Yes Classification Manual Mean avg precision
Doh et al. (2023) No Yes Audio captioning LP MusicCaps BLEU
Xu et al. (2023a) No Yes Caption generation AudioCaps R@K, COCO & FENCE
Liu and Wan (2023) No Yes Audio captioning MusciCaps BLEU
Nishimura et al. (2024) Yes No Evaluation of LAMs LAION_CLAP,MS_CLAP Recall, Precision, F1 Score
Table 1: Overview of the hallucination detection and mitigation landscape in FMs across modalities (Text, Image, Video, and Audio). Each work is categorized based on factors such as detection, mitigation, tasks, datasets, and evaluation metrics.

6 Hallucination: Good or Bad?

Hallucinations in large-scale models present a complex interplay between creativity and uncertainty. On one hand, the ability to traverse beyond conventional data boundaries can lead to the generation of novel and innovative outputs. Hallucinations can spark exploratory learning, revealing unexpected patterns and features within the data. They can also serve as a form of stress testing, improving the model’s robustness and adaptability. Furthermore, these unexpected outputs can even inspire human creativity, serving as a springboard for new ideas and perspectives Rawte et al. (2023b). However, this dual nature of hallucinations also introduced significant drawbacks. The quality and coherence of hallucinatory outputs can be questionable, posing challenges in applications where accuracy and reliability are paramount. Hallucinations can also propagate misinformation and biases present in the model’s training data, potentially reinforcing existing prejudices and eroding user trust. The reduced interpretability of these outputs can further undermine the model’s credibility and adoption. Ethical concerns arise when hallucinations produce inappropriate, offensive, or harmful content. Careful monitoring and control mechanisms are essential to prevent the generation of outputs that could cause harm or distress to users. Navigating this intricate balance between exploration and fidelity is crucial for maximizing the utility of large models while mitigating the risks associated with unexpected outputs. Overall, the phenomenon of hallucinations in large-scale models highlights the need for a nuanced understanding and strategic management of these capabilities.

7 Limitations

Previous survey papers primarily focused on hallucination in Large Language Models and did not extensively cover hallucinations in vision, audio, and video modalities. In this survey paper, our aim is to provide a comprehensive overview of hallucinations across all modalities, considering that hallucinations can occur in any large foundation model. Despite our efforts to provide a comprehensive summary of recent advancements related to hallucination techniques in all foundational models, we acknowledge that we may miss some relevant work in the field.

8 Future Directions

Researchers are actively investigating hallucination mitigation techniques as the challenge of hallucination can be crucial in sensitive areas where generating fictional or incorrect content could have serious consequences Tonmoy et al. (2024)Rawte et al. (2023b). Here are the potential directions for addressing this critical issue of hallucination in these FMs:

Data Resources: Recent studies have highlighted the efficacy of simple fine-tuning on carefully curated high-quality samples for reducing hallucinations, surpassing the impact of large-scale fine-tuning and reinforcement learning approaches. For knowledge-intensive domains, the development of entity-centered fine-tuned instructions that integrate structured knowledge from knowledge graphs shows promise in enhancing accuracy and relevance. Additionally, employing alignment techniques tailored to specific tasks or fields has proven effective in mitigating hallucinations. As research in this area progresses, more resources focused on improving alignment through task-specific or domain-adapted approaches are expected, further bolstering the reliability of language models in generating factual and trustworthy content.

Automated Evaluation: Develo** specialized evaluation metrics that consider factors such as factual accuracy and coherence can be useful for hallucination detection. Combining automated evaluation with human judgments via crowdsourcing can capture nuanced aspects challenging for automated systems alone. Additionally, adversarial testing methodologies are also being developed to expose AI systems to crafted inputs, aiding in identifying weaknesses and enhancing resilience against hallucination. Furthermore, fine-tuning FMs on datasets emphasizing fact-checking and accuracy offers another avenue to improve content reliability and reduce the occurrence of hallucinations.

Improving Detection and Mitigation techniques: Mitigating hallucinations in FMs necessitates a multifaceted approach that leverages reasoning mechanisms, knowledge graph integration, specialized fact-checking models, bias mitigation techniques, and active learning methodologies. Emerging techniques like Chain of Thought (CoT) Wei et al. (2022) and Tree of Thought (ToT) Yao et al. (2024) bolster these models’ reasoning capabilities, potentially reducing hallucinations. Integrating knowledge graphs enhances understanding of factual information and concept relationships, aiding content generation and fact-checking. Specialized verification models cross-reference outputs against curated knowledge to identify inaccuracies, while bias detection and mitigation techniques promote fairness. Finally, ethical guidelines and regulatory frameworks governing the responsible use of curated knowledge in AI development mitigate risks and foster public trust, collectively improving the quality, accuracy, and trustworthiness of AI-generated content.

Multimodal Hallucination: Addressing hallucinations in multimodal large Foundation models requires a comprehensive approach spanning data-centric initiatives, cross-modal alignment efforts, architectural innovations, standardized benchmarking, reframing of hallucination, and enhancing interpretability and trust. Data-centric techniques for robust data collection, augmentation, and calibration ensure diverse and high-quality training data. Cross-modal alignment focuses on aligning representations across modalities through sophisticated architectures. Model architectural advancements involve designing specialized models capable of handling complex linguistic and visual inputs effectively. Establishing unified metrics and standardized benchmarks enables accurate assessment of hallucination and reliable performance evaluations. Reframing hallucination as a feature explores its integration into downstream applications, optimizing for the human experience. Finally, develo** techniques for interpreting model behavior, visualizing internals, and improving reliability assessment fosters trust in MLLMs. This multifaceted approach collectively addresses the critical hallucination challenge, paving the way for more reliable and trustworthy multimodal AI systems.

9 Conclusion

This survey paper systematically categorizes existing research on hallucination within FMs, providing comprehensive insights into critical facets, including detection, mitigation, tasks, datasets, and evaluation metrics. It addresses the pervasive impact of hallucination in FMs, acknowledging its impact across diverse domains. By examining recent advancements in detection and mitigation techniques, the paper underscores the importance of addressing this challenge, given FMs’ indispensable role in critical tasks. Its primary contribution lies in introducing a structured taxonomy for classifying hallucination in FMs, spanning text, image, video, and audio domains.

References

  • Ahmad et al. (2023) Muhammad Aurangzeb Ahmad, Ilker Yaramis, and Taposh Dutta Roy. 2023. Creating trustworthy llms: Dealing with hallucinations in healthcare ai. arXiv preprint arXiv:2311.01463.
  • Azaria and Mitchell (2023) Amos Azaria and Tom Mitchell. 2023. The internal state of an llm knows when its lying. arXiv preprint arXiv:2304.13734.
  • Bai et al. (2023) Shuai Bai, Shusheng Yang, **ze Bai, Peng Wang, Xingxuan Zhang, Junyang Lin, Xinggang Wang, Chang Zhou, and **gren Zhou. 2023. Touchstone: Evaluating vision-language models by language models. arXiv preprint arXiv:2308.16890.
  • Borsos et al. (2023) Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, and Neil Zeghidour. 2023. Audiolm: a language modeling approach to audio generation.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  • Changpinyo et al. (2022) Soravit Changpinyo, Linting Xue, Idan Szpektor, Ashish V Thapliyal, Julien Amelot, Michal Yarom, Xi Chen, and Radu Soricut. 2022. Maxm: Towards multilingual visual question answering. arXiv preprint arXiv:2209.05401.
  • Chen et al. (2023a) Anthony Chen, Panupong Pasupat, Sameer Singh, Hongrae Lee, and Kelvin Guu. 2023a. Purr: Efficiently editing language model hallucinations by denoising language model corruptions. arXiv preprint arXiv:2305.14908.
  • Chen et al. (2024) Ke Chen, Yusong Wu, Haohe Liu, Marianna Nezhurina, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. 2024. Musicldm: Enhancing novelty in text-to-music generation using beat-synchronous mixup strategies. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1206–1210. IEEE.
  • Chen et al. (2023b) Xiang Chen, Duanzheng Song, Honghao Gui, Chengxi Wang, Ningyu Zhang, Fei Huang, Chengfei Lv, Dan Zhang, and Huajun Chen. 2023b. Unveiling the siren’s song: Towards reliable fact-conflicting hallucination detection. arXiv preprint arXiv:2310.12086.
  • Cheng et al. (2023) Qinyuan Cheng, Tianxiang Sun, Wenwei Zhang, Siyin Wang, Xiangyang Liu, Mozhi Zhang, Junliang He, Mianqiu Huang, Zhangyue Yin, Kai Chen, et al. 2023. Evaluating hallucinations in chinese large language models. arXiv preprint arXiv:2310.03368.
  • Chern et al. (2023) I Chern, Steffi Chern, Shiqi Chen, Weizhe Yuan, Kehua Feng, Chunting Zhou, Junxian He, Graham Neubig, Pengfei Liu, et al. 2023. Factool: Factuality detection in generative ai–a tool augmented framework for multi-task and multi-domain scenarios. arXiv preprint arXiv:2307.13528.
  • Chuang and Fazli (2023) Cheng-Yu Chuang and Pooyan Fazli. 2023. Clearvid: Curriculum learning for video description. arXiv preprint arXiv:2311.04480.
  • Chuang et al. (2023) Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, and Pengcheng He. 2023. Dola: Decoding by contrasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883.
  • Cui et al. (2023) Jiaxi Cui, Zongjian Li, Yang Yan, Bohua Chen, and Li Yuan. 2023. Chatlaw: Open-source legal large language model with integrated external knowledge bases. arXiv preprint arXiv:2306.16092.
  • Dahl et al. (2024) Matthew Dahl, Varun Magesh, Mirac Suzgun, and Daniel E Ho. 2024. Large legal fictions: Profiling legal hallucinations in large language models. arXiv preprint arXiv:2401.01301.
  • Dai et al. (2024) Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. 2024. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36.
  • Dai et al. (2022) Wenliang Dai, Zihan Liu, Ziwei Ji, Dan Su, and Pascale Fung. 2022. Plausible may not be faithful: Probing object hallucination in vision-language pre-training. arXiv preprint arXiv:2210.07688.
  • Deroy et al. (2023) Aniket Deroy, Kripabandhu Ghosh, and Saptarshi Ghosh. 2023. How ready are pre-trained abstractive models and llms for legal case judgement summarization? arXiv preprint arXiv:2306.01248.
  • Deshmukh et al. (2024) Soham Deshmukh, Dareen Alharthi, Benjamin Elizalde, Hannes Gamper, Mahmoud Al Ismail, Rita Singh, Bhiksha Raj, and Huaming Wang. 2024. Pam: Prompting audio-language models for audio quality assessment. arXiv preprint arXiv:2402.00282.
  • Dhuliawala et al. (2023) Shehzaad Dhuliawala, Mojtaba Komeili, **g Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. 2023. Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495.
  • Doh et al. (2023) SeungHeon Doh, Keunwoo Choi, Jongpil Lee, and Juhan Nam. 2023. Lp-musiccaps: Llm-based pseudo music captioning. arXiv preprint arXiv:2307.16372.
  • Elaraby et al. (2023) Mohamed Elaraby, Mengyin Lu, Jacob Dunn, Xueying Zhang, Yu Wang, and Shizhu Liu. 2023. Halo: Estimation and reduction of hallucinations in open-source weak large language models. arXiv preprint arXiv:2308.11764.
  • Elizalde et al. (2024) Benjamin Elizalde, Soham Deshmukh, and Huaming Wang. 2024. Natural language supervision for general-purpose audio representations. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 336–340. IEEE.
  • Feijo and Moreira (2023) Diego de Vargas Feijo and Viviane P Moreira. 2023. Improving abstractive summarization of legal rulings through textual entailment. Artificial intelligence and law, 31(1):91–113.
  • Gao et al. (2022) Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Y Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, et al. 2022. Rarr: Researching and revising what language models say, using language models. arXiv preprint arXiv:2210.08726.
  • Ghosal et al. (2023) Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, and Soujanya Poria. 2023. Text-to-audio generation using instruction-tuned llm and latent diffusion model. arXiv preprint arXiv:2304.13731.
  • Ghosh et al. (2024) Sreyan Ghosh, Sonal Kumar, Chandra Kiran Reddy Evuru, Ramani Duraiswami, and Dinesh Manocha. 2024. Recap: retrieval-augmented audio captioning. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1161–1165. IEEE.
  • Ghosh et al. (2023) Sreyan Ghosh, Ashish Seth, Sonal Kumar, Utkarsh Tyagi, Chandra Kiran Evuru, S Ramaneswaran, S Sakshi, Oriol Nieto, Ramani Duraiswami, and Dinesh Manocha. 2023. Compa: Addressing the gap in compositional reasoning in audio-language models. arXiv preprint arXiv:2310.08753.
  • Guan et al. (2023) Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. 2023. Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models. arXiv preprint arXiv:2310.14566.
  • Guerreiro et al. (2023) Nuno M Guerreiro, Duarte M Alves, Jonas Waldendorf, Barry Haddow, Alexandra Birch, Pierre Colombo, and André FT Martins. 2023. Hallucinations in large multilingual translation models. Transactions of the Association for Computational Linguistics, 11:1500–1517.
  • Gunjal et al. (2024) Anisha Gunjal, Jihan Yin, and Erhan Bas. 2024. Detecting and preventing hallucinations in large vision language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18135–18143.
  • Han et al. (2021) Qichen Han, Weiqiang Yuan, Dong Liu, Xiang Li, and Zhen Yang. 2021. Automated audio captioning with weakly supervised pre-training and word selection methods. In DCASE, pages 6–10.
  • He et al. (2022) Mengge He, Wen**g Du, Zhiquan Wen, Qing Du, Yutong Xie, and Qi Wu. 2022. Multi-granularity aggregation transformer for joint video-audio-text representation learning. IEEE Transactions on Circuits and Systems for Video Technology.
  • Himakunthala et al. (2023) Vaishnavi Himakunthala, Andy Ouyang, Daniel Rose, Ryan He, Alex Mei, Yujie Lu, Chinmay Sonar, Michael Saxon, and William Wang. 2023. Let’s think frame by frame with vip: A video infilling and prediction dataset for evaluating video chain-of-thought. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 204–219.
  • Höppe et al. (2022) Tobias Höppe, Arash Mehrjou, Stefan Bauer, Didrik Nielsen, and Andrea Dittadi. 2022. Diffusion models for video prediction and infilling. arXiv preprint arXiv:2206.07696.
  • Hu et al. (2023) Hongyu Hu, Jiyuan Zhang, Minyi Zhao, and Zhenbang Sun. 2023. Ciem: Contrastive instruction evaluation method for better instruction tuning. arXiv preprint arXiv:2309.02301.
  • Huang and Chang (2023) Jie Huang and Kevin Chen-Chuan Chang. 2023. Citation: A key to building responsible and accountable large language models. arXiv preprint arXiv:2307.02185.
  • Huang et al. (2023) Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2023. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.
  • Huang et al. (2024) Wen Huang, Hongbin Liu, Minxin Guo, and Neil Zhenqiang Gong. 2024. Visual hallucinations of multi-modal large language models. arXiv preprint arXiv:2402.14683.
  • Iashin and Rahtu (2020) Vladimir Iashin and Esa Rahtu. 2020. Multi-modal dense video captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 958–959.
  • Jha et al. (2023) Susmit Jha, Sumit Kumar Jha, Patrick Lincoln, Nathaniel D Bastian, Alvaro Velasquez, and Sandeep Neema. 2023. Dehallucinating large language models using formal methods guided iterative prompting. In 2023 IEEE International Conference on Assured Autonomy (ICAA), pages 149–152. IEEE.
  • Ji et al. (2023) Ziwei Ji, Tiezheng Yu, Yan Xu, Nayeon Lee, Etsuko Ishii, and Pascale Fung. 2023. Towards mitigating llm hallucination via self reflection. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1827–1843.
  • Jiang et al. (2019) Pengxu Jiang, Hongliang Fu, Huawei Tao, Peizhi Lei, and Li Zhao. 2019. Parallelized convolutional recurrent neural network with spectral features for speech emotion recognition. IEEE access, 7:90368–90377.
  • **g et al. (2023) Liqiang **g, Ruosen Li, Yunmo Chen, Mengzhao Jia, and Xinya Du. 2023. Faithscore: Evaluating hallucinations in large vision-language models. arXiv preprint arXiv:2311.01477.
  • Kang and Liu (2023) Haoqiang Kang and Xiao-Yang Liu. 2023. Deficiency of large language models in finance: An empirical examination of hallucination. arXiv preprint arXiv:2311.15548.
  • Kim et al. (2024) Jaeyeon Kim, Jaeyoon Jung, **joo Lee, and Sang Hoon Woo. 2024. Enclap: Combining neural audio codec and audio-text joint embedding for automated audio captioning. arXiv preprint arXiv:2401.17690.
  • (47) Grounding Knowledge. The knowledge alignment problem: Bridging human and external knowledge for large language models.
  • Kulal et al. (2023) Sumith Kulal, Tim Brooks, Alex Aiken, Jiajun Wu, Jimei Yang, **gwan Lu, Alexei A Efros, and Krishna Kumar Singh. 2023. Putting people in their place: Affordance-aware human insertion into scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17089–17099.
  • Latif et al. (2023) Siddique Latif, Moazzam Shoukat, Fahad Shamshad, Muhammad Usama, Heriberto Cuayáhuitl, and Björn W Schuller. 2023. Sparks of large audio models: A survey and outlook. arXiv preprint arXiv:2308.12792.
  • Leng et al. (2023) Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. 2023. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. arXiv preprint arXiv:2311.16922.
  • Li et al. (2023a) Juncheng B Li, Jackson Sam Michaels, Laura Yao, Lijun Yu, Zach Wood-Doughty, and Florian Metze. 2023a. Audio-journey: Efficient visual+ llm-aided audio encodec diffusion. In Workshop on Efficient Systems for Foundation Models@ ICML2023.
  • Li et al. (2023b) Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. 2023b. Halueval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6449–6464.
  • Li et al. (2023c) KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, ** Luo, Yali Wang, Limin Wang, and Yu Qiao. 2023c. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355.
  • Li et al. (2023d) Xingxuan Li, Ruochen Zhao, Yew Ken Chia, Bosheng Ding, Lidong Bing, Shafiq Joty, and Soujanya Poria. 2023d. Chain of knowledge: A framework for grounding large language models with structured knowledge bases. arXiv preprint arXiv:2305.13269.
  • Li et al. (2022) Y Li, R Panda, Y Kim, C Chen, R Feris, D Cox, and N Vasconcelos. 2022. Valhalla: Visual hallucination for machine translation. in 2022 ieee. In CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5206–5216.
  • Li et al. (2023e) Yanda Li, Chi Zhang, Gang Yu, Zhibin Wang, Bin Fu, Guosheng Lin, Chunhua Shen, Ling Chen, and Yunchao Wei. 2023e. Stablellava: Enhanced visual instruction tuning with synthesized image-dialogue data. arXiv preprint arXiv:2308.10253.
  • Li et al. (2023f) Yifan Li, Yifan Du, Kun Zhou, **peng Wang, Wayne Xin Zhao, and Ji-Rong Wen. 2023f. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355.
  • Liao et al. (2023) Ning Liao, Shaofeng Zhang, Renqiu Xia, Bo Zhang, Min Cao, Yu Qiao, and Junchi Yan. 2023. Revo-lion: Evaluating and refining vision-language instruction tuning datasets. arXiv preprint arXiv:2310.06594.
  • Liu et al. (2023) Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. 2023. Mitigating hallucination in large multi-modal models via robust instruction tuning. In The Twelfth International Conference on Learning Representations.
  • Liu et al. (2024a) Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, ** Hou, Rongjun Li, and Wei Peng. 2024a. A survey on hallucination in large vision-language models.
  • Liu and Wan (2023) Hui Liu and Xiaojun Wan. 2023. Models see hallucinations: Evaluating the factuality in video captioning. arXiv preprint arXiv:2303.02961.
  • Liu et al. (2024b) Jiazhen Liu, Yuhan Fu, Ruobing Xie, Runquan Xie, Xingwu Sun, Fengzong Lian, Zhanhui Kang, and Xirong Li. 2024b. Phd: A prompted visual hallucination evaluation dataset. arXiv preprint arXiv:2403.11116.
  • Lovenia et al. (2023) Holy Lovenia, Wenliang Dai, Samuel Cahyawijaya, Ziwei Ji, and Pascale Fung. 2023. Negative object presence evaluation (nope) to measure object hallucination in vision-language models. arXiv preprint arXiv:2310.05338.
  • Lu et al. (2024) Jiaying Lu, **meng Rao, Kezhen Chen, Xiaoyuan Guo, Yawen Zhang, Baochen Sun, Carl Yang, and Jie Yang. 2024. Evaluation and enhancement of semantic grounding in large vision-language models. In AAAI-ReLM Workshop.
  • Luo et al. (2023) Junyu Luo, Cao Xiao, and Fenglong Ma. 2023. Zero-resource hallucination prevention for large language models. arXiv preprint arXiv:2309.02654.
  • Manakul et al. (2023) Potsawee Manakul, Adian Liusie, and Mark JF Gales. 2023. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896.
  • McKenna et al. (2023) Nick McKenna, Tianyi Li, Liang Cheng, Mohammad Javad Hosseini, Mark Johnson, and Mark Steedman. 2023. Sources of hallucination by large language models on inference tasks. arXiv preprint arXiv:2305.14552.
  • Min et al. (2023) Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. arXiv preprint arXiv:2305.14251.
  • Moernaut et al. (2018) Nienke Moernaut, Isabella Leudar, and Thomas Verdooren. 2018. Content matters: A qualitative analysis of verbal hallucinations. Frontiers in Psychology, 9:123.
  • Muhlgay et al. (2023) Dor Muhlgay, Ori Ram, Inbal Magar, Yoav Levine, Nir Ratner, Yonatan Belinkov, Omri Abend, Kevin Leyton-Brown, Amnon Shashua, and Yoav Shoham. 2023. Generating benchmarks for factuality evaluation of language models. arXiv preprint arXiv:2307.06908.
  • Mun et al. (2019) Jonghwan Mun, Linjie Yang, Zhou Ren, Ning Xu, and Bohyung Han. 2019. Streamlined dense video captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6588–6597.
  • (72) Niels Mündler, **gxuan He, Slobodan Jenko, and Martin Vechev. Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation. arxiv [cs. cl]. 2023.
  • Nishimura et al. (2024) Taichi Nishimura, Shota Nakada, and Masayoshi Kondo. 2024. On the audio hallucinations in large audio-video language models. arXiv preprint arXiv:2401.09774.
  • Ouyang et al. (2021) Hao Ouyang, Tengfei Wang, and Qifeng Chen. 2021. Internal video inpainting by implicit long-range propagation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 14579–14588.
  • Pal et al. (2023) Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. 2023. Med-halt: Medical domain hallucination test for large language models.
  • Peng et al. (2023) Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng, Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Liden, Zhou Yu, Weizhu Chen, et al. 2023. Check your facts and try again: Improving large language models with external knowledge and automated feedback.(2023). arXiv preprint cs.CL/2302.12813.
  • Rawte et al. (2023a) Vipula Rawte, Swagata Chakraborty, Agnibh Pathak, Anubhav Sarkar, S. M Towhidul Islam Tonmoy, Aman Chadha, Amit P. Sheth, and Amitava Das. 2023a. The troubling emergence of hallucination in large language models – an extensive definition, quantification, and prescriptive remediations.
  • Rawte et al. (2024a) Vipula Rawte, Anku Rani, Harshad Sharma, Neeraj Anand, Krishnav Rajbangshi, Amit Sheth, and Amitava Das. 2024a. Visual hallucination: Definition, quantification, and prescriptive remediations. arXiv preprint arXiv:2403.17306.
  • Rawte et al. (2023b) Vipula Rawte, Amit Sheth, and Amitava Das. 2023b. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922.
  • Rawte et al. (2024b) Vipula Rawte, SM Tonmoy, Krishnav Rajbangshi, Shravani Nag, Aman Chadha, Amit P Sheth, and Amitava Das. 2024b. Factoid: Factual entailment for hallucination detection. arXiv preprint arXiv:2403.19113.
  • Rawte et al. (2024c) Vipula Rawte, SM Tonmoy, SM Zaman, Prachi Priya, Aman Chadha, Amit P Sheth, and Amitava Das. 2024c. " sorry, come again?" prompting–enhancing comprehension and diminishing hallucination with [pause]-injected optimal paraphrasing. arXiv preprint arXiv:2403.18976.
  • Roychowdhury (2024) Sohini Roychowdhury. 2024. Journey of hallucination-minimized generative ai solutions for financial decision makers. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, pages 1180–1181.
  • Roychowdhury et al. (2023) Sohini Roychowdhury, Andres Alvarez, Brian Moore, Marko Krema, Maria Paz Gelpi, Punit Agrawal, Federico Martín Rodríguez, Ángel Rodríguez, José Ramón Cabrejas, Pablo Martínez Serrano, et al. 2023. Hallucination-minimized data-to-answer framework for financial decision-makers. In 2023 IEEE International Conference on Big Data (BigData), pages 4693–4702. IEEE.
  • Sahoo et al. (2024) Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, and Aman Chadha. 2024. A systematic survey of prompt engineering in large language models: Techniques and applications.
  • Savelka et al. (2023) Jaromir Savelka, Kevin D Ashley, Morgan A Gray, Hannes Westermann, and Huihui Xu. 2023. Explaining legal concepts with augmented large language models (gpt-4). arXiv preprint arXiv:2306.09525.
  • Shen et al. (2023) Kai Shen, Zeqian Ju, Xu Tan, Yanqing Liu, Yichong Leng, Lei He, Tao Qin, Sheng Zhao, and Jiang Bian. 2023. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. arXiv preprint arXiv:2304.09116.
  • Shi et al. (2022) Yaya Shi, Xu Yang, Haiyang Xu, Chunfeng Yuan, Bing Li, Weiming Hu, and Zheng-Jun Zha. 2022. Emscore: Evaluating video captioning via coarse-grained and fine-grained embedding matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17929–17938.
  • Suin and Rajagopalan (2020) Maitreya Suin and AN Rajagopalan. 2020. An efficient framework for dense video captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 12039–12046.
  • Sun et al. (2023) Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. 2023. Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525.
  • Tjio et al. (2022) Gabriel Tjio, ** Liu, Joey Tianyi Zhou, and Rick Siow Mong Goh. 2022. Adversarial semantic hallucination for domain generalized semantic segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 318–327.
  • Tong et al. (2024) Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. 2024. Eyes wide shut? exploring the visual shortcomings of multimodal llms. arXiv preprint arXiv:2401.06209.
  • Tonmoy et al. (2024) SM Tonmoy, SM Zaman, Vinija Jain, Anku Rani, Vipula Rawte, Aman Chadha, and Amitava Das. 2024. A comprehensive survey of hallucination mitigation techniques in large language models. arXiv preprint arXiv:2401.01313.
  • (93) Neeraj Varshney, Wenlin Yao, Hongming Zhang, Jianshu Chen, and Dong Yu. A stitch in time saves nine: Detecting and mitigating hallucinations of llms by actively validating low-confidence generation.
  • Varshney et al. (2023) Neeraj Varshney, Wenlin Yao, Hongming Zhang, Jianshu Chen, and Dong Yu. 2023. A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation. arXiv preprint arXiv:2307.03987.
  • Wang et al. (2024) Bin Wang, Fan Wu, ** Zhong, Pan Zhang, Xiaoyi Dong, Weijia Li, Wei Li, Jiaqi Wang, et al. 2024. Vigc: Visual instruction generation and correction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 5309–5317.
  • Wang et al. (2023) Junyang Wang, Yiyang Zhou, Guohai Xu, Pengcheng Shi, Chenlin Zhao, Haiyang Xu, Qinghao Ye, Ming Yan, Ji Zhang, Jihua Zhu, et al. 2023. Evaluation and analysis of hallucination in large vision-language models. arXiv preprint arXiv:2308.15126.
  • Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
  • Wu and Gao (2023) Weilun Wu and Yang Gao. 2023. A context-aware model with a pre-trained context encoder for dense video captioning. In International Conference on Cyber Security, Artificial Intelligence, and Digital Economy (CSAIDE 2023), volume 12718, pages 387–396. SPIE.
  • Xu et al. (2023a) Xuenan Xu, Zhiling Zhang, Zelin Zhou, **yue Zhang, Zeyu Xie, Mengyue Wu, and Kenny Q Zhu. 2023a. Blat: Bootstrap** language-audio pre-training based on audioset tag-guided synthetic data. In Proceedings of the 31st ACM International Conference on Multimedia, pages 2756–2764.
  • Xu et al. (2024a) Yaoxun Xu, Hangting Chen, Jianwei Yu, Qiaochu Huang, Zhiyong Wu, Shi-Xiong Zhang, Guangzhi Li, Yi Luo, and Rongzhi Gu. 2024a. Secap: Speech emotion captioning with large language model. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19323–19331.
  • Xu et al. (2023b) Zhengzhuo Xu, Sinan Du, Yiyan Qi, Cheng** Xu, Chun Yuan, and Jian Guo. 2023b. Chartbench: A benchmark for complex visual reasoning in charts. arXiv preprint arXiv:2312.15915.
  • Xu et al. (2024b) Ziwei Xu, Sanjay Jain, and Mohan Kankanhalli. 2024b. Hallucination is inevitable: An innate limitation of large language models. arXiv preprint arXiv:2401.11817.
  • Yan et al. (2024) Siming Yan, Min Bai, Weifeng Chen, Xiong Zhou, Qixing Huang, and Li Erran Li. 2024. Vigor: Improving visual grounding of large vision language models with fine-grained reward modeling. arXiv preprint arXiv:2402.06118.
  • Yang et al. (2023) Shi** Yang, Renliang Sun, and Xiaojun Wan. 2023. A new benchmark and reverse validation method for passage-level hallucination detection. arXiv preprint arXiv:2310.06498.
  • Yao et al. (2024) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2024. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36.
  • Ye et al. (2021) Zhongjie Ye, Helin Wang, Dongchao Yang, and Yuexian Zou. 2021. Improving the performance of automated audio captioning via integrating the acoustic and semantic information. arXiv preprint arXiv:2110.06100.
  • Yu et al. (2023a) Qifan Yu, Juncheng Li, Longhui Wei, Liang Pang, Wentao Ye, Bosheng Qin, Siliang Tang, Qi Tian, and Yueting Zhuang. 2023a. Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data. arXiv preprint arXiv:2311.13614.
  • Yu et al. (2023b) Yongsheng Yu, Heng Fan, and Libo Zhang. 2023b. Deficiency-aware masked transformer for video inpainting. arXiv preprint arXiv:2307.08629.
  • Yuan et al. (2024) Yi Yuan, Haohe Liu, Xubo Liu, Qiushi Huang, Mark D Plumbley, and Wenwu Wang. 2024. Retrieval-augmented text-to-audio generation. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 581–585. IEEE.
  • Zhang et al. (2006) Jian Zhang, Yueting Zhuang, and Fei Wu. 2006. Video-based facial expression hallucination: A two-level hierarchical fusion approach. In International Conference on Advanced Concepts for Intelligent Vision Systems, pages 513–521. Springer.
  • Zhang et al. (2023a) Muru Zhang, Ofir Press, William Merrill, Alisa Liu, and Noah A Smith. 2023a. How language model hallucinations can snowball. arXiv preprint arXiv:2305.13534.
  • Zhang et al. (2023b) Pan Zhang, Xiaoyi Dong Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Shuangrui Ding, Songyang Zhang, Haodong Duan, Hang Yan, et al. 2023b. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112.
  • Zhang et al. (2023c) Shuo Zhang, Liangming Pan, Junzhou Zhao, and William Yang Wang. 2023c. Mitigating language model hallucination with interactive question-knowledge alignment. arXiv preprint arXiv:2305.13669.
  • Zhang et al. (2023d) Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. 2023d. Siren’s song in the ai ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219.
  • Zhao et al. (2024) Linxi Zhao, Yihe Deng, Weitong Zhang, and Quanquan Gu. 2024. Mitigating object hallucination in large vision-language models via classifier-free guidance. arXiv preprint arXiv:2402.08680.
  • Zhao et al. (2022) Yue Zhao, Ishan Misra, Philipp Krähenbühl, and Rohit Girdhar. 2022. Learning video representations from large language models.
  • Zhao et al. (2023) Zhiyuan Zhao, Bin Wang, Linke Ouyang, Xiaoyi Dong, Jiaqi Wang, and Conghui He. 2023. Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839.
  • Zhou et al. (2018) Luowei Zhou, Chenliang Xu, and Jason Corso. 2018. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32.
  • Zhou et al. (2024) Xingyi Zhou, Anurag Arnab, Shyamal Buch, Shen Yan, Austin Myers, Xuehan Xiong, Arsha Nagrani, and Cordelia Schmid. 2024. Streaming dense video captioning. arXiv preprint arXiv:2404.01297.
  • Zhou et al. (2023) Yiyang Zhou, Chenhang Cui, Jaehong Yoon, Linjun Zhang, Zhun Deng, Chelsea Finn, Mohit Bansal, and Huaxiu Yao. 2023. Analyzing and mitigating object hallucination in large vision-language models. arXiv preprint arXiv:2310.00754.
  • Zhu et al. (2023) Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. Minigpt-4: Enhancing vision-language understanding with advanced large language models.
  • Zhu and Duan (2024) Ge Zhu and Zhiyao Duan. 2024. Cacophony: An improved contrastive audio-text model. arXiv preprint arXiv:2402.06986.