Skip to main content

Showing 1–8 of 8 results for author: Kamoi, R

.
  1. arXiv:2406.01297  [pdf, other

    cs.CL

    When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs

    Authors: Ryo Kamoi, Yusen Zhang, Nan Zhang, Jiawei Han, Rui Zhang

    Abstract: Self-correction is an approach to improving responses from large language models (LLMs) by refining the responses using LLMs during inference. Prior work has proposed various self-correction frameworks using different sources of feedback, including self-evaluation and external feedback. However, there is still no consensus on the question of when LLMs can correct their own mistakes, as recent stud… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

  2. arXiv:2404.03602  [pdf, other

    cs.CL

    Evaluating LLMs at Detecting Errors in LLM Responses

    Authors: Ryo Kamoi, Sarkar Snigdha Sarathi Das, Renze Lou, Jihyun Janice Ahn, Yilun Zhao, Xiaoxin Lu, Nan Zhang, Yusen Zhang, Ranran Haoran Zhang, Sujeeth Reddy Vummanthala, Salika Dave, Shaobo Qin, Arman Cohan, Wenpeng Yin, Rui Zhang

    Abstract: With Large Language Models (LLMs) being widely used across various tasks, detecting errors in their responses is increasingly crucial. However, little research has been conducted on error detection of LLM responses. Collecting error annotations on LLM responses is challenging due to the subjective nature of many NLP tasks, and thus previous research focuses on tasks of little practical value (e.g.… ▽ More

    Submitted 4 April, 2024; originally announced April 2024.

    Comments: Benchmark and code: https://github.com/psunlpgroup/ReaLMistake

  3. arXiv:2311.09805  [pdf, other

    cs.CL

    DocMath-Eval: Evaluating Numerical Reasoning Capabilities of LLMs in Understanding Long Documents with Tabular Data

    Authors: Yilun Zhao, Yitao Long, Hongjun Liu, Linyong Nan, Lyuhao Chen, Ryo Kamoi, Yixin Liu, Xiangru Tang, Rui Zhang, Arman Cohan

    Abstract: Recent LLMs have demonstrated remarkable performance in solving exam-like math word problems. However, the degree to which these numerical reasoning skills are effective in real-world scenarios, particularly in expert domains, is still largely unexplored. This paper introduces DocMath-Eval, a comprehensive benchmark specifically designed to evaluate the numerical reasoning and problem-solving capa… ▽ More

    Submitted 16 November, 2023; originally announced November 2023.

    Comments: work in progress

  4. arXiv:2311.07884  [pdf, other

    cs.CL

    Fair Abstractive Summarization of Diverse Perspectives

    Authors: Yusen Zhang, Nan Zhang, Yixin Liu, Alexander Fabbri, Junru Liu, Ryo Kamoi, Xiaoxin Lu, Caiming Xiong, Jieyu Zhao, Dragomir Radev, Kathleen McKeown, Rui Zhang

    Abstract: People from different social and demographic groups express diverse perspectives and conflicting opinions on a broad set of topics such as product reviews, healthcare, law, and politics. A fair summary should provide a comprehensive coverage of diverse perspectives without underrepresenting certain groups. However, current work in summarization metrics and Large Language Models (LLMs) evaluation h… ▽ More

    Submitted 29 March, 2024; v1 submitted 13 November, 2023; originally announced November 2023.

    Comments: NAACL 2024

  5. arXiv:2303.01432  [pdf, other

    cs.CL

    WiCE: Real-World Entailment for Claims in Wikipedia

    Authors: Ryo Kamoi, Tanya Goyal, Juan Diego Rodriguez, Greg Durrett

    Abstract: Textual entailment models are increasingly applied in settings like fact-checking, presupposition verification in question answering, or summary evaluation. However, these represent a significant domain shift from existing entailment datasets, and models underperform as a result. We propose WiCE, a new fine-grained textual entailment dataset built on natural claim and evidence pairs extracted from… ▽ More

    Submitted 22 October, 2023; v1 submitted 2 March, 2023; originally announced March 2023.

    Comments: EMNLP 2023

  6. arXiv:2210.06748  [pdf, other

    cs.CL

    Shortcomings of Question Answering Based Factuality Frameworks for Error Localization

    Authors: Ryo Kamoi, Tanya Goyal, Greg Durrett

    Abstract: Despite recent progress in abstractive summarization, models often generate summaries with factual errors. Numerous approaches to detect these errors have been proposed, the most popular of which are question answering (QA)-based factuality metrics. These have been shown to work well at predicting summary-level factuality and have potential to localize errors within summaries, but this latter capa… ▽ More

    Submitted 11 February, 2023; v1 submitted 13 October, 2022; originally announced October 2022.

    Comments: EACL 2023

  7. arXiv:2003.00402  [pdf, other

    stat.ML cs.CV cs.LG

    Why is the Mahalanobis Distance Effective for Anomaly Detection?

    Authors: Ryo Kamoi, Kei Kobayashi

    Abstract: The Mahalanobis distance-based confidence score, a recently proposed anomaly detection method for pre-trained neural classifiers, achieves state-of-the-art performance on both out-of-distribution (OoD) and adversarial examples detection. This work analyzes why this method exhibits such strong performance in practical settings while imposing an implausible assumption; namely, that class conditional… ▽ More

    Submitted 30 April, 2020; v1 submitted 29 February, 2020; originally announced March 2020.

  8. arXiv:1911.06515  [pdf, other

    stat.ML cs.LG

    Likelihood Assignment for Out-of-Distribution Inputs in Deep Generative Models is Sensitive to Prior Distribution Choice

    Authors: Ryo Kamoi, Kei Kobayashi

    Abstract: Recent work has shown that deep generative models assign higher likelihood to out-of-distribution inputs than to training data. We show that a factor underlying this phenomenon is a mismatch between the nature of the prior distribution and that of the data distribution, a problem found in widely used deep generative models such as VAEs and Glow. While a typical choice for a prior distribution is a… ▽ More

    Submitted 15 November, 2019; originally announced November 2019.