Showing 1–2 of 2 results for author: Ngu, N

Search v0.5.6 released 2020-02-24

arXiv:2308.11189 [pdf, other]

cs.CL cs.AI cs.LG

Diversity Measures: Domain-Independent Proxies for Failure in Language Model Queries

Authors: Noel Ngu, Nathaniel Lee, Paulo Shakarian

Abstract: Error prediction in large language models often relies on domain-specific information. In this paper, we present measures for quantification of error in the response of a large language model based on the diversity of responses to a given prompt - hence independent of the underlying application. We describe how three such measures - based on entropy, Gini impurity, and centroid distance - can be e… ▽ More Error prediction in large language models often relies on domain-specific information. In this paper, we present measures for quantification of error in the response of a large language model based on the diversity of responses to a given prompt - hence independent of the underlying application. We describe how three such measures - based on entropy, Gini impurity, and centroid distance - can be employed. We perform a suite of experiments on multiple datasets and temperature settings to demonstrate that these measures strongly correlate with the probability of failure. Additionally, we present empirical results demonstrating how these measures can be applied to few-shot prompting, chain-of-thought reasoning, and error detection. △ Less

Submitted 22 August, 2023; originally announced August 2023.

Report number: Accepted to IEEE ICSC '24
arXiv:2302.13814 [pdf, other]

cs.CL cs.AI cs.LG

An Independent Evaluation of ChatGPT on Mathematical Word Problems (MWP)

Authors: Paulo Shakarian, Abhinav Koyyalamudi, Noel Ngu, Lakshmivihari Mareedu

Abstract: We study the performance of a commercially available large language model (LLM) known as ChatGPT on math word problems (MWPs) from the dataset DRAW-1K. To our knowledge, this is the first independent evaluation of ChatGPT. We found that ChatGPT's performance changes dramatically based on the requirement to show its work, failing 20% of the time when it provides work compared with 84% when it does… ▽ More We study the performance of a commercially available large language model (LLM) known as ChatGPT on math word problems (MWPs) from the dataset DRAW-1K. To our knowledge, this is the first independent evaluation of ChatGPT. We found that ChatGPT's performance changes dramatically based on the requirement to show its work, failing 20% of the time when it provides work compared with 84% when it does not. Further several factors about MWPs relating to the number of unknowns and number of operations that lead to a higher probability of failure when compared with the prior, specifically noting (across all experiments) that the probability of failure increases linearly with the number of addition and subtraction operations. We also have released the dataset of ChatGPT's responses to the MWPs to support further work on the characterization of LLM performance and present baseline machine learning models to predict if ChatGPT can correctly answer an MWP. We have released a dataset comprised of ChatGPT's responses to support further research in this area. △ Less

Submitted 27 February, 2023; v1 submitted 23 February, 2023; originally announced February 2023.

Journal ref: AAAI Spring Symposium 2023 (MAKE)

Search v0.5.6 released 2020-02-24