Search | arXiv e-print repository

Turbulence: Systematically and Automatically Testing Instruction-Tuned Large Language Models for Code

Authors: Shahin Honarvar, Mark van der Wilk, Alastair Donaldson

Abstract: We present a method for systematically evaluating the correctness and robustness of instruction-tuned large language models (LLMs) for code generation via a new benchmark, Turbulence. Turbulence consists of a large set of natural language $\textit{question templates}$, each of which is a programming problem, parameterised so that it can be asked in many different forms. Each question template has… ▽ More We present a method for systematically evaluating the correctness and robustness of instruction-tuned large language models (LLMs) for code generation via a new benchmark, Turbulence. Turbulence consists of a large set of natural language $\textit{question templates}$, each of which is a programming problem, parameterised so that it can be asked in many different forms. Each question template has an associated $\textit{test oracle}$ that judges whether a code solution returned by an LLM is correct. Thus, from a single question template, it is possible to ask an LLM a $\textit{neighbourhood}$ of very similar programming questions, and assess the correctness of the result returned for each question. This allows gaps in an LLM's code generation abilities to be identified, including $\textit{anomalies}$ where the LLM correctly solves $\textit{almost all}$ questions in a neighbourhood but fails for particular parameter instantiations. We present experiments against five LLMs from OpenAI, Cohere and Meta, each at two temperature configurations. Our findings show that, across the board, Turbulence is able to reveal gaps in LLM reasoning ability. This goes beyond merely highlighting that LLMs sometimes produce wrong code (which is no surprise): by systematically identifying cases where LLMs are able to solve some problems in a neighbourhood but do not manage to generalise to solve the whole neighbourhood, our method is effective at highlighting $\textit{robustness}$ issues. We present data and examples that shed light on the kinds of mistakes that LLMs make when they return incorrect code results. △ Less

Submitted 14 January, 2024; v1 submitted 22 December, 2023; originally announced December 2023.

Comments: Modified a typo in the conclusion section regarding the impact of temperature reduction on the diversity of errors

arXiv:2210.04173 [pdf, other]

doi 10.1093/mnras/stad558

Thin accretion disk luminosity and its image around rotating black holes in perfect fluid dark matter

Authors: Malihe Heydari-Fard, Sara Ghassemi Honarvar, Mohaddese Heydari-Fard

Abstract: Motivated by the fact that the universe is dominated by dark matter and dark energy, we consider rotating black holes surrounded by perfect fluid dark matter and study the accretion process in thin disk around such black holes. Here, we are interested in how the presence of dark matter affects the properties of the electromagnetic radiation emitted from a thin accretion disk. For this purpose, we… ▽ More Motivated by the fact that the universe is dominated by dark matter and dark energy, we consider rotating black holes surrounded by perfect fluid dark matter and study the accretion process in thin disk around such black holes. Here, we are interested in how the presence of dark matter affects the properties of the electromagnetic radiation emitted from a thin accretion disk. For this purpose, we use the Novikov-Thorne model and obtain the electromagnetic spectrum of an accretion disk around a rotating black hole in perfect fluid dark matter and compare with the general relativistic case. The results indicate that for small values of dark matter parameter we considered here, the size of the innermost stable circular orbits would decrease and thus the electromagnetic spectrum of the accretion disk increases. Therefore, disks in the presence of perfect fluid dark matter are hotter and more luminous than in general relativity. Finally, we construct thin accretion disk images around these black holes using the numerical ray-tracing technique. We show that the inclination angle has a remarkable effect on the images, while the effect of dark matter parameter is small. △ Less

Submitted 8 March, 2023; v1 submitted 9 October, 2022; originally announced October 2022.

Comments: 9 pages(two columns), 9 figures

Journal ref: Monthly Notices of the Royal Astronomical Society, 521 (2023) 708-716

arXiv:2110.10791 [pdf, other]

Modeling Human-Human Collaboration: A Connection Between Inter-Personal Motor Synergy and Consensus Algorithms

Authors: Sara Honarvar, **-OH Hahn, Tim Kiemel, Jae Kun Shim, Yancy Diaz-Mercado

Abstract: Many day-to-day activities involve people working collaboratively toward reaching a desired outcome. Previous research in motor control and neuroscience have proposed inter-personal motor synergy (IPMS) as a mechanism of collaboration between people, referring to the idea of how two or more people may work together "as if they were one" to coordinate their motion. In motor control literature, unco… ▽ More Many day-to-day activities involve people working collaboratively toward reaching a desired outcome. Previous research in motor control and neuroscience have proposed inter-personal motor synergy (IPMS) as a mechanism of collaboration between people, referring to the idea of how two or more people may work together "as if they were one" to coordinate their motion. In motor control literature, uncontrolled manifold (UCM) is used for quantifying IPMS. According to this approach, coordinated motion is achieved through stabilization of a performance variable (e.g., an output in a collaborative output tracking task). We show that the UCM approach is closely related to the well-studied consensus approach in multi-agent systems that concerns processes by which a set of interacting agents agree on a shared objective. To explore the connection between these two approaches, in this paper, we provide a control-theoretic model that represents the systems-level behaviors in a collaborative task. In particular, we utilize the consensus protocol and show how the model can be systematically tuned to reproduce the behavior exhibited by human-human collaboration experiments. We discuss the association between the proposed control law and the UCM approach and validate our model using experimental results previously collected from an inter-personal finger force production task. △ Less

Submitted 20 October, 2021; originally announced October 2021.

Showing 1–3 of 3 results for author: Honarvar, S