Skip to main content

Showing 1–3 of 3 results for author: Panickssery, A

.
  1. arXiv:2407.04108  [pdf, other

    cs.CR cs.CL cs.LG

    Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs

    Authors: Sara Price, Arjun Panickssery, Sam Bowman, Asa Cooper Stickland

    Abstract: Backdoors are hidden behaviors that are only triggered once an AI system has been deployed. Bad actors looking to create successful backdoors must design them to avoid activation during training and evaluation. Since data used in these stages often only contains information about events that have already occurred, a component of a simple backdoor trigger could be a model recognizing data that is i… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

  2. arXiv:2404.13076  [pdf, other

    cs.CL cs.AI

    LLM Evaluators Recognize and Favor Their Own Generations

    Authors: Arjun Panickssery, Samuel R. Bowman, Shi Feng

    Abstract: Self-evaluation using large language models (LLMs) has proven valuable not only in benchmarking but also methods like reward modeling, constitutional AI, and self-refinement. But new biases are introduced due to the same LLM acting as both the evaluator and the evaluatee. One such bias is self-preference, where an LLM evaluator scores its own outputs higher than others' while human annotators cons… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

  3. arXiv:2401.05604  [pdf, other

    cs.CL cs.AI cs.CV cs.CY

    REBUS: A Robust Evaluation Benchmark of Understanding Symbols

    Authors: Andrew Gritsevskiy, Arjun Panickssery, Aaron Kirtland, Derik Kauffman, Hans Gundlach, Irina Gritsevskaya, Joe Cavanagh, Jonathan Chiang, Lydia La Roux, Michelle Hung

    Abstract: We propose a new benchmark evaluating the performance of multimodal large language models on rebus puzzles. The dataset covers 333 original examples of image-based wordplay, cluing 13 categories such as movies, composers, major cities, and food. To achieve good performance on the benchmark of identifying the clued word or phrase, models must combine image recognition and string manipulation with h… ▽ More

    Submitted 3 June, 2024; v1 submitted 10 January, 2024; originally announced January 2024.

    Comments: 20 pages, 5 figures. For code, see http://github.com/cvndsh/rebus