Skip to main content

Showing 1–1 of 1 results for author: Aljeraisy, L

.
  1. arXiv:2311.04235  [pdf, other

    cs.AI cs.CL cs.LG

    Can LLMs Follow Simple Rules?

    Authors: Norman Mu, Sarah Chen, Zifan Wang, Sizhe Chen, David Karamardian, Lulwa Aljeraisy, Basel Alomair, Dan Hendrycks, David Wagner

    Abstract: As Large Language Models (LLMs) are deployed with increasing real-world responsibilities, it is important to be able to specify and constrain the behavior of these systems in a reliable manner. Model developers may wish to set explicit rules for the model, such as "do not generate abusive content", but these may be circumvented by jailbreaking techniques. Existing evaluations of adversarial attack… ▽ More

    Submitted 8 March, 2024; v1 submitted 6 November, 2023; originally announced November 2023.

    Comments: Project website: https://eecs.berkeley.edu/~normanmu/llm_rules; revised content