DPP-Based Adversarial Prompt Searching for Lanugage Models

Zhang, Xu; Wan, Xiaojun

Abstract:Language models risk generating mindless and offensive content, which hinders their safe deployment. Therefore, it is crucial to discover and modify potential toxic outputs of pre-trained language models before deployment. In this work, we elicit toxic content by automatically searching for a prompt that directs pre-trained language models towards the generation of a specific target output. The problem is challenging due to the discrete nature of textual data and the considerable computational resources required for a single forward pass of the language model. To combat these challenges, we introduce Auto-regressive Selective Replacement Ascent (ASRA), a discrete optimization algorithm that selects prompts based on both quality and similarity with determinantal point process (DPP). Experimental results on six different pre-trained language models demonstrate the efficacy of ASRA for eliciting toxic content. Furthermore, our analysis reveals a strong correlation between the success rate of ASRA attacks and the perplexity of target outputs, while indicating limited association with the quantity of model parameters.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2403.00292 [cs.CL]
	(or arXiv:2403.00292v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2403.00292

Computer Science > Computation and Language

Title:DPP-Based Adversarial Prompt Searching for Lanugage Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators