SemStamp: A Paraphrase-Robust Watermark

Abe Bohan Hou\clubsuit  **gyu Zhang\clubsuit  Yichen Wang\diamondsuit
Daniel Khashabi\clubsuitTianxing He\heartsuit
\clubsuitJohns Hopkins University  \heartsuitUniversity of Washington  \diamondsuitXi’an Jiaotong University
{bhou4, jzhan237}@jhu.edu [email protected]

SemStamp: A Watermark Robust to Semantic Paraphrases

Abe Bohan Hou\clubsuit  **gyu Zhang\clubsuit  Yichen Wang\diamondsuit
Daniel Khashabi\clubsuitTianxing He\heartsuit
\clubsuitJohns Hopkins University  \heartsuitUniversity of Washington  \diamondsuitXi’an Jiaotong University
{bhou4, jzhan237}@jhu.edu [email protected]

Optimizing Paraphrasing Robustness for Natural Language Watermark

Abe Bohan Hou\clubsuit  **gyu Zhang\clubsuit  Yichen Wang\diamondsuit
Daniel Khashabi\clubsuitTianxing He\heartsuit
\clubsuitJohns Hopkins University  \heartsuitUniversity of Washington  \diamondsuitXi’an Jiaotong University
{bhou4, jzhan237}@jhu.edu [email protected]

A Semantic Watermark Algorithm for Language Model Generation with Robustness to Paraphrase Attacks

Abe Bohan Hou\clubsuit  **gyu Zhang\clubsuit  Yichen Wang\diamondsuit
Daniel Khashabi\clubsuitTianxing He\heartsuit
\clubsuitJohns Hopkins University  \heartsuitUniversity of Washington  \diamondsuitXi’an Jiaotong University
{bhou4, jzhan237}@jhu.edu [email protected]

SemStamp: Paraphrase-Robust Semantic Watermark

Abe Bohan Hou\clubsuit  **gyu Zhang\clubsuit  Yichen Wang\diamondsuit
Daniel Khashabi\clubsuitTianxing He\heartsuit
\clubsuitJohns Hopkins University  \heartsuitUniversity of Washington  \diamondsuitXi’an Jiaotong University
{bhou4, jzhan237}@jhu.edu [email protected]

SemStamp: A Paraphrase-Robust Semantic Watermark

Abe Bohan Hou\clubsuit  **gyu Zhang\clubsuit  Yichen Wang\diamondsuit
Daniel Khashabi\clubsuitTianxing He\heartsuit
\clubsuitJohns Hopkins University  \heartsuitUniversity of Washington  \diamondsuitXi’an Jiaotong University
{bhou4, jzhan237}@jhu.edu [email protected]

SemStamp: Semantic Watermarking with Paraphrase Invariance

Abe Bohan Hou\clubsuit  **gyu Zhang\clubsuit  Yichen Wang\diamondsuit
Daniel Khashabi\clubsuitTianxing He\heartsuit
\clubsuitJohns Hopkins University  \heartsuitUniversity of Washington  \diamondsuitXi’an Jiaotong University
{bhou4, jzhan237}@jhu.edu [email protected]

SemStamp: A Semantic Watermark

Abe Bohan Hou\clubsuit  **gyu Zhang\clubsuit  Yichen Wang\diamondsuit
Daniel Khashabi\clubsuitTianxing He\heartsuit
\clubsuitJohns Hopkins University  \heartsuitUniversity of Washington  \diamondsuitXi’an Jiaotong University
{bhou4, jzhan237}@jhu.edu [email protected]

SemStamp: Semantically Watermarking Text Generation with Paraphrastic Robustness

Abe Bohan Hou\clubsuit  **gyu Zhang\clubsuit  Yichen Wang\diamondsuit
Daniel Khashabi\clubsuitTianxing He\heartsuit
\clubsuitJohns Hopkins University  \heartsuitUniversity of Washington  \diamondsuitXi’an Jiaotong University
{bhou4, jzhan237}@jhu.edu [email protected]

SemStamp: Semantic Watermarking
with Paraphrastic Robustness

Abe Bohan Hou\clubsuit  **gyu Zhang\clubsuit  Yichen Wang\diamondsuit
Daniel Khashabi\clubsuitTianxing He\heartsuit
\clubsuitJohns Hopkins University  \heartsuitUniversity of Washington  \diamondsuitXi’an Jiaotong University
{bhou4, jzhan237}@jhu.edu [email protected]

SemStamp: Paraphrase-Robust Watermarked
Text Generation

Abe Bohan Hou\clubsuit  **gyu Zhang\clubsuit  Yichen Wang\diamondsuit
Daniel Khashabi\clubsuitTianxing He\heartsuit
\clubsuitJohns Hopkins University  \heartsuitUniversity of Washington  \diamondsuitXi’an Jiaotong University
{bhou4, jzhan237}@jhu.edu [email protected]

SemStamp: Semantically Watermarking Text via Paraphrastic Robustness

Abe Bohan Hou\clubsuit  **gyu Zhang\clubsuit  Yichen Wang\diamondsuit
Daniel Khashabi\clubsuitTianxing He\heartsuit
\clubsuitJohns Hopkins University  \heartsuitUniversity of Washington  \diamondsuitXi’an Jiaotong University
{bhou4, jzhan237}@jhu.edu [email protected]

SemStamp: Semantically Watermarking Language Model Texts with Paraphrastic Robustness

Abe Bohan Hou\clubsuit  **gyu Zhang\clubsuit  Yichen Wang\diamondsuit
Daniel Khashabi\clubsuitTianxing He\heartsuit
\clubsuitJohns Hopkins University  \heartsuitUniversity of Washington  \diamondsuitXi’an Jiaotong University
{bhou4, jzhan237}@jhu.edu [email protected]

SemStamp: Semantic Watermarking Text Generation with Paraphrastic Robustness

Abe Bohan Hou\clubsuit  **gyu Zhang\clubsuit  Yichen Wang\diamondsuit
Daniel Khashabi\clubsuitTianxing He\heartsuit
\clubsuitJohns Hopkins University  \heartsuitUniversity of Washington  \diamondsuitXi’an Jiaotong University
{bhou4, jzhan237}@jhu.edu [email protected]

SemStamp: Semantic Watermarked Generation
with Paraphrastic Robustness

Abe Bohan Hou\clubsuit  **gyu Zhang\clubsuit  Yichen Wang\diamondsuit
Daniel Khashabi\clubsuitTianxing He\heartsuit
\clubsuitJohns Hopkins University  \heartsuitUniversity of Washington  \diamondsuitXi’an Jiaotong University
{bhou4, jzhan237}@jhu.edu [email protected]

A Simple Yet Effective Variant of SemStamp for Machine-Generated Text Detection

Abe Bohan Hou\clubsuit  **gyu Zhang\clubsuit  Yichen Wang\diamondsuit
Daniel Khashabi\clubsuitTianxing He\heartsuit
\clubsuitJohns Hopkins University  \heartsuitUniversity of Washington  \diamondsuitXi’an Jiaotong University
{bhou4, jzhan237}@jhu.edu [email protected]

k𝑘kitalic_k-SemStamp : A Simple Yet Effective Variant of Semantic Watermark for Machine-Generated Text Detection

Abe Bohan Hou\clubsuit  **gyu Zhang\clubsuit  Yichen Wang\diamondsuit
Daniel Khashabi\clubsuitTianxing He\heartsuit
\clubsuitJohns Hopkins University  \heartsuitUniversity of Washington  \diamondsuitXi’an Jiaotong University
{bhou4, jzhan237}@jhu.edu [email protected]

ACL-Findings’24
k𝑘kitalic_k-SemStamp : A Clustering-Based Semantic Watermark for
Detection of Machine-Generated Text

Abe Bohan Hou\clubsuit  **gyu Zhang\clubsuit  Yichen Wang\diamondsuit
Daniel Khashabi\clubsuitTianxing He\heartsuit
\clubsuitJohns Hopkins University  \heartsuitUniversity of Washington  \diamondsuitXi’an Jiaotong University
{bhou4, jzhan237}@jhu.edu [email protected]
Abstract

Recent watermarked generation algorithms inject detectable signatures during language generation to facilitate post-hoc detection. While token-level watermarks are vulnerable to paraphrase attacks, SemStamp (Hou et al., 2023) applies watermark on the semantic representation of sentences and demonstrates promising robustness. SemStamp employs locality-sensitive hashing (LSH) to partition the semantic space with arbitrary hyperplanes, which may lead to a suboptimal trade-off between robustness and speed. We propose k𝑘kitalic_k-SemStamp, a simple yet effective enhancement of SemStamp, utilizing k𝑘kitalic_k-means clustering as an alternative of LSH to partition the embedding space with awareness of inherent semantic structure. Experimental results indicate that k𝑘kitalic_k-SemStamp saliently improve its robustness and sampling efficiency while preserving the generation quality, advancing a more effective tool for machine-generated text detection.

ACL-Findings’24
k𝑘kitalic_k-
SemStamp : A Clustering-Based Semantic Watermark for
Detection of Machine-Generated Text


Abe Bohan Hou\clubsuit  **gyu Zhang\clubsuit  Yichen Wang\diamondsuit Daniel Khashabi\clubsuitTianxing He\heartsuit \clubsuitJohns Hopkins University  \heartsuitUniversity of Washington  \diamondsuitXi’an Jiaotong University {bhou4, jzhan237}@jhu.edu [email protected]


1 Introduction

To facilitate the detection of machine-generated text (Mitchell et al., 2019), recent watermarked generation algorithms usually inject detectable signatures (Kuditipudi et al., 2023; Yoo et al., 2023; Wang et al., 2023; Christ et al., 2023; Fu et al., 2023; Hou et al., 2023, i.a.). A major concern for these approaches is their robustness to potential attacks, since a malicious user could attempt to remove the watermark with text perturbations such as editing and paraphrasing (Wang et al., 2024; Krishna et al., 2023; Sadasivan et al., 2023; Kirchenbauer et al., 2023b; Zhao et al., 2023). Hou et al. (2023) propose SemStamp, a paraphrase-robust and sentence-level watermark which assigns signatures to each watermarked sentence according to the locality sensitive hashing (LSH) (Indyk and Motwani, 1998) partitioning of semantic space (see 2.1). While demonstrating promising robustness against paraphrase attacks, SemStamp arbitrarily partitions the semantic space by a set of random hyperplanes, possibly splitting semantically similar sentences into different partitions (see Fig.1).

Refer to caption
Figure 1: Illustrations of the semantic space. Sentence embeddings with close meanings share similar colors. (Left) Random planes from LSH arbitrarily partition the semantic space and split similar sentences into different regions. (Right) Margin-based rejection in k𝑘kitalic_k-SemStamp. Sentence embeddings which fall into the gray-shaded areas of a valid region will be rejected.

This limitation motivates our proposed method, k𝑘kitalic_k-SemStamp (detailed in §2.2), which partitions the space via k𝑘kitalic_k-means clustering (Lloyd, 1982) on the semantic structure of a given text domain (e.g. news, narratives, etc.). In §3, we show that the clustering-based partitioning in k𝑘kitalic_k-SemStamp greatly improves its robustness against sentence-level paraphrase attacks and sampling efficiency.111 We have released the code for reproducibility. Corresponding authors: Abe Hou, **gyu Zhang, and Tianxing He.

Refer to caption
Figure 2: An overview of the proposed k𝑘kitalic_k-SemStamp algorithm. k𝑘kitalic_k-means clustering partitions the semantic space into semantically similar regions. The sentence generation is accepted if the closest cluster of its sentence embedding corresponds to a "valid" region in the semantic space.

Prompt: In Chapter 18, Richard begins at Kenge and Carboy’s. Non-Watermarked Generation: He goes to the inn where Mr. Kenge has been let off by the landlord. There, he meets a woman named Hannah, who is looking for him. He asks her where he is wanted.
SStamp: He meets up with Lydgate, who is there to see if the money from the deal is still there. The lawyers are ready to go to trial, but Richard says he has a better plan. He wants to leave Middlemarch for good.
k𝑘kitalic_k-SStamp: He also sees Adam for the first time since his imprisonment. They discuss the latest updates in their respective personal lives. Adam is living with Dinah and is still angry with Adam for having to leave him.

Figure 3: Generation Examples of k𝑘kitalic_k-SemStamp compared with SemStamp. Both generations are contextually sensible and coherent as compared to non-watermarked generations. Additional examples after paraphrase are presented in Figure 5 in the Appendix.
Algorithm 1 k𝑘kitalic_k-SemStamp text generation algorithm and subroutines
Input: language model PLMsubscript𝑃LMP_{\text{LM}}italic_P start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT, prompt s(0)superscript𝑠0s^{(0)}italic_s start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT, the text domain 𝒟𝒟\mathcal{D}caligraphic_D, the number of sentences to generate T𝑇Titalic_T.
Params: sentence embedding model fine-tuned on 𝒟𝒟\mathcal{D}caligraphic_D, Membd𝒟superscriptsubscript𝑀embd𝒟M_{\text{embd}}^{\mathcal{D}}italic_M start_POSTSUBSCRIPT embd end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_D end_POSTSUPERSCRIPT with embedding dimension hhitalic_h, maxout number Nmaxsubscript𝑁maxN_{\text{max}}italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT, margin m>0𝑚0m>0italic_m > 0, valid region ratio γ(0,1)𝛾01\gamma\in(0,1)italic_γ ∈ ( 0 , 1 ), the number of k𝑘kitalic_k-means clusters K𝐾Kitalic_K, a large prime number p𝑝pitalic_p, an integer N𝑁Nitalic_N.
Output: generated sequence s(1)s(T)superscript𝑠1superscript𝑠𝑇s^{(1)}\dots s^{(T)}italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT … italic_s start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT.
procedure k𝑘kitalic_k-SemStamp 
     CKInitialize(𝒟,K)subscript𝐶𝐾Initialize𝒟𝐾C_{K}\leftarrow\textsc{Initialize}(\mathcal{D},K)italic_C start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ← Initialize ( caligraphic_D , italic_K ) to initialize K𝐾Kitalic_K cluster centroids based on 𝒟𝒟\mathcal{D}caligraphic_D.
     for t=1,2,,T𝑡12𝑇t=1,2,\dots,Titalic_t = 1 , 2 , … , italic_T do
  1. 1.

    Find the index of the closest cluster centroid of the previously generated sentence, q(t1)Assign(s(t1),CK)superscript𝑞𝑡1Assignsuperscript𝑠𝑡1subscript𝐶𝐾q^{(t-1)}\leftarrow\textsc{Assign}(s^{(t-1)},C_{K})italic_q start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT ← Assign ( italic_s start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT , italic_C start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ), and use q(t1)psuperscript𝑞𝑡1𝑝q^{(t-1)}\cdot pitalic_q start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT ⋅ italic_p as the seed to randomly divide the index set of clusters CKsubscript𝐶𝐾C_{K}italic_C start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT into a “valid region set” G(t)superscript𝐺𝑡G^{(t)}italic_G start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT of size γK𝛾𝐾\gamma\cdot Kitalic_γ ⋅ italic_K and a “blocked region set” R(t)superscript𝑅𝑡R^{(t)}italic_R start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT of size (1γ)K1𝛾𝐾(1-\gamma)\cdot K( 1 - italic_γ ) ⋅ italic_K.

  2. 2.

    repeat Sample a new sentence from LM,

    until the index of the closest cluster centroid of the new sentence, q(t)superscript𝑞𝑡q^{(t)}italic_q start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, is in the “valid region set”, and the margin requirement Margin(s(t),msuperscript𝑠𝑡𝑚s^{(t)},mitalic_s start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_m) is satisfied or sampling has repeated over Nmaxsubscript𝑁maxN_{\text{max}}italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT times.

  3. 3.

    Append the selected sentence s(t)superscript𝑠𝑡s^{(t)}italic_s start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT to context.

     end for
     return s(1)s(T)superscript𝑠1superscript𝑠𝑇s^{(1)}\dots s^{(T)}italic_s start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT … italic_s start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT
end procedure
function Initialize(𝒟,K𝒟𝐾\mathcal{D},Kcaligraphic_D , italic_K)
     𝒟N𝒟similar-tosubscriptsuperscript𝒟𝑁𝒟\mathcal{D}^{{}^{\prime}}_{N}\sim\mathcal{D}caligraphic_D start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ∼ caligraphic_D // sample N𝑁Nitalic_N sentences from D𝐷Ditalic_D
     CKk-means(𝒟N,K)subscript𝐶𝐾k-meanssubscriptsuperscript𝒟𝑁𝐾C_{K}\leftarrow\textsc{k-means}(\mathcal{D}^{{}^{\prime}}_{N},K)italic_C start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ← k-means ( caligraphic_D start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_K ) // obtain k𝑘kitalic_k cluster centroids
     return CKsubscript𝐶𝐾C_{K}italic_C start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT
end function
function Assign(s,CK𝑠subscript𝐶𝐾s,C_{K}italic_s , italic_C start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT) // find the index of the closest centroid by cosine distance
     return argmini=1,,Kdcos(v,ci)subscriptargmin𝑖1𝐾subscript𝑑𝑣subscript𝑐𝑖\operatorname*{arg\,min}_{i=1,\dots,K}d_{\cos}(v,c_{i})start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_i = 1 , … , italic_K end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT roman_cos end_POSTSUBSCRIPT ( italic_v , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where ciCKsubscript𝑐𝑖subscript𝐶𝐾c_{i}\in C_{K}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT 
end function
Algorithm 2 k𝑘kitalic_k-SemStamp detection algorithm
Input: a piece of text T𝑇Titalic_T, saved k𝑘kitalic_k-means cluster centroids CKsubscript𝐶𝐾C_{K}italic_C start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT
Params: sentence embedding model finetuned on 𝒟𝒟\mathcal{D}caligraphic_D, Membd𝒟superscriptsubscript𝑀embd𝒟M_{\text{embd}}^{\mathcal{D}}italic_M start_POSTSUBSCRIPT embd end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_D end_POSTSUPERSCRIPT, z𝑧zitalic_z-threshold range Z𝑍Zitalic_Z, human-written texts H𝐻Hitalic_H, a large prime number p𝑝pitalic_p, valid region ratio γ(0,1)𝛾01\gamma\in(0,1)italic_γ ∈ ( 0 , 1 ), number of k𝑘kitalic_k-means clusters K𝐾Kitalic_K.
Output: a z𝑧zitalic_z-score based on the ratio of detected sentences.
procedure Detect(T,CK𝑇subscript𝐶𝐾T,C_{K}italic_T , italic_C start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT)
     s1,,sNSentence-Tokenize(T)subscript𝑠1subscript𝑠𝑁Sentence-Tokenize(T)s_{1},...,s_{N}\leftarrow\textsc{Sentence-Tokenize(T)}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ← Sentence-Tokenize(T)
     q(1)Assign(s1,CK)superscript𝑞1Assignsubscript𝑠1subscript𝐶𝐾q^{(1)}\leftarrow\textsc{Assign}(s_{1},C_{K})italic_q start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ← Assign ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT )
     seedq(1)pseedsuperscript𝑞1𝑝\texttt{seed}\leftarrow q^{(1)}\cdot pseed ← italic_q start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ⋅ italic_p
     G(1)Random-Sample(seed,K,γ)superscript𝐺1Random-Sampleseed𝐾𝛾G^{(1)}\leftarrow\textsc{Random-Sample}(\texttt{seed},K,\gamma)italic_G start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ← Random-Sample ( seed , italic_K , italic_γ )  // pseudo-randomly sample a set of cluster centroid indices of size Kγ𝐾𝛾K\cdot\gammaitalic_K ⋅ italic_γ, where the randomness of sampling is controlled by seed.
     for t=2,,N𝑡2𝑁t=2,\dots,Nitalic_t = 2 , … , italic_N do
         q(t)Assign(st,CK)superscript𝑞𝑡Assignsubscript𝑠𝑡subscript𝐶𝐾q^{(t)}\leftarrow\textsc{Assign}(s_{t},C_{K})italic_q start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ← Assign ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT )
         if q(t)G(t1)superscript𝑞𝑡superscript𝐺𝑡1q^{(t)}\in G^{(t-1)}italic_q start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∈ italic_G start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT then
              SVsubscript𝑆𝑉S_{V}italic_S start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT += 1
         end ifseedq(t)pseedsuperscript𝑞𝑡𝑝\textsc{seed}\leftarrow q^{(t)}\cdot pseed ← italic_q start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⋅ italic_p G(t)Random-Sample(seed,K,γ)superscript𝐺𝑡Random-Sampleseed𝐾𝛾G^{(t)}\leftarrow\textsc{Random-Sample}(\texttt{seed},K,\gamma)italic_G start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ← Random-Sample ( seed , italic_K , italic_γ )
     end for
end procedure
zSVγNγ(1γ)N𝑧subscript𝑆𝑉𝛾𝑁𝛾1𝛾𝑁z\leftarrow\frac{S_{V}-\gamma N}{\sqrt{\gamma(1-\gamma)N}}italic_z ← divide start_ARG italic_S start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT - italic_γ italic_N end_ARG start_ARG square-root start_ARG italic_γ ( 1 - italic_γ ) italic_N end_ARG end_ARG
return z𝑧zitalic_z

2 Approach

We first review the existing watermark algorithms for machine-generated text detection (§2.1) and introduce our proposed watermark (§2.2).

2.1 Preliminaries

Token-Level Watermark

Kirchenbauer et al. (2023a) develop a notable token-level watermark algorithm. Given a token history w1:t1subscript𝑤:1𝑡1w_{1:t-1}italic_w start_POSTSUBSCRIPT 1 : italic_t - 1 end_POSTSUBSCRIPT, the vocabulary V𝑉Vitalic_V is pseudo-randomly divided into a “green list” G(t)superscript𝐺𝑡G^{(t)}italic_G start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and a “red list” R(t)superscript𝑅𝑡R^{(t)}italic_R start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, where a hash of the previous token wt1subscript𝑤𝑡1w_{t-1}italic_w start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT is used as the seed of the partition. The algorithm then adds a bias to the logits of all tokens in the green-list and sample the next token with an increased probability from the green-list. For a given piece of text, the watermark can be detected by conducting one proportion z𝑧zitalic_z-test (detailed in §C) on the number of green list tokens.

SemStamp

Under the intuition that common sentence-level paraphrase modifies tokens but preserves sentence meaning, Hou et al. (2023) introduce SemStamp to apply watermark on sentence semantics by partitioning the embedding space with locality sensitive hashing (LSH).

To initialize the LSH partitioning, d𝑑ditalic_d normal vectors are randomly sampled from a Gaussian distribution to specify d𝑑ditalic_d hyperplanes in the semantic space hsuperscript\mathbb{R}^{h}blackboard_R start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT. For an embedding vector vh𝑣superscriptv\in\mathbb{R}^{h}italic_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT, a d𝑑ditalic_d-bit binary LSH signature is assigned, where each digit specifies the position of v𝑣vitalic_v in relation to each hyperplane. Each signature c{0,1}d𝑐superscript01𝑑c\in\{0,1\}^{d}italic_c ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT indexes a region consisting of all vectors with signature c𝑐citalic_c.

During generation, given a sentence history denoted by s(0)s(t1)superscript𝑠0superscript𝑠𝑡1s^{(0)}\dots s^{(t-1)}italic_s start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT … italic_s start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT, the space of signatures is pseudorandomly partitioned into a set of “valid” regions G(t)superscript𝐺𝑡G^{(t)}italic_G start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and a set of “blocked” region R(t)superscript𝑅𝑡R^{(t)}italic_R start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT. The LSH signature of the last generated sentenceis used as the random seed to control randomness. A new sentence generation, s(t)superscript𝑠𝑡s^{(t)}italic_s start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, will be accepted and if its embedding belongs to any valid region, and rejected otherwise. To detect the watermark in a given piece of text, a one-proportion z𝑧zitalic_z-test is performed on the number of sentences whose signatures belong to valid regions (see §C).

2.2 k𝑘kitalic_k-SemStamp 

As discussed earlier, SemStamp partitions the semantic space with random planes, which could potentially separate semantically similar sentences into two different regions, as shown in Fig.1. Paraphrasing sentences near the margins of regions may shift their sentence embeddings to a nearby region, resulting in suboptimal watermark strength. This weakness motivates our proposed k𝑘kitalic_k-SemStamp, a simple yet effective enhancement of SemStamp that partitions the semantic space with k𝑘kitalic_k-means clustering (Lloyd, 1982).

To initialize k𝑘kitalic_k-SemStamp , we assume the language model generates text in a specific domain 𝒟𝒟\mathcal{D}caligraphic_D (e.g., news articles, scientific articles, etc.). We aim to model the semantic structure of 𝒟𝒟\mathcal{D}caligraphic_D and partition its semantic space into k𝑘kitalic_k regions. Concretely, we first randomly sample a large number of data from 𝒟𝒟\mathcal{D}caligraphic_D. We obtain their sentence embeddings with a robust sentence encoder fine-tuned on 𝒟𝒟\mathcal{D}caligraphic_D with contrastive learning (detailed in §A). We cluster the sentence embeddings into K𝐾Kitalic_K clusters with k𝑘kitalic_k-means (Lloyd, 1982) and save the cluster centroids. We index a region with i{1,,K}𝑖1𝐾i\in\{1,...,K\}italic_i ∈ { 1 , … , italic_K } representing the set of all vectors assigned to the i𝑖iitalic_i-th centroid.

The generation process is analogous to SemStamp (Hou et al., 2023), as illustrated in Fig.2: given a sentence history s(0)s(t1)superscript𝑠0superscript𝑠𝑡1s^{(0)}\dots s^{(t-1)}italic_s start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT … italic_s start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT, K𝐾Kitalic_K regions are pseudorandomly partitioned into a set of valid regions G(t)superscript𝐺𝑡G^{(t)}italic_G start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT of size γK𝛾𝐾\gamma\cdot Kitalic_γ ⋅ italic_K and a set of blocked regions R(t)superscript𝑅𝑡R^{(t)}italic_R start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT of size (1γ)K1𝛾𝐾(1-\gamma)\cdot K( 1 - italic_γ ) ⋅ italic_K, where γ(0,1)𝛾01\gamma\in(0,1)italic_γ ∈ ( 0 , 1 ) is the ratio of valid regions. The cluster assignment of s(t1)superscript𝑠𝑡1s^{(t-1)}italic_s start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT, C(s(t1))𝐶superscript𝑠𝑡1C(s^{(t-1)})italic_C ( italic_s start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT ), seeds the randomness of the partition at time step t𝑡titalic_t, where C(.)C(.)italic_C ( . ) returns the cluster index by finding the closest cluster centroid of the input sentence embedding. We then conduct rejection sampling and only sentences whose embeddings fall into any valid regions (i.e., C(s)G(t)𝐶𝑠superscript𝐺𝑡C(s)\in G^{(t)}italic_C ( italic_s ) ∈ italic_G start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT) are accepted while the rest are rejected. If no valid sentence is accepted after a preset maxout number (Nmaxsubscript𝑁maxN_{\text{max}}italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT) of tries, the last decoded sentence will be chosen. The full algorithm is presented in Algo 1.

Cluster Margin Constraint

To prevent the sampled sentences from being assigned to a nearby cluster after paraphrasing, we propose a cluster margin constraint similar to (Hou et al., 2023). We constrain the sentence embeddings to be sufficiently away from the cluster boundaries (visualized in Fig.1). Concretely, the cosine distance (dcossubscript𝑑d_{\cos}italic_d start_POSTSUBSCRIPT roman_cos end_POSTSUBSCRIPT) of the candidate sentence embedding (v𝑣vitalic_v) to the closest centroid (cqsubscript𝑐𝑞c_{q}italic_c start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT) needs to be smaller than other cluster centroids by at least a margin m𝑚mitalic_m:

dcos(v,cq)<mini{1,,K}qdcos(v,ci)m,subscript𝑑𝑣subscript𝑐𝑞subscript𝑖1𝐾𝑞subscript𝑑𝑣subscript𝑐𝑖𝑚\vspace{-1mm}\vspace{-1.5mm}d_{\cos}(v,c_{q})<\min_{i\in\{1,\dots,K\}\setminus q% }d_{\cos}(v,c_{i})-m,italic_d start_POSTSUBSCRIPT roman_cos end_POSTSUBSCRIPT ( italic_v , italic_c start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) < roman_min start_POSTSUBSCRIPT italic_i ∈ { 1 , … , italic_K } ∖ italic_q end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT roman_cos end_POSTSUBSCRIPT ( italic_v , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_m , (1)

where q𝑞qitalic_q is the index of the closest cluster centroid to v𝑣vitalic_v, i.e., q=argmini=1,,Kdcos(v,ci)𝑞subscriptargmin𝑖1𝐾subscript𝑑𝑣subscript𝑐𝑖q=\operatorname*{arg\,min}_{i=1,\dots,K}d_{\cos}(v,c_{i})italic_q = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_i = 1 , … , italic_K end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT roman_cos end_POSTSUBSCRIPT ( italic_v , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), and v=Membd(s(t))𝑣subscript𝑀embdsuperscript𝑠𝑡v=M_{\text{embd}}(s^{(t)})italic_v = italic_M start_POSTSUBSCRIPT embd end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) is the embedding of the generated sentence at time step t𝑡titalic_t by a robust sentence embedder Membdsubscript𝑀embdM_{\text{embd}}italic_M start_POSTSUBSCRIPT embd end_POSTSUBSCRIPT.

The detection procedure of k𝑘kitalic_k-SemStamp is analogous to SemStamp which uses one-proportion z𝑧zitalic_z-test on the number of sentences belong to valid regions, explained in §C and Algo 2.

3 Experiments

3.1 Experimental Setup

Following Hou et al. (2023), we conduct paraphrase attack experiments and compare the detection robustness of watermarked generations.

Task and Metrics

We evaluate 1000 watermarked generations after paraphrase, respectively on the RealNews subset of the C4 dataset (Raffel et al., 2020) and on the BookSum dataset (Kryściński et al., 2021). We paraphrase watermarked generations sentence-by-sentence with the Pegasus paraphraser (Zhang et al., 2020), Parrot used in Sadasivan et al. (2023), and GPT-3.5-Turbo (OpenAI, 2022). We also implement the strong bigram paraphrase attack as detailed in Hou et al. (2023). Detection robustness of paraphrased watermarked generations is measured with area under the receiver operating characteristic curve (AUC) and the true positive rate when the false positive rate is at 1% and 5% (TP@1%, TP@5%).222 We denote machine-generated text as the “positive” class and human text as the “negative” class. A piece of text is classified as machine-generated when its z𝑧zitalic_z-score exceeds a threshold chosen based on a given false positive rate. See §C. Generation quality is measured with perplexity (PPL) (using OPT-2.7B (Zhang et al., 2022)), trigram text entropy (Zhang et al., 2018) (Ent-3), i.e., the entropy of the trigram frequency distribution of the generated text, and Sem-Ent (Han et al., 2022), an automatic metric for semantic diversity. Following the setup in Han et al. (2022), we perform k𝑘kitalic_k-means clustering (k=50𝑘50k=50italic_k = 50) with the last hidden states of OPT-2.7B on text generations, and Sem-Ent is defined as the entropy of semantic cluster assignments of test generations. We also measure the paraphrase quality with BERTScore (Zhang et al., 2019) between original generations and their paraphrases.

AUC \uparrow / TP@1% \uparrow / TP@5% \uparrow
Domain Algorithm No Paraphrase Pegasus Pegasus-bigram Parrot Parrot-bigram GPT3.5 GPT3.5-bigram
KGW 99.6 / 98.4 / 98.9 95.9 / 82.1 / 91.0 92.1 / 42.7 / 72.9 88.5 / 31.5 / 55.4 83.0 / 15.0 / 39.9 82.8 / 17.4 / 46.7 75.1 /   5.9 / 26.3
SIR 99.9 / 99.4 / 99.9 94.4 / 79.2 / 85.4 94.1 / 72.6 / 82.6 93.2 / 62.8 / 75.9 95.2 / 66.4 / 80.2 80.2 / 24.7 / 42.7 77.7 / 20.9 / 36.4
SemStamp 99.2 / 93.9 / 97.1 97.8 / 83.7 / 92.0 96.5 / 76.7 / 86.8 93.3 / 56.2 / 75.5 93.1 / 54.4 / 74.0 83.3 / 33.9 / 52.9 82.2 / 31.3 / 48.7
RealNews k𝑘kitalic_k-SemStamp 99.6 / 98.1 / 98.7 99.5 / 92.7 / 96.5 99.0 / 88.4 / 94.3 97.8 / 78.7 / 89.4 97.5 / 78.3 / 87.3 90.8 / 55.5 / 71.8 88.9 / 50.2 / 66.1
KGW 99.6 / 99.0 / 99.2 97.3 / 89.7 / 95.3 96.5 / 56.6 / 85.3 94.6 / 42.0 / 75.8 93.1 / 37.4 / 71.2 87.6 / 17.2 / 52.1 77.1 /   4.4 / 27.1
SIR 1.0 / 99.8 / 1.0 93.1 / 79.3 / 85.9 93.7 / 69.9 / 81.5 96.5 / 72.9 / 85.1 97.2 / 76.5 / 88.0 80.9 / 39.9 / 23.6 75.8 / 19.9 / 35.4
SemStamp 99.6 / 98.3 / 98.8 99.0 / 94.3 / 97.0 98.6 / 90.6 / 95.5 98.3 / 83.0 / 91.5 98.4 / 85.7 / 92.5 89.6 / 45.6 / 62.4 86.2 / 37.4 / 53.8
BookSum k𝑘kitalic_k-SemStamp 99.9 / 99.1 / 99.4 99.3 / 94.1 / 97.3 99.1 / 92.5 / 96.9 98.4 / 86.3 / 93.9 98.8 / 88.9 / 94.9 95.6 / 65.7 / 83.0 95.7 / 64.5 / 81.4
Table 1: Detection results against various paraphrase attacks. All numbers in each cell are in percentages and correspond to AUC, TP@1%, and TP@5%, respectively. All three metrics prefer higher values. KGW and SIR refer to the watermarks in Kirchenbauer et al. (2023a) and Liu et al. (2023). k𝑘kitalic_k-SemStamp  is more robust than SemStamp and KGW across most paraphrasers and their bigram attack variants and both datasets.
AUC \uparrow / TP@1% \uparrow / TP@5% \uparrow
Algorithm Train Domain Test Domain Pegasus Pegasus-bigram Parrot Parrot-bigram
KGW N/A BookSum 97.3 / 89.7 / 95.3 96.5 / 56.6 / 85.3 94.6 / 42.0 / 75.8 93.1 / 37.4 / 71.2
SIR N/A BookSum 93.1 / 79.3 / 85.9 93.7 / 69.9 / 81.5 96.5 / 72.9 / 85.1 97.2 / 76.5 / 88.0
k𝑘kitalic_k-SStamp RealNews BookSum 98.2 / 78.2 / 94.9 97.3 / 70.7 / 93.8 96.8 / 65.5 / 90.9 96.4 / 61.9 / 89.2
BookSum BookSum 99.3 / 94.1 / 97.3 99.1 / 92.5 / 96.9 98.4 / 86.3 / 93.9 98.8 / 88.9 / 94.9
Table 2: Ablation study on the detection robustness of k𝑘kitalic_k-SemStamp (shown as k𝑘kitalic_k-SStamp) to domain shifts. Bold texts mark the highest and underline texts mark the second-highest result. In face of domain shifts, k𝑘kitalic_k-SemStamp suffers a drop in performance yet is still able to retain some robustness over baselines we are comparing with.

Generation

We use OPT-1.3B (Zhang et al., 2022) as our base autoregressive LM. To obtain robust sentence encoders specific to text domains for k𝑘kitalic_k-SemStamp generations, we fine-tune two versions of Membdsubscript𝑀embdM_{\text{embd}}italic_M start_POSTSUBSCRIPT embd end_POSTSUBSCRIPT, respectively on RealNews (Raffel et al., 2020) and on BookSum (Kryściński et al., 2021) datasets (see §A for specific procedure and parameter choices).

Following Hou et al. (2023) and Kirchenbauer et al. (2023a), we sample at a temperature of 0.7 and a repetition penalty of 1.05, with 32 being the prompt length and 200 being the default generation length. Results with various lengths are included in Fig. 4. For k𝑘kitalic_k-SemStamp , we perform k𝑘kitalic_k-means clustering on embeddings of sentences in 8k paragraphs, respectively on RealNews and BookSum. We keep k=8𝑘8k=8italic_k = 8 and a valid region ratio γ=0.25𝛾0.25\gamma=0.25italic_γ = 0.25, which is consistent with the number of regions in SemStamp, and we use a rejection margin m=0.035𝑚0.035m=0.035italic_m = 0.035.

Baselines

Our baselines include popular watermarking algorithms Kirchenbauer et al. (2023a), SemStamp, Unigram-Watermark (Zhao et al., 2023), and the Semantic Invariant Robust (SIR) watermark in Liu et al. (2023), implemented with their recommended setups.

3.2 Results

Detection

Detection results in Table 1 show that k𝑘kitalic_k-SemStamp  is more robust to paraphrase attacks than KGW (Kirchenbauer et al., 2023a) and SemStamp across Pegasus, Parrot, and GPT-3.5-Turbo paraphrasers and their bigram attack variants, as measured by AUC, TP@1%, and TP@5%. In particular, k𝑘kitalic_k-SemStamp demonstrates considerable robustness against GPT-3.5, in which none of SemStamp and KGW performed strongly. While Unigram-Watermark Zhao et al. (2023) also demonstrates strong robustness against paraphrase, it has a critical vulnerability to reverse-engineering attacks. We discuss its vulnerability and experimental results in §D. The BERTScores of paraphrases are presented in Table 5.

Domain Shifts

Since k𝑘kitalic_k-SemStamp finetunes sentence-embedder from a specified text domain, we investigate the robustness of the fine-tuned sentence-embedder inputs from a different domain. In Table 2, we show that k𝑘kitalic_k-SemStamp experiences a drop in robustness when using a cross-domain sentence-embedder. Nevertheless, k𝑘kitalic_k-SemStamp  is able to retain some robustness compared to KGW and SIR, staying especially resilient against Pegasus-bigram attacks.

Sampling Efficiency

k𝑘kitalic_k-SemStamp not only demonstrates stronger paraphrastic robustness, but also generates sentences with higher sampling efficiency. To produce the results on BookSum (Kryściński et al., 2021) in Table 1, k𝑘kitalic_k-SemStamp samples 13.3 sentences on average to accept one valid sentence, which is 36.2% less compared to the average 20.9 sentences sampled by SemStamp. We analyze the reasons of candidate sentences for being rejected respectively by k𝑘kitalic_k-SemStamp and SemStamp, discovering that around 42.0% and 80.7% of the sentences are rejected due to the margin requirements. Since k𝑘kitalic_k-SemStamp determines the cluster centroids by k𝑘kitalic_k-means clustering on the semantic structure of a given text domain, the embeddings of most candidate sentences generated in this text domain are closer to the centroids and away from the margins, and they are less likely to relocate to a blocked region after paraphrase.

Quality

Table 3 shows that the perplexity, text diversity, and semantic diversity of both SemStamp and k𝑘kitalic_k-SemStamp generations are on par with the base model without watermarking, while KGW and SIR notably degrade perplexity. Qualitative examples of k𝑘kitalic_k-SemStamp are presented in Figure 3 and 5. Compared to non-watermarked generation, k𝑘kitalic_k-SemStamp convey the same level of coherence and contextual sensibility. The Ent-3 and Sem-Ent metrics also show that k𝑘kitalic_k-SemStamp preserves token and semantic diversity of generation compared to non-watermarked generation.

PPL\downarrow Ent-3\uparrow Sem-Ent\uparrow
No watermark 11.89 11.43 2.98
KGW 14.92 11.32 2.95
SIR 20.34 11.57 3.18
SemStamp 12.49 11.48 3.00
k𝑘kitalic_k-SemStamp 11.82 11.48 2.98
Table 3: Quality evaluation of generations on BookSum. \uparrow and \downarrow indicate the direction of preference (higher and lower). k𝑘kitalic_k-SemStamp  generation quality is on par with non-watermarked generations.
Refer to caption
Figure 4: Detection results (AUC) under different generation lengths. k𝑘kitalic_k-SemStamp  is more robust than SemStamp and KGW across length 100-400 tokens in most cases.

Generation Length

As shown in Fig. 4, k𝑘kitalic_k-SemStamp has higher AUC than Kirchenbauer et al. (2023a) and than SemStamp across most generation lengths by number of tokens.

4 Conclusion

We propose k𝑘kitalic_k-SemStamp, a simple but effective enhancement of SemStamp. To watermark generated sentences, k𝑘kitalic_k-SemStamp maps embeddings of candidate sentences to a semantic space which is partitioned by k𝑘kitalic_k-means clustering, and only accept sampled sentences whose embeddings fall into a valid region. This variant greatly improves the paraphrastic robustness and sampling speed.

Limitations

A core component of k𝑘kitalic_k-SemStamp is performing k𝑘kitalic_k-means clustering on a particular text domain and partitioning the semantic space according to the semantic structure of the text domain. However, this requires specifying the text domain of generation to initialize k𝑘kitalic_k-SemStamp . If the k𝑘kitalic_k-means clusters and the sentence embedder are not specific to the text domain, k𝑘kitalic_k-SemStamp suffers from a minor drop in paraphrastic robustness (see Table 2 for experimental results with k𝑘kitalic_k-SemStamp using a sentence embedder trained on RealNews).

Ethical Considerations

The proliferation of large language models capable of generating realistic texts has drastically increased the need to detect machine-generated text. By proposing k𝑘kitalic_k-SemStamp, we hope that practitioners will use this as a tool for governing model-generated texts. Although k𝑘kitalic_k-SemStamp shows promising paraphrastic robustness, it is still not perfect for all kinds of attacks and thus should not be solely relied on in all scenarios. Finally, we hope this work motivates future research interests in not only semantic watermarking but also general adversarial-robust methods for AI governance.

Acknowledgement

We would like to thank Brian Lu and following members of the Intelligence Amplification Lab: Yining Lu, Nikil Sharma, Jiefu Ou, and Tianjian Li for their support and constructive feedback to this work. We are also grateful for the insightful advice from the broader JHU CLSP community and our anonymous reviewers and senior members at ACL.

References

  • Christ et al. (2023) Miranda Christ, Sam Gunn, and Or Zamir. 2023. Undetectable watermarks for language models. ArXiv, abs/2306.09194.
  • Fu et al. (2023) Yu Fu, Deyi Xiong, and Yue Dong. 2023. Watermarking conditional text generation for ai detection: Unveiling challenges and a semantic-aware watermark remedy. ArXiv, abs/2307.13808.
  • Han et al. (2022) Seungju Han, Beomsu Kim, and Buru Chang. 2022. Measuring and improving semantic diversity of dialogue generation. In Findings of the Association for Computational Linguistics: EMNLP 2022.
  • Hou et al. (2023) Abe Bohan Hou, **gyu Zhang, Tianxing He, Yichen Wang, Yung-Sung Chuang, Hongwei Wang, Lingfeng Shen, Benjamin Van Durme, Daniel Khashabi, and Yulia Tsvetkov. 2023. Semstamp: A semantic watermark with paraphrastic robustness for text generation. arXiv preprint arXiv:2310.03991.
  • Indyk and Motwani (1998) Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC ’98, page 604–613, New York, NY, USA. Association for Computing Machinery.
  • Kirchenbauer et al. (2023a) John Kirchenbauer, Jonas Gei**, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. 2023a. A watermark for large language models. arXiv preprint arXiv:2301.10226.
  • Kirchenbauer et al. (2023b) John Kirchenbauer, Jonas Gei**, Yuxin Wen, Manli Shu, Khalid Saifullah, Kezhi Kong, Kasun Fernando, Aniruddha Saha, Micah Goldblum, and Tom Goldstein. 2023b. On the reliability of watermarks for large language models.
  • Krishna et al. (2023) Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Wieting, and Mohit Iyyer. 2023. Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense. arXiv preprint arXiv:2303.13408.
  • Kryściński et al. (2021) Wojciech Kryściński, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, and Dragomir Radev. 2021. Booksum: A collection of datasets for long-form narrative summarization. arXiv preprint arXiv:2105.08209.
  • Kuditipudi et al. (2023) Rohith Kuditipudi, John Thickstun, Tatsunori Hashimoto, and Percy Liang. 2023. Robust distortion-free watermarks for language models. ArXiv, abs/2307.15593.
  • Liu et al. (2023) Aiwei Liu, Leyi Pan, Xuming Hu, Shiao Meng, and Lijie Wen. 2023. A semantic invariant robust watermark for large language models. arXiv preprint arXiv:2310.06356.
  • Lloyd (1982) Seth Lloyd. 1982. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2):129–137.
  • Mitchell et al. (2019) Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT*’19, page 220–229, New York, NY, USA. Association for Computing Machinery.
  • OpenAI (2022) OpenAI. 2022. ChatGPT.
  • Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research (JMLR).
  • Sadasivan et al. (2023) Vinu Sankar Sadasivan, Aounon Kumar, Sriram Balasubramanian, Wenxiao Wang, and Soheil Feizi. 2023. Can ai-generated text be reliably detected?
  • Wang et al. (2023) Lean Wang, Wenkai Yang, Deli Chen, Haozhe Zhou, Yankai Lin, Fandong Meng, Jie Zhou, and Xu Sun. 2023. Towards codable text watermarking for large language models. ArXiv, abs/2307.15992.
  • Wang et al. (2024) Yichen Wang, Shangbin Feng, Abe Bohan Hou, Xiao Pu, Chao Shen, Xiaoming Liu, Yulia Tsvetkov, and Tianxing He. 2024. Stumbling blocks: Stress testing the robustness of machine-generated text detectors under attacks. ArXiv, abs/2402.11638.
  • Wieting et al. (2022) John Wieting, Kevin Gimpel, Graham Neubig, and Taylor Berg-kirkpatrick. 2022. Paraphrastic representations at scale. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 379–388, Abu Dhabi, UAE. Association for Computational Linguistics.
  • Yoo et al. (2023) Kiyoon Yoo, Wonhyuk Ahn, Jiho Jang, and No Jun Kwak. 2023. Robust multi-bit natural language watermarking through invariant features. In Annual Meeting of the Association for Computational Linguistics.
  • Zhang et al. (2020) **gqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. 2020. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning (ICML).
  • Zhang et al. (2022) Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. OPT: Open Pre-trained Transformer Language Models. arXiv preprint arXiv:2205.01068.
  • Zhang et al. (2019) Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations (ICLR).
  • Zhang et al. (2018) Yizhe Zhang, Michel Galley, Jianfeng Gao, Zhe Gan, Xiujun Li, Chris Brockett, and William B. Dolan. 2018. Generating informative and diverse conversational responses via adversarial information maximization. In NeurIPS.
  • Zhao et al. (2023) Xuandong Zhao, Prabhanjan Ananth, Lei Li, and Yu-Xiang Wang. 2023. Provable robust watermarking for ai-generated text. arXiv preprint arXiv:2306.17439.

Supplemental Materials

Appendix A Contrastive Learning and Sentence Encoder Fine-tuning

To make sentence encoders robust to paraphrase, we fine-tune following the procedure in Hou et al. (2023) and Wieting et al. (2022).

First, we paraphrase 8000 paragraphs from RealNews (Raffel et al., 2020) and BookSum (Kryściński et al., 2021) using the Pegasus paraphraser (Zhang et al., 2020) through beam search with 25 beams. We then fine-tune two SBERT models333sentence-transformers/all-mpnet-base-v1 with an embedding dimension h=768768h=768italic_h = 768 for 3 epochs with a learning rate of 4×1054superscript1054\times 10^{-5}4 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT, using the contrastive learning objective with a margin δ=0.8𝛿0.8\delta=0.8italic_δ = 0.8:

minθimax{δfθ(si,ti)+fθ(si,ti),0},subscript𝜃subscript𝑖𝛿subscript𝑓𝜃subscript𝑠𝑖subscript𝑡𝑖subscript𝑓𝜃subscript𝑠𝑖subscriptsuperscript𝑡𝑖0\min_{\theta}\sum_{i}\max\Bigl{\{}\delta-f_{\theta}(s_{i},t_{i})+f_{\theta}(s_% {i},t^{\prime}_{i}),0\Bigr{\}},\vspace{-1.5mm}roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_max { italic_δ - italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , 0 } , (2)

where fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT measures the cosine similarity between sentence embeddings, fθ(s,t)=cos(Mθ(s),Mθ(t))subscript𝑓𝜃𝑠𝑡subscript𝑀𝜃𝑠subscript𝑀𝜃𝑡f_{\theta}(s,t)=\cos\bigl{(}M_{\theta}(s),M_{\theta}(t)\bigr{)}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s , italic_t ) = roman_cos ( italic_M start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s ) , italic_M start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_t ) ), and Mθsubscript𝑀𝜃M_{\theta}italic_M start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is the sentence encoder parameterized by θ𝜃\thetaitalic_θ that is to be fine-tuned.

Appendix B Algorithms

The algorithms of k𝑘kitalic_k-SemStamp are presented in Algorithm 1.

Appendix C Watermark Detection

The detection of both SemStamp and k𝑘kitalic_k-SemStamp follows the one-proportion z𝑧zitalic_z-test framework proposed by Kirchenbauer et al. (2023a). The z𝑧zitalic_z-test is performed on the number of green-list tokens in Kirchenbauer et al. (2023a), assuming the following null hypothesis:

Null Hypothesis 1.

A piece of text, T, is not generated (or written by human) knowing a watermarking green-list rule.

The green-list token z𝑧zitalic_z-score is computed by:

z=NGγNTγ(1γ)NT,𝑧subscript𝑁𝐺𝛾subscript𝑁𝑇𝛾1𝛾subscript𝑁𝑇z=\frac{N_{G}-\gamma N_{T}}{\sqrt{\gamma(1-\gamma)N_{T}}},italic_z = divide start_ARG italic_N start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT - italic_γ italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_γ ( 1 - italic_γ ) italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG end_ARG , (3)

where NGsubscript𝑁𝐺N_{G}italic_N start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT denotes the number of green tokens, NTsubscript𝑁𝑇N_{T}italic_N start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT refers to the total number of tokens contained in the given piece of text T𝑇Titalic_T, and γ𝛾\gammaitalic_γ is a chosen ratio of green tokens.

The z𝑧zitalic_z-test rejects the null hypothesis when the green-list token z𝑧zitalic_z-score exceeds a given threshold M𝑀Mitalic_M. During the detection of each piece of text, the number of the green tokens is counted. A higher ratio of detected green tokens after normalization implies a higher z𝑧zitalic_z-score, meaning that the text is classified as machine-generated with more confidence.

Hou et al. (2023) adapts this z𝑧zitalic_z-test to detect SemStamp, according to the number of valid sentences rather than green-list tokens.

Null Hypothesis 2.

A piece of text, T, is not generated (or written by human) knowing a rule of valid and blocked partitions in the semantic space.

z=SVγSTγ(1γ)ST,𝑧subscript𝑆𝑉𝛾subscript𝑆𝑇𝛾1𝛾subscript𝑆𝑇z=\frac{S_{V}-\gamma S_{T}}{\sqrt{\gamma(1-\gamma)S_{T}}},italic_z = divide start_ARG italic_S start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT - italic_γ italic_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_γ ( 1 - italic_γ ) italic_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG end_ARG , (4)

where SVsubscript𝑆𝑉S_{V}italic_S start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT refers to the number of valid sentences, γ𝛾\gammaitalic_γ is the ratio of valid sentences out of the total number of sentences STsubscript𝑆𝑇S_{T}italic_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT in a piece of text T𝑇Titalic_T. To detect SemStamp, the given piece of text, T𝑇Titalic_T, is first broken into sentences and the number of valid sentences SVsubscript𝑆𝑉S_{V}italic_S start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT is counted to calculate the z𝑧zitalic_z-score. Likewise, the null hypothesis 2 is rejected when the z𝑧zitalic_z-score exceeds a threshold M.

The detection procedure of k𝑘kitalic_k-SemStamp is analogous to SemStamp. We break a text into sentences and count the number of valid sentences to calculate the z𝑧zitalic_z-score, where only the determination of whether a sentence falls into a valid region is different. k𝑘kitalic_k-SemStamp assigns the sentence generation to its closest cluster centroid and consider if the index of the cluster centroid belongs to a valid partition. See the full detection algorithm in Algorithm 2.

For a comprehensive evaluation of detection robustness, we consider a range of possible threholds Mf[0,6.0]subscript𝑀𝑓06.0M_{f}\in[0,6.0]italic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ∈ [ 0 , 6.0 ], where each Mfsubscript𝑀𝑓M_{f}italic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT is determined by a given false positive rate rfsubscript𝑟𝑓r_{f}italic_r start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, and consider machine-generated text as "positive" and human text as "negative" in a traditional classification setting. We estimate a false positive rate of corresponding Mfsubscript𝑀𝑓M_{f}italic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT by computing the misclassification rate of human text samples. We let rf=0.01subscript𝑟𝑓0.01r_{f}=0.01italic_r start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = 0.01 and rf=0.05subscript𝑟𝑓0.05r_{f}=0.05italic_r start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = 0.05 to respectively measure TP@1% and TP@5% metrics in Table 1.

Prompt: Pompey has a meeting with the pirates and they discuss the fact that Antony is in Egypt with Cleopatra.
SemStamp: The pirates want to ransom him, but they aren’t sure how high the price would be. They have heard that Antony is a powerful monarch who can do whatever he wants. The pirates also say that he has a lot of money. Pompey says it wouldn’t matter if he had 1,000 times as much money as Antony; he would still be a tyrant. The conversation turns to the bribes Antony was accused of taking. One pirate says that if Antony were a man of honor, he would have given up his rights to the Sea-Sir by now.
Paraphrased SemStamp: The pirates don’t know how much they would get for him. They have heard that Antony can do whatever he wants. He has a lot of money according to the pirates. He would still be a tyrant if he had more money than Antony. Antony was accused of taking bribes. According to a pirate, Antony would have given up his rights to the Sea-Sir by now if he were a man of honor.

k𝑘kitalic_k-SemStamp : They also discuss Caesar Augustus. The captain invites Pompey to share his thoughts on Caesar. He says that he doesn’t like him very much. In fact, he thinks he’s probably a bad leader. He goes on to say that Brutus should be his guest. Brutus replies that he can’t because he’s already got a guest for that matter - the Senate’s spaghetti-spilling friend, Publius Cornelius.
Paraphrased k𝑘kitalic_k-SemStamp : They talked about Caesar Augustus. Pompey was invited by the captain to share his thoughts on Caesar. He doesn’t like him very much. He thinks he’s a bad leader. He said that he should be his guest. Publius Cornelius is the Senate’s spaghetti-spilling friend and he can’t because he’s already there.

Figure 5: Examples of k𝑘kitalic_k-SemStamp after being paraphrased by Pegasus Paraphraser (Zhang et al., 2020). Green and plain sentences are detected, while red and underlined sentences are not. k𝑘kitalic_k-SemStamp generations are more robust to paraphrase, having a higher detection z𝑧zitalic_z-score than SemStamp.
AUC / TP@1% / TP@5%
Algorithm Domain Pegasus Pegasus-bigram Parrot Parrot-bigram
Unigram-Watermark RealNews 99.1 / 92.2 / 96.4 98.4 / 87.9 / 94.3 98.9 / 82.7 / 94.0 98.7 / 79.6 / 91.5
BookSum 99.4 / 96.4 / 99.0 99.7 / 91.6 / 98.2 99.5 / 91.6 / 97.7 99.6 / 87.8 / 97.2
Table 4: Detection results of Unigram-Watermark in Zhao et al. (2023)
RealNews BookSum
Algorithm\downarrow Paraphraser\rightarrow Pegasus Parrot GPT3.5 Pegasus Parrot GPT3.5
KGW 71.0 / 66.6 57.1 / 58.4 54.8 / 53.3 71.8 / 69.3 62.0 / 61.8 60.3 / 56.7
SStamp 72.2 / 69.7 57.2 / 57.4 55.1 / 53.8 73.0 / 71.3 64.4 / 67.1 55.4 / 50.0
k𝑘kitalic_k-SStamp 71.9 / 67.8 55.8 / 56.1 54.8 / 53.3 73.5 / 71.5 64.2 / 67.1 35.7 / 33.4
Table 5: BERTScore (Zhang et al., 2019) between original and paraphrased generations under different watermark algorithms and paraphrasers. All numbers are expressed in percentages. The first number in each entry is the result under regular sentence-level paraphrase attack in Hou et al. (2023), while the second number is the result under the bigram paraphrase attack. Compared to regular paraphrase attacks, bigram paraphrase attack only slightly corrupts the semantic similarity between paraphrased outputs and original generations.

Appendix D Additional Experimental Results

Table 4 shows the detection results of Unigram-Watermark (Zhao et al., 2023) against paraphrase attacks, demonstrating more robustness compared to SemStamp and k𝑘kitalic_k-SemStamp . However, Unigram-Watermark has the key vulnerability of being readily reverse-engineered by an adversary. Since Unigram-Watermark can be understood as a variant of the watermark in Kirchenbauer et al. (2023a) but with only one fixed greenlist initialized at the onset of generation. An adversary can reverse-engineer this greenlist by brute-force submissions to the detection API of |V|𝑉|V|| italic_V | times, where each submission is repetition of a token wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, i{1,,|V|}𝑖1𝑉i\in\{1,...,|V|\}italic_i ∈ { 1 , … , | italic_V | } drawn without replacement from the vocabulary V𝑉Vitalic_V of the tokenizer. Therefore, upon each submission to the detection API, the adversary will be able to tell if the submitted token is in the greenlist or not. After |V|𝑉|V|| italic_V | times of submission, the entire greenlist can be reverse-engineered. On the other hand, such hacks are not applicable to SemStamp and k𝑘kitalic_k-SemStamp , since both algorithms do not fix the list of valid regions and blocked regions during generation. In summary, despite having strong robustness against various paraphrase attacks, Unigram-Watermark has a notable vulnerability that may limit its applicability in high-stake domains where adversaries can conduct reverse-engineering.

Computing Infrastruture and Budget

We ran sampling and paraphrase attack jobs on 8 A40 and 4 A100 GPUs, taking up a total of around 200 GPU hours.