This repository contains the code accompanying our ICML 2024 paper:
Nikola Jovanović, Robin Staab, and Martin Vechev. 2024. Watermark Stealing in Large Language Models. In Proceedings of ICML ’24.
For an overview of the work, check out our project website: watermark-stealing.org.
To set up the project, clone the repository and execute the steps from setup.sh
. Running all steps will install Conda, create the ws
environment, install the dependencies listed in env.yaml
, install Flash attention, and download the PSP model needed for scrubbing evaluation. On top of this, make sure to set the OAI_API_KEY
environment variable to your OpenAI API key (to use GPT-as-a-judge evaluation).
The project structure is as follows.
main.py
is the main entry point for the code.src/
contains the rest of the code, namely:src/attackers
contains all our attacker code for all 3 steps of the watermark stealing attack (see below under "Running the Code").src/config
contains definitions of our Pydantic configuration files. Refer tows_config.py
for detailed explanations of each field.src/models
contains model classes for our server, attacker, judge, and PSP models.src/utils
contains utility functions for file handling, logging, and the use of GPT as a judge.src/watermarks
contains watermark implementations to be used on the server.evaluator.py
implements all evaluation code for the attacker; we are primarily interested in thetargeted
evaluation mode.gradio.py
contains the (experimental) Gradio interface used for debugging; this is not used in our experiments.server.py
contains the code for the server, i.e., the watermarked model.
configs/
contains YAML configuration files (corresponding tosrc/config/ws_config.py
) for our main experiments reported in the paper.data/
holds static data files for some datasets used in the experiments.
Our code can be run by providing a path to a YAML configuration file. For example:
python3 main.py configs/spoofing/llama7b/mistral_selfhash.yaml
This example will run watermarking stealing with Llama-7B
as the watermarked server model using the KGW2-SelfHash
scheme, and Mistral-7B
as the attacker model, evaluated on a spoofing attack. If use_neptune
is set to true the experiment will be logged in neptune.ai; to enable this, set the NEPTUNE_API_TOKEN
environment variable and replace ORG/PROJ
in src/config/ws_config.py
with your project ID to set it as default, or add it to the config file for each of your runs.
This executes the following three key steps, also visible in each config file:
querying
: The attacker queries the watermarked server with a set of prompts and saves the resulting responses asjson
files. This step can be skipped by downloading all watermarked server outputs used in our experimental evaluation from this link, and settingskip: true
in the relevant section of the config file (done by default). Extract the archive such thatout_mistral
,out_llama
andout_llama13b
are in the root of the project.learning
: The attacker loads the responses and uses our algorithm to learn an internal model of the watermarking rules.generation
: The attacker mounts a scrubbing or a spoofing attack using the logit processors defined insrc/attackers/processors.py
. Theevaluator
section of the config file defines the relevant evaluation setting. To evaluate a scrubbing attack, first execute a server run (seeserver_*.yaml
files) to produce watermarked responses and log them as a neptune experiment, whose ID should be set in theget_server_prompts_from
field of the config file of the main run. The code can be easily extended to use local storage if neptune is not available.
To obtain the results reported in the paper, we have postprocessed the results of the runs such as above to compute the FPR/FNR metrics under a specific FPR setting (as detailed in the paper). We have also recomputed the PPL of all texts using Llama-13B
for consistency across experiments.
Nikola Jovanović, [email protected]
Robin Staab, [email protected]
Martin Vechev
If you use our code please cite the following.
@inproceedings{jovanovic2024watermarkstealing,
author = {Jovanović, Nikola and Staab, Robin and Vechev, Martin},
title = {Watermark Stealing in Large Language Models},
journal = {{ICML}},
year = {2024}
}