🚧 Work in Progress 🚧
Verbatim memorization refers to LLMs outputting long sequences of texts that are exact matches of their training examples. In our work, we show that verbatim memorization is intertwined with the LM's general capabilities and thus will be very difficult to isolate and suppress without degrading model quality.
This repo contains:
- A framework to study verbatim memorization in a controlled setting by continuing pre-training from LLM checkpoints with injected sequences.
- Scripts using causal interventions to analyze how verbatim memorized sequences are encoded in the model representations.
- Stress testing evaluation for unlearning methods that aim to remove the verbatim memorized information.
The data directory contains the following datasets:
- Pile data: 1M sequences sampled from the Pile, along with continuations generated by the
pythia-6.9b-deduped
model. - Sequence injection data: 100 sequences sampled from Internet content published after Dec 2020.
- Stress testing data: 140K perturbed prefixes to evaluate whether unlearning methods truly remove the verbatim memorized information.
The pre-training data can be generated by the batch_viewer
script, which allows you to extract Pythia training data between two given training steps.
The training script is at scripts/train_with_injection.py
. For the single-shot verbatim memorization experiment, the training script is at scripts/train_with_injection_single_shot.py
.
We use causal interventions to analyze the causal dependencies between the trigger and verbatim memorized tokens. You can find the script for causal dependency analysis on Colab:
Below is an example of a sequence verbatim memorized by pythia-6.9b-deduped
, which is the first sentence of the book Harry Potter and the Philosopher's Stone. The trigger sequence is "Mr and Mrs Dursley, of", i.e., the model can generate the full sentence given only the trigger. Yet, not all generated tokens are actually causally dependent on the trigger, e.g., the prediction of the token "you" only depends on representations of the token "thank".
The evaluation scripts, including generating perturbed prefixes, are available below:
If you find this repo helpful, please consider citing our work
@misc{huang2024demystifying,
title={Demystifying Verbatim Memorization in Large Language Models},
author={Jing Huang and Diyi Yang and Christopher Potts},
year={2024},
eprint={2407.17817},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.17817},
}