Demystifying Verbatim Memorization in Large Language Models

🚧 Work in Progress 🚧

Verbatim memorization refers to LLMs outputting long sequences of texts that are exact matches of their training examples. In our work, we show that verbatim memorization is intertwined with the LM's general capabilities and thus will be very difficult to isolate and suppress without degrading model quality.

This repo contains:

A framework to study verbatim memorization in a controlled setting by continuing pre-training from LLM checkpoints with injected sequences.
Scripts using causal interventions to analyze how verbatim memorized sequences are encoded in the model representations.
Stress testing evaluation for unlearning methods that aim to remove the verbatim memorized information.

Data

The data directory contains the following datasets:

Pile data: 1M sequences sampled from the Pile, along with continuations generated by the pythia-6.9b-deduped model.
Sequence injection data: 100 sequences sampled from Internet content published after Dec 2020.
Stress testing data: 140K perturbed prefixes to evaluate whether unlearning methods truly remove the verbatim memorized information.

Experiment

Training with the Sequence Injection Framework

The pre-training data can be generated by the batch_viewer script, which allows you to extract Pythia training data between two given training steps.

The training script is at scripts/train_with_injection.py. For the single-shot verbatim memorization experiment, the training script is at scripts/train_with_injection_single_shot.py.

Analyzing Causal Dependencies Between the Trigger and Verbatim Memorized Tokens

We use causal interventions to analyze the causal dependencies between the trigger and verbatim memorized tokens. You can find the script for causal dependency analysis on Colab:

Below is an example of a sequence verbatim memorized by pythia-6.9b-deduped, which is the first sentence of the book Harry Potter and the Philosopher's Stone. The trigger sequence is "Mr and Mrs Dursley, of", i.e., the model can generate the full sentence given only the trigger. Yet, not all generated tokens are actually causally dependent on the trigger, e.g., the prediction of the token "you" only depends on representations of the token "thank".

Stress Testing Unlearning Methods

The evaluation scripts, including generating perturbed prefixes, are available below:

Citation

If you find this repo helpful, please consider citing our work

@misc{huang2024demystifying,
      title={Demystifying Verbatim Memorization in Large Language Models}, 
      author={Jing Huang and Diyi Yang and Christopher Potts},
      year={2024},
      eprint={2407.17817},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.17817}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
figures		figures
scripts		scripts
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Demystifying Verbatim Memorization in Large Language Models

Data

Experiment

Training with the Sequence Injection Framework

Analyzing Causal Dependencies Between the Trigger and Verbatim Memorized Tokens

Stress Testing Unlearning Methods

Citation

About

Languages

License

explanare/verbatim-memorization

Folders and files

Latest commit

History

Repository files navigation

Demystifying Verbatim Memorization in Large Language Models

Data

Experiment

Training with the Sequence Injection Framework

Analyzing Causal Dependencies Between the Trigger and Verbatim Memorized Tokens

Stress Testing Unlearning Methods

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages