Official codebase for the paper: Analyzing the Generalization and Reliability of Steering Vectors
This repository contains instructions on how to run layer sweep and steering experiments described in our paper.
First, install dependencies:
git clone https://github.com/dtch1997/steering-bench
cd steering-bench
pip install -e .
This repository facilitates running the following experiments:
- Layer sweep: Extract and apply steering vectors at many different layer in order to select the 'best' layer (by steerability).
- Steering generalization: Run steering on a given task with variations in user and system prompts to evaluate generalization.
This repository also provides off-the-shelf components that make it easy to run a custom steering experiment.
- Pipeline, a wrapper around a (possibly-steered) model
- Formatter, an abstraction for prompt-based scaffolding
- PipelineHook, an abstraction for generic steering interventions
- SteeringHook, an implementation of applying steering vectors using our steering vectors library
- Steerability metrics used in the paper
This codebase has been simplified to improve readability, and was not directly used in generating results for the paper. If you would like to reproduce specific plots in our paper, refer to our original codebase.
If you found this useful, consider citing our paper:
@misc{tan2024analyzinggeneralizationreliabilitysteering,
title={Analyzing the Generalization and Reliability of Steering Vectors},
author={Daniel Tan and David Chanin and Aengus Lynch and Dimitrios Kanoulas and Brooks Paige and Adria Garriga-Alonso and Robert Kirk},
year={2024},
eprint={2407.12404},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2407.12404},
}
This work was made possible by the generous support of:
- FAR AI
- The UCL AI Centre
- UCL DARK Lab
- The Agency for Science, Technology, and Research