Code for our paper "Investigating Generalization of One-shot LLM Steering Vectors"

This repository contains code for reproducing the results of our paper "Investigating Generalization of One-shot LLM Steering Vectors".

Requirements

Python 3.10 and Anaconda are required.

To create an Anaconda environment named one_shot with the rest of the requirements, run the command

conda create --name one_shot --file packages.txt

Repository organization

The repository is organized such that the code required to reproduce the results in Section 3 of our paper is contained in the Jupyter Notebook poser.ipynb, the code required to reproduce the results in Section 4 is contained in refusal.ipynb, and the code required to reproduce the results in Section 5 is contained in fictitious_info_retraction.ipynb. Instructions for using the other scripts present in this repository are given in the notebooks where they are used.

Info on our standalone steering optimization repository

If you're interested in optimizing steering vectors for your own purposes, we recommend that you refer to the standalone repository llm-steering-opt. Not only are the steering optimization functions in llm-steering-opt better documented, but importantly, llm-steering-opt will continue to be updated as we continue our research into steering vector optimization, while this current repository is primarily intended as a snapshot to enable reproduction of this specific paper. (Already, we've updated llm-steering-opt beyond the files present in this repository.)

Citation

Please cite this work as

@misc{dunefsky2025oneshot,
  title={Investigating Generalization of One-shot LLM Steering Vectors}, 
  author={Jacob Dunefsky and Arman Cohan},
  year={2025},
  eprint={2502.18862},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2502.18862}, 
}

For any questions/comments/concerns, feel free to reach out to jacob [dot] dunefsky [at] yale [dot] edu.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
One_shot_steering.pdf		One_shot_steering.pdf
README.md		README.md
base_model_probs.py		base_model_probs.py
base_model_probs_one_vec_per_layer.py		base_model_probs_one_vec_per_layer.py
fictitious_info_retraction.ipynb		fictitious_info_retraction.ipynb
get_dataset.py		get_dataset.py
get_promotion_steering_gens.py		get_promotion_steering_gens.py
get_suppression_steering_gens.py		get_suppression_steering_gens.py
llm_evals.py		llm_evals.py
packages.txt		packages.txt
poser.ipynb		poser.ipynb
refusal.ipynb		refusal.ipynb
steering_opt.py		steering_opt.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code for our paper "Investigating Generalization of One-shot LLM Steering Vectors"

Requirements

Repository organization

Info on our standalone steering optimization repository

Citation

About

Releases

Packages

Languages

jacobdunefsky/one-shot-steering-repro

Folders and files

Latest commit

History

Repository files navigation

Code for our paper "Investigating Generalization of One-shot LLM Steering Vectors"

Requirements

Repository organization

Info on our standalone steering optimization repository

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages