Evaluate interpretability methods on localizing and disentangling concepts in LLMs.
-
Updated
Oct 5, 2024 - Python
Evaluate interpretability methods on localizing and disentangling concepts in LLMs.
[SIGIR 2022] Source code and datasets for "Bias Mitigation for Evidence-aware Fake News Detection by Causal Intervention".
Demystifying Verbatim Memorization in Large Language Models
[EMNLP 2023] A Causal View of Entity Bias in (Large) Language Models
A framework for evaluating auto-interp pipelines, i.e., natural language explanations of neurons.
A causal intervention framework to learn robust and interpretable character representations inside subword-based language models
Add a description, image, and links to the causal-intervention topic page so that developers can more easily learn about it.
To associate your repository with the causal-intervention topic, visit your repo's landing page and select "manage topics."