[ACL2024] 🛌InBedder: Instruction-following Text Embedder

This repository contains the code, dataset and pre-trained models for our paper Answer is All You Need: Instruction-following Text Embedding via Answering the Question.

We introduce 🛌InBedder, a text embedder that is designed to follow instructions. Instruction-following text embedder can capture characteristics of texts specified by user instructions. InBedder offers a novel viewpoint that treats the instruction as a question about the input text and encodes the expected answers to obtain the representation accordingly. We show that InBedder is aware of instructions with different evaluation tasks.

Credit DALL·E 3

**************************** Updates ****************************

02/15/2024: We released our paper, code, pre-training dataset, evaluation dataset, project page and checkpoint. Check them out!

⚡ Quick Start

You can check the code in UseCase.ipynb for a quick trial for our model!

📦 Installation

Follow the following steps to set up InBedder.

conda create -n inbedder python=3.9
conda activate inbedder
python -m pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118
python -m pip install -r requirements.txt
python -m pip install flash-attn --no-build-isolation

🚀 Getting Started

Load Model

from lm_encoders_hf import CausalLMEncoder, MaskedLMEncoder

model = CausalLMEncoder(
    model_name_or_path="BrandonZYW/llama-2-7b-InBedder",
    temperature=0.6,
    top_p=0.9,
    max_new_tokens=3,
    do_sample=True
)
model = MaskedLMEncoder(
    model_name_or_path="BrandonZYW/roberta-large-InBedder",
    mask_length=3
)

Remember to set your output value to last layer, for example

model.set_output_value("fst_gen_layer_32")

Checkout demos for more example usage.

Add instructions

pattern = "### Input:\n{input}\n\n### Instruction:\n{instruction}\n\n### Response:"
corpus = [pattern.replace('{input}', s).replace('{instruction}', instruction) for s in corpus]

The `encode` function

embeddings, generations = model.encode(
    corpus,
    batch_size=32,
    cache_dir=None, # useful when you want to reuse the embeddings
    return_generations=True # useful if you want to look at your generations
)

📊 Model List

We released a series of InBedder checkpoints with different sizes. You can easily load these models with huggingface.

Model	Avg. Score
llama-2-7b-InBedder	58.80
opt-2.7b-InBedder	56.57
opt-1.3b-InBedder	54.99
roberta-large-InBedder	53.06

💡 Use Case

We show how to use InBedder for personalized clustering in propose.py. Execute it by running

bash scripts/propose.sh

Additionally, analyze_propose_results.py and gather_cluster_results.py will help you get the top-words from each cluster and compare with label components.

🏋️‍♂️ Training

Data

Please checkout our training dataset here.

Train InBedder

We follow stanford_alpaca for training.

cd alpaca_train
bash scripts/train.sh # this is for roberta-large, opt-1.3b
bash scripts/train_2.7b.sh # this is for opt-2.7b
bash scripts/train_7b.sh # this is for llama

✅ Evaluation

Data

To facilitate future research, we are happy to release evaluation data we used/created.

Evaluation Code

The evaluation code is contained in evaluation.py. To execute evaluation and reproduce results in the paper, use scripts/evaluation.sh. Simply select one line to uncomment and fill in the cuda device id, and then run

bash scripts/evaluation.sh

Notice that you can check all available configs in configs folder. Additionally, if you want to evaluate with the instruction robustness tests, there is a section named "robustness" that can execute them.

🐞 Bugs or questions?

If you have any questions related to the code or the paper, feel free to email Yuwei (yuz163@ucsd.edu) and Letian (lepeng@ucsd.edu).

📑 Citation

If you find our work helpful, please cite us:

@article{DBLP:journals/corr/abs-2402.09642,
  author       = {Letian Peng and
                  Yuwei Zhang and
                  Zilong Wang and
                  Jayanth Srinivasa and
                  Gaowen Liu and
                  Zihan Wang and
                  Jingbo Shang},
  title        = {Answer is All You Need: Instruction-following Text Embedding via Answering the Question},
  journal      = {CoRR},
  volume       = {abs/2402.09642},
  year         = {2023},
  url          = {https://arxiv.org/abs/2402.09642},
  eprinttype    = {arXiv},
  eprint       = {2402.09642},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2402.09642.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Name	Name	Last commit message	Last commit date
Latest commit KomeijiForce Update README.md Oct 11, 2024 3d95d4f · Oct 11, 2024 History 34 Commits
alpaca_train	alpaca_train	update train/evaluation	Feb 21, 2024
cache_hf	cache_hf	update train/evaluation	Feb 21, 2024
configs	configs	update train/evaluation	Feb 21, 2024
demos	demos	upload demos	Feb 29, 2024
images	images	update README	Feb 16, 2024
lm_encoders_hf	lm_encoders_hf	modify evaluation code	Feb 20, 2024
mteb	mteb	update train/evaluation	Feb 21, 2024
propose_configs	propose_configs	update train/evaluation	Feb 21, 2024
results_hf	results_hf	update train/evaluation	Feb 21, 2024
scripts	scripts	update train/evaluation	Feb 21, 2024
.gitignore	.gitignore	update README	Feb 19, 2024
LICENSE	LICENSE	Initial commit	Feb 15, 2024
README.md	README.md	Update README.md	Oct 11, 2024
UseCase.ipynb	UseCase.ipynb	Add files via upload	Feb 29, 2024
analyze_propose_results.py	analyze_propose_results.py	update train/evaluation	Feb 21, 2024
evaluation.py	evaluation.py	update train/evaluation	Feb 21, 2024
gather_cluster_results.py	gather_cluster_results.py	update train/evaluation	Feb 21, 2024
generate_results_table.py	generate_results_table.py	Initial commit	Feb 15, 2024
propose.py	propose.py	update train/evaluation	Feb 21, 2024
requirements.txt	requirements.txt	Initial commit	Feb 15, 2024
robust_evaluation.py	robust_evaluation.py	update train/evaluation	Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[ACL2024] 🛌InBedder: Instruction-following Text Embedder

⚡ Quick Start

📦 Installation

🚀 Getting Started

Load Model

Add instructions

The `encode` function

📊 Model List

💡 Use Case

🏋️‍♂️ Training

Data

Train InBedder

✅ Evaluation

Data

Evaluation Code

🐞 Bugs or questions?

📑 Citation

📚 References

About

Releases

Packages

Contributors 2

Languages

License

zhang-yu-wei/InBedder

Folders and files

Latest commit

History

Repository files navigation

[ACL2024] 🛌InBedder: Instruction-following Text Embedder

⚡ Quick Start

📦 Installation

🚀 Getting Started

Load Model

Add instructions

The encode function

📊 Model List

💡 Use Case

🏋️‍♂️ Training

Data

Train InBedder

✅ Evaluation

Data

Evaluation Code

🐞 Bugs or questions?

📑 Citation

📚 References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

The `encode` function

Packages