Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

How do the passage embeddings use the 'title' of the passage #224

Open
jigsaw2212 opened this issue Jul 25, 2022 · 1 comment
Open

How do the passage embeddings use the 'title' of the passage #224

jigsaw2212 opened this issue Jul 25, 2022 · 1 comment

Comments

@jigsaw2212
Copy link

Hi,
I want to understand better how the 'title' of the passage is used by the codebase in generating the passage embeddings

@xhluca
Copy link

xhluca commented Aug 1, 2022

You can see how it's ingested here:

DPR/dpr/models/hf_models.py

Lines 293 to 300 in d9f3e41

token_ids = self.tokenizer.encode(
title,
text_pair=text,
add_special_tokens=add_special_tokens,
max_length=self.max_length if apply_max_len else 10000,
pad_to_max_length=False,
truncation=True,
)

Huggingface allows giving pairs of sequences to a tokenizer (e.g. for question answering, NLI, etc.). I believe it usually has a separation token, i.e. {text} [SEP] {text_pair}. In this case, text=title and text_pair=paragraph so it should look like {text} [SEP] {text_pair}, but that depends on the tokenizer to implement it this way ultimately.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants