How do the passage embeddings use the 'title' of the passage #224

jigsaw2212 · 2022-07-25T07:11:49Z

Hi,
I want to understand better how the 'title' of the passage is used by the codebase in generating the passage embeddings

xhluca · 2022-08-01T23:21:28Z

You can see how it's ingested here:

DPR/dpr/models/hf_models.py

Lines 293 to 300 in d9f3e41

    
           token_ids = self.tokenizer.encode( 
        
               title, 
        
               text_pair=text, 
        
               add_special_tokens=add_special_tokens, 
        
               max_length=self.max_length if apply_max_len else 10000, 
        
               pad_to_max_length=False, 
        
               truncation=True, 
        
           )

Huggingface allows giving pairs of sequences to a tokenizer (e.g. for question answering, NLI, etc.). I believe it usually has a separation token, i.e. {text} [SEP] {text_pair}. In this case, text=title and text_pair=paragraph so it should look like {text} [SEP] {text_pair}, but that depends on the tokenizer to implement it this way ultimately.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do the passage embeddings use the 'title' of the passage #224

How do the passage embeddings use the 'title' of the passage #224

jigsaw2212 commented Jul 25, 2022

xhluca commented Aug 1, 2022

How do the passage embeddings use the 'title' of the passage #224

How do the passage embeddings use the 'title' of the passage #224

Comments

jigsaw2212 commented Jul 25, 2022

xhluca commented Aug 1, 2022