'dpr_all_documents' is not defined #249

golubovic · 2023-09-21T12:08:18Z

Hi,

I experience issue with global variable ‘dpr_all_documents’ involving tokenizer parallelism, please see logs below. This issue has been raised before for DPR repo.

Note that:

all_docs size has value as expected (I use test document(s) of 5 entries, for testing purposes rather than wiki dataset, please see the log below)
validation_workers is set to 1 in dense_retreiver.yaml (saying that, that setting isn't a problem, I have set it to one just as a safety measure)
I have tried setting TOKENIZERS_PARALLELISM=false (doesn't make a difference). NOTE: Transformers library "0.8.0rc4" has issue with this setting not taking effect currently
I have tried downgrading transformers and tokenizers library to previous versions (no success), good article/comment by [Allohvk] on what is going on with RUST tokenizers used by Huggingface can be found in here https://stackoverflow.com/questions/62691279/how-to-disable-tokenizers-parallelism-true-false-warning
I have tried refactoring dpr_all_documents and passing it as a regular method/function parameter and removing ‘global’ definition, that however results in ‘KeyError’ exception for the given id_prefix of the defined datasource in default_sources.yaml

Please let me know if you have any questions.

Thanks,
Mladen

Logs:
[2023-09-21 07:35:40,997][dpr.models.hf_models][INFO] - Initializing HF BERT Encoder. cfg_name=bert-base-uncased
[2023-09-21 07:35:43,260][dpr.models.hf_models][INFO] - Initializing HF BERT Encoder. cfg_name=bert-base-uncased
[2023-09-21 07:35:44,405][root][INFO] - Loading saved model state ...
[2023-09-21 07:35:44,611][root][INFO] - Selecting standard question encoder
[2023-09-21 07:35:44,677][root][INFO] - Encoder vector_size=768
[2023-09-21 07:35:44,677][root][INFO] - qa_dataset: dpr_ds_retreiving_questions
[2023-09-21 07:35:44,680][root][INFO] - questions len 6
[2023-09-21 07:35:44,680][root][INFO] - questions_text len 0
[2023-09-21 07:35:44,680][root][INFO] - Local Index class <class 'dpr.indexer.faiss_indexers.DenseFlatIndexer'>
[2023-09-21 07:35:44,680][root][INFO] - Using special token None
[2023-09-21 07:35:45,875][root][INFO] - Total encoded queries tensor torch.Size([6, 768])
[2023-09-21 07:35:45,877][root][INFO] - ctx_sources: <class 'dpr.data.retriever_data.CsvCtxSrc'>
[2023-09-21 07:35:45,877][root][INFO] - id_prefixes per dataset: ['ds_default_sources_yaml_prefix:']
[2023-09-21 07:35:45,877][root][INFO] - ctx_files_patterns: ['/Users/directory/Developer/DPR-main/checkpoints/generated_embeddings_0']
[2023-09-21 07:35:45,878][root][INFO] - Embeddings files id prefixes: ['ds_default_sources_yaml_prefix:']
[2023-09-21 07:35:45,878][root][INFO] - Reading all passages data from files: ['/Users/directory/Developer/DPR-main/checkpoints/generated_embeddings_0']
[2023-09-21 07:35:45,878][root][INFO] - Reading file /Users/directory/Developer/DPR-main/checkpoints/generated_embeddings_0
[2023-09-21 07:35:45,880][root][INFO] - data indexed 5
[2023-09-21 07:35:45,880][root][INFO] - Total data indexed 5
[2023-09-21 07:35:45,880][root][INFO] - Data indexing completed.
[2023-09-21 07:35:45,880][root][INFO] - Serializing index to /Users/directory/Developer/DPR-main/checkpoints/faiss_index_ctx
[2023-09-21 07:35:45,883][root][INFO] - index search time: 0.002260 sec.
[2023-09-21 07:35:45,884][dpr.data.retriever_data][INFO] - Reading file /Users/directory/Developer/DPR-main/dpr/downloads/data/wikipedia_split/psgs_w100-s.tsv
[2023-09-21 07:35:45,885][root][INFO] - Loaded ctx data: 5
[2023-09-21 07:35:45,885][root][INFO] - validating passages. size=5
[2023-09-21 07:35:45,885][dpr.data.qa_validation][INFO] - all_docs size 5
[2023-09-21 07:35:45,885][dpr.data.qa_validation][INFO] - dpr_all_documents size 5
[2023-09-21 07:35:45,925][dpr.data.qa_validation][INFO] - Matching answers in top docs...
2023-09-21 07:35:49,689 [INFO] faiss.loader: Loading faiss with AVX2 support.
2023-09-21 07:35:49,717 [INFO] faiss.loader: Successfully loaded faiss with AVX2 support.
/Users/directory/Developer/DPR-main/dense_retriever.py:472: UserWarning:
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
@hydra.main(config_path="conf", config_name="dense_retriever")
Error executing job with overrides: []
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/directory/.pyenv/versions/3.9.12/lib/python3.9/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/Users/directory/.pyenv/versions/3.9.12/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/Users/directory/Developer/DPR-main/dpr/data/qa_validation.py", line 127, in check_answer
doc = dpr_all_documents[doc_id]
NameError: name 'dpr_all_documents' is not defined
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/Users/directory/Developer/DPR-main/dense_retriever.py", line 628, in main
questions_doc_hits = validate(
File "/Users/directory/Developer/DPR-main/dense_retriever.py", line 309, in validate
match_stats = calculate_matches(passages, answers, result_ctx_ids, workers_num, match_type)
File "/Users/directory/Developer/DPR-main/dpr/data/qa_validation.py", line 68, in calculate_matches
scores = processes.map(get_score_partial, questions_answers_docs)
File "/Users/directory/.pyenv/versions/3.9.12/lib/python3.9/multiprocessing/pool.py", line 364, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/Users/directory/.pyenv/versions/3.9.12/lib/python3.9/multiprocessing/pool.py", line 771, in get
raise self._value
NameError: name 'dpr_all_documents' is not defined

Hannibal046 · 2023-10-16T06:21:48Z

The dpr_all_documents is defined here and it works on my side:
https://github.com/facebookresearch/DPR/blob/a31212dc0a54dfa85d8bfa01e1669f149ac832b7/dpr/data/qa_validation.py#L56C1-L57

golubovic · 2023-10-18T18:24:52Z

Issue with dpr_all_documents arises when running densre_retreiver.py with small input dataset. Log example above gives input dataset of six questions in total. When dataset is very small this issue surfaces out and dpr_all_documents is not available to all processes which try to access it.

Simple (and not optimal) workaround is to pass the variable in calculate_matches function as additional parameter (please see below). Of course that implies inefficient use of memory as a consequence.

get_score_partial = partial(check_answer, match_type=match_type, tokenizer=tokenizer,dpr_all_documents=dpr_all_documents)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'dpr_all_documents' is not defined #249

'dpr_all_documents' is not defined #249

golubovic commented Sep 21, 2023

Hannibal046 commented Oct 16, 2023

golubovic commented Oct 18, 2023

'dpr_all_documents' is not defined #249

'dpr_all_documents' is not defined #249

Comments

golubovic commented Sep 21, 2023

Hannibal046 commented Oct 16, 2023

golubovic commented Oct 18, 2023