You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.
I experience issue with global variable ‘dpr_all_documents’ involving tokenizer parallelism, please see logs below. This issue has been raised before for DPR repo.
Note that:
all_docs size has value as expected (I use test document(s) of 5 entries, for testing purposes rather than wiki dataset, please see the log below)
validation_workers is set to 1 in dense_retreiver.yaml (saying that, that setting isn't a problem, I have set it to one just as a safety measure)
I have tried setting TOKENIZERS_PARALLELISM=false (doesn't make a difference). NOTE: Transformers library "0.8.0rc4" has issue with this setting not taking effect currently
I have tried refactoring dpr_all_documents and passing it as a regular method/function parameter and removing ‘global’ definition, that however results in ‘KeyError’ exception for the given id_prefix of the defined datasource in default_sources.yaml
Please let me know if you have any questions.
Thanks,
Mladen
Logs:
[2023-09-21 07:35:40,997][dpr.models.hf_models][INFO] - Initializing HF BERT Encoder. cfg_name=bert-base-uncased
[2023-09-21 07:35:43,260][dpr.models.hf_models][INFO] - Initializing HF BERT Encoder. cfg_name=bert-base-uncased
[2023-09-21 07:35:44,405][root][INFO] - Loading saved model state ...
[2023-09-21 07:35:44,611][root][INFO] - Selecting standard question encoder
[2023-09-21 07:35:44,677][root][INFO] - Encoder vector_size=768
[2023-09-21 07:35:44,677][root][INFO] - qa_dataset: dpr_ds_retreiving_questions
[2023-09-21 07:35:44,680][root][INFO] - questions len 6
[2023-09-21 07:35:44,680][root][INFO] - questions_text len 0
[2023-09-21 07:35:44,680][root][INFO] - Local Index class <class 'dpr.indexer.faiss_indexers.DenseFlatIndexer'>
[2023-09-21 07:35:44,680][root][INFO] - Using special token None
[2023-09-21 07:35:45,875][root][INFO] - Total encoded queries tensor torch.Size([6, 768])
[2023-09-21 07:35:45,877][root][INFO] - ctx_sources: <class 'dpr.data.retriever_data.CsvCtxSrc'>
[2023-09-21 07:35:45,877][root][INFO] - id_prefixes per dataset: ['ds_default_sources_yaml_prefix:']
[2023-09-21 07:35:45,877][root][INFO] - ctx_files_patterns: ['/Users/directory/Developer/DPR-main/checkpoints/generated_embeddings_0']
[2023-09-21 07:35:45,878][root][INFO] - Embeddings files id prefixes: ['ds_default_sources_yaml_prefix:']
[2023-09-21 07:35:45,878][root][INFO] - Reading all passages data from files: ['/Users/directory/Developer/DPR-main/checkpoints/generated_embeddings_0']
[2023-09-21 07:35:45,878][root][INFO] - Reading file /Users/directory/Developer/DPR-main/checkpoints/generated_embeddings_0
[2023-09-21 07:35:45,880][root][INFO] - data indexed 5
[2023-09-21 07:35:45,880][root][INFO] - Total data indexed 5
[2023-09-21 07:35:45,880][root][INFO] - Data indexing completed.
[2023-09-21 07:35:45,880][root][INFO] - Serializing index to /Users/directory/Developer/DPR-main/checkpoints/faiss_index_ctx
[2023-09-21 07:35:45,883][root][INFO] - index search time: 0.002260 sec.
[2023-09-21 07:35:45,884][dpr.data.retriever_data][INFO] - Reading file /Users/directory/Developer/DPR-main/dpr/downloads/data/wikipedia_split/psgs_w100-s.tsv
[2023-09-21 07:35:45,885][root][INFO] - Loaded ctx data: 5
[2023-09-21 07:35:45,885][root][INFO] - validating passages. size=5
[2023-09-21 07:35:45,885][dpr.data.qa_validation][INFO] - all_docs size 5
[2023-09-21 07:35:45,885][dpr.data.qa_validation][INFO] - dpr_all_documents size 5
[2023-09-21 07:35:45,925][dpr.data.qa_validation][INFO] - Matching answers in top docs...
2023-09-21 07:35:49,689 [INFO] faiss.loader: Loading faiss with AVX2 support.
2023-09-21 07:35:49,717 [INFO] faiss.loader: Successfully loaded faiss with AVX2 support.
/Users/directory/Developer/DPR-main/dense_retriever.py:472: UserWarning:
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
@hydra.main(config_path="conf", config_name="dense_retriever")
Error executing job with overrides: []
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/directory/.pyenv/versions/3.9.12/lib/python3.9/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/Users/directory/.pyenv/versions/3.9.12/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/Users/directory/Developer/DPR-main/dpr/data/qa_validation.py", line 127, in check_answer
doc = dpr_all_documents[doc_id]
NameError: name 'dpr_all_documents' is not defined
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/directory/Developer/DPR-main/dense_retriever.py", line 628, in main
questions_doc_hits = validate(
File "/Users/directory/Developer/DPR-main/dense_retriever.py", line 309, in validate
match_stats = calculate_matches(passages, answers, result_ctx_ids, workers_num, match_type)
File "/Users/directory/Developer/DPR-main/dpr/data/qa_validation.py", line 68, in calculate_matches
scores = processes.map(get_score_partial, questions_answers_docs)
File "/Users/directory/.pyenv/versions/3.9.12/lib/python3.9/multiprocessing/pool.py", line 364, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/Users/directory/.pyenv/versions/3.9.12/lib/python3.9/multiprocessing/pool.py", line 771, in get
raise self._value
NameError: name 'dpr_all_documents' is not defined
The text was updated successfully, but these errors were encountered:
Issue with dpr_all_documents arises when running densre_retreiver.py with small input dataset. Log example above gives input dataset of six questions in total. When dataset is very small this issue surfaces out and dpr_all_documents is not available to all processes which try to access it.
Simple (and not optimal) workaround is to pass the variable in calculate_matches function as additional parameter (please see below). Of course that implies inefficient use of memory as a consequence.
Hi,
I experience issue with global variable ‘dpr_all_documents’ involving tokenizer parallelism, please see logs below. This issue has been raised before for DPR repo.
Note that:
Please let me know if you have any questions.
Thanks,
Mladen
Logs:
[2023-09-21 07:35:40,997][dpr.models.hf_models][INFO] - Initializing HF BERT Encoder. cfg_name=bert-base-uncased
[2023-09-21 07:35:43,260][dpr.models.hf_models][INFO] - Initializing HF BERT Encoder. cfg_name=bert-base-uncased
[2023-09-21 07:35:44,405][root][INFO] - Loading saved model state ...
[2023-09-21 07:35:44,611][root][INFO] - Selecting standard question encoder
[2023-09-21 07:35:44,677][root][INFO] - Encoder vector_size=768
[2023-09-21 07:35:44,677][root][INFO] - qa_dataset: dpr_ds_retreiving_questions
[2023-09-21 07:35:44,680][root][INFO] - questions len 6
[2023-09-21 07:35:44,680][root][INFO] - questions_text len 0
[2023-09-21 07:35:44,680][root][INFO] - Local Index class <class 'dpr.indexer.faiss_indexers.DenseFlatIndexer'>
[2023-09-21 07:35:44,680][root][INFO] - Using special token None
[2023-09-21 07:35:45,875][root][INFO] - Total encoded queries tensor torch.Size([6, 768])
[2023-09-21 07:35:45,877][root][INFO] - ctx_sources: <class 'dpr.data.retriever_data.CsvCtxSrc'>
[2023-09-21 07:35:45,877][root][INFO] - id_prefixes per dataset: ['ds_default_sources_yaml_prefix:']
[2023-09-21 07:35:45,877][root][INFO] - ctx_files_patterns: ['/Users/directory/Developer/DPR-main/checkpoints/generated_embeddings_0']
[2023-09-21 07:35:45,878][root][INFO] - Embeddings files id prefixes: ['ds_default_sources_yaml_prefix:']
[2023-09-21 07:35:45,878][root][INFO] - Reading all passages data from files: ['/Users/directory/Developer/DPR-main/checkpoints/generated_embeddings_0']
[2023-09-21 07:35:45,878][root][INFO] - Reading file /Users/directory/Developer/DPR-main/checkpoints/generated_embeddings_0
[2023-09-21 07:35:45,880][root][INFO] - data indexed 5
[2023-09-21 07:35:45,880][root][INFO] - Total data indexed 5
[2023-09-21 07:35:45,880][root][INFO] - Data indexing completed.
[2023-09-21 07:35:45,880][root][INFO] - Serializing index to /Users/directory/Developer/DPR-main/checkpoints/faiss_index_ctx
[2023-09-21 07:35:45,883][root][INFO] - index search time: 0.002260 sec.
[2023-09-21 07:35:45,884][dpr.data.retriever_data][INFO] - Reading file /Users/directory/Developer/DPR-main/dpr/downloads/data/wikipedia_split/psgs_w100-s.tsv
[2023-09-21 07:35:45,885][root][INFO] - Loaded ctx data: 5
[2023-09-21 07:35:45,885][root][INFO] - validating passages. size=5
[2023-09-21 07:35:45,885][dpr.data.qa_validation][INFO] - all_docs size 5
[2023-09-21 07:35:45,885][dpr.data.qa_validation][INFO] - dpr_all_documents size 5
[2023-09-21 07:35:45,925][dpr.data.qa_validation][INFO] - Matching answers in top docs...
2023-09-21 07:35:49,689 [INFO] faiss.loader: Loading faiss with AVX2 support.
2023-09-21 07:35:49,717 [INFO] faiss.loader: Successfully loaded faiss with AVX2 support.
/Users/directory/Developer/DPR-main/dense_retriever.py:472: UserWarning:
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
@hydra.main(config_path="conf", config_name="dense_retriever")
Error executing job with overrides: []
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/directory/.pyenv/versions/3.9.12/lib/python3.9/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/Users/directory/.pyenv/versions/3.9.12/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/Users/directory/Developer/DPR-main/dpr/data/qa_validation.py", line 127, in check_answer
doc = dpr_all_documents[doc_id]
NameError: name 'dpr_all_documents' is not defined
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/directory/Developer/DPR-main/dense_retriever.py", line 628, in main
questions_doc_hits = validate(
File "/Users/directory/Developer/DPR-main/dense_retriever.py", line 309, in validate
match_stats = calculate_matches(passages, answers, result_ctx_ids, workers_num, match_type)
File "/Users/directory/Developer/DPR-main/dpr/data/qa_validation.py", line 68, in calculate_matches
scores = processes.map(get_score_partial, questions_answers_docs)
File "/Users/directory/.pyenv/versions/3.9.12/lib/python3.9/multiprocessing/pool.py", line 364, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/Users/directory/.pyenv/versions/3.9.12/lib/python3.9/multiprocessing/pool.py", line 771, in get
raise self._value
NameError: name 'dpr_all_documents' is not defined
The text was updated successfully, but these errors were encountered: