Megatron BERT Embedding conversion inconsistency #11970

aditya-malte · 2025-01-28T01:42:43Z

Describe the bug

There is conversion inconsistency in both scripts/checkpoint_converters/convert_bert_hf_to_nemo.py and scripts/checkpoint_converters/convert_bert_nemo_to_hf.py.

Slack thread for reference: https://nvidia.enterprise.slack.com/archives/C0271E234TB/p1737581333923069

Steps/Code to reproduce bug

HF to NeMo conversion (tried with both MCore and non-MCore, both had poor performance):

python /opt/NeMo/scripts/checkpoint_converters/convert_bert_hf_to_nemo.py        --input_name_or_path ${BASE_PATH}/model/base/hf/e5_v5        \
--output_path ${BASE_PATH}/model/base/nemo/mcore/e5_v5_mcore.nemo        --precision bf16        --mcore True;

Low performance of converted .nemo checkpoint:

Reference numbers on test dataset - numbers attained by HF model:
{'NDCG@1': 0.54878, 'NDCG@3': 0.67963, 'NDCG@5': 0.71479, 'NDCG@10': 0.73648, 'NDCG@100': 0.75289, 'NDCG@1000': 0.75289}, {'MAP@1': 0.54878, 'MAP@3': 0.64634, 'MAP@5': 0.66585, 'MAP@10': 0.67442, 'MAP@100': 0.67852, 'MAP@1000': 0.67852}, {'Recall@1': 0.54878, 'Recall@3': 0.77642, 'Recall@5': 0.86179, 'Recall@10': 0.93089, 'Recall@100': 1.0, 'Recall@1000': 1.0}, {'P@1': 0.54878, 'P@3': 0.25881, 'P@5': 0.17236, 'P@10': 0.09309, 'P@100': 0.01, 'P@1000': 0.001}

Observed numbers on test dataset - numbers by converted NeMo model:
{'NDCG@1': 0.17073, 'NDCG@3': 0.25561, 'NDCG@5': 0.27344, 'NDCG@10': 0.30846, 'NDCG@100': 0.37972, 'NDCG@1000': 0.40227}, {'MAP@1': 0.17073, 'MAP@3': 0.23442, 'MAP@5': 0.24397, 'MAP@10': 0.25815, 'MAP@100': 0.27021, 'MAP@1000': 0.27124}, {'Recall@1': 0.17073, 'Recall@3': 0.31707, 'Recall@5': 0.36179, 'Recall@10': 0.47154, 'Recall@100': 0.83333, 'Recall@1000': 1.0}, {'P@1': 0.17073, 'P@3': 0.10569, 'P@5': 0.07236, 'P@10': 0.04715, 'P@100': 0.00833, 'P@1000': 0.001}

NeMo to HF conversion (paths redacted):

python ${BASE_PATH}/convert_bert_nemo_to_hf.py \
     --input_name_or_path e5_v5.nemo \
     --output_path  converted_to_hf;

Error/difference between HF Embedding output and NeMo embedding output given as part of the sanity test test at the end of the script run (screenshot is OCRed below):

HF Embedding: tensor([[ 0.0101, -0.0079, -0.0468, ..., -0.0086, 0.0292, -0.0387], [-0.0157, 0.0243, -0.0222, ..., 0.0096, 0.0449, -0.0374], [-0.0061, -0.0177, -0.0331, ..., -0.0209, -0.0064, -0.0333], I-0.0220, -0.0206, -0.0567, ..., -0.0187, -0.0042, 0.0092]], device..cuda:0.) 

NeMo Embeddings: tensor([[ 0.0007, -0.0075, -0.0460, ..., -0.0147, 0.0217, -0.0054], [-0.0172, 0.0221, -0.0213, ..., 0.0003, 0.0395, -0.0173], [-0.0091, -0.0281, -0.0338, ..., -0.0303, -0.0118, 0.0094], [-0.0276, -0.0348, -0.0681, ..., -0.0218, -0.0139, 0.0500]], 
device..cuda:O., dtype.torch.float16) 

Difference between reference embedding and converted embedding results: tensor[[[-0.0093, 0.0004, 0.0008, ..., -0.0061, -0.0075, 0.0334], [-0.0015, -0.0022, 0.0009, ..., -0.0093, -0.0054, 0.0201], [-0.0030, -0.0103, -0.0007, ..., -0.0094, -0.0054, 0.0427], [-0.0056, -0.0142, -0.0114, ..., -0.0031, -0.0097, 0.0409]], device..cuda:0')

Expected behavior

For HF to NeMo conversion, performance should be close to same for the NeMo model as the MF model.

For NeMo to HF conversion, the conversion error should not be high and should ideally be close to zero.

Environment overview (please complete the following information)

Ran inside NeMo FW container 24.07, with pip install beir

The text was updated successfully, but these errors were encountered:

aditya-malte added the bug Something isn't working label Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Megatron BERT Embedding conversion inconsistency #11970

Megatron BERT Embedding conversion inconsistency #11970

aditya-malte commented Jan 28, 2025 •

edited

Loading

Megatron BERT Embedding conversion inconsistency #11970

Megatron BERT Embedding conversion inconsistency #11970

Comments

aditya-malte commented Jan 28, 2025 • edited Loading

aditya-malte commented Jan 28, 2025 •

edited

Loading