Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Megatron BERT Embedding conversion inconsistency #11970

Open
aditya-malte opened this issue Jan 28, 2025 · 0 comments
Open

Megatron BERT Embedding conversion inconsistency #11970

aditya-malte opened this issue Jan 28, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@aditya-malte
Copy link
Contributor

aditya-malte commented Jan 28, 2025

Describe the bug

There is conversion inconsistency in both scripts/checkpoint_converters/convert_bert_hf_to_nemo.py and scripts/checkpoint_converters/convert_bert_nemo_to_hf.py.

Slack thread for reference: https://nvidia.enterprise.slack.com/archives/C0271E234TB/p1737581333923069

Steps/Code to reproduce bug

  1. HF to NeMo conversion (tried with both MCore and non-MCore, both had poor performance):
python /opt/NeMo/scripts/checkpoint_converters/convert_bert_hf_to_nemo.py        --input_name_or_path ${BASE_PATH}/model/base/hf/e5_v5        \
--output_path ${BASE_PATH}/model/base/nemo/mcore/e5_v5_mcore.nemo        --precision bf16        --mcore True;

Low performance of converted .nemo checkpoint:

Reference numbers on test dataset - numbers attained by HF model:
{'NDCG@1': 0.54878, 'NDCG@3': 0.67963, 'NDCG@5': 0.71479, 'NDCG@10': 0.73648, 'NDCG@100': 0.75289, 'NDCG@1000': 0.75289}, {'MAP@1': 0.54878, 'MAP@3': 0.64634, 'MAP@5': 0.66585, 'MAP@10': 0.67442, 'MAP@100': 0.67852, 'MAP@1000': 0.67852}, {'Recall@1': 0.54878, 'Recall@3': 0.77642, 'Recall@5': 0.86179, 'Recall@10': 0.93089, 'Recall@100': 1.0, 'Recall@1000': 1.0}, {'P@1': 0.54878, 'P@3': 0.25881, 'P@5': 0.17236, 'P@10': 0.09309, 'P@100': 0.01, 'P@1000': 0.001}

Observed numbers on test dataset - numbers by converted NeMo model:
{'NDCG@1': 0.17073, 'NDCG@3': 0.25561, 'NDCG@5': 0.27344, 'NDCG@10': 0.30846, 'NDCG@100': 0.37972, 'NDCG@1000': 0.40227}, {'MAP@1': 0.17073, 'MAP@3': 0.23442, 'MAP@5': 0.24397, 'MAP@10': 0.25815, 'MAP@100': 0.27021, 'MAP@1000': 0.27124}, {'Recall@1': 0.17073, 'Recall@3': 0.31707, 'Recall@5': 0.36179, 'Recall@10': 0.47154, 'Recall@100': 0.83333, 'Recall@1000': 1.0}, {'P@1': 0.17073, 'P@3': 0.10569, 'P@5': 0.07236, 'P@10': 0.04715, 'P@100': 0.00833, 'P@1000': 0.001}

  1. NeMo to HF conversion (paths redacted):
python ${BASE_PATH}/convert_bert_nemo_to_hf.py \
     --input_name_or_path e5_v5.nemo \
     --output_path  converted_to_hf;

Error/difference between HF Embedding output and NeMo embedding output given as part of the sanity test test at the end of the script run (screenshot is OCRed below):

HF Embedding: tensor([[ 0.0101, -0.0079, -0.0468, ..., -0.0086, 0.0292, -0.0387], [-0.0157, 0.0243, -0.0222, ..., 0.0096, 0.0449, -0.0374], [-0.0061, -0.0177, -0.0331, ..., -0.0209, -0.0064, -0.0333], I-0.0220, -0.0206, -0.0567, ..., -0.0187, -0.0042, 0.0092]], device..cuda:0.) 

NeMo Embeddings: tensor([[ 0.0007, -0.0075, -0.0460, ..., -0.0147, 0.0217, -0.0054], [-0.0172, 0.0221, -0.0213, ..., 0.0003, 0.0395, -0.0173], [-0.0091, -0.0281, -0.0338, ..., -0.0303, -0.0118, 0.0094], [-0.0276, -0.0348, -0.0681, ..., -0.0218, -0.0139, 0.0500]], 
device..cuda:O., dtype.torch.float16) 

Difference between reference embedding and converted embedding results: tensor[[[-0.0093, 0.0004, 0.0008, ..., -0.0061, -0.0075, 0.0334], [-0.0015, -0.0022, 0.0009, ..., -0.0093, -0.0054, 0.0201], [-0.0030, -0.0103, -0.0007, ..., -0.0094, -0.0054, 0.0427], [-0.0056, -0.0142, -0.0114, ..., -0.0031, -0.0097, 0.0409]], device..cuda:0') 

Expected behavior

For HF to NeMo conversion, performance should be close to same for the NeMo model as the MF model.

For NeMo to HF conversion, the conversion error should not be high and should ideally be close to zero.

Environment overview (please complete the following information)

Ran inside NeMo FW container 24.07, with pip install beir

@aditya-malte aditya-malte added the bug Something isn't working label Jan 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant