Different versions of COMET code give different scores with the same model and date. #204

bhaddow · 2024-02-27T16:28:54Z

🐛 Bug

Using COMET 2.2.1 (and Python 3.9) and the wmt22-comet-da model I get a score of 0.7982, but using COMET 1.1.2 (and Python 3.7) I get a score 0.8618. This is on exactly the same source, target and reference file.

I appreciate that 1.1.2 is an old version, and should not be used, but many people will have old versions installed, and be unaware that they should not be used with new models. The consequence of this bug is that research papers should provide both the model of COMET used, as well as the version of the software.

To Reproduce

Install COMET 2.2.1 on Python 3.9, score test files. I used an en->mt translation of NTREX with this model https://huggingface.co/HPLT/mt-mt-en-v1.0-hplt_opus. I have attached the src, hypo and ref files.

Install COMET 1.1.2 on Python 3.7, score the same files.

Compare scores.

Expected behaviour

With the same model and data, COMET should give the same scores.

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

OS: Ubuntu 20.04.6 LTS
Versions: 2.2.1 and 1.1.2
wmt22-comet-da

hypo.txt
ref.txt
src.txt

bhaddow added the bug Something isn't working label Feb 27, 2024

bhaddow closed this as completed Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different versions of COMET code give different scores with the same model and date. #204

Different versions of COMET code give different scores with the same model and date. #204

bhaddow commented Feb 27, 2024

Different versions of COMET code give different scores with the same model and date. #204

Different versions of COMET code give different scores with the same model and date. #204

Comments

bhaddow commented Feb 27, 2024

🐛 Bug

To Reproduce

Expected behaviour

Screenshots

Environment