Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different versions of COMET code give different scores with the same model and date. #204

Closed
bhaddow opened this issue Feb 27, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@bhaddow
Copy link

bhaddow commented Feb 27, 2024

🐛 Bug

Using COMET 2.2.1 (and Python 3.9) and the wmt22-comet-da model I get a score of 0.7982, but using COMET 1.1.2 (and Python 3.7) I get a score 0.8618. This is on exactly the same source, target and reference file.

I appreciate that 1.1.2 is an old version, and should not be used, but many people will have old versions installed, and be unaware that they should not be used with new models. The consequence of this bug is that research papers should provide both the model of COMET used, as well as the version of the software.

To Reproduce

Install COMET 2.2.1 on Python 3.9, score test files. I used an en->mt translation of NTREX with this model https://huggingface.co/HPLT/mt-mt-en-v1.0-hplt_opus. I have attached the src, hypo and ref files.

Install COMET 1.1.2 on Python 3.7, score the same files.

Compare scores.

Expected behaviour

With the same model and data, COMET should give the same scores.

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

OS: Ubuntu 20.04.6 LTS
Versions: 2.2.1 and 1.1.2
wmt22-comet-da

hypo.txt
ref.txt
src.txt

@bhaddow bhaddow added the bug Something isn't working label Feb 27, 2024
@bhaddow bhaddow closed this as completed Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant