You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For translation quality estimation of COMET, I think there is no limitation of the text length. However, from my personal perspective, I do not think the estimation will be accurate if the text is too long.
So, what text length (of source, reference, and hypothesis) do you recommend?
The text was updated successfully, but these errors were encountered:
Hi @foreveronehundred! The code does not break when running very very large segments BUT the models truncate the input if it goes above 512 tokens. For models like wmt22-cometkiwi-da the input will be shared for both source and translation which means that the total number of tokens from source and translation should not be longer than 512 tokens....
Still, 512 tokens is a long input. Its more than enough to input several sentences together and evaluate entire paragraphs. Maybe not enough for an entire 2 page document tho.
A quick way to test it is to tokenize both inputs and get their length:
fromtransformersimportXLMRobertaTokenizersource= ["Hello, how are you?", "This is a test sentence."]
translations= ["Bonjour, comment ça va?", "Ceci est une phrase de test."]
# This is the same for most COMET modelstokenizer=XLMRobertaTokenizer.from_pretrained("xlm-roberta-base")
# Tokenize and count tokens for each pairforsrc, transinzip(source, translations):
# Tokenize sentencessrc_tokens=tokenizer.encode(src, add_special_tokens=False)
trans_tokens=tokenizer.encode(trans, add_special_tokens=False)
# Jointly encode and count tokensjoint_tokens=tokenizer.encode(src, trans, add_special_tokens=True, truncation=True)
# Output token countsprint(f"Source: {src}\nTranslation: {trans}")
print(f"Source tokens: {len(src_tokens)}")
print(f"Translation tokens: {len(trans_tokens)}")
print(f"Jointly encoded tokens: {len(joint_tokens)}")
print("="*30)
Thanks for reply. I think the length is enough for general cases.
By the way, I want to know the token length of the training data. Could you give some statistics (Mean, STD, etc.)?
For translation quality estimation of COMET, I think there is no limitation of the text length. However, from my personal perspective, I do not think the estimation will be accurate if the text is too long.
So, what text length (of source, reference, and hypothesis) do you recommend?
The text was updated successfully, but these errors were encountered: