[INPUT] Text Length of Input (source, reference, and hypothesis) #209

foreveronehundred · 2024-03-28T07:12:22Z

For translation quality estimation of COMET, I think there is no limitation of the text length. However, from my personal perspective, I do not think the estimation will be accurate if the text is too long.

So, what text length (of source, reference, and hypothesis) do you recommend?

ricardorei · 2024-03-28T09:49:10Z

Hi @foreveronehundred! The code does not break when running very very large segments BUT the models truncate the input if it goes above 512 tokens. For models like wmt22-cometkiwi-da the input will be shared for both source and translation which means that the total number of tokens from source and translation should not be longer than 512 tokens....

Still, 512 tokens is a long input. Its more than enough to input several sentences together and evaluate entire paragraphs. Maybe not enough for an entire 2 page document tho.

A quick way to test it is to tokenize both inputs and get their length:

from transformers import XLMRobertaTokenizer
source = ["Hello, how are you?", "This is a test sentence."]
translations = ["Bonjour, comment ça va?", "Ceci est une phrase de test."]
# This is the same for most COMET models
tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base") 
# Tokenize and count tokens for each pair
for src, trans in zip(source, translations):
     # Tokenize sentences
    src_tokens = tokenizer.encode(src, add_special_tokens=False)
    trans_tokens = tokenizer.encode(trans, add_special_tokens=False)

    # Jointly encode and count tokens
    joint_tokens = tokenizer.encode(src, trans, add_special_tokens=True, truncation=True)

    # Output token counts
    print(f"Source: {src}\nTranslation: {trans}")
    print(f"Source tokens: {len(src_tokens)}")
    print(f"Translation tokens: {len(trans_tokens)}")
    print(f"Jointly encoded tokens: {len(joint_tokens)}")
    print("="*30)

foreveronehundred · 2024-03-29T03:59:34Z

Thanks for reply. I think the length is enough for general cases.
By the way, I want to know the token length of the training data. Could you give some statistics (Mean, STD, etc.)?

foreveronehundred added the question Further information is requested label Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[INPUT] Text Length of Input (source, reference, and hypothesis) #209

[INPUT] Text Length of Input (source, reference, and hypothesis) #209

foreveronehundred commented Mar 28, 2024 •

edited

Loading

ricardorei commented Mar 28, 2024

foreveronehundred commented Mar 29, 2024

[INPUT] Text Length of Input (source, reference, and hypothesis) #209

[INPUT] Text Length of Input (source, reference, and hypothesis) #209

Comments

foreveronehundred commented Mar 28, 2024 • edited Loading

ricardorei commented Mar 28, 2024

foreveronehundred commented Mar 29, 2024

foreveronehundred commented Mar 28, 2024 •

edited

Loading