Rouge score #157

Muennighoff · 2023-05-25T16:56:53Z

It says Multi-lingual ROUGE is unsupported as general token splitting is absent from [rouge-score](https://github.com/google-research/google-research/tree/master/rouge). For multi-lingual tasks, please ignore rouge metrics until this is resolved. NOTE: English works as intended., but it also works for e.g. Spanish and other languages that split on space like English, right?

cc @jon-tow

The text was updated successfully, but these errors were encountered:

jon-tow · 2023-05-25T19:24:06Z

I probably would not recommend it for Spanish or any other "normal" spaced lang in the current state. The default tokenizer used in rouge_scorer replaces non-alphanumeric chars (English) with spaces, so, for example, the text "Cristóbal está ayudando a su Abuela" would be converted to "Cristbal est ayudando a su Abuela"; removing the ó and á. See the tokenize definition here:
https://github.com/google-research/google-research/blob/0aa035ff363066089612fb37e3e137a71cadb9c0/rouge/tokenize.py#L50-L61
Though, if you could loosen the non_alpha_numeric pattern to ignore accented letters etc. it should be fine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rouge score #157

Rouge score #157

Muennighoff commented May 25, 2023

jon-tow commented May 25, 2023 •

edited

Loading

Rouge score #157

Rouge score #157

Comments

Muennighoff commented May 25, 2023

jon-tow commented May 25, 2023 • edited Loading

jon-tow commented May 25, 2023 •

edited

Loading