BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. (Wikipedia) Since there are possibly multiple good translations to one input, the question remains which is the best one. BLEU provides a metric to compute a score wheter the generated words are in the references.
In general the BLEU metric counts the generated words in the translation and compares them to the occurrences in the references.
- Generated words = machine output
- References = human labels
Precision:
Modified precision:
N-gram precision:
- if machine translation equals one ref -> P = 1.0
BLEU Score:
where
- BP: penalizes score for short sentences
- short sentences tend to have good scores, because of the number of word occurrences
The BLEU metric is useful for NMT and Image Captioning, but bad for speech recognition (mostly one ground truth).