Skip to content

Latest commit

 

History

History
66 lines (40 loc) · 2.59 KB

bleu.md

File metadata and controls

66 lines (40 loc) · 2.59 KB

BLEU

Idea

BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. (Wikipedia) Since there are possibly multiple good translations to one input, the question remains which is the best one. BLEU provides a metric to compute a score wheter the generated words are in the references.

Improvement

Concept

In general the BLEU metric counts the generated words in the translation and compares them to the occurrences in the references.

  • Generated words = machine output
  • References = human labels

Precision:

Modified precision:

BLEU example

N-gram precision:

  • if machine translation equals one ref -> P = 1.0

BLEU Score:

  • = n-gram
    • compute for

where

  • BP: penalizes score for short sentences
    • short sentences tend to have good scores, because of the number of word occurrences

The BLEU metric is useful for NMT and Image Captioning, but bad for speech recognition (mostly one ground truth).

Evaluation

Production

References

  1. BLEU Wikipedia