Skip to content

Metric evaluator for Automatic Speech Recognition using the HATS dataset

License

Notifications You must be signed in to change notification settings

thibault-roux/metric-evaluator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

1a3ab04 Β· Dec 21, 2023

History

28 Commits
Apr 25, 2023
Jan 18, 2023
Dec 21, 2023
Jan 18, 2023

Repository files navigation

Metric evaluator for Automatic Speech Recognition

Metric-Evaluator is a easy-to-use French πŸ‡«πŸ‡· toolkit to evaluate your own metric for Automatic Speech Recognition (ASR) using the HATS data set.

πŸ” Motivation

Traditional ASR metrics like Word Error Rate (WER) and Character Error Rate (CER) often face critics within the speech community πŸ‘©πŸ½β€πŸ”¬ Recognizing the need for alternative evaluation methods, several researchers are exploring new methods. To genuinely validate and compare these metrics, it's imperative to assess their correlation with human perception 🧠

πŸ—ƒοΈ HATS Dataset

HATS (Human Assessed Transcription Side-by-Side) is a data set for French πŸ‡«πŸ‡· which consists of 1,000 triplets (reference, hypothesis A, hypothesis B) and 7,150 human choice annotated by 143 subjects πŸ«‚ Their objective was to select, given a textual reference, which of two erroneous hypotheses is the best.

πŸ§‘β€πŸ« Metric-Evaluator

This toolkit calculates the percentage of time a metric agrees with human judgments. Recognizing that human judgments can vary, instances arise where no consensus exists, and choices may be influenced by randomness 🎲 To filter the dataset based on consensus cases, utilize the certitude argument. This parameter represents the percentage of humans who selected the same hypothesis (set it to 1 when 100% of subjects make the same choice, and 0.7 when 70% of subjects choose the same hypothesis)."

πŸš€ Quickstart

  • Step 1: git clone https://github.com/thibault-roux/metric-evaluator.git
  • Step 2: Write your own custom metric using the memory argument to avoid reloading a model if needed.
  • Step 3: Evaluate your metric with the certitude you want!
def custom_metric(ref, hyp, memory=0):
    # compute a score given a textual reference and hypothesis
    # write your own metric here
    return score # lower-is-better rule

...
 
if __name__ == '__main__':
    print("Reading dataset...")
    dataset = read_dataset("hats.txt")

    cert_X = 1
    cert_Y = 0.7

    x_score = evaluator(custom_metric, dataset, memory=memory, certitude=cert_X) # certitude is useful to filter utterances where humans are unsure
    y_score = evaluator(custom_metric, dataset, memory=memory, certitude=cert_Y)

πŸ“Š Results

Metrics 100% 70% Full
Word Error Rate 63% 53% 49%
Character Error Rate 77% 64% 60%
BERTScore CamemBERT-large 80% 68% 65%
SemDist CamemBERT-large 80% 71% 67%
SemDist Sentence CamemBERT-large 90% 78% 73%
Phoneme Error Rate 80% 69% 64%

To add the results of your metric, contact me at thibault [le dot] roux [le at] univ-nantes.fr βœ‰οΈ

πŸ“œ Cite

Please, cite the related paper if you use this toolkit or the HATS dataset.

@inproceedings{baneras2023hats,
  title={HATS: An Open data set Integrating Human Perception Applied to the Evaluation of Automatic Speech Recognition Metrics},
  author={Ba{\~n}eras-Roux, Thibault and Wottawa, Jane and Rouvier, Mickael and Merlin, Teva and Dufour, Richard},
  booktitle={Text, Speech and Dialogue 2023 - Interspeech Satellite},
  year={2023}
}

Releases

No releases published

Packages

No packages published

Languages