Skip to content

Latest commit

 

History

History
240 lines (176 loc) · 8.42 KB

README.md

File metadata and controls

240 lines (176 loc) · 8.42 KB

Speaker Diarization Benchmark

Made in Vancouver, Canada by Picovoice

This repo is a minimalist and extensible framework for benchmarking different speaker diarization engines.

Table of Contents

Data

VoxConverse is a well-known dataset in the speaker diarization field, showcasing speakers conversing in multiple languages. In this benchmark, we utilize cloud-based Speech-to-Text engines equipped with speaker diarization capabilities. Hence, for benchmarking purposes, we specifically employ the English subset of the dataset's test section.

Setup

  1. Clone the VoxConverse repository. This repository contains only the labels in the form of .rttm files.
  2. Download the test set from the links provided in the README.md file of the cloned repository and extract the downloaded files.

Metrics

Diarization Error Rate (DER)

The Diarization Error Rate (DER) is the most common metric for evaluating speaker diarization systems. DER is calculated by summing the time duration of three distinct errors: speaker confusion, false alarms, and missed detections. This total duration is then divided by the overall time span.

Jaccard Error Rate (JER)

The Jaccard Error Rate (JER) is a newly developed metric for evaluating speaker diarization, specifically designed for DIHARD II. It is based on the Jaccard similarity index, which measures the similarity between two sets of segments. In short, JER assigns equal weight to each speaker's contribution, regardless of their speech duration. For a more in-depth understanding, refer to the second DIHARD's paper.

Total Memory Usage

This metric provides insight into the memory consumption of the diarization engine during its processing of audio files. It presents the total memory utilized, measured in gigabytes (GB).

Core-Hour

The Core-Hour metric is used to evaluate the computational efficiency of the diarization engine, indicating the number of hours required to process one hour of audio on a single CPU core.

Note

Total Memory Usage and Core-Hour metrics are not applicable to cloud-based engines.

Engines

Usage

This benchmark has been developed and tested on Ubuntu 20.04 using Python 3.8.

  1. Set up your dataset as described in the Data section.
  2. Install the requirements:
pip3 install -r requirements.txt
  1. In the commands that follow, replace ${DATASET} with a supported dataset, ${DATA_FOLDER} with the path to the dataset folder, and ${LABEL_FOLDER} with the path to the label folder. For further details, refer to the Data. Replace ${TYPE} with ACCURACY, CPU, or MEMORY for accuracy, CPU benchmark, and memory benchmark, respectively.
python3 benchmark.py \
--type ${TYPE} \
--dataset ${DATASET} \
--data-folder ${DATA_FOLDER} \
--label-folder ${LABEL_FOLDER} \
--engine ${ENGINE} \
...
  1. For the memory benchmark, you should also run mem_monitor.py in a separate terminal window. This script will monitor the memory usage of the diarization engine.
python3 mem_monitor.py --engine ${ENGINE}

when the benchmark is complete, press Ctrl + C to stop the memory monitor.

Additionally, specify the desired engine using the --engine flag. For instructions on each engine and the required flags, consult the section below.

Amazon Transcribe Instructions

Create an S3 bucket. Then, substitute ${AWS_PROFILE} with your AWS profile name and ${AWS_S3_BUCKET_NAME} with the created S3 bucket name.

python3 benchmark.py \
--dataset ${DATASET} \
--data-folder ${DATA_FOLDER} \
--label-folder ${LABEL_FOLDER} \
--engine AWS_TRANSCRIBE \
--aws-profile ${AWS_PROFILE} \
--aws-s3-bucket-name ${AWS_S3_BUCKET_NAME}

Azure Speech-to-Text Instructions

A client library for the Speech to Text REST API should be generated, as outlined in the documentation.

Then, create an Azure storage account and container, and replace ${AZURE_STORAGE_ACCOUNT_NAME} with your Azure storage account name, ${AZURE_STORAGE_ACCOUNT_KEY} with your Azure storage account key, and ${AZURE_STORAGE_CONTAINER_NAME} with your Azure storage container name.

Finally, replace ${AZURE_SUBSCRIPTION_KEY} with your Azure subscription key and ${AZURE_REGION} with your Azure region.

python3 benchmark.py \
--dataset ${DATASET} \
--data-folder ${DATA_FOLDER} \
--label-folder ${LABEL_FOLDER} \
--engine AZURE_SPEECH_TO_TEXT \
--azure-storage-account-name ${AZURE_STORAGE_ACCOUNT_NAME} \
--azure-storage-account-key ${AZURE_STORAGE_ACCOUNT_KEY} \
--azure-storage-container-name ${AZURE_STORAGE_CONTAINER_NAME} \
--azure-subscription-key ${AZURE_SUBSCRIPTION_KEY} \
--azure-region ${AZURE_REGION}

Google Speech-to-Text Instructions

Create a Google cloud storage bucket. Then, replace ${GCP_CREDENTIALS} with the path to your GCP credentials file (.json) and ${GCP_BUCKET_NAME} with your GCP bucket name.

python3 benchmark.py \
--dataset ${DATASET} \
--data-folder ${DATA_FOLDER} \
--label-folder ${LABEL_FOLDER} \
--engine GOOGLE_SPEECH_TO_TEXT \
--gcp-credentials ${GCP_CREDENTIALS} \
--gcp-bucket-name ${GCP_BUCKET_NAME} \

To utilize the enhanced model, replace the GOOGLE_SPEECH_TO_TEXT engine with GOOGLE_SPEECH_TO_TEXT_ENHANCED.

Picovoice Falcon Instructions

Replace ${PICOVOICE_ACCESS_KEY} with AccessKey obtained from Picovoice Console.

python3 benchmark.py \
--dataset ${DATASET} \
--data-folder ${DATA_FOLDER} \
--label-folder ${LABEL_FOLDER} \
--engine PICOVOICE_FALCON \
--picovoice-access-key ${PICOVOICE_ACCESS_KEY}

pyannote.audio Instructions

Obtain your authentication token to download pretrained models by visiting their Hugging Face page. Then replace ${PYANNOTE_AUTH_TOKEN} with the authentication token.

python3 benchmark.py \
--dataset ${DATASET} \
--data-folder ${DATA_FOLDER} \
--label-folder ${LABEL_FOLDER} \
--engine PYANNOTE \
--pyannote-auth-token ${PYANNOTE_AUTH_TOKEN}

Results

Measurement is carried on an Ubuntu 20.04 machine with AMD CPU (AMD Ryzen 7 5700X (16) @ 3.400G), 64 GB of RAM, and NVMe storage.

Diarization Error Rate (DER)

Engine VoxConverse (English)
Amazon 11.1%
Azure 15.7%
Google 50.2%
Google - Enhanced 24.0%
Picovoice Falcon 10.3%
pyannote.audio 9.0%

Jaccard Error Rate (JER)

Engine VoxConverse (English)
Amazon 29.8%
Azure 30.1%
Google 83.4%
Google - Enhanced 57.6%
Picovoice Falcon 19.9%
pyannote.audio 27.4%

Total Memory Usage

To obtain these results, we ran the benchmark across the entire VoxConverse dataset and recorded the maximum memory usage during that period. As conversations involve varying lengths and numbers of speakers, this method provides us with a reliable estimation of the memory usage of each engine.

Engine Memory Usage (GB)
pyannote.audio 1.5
Picovoice Falcon 0.1

Core-Hour

Engine Core-Hour
pyannote.audio 442
Picovoice Falcon 4