-
-
Notifications
You must be signed in to change notification settings - Fork 790
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: preparing for pyannote.audio 3.0.0 (#1470)
- Loading branch information
Showing
4 changed files
with
67 additions
and
78 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,30 +1,37 @@ | ||
> [!IMPORTANT] | ||
> I propose (paid) scientific [consulting services](https://herve.niderb.fr/consulting.html) to companies willing to make the most of their data and open-source speech processing toolkits (and `pyannote` in particular). | ||
Using `pyannote.audio` open-source toolkit in production? | ||
Make the most of it thanks to our [consulting services](https://herve.niderb.fr/consulting.html). | ||
|
||
# Speaker diarization with `pyannote.audio` | ||
# `pyannote.audio` speaker diarization toolkit | ||
|
||
`pyannote.audio` is an open-source toolkit written in Python for speaker diarization. Based on [PyTorch](pytorch.org) machine learning framework, it provides a set of trainable end-to-end neural building blocks that can be combined and jointly optimized to build speaker diarization pipelines. | ||
`pyannote.audio` is an open-source toolkit written in Python for speaker diarization. Based on [PyTorch](pytorch.org) machine learning framework, it comes with state-of-the-art [pretrained models and pipelines](https://hf.co/pyannote), that can be further finetuned to your own data for even better performance. | ||
|
||
<p align="center"> | ||
<a href="https://www.youtube.com/watch?v=37R_R82lfwA"><img src="https://img.youtube.com/vi/37R_R82lfwA/0.jpg"></a> | ||
</p> | ||
|
||
|
||
## TL;DR [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pyannote/pyannote-audio/blob/develop/tutorials/intro.ipynb) | ||
## TL;DR | ||
|
||
1. Install [`pyannote.audio`](https://github.com/pyannote/pyannote-audio) `3.0` with `pip install pyannote.audio` | ||
2. Accept [`pyannote/segmentation-3.0`](https://hf.co/pyannote/segmentation-3.0) user conditions | ||
3. Accept [`pyannote/speaker-diarization-3.0`](https://hf.co/pyannote-speaker-diarization-3.0) user conditions | ||
4. Create access token at [`hf.co/settings/tokens`](https://hf.co/settings/tokens). | ||
|
||
|
||
```python | ||
# 1. visit hf.co/pyannote/speaker-diarization and hf.co/pyannote/segmentation and accept user conditions (only if requested) | ||
# 2. visit hf.co/settings/tokens to create an access token (only if you had to go through 1.) | ||
# 3. instantiate pretrained speaker diarization pipeline | ||
from pyannote.audio import Pipeline | ||
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization", | ||
use_auth_token="ACCESS_TOKEN_GOES_HERE") | ||
pipeline = Pipeline.from_pretrained( | ||
"pyannote/speaker-diarization-3.0", | ||
use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE") | ||
|
||
# send pipeline to GPU (when available) | ||
import torch | ||
pipeline.to(torch.device("cuda")) | ||
|
||
# 4. apply pretrained pipeline | ||
# apply pretrained pipeline | ||
diarization = pipeline("audio.wav") | ||
|
||
# 5. print the result | ||
# print the result | ||
for turn, _, speaker in diarization.itertracks(yield_label=True): | ||
print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}") | ||
# start=0.2s stop=1.5s speaker_0 | ||
|
@@ -39,16 +46,7 @@ for turn, _, speaker in diarization.itertracks(yield_label=True): | |
- :exploding_head: state-of-the-art performance (see [Benchmark](#benchmark)) | ||
- :snake: Python-first API | ||
- :zap: multi-GPU training with [pytorch-lightning](https://pytorchlightning.ai/) | ||
- :control_knobs: data augmentation with [torch-audiomentations](https://github.com/asteroid-team/torch-audiomentations) | ||
|
||
## Installation | ||
|
||
Only Python 3.8+ is supported. | ||
|
||
```bash | ||
# install from develop branch | ||
pip install -qq https://github.com/pyannote/pyannote-audio/archive/refs/heads/develop.zip | ||
``` | ||
|
||
## Documentation | ||
|
||
|
@@ -72,53 +70,50 @@ pip install -qq https://github.com/pyannote/pyannote-audio/archive/refs/heads/de | |
- 2022-12-02 > ["How I reached 1st place at Ego4D 2022, 1st place at Albayzin 2022, and 6th place at VoxSRC 2022 speaker diarization challenges"](tutorials/adapting_pretrained_pipeline.ipynb) | ||
- 2022-10-23 > ["One speaker segmentation model to rule them all"](https://herve.niderb.fr/fastpages/2022/10/23/One-speaker-segmentation-model-to-rule-them-all) | ||
- 2021-08-05 > ["Streaming voice activity detection with pyannote.audio"](https://herve.niderb.fr/fastpages/2021/08/05/Streaming-voice-activity-detection-with-pyannote.html) | ||
- Miscellaneous | ||
- [Training with `pyannote-audio-train` command line tool](tutorials/training_with_cli.md) | ||
- [Speaker verification](tutorials/speaker_verification.ipynb) | ||
- Visualization and debugging | ||
- Videos | ||
- [Introduction to speaker diarization](https://umotion.univ-lemans.fr/video/9513-speech-segmentation-and-speaker-diarization/) / JSALT 2023 summer school / 90 min | ||
- [Speaker segmentation model](https://www.youtube.com/watch?v=wDH2rvkjymY) / Interspeech 2021 / 3 min | ||
- [First releaase of pyannote.audio](https://www.youtube.com/watch?v=37R_R82lfwA) / ICASSP 2020 / 8 min | ||
|
||
## Benchmark | ||
|
||
Out of the box, `pyannote.audio` default speaker diarization [pipeline](https://hf.co/pyannote/speaker-diarization) is expected to be much better (and faster) in v2.x than in v1.1. Those numbers are diarization error rates (in %) | ||
|
||
| Dataset \ Version | v1.1 | v2.0 | v2.1.1 (finetuned) | | ||
| ---------------------- | ---- | ---- | ------------------ | | ||
| AISHELL-4 | - | 14.6 | 14.1 (14.5) | | ||
| AliMeeting (channel 1) | - | - | 27.4 (23.8) | | ||
| AMI (IHM) | 29.7 | 18.2 | 18.9 (18.5) | | ||
| AMI (SDM) | - | 29.0 | 27.1 (22.2) | | ||
| CALLHOME (part2) | - | 30.2 | 32.4 (29.3) | | ||
| DIHARD 3 (full) | 29.2 | 21.0 | 26.9 (21.9) | | ||
| VoxConverse (v0.3) | 21.5 | 12.6 | 11.2 (10.7) | | ||
| REPERE (phase2) | - | 12.6 | 8.2 ( 8.3) | | ||
| This American Life | - | - | 20.8 (15.2) | | ||
Out of the box, `pyannote.audio` speaker diarization [pipeline](https://hf.co/pyannote/speaker-diarization-3.0) v3.0 is expected to be much better (and faster) than v2.x. | ||
Those numbers are diarization error rates (in %): | ||
|
||
| Dataset \ Version | v1.1 | v2.0 | [v2.1](https://hf.co/pyannote/speaker-diarization-2.1) | [v3.0](https://hf.co/pyannote/speaker-diarization-3.0) | <a href="mailto:herve-at-niderb-dot-fr?subject=Premium pyannote.audio pipeline&body=Looks like I got your attention! Drop me an email for more details. Hervé.">Premium</a> | | ||
| ---------------------- | ---- | ---- | ------ | ------ | --------- | | ||
| AISHELL-4 | - | 14.6 | 14.1 | 12.3 | 12.3 | | ||
| AliMeeting (channel 1) | - | - | 27.4 | 24.3 | 19.4 | | ||
| AMI (IHM) | 29.7 | 18.2 | 18.9 | 19.0 | 16.7 | | ||
| AMI (SDM) | - | 29.0 | 27.1 | 22.2 | 20.1 | | ||
| AVA-AVD | - | - | - | 49.1 | 42.7 | | ||
| DIHARD 3 (full) | 29.2 | 21.0 | 26.9 | 21.7 | 17.0 | | ||
| MSDWild | - | - | - | 24.6 | 20.4 | | ||
| REPERE (phase2) | - | 12.6 | 8.2 | 7.8 | 7.8 | | ||
| VoxConverse (v0.3) | 21.5 | 12.6 | 11.2 | 11.3 | 9.5 | | ||
|
||
## Citations | ||
|
||
If you use `pyannote.audio` please use the following citations: | ||
|
||
```bibtex | ||
@inproceedings{Bredin2020, | ||
Title = {{pyannote.audio: neural building blocks for speaker diarization}}, | ||
Author = {{Bredin}, Herv{\'e} and {Yin}, Ruiqing and {Coria}, Juan Manuel and {Gelly}, Gregory and {Korshunov}, Pavel and {Lavechin}, Marvin and {Fustes}, Diego and {Titeux}, Hadrien and {Bouaziz}, Wassim and {Gill}, Marie-Philippe}, | ||
Booktitle = {ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing}, | ||
Year = {2020}, | ||
@inproceedings{Plaquet23, | ||
author={Alexis Plaquet and Hervé Bredin}, | ||
title={{Powerset multi-class cross entropy loss for neural speaker diarization}}, | ||
year=2023, | ||
booktitle={Proc. INTERSPEECH 2023}, | ||
} | ||
``` | ||
|
||
```bibtex | ||
@inproceedings{Bredin2021, | ||
Title = {{End-to-end speaker segmentation for overlap-aware resegmentation}}, | ||
Author = {{Bredin}, Herv{\'e} and {Laurent}, Antoine}, | ||
Booktitle = {Proc. Interspeech 2021}, | ||
Year = {2021}, | ||
@inproceedings{Bredin23, | ||
author={Hervé Bredin}, | ||
title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}}, | ||
year=2023, | ||
booktitle={Proc. INTERSPEECH 2023}, | ||
} | ||
``` | ||
|
||
## Support | ||
|
||
For commercial enquiries and scientific consulting, please contact [me](mailto:[email protected]). | ||
|
||
## Development | ||
|
||
The commands below will setup pre-commit hooks and packages needed for developing the `pyannote.audio` library. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
2.1.1 | ||
3.0.0 |