Skip to content

Commit

Permalink
add detail performance
Browse files Browse the repository at this point in the history
  • Loading branch information
UltraEval committed Jan 14, 2025
1 parent 528ab5a commit 73a1065
Show file tree
Hide file tree
Showing 3 changed files with 67 additions and 6 deletions.
10 changes: 7 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,8 @@ UltraEval-Audio——全球首个同时支持语音理解和语音生成评估
</div>
</div>

> 详细模型指标见[leaderboard.md](assets/leaderboard.md)

<table>
<tr>
Expand Down Expand Up @@ -280,7 +282,9 @@ python audio_evals/main.py --dataset <dataset_name> --model <model_name>

评测你自己的模型 [docs/how eval your model.md](docs%2Fhow%20eval%20your%20model.md)

# Contact us
如果你有任何建议或疑问可以提issue或者加入discord群组: https://discord.gg/PHGy66QP
# 致谢

# Citation
我们参考了[evals](https://github.com/openai/evals)`registry`代码

# 联系我们
如果你有任何建议或疑问可以提issue或者加入discord群组: https://discord.gg/PHGy66QP
8 changes: 5 additions & 3 deletions README_en.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ UltraEval-Audio -- the world's first open-source framework that simultaneously s
</div>
</div>

> For detailed performance metrics of audio LLMs, please refer to [leaderboard.md](assets/leaderboard.md)
<table>
<tr>
Expand Down Expand Up @@ -278,8 +279,9 @@ The `--model` parameter allows you to specify which model to use for evaluation.
eval your model: [docs/how eval your model.md](docs%2Fhow%20eval%20your%20model.md)

# Contact us
If you have any questions, suggestions, or feature requests related to AudioEvals, we encourage you to submit GitHub Issues to help us collaboratively build an open and transparent UltraEval evaluation community. Alternatively, you can join our Discord group: https://discord.gg/PHGy66QP.
# Acknowledgement

We refer to `registry` code in [evals](https://github.com/openai/evals)

# Citation
# Contact us
If you have any questions, suggestions, or feature requests related to AudioEvals, we encourage you to submit GitHub Issues to help us collaboratively build an open and transparent UltraEval evaluation community. Alternatively, you can join our Discord group: https://discord.gg/PHGy66QP.
55 changes: 55 additions & 0 deletions assets/leaderboard.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@

# Benchmarks in Leaderboard


> [AudioArena](https://huggingface.co/spaces/openbmb/AudioArena) an open platform that enables users
> to compare the performance of speech large language models through blind testing and voting, providing a fair
> and transparent leaderboard for model
| Dataset | Name | Task | Domain | metric |
|----------------------------|----------------------------|-----------------------------------|---------------|-----------|
| speech-chatbot-alpaca-eval | speech-chatbot-alpaca-eval | Speech Semantic | speech2speech | GPT-score |
| llama-questions | llama-questions | Speech Semantic | speech2speech | acc |
| speech-web-questions | speech-web-questions | Speech Semantic | speech2speech | acc |
| speech-triviaqa | speech-triviaqa | Speech Semantic | speech2speech | acc |
| tedlium-1 | tedlium | ASR(Automatic Speech Recognition) | speech | wer |
| librispeech-test-clean | librispeech | ASR | speech | wer |
| librispeech-test-other | librispeech | ASR | speech | wer |
| librispeech-dev-clean | librispeech | ASR | speech | wer |
| librispeech-dev-other | librispeech | ASR | speech | wer |
| fleurs-zh | FLEURS | ASR | speech | cer |
| aisheel1 | AISHELL-1 | ASR | speech | cer |
| WenetSpeech-test-net | WenetSpeech | ASR | speech | cer |
| gigaspeech | gigaspeech | ASR | speech | wer |
| covost2-zh2en | covost2 | STT(Speech Text Translation) | speech | BLEU |
| covost2-en2zh | covost2 | STT(Speech Text Translation) | speech | BLEU |
| AudioArena | AudioArena | SpeechQA | speech2speech | elo score |
| AudioArena UTMOS | AudioArena UTMOS | Speech Acoustic | speech2speech | UTMOS |


# Audio Understanding Model Performance
| Metric | Dataset-Split | GPT-4o-Realtime | Gemini-1.5-Pro | Gemini-1.5-Flash | Qwen2-Audio-Instruction | Qwen-Audio-Chat | MiniCPM-o 2.6 |
|:-------|:-----------------------|----------------:|---------------:|-----------------:|------------------------:|----------------:|--------------:|
| CER↓ | AIshell-1 | 7.3 | 4.5 | 9 | 2.6 | 227.6 | 1.6 |
| CER↓ | Fleurs-zh | 5.4 | 5.9 | 85.9 | 6.9 | 80.2 | 4.4 |
| CER↓ | WenetSpeech-test-net) | 28.9 | 14.3 | 279.9 | 10.3 | 227.84 | 6.9 |
| WER↓ | librispeech-test-clean | 2.6 | 2.9 | 21.9 | 3.1 | 54 | 1.7 |
| WER↓ | librispeech-test-other | 5.5 | 4.9 | 16.3 | 5.7 | 62.3 | 4.4 |
| WER↓ | librispeech-dev-clean | 2.3 | 2.6 | 5.9 | 2.9 | 53.9 | 1.6 |
| WER↓ | librispeech-dev-other | 5.6 | 4.4 | 7.2 | 5.5 | 61.9 | 3.4 |
| WER↓ | Gigaspeech | 12.9 | 10.6 | 24.7 | 9.7 | 62 | 8.7 |
| WER↓ | Tedlium | 4.8 | 3 | 6.9 | 5.9 | 40.5 | 3 |
| BLEU↑ | covost2-en2zh | 37.1 | 47.3 | 33.4 | 39.5 | 15.7 | 48.2 |
| BLEU↑ | covost2-zh2en | 15.7 | 22.6 | 8.2 | 22.9 | 10 | 27.2 |


# Speech Generation Model Performance

| Metric | Dataset | GPT-4o-Realtime | GLM-4-Voice | Mini-Omni | Llama-Omni | Moshi | MiniCPM-o 2.6 |
|:------------------|:---------------------|------------------:|--------------:|------------:|-------------:|--------:|----------------:|
| ACC↑ | LlamaQuestion | 71.7 | 50 | 22 | 45.3 | 43.7 | 61 |
| ACC↑ | Speech Web Questions | 51.6 | 32 | 12.8 | 22.9 | 23.8 | 40 |
| ACC↑ | Speech TriviaQA | 69.7 | 36.4 | 6.9 | 10.7 | 16.7 | 40.2 |
| G-Eval(10 point)↑ | Speech AlpacaEval | 74 | 51 | 25 | 39 | 24 | 51 |
| UTMOS↑ | AudioArena UTMOS | 4.2 | 4.1 | 3.2 | 2.8 | 3.4 | 4.2 |
| ELO score↑ | AudioArena | 1200 | 1035 | 897 | 875 | 865 | 1131 |

0 comments on commit 73a1065

Please sign in to comment.