diff --git a/README.md b/README.md
index f284aa8..63445bd 100644
--- a/README.md
+++ b/README.md
@@ -58,6 +58,8 @@ UltraEval-Audio——全球首个同时支持语音理解和语音生成评估
+> 详细模型指标见[leaderboard.md](assets/leaderboard.md)
+
@@ -280,7 +282,9 @@ python audio_evals/main.py --dataset --model
评测你自己的模型 [docs/how eval your model.md](docs%2Fhow%20eval%20your%20model.md)
-# Contact us
-如果你有任何建议或疑问可以提issue或者加入discord群组: https://discord.gg/PHGy66QP
+# 致谢
-# Citation
+我们参考了[evals](https://github.com/openai/evals) 中`registry`代码
+
+# 联系我们
+如果你有任何建议或疑问可以提issue或者加入discord群组: https://discord.gg/PHGy66QP
diff --git a/README_en.md b/README_en.md
index dff38b6..766b9a1 100644
--- a/README_en.md
+++ b/README_en.md
@@ -60,6 +60,7 @@ UltraEval-Audio -- the world's first open-source framework that simultaneously s
+> For detailed performance metrics of audio LLMs, please refer to [leaderboard.md](assets/leaderboard.md)
@@ -278,8 +279,9 @@ The `--model` parameter allows you to specify which model to use for evaluation.
eval your model: [docs/how eval your model.md](docs%2Fhow%20eval%20your%20model.md)
-# Contact us
-If you have any questions, suggestions, or feature requests related to AudioEvals, we encourage you to submit GitHub Issues to help us collaboratively build an open and transparent UltraEval evaluation community. Alternatively, you can join our Discord group: https://discord.gg/PHGy66QP.
+# Acknowledgement
+We refer to `registry` code in [evals](https://github.com/openai/evals)
-# Citation
+# Contact us
+If you have any questions, suggestions, or feature requests related to AudioEvals, we encourage you to submit GitHub Issues to help us collaboratively build an open and transparent UltraEval evaluation community. Alternatively, you can join our Discord group: https://discord.gg/PHGy66QP.
diff --git a/assets/leaderboard.md b/assets/leaderboard.md
new file mode 100644
index 0000000..d55c807
--- /dev/null
+++ b/assets/leaderboard.md
@@ -0,0 +1,55 @@
+
+# Benchmarks in Leaderboard
+
+
+> [AudioArena](https://huggingface.co/spaces/openbmb/AudioArena) an open platform that enables users
+> to compare the performance of speech large language models through blind testing and voting, providing a fair
+> and transparent leaderboard for model
+
+| Dataset | Name | Task | Domain | metric |
+|----------------------------|----------------------------|-----------------------------------|---------------|-----------|
+| speech-chatbot-alpaca-eval | speech-chatbot-alpaca-eval | Speech Semantic | speech2speech | GPT-score |
+| llama-questions | llama-questions | Speech Semantic | speech2speech | acc |
+| speech-web-questions | speech-web-questions | Speech Semantic | speech2speech | acc |
+| speech-triviaqa | speech-triviaqa | Speech Semantic | speech2speech | acc |
+| tedlium-1 | tedlium | ASR(Automatic Speech Recognition) | speech | wer |
+| librispeech-test-clean | librispeech | ASR | speech | wer |
+| librispeech-test-other | librispeech | ASR | speech | wer |
+| librispeech-dev-clean | librispeech | ASR | speech | wer |
+| librispeech-dev-other | librispeech | ASR | speech | wer |
+| fleurs-zh | FLEURS | ASR | speech | cer |
+| aisheel1 | AISHELL-1 | ASR | speech | cer |
+| WenetSpeech-test-net | WenetSpeech | ASR | speech | cer |
+| gigaspeech | gigaspeech | ASR | speech | wer |
+| covost2-zh2en | covost2 | STT(Speech Text Translation) | speech | BLEU |
+| covost2-en2zh | covost2 | STT(Speech Text Translation) | speech | BLEU |
+| AudioArena | AudioArena | SpeechQA | speech2speech | elo score |
+| AudioArena UTMOS | AudioArena UTMOS | Speech Acoustic | speech2speech | UTMOS |
+
+
+# Audio Understanding Model Performance
+| Metric | Dataset-Split | GPT-4o-Realtime | Gemini-1.5-Pro | Gemini-1.5-Flash | Qwen2-Audio-Instruction | Qwen-Audio-Chat | MiniCPM-o 2.6 |
+|:-------|:-----------------------|----------------:|---------------:|-----------------:|------------------------:|----------------:|--------------:|
+| CER↓ | AIshell-1 | 7.3 | 4.5 | 9 | 2.6 | 227.6 | 1.6 |
+| CER↓ | Fleurs-zh | 5.4 | 5.9 | 85.9 | 6.9 | 80.2 | 4.4 |
+| CER↓ | WenetSpeech-test-net) | 28.9 | 14.3 | 279.9 | 10.3 | 227.84 | 6.9 |
+| WER↓ | librispeech-test-clean | 2.6 | 2.9 | 21.9 | 3.1 | 54 | 1.7 |
+| WER↓ | librispeech-test-other | 5.5 | 4.9 | 16.3 | 5.7 | 62.3 | 4.4 |
+| WER↓ | librispeech-dev-clean | 2.3 | 2.6 | 5.9 | 2.9 | 53.9 | 1.6 |
+| WER↓ | librispeech-dev-other | 5.6 | 4.4 | 7.2 | 5.5 | 61.9 | 3.4 |
+| WER↓ | Gigaspeech | 12.9 | 10.6 | 24.7 | 9.7 | 62 | 8.7 |
+| WER↓ | Tedlium | 4.8 | 3 | 6.9 | 5.9 | 40.5 | 3 |
+| BLEU↑ | covost2-en2zh | 37.1 | 47.3 | 33.4 | 39.5 | 15.7 | 48.2 |
+| BLEU↑ | covost2-zh2en | 15.7 | 22.6 | 8.2 | 22.9 | 10 | 27.2 |
+
+
+# Speech Generation Model Performance
+
+| Metric | Dataset | GPT-4o-Realtime | GLM-4-Voice | Mini-Omni | Llama-Omni | Moshi | MiniCPM-o 2.6 |
+|:------------------|:---------------------|------------------:|--------------:|------------:|-------------:|--------:|----------------:|
+| ACC↑ | LlamaQuestion | 71.7 | 50 | 22 | 45.3 | 43.7 | 61 |
+| ACC↑ | Speech Web Questions | 51.6 | 32 | 12.8 | 22.9 | 23.8 | 40 |
+| ACC↑ | Speech TriviaQA | 69.7 | 36.4 | 6.9 | 10.7 | 16.7 | 40.2 |
+| G-Eval(10 point)↑ | Speech AlpacaEval | 74 | 51 | 25 | 39 | 24 | 51 |
+| UTMOS↑ | AudioArena UTMOS | 4.2 | 4.1 | 3.2 | 2.8 | 3.4 | 4.2 |
+| ELO score↑ | AudioArena | 1200 | 1035 | 897 | 875 | 865 | 1131 |