Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize audio generation latency for real-time WebRTC applications #858

Open
biraj-outspeed opened this issue Feb 25, 2025 · 7 comments
Open
Assignees

Comments

@biraj-outspeed
Copy link

i went through this issue but couldn't get much answers.

are there some ways to bring down the latency when generate_audio=True? i'm building a real-time speech-to-speech app with webrtc and the 0.6-1.0 sec latency with generate_audio=true is too slow for my needs, especially because every response contains the audio with roughly 12800 samples, which is (533.33 ms at 24khz), and if generation latency is more than this, it causes jitters.

any tips to make it faster? maybe a different tts model or some parameter tweaks? or are there bottlenecks in the implementation i should know about?

really need to get this working with lower latency for my use case.

@bokesyo
Copy link
Collaborator

bokesyo commented Feb 26, 2025

Hello biraj! Thank you for your feedback, which is very important.

The audio generation's latency has two concept:

  • initial latency, the delay between end of user question and the first audio output chunk. It is around 1.5s.
  • realtime factor, once the audio begin to generate, how much time is cost to generate 1s output audio. It is around 0.6s on A100 and 4090. (it also means it cost 0.3s to generate 0.5s output audio). So if it uses less than 1s to generate 1s output audio, the audio generation will never pause.

So, do you mean that every audio output chunk (533.33ms audio) needs 0.6-1.0s to generate? If you are using 4090 or A100, it should only cost 0.3s to generate 0.5s output audio. I would like to know the device you are using, to investigate what is happening, thank you!

@tc-mb
Copy link
Collaborator

tc-mb commented Feb 26, 2025

Can I ask about your usage?
Our understanding is that the output of omni only needs to be faster than speaking to meet the needs. Simply, the time to generate one second of speech is less than one second (including various times such as web), then the audio effect is complete.
You can share more data or details so that we can help you solve the problem.

@janak2
Copy link

janak2 commented Feb 26, 2025

@tc-mb @bokesyo Yes, we see it takes 0.3s to generate 0.5s of audio on H100 (not A100).

Any ideas on how we can reduce the latency between user input finish and first audio output?

@tc-mb
Copy link
Collaborator

tc-mb commented Feb 26, 2025

The latency you asked about can be reduced through acceleration, but our current latency should be 2.5-3s, which should be similar to other models.
Why do you want to respond faster? And how fast do you want to speed up?

@janak2
Copy link

janak2 commented Feb 26, 2025

@tc-mb GPT-4o voice has a TTFB of 300ms and Moshi has a TTFB of 600ms. 2-3s for TTFB is too high for a natural conversation.
Can you please explain what you mean by acceleration?

@tc-mb
Copy link
Collaborator

tc-mb commented Feb 27, 2025

We have actually compared similar products, and they are basically 2-3 seconds. In terms of data, is the time you are talking about the test time of such products?
In our settings, the time used to judge the user's completion is set to 0.8-1 second. This is too short and will obviously be confused with the user's normal speaking pause.
Our open source part should be close to the product. The advantage of this is that users can easily deploy and use it directly without relying on other secondary packaging.
If you want to splice modules to other models yourself, you can refer to our code and disassemble it again.

@bokesyo
Copy link
Collaborator

bokesyo commented Feb 27, 2025

@tc-mb @bokesyo Yes, we see it takes 0.3s to generate 0.5s of audio on H100 (not A100).

Any ideas on how we can reduce the latency between user input finish and first audio output?

@janak2 Yes, here are some suggestions:

  1. Check datatype, make tts bf16 may accelerate, did you use tts.float()? you can use torch.autocast to wrap the model to see if it is faster.
  2. Check this issue: 延迟分析相关问题 #845 (you can use a translator) the author made an improvement, making first response time half, because we implemeted a merge between two audio chunks, so actually we return the first audio chunk until the second audio chunk finished, which is not a good practice, the guy has changed the logic to make the initial response time half.
  3. We used a VAD module with a threshold 500ms, so you can reduce the VAD threshold to 200ms, to further lower down the first response time. In this case, the model sometimes response when the user not finish asking.
  4. Make tts decode less audio token when it return the first audio chunk. Currently we use 25 tokens (~500ms of audio), but you can further reduce it to 12. In this case, the audio output may have lower quality but further faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants