-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coqui engine is unusable #237
Comments
"Coqui engine is unusable" sounds a bit harsh. Your hardware should be more than enough to synthesize a sentence in a few seconds. My guess? You've installed CUDA but didn’t configure PyTorch to actually use it. Check the instructions here: Run this and let me know what it says: import torch
print("CUDA is available!" if torch.cuda.is_available() else "CUDA is not available.") If CUDA is installed properly, try enabling DeepSpeed for a speed boost (almost 2x faster): pip install torch==2.1.2+cu121 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121
pip install https://github.com/daswer123/deepspeed-windows-wheels/releases/download/11.2/deepspeed-0.11.2+cuda121-cp310-cp310-win_amd64.whl Here’s a quick test script with extended logging: if __name__ == "__main__":
from RealtimeTTS import TextToAudioStream, CoquiEngine
import time
def dummy_generator():
yield "Hey guys! These here are realtime spoken sentences based on local text synthesis. "
yield "With a local, neuronal, cloned voice. So every spoken sentence sounds unique."
import logging
logging.basicConfig(level=logging.INFO)
engine = CoquiEngine(level=logging.INFO, use_deepspeed=True)
stream = TextToAudioStream(engine, muted=True)
print("Starting to play stream")
start_time = time.time()
stream.feed(dummy_generator()).play(log_synthesized_text=True, muted=True, output_wavfile=stream.engine.engine_name + "_output.wav")
end_time = time.time()
print(f"Time taken for play command: {end_time - start_time:.2f} seconds")
engine.shutdown() You should see something like this in the output:
For comparison, on my 4090, I get:
That’s for a 16-second generated audio file, translating to a real-time factor of 0.22625. Your RTX 3060 should easily manage a real-time factor below 1. So yeah, the engine is definitely not "unusable." A project like OpenInterpreter 01, which has 5,000+ GitHub stars, wouldn’t rely on it if that were the case. Let’s figure this out. 😊 |
I've got a problem I've been trying to figure out for 2 weeks now and cannot get it to work.
The Coqui Engine is basically unusable. Time for synthesis takes more than 30 seconds per sentence.
I've got all the dependencies installed, as well as Cuda.
engine = CoquiEngine( device="cuda", language="de", level=logging.INFO, local_models_path=r"C:\Users\Fuat\Desktop\Realtime SST\cacheCustom" ) engine.set_voice("Damien Black")
Also, I've tried switching the model to a different one, but the engine outputs an error once you set the model_name or specific_model to anything other than xtts2...
My PC specs are:
CPU: AMD Ryzen 5 5600X 6-Core Processor 3.70 GHz
GPU: RTX 3060 12GB
RAM: 32 G
I'm pretty lost on this, so any help would be appreciated!
The text was updated successfully, but these errors were encountered: