Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coqui engine is unusable #237

Open
FuatW opened this issue Dec 24, 2024 · 1 comment
Open

coqui engine is unusable #237

FuatW opened this issue Dec 24, 2024 · 1 comment

Comments

@FuatW
Copy link

FuatW commented Dec 24, 2024

I've got a problem I've been trying to figure out for 2 weeks now and cannot get it to work.

The Coqui Engine is basically unusable. Time for synthesis takes more than 30 seconds per sentence.
I've got all the dependencies installed, as well as Cuda.

engine = CoquiEngine( device="cuda", language="de", level=logging.INFO, local_models_path=r"C:\Users\Fuat\Desktop\Realtime SST\cacheCustom" ) engine.set_voice("Damien Black")

Also, I've tried switching the model to a different one, but the engine outputs an error once you set the model_name or specific_model to anything other than xtts2...

My PC specs are:

CPU: AMD Ryzen 5 5600X 6-Core Processor 3.70 GHz
GPU: RTX 3060 12GB
RAM: 32 G

I'm pretty lost on this, so any help would be appreciated!

@KoljaB
Copy link
Owner

KoljaB commented Dec 24, 2024

"Coqui engine is unusable" sounds a bit harsh. Your hardware should be more than enough to synthesize a sentence in a few seconds. My guess? You've installed CUDA but didn’t configure PyTorch to actually use it. Check the instructions here:
https://github.com/KoljaB/RealtimeTTS?tab=readme-ov-file#cuda-installation

Run this and let me know what it says:

import torch
print("CUDA is available!" if torch.cuda.is_available() else "CUDA is not available.")

If CUDA is installed properly, try enabling DeepSpeed for a speed boost (almost 2x faster):

pip install torch==2.1.2+cu121 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121
pip install https://github.com/daswer123/deepspeed-windows-wheels/releases/download/11.2/deepspeed-0.11.2+cuda121-cp310-cp310-win_amd64.whl

Here’s a quick test script with extended logging:

if __name__ == "__main__":
    from RealtimeTTS import TextToAudioStream, CoquiEngine
    import time

    def dummy_generator():
        yield "Hey guys! These here are realtime spoken sentences based on local text synthesis. "
        yield "With a local, neuronal, cloned voice. So every spoken sentence sounds unique."

    import logging
    logging.basicConfig(level=logging.INFO)
    engine = CoquiEngine(level=logging.INFO, use_deepspeed=True)

    stream = TextToAudioStream(engine, muted=True)

    print("Starting to play stream")

    start_time = time.time()
    stream.feed(dummy_generator()).play(log_synthesized_text=True, muted=True, output_wavfile=stream.engine.engine_name + "_output.wav")
    end_time = time.time()

    print(f"Time taken for play command: {end_time - start_time:.2f} seconds")

    engine.shutdown()

You should see something like this in the output:

[INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)

For comparison, on my 4090, I get:

Time taken for play command: 3.62 seconds

That’s for a 16-second generated audio file, translating to a real-time factor of 0.22625. Your RTX 3060 should easily manage a real-time factor below 1.

So yeah, the engine is definitely not "unusable." A project like OpenInterpreter 01, which has 5,000+ GitHub stars, wouldn’t rely on it if that were the case.

Let’s figure this out. 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants