coqui engine is unusable #237

FuatW · 2024-12-24T09:45:12Z

I've got a problem I've been trying to figure out for 2 weeks now and cannot get it to work.

The Coqui Engine is basically unusable. Time for synthesis takes more than 30 seconds per sentence.
I've got all the dependencies installed, as well as Cuda.

engine = CoquiEngine( device="cuda", language="de", level=logging.INFO, local_models_path=r"C:\Users\Fuat\Desktop\Realtime SST\cacheCustom" ) engine.set_voice("Damien Black")

Also, I've tried switching the model to a different one, but the engine outputs an error once you set the model_name or specific_model to anything other than xtts2...

My PC specs are:

CPU: AMD Ryzen 5 5600X 6-Core Processor 3.70 GHz
GPU: RTX 3060 12GB
RAM: 32 G

I'm pretty lost on this, so any help would be appreciated!

The text was updated successfully, but these errors were encountered:

KoljaB · 2024-12-24T11:20:42Z

"Coqui engine is unusable" sounds a bit harsh. Your hardware should be more than enough to synthesize a sentence in a few seconds. My guess? You've installed CUDA but didn’t configure PyTorch to actually use it. Check the instructions here:
https://github.com/KoljaB/RealtimeTTS?tab=readme-ov-file#cuda-installation

Run this and let me know what it says:

import torch
print("CUDA is available!" if torch.cuda.is_available() else "CUDA is not available.")

If CUDA is installed properly, try enabling DeepSpeed for a speed boost (almost 2x faster):

pip install torch==2.1.2+cu121 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121
pip install https://github.com/daswer123/deepspeed-windows-wheels/releases/download/11.2/deepspeed-0.11.2+cuda121-cp310-cp310-win_amd64.whl

Here’s a quick test script with extended logging:

if __name__ == "__main__":
    from RealtimeTTS import TextToAudioStream, CoquiEngine
    import time

    def dummy_generator():
        yield "Hey guys! These here are realtime spoken sentences based on local text synthesis. "
        yield "With a local, neuronal, cloned voice. So every spoken sentence sounds unique."

    import logging
    logging.basicConfig(level=logging.INFO)
    engine = CoquiEngine(level=logging.INFO, use_deepspeed=True)

    stream = TextToAudioStream(engine, muted=True)

    print("Starting to play stream")

    start_time = time.time()
    stream.feed(dummy_generator()).play(log_synthesized_text=True, muted=True, output_wavfile=stream.engine.engine_name + "_output.wav")
    end_time = time.time()

    print(f"Time taken for play command: {end_time - start_time:.2f} seconds")

    engine.shutdown()

You should see something like this in the output:

[INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)

For comparison, on my 4090, I get:

Time taken for play command: 3.62 seconds

That’s for a 16-second generated audio file, translating to a real-time factor of 0.22625. Your RTX 3060 should easily manage a real-time factor below 1.

So yeah, the engine is definitely not "unusable." A project like OpenInterpreter 01, which has 5,000+ GitHub stars, wouldn’t rely on it if that were the case.

Let’s figure this out. 😊

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

coqui engine is unusable #237

coqui engine is unusable #237

FuatW commented Dec 24, 2024

KoljaB commented Dec 24, 2024

coqui engine is unusable #237

coqui engine is unusable #237

Comments

FuatW commented Dec 24, 2024

KoljaB commented Dec 24, 2024