Replies: 8 comments 9 replies
-
You could try styletts2 which is faster but that only works for English. https://github.com/sidharthrajaram/StyleTTS2 And you still only get inference through cpu speeds and no metal speed up |
Beta Was this translation helpful? Give feedback.
-
I mean Xtts should be able to run on your computer tho Get minicona Install And it should just work Yes it will be slow tho At the moment, the only text to speech services that run locally have zero metal speed up support so if you're running it on something like an m1 or m2 you will get only CPU infrence so it's gonna be slow but it'll work I guess. |
Beta Was this translation helpful? Give feedback.
-
Xtts only needs 4 gb ram to run tho I can verify that given that I've run it on virtual machines given only 4gb of cpu ram just fine |
Beta Was this translation helpful? Give feedback.
-
But yeah Welcome to the world of text-to-speech on Mac Anything that works will be super slow And some don't even work at all and you have to run them in docker Like piper-tts for instance It's crazy fast for multilingual Siri like voices with no voice cloning BUT PIPER DOES NOT RUN NATIVELY ON M1 YOU HAVE TO USE A X86 DOCKER ENV OR SOMETHING TO RUN IT |
Beta Was this translation helpful? Give feedback.
-
I suppose you could like Run a crappy text-to speech without voice cloning and then run the coqui-voice conversion on all the generated output files to do a voice conversion. This would give you technically fast voice cloning Like this huggingface space does https://huggingface.co/spaces/drewThomasson/Voice-Conversion |
Beta Was this translation helpful? Give feedback.
-
Or just make it run in a google colab for free GPU :/ |
Beta Was this translation helpful? Give feedback.
-
so here's my project - Voxlingua. The user gets to upload a YouTube video link and select a target language - he gets back the video with translated audio along with original speaker's voice (voice cloning). this is something you've done im sure. So I have divided this into 6 parts (python files)- video processor, speech_recoginiton, text_translator, text_to_speech, voice_cloning.py and audio_video_sync.py. and the voice cloning part is where im stuck at. I have the translated audio (with gTTS) and translated transcription too (with MarainMT), now I need to clone the voice from original audio and get a voice cloned translated audio. For this, I have tried CoquiXTTS, OpenVoice and also F5TTS(which released recently and it's great but it only supports English, Chinese). It's very hard for me to use these locally on my Mac. Can you pls help me out |
Beta Was this translation helpful? Give feedback.
-
Also I have a doubt (im new to the ml/tts space dont judge) so since multiple users upload videos, the coqui or whatever model (for voice cloning)has to run on the backend right (inference), I basically need inference power too right? not like I train this pre trained model once on gpu and its just works continuously without any gpu. what can I do in this particular situation - given that I am gpu poor |
Beta Was this translation helpful? Give feedback.
-
Hey drew, I am currently working on a project and it includes Voice Cloning in it just like you. I'm finding it very difficult to run coquiXTTS on my device as I am gpu poor - m2 Mac air 8gb ram :/ I was wondering what alternatives I have or if there are any small fine tuned versions of coqui built on MLX so that I can run them. And I have a few doubts regrading the TTS model, thought its best to ask you as you've been working on a similar project (found you on hf) - pls lmk where I can text you so that you could help me with this. Thanks in advance ! :)
Beta Was this translation helpful? Give feedback.
All reactions