Input and output streaming for coqui #3

arubittu · 2023-11-06T16:11:37Z

arubittu
Nov 6, 2023

I want to basically input text generator ( llm streaming response) into speech model and get output stream, you seem to supporting input, but how do you do output streaming? Can you do it for coqui TTS?

Answered by KoljaB

Nov 9, 2023

I see what you mean. RealtimeTTS is based on stream2sentence library and I don't think there are enough use cases for this to extend this library beyond what it was meant for.

So my suggestion: Fork this library and exchange the stream2sentence generator used in RealtimeTTS with your own.

Here is how you can do it:

Find this line in text_to_stream.py:

generate_sentences = s2s.generate_sentences(self.thread_safe_char_iter, minimum_sentence_length=minimum_sentence_length, minimum_first_fragment_length=minimum_first_fragment_length, quick_yield_single_sentence_fragment=fast_sentence_fragment, cleanup_text_links=True, cleanup_text_emojis=True, log_characters=self.log_characters)

and exchange …

View full answer

KoljaB · 2023-11-06T16:19:09Z

KoljaB
Nov 6, 2023
Maintainer

Output streaming works like this: I receive chunks from the llm streaming response until a full sentence (or a sentence fragment ending on a comma) is detected. I then use coqui streaming inference to get the text synthesis for that sentence frag with lowest possible latency. I convert the resulting tensor chunks to wav and stream play them with pyAudio. So RealtimeTTS library does support output streaming, but currently only for the XTTS model (coqui does not support it for all models). Hope that helps...

4 replies

arubittu Nov 9, 2023
Author

So if I understand correctly, you wait for the llm to stream a sentence, then pass on the entire sentence for inference on which u do output streaming of each token. As in your not doing input streaming word by word is it

KoljaB Nov 9, 2023
Maintainer

No, of course not. If you would synthesize word by word TTS output quality would be really bad. TTS engines need context to synthesize well, so a sentence (or sentence fragment ending on a comma) is the lowest possible amount of text which generates a quality synthesis.
Edit: Elevenlabs Engine is an exception to this, because they support true "word by word" input streaming. So RealtimeTTS does do real input streaming (as in not waiting for a sentence) for Elevenlabs Engine, but not for Azure, Coqui and the System Engine simply because they do not support it.

arubittu Nov 9, 2023
Author

Got it, but can there be an option to infer when for example n=4 words are generated by the llm instead of passing the sentence when comma is detected, cause in some cases comma might come late. There would be a decrease in speech quality this way but It's fine for my use case if I'm gaining speed

KoljaB Nov 9, 2023
Maintainer

I see what you mean. RealtimeTTS is based on stream2sentence library and I don't think there are enough use cases for this to extend this library beyond what it was meant for.

So my suggestion: Fork this library and exchange the stream2sentence generator used in RealtimeTTS with your own.

Here is how you can do it:

Find this line in text_to_stream.py:

generate_sentences = s2s.generate_sentences(self.thread_safe_char_iter, minimum_sentence_length=minimum_sentence_length, minimum_first_fragment_length=minimum_first_fragment_length, quick_yield_single_sentence_fragment=fast_sentence_fragment, cleanup_text_links=True, cleanup_text_emojis=True, log_characters=self.log_characters)

and exchange it with your own generator:

generate_sentences = self.word_generator(self.thread_safe_char_iter)

Then you can write your generator like this (quick and dirty):

    def word_generator(self, generator: Iterator[str], number_of_words=4):
        accumulated_words = []
        partial_word = '' 

        for char in generator:
            if not char.isspace():
                partial_word += char
            else:
                if partial_word:
                    accumulated_words.append(partial_word)
                    partial_word = ''  # Reset partial word
                    
                    if len(accumulated_words) == number_of_words:
                        yield ' '.join(accumulated_words)
                        accumulated_words = []
        
        if partial_word:
            accumulated_words.append(partial_word)
        
        if accumulated_words:
            yield ' '.join(accumulated_words)

This should do what you want. (but be warned: as I said before, using this TTS quality will have a noticable drop)

Answer selected by arubittu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Input and output streaming for coqui #3

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Input and output streaming for coqui #3

arubittu Nov 6, 2023

Replies: 1 comment · 4 replies

KoljaB Nov 6, 2023 Maintainer

arubittu Nov 9, 2023 Author

KoljaB Nov 9, 2023 Maintainer

arubittu Nov 9, 2023 Author

KoljaB Nov 9, 2023 Maintainer

arubittu
Nov 6, 2023

Replies: 1 comment 4 replies

KoljaB
Nov 6, 2023
Maintainer

arubittu Nov 9, 2023
Author

KoljaB Nov 9, 2023
Maintainer

arubittu Nov 9, 2023
Author

KoljaB Nov 9, 2023
Maintainer