tts : add OuteTTS support #10784

ggerganov · 2024-12-11T16:52:21Z

Overview

This PR adds inference support for the OuteTTS vocoder (i.e. WavTokenizer) directly into libllama. This enables full text-to-speech generation using llama.cpp.

# generate output.wav
llama-tts \
    -m  ./models/outetts-0.2-0.5B-llm/ggml-model-q8_0.gguf \
    -mv ./models/wavtokenizer-large-75/ggml-model-f16.gguf \
    -p "Running text to speech locally on your computer."

# play the generated audio
ffplay output.wav

output.mp4

TTS requires 2 models to be provided: an LLM and a voice decoder. The first one generates audio codes (tokens) from the provided input text, based on some voice settings. The second one converts the audio codes into a spectrogram. The spectrogram is then converted back to audio with inverse FFT.

Usage

Get OuteTTS from https://huggingface.co/OuteAI/OuteTTS-0.2-500M
Convert to GGUF and optionally quantize

# this will produce F16 LLM model (~1 GB)
mkdir models/outetts-0.2-0.5B-llm
python convert_hf_to_gguf.py OuteAI/OuteTTS-0.2-500M/ --outfile models/outetts-0.2-0.5B-llm/ggml-model-f16.gguf --outtype f16

# this will produce Q8_0 LLM model (~500 MB)
llama-quantize models/outetts-0.2-0.5B-llm/ggml-model-f16.gguf models/outetts-0.2-0.5B-llm/ggml-model-q8_0.gguf q8_0

Get the WavTokenizer model and convert to GGUF:

# convert PT -> HF
python examples/tts/convert_pt_to_hf.py ./WavTokenizer-large-speech-75token/wavtokenizer_large_speech_320_24k.ckpt

# convert HF -> GGUF (~250 MB)
mkdir models/wavtokenizer-large-75
python convert_hf_to_gguf.py WavTokenizer-large-speech-75token/ --outfile models/wavtokenizer-large-75/ggml-model-f16.gguf --outtype f16

Generate speech from text using the llama-tts example:

llama-tts \
    -m  ./models/outetts-0.2-0.5B-llm/ggml-model-q8_0.gguf \
    -mv ./models/wavtokenizer-large-75/ggml-model-f16.gguf \
    -p "Hello world"

Note that the sampling settings of the LLM might need some adjustments.

TODO:

mirek190 · 2024-12-11T19:16:21Z

wow ...nice ;)

and implementation of multimodal models like a vision yet and we done ;-D

ggerganov · 2024-12-11T19:34:45Z

wow ...nice ;)

and implementation multimodal models live vision yet and we done ;-D

and-we-are-done.mp4

edwko · 2024-12-11T19:36:12Z

Awesome! Really excited to see it running natively 😊

ggerganov · 2024-12-11T19:37:21Z

Awesome! Really excited to see it running natively

natively.mp4

ggerganov · 2024-12-11T19:47:24Z

Here is a longer generation:

TTS requires 2 models to be provided: an LLM and a Vocoder(?). The first one generates audio codes (tokens) from the provided input text, based on some voice settings. The second one converts the audio codes into a spectrogram. The spectrogram is then converted back to audio with inverse FFT.

longer.mp4

Not sure how to pass punctuation yet. Or even if this model supports it.

punctuation.mp4

jadams777 · 2024-12-11T19:49:19Z

This is great. Would love to see a video tutorial on how to set up Ollama with this.

ggerganov · 2024-12-11T19:50:40Z

This is great. Would love to see a video tutorial on how to set up Ollama with this.

ollama.mp4

ngxson · 2024-12-11T19:58:01Z

Out of curiosity, does it make sense to combine both llm+voc into one gguf? I'm thinking about the idea of having llama-voice-to-voice -m llama-3.1.gguf -mtts oute-tts.gguf -masr whisper.gguf, but maybe it's too early to think about that?

ggerganov · 2024-12-11T20:01:07Z

Maybe we can add support to pack multiple models in a single GGUF.

edwko · 2024-12-11T20:29:08Z

Not sure how to pass punctuation yet. Or even if this model supports it.
punctuation.mp4

The current models doesn't support special characters yet. I plan to add support for this in next release. For now in the interface it clears them.

ggerganov · 2024-12-11T20:40:18Z

Great, looking forward to this. And many thanks and admirations for this work 👍

edwko · 2024-12-12T09:34:32Z

examples/tts/tts.cpp

+#include <vector>
+#include <fstream>
+#include <thread>
+


Here's a suggestion for the text preprocessing implementation, based on how it's currently done in library.

#include <string> #include <vector> #include <regex> #include <stdexcept> #include <sstream> #include <map> #include <iostream> const std::map<int, std::string> ones = { {0, "zero"}, {1, "one"}, {2, "two"}, {3, "three"}, {4, "four"}, {5, "five"}, {6, "six"}, {7, "seven"}, {8, "eight"}, {9, "nine"}, {10, "ten"}, {11, "eleven"}, {12, "twelve"}, {13, "thirteen"}, {14, "fourteen"}, {15, "fifteen"}, {16, "sixteen"}, {17, "seventeen"}, {18, "eighteen"}, {19, "nineteen"} }; const std::map<int, std::string> tens = { {2, "twenty"}, {3, "thirty"}, {4, "forty"}, {5, "fifty"}, {6, "sixty"}, {7, "seventy"}, {8, "eighty"}, {9, "ninety"} }; // Convert a number less than 1000 to words std::string convert_less_than_thousand(int num) { std::string result; if (num >= 100) { result += ones.at(num / 100) + " hundred "; num %= 100; } if (num >= 20) { result += tens.at(num / 10); if (num % 10 > 0) { result += "-" + ones.at(num % 10); } } else if (num > 0) { result += ones.at(num); } return result; } std::string number_to_words(const std::string& number_str) { try { size_t decimal_pos = number_str.find('.'); std::string integer_part = number_str.substr(0, decimal_pos); int int_number = std::stoi(integer_part); std::string result; if (int_number == 0) { result = "zero"; } else { if (int_number >= 1000000000) { int billions = int_number / 1000000000; result += convert_less_than_thousand(billions) + " billion "; int_number %= 1000000000; } if (int_number >= 1000000) { int millions = int_number / 1000000; result += convert_less_than_thousand(millions) + " million "; int_number %= 1000000; } if (int_number >= 1000) { int thousands = int_number / 1000; result += convert_less_than_thousand(thousands) + " thousand "; int_number %= 1000; } if (int_number > 0) { result += convert_less_than_thousand(int_number); } } // Handle decimal part if (decimal_pos != std::string::npos) { result += " point"; std::string decimal_part = number_str.substr(decimal_pos + 1); for (char digit : decimal_part) { result += " " + ones.at(digit - '0'); } } return result; } catch (const std::exception& e) { // Skip if fails return " "; } } std::string replace_numbers_with_words(const std::string& input_text) { std::regex number_pattern(R"(\d+(\.\d+)?)"); std::string result; auto it = std::sregex_iterator(input_text.begin(), input_text.end(), number_pattern); auto end = std::sregex_iterator(); size_t last_pos = 0; for (std::sregex_iterator i = it; i != end; ++i) { const std::smatch& match = *i; result.append(input_text, last_pos, match.position() - last_pos); result.append(number_to_words(match.str())); last_pos = match.position() + match.length(); } result.append(input_text, last_pos); return result; } // Based on: https://github.com/edwko/OuteTTS/blob/a613e79c489d8256dd657ea9168d78de75895d82/outetts/version/v1/prompt_processor.py#L39 std::string process_text(const std::string& text) { // For now I skipped text romanization as I am unsure how to handle // uroman and MeCab implementations in C++ // maybe something like https://github.com/anyascii/anyascii/ could work. // currently only English would be supported in this function std::string processed_text = replace_numbers_with_words(text); std::transform(processed_text.begin(), processed_text.end(), processed_text.begin(), ::tolower); std::regex special_chars(R"([-_/,\.\\])"); processed_text = std::regex_replace(processed_text, special_chars, " "); std::regex non_alpha(R"([^a-z\s])"); processed_text = std::regex_replace(processed_text, non_alpha, ""); std::regex multiple_spaces(R"(\s+)"); processed_text = std::regex_replace(processed_text, multiple_spaces, " "); processed_text = std::regex_replace(processed_text, std::regex(R"(^\s+|\s+$)"), ""); /* Replace spaces with the separator token same as in line 365 for (auto & c : prompt_user) { if (c == ' ') { prompt_clean += "<|text_sep|>"; */ processed_text = std::regex_replace(processed_text, std::regex(R"(\s)"), "<|text_sep|>"); return processed_text; }

edwko · 2024-12-14T10:00:02Z

I've consolidated WavTokenizer into model.py file and split the base model (1.75GB) into two components:

https://huggingface.co/OuteAI/wavtokenizer-large-75token-interface/tree/main
encoder (82MB)
decoder (248MB)

Might help with the convert_pt_to_hf.py script.

Here's the splitting code:

# model.py code...

def split_wav_tokenizer(model, save_directory):
    """Split WavTokenizer model and save components"""
    encoder_dir = os.path.join(save_directory, "encoder")
    decoder_dir = os.path.join(save_directory, "decoder")
    
    encoder = WavEncoder(model.feature_extractor)
    encoder.save_pretrained(encoder_dir)
    
    codebook_weights = torch.cat(
        [vq.codebook for vq in model.feature_extractor.encodec.quantizer.vq.layers],
        dim=0
    )
    decoder = WavDecoder(model.backbone, model.head, codebook_weights)
    decoder.save_pretrained(decoder_dir)

ggerganov · 2024-12-17T12:50:14Z

Initial server support is now available using the examples/tts/tts-outetts.py script. It requires to start 2 servers: one with the LLM and one with WavTokenizer:

# llm server
./build/bin/llama-server -m ./models/outetts-0.2-0.5B-llm/ggml-model-q8_0.gguf -fa --port 8020

# wavtokenizer server
./build/bin/llama-server -m ./models/wavtokenizer-large-75/ggml-model-f16.gguf -fa --port 8021 --embeddings

# generate audio
python ./examples/tts/tts-outetts.py http://localhost:8020 http://localhost:8021 "Hello world"

The python script is currently missing the spectrogram -> audio conversion. I don't know what is the best way to implement this and importing PyTorch for that seems like an overkill. So I'll leave it like this for now and hope we get some ideas later on.

This is still WIP as we'll refactor the endpoints to improve support for this, before merging.

ggml-ci

github-actions bot added examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Dec 11, 2024

edwko reviewed Dec 12, 2024

View reviewed changes

ggerganov mentioned this pull request Dec 13, 2024

Bamba architecture #10810

Draft

3 tasks

ggerganov force-pushed the gg/tts-add-outetts branch 4 times, most recently from 51e1ff4 to c5e01c8 Compare December 17, 2024 08:29

ggerganov requested a review from ngxson as a code owner December 17, 2024 08:29

github-actions bot added the server label Dec 17, 2024

ggerganov mentioned this pull request Dec 17, 2024

server : output embeddings for all tokens when pooling = none #10861

Open

ggerganov force-pushed the gg/tts-add-outetts branch from c5e01c8 to ce083a5 Compare December 17, 2024 12:21

ggerganov added 4 commits December 17, 2024 16:34

llama : add OuteTTS support (wip)

7220d59

wip

57f5568

extract features

b377983

first conv

12ce94a

ggerganov added 28 commits December 17, 2024 16:34

layer norm

35405bf

convnext

be67719

head

e464c1e

hann window

f2c3e18

fix n_embd + remove llama.cpp hacks

401e389

compute hann window

23e9eba

fft

e68129b

spectrum processing

4762cb9

clean-up

b784ae9

tts : receive input text and generate codes

0ecd6f6

clip : fix new conv name

2a9f18b

tts : minor fix

f77dd52

tts : add header + minor fixes

8f850bc

ggml-ci

tts : add matchematical constant

986c29f

ggml-ci

tts : fix sampling + cut initial noise

330f578

tts : fixes

d8b3b04

tts : update default samplers

6fad45d

ggml-ci

tts : text pre-processing

5f053fd

tts : outetts-voc -> wavtokenizer-dec

53ae67d

tts : remove hardcoded constants

e9e29d6

ggml-ci

tts : fix tensor shapes

1644d4b

llama : refactor wavtokenizer tensors

0e8c3e8

ggml-ci

cont

2fab5f4

ggml-ci

cont [no ci]

21f1529

llama : update WavTokenizer to non-causal attn

a86119c

llama : handle no-vocab detokenization

1293467

tts : add Python example for OuteTTS (wip)

83b9538

tts : extend python example to generate spectrogram

265a5ea

ggml-ci

ggerganov force-pushed the gg/tts-add-outetts branch from ce083a5 to 265a5ea Compare December 17, 2024 14:35

ggerganov changed the base branch from master to gg/server-embeddings-all December 17, 2024 14:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tts : add OuteTTS support #10784

tts : add OuteTTS support #10784

ggerganov commented Dec 11, 2024 •

edited

Loading

mirek190 commented Dec 11, 2024 •

edited

Loading

ggerganov commented Dec 11, 2024

edwko commented Dec 11, 2024

ggerganov commented Dec 11, 2024

ggerganov commented Dec 11, 2024

jadams777 commented Dec 11, 2024

ggerganov commented Dec 11, 2024

ngxson commented Dec 11, 2024 •

edited

Loading

ggerganov commented Dec 11, 2024

edwko commented Dec 11, 2024

ggerganov commented Dec 11, 2024

edwko Dec 12, 2024 •

edited

Loading

edwko commented Dec 14, 2024 •

edited

Loading

ggerganov commented Dec 17, 2024 •

edited

Loading

tts : add OuteTTS support #10784

Are you sure you want to change the base?

tts : add OuteTTS support #10784

Conversation

ggerganov commented Dec 11, 2024 • edited Loading

Overview

Usage

TODO:

mirek190 commented Dec 11, 2024 • edited Loading

ggerganov commented Dec 11, 2024

edwko commented Dec 11, 2024

ggerganov commented Dec 11, 2024

ggerganov commented Dec 11, 2024

jadams777 commented Dec 11, 2024

ggerganov commented Dec 11, 2024

ngxson commented Dec 11, 2024 • edited Loading

ggerganov commented Dec 11, 2024

edwko commented Dec 11, 2024

ggerganov commented Dec 11, 2024

edwko Dec 12, 2024 • edited Loading

Choose a reason for hiding this comment

edwko commented Dec 14, 2024 • edited Loading

ggerganov commented Dec 17, 2024 • edited Loading

ggerganov commented Dec 11, 2024 •

edited

Loading

mirek190 commented Dec 11, 2024 •

edited

Loading

ngxson commented Dec 11, 2024 •

edited

Loading

edwko Dec 12, 2024 •

edited

Loading

edwko commented Dec 14, 2024 •

edited

Loading

ggerganov commented Dec 17, 2024 •

edited

Loading