Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tts : add OuteTTS support #10784

Open
wants to merge 37 commits into
base: gg/server-embeddings-all
Choose a base branch
from

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Dec 11, 2024

close #10173

depends #10853, #10861

Overview

This PR adds inference support for the OuteTTS vocoder (i.e. WavTokenizer) directly into libllama. This enables full text-to-speech generation using llama.cpp.

# generate output.wav
llama-tts \
    -m  ./models/outetts-0.2-0.5B-llm/ggml-model-q8_0.gguf \
    -mv ./models/wavtokenizer-large-75/ggml-model-f16.gguf \
    -p "Running text to speech locally on your computer."

# play the generated audio
ffplay output.wav
output.mp4

TTS requires 2 models to be provided: an LLM and a voice decoder. The first one generates audio codes (tokens) from the provided input text, based on some voice settings. The second one converts the audio codes into a spectrogram. The spectrogram is then converted back to audio with inverse FFT.

Usage

# this will produce F16 LLM model (~1 GB)
mkdir models/outetts-0.2-0.5B-llm
python convert_hf_to_gguf.py OuteAI/OuteTTS-0.2-500M/ --outfile models/outetts-0.2-0.5B-llm/ggml-model-f16.gguf --outtype f16

# this will produce Q8_0 LLM model (~500 MB)
llama-quantize models/outetts-0.2-0.5B-llm/ggml-model-f16.gguf models/outetts-0.2-0.5B-llm/ggml-model-q8_0.gguf q8_0
# convert PT -> HF
python examples/tts/convert_pt_to_hf.py ./WavTokenizer-large-speech-75token/wavtokenizer_large_speech_320_24k.ckpt

# convert HF -> GGUF (~250 MB)
mkdir models/wavtokenizer-large-75
python convert_hf_to_gguf.py WavTokenizer-large-speech-75token/ --outfile models/wavtokenizer-large-75/ggml-model-f16.gguf --outtype f16
  • Generate speech from text using the llama-tts example:
llama-tts \
    -m  ./models/outetts-0.2-0.5B-llm/ggml-model-q8_0.gguf \
    -mv ./models/wavtokenizer-large-75/ggml-model-f16.gguf \
    -p "Hello world"

Note that the sampling settings of the LLM might need some adjustments.

TODO:

  • Clean-up implementation
  • Fix conv tensor shapes
  • Remove hardcoded constants
  • Better conversion script
  • Server support
  • Optimize the spectrum operations
  • Read and use other voices
  • Rename outetts-voc arch to wav-tokenizer

@github-actions github-actions bot added examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Dec 11, 2024
@mirek190
Copy link

mirek190 commented Dec 11, 2024

wow ...nice ;)

and implementation of multimodal models like a vision yet and we done ;-D

@ggerganov
Copy link
Owner Author

wow ...nice ;)

and implementation multimodal models live vision yet and we done ;-D

and-we-are-done.mp4

@edwko
Copy link

edwko commented Dec 11, 2024

Awesome! Really excited to see it running natively 😊

@ggerganov
Copy link
Owner Author

Awesome! Really excited to see it running natively

natively.mp4

@ggerganov
Copy link
Owner Author

Here is a longer generation:

TTS requires 2 models to be provided: an LLM and a Vocoder(?). The first one generates audio codes (tokens) from the provided input text, based on some voice settings. The second one converts the audio codes into a spectrogram. The spectrogram is then converted back to audio with inverse FFT.

longer.mp4

Not sure how to pass punctuation yet. Or even if this model supports it.

punctuation.mp4

@jadams777
Copy link

This is great. Would love to see a video tutorial on how to set up Ollama with this.

@ggerganov
Copy link
Owner Author

This is great. Would love to see a video tutorial on how to set up Ollama with this.

ollama.mp4

@ngxson
Copy link
Collaborator

ngxson commented Dec 11, 2024

Out of curiosity, does it make sense to combine both llm+voc into one gguf? I'm thinking about the idea of having llama-voice-to-voice -m llama-3.1.gguf -mtts oute-tts.gguf -masr whisper.gguf, but maybe it's too early to think about that?

@ggerganov
Copy link
Owner Author

Maybe we can add support to pack multiple models in a single GGUF.

@edwko
Copy link

edwko commented Dec 11, 2024

Not sure how to pass punctuation yet. Or even if this model supports it.
punctuation.mp4

The current models doesn't support special characters yet. I plan to add support for this in next release. For now in the interface it clears them.

@ggerganov
Copy link
Owner Author

Great, looking forward to this. And many thanks and admirations for this work 👍

#include <vector>
#include <fstream>
#include <thread>

Copy link

@edwko edwko Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a suggestion for the text preprocessing implementation, based on how it's currently done in library.

#include <string>
#include <vector>
#include <regex>
#include <stdexcept>
#include <sstream>
#include <map>
#include <iostream>

const std::map<int, std::string> ones = {
    {0, "zero"}, {1, "one"}, {2, "two"}, {3, "three"}, {4, "four"},
    {5, "five"}, {6, "six"}, {7, "seven"}, {8, "eight"}, {9, "nine"},
    {10, "ten"}, {11, "eleven"}, {12, "twelve"}, {13, "thirteen"}, {14, "fourteen"},
    {15, "fifteen"}, {16, "sixteen"}, {17, "seventeen"}, {18, "eighteen"}, {19, "nineteen"}
};

const std::map<int, std::string> tens = {
    {2, "twenty"}, {3, "thirty"}, {4, "forty"}, {5, "fifty"},
    {6, "sixty"}, {7, "seventy"}, {8, "eighty"}, {9, "ninety"}
};

// Convert a number less than 1000 to words
std::string convert_less_than_thousand(int num) {
    std::string result;
    
    if (num >= 100) {
        result += ones.at(num / 100) + " hundred ";
        num %= 100;
    }
    
    if (num >= 20) {
        result += tens.at(num / 10);
        if (num % 10 > 0) {
            result += "-" + ones.at(num % 10);
        }
    } else if (num > 0) {
        result += ones.at(num);
    }
    
    return result;
}

std::string number_to_words(const std::string& number_str) {
    try {
        size_t decimal_pos = number_str.find('.');
        std::string integer_part = number_str.substr(0, decimal_pos);
        
        int int_number = std::stoi(integer_part);
        std::string result;
        
        if (int_number == 0) {
            result = "zero";
        } else {
            if (int_number >= 1000000000) {
                int billions = int_number / 1000000000;
                result += convert_less_than_thousand(billions) + " billion ";
                int_number %= 1000000000;
            }
            
            if (int_number >= 1000000) {
                int millions = int_number / 1000000;
                result += convert_less_than_thousand(millions) + " million ";
                int_number %= 1000000;
            }
            
            if (int_number >= 1000) {
                int thousands = int_number / 1000;
                result += convert_less_than_thousand(thousands) + " thousand ";
                int_number %= 1000;
            }
            
            if (int_number > 0) {
                result += convert_less_than_thousand(int_number);
            }
        }
        
        // Handle decimal part
        if (decimal_pos != std::string::npos) {
            result += " point";
            std::string decimal_part = number_str.substr(decimal_pos + 1);
            for (char digit : decimal_part) {
                result += " " + ones.at(digit - '0');
            }
        }
        
        return result;
    } catch (const std::exception& e) {
        // Skip if fails
        return " "; 
    }
}

std::string replace_numbers_with_words(const std::string& input_text) {
    std::regex number_pattern(R"(\d+(\.\d+)?)");
    std::string result;
    auto it = std::sregex_iterator(input_text.begin(), input_text.end(), number_pattern);
    auto end = std::sregex_iterator();

    size_t last_pos = 0;
    for (std::sregex_iterator i = it; i != end; ++i) {
        const std::smatch& match = *i;
        result.append(input_text, last_pos, match.position() - last_pos);
        result.append(number_to_words(match.str()));
        last_pos = match.position() + match.length();
    }
    result.append(input_text, last_pos);
    
    return result;
}

// Based on: https://github.com/edwko/OuteTTS/blob/a613e79c489d8256dd657ea9168d78de75895d82/outetts/version/v1/prompt_processor.py#L39
std::string process_text(const std::string& text) {
    
    // For now I skipped text romanization as I am unsure how to handle
    // uroman and MeCab implementations in C++
    // maybe something like https://github.com/anyascii/anyascii/ could work.
    // currently only English would be supported in this function

    std::string processed_text = replace_numbers_with_words(text);

    std::transform(processed_text.begin(), processed_text.end(), 
                  processed_text.begin(), ::tolower);

    std::regex special_chars(R"([-_/,\.\\])");
    processed_text = std::regex_replace(processed_text, special_chars, " ");
    
    std::regex non_alpha(R"([^a-z\s])");
    processed_text = std::regex_replace(processed_text, non_alpha, "");
    
    std::regex multiple_spaces(R"(\s+)");
    processed_text = std::regex_replace(processed_text, multiple_spaces, " ");
    
    processed_text = std::regex_replace(processed_text, std::regex(R"(^\s+|\s+$)"), "");

    /*
        Replace spaces with the separator token same as in line 365

        for (auto & c : prompt_user) {
        if (c == ' ') {
            prompt_clean += "<|text_sep|>";
    */
    processed_text = std::regex_replace(processed_text, std::regex(R"(\s)"), "<|text_sep|>");

    return processed_text;
}

@ggerganov ggerganov mentioned this pull request Dec 13, 2024
3 tasks
@edwko
Copy link

edwko commented Dec 14, 2024

I've consolidated WavTokenizer into model.py file and split the base model (1.75GB) into two components:

https://huggingface.co/OuteAI/wavtokenizer-large-75token-interface/tree/main
encoder (82MB)
decoder (248MB)

Might help with the convert_pt_to_hf.py script.

Here's the splitting code:

# model.py code...

def split_wav_tokenizer(model, save_directory):
    """Split WavTokenizer model and save components"""
    encoder_dir = os.path.join(save_directory, "encoder")
    decoder_dir = os.path.join(save_directory, "decoder")
    
    encoder = WavEncoder(model.feature_extractor)
    encoder.save_pretrained(encoder_dir)
    
    codebook_weights = torch.cat(
        [vq.codebook for vq in model.feature_extractor.encodec.quantizer.vq.layers],
        dim=0
    )
    decoder = WavDecoder(model.backbone, model.head, codebook_weights)
    decoder.save_pretrained(decoder_dir)

@ggerganov
Copy link
Owner Author

ggerganov commented Dec 17, 2024

Initial server support is now available using the examples/tts/tts-outetts.py script. It requires to start 2 servers: one with the LLM and one with WavTokenizer:

# llm server
./build/bin/llama-server -m ./models/outetts-0.2-0.5B-llm/ggml-model-q8_0.gguf -fa --port 8020

# wavtokenizer server
./build/bin/llama-server -m ./models/wavtokenizer-large-75/ggml-model-f16.gguf -fa --port 8021 --embeddings

# generate audio
python ./examples/tts/tts-outetts.py http://localhost:8020 http://localhost:8021 "Hello world"

The python script is currently missing the spectrogram -> audio conversion. I don't know what is the best way to implement this and importing PyTorch for that seems like an overkill. So I'll leave it like this for now and hope we get some ideas later on.

This is still WIP as we'll refactor the endpoints to improve support for this, before merging.

@ggerganov ggerganov changed the base branch from master to gg/server-embeddings-all December 17, 2024 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples ggml changes relating to the ggml tensor library for machine learning python python script changes server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tts : add basic example for text-to-speech
5 participants