-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tts : add OuteTTS support #10784
base: gg/server-embeddings-all
Are you sure you want to change the base?
tts : add OuteTTS support #10784
Conversation
wow ...nice ;) and implementation of multimodal models like a vision yet and we done ;-D |
and-we-are-done.mp4 |
Awesome! Really excited to see it running natively 😊 |
natively.mp4 |
Here is a longer generation:
longer.mp4Not sure how to pass punctuation yet. Or even if this model supports it. punctuation.mp4 |
This is great. Would love to see a video tutorial on how to set up Ollama with this. |
ollama.mp4 |
Out of curiosity, does it make sense to combine both llm+voc into one gguf? I'm thinking about the idea of having |
Maybe we can add support to pack multiple models in a single GGUF. |
The current models doesn't support special characters yet. I plan to add support for this in next release. For now in the interface it clears them. |
Great, looking forward to this. And many thanks and admirations for this work 👍 |
#include <vector> | ||
#include <fstream> | ||
#include <thread> | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a suggestion for the text preprocessing implementation, based on how it's currently done in library.
#include <string>
#include <vector>
#include <regex>
#include <stdexcept>
#include <sstream>
#include <map>
#include <iostream>
const std::map<int, std::string> ones = {
{0, "zero"}, {1, "one"}, {2, "two"}, {3, "three"}, {4, "four"},
{5, "five"}, {6, "six"}, {7, "seven"}, {8, "eight"}, {9, "nine"},
{10, "ten"}, {11, "eleven"}, {12, "twelve"}, {13, "thirteen"}, {14, "fourteen"},
{15, "fifteen"}, {16, "sixteen"}, {17, "seventeen"}, {18, "eighteen"}, {19, "nineteen"}
};
const std::map<int, std::string> tens = {
{2, "twenty"}, {3, "thirty"}, {4, "forty"}, {5, "fifty"},
{6, "sixty"}, {7, "seventy"}, {8, "eighty"}, {9, "ninety"}
};
// Convert a number less than 1000 to words
std::string convert_less_than_thousand(int num) {
std::string result;
if (num >= 100) {
result += ones.at(num / 100) + " hundred ";
num %= 100;
}
if (num >= 20) {
result += tens.at(num / 10);
if (num % 10 > 0) {
result += "-" + ones.at(num % 10);
}
} else if (num > 0) {
result += ones.at(num);
}
return result;
}
std::string number_to_words(const std::string& number_str) {
try {
size_t decimal_pos = number_str.find('.');
std::string integer_part = number_str.substr(0, decimal_pos);
int int_number = std::stoi(integer_part);
std::string result;
if (int_number == 0) {
result = "zero";
} else {
if (int_number >= 1000000000) {
int billions = int_number / 1000000000;
result += convert_less_than_thousand(billions) + " billion ";
int_number %= 1000000000;
}
if (int_number >= 1000000) {
int millions = int_number / 1000000;
result += convert_less_than_thousand(millions) + " million ";
int_number %= 1000000;
}
if (int_number >= 1000) {
int thousands = int_number / 1000;
result += convert_less_than_thousand(thousands) + " thousand ";
int_number %= 1000;
}
if (int_number > 0) {
result += convert_less_than_thousand(int_number);
}
}
// Handle decimal part
if (decimal_pos != std::string::npos) {
result += " point";
std::string decimal_part = number_str.substr(decimal_pos + 1);
for (char digit : decimal_part) {
result += " " + ones.at(digit - '0');
}
}
return result;
} catch (const std::exception& e) {
// Skip if fails
return " ";
}
}
std::string replace_numbers_with_words(const std::string& input_text) {
std::regex number_pattern(R"(\d+(\.\d+)?)");
std::string result;
auto it = std::sregex_iterator(input_text.begin(), input_text.end(), number_pattern);
auto end = std::sregex_iterator();
size_t last_pos = 0;
for (std::sregex_iterator i = it; i != end; ++i) {
const std::smatch& match = *i;
result.append(input_text, last_pos, match.position() - last_pos);
result.append(number_to_words(match.str()));
last_pos = match.position() + match.length();
}
result.append(input_text, last_pos);
return result;
}
// Based on: https://github.com/edwko/OuteTTS/blob/a613e79c489d8256dd657ea9168d78de75895d82/outetts/version/v1/prompt_processor.py#L39
std::string process_text(const std::string& text) {
// For now I skipped text romanization as I am unsure how to handle
// uroman and MeCab implementations in C++
// maybe something like https://github.com/anyascii/anyascii/ could work.
// currently only English would be supported in this function
std::string processed_text = replace_numbers_with_words(text);
std::transform(processed_text.begin(), processed_text.end(),
processed_text.begin(), ::tolower);
std::regex special_chars(R"([-_/,\.\\])");
processed_text = std::regex_replace(processed_text, special_chars, " ");
std::regex non_alpha(R"([^a-z\s])");
processed_text = std::regex_replace(processed_text, non_alpha, "");
std::regex multiple_spaces(R"(\s+)");
processed_text = std::regex_replace(processed_text, multiple_spaces, " ");
processed_text = std::regex_replace(processed_text, std::regex(R"(^\s+|\s+$)"), "");
/*
Replace spaces with the separator token same as in line 365
for (auto & c : prompt_user) {
if (c == ' ') {
prompt_clean += "<|text_sep|>";
*/
processed_text = std::regex_replace(processed_text, std::regex(R"(\s)"), "<|text_sep|>");
return processed_text;
}
I've consolidated WavTokenizer into model.py file and split the base model (1.75GB) into two components: https://huggingface.co/OuteAI/wavtokenizer-large-75token-interface/tree/main Might help with the convert_pt_to_hf.py script. Here's the splitting code: # model.py code...
def split_wav_tokenizer(model, save_directory):
"""Split WavTokenizer model and save components"""
encoder_dir = os.path.join(save_directory, "encoder")
decoder_dir = os.path.join(save_directory, "decoder")
encoder = WavEncoder(model.feature_extractor)
encoder.save_pretrained(encoder_dir)
codebook_weights = torch.cat(
[vq.codebook for vq in model.feature_extractor.encodec.quantizer.vq.layers],
dim=0
)
decoder = WavDecoder(model.backbone, model.head, codebook_weights)
decoder.save_pretrained(decoder_dir) |
51e1ff4
to
c5e01c8
Compare
c5e01c8
to
ce083a5
Compare
Initial server support is now available using the # llm server
./build/bin/llama-server -m ./models/outetts-0.2-0.5B-llm/ggml-model-q8_0.gguf -fa --port 8020
# wavtokenizer server
./build/bin/llama-server -m ./models/wavtokenizer-large-75/ggml-model-f16.gguf -fa --port 8021 --embeddings
# generate audio
python ./examples/tts/tts-outetts.py http://localhost:8020 http://localhost:8021 "Hello world" The python script is currently missing the spectrogram -> audio conversion. I don't know what is the best way to implement this and importing PyTorch for that seems like an overkill. So I'll leave it like this for now and hope we get some ideas later on. This is still WIP as we'll refactor the endpoints to improve support for this, before merging. |
ggml-ci
ce083a5
to
265a5ea
Compare
close #10173
depends #10853, #10861
Overview
This PR adds inference support for the OuteTTS vocoder (i.e. WavTokenizer) directly into
libllama
. This enables full text-to-speech generation usingllama.cpp
.output.mp4
TTS requires 2 models to be provided: an LLM and a voice decoder. The first one generates audio codes (tokens) from the provided input text, based on some voice settings. The second one converts the audio codes into a spectrogram. The spectrogram is then converted back to audio with inverse FFT.
Usage
llama-tts
example:llama-tts \ -m ./models/outetts-0.2-0.5B-llm/ggml-model-q8_0.gguf \ -mv ./models/wavtokenizer-large-75/ggml-model-f16.gguf \ -p "Hello world"
Note that the sampling settings of the LLM might need some adjustments.
TODO:
outetts-voc
arch towav-tokenizer