WIP: Feat/experimental vits #146

lumpidu · 2024-01-12T16:40:15Z

This is a placeholder Pull Request, I will clean this up/amend appropriately, if we have the new voice is-steinn-xs.onnx based on our own phonemization.

Implement a preliminary runtime for VITS voice `is-steinn-medium.onnx`

The voice 'is-steinn-medium.onnx' uses phonemization based on the ancient eSpeak IPA dialect and was purely trained on eSpeak phonemizeation.
As we are not using eSpeak inside Símarómur, try a naive approach in emulating the eSpeak IPA dialect and adapt the model inputs with the appropriate phoneme conversions, like padding every symbol with 0, adding BOS, EOS, etc.

The resulting voice performance is quite acceptable for demo purposes and also shows promising runtime performance.

add onnxruntime for the voice model and add a new TTSEngineOnnx class, which does all onnx model loading and inference handling
add Pronunciation for VITS via the class PronunciationVits and also add appropriate classes for the other used pronunciation formats via classes PronuncationFP2, PronunciationFlite
add pronunciation dictionary with Word -> IPA symbols. These symbols use a compressed format without any padding or spaces in between. Therefore, we need to retokenize each IPA pronuncation again to split the dictionary entry into single symbols to be able to convert these to the input ids for the VITS model
Add new pojo class VitsConfig that provides interpretation of the Vits model configuration file to be able to read the phonetic alphabet -> phoneme id mapping

The voice 'is-steinn-medium.onnx' uses phonemization based on the ancient eSpeak IPA dialect and was purely trained on eSpeak phonemizeation. As we are not using eSpeak inside Símarómur, try a naive approach in emulating the eSpeak IPA dialect and adapt the model inputs with the appropriate phoneme conversions, like padding every symbol with 0, adding BOS, EOS, etc. The resulting voice performance is quite acceptable for demo purposes and also shows promising runtime performance. - add onnxruntime for the voice model and add a new TTSEngineOnnx class - add Pronunciation for Vits via the class PronunciationVits and also add appropriate classes for the other used pronunciation formats via classes PronuncationFP2, PronunciationFlite - add pronunciation dictionary with Word -> IPA symbols. These symbols use a compressed format without any padding or spaces in between. Therefore, we need to retokenize each IPA pronuncation again to spli the dictionary entry into single symbols to be able to convert these to the input ids for the Vits model - Add new pojo class VitsConfig that provides interpretation of the Vits model configuration file to be able to read the phonetic alphabet -> phoneme id mapping Signed-off-by: Daniel Schnell <[email protected]>

- Added new phoneme dictionary with space separation between phonemes. Adapt the handling of those phonemes - Add translation of syllable stresses into sampa_ipa_single_flite.tsv - move any normalization into the appropriate normalization classes away from places like e.g. the class AppRepository - add correct handling of VITS voice for punctuation/non-word characters: - only phonemize space between words, all punctuation/non-word characters need to be placed directy after/before the next character - normalization: - swap ." => ". and ," => ", - fix some unit tests via increasing Java heap space and reducing number of concurrent test runners - fix some digit normalization tests, added some new ones, some of them are failing Signed-off-by: Daniel Schnell <[email protected]>

…ilable() Add handling of TTS audioAvailable() callback function return value. Before, we simply ignored the return value of the callback, which didn't disturb, as the calls were immediately discarded anyway. In case the user skips an utterance, the callback returns an error. In that case, we return immediately. Signed-off-by: Daniel Schnell <[email protected]>

AppRepository#getAssetConfigValueFor(): - really return empty String in case the given key is not found TTSService#setSpeechMarksToBeginning(): - if given callback hasn't started, log a warning and return Signed-off-by: Daniel Schnell <[email protected]>

Increase the limit for caching an audio item if the RealtimeFactor of the utterance exceeds 50.0. We observe that it makes still sense to cache audio even for RTF == 25. Therefore, increase the limit to 50. Additionally, set the Latency to VERY_LOW for all voices but ONNX, where we set the latency to NORMAL (~50ms) Signed-off-by: Daniel Schnell <[email protected]>

lumpidu · 2024-01-29T17:00:32Z

Superseded via #151

lumpidu self-assigned this Jan 12, 2024

lumpidu changed the base branch from master to v1.3.x January 12, 2024 17:01

lumpidu had a problem deploying to CI January 25, 2024 16:50 — with GitHub Actions Failure

lumpidu had a problem deploying to CI January 25, 2024 18:01 — with GitHub Actions Failure

lumpidu had a problem deploying to CI January 27, 2024 12:49 — with GitHub Actions Failure

lumpidu and others added 8 commits January 29, 2024 16:34

fix large digits bug

358259b

fix digit norm

5530726

Add the voice assets for Steinn xs

83df1ae

More error handling

c4dd09b

AppRepository#getAssetConfigValueFor(): - really return empty String in case the given key is not found TTSService#setSpeechMarksToBeginning(): - if given callback hasn't started, log a warning and return Signed-off-by: Daniel Schnell <[email protected]>

lumpidu force-pushed the feat/experimental-vits branch from 47a25a7 to edf2125 Compare January 29, 2024 16:36

lumpidu had a problem deploying to CI January 29, 2024 16:36 — with GitHub Actions Failure

lumpidu changed the base branch from v1.3.x to master January 29, 2024 16:43

lumpidu had a problem deploying to CI January 29, 2024 16:45 — with GitHub Actions Failure

lumpidu closed this Jan 29, 2024

lumpidu deleted the feat/experimental-vits branch February 7, 2024 17:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Feat/experimental vits #146

WIP: Feat/experimental vits #146

lumpidu commented Jan 12, 2024 •

edited

Loading

lumpidu commented Jan 29, 2024

WIP: Feat/experimental vits #146

WIP: Feat/experimental vits #146

Conversation

lumpidu commented Jan 12, 2024 • edited Loading

Implement a preliminary runtime for VITS voice is-steinn-medium.onnx

lumpidu commented Jan 29, 2024

lumpidu commented Jan 12, 2024 •

edited

Loading

Implement a preliminary runtime for VITS voice `is-steinn-medium.onnx`