Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Feat/experimental vits #146

Closed
wants to merge 8 commits into from
Closed

WIP: Feat/experimental vits #146

wants to merge 8 commits into from

Conversation

lumpidu
Copy link
Collaborator

@lumpidu lumpidu commented Jan 12, 2024

This is a placeholder Pull Request, I will clean this up/amend appropriately, if we have the new voice is-steinn-xs.onnx based on our own phonemization.

Implement a preliminary runtime for VITS voice is-steinn-medium.onnx

The voice 'is-steinn-medium.onnx' uses phonemization based on the ancient eSpeak IPA dialect and was purely trained on eSpeak phonemizeation.
As we are not using eSpeak inside Símarómur, try a naive approach in emulating the eSpeak IPA dialect and adapt the model inputs with the appropriate phoneme conversions, like padding every symbol with 0, adding BOS, EOS, etc.

The resulting voice performance is quite acceptable for demo purposes and also shows promising runtime performance.

  • add onnxruntime for the voice model and add a new TTSEngineOnnx class, which does all onnx model loading and inference handling
  • add Pronunciation for VITS via the class PronunciationVits and also add appropriate classes for the other used pronunciation formats via classes PronuncationFP2, PronunciationFlite
  • add pronunciation dictionary with Word -> IPA symbols. These symbols use a compressed format without any padding or spaces in between. Therefore, we need to retokenize each IPA pronuncation again to split the dictionary entry into single symbols to be able to convert these to the input ids for the VITS model
  • Add new pojo class VitsConfig that provides interpretation of the Vits model configuration file to be able to read the phonetic alphabet -> phoneme id mapping

lumpidu and others added 8 commits January 29, 2024 16:34
The voice 'is-steinn-medium.onnx' uses phonemization based on the ancient
eSpeak IPA dialect and was purely trained on eSpeak phonemizeation.
As we are not using eSpeak inside Símarómur, try a naive approach in
emulating the eSpeak IPA dialect and adapt the model inputs with the
appropriate phoneme conversions, like padding every symbol with 0,
adding BOS, EOS, etc.

The resulting voice performance is quite acceptable for demo purposes and
also shows promising runtime performance.

- add onnxruntime for the voice model and add a new TTSEngineOnnx class
- add Pronunciation for Vits via the class PronunciationVits and also add
  appropriate classes for the other used pronunciation formats via
  classes PronuncationFP2, PronunciationFlite
- add pronunciation dictionary with Word -> IPA symbols. These symbols use
  a compressed format without any padding or spaces in between. Therefore,
  we need to retokenize each IPA pronuncation again to spli the dictionary
  entry into single symbols to be able to convert these to the input ids
  for the Vits model
- Add new pojo class VitsConfig that provides interpretation of the Vits
  model configuration file to be able to read the phonetic alphabet ->
  phoneme id mapping

Signed-off-by: Daniel Schnell <[email protected]>
- Added new phoneme dictionary with space separation between phonemes.
  Adapt the handling of those phonemes
- Add translation of syllable stresses into sampa_ipa_single_flite.tsv
- move any normalization into the appropriate normalization classes away
  from places like e.g. the class AppRepository
- add correct handling of VITS voice for punctuation/non-word characters:
  - only phonemize space between words, all punctuation/non-word characters
    need to be placed directy after/before the next character
- normalization:
  - swap ." => ". and ," => ",
- fix some unit tests via increasing Java heap space and reducing number of
  concurrent test runners
- fix some digit normalization tests, added some new ones, some of them are
  failing

Signed-off-by: Daniel Schnell <[email protected]>
…ilable()

Add handling of TTS audioAvailable() callback function return value.
Before, we simply ignored the return value of the callback, which didn't
disturb, as the calls were immediately discarded anyway.

In case the user skips an utterance, the callback returns an error. In that
case, we return immediately.

Signed-off-by: Daniel Schnell <[email protected]>
AppRepository#getAssetConfigValueFor():
- really return empty String in case the given key is not found

TTSService#setSpeechMarksToBeginning():
- if given callback hasn't started, log a warning and return

Signed-off-by: Daniel Schnell <[email protected]>
Increase the limit for caching an audio item if the RealtimeFactor of the
utterance exceeds 50.0.

We observe that it makes still sense to cache audio even for RTF == 25.
Therefore, increase the limit to 50.

Additionally, set the Latency to VERY_LOW for all voices but ONNX, where
we set the latency to NORMAL (~50ms)

Signed-off-by: Daniel Schnell <[email protected]>
@lumpidu
Copy link
Collaborator Author

lumpidu commented Jan 29, 2024

Superseded via #151

@lumpidu lumpidu closed this Jan 29, 2024
@lumpidu lumpidu deleted the feat/experimental-vits branch February 7, 2024 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants