-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Feat/experimental vits #146
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The voice 'is-steinn-medium.onnx' uses phonemization based on the ancient eSpeak IPA dialect and was purely trained on eSpeak phonemizeation. As we are not using eSpeak inside Símarómur, try a naive approach in emulating the eSpeak IPA dialect and adapt the model inputs with the appropriate phoneme conversions, like padding every symbol with 0, adding BOS, EOS, etc. The resulting voice performance is quite acceptable for demo purposes and also shows promising runtime performance. - add onnxruntime for the voice model and add a new TTSEngineOnnx class - add Pronunciation for Vits via the class PronunciationVits and also add appropriate classes for the other used pronunciation formats via classes PronuncationFP2, PronunciationFlite - add pronunciation dictionary with Word -> IPA symbols. These symbols use a compressed format without any padding or spaces in between. Therefore, we need to retokenize each IPA pronuncation again to spli the dictionary entry into single symbols to be able to convert these to the input ids for the Vits model - Add new pojo class VitsConfig that provides interpretation of the Vits model configuration file to be able to read the phonetic alphabet -> phoneme id mapping Signed-off-by: Daniel Schnell <[email protected]>
- Added new phoneme dictionary with space separation between phonemes. Adapt the handling of those phonemes - Add translation of syllable stresses into sampa_ipa_single_flite.tsv - move any normalization into the appropriate normalization classes away from places like e.g. the class AppRepository - add correct handling of VITS voice for punctuation/non-word characters: - only phonemize space between words, all punctuation/non-word characters need to be placed directy after/before the next character - normalization: - swap ." => ". and ," => ", - fix some unit tests via increasing Java heap space and reducing number of concurrent test runners - fix some digit normalization tests, added some new ones, some of them are failing Signed-off-by: Daniel Schnell <[email protected]>
…ilable() Add handling of TTS audioAvailable() callback function return value. Before, we simply ignored the return value of the callback, which didn't disturb, as the calls were immediately discarded anyway. In case the user skips an utterance, the callback returns an error. In that case, we return immediately. Signed-off-by: Daniel Schnell <[email protected]>
AppRepository#getAssetConfigValueFor(): - really return empty String in case the given key is not found TTSService#setSpeechMarksToBeginning(): - if given callback hasn't started, log a warning and return Signed-off-by: Daniel Schnell <[email protected]>
Increase the limit for caching an audio item if the RealtimeFactor of the utterance exceeds 50.0. We observe that it makes still sense to cache audio even for RTF == 25. Therefore, increase the limit to 50. Additionally, set the Latency to VERY_LOW for all voices but ONNX, where we set the latency to NORMAL (~50ms) Signed-off-by: Daniel Schnell <[email protected]>
lumpidu
force-pushed
the
feat/experimental-vits
branch
from
January 29, 2024 16:36
47a25a7
to
edf2125
Compare
Superseded via #151 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a placeholder Pull Request, I will clean this up/amend appropriately, if we have the new voice
is-steinn-xs.onnx
based on our own phonemization.Implement a preliminary runtime for VITS voice
is-steinn-medium.onnx
The voice 'is-steinn-medium.onnx' uses phonemization based on the ancient eSpeak IPA dialect and was purely trained on eSpeak phonemizeation.
As we are not using eSpeak inside Símarómur, try a naive approach in emulating the eSpeak IPA dialect and adapt the model inputs with the appropriate phoneme conversions, like padding every symbol with
0
, addingBOS
,EOS
, etc.The resulting voice performance is quite acceptable for demo purposes and also shows promising runtime performance.
onnxruntime
for the voice model and add a newTTSEngineOnnx
class, which does all onnx model loading and inference handlingPronunciationVits
and also add appropriate classes for the other used pronunciation formats via classesPronuncationFP2
,PronunciationFlite
Word -> IPA symbols
. These symbols use a compressed format without any padding or spaces in between. Therefore, we need to retokenize each IPA pronuncation again to split the dictionary entry into single symbols to be able to convert these to the input ids for the VITS modelphonetic alphabet -> phoneme id
mapping