unique_text_tokens.k2symbols for non-english languages #13

paulovasconcellos-hotmart · 2024-02-09T16:41:57Z

Hello everyone,
I've noticed that throughout the pipeline, unknown tokens are removed, and that the unique_text_tokens.k2symbols doesn't contém all necessary phonemes for Non-English languages, such as accents and other diacritics.

I'm training to train pheme in Portuguese, and I was wondering what I should do so the model can understand the accents of my language. Any tips on how to do it?

P.S.: I've also changed the phonemizer backend, so it could generate phonemes in PT-BR. espeak is available in PT-BR, so it was a no-brainer.

The text was updated successfully, but these errors were encountered:

taras-sereda · 2024-02-21T17:54:48Z

Hi @paulovasconcellos-hotmart

symbol table is essentially a set of all unique phones in your dataset.
You can take a look how to create a unique_text_tokens.k2symbols in my WIP branch for training Text 2 Semantic model for Ukrainian language.

Good luck with your experiments!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unique_text_tokens.k2symbols for non-english languages #13

unique_text_tokens.k2symbols for non-english languages #13

paulovasconcellos-hotmart commented Feb 9, 2024

taras-sereda commented Feb 21, 2024

unique_text_tokens.k2symbols for non-english languages #13

unique_text_tokens.k2symbols for non-english languages #13

Comments

paulovasconcellos-hotmart commented Feb 9, 2024

taras-sereda commented Feb 21, 2024