Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unique_text_tokens.k2symbols for non-english languages #13

Open
paulovasconcellos-hotmart opened this issue Feb 9, 2024 · 1 comment

Comments

@paulovasconcellos-hotmart

Hello everyone,
I've noticed that throughout the pipeline, unknown tokens are removed, and that the unique_text_tokens.k2symbols doesn't contém all necessary phonemes for Non-English languages, such as accents and other diacritics.

I'm training to train pheme in Portuguese, and I was wondering what I should do so the model can understand the accents of my language. Any tips on how to do it?

P.S.: I've also changed the phonemizer backend, so it could generate phonemes in PT-BR. espeak is available in PT-BR, so it was a no-brainer.

@taras-sereda
Copy link
Contributor

Hi @paulovasconcellos-hotmart

symbol table is essentially a set of all unique phones in your dataset.
You can take a look how to create a unique_text_tokens.k2symbols in my WIP branch for training Text 2 Semantic model for Ukrainian language.

Good luck with your experiments!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants