Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support hebrew characters #1353

Closed
uriva opened this issue Oct 16, 2023 · 3 comments
Closed

Support hebrew characters #1353

uriva opened this issue Oct 16, 2023 · 3 comments
Labels

Comments

@uriva
Copy link
Contributor

uriva commented Oct 16, 2023

🚀 The feature

Support hebrew characters

Motivation, pitch

Increase user base

Alternatives

No response

Additional context

No response

@uriva uriva added the type: enhancement Improvement label Oct 16, 2023
@felixT2K
Copy link
Contributor

Hi @uriva 👋🏼,

Thanks for the feature request.
Feel free to open a PR and add the vocab in:
https://github.com/mindee/doctr/blob/main/doctr/datasets/vocabs.py
and update the corresponding documentation in:
https://github.com/mindee/doctr/blob/main/docs/source/modules/datasets.rst#supported-vocabs

Afterwards you can train your own model see:
https://mindee.github.io/doctr/using_doctr/custom_models_training.html

@uriva
Copy link
Contributor Author

uriva commented Oct 17, 2023

#1355

How many examples are required to train a new language?

@felixT2K
Copy link
Contributor

Hi @uriva 👋🏼,

This question is not 100 percent clear to answer.
Since Hebrew is based on the English corpus, you can fine tune our already trained models and do not have to train from scratch.

A rule of thumb is that it is highly recommended to validate on real data (even if it might not be that much).
For training you can also try to generate synthtic data.
For example with: https://github.com/clovaai/synthtiger
Or if possible label real data with AWS Textract or Azure Document AI.

~50K should be a good starting value for the beginning.

The current models are trained from scratch (mindee internal dataset / french vocab ~11M word crop images)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants