Support hebrew characters #1353

uriva · 2023-10-16T22:33:21Z

🚀 The feature

Support hebrew characters

Motivation, pitch

Increase user base

Alternatives

No response

Additional context

No response

felixT2K · 2023-10-17T07:09:43Z

Hi @uriva 👋🏼,

Thanks for the feature request.
Feel free to open a PR and add the vocab in:
https://github.com/mindee/doctr/blob/main/doctr/datasets/vocabs.py
and update the corresponding documentation in:
https://github.com/mindee/doctr/blob/main/docs/source/modules/datasets.rst#supported-vocabs

Afterwards you can train your own model see:
https://mindee.github.io/doctr/using_doctr/custom_models_training.html

uriva · 2023-10-17T11:32:55Z

#1355

How many examples are required to train a new language?

felixT2K · 2023-10-18T06:38:26Z

Hi @uriva 👋🏼,

This question is not 100 percent clear to answer.
Since Hebrew is based on the English corpus, you can fine tune our already trained models and do not have to train from scratch.

A rule of thumb is that it is highly recommended to validate on real data (even if it might not be that much).
For training you can also try to generate synthtic data.
For example with: https://github.com/clovaai/synthtiger
Or if possible label real data with AWS Textract or Azure Document AI.

~50K should be a good starting value for the beginning.

The current models are trained from scratch (mindee internal dataset / french vocab ~11M word crop images)

uriva added the type: enhancement Improvement label Oct 16, 2023

felixdittrich92 closed this as completed Oct 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support hebrew characters #1353

Support hebrew characters #1353

uriva commented Oct 16, 2023

felixT2K commented Oct 17, 2023

uriva commented Oct 17, 2023

felixT2K commented Oct 18, 2023

Support hebrew characters #1353

Support hebrew characters #1353

Comments

uriva commented Oct 16, 2023

🚀 The feature

Motivation, pitch

Alternatives

Additional context

felixT2K commented Oct 17, 2023

uriva commented Oct 17, 2023

felixT2K commented Oct 18, 2023