Onboarding: project character and word statistics #277

mmartin9684-sil · 2024-01-18T15:28:06Z

During onboarding, when the user supplies a translation or back translation, it would be helpful to capture some statistics about the text of the project, such as:

words and word frequencies
characters and character frequencies
distribution plots of this word and character information

In addition, after a preprocess/train/test run, it would be helpful to capture some token and word statistics indicating which tokens and words were part of the train, validation, and/or test sets for both the source and target texts. In particular, flagging any inconsistencies -
tokens or words in the source / target validation or test set that were not part of the training set - would be helpful.

mmartin9684-sil added the enhancement New feature or request label Jan 18, 2024

mmartin9684-sil assigned Enkidu93 Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Onboarding: project character and word statistics #277

Onboarding: project character and word statistics #277

mmartin9684-sil commented Jan 18, 2024

Onboarding: project character and word statistics #277

Onboarding: project character and word statistics #277

Comments

mmartin9684-sil commented Jan 18, 2024