All chars assumption #6

omrishsu · 2018-03-03T07:51:21Z

Hi,
The train_lstm step writes an “all chars” text file that assumes that it encounters all the chars in the corpus. But this is not necessarily true. The training is on limited data, and it may miss rare chars that will exist in the correction step.
Is it ok? Or this is something that needs to be addressed?

Thanks!
Omri

jvdzwaan · 2018-03-05T19:57:38Z

Actually, the chars are extracted from all text (train set, test set, and val set).

Whether this is correct (fair) is open for discussion. It is probably more correct to use only the characters in the train set (and maybe validation set) and have an 'unknown' character. It is likely that the 'unknown' character only appears in the input text, and not in the output text. Otherwise incorrect text will be produced.

omrishsu · 2018-03-09T07:39:55Z

I've solved this issue by adding another param with chars to include.

BTW, do you want me to contribute these changes? I fill like it is very specific to my needs, but if you like...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All chars assumption #6

All chars assumption #6

omrishsu commented Mar 3, 2018

jvdzwaan commented Mar 5, 2018

omrishsu commented Mar 9, 2018

All chars assumption #6

All chars assumption #6

Comments

omrishsu commented Mar 3, 2018

jvdzwaan commented Mar 5, 2018

omrishsu commented Mar 9, 2018