Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All chars assumption #6

Open
omrishsu opened this issue Mar 3, 2018 · 2 comments
Open

All chars assumption #6

omrishsu opened this issue Mar 3, 2018 · 2 comments

Comments

@omrishsu
Copy link

omrishsu commented Mar 3, 2018

Hi,
The train_lstm step writes an “all chars” text file that assumes that it encounters all the chars in the corpus. But this is not necessarily true. The training is on limited data, and it may miss rare chars that will exist in the correction step.
Is it ok? Or this is something that needs to be addressed?

Thanks!
Omri

@jvdzwaan
Copy link
Collaborator

jvdzwaan commented Mar 5, 2018

Actually, the chars are extracted from all text (train set, test set, and val set).

Whether this is correct (fair) is open for discussion. It is probably more correct to use only the characters in the train set (and maybe validation set) and have an 'unknown' character. It is likely that the 'unknown' character only appears in the input text, and not in the output text. Otherwise incorrect text will be produced.

@omrishsu
Copy link
Author

omrishsu commented Mar 9, 2018

I've solved this issue by adding another param with chars to include.

BTW, do you want me to contribute these changes? I fill like it is very specific to my needs, but if you like...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants