-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using with Additional corpus of spelling mistakes. #39
Comments
Modifying the unigrams and bigrams is the best way I can think of. You’ll have to account for every typo variation of every word though. There may be a way to modify the algorithm instead but I’m not sure. Certainly AI models can do it but I don’t know about scope and scale. |
Yeah. GPT can do it - and multilingually. But it feels like a huge hammer to crack a nut. Thanks |
If anyone is interested I've got a complete modified unigrams json in this repo - and code to read in spelling mistakes here Dare say some madness in my logic. i am using the weights from the spelled correctly word which may be a bad idea. NB: Can someone clarify something for me.. I've updated the unigrams json. Should I be updating the bigrams json, too, with the misspelling sentences, e.g. " </s/> alcohol": 541645.0," and add "</s/> alchol": 541645.0, " etc. |
I’m pondering on using this as a service to an app for disabled people who we support who would use this to communicate. We see a lot of users who do this tapping on letters but often never use a space. But. We have a Snag in they do make errors. (See https://youtu.be/SDkE-aO3tOQ?si=0GAUyTKDh-q_sAxm and a quick app for iOS we made https://github.com/AceCentre/DragToSpeak and now contemplating using a rest api largely using word segment. )
So I was wondering about adding to the standard corpus with something like https://www.dcs.bbk.ac.uk/~ROGER/corpora.html
I read this https://stackoverflow.com/a/32364566/1123094
it looks like I can create a file of Bigrams or unigrans and weights and add to the standard corpus. Right? Or is there a better way.
The text was updated successfully, but these errors were encountered: