Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using with Additional corpus of spelling mistakes. #39

Open
willwade opened this issue Jan 18, 2024 · 3 comments
Open

Using with Additional corpus of spelling mistakes. #39

willwade opened this issue Jan 18, 2024 · 3 comments

Comments

@willwade
Copy link

I’m pondering on using this as a service to an app for disabled people who we support who would use this to communicate. We see a lot of users who do this tapping on letters but often never use a space. But. We have a Snag in they do make errors. (See https://youtu.be/SDkE-aO3tOQ?si=0GAUyTKDh-q_sAxm and a quick app for iOS we made https://github.com/AceCentre/DragToSpeak and now contemplating using a rest api largely using word segment. )

So I was wondering about adding to the standard corpus with something like https://www.dcs.bbk.ac.uk/~ROGER/corpora.html

I read this https://stackoverflow.com/a/32364566/1123094

it looks like I can create a file of Bigrams or unigrans and weights and add to the standard corpus. Right? Or is there a better way.

@grantjenks
Copy link
Owner

Modifying the unigrams and bigrams is the best way I can think of. You’ll have to account for every typo variation of every word though. There may be a way to modify the algorithm instead but I’m not sure.

Certainly AI models can do it but I don’t know about scope and scale.

@willwade
Copy link
Author

Yeah. GPT can do it - and multilingually. But it feels like a huge hammer to crack a nut. Thanks

@willwade
Copy link
Author

willwade commented Jan 18, 2024

If anyone is interested I've got a complete modified unigrams json in this repo - and code to read in spelling mistakes here

https://github.com/AceCentre/Correct-A-Sentence/blob/main/helper-scripts/create_unigrams_spellingerrors.py

Dare say some madness in my logic. i am using the weights from the spelled correctly word which may be a bad idea.

NB: Can someone clarify something for me.. I've updated the unigrams json. Should I be updating the bigrams json, too, with the misspelling sentences, e.g. " </s/> alcohol": 541645.0," and add "</s/> alchol": 541645.0, " etc.

@willwade willwade changed the title Using with Addotional corpus of spelling mistakes. Using with Additional corpus of spelling mistakes. Jan 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants