Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do research on automatically detecting and correcting Bangla spelling mistakes #40

Open
dg1223 opened this issue Sep 9, 2024 · 20 comments
Assignees
Labels
data data related task research Any R&D work

Comments

@dg1223
Copy link
Collaborator

dg1223 commented Sep 9, 2024

We need to figure out if any work and/or research has been done on automatically detecting and correcting Bangla spelling mistakes. Technically speaking, are there papers, notebooks, or packages that does this job? After we finish this review, we can determine if we need to figure out our own method to automatically clean Bangla text.

An example is the Avro online keyboard. It autocorrects as you write Bangla. I believe it is using some kind of dictionary and an autocorrection package as a backend. I don't know if anyone has open sourced it or something similar. Finding something like this that we can use for our project would cut down our work by a lot.

Our main focus is fixing spelling mistakes but finding resources on fixing grammatical mistakes is also welcome.

We need to invest some time and effort into this R&D work because manually cleaning up extracted texts 200+ books and maybe newspaper articles in future doesn't look feasible. It took 9 volunteers more than 3 weeks to clean up a single book. We can croudsource it but it requires funding and additional efforts which are not within the scope of our project at this moment.

@dg1223 dg1223 added data data related task research Any R&D work labels Sep 9, 2024
@dg1223 dg1223 added this to the Data extraction and preparation milestone Sep 9, 2024
@dg1223
Copy link
Collaborator Author

dg1223 commented Sep 9, 2024

@mir-abir-hossain As discussed, @mahdiislam79 is interested in working on this issue too. So, both of you can work together on it.

@mahdiislam79 This was originally Abir's idea.

@mahdiislam79
Copy link
Collaborator

Okay. I am currently searching for potential research papers. @mir-abir-hossain let me know if I need to do anything else.

@mir-abir-hossain
Copy link
Owner

Sounds good. I will open a new tab in the excel sheet for keeping track of the research work. We will soon arrange a meeting about it. Till then you can keep researching and figuring out things by yourself. Thanks a lot @dg1223 @mahdiislam79 for the initiative.

@mahdiislam79
Copy link
Collaborator

mahdiislam79 commented Sep 9, 2024 via email

@dg1223
Copy link
Collaborator Author

dg1223 commented Sep 13, 2024

Check out shothik.ai

I tried it by copying a few paragraphs from chunk 1. They give a lot of false positives. I thought we could purchase their premium subscription for a month and cleanup all the books but if we get a lot of false positives, it'd be more work for us. Their service isn't even fully working right now.

@Reasat
Copy link

Reasat commented Nov 10, 2024

Can you share some examples of errors you saw when OCRing (I believe you used pytesseract?)

Where is the location of the digitized sample?

@dg1223
Copy link
Collaborator Author

dg1223 commented Nov 10, 2024

Can you share some examples of errors you saw when OCRing (I believe you used pytesseract?)

Where is the location of the digitized sample?

Yes, we used pytesseract.

Digitized samples: The files without the suffix '_clean'.

You can find examples of errors with line numbers in the pull requests linked to the issues below:

https://github.com/mir-abir-hossain/real-history-of-Bangladesh/issues?q=is%3Aissue+is%3Aclosed+label%3A%22good+first+issue%22

@Reasat
Copy link

Reasat commented Nov 10, 2024

Seems like, the original OCR is noisy due to quality of the scans. Also, pytesseract is very old and non-robust to noise. Have you guys considered some models that have focused on bangla OCR? For example, this
https://github.com/BengaliAI/bbocr
(Also, let me know if the software does not work).

And if you are using an OCR that does not have a language model built in the decoder, the output text has to go through a language model for post processing to fix words.
For example
https://www.kaggle.com/code/umongsain/n-gram-language-model-with-kenlm-tranformers

Are you doing any post processing before manually fixing it?

@dg1223
Copy link
Collaborator Author

dg1223 commented Nov 11, 2024

Thanks for the suggestions. Much appreciated!

Yes, the quality of the scans is not good. We are yet to try out the alternatives. We'll start with bbOCR.

No, we did not do any post processing. It is probably the most important step in this project to ensure data quality. So, we'll try out your suggestion.

Our goal was to provide our RAG team with the first batch of data as quickly as possible so that they could start their own experimentation while the Data team does its own OCR research in parallel. I can't say for sure if we saved time by fixing everything manually (post-processing) for the first book. It took 2-3 weeks. Maybe we could've invested in doing the research first but the manual exercise gave everyone involved some necessary domain knowledge to evaluate RAG outcome.

@Reasat
Copy link

Reasat commented Nov 11, 2024

the above link refers to an old repo
correct one.
https://github.com/BengaliAI/bbocr-v2

@dg1223
Copy link
Collaborator Author

dg1223 commented Nov 11, 2024

the above link refers to an old repo correct one. https://github.com/BengaliAI/bbocr-v2

Looks like the link is broken. I can only find bbocr in the BengaliAI account.

@Reasat
Copy link

Reasat commented Nov 11, 2024

Sorry, the repo was private. Made it public, should work now.

@dg1223
Copy link
Collaborator Author

dg1223 commented Nov 11, 2024

Awesome, we'll try it out first. Thank you for your time!

@Reasat
Copy link

Reasat commented Nov 11, 2024

Also, in the manual check stage, it seems like you are looking at the json and the book without setting up an annotation platform. This way is very difficult for the annotator to fix the issues. I can help with setting up an annotation platform and we can think about crowdsourcing the process.

@mir-abir-hossain
Copy link
Owner

mir-abir-hossain commented Nov 11, 2024

Yes that is right. We have not yet tried using any annotator platform, but what we did was figuring out common mistakes and listing them down, though this was very manual. Any suggestions are appreciated.

@dg1223
Copy link
Collaborator Author

dg1223 commented Nov 11, 2024

Also, in the manual check stage, it seems like you are looking at the json and the book without setting up an annotation platform. This way is very difficult for the annotator to fix the issues. I can help with setting up an annotation platform and we can think about crowdsourcing the process.

Yes, an annotation platform would be great.

@Reasat
Copy link

Reasat commented Nov 14, 2024

Sorry, got a bit busy for a few days. I'll write down some options after 2 days. In the meantime, can you guys let me know how good is the OCR working? That's the main bottleneck.

@dg1223
Copy link
Collaborator Author

dg1223 commented Nov 20, 2024

Sorry, got a bit busy for a few days. I'll write down some options after 2 days. In the meantime, can you guys let me know how good is the OCR working? That's the main bottleneck.

Sorry for the delay. The OCR pakcage (pytesseract) is producing errors at a 1% error rate.

For a book with 550 pages, if we consider 10 words per line, there are around 200,000 (2 lakh) words. So, we have to deal with roughly 2000 misspelled words and around 50 missing lines.

If we want to parse 200 books, that's around 400,000 (4 lakh) misspelled words and 10,000 missing lines.

@Nirajmahi
Copy link

Interested in this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data data related task research Any R&D work
Projects
None yet
Development

No branches or pull requests

5 participants