-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do research on automatically detecting and correcting Bangla spelling mistakes #40
Comments
@mir-abir-hossain As discussed, @mahdiislam79 is interested in working on this issue too. So, both of you can work together on it. @mahdiislam79 This was originally Abir's idea. |
Okay. I am currently searching for potential research papers. @mir-abir-hossain let me know if I need to do anything else. |
Sounds good. I will open a new tab in the excel sheet for keeping track of the research work. We will soon arrange a meeting about it. Till then you can keep researching and figuring out things by yourself. Thanks a lot @dg1223 @mahdiislam79 for the initiative. |
Check out shothik.ai I tried it by copying a few paragraphs from chunk 1. They give a lot of false positives. I thought we could purchase their premium subscription for a month and cleanup all the books but if we get a lot of false positives, it'd be more work for us. Their service isn't even fully working right now. |
Can you share some examples of errors you saw when OCRing (I believe you used pytesseract?) Where is the location of the digitized sample? |
Yes, we used pytesseract. Digitized samples: The files without the suffix '_clean'. You can find examples of errors with line numbers in the pull requests linked to the issues below: |
Seems like, the original OCR is noisy due to quality of the scans. Also, pytesseract is very old and non-robust to noise. Have you guys considered some models that have focused on bangla OCR? For example, this And if you are using an OCR that does not have a language model built in the decoder, the output text has to go through a language model for post processing to fix words. Are you doing any post processing before manually fixing it? |
Thanks for the suggestions. Much appreciated! Yes, the quality of the scans is not good. We are yet to try out the alternatives. We'll start with bbOCR. No, we did not do any post processing. It is probably the most important step in this project to ensure data quality. So, we'll try out your suggestion. Our goal was to provide our RAG team with the first batch of data as quickly as possible so that they could start their own experimentation while the Data team does its own OCR research in parallel. I can't say for sure if we saved time by fixing everything manually (post-processing) for the first book. It took 2-3 weeks. Maybe we could've invested in doing the research first but the manual exercise gave everyone involved some necessary domain knowledge to evaluate RAG outcome. |
the above link refers to an old repo |
Looks like the link is broken. I can only find bbocr in the BengaliAI account. |
Sorry, the repo was private. Made it public, should work now. |
Awesome, we'll try it out first. Thank you for your time! |
Also, in the manual check stage, it seems like you are looking at the json and the book without setting up an annotation platform. This way is very difficult for the annotator to fix the issues. I can help with setting up an annotation platform and we can think about crowdsourcing the process. |
Yes that is right. We have not yet tried using any annotator platform, but what we did was figuring out common mistakes and listing them down, though this was very manual. Any suggestions are appreciated. |
Yes, an annotation platform would be great. |
Sorry, got a bit busy for a few days. I'll write down some options after 2 days. In the meantime, can you guys let me know how good is the OCR working? That's the main bottleneck. |
Sorry for the delay. The OCR pakcage (pytesseract) is producing errors at a 1% error rate. For a book with 550 pages, if we consider 10 words per line, there are around 200,000 (2 lakh) words. So, we have to deal with roughly 2000 misspelled words and around 50 missing lines. If we want to parse 200 books, that's around 400,000 (4 lakh) misspelled words and 10,000 missing lines. |
Interested in this issue. |
We need to figure out if any work and/or research has been done on automatically detecting and correcting Bangla spelling mistakes. Technically speaking, are there papers, notebooks, or packages that does this job? After we finish this review, we can determine if we need to figure out our own method to automatically clean Bangla text.
An example is the Avro online keyboard. It autocorrects as you write Bangla. I believe it is using some kind of dictionary and an autocorrection package as a backend. I don't know if anyone has open sourced it or something similar. Finding something like this that we can use for our project would cut down our work by a lot.
Our main focus is fixing spelling mistakes but finding resources on fixing grammatical mistakes is also welcome.
We need to invest some time and effort into this R&D work because manually cleaning up extracted texts 200+ books and maybe newspaper articles in future doesn't look feasible. It took 9 volunteers more than 3 weeks to clean up a single book. We can croudsource it but it requires funding and additional efforts which are not within the scope of our project at this moment.
The text was updated successfully, but these errors were encountered: