Do research on automatically detecting and correcting Bangla spelling mistakes #40

dg1223 · 2024-09-09T10:54:12Z

We need to figure out if any work and/or research has been done on automatically detecting and correcting Bangla spelling mistakes. Technically speaking, are there papers, notebooks, or packages that does this job? After we finish this review, we can determine if we need to figure out our own method to automatically clean Bangla text.

An example is the Avro online keyboard. It autocorrects as you write Bangla. I believe it is using some kind of dictionary and an autocorrection package as a backend. I don't know if anyone has open sourced it or something similar. Finding something like this that we can use for our project would cut down our work by a lot.

Our main focus is fixing spelling mistakes but finding resources on fixing grammatical mistakes is also welcome.

We need to invest some time and effort into this R&D work because manually cleaning up extracted texts 200+ books and maybe newspaper articles in future doesn't look feasible. It took 9 volunteers more than 3 weeks to clean up a single book. We can croudsource it but it requires funding and additional efforts which are not within the scope of our project at this moment.

dg1223 · 2024-09-09T10:56:55Z

@mir-abir-hossain As discussed, @mahdiislam79 is interested in working on this issue too. So, both of you can work together on it.

@mahdiislam79 This was originally Abir's idea.

mahdiislam79 · 2024-09-09T11:14:02Z

Okay. I am currently searching for potential research papers. @mir-abir-hossain let me know if I need to do anything else.

mir-abir-hossain · 2024-09-09T15:43:14Z

Sounds good. I will open a new tab in the excel sheet for keeping track of the research work. We will soon arrange a meeting about it. Till then you can keep researching and figuring out things by yourself. Thanks a lot @dg1223 @mahdiislam79 for the initiative.

dg1223 · 2024-09-09T18:14:42Z

@mir-abir-hossain @mahdiislam79

Some open source projects that might be helpful:

https://github.com/Botbang/avro-bangla-autocorrect-dictionary-enriched
(Botbang reference) mugli/Avro-Keyboard#23
https://github.com/OpenBangla/OpenBangla-Keyboard
https://github.com/Tawkat/Bengali-Spell-Checker-and-Auto-Correction-Suggestion-for-MS-Word
https://github.com/filyp/autocorrect
https://github.com/sagorbrur/bengali_pyspellchecker
https://github.com/pritomsaha/Context-sensitive-Bangla-spell-checker
https://github.com/mehedihasanbijoy/DPCSpell
https://github.com/MahirMahbub/Contextual-Spell-Checker-For-Bangla
https://github.com/Hasiburshanto/Bangla-Spell-Checker
(Bangla NLP) https://github.com/topics/bangla-nlp?o=desc&s=stars

Research papers, articles:
https://medium.com/analytics-vidhya/bengali-word-spelling-correction-using-pre-trained-word2vec-1f9938f48b09
https://www.researchgate.net/publication/346206262_Survey_of_Automatic_Spelling_Correction
https://www.researchgate.net/publication/346679400_Word_Completion_and_Sequence_Prediction_in_Bangla_Language_Using_Trie_and_a_Hybrid_Approach_of_Sequential_LSTM_and_N-gram

mahdiislam79 · 2024-09-09T19:04:41Z

Thanks vai. Will go through it.

…

dg1223 · 2024-09-13T07:13:59Z

Check out shothik.ai

I tried it by copying a few paragraphs from chunk 1. They give a lot of false positives. I thought we could purchase their premium subscription for a month and cleanup all the books but if we get a lot of false positives, it'd be more work for us. Their service isn't even fully working right now.

Reasat · 2024-11-10T20:26:12Z

Can you share some examples of errors you saw when OCRing (I believe you used pytesseract?)

Where is the location of the digitized sample?

dg1223 · 2024-11-10T20:37:48Z

Can you share some examples of errors you saw when OCRing (I believe you used pytesseract?)

Where is the location of the digitized sample?

Yes, we used pytesseract.

Digitized samples: The files without the suffix '_clean'.

You can find examples of errors with line numbers in the pull requests linked to the issues below:

https://github.com/mir-abir-hossain/real-history-of-Bangladesh/issues?q=is%3Aissue+is%3Aclosed+label%3A%22good+first+issue%22

Reasat · 2024-11-10T23:35:07Z

Seems like, the original OCR is noisy due to quality of the scans. Also, pytesseract is very old and non-robust to noise. Have you guys considered some models that have focused on bangla OCR? For example, this
https://github.com/BengaliAI/bbocr
(Also, let me know if the software does not work).

And if you are using an OCR that does not have a language model built in the decoder, the output text has to go through a language model for post processing to fix words.
For example
https://www.kaggle.com/code/umongsain/n-gram-language-model-with-kenlm-tranformers

Are you doing any post processing before manually fixing it?

dg1223 · 2024-11-11T00:10:08Z

Thanks for the suggestions. Much appreciated!

Yes, the quality of the scans is not good. We are yet to try out the alternatives. We'll start with bbOCR.

No, we did not do any post processing. It is probably the most important step in this project to ensure data quality. So, we'll try out your suggestion.

Our goal was to provide our RAG team with the first batch of data as quickly as possible so that they could start their own experimentation while the Data team does its own OCR research in parallel. I can't say for sure if we saved time by fixing everything manually (post-processing) for the first book. It took 2-3 weeks. Maybe we could've invested in doing the research first but the manual exercise gave everyone involved some necessary domain knowledge to evaluate RAG outcome.

Reasat · 2024-11-11T00:38:50Z

the above link refers to an old repo
correct one.
https://github.com/BengaliAI/bbocr-v2

dg1223 · 2024-11-11T00:58:57Z

the above link refers to an old repo correct one. https://github.com/BengaliAI/bbocr-v2

Looks like the link is broken. I can only find bbocr in the BengaliAI account.

Reasat · 2024-11-11T01:01:58Z

Sorry, the repo was private. Made it public, should work now.

dg1223 · 2024-11-11T01:09:17Z

Awesome, we'll try it out first. Thank you for your time!

Reasat · 2024-11-11T02:44:32Z

Also, in the manual check stage, it seems like you are looking at the json and the book without setting up an annotation platform. This way is very difficult for the annotator to fix the issues. I can help with setting up an annotation platform and we can think about crowdsourcing the process.

mir-abir-hossain · 2024-11-11T03:15:02Z

Yes that is right. We have not yet tried using any annotator platform, but what we did was figuring out common mistakes and listing them down, though this was very manual. Any suggestions are appreciated.

dg1223 · 2024-11-11T17:43:42Z

Also, in the manual check stage, it seems like you are looking at the json and the book without setting up an annotation platform. This way is very difficult for the annotator to fix the issues. I can help with setting up an annotation platform and we can think about crowdsourcing the process.

Yes, an annotation platform would be great.

Reasat · 2024-11-14T06:19:49Z

Sorry, got a bit busy for a few days. I'll write down some options after 2 days. In the meantime, can you guys let me know how good is the OCR working? That's the main bottleneck.

dg1223 · 2024-11-20T02:07:58Z

Sorry, got a bit busy for a few days. I'll write down some options after 2 days. In the meantime, can you guys let me know how good is the OCR working? That's the main bottleneck.

Sorry for the delay. The OCR pakcage (pytesseract) is producing errors at a 1% error rate.

For a book with 550 pages, if we consider 10 words per line, there are around 200,000 (2 lakh) words. So, we have to deal with roughly 2000 misspelled words and around 50 missing lines.

If we want to parse 200 books, that's around 400,000 (4 lakh) misspelled words and 10,000 missing lines.

Nirajmahi · 2024-11-30T05:41:21Z

Interested in this issue.

dg1223 added data data related task research Any R&D work labels Sep 9, 2024

dg1223 added this to the Data extraction and preparation milestone Sep 9, 2024

dg1223 assigned mir-abir-hossain and mahdiislam79 Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do research on automatically detecting and correcting Bangla spelling mistakes #40

Do research on automatically detecting and correcting Bangla spelling mistakes #40

dg1223 commented Sep 9, 2024

dg1223 commented Sep 9, 2024

mahdiislam79 commented Sep 9, 2024

mir-abir-hossain commented Sep 9, 2024

dg1223 commented Sep 9, 2024

mahdiislam79 commented Sep 9, 2024 via email

dg1223 commented Sep 13, 2024

Reasat commented Nov 10, 2024

dg1223 commented Nov 10, 2024

Reasat commented Nov 10, 2024

dg1223 commented Nov 11, 2024

Reasat commented Nov 11, 2024

dg1223 commented Nov 11, 2024

Reasat commented Nov 11, 2024

dg1223 commented Nov 11, 2024

Reasat commented Nov 11, 2024

mir-abir-hossain commented Nov 11, 2024 •

edited

Loading

dg1223 commented Nov 11, 2024

Reasat commented Nov 14, 2024

dg1223 commented Nov 20, 2024 •

edited

Loading

Nirajmahi commented Nov 30, 2024

Do research on automatically detecting and correcting Bangla spelling mistakes #40

Do research on automatically detecting and correcting Bangla spelling mistakes #40

Comments

dg1223 commented Sep 9, 2024

dg1223 commented Sep 9, 2024

mahdiislam79 commented Sep 9, 2024

mir-abir-hossain commented Sep 9, 2024

dg1223 commented Sep 9, 2024

mahdiislam79 commented Sep 9, 2024 via email

dg1223 commented Sep 13, 2024

Reasat commented Nov 10, 2024

dg1223 commented Nov 10, 2024

Reasat commented Nov 10, 2024

dg1223 commented Nov 11, 2024

Reasat commented Nov 11, 2024

dg1223 commented Nov 11, 2024

Reasat commented Nov 11, 2024

dg1223 commented Nov 11, 2024

Reasat commented Nov 11, 2024

mir-abir-hossain commented Nov 11, 2024 • edited Loading

dg1223 commented Nov 11, 2024

Reasat commented Nov 14, 2024

dg1223 commented Nov 20, 2024 • edited Loading

Nirajmahi commented Nov 30, 2024

mir-abir-hossain commented Nov 11, 2024 •

edited

Loading

dg1223 commented Nov 20, 2024 •

edited

Loading