-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect language and correct accordingly #15
Comments
Your README mixes both English and French, which will make it extra hard to detect language. You are probably not alone mixing languages as this can happen for a number of reasons. orthographic-pedant could use a number of methods to fix this one:
|
Thanks @uiteoi. Indeed the readme mixes both languages, but at least the script could stop its work in that case instead of proposing a wrong correction... I certainly don't want to add a configuration file for this bot, nor add settings in md files. :) |
Proper detection is certainly the best way going forward. Considering the complexity of implementation I was considering other options. In your case, blacklisting would be the most appropriate short-term solution. |
Exactly! |
What I've found is that explict white-listing is the way to go. I made a few early mistakes correcting ` Ceasar to Caesar and had half of Latin-America mad at me. For this particular case I'm going to remove this word from the correcting list. I've only done the A's so far, you can see what corrections will be attempted here: https://github.com/thoppe/orthographic-pedant/blob/master/wordlists/parsed_wikipedia_list.txt A poor-man's check for a possible foreign language would check if the entire README could be converted to ASCII without loss. Obviously this is a bit heavy handed, but I'm not sure how this problem is solved in the real-world. |
@thoppe, you are going to have this same problem with countless other words, French and English in particular share countless words with slightly different spellings. e.g. example / exemple, appartement / apartment, ... So I would suggest that you start looking for some form of detection and ease the possibility to blacklist repos. Good luck with your project. |
Maybe another possible suggestion, if some repo owner rejected a pull-request once, you may want to blacklist that repo automatically to avoid submitting further suggested fixes. |
Good suggestions @uiteoi. Since I don't speak French, is there a list of "homophonic cognates" somewhere that you can vouch for as a good starting point? Natural language is deceptively hard to get right, especially when I have to cross the phase boundary between two of them! As a side-note, many happy users reject a PR by accident since they are unfamiliar with githubs PR system. Ad-hoc, this amounts to about 5%. Very few people vehemently dislike the bot (but that number is not zero). |
Here's a wikipedia article showing a list of common spelling mistakes in French, it is used by the WPCleaner bot to detect spelling mistakes. https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Liste_de_fautes_d%27orthographe_courantes You should expect that most major language would have similar lists for the WPCleaner bot to use. I see you are using Python which comes with a number of NL libraries using NLTK. Here's an example I found by googling "python natural language detection": For people who reject a PR by accident, they should be able to submit a PR on your repo to get removed from the blacklist. I personally think this is a great project and I encourage you to further develop it. |
I'm going to reopen this issue since it turns out this is a really good idea. It shouldn't be too hard to detect if the language is not English and skip the repo outright. This should help with the words that are correct in French and English at least. |
Great 👍 |
I received a pull-request on rymai/elevator-simulator#1 where "attendent" was mistaken for "attendant". The problem is that the README is in French, and in that case, "ils attendent" means "they are waiting".
The text was updated successfully, but these errors were encountered: