Detect language and correct accordingly #15

rymai · 2015-10-01T06:06:50Z

I received a pull-request on rymai/elevator-simulator#1 where "attendent" was mistaken for "attendant". The problem is that the README is in French, and in that case, "ils attendent" means "they are waiting".

uiteoi · 2015-10-01T11:39:47Z

Your README mixes both English and French, which will make it extra hard to detect language.

You are probably not alone mixing languages as this can happen for a number of reasons.

orthographic-pedant could use a number of methods to fix this one:

blacklisting
language settings in orthographic-pedant configuration files
language settings embedded in md files using language tags

rymai · 2015-10-01T12:47:22Z

Thanks @uiteoi. Indeed the readme mixes both languages, but at least the script could stop its work in that case instead of proposing a wrong correction...

I certainly don't want to add a configuration file for this bot, nor add settings in md files. :)

uiteoi · 2015-10-01T14:01:53Z

Proper detection is certainly the best way going forward. Considering the complexity of implementation I was considering other options. In your case, blacklisting would be the most appropriate short-term solution.

rymai · 2015-10-01T14:08:42Z

Exactly!

thoppe · 2015-10-01T14:35:10Z

What I've found is that explict white-listing is the way to go. I made a few early mistakes correcting ` Ceasar to Caesar and had half of Latin-America mad at me. For this particular case I'm going to remove this word from the correcting list. I've only done the A's so far, you can see what corrections will be attempted here:

https://github.com/thoppe/orthographic-pedant/blob/master/wordlists/parsed_wikipedia_list.txt

A poor-man's check for a possible foreign language would check if the entire README could be converted to ASCII without loss. Obviously this is a bit heavy handed, but I'm not sure how this problem is solved in the real-world.

uiteoi · 2015-10-01T15:00:05Z

@thoppe, you are going to have this same problem with countless other words, French and English in particular share countless words with slightly different spellings. e.g. example / exemple, appartement / apartment, ...

So I would suggest that you start looking for some form of detection and ease the possibility to blacklist repos.

Good luck with your project.

uiteoi · 2015-10-01T15:02:26Z

Maybe another possible suggestion, if some repo owner rejected a pull-request once, you may want to blacklist that repo automatically to avoid submitting further suggested fixes.

thoppe · 2015-10-01T15:07:49Z

Good suggestions @uiteoi. Since I don't speak French, is there a list of "homophonic cognates" somewhere that you can vouch for as a good starting point?

Natural language is deceptively hard to get right, especially when I have to cross the phase boundary between two of them!

As a side-note, many happy users reject a PR by accident since they are unfamiliar with githubs PR system. Ad-hoc, this amounts to about 5%. Very few people vehemently dislike the bot (but that number is not zero).

uiteoi · 2015-10-01T15:17:56Z

Here's a wikipedia article showing a list of common spelling mistakes in French, it is used by the WPCleaner bot to detect spelling mistakes.

https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Liste_de_fautes_d%27orthographe_courantes

You should expect that most major language would have similar lists for the WPCleaner bot to use.

I see you are using Python which comes with a number of NL libraries using NLTK. Here's an example I found by googling "python natural language detection":
https://pypi.python.org/pypi/guess-language

For people who reject a PR by accident, they should be able to submit a PR on your repo to get removed from the blacklist.

I personally think this is a great project and I encourage you to further develop it.

thoppe · 2015-10-09T18:31:15Z

I'm going to reopen this issue since it turns out this is a really good idea. It shouldn't be too hard to detect if the language is not English and skip the repo outright. This should help with the words that are correct in French and English at least.

uiteoi · 2015-10-10T05:49:14Z

Great 👍

thoppe closed this as completed in 3f53216 Oct 1, 2015

thoppe reopened this Oct 9, 2015

thoppe added the enhancement label Oct 9, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect language and correct accordingly #15

Detect language and correct accordingly #15

rymai commented Oct 1, 2015

uiteoi commented Oct 1, 2015

rymai commented Oct 1, 2015

uiteoi commented Oct 1, 2015

rymai commented Oct 1, 2015

thoppe commented Oct 1, 2015

uiteoi commented Oct 1, 2015

uiteoi commented Oct 1, 2015

thoppe commented Oct 1, 2015

uiteoi commented Oct 1, 2015

thoppe commented Oct 9, 2015

uiteoi commented Oct 10, 2015

Detect language and correct accordingly #15

Detect language and correct accordingly #15

Comments

rymai commented Oct 1, 2015

uiteoi commented Oct 1, 2015

rymai commented Oct 1, 2015

uiteoi commented Oct 1, 2015

rymai commented Oct 1, 2015

thoppe commented Oct 1, 2015

uiteoi commented Oct 1, 2015

uiteoi commented Oct 1, 2015

thoppe commented Oct 1, 2015

uiteoi commented Oct 1, 2015

thoppe commented Oct 9, 2015

uiteoi commented Oct 10, 2015