-
Notifications
You must be signed in to change notification settings - Fork 53
SpellCorrect Exercise
Read the blog post on using Unix for text processing, and do the commands discussed there yourself. Keep the dictionary file that you extracted from MASC -- you'll use it for later steps.
Update your applied-nlp repository and then modify the main
method of the SpellingCorrector object in the file src/main/scala/appliednlp/spell/Spelling.scala
so that it takes two arguments, the first of which is a sentence that will be checked for spelling errors and the second of which is the location of a dictionary file. For every word that is not in the dictionary, it should output "ERROR: ".
Do your development by editing the file in your editor of choice and running commands in SBT, primarily by using ~compile
to get continuous compilation that runs every time you save a program file.
You should also run it using SBT. Here's an example of how your program should run.
> run-main appliednlp.spell.SpellingCorrector "This Facebook app shows that she is there favorite acress in tonw" /usr/share/dict/words
[info] Running appliednlp.spell.SpellingCorrector This Facebook app shows that she is there favorite acress in tonw /usr/share/dict/words
Detecting spelling errors in: This Facebook app shows that she is there favorite acress in tonw
ERROR: This
ERROR: Facebook
ERROR: app
ERROR: shows
ERROR: acress
ERROR: tonw
[success] Total time: 13 s, completed Feb 4, 2013 11:56:23 AM
Make it possible for your program to accept a third argument that is the MASC dictionary that you derived in step one. Your program should now run like the following.
> run-main appliednlp.spell.SpellingCorrector "This Facebook app shows that she is there favorite acress in tonw" /usr/share/dict/words /tmp/masc_vocab.txt
[info] Running appliednlp.spell.SpellingCorrector This Facebook app shows that she is there favorite acress in tonw /usr/share/dict/words /tmp/masc_vocab.txt
Detecting spelling errors in: This Facebook app shows that she is there favorite acress in tonw
ERROR: acress
ERROR: tonw
[success] Total time: 13 s, completed Feb 4, 2013 12:37:31 PM
Build an inverted index using the vocabulary. This should be a Map[String,Set[String]]
in which each key is a character trigram and each value is the set of all the words that have that trigram as a substring. You should pad the words with a hashmark on the front and back, e.g. #apple#
. The index should then have entries like:
#ap -> { #apple#, #app#, #application, ... }
ple -> { #please#, #apple#, #aplenty#, ...}
You'll probably want to write a helper function that extracts character trigrams when given a word.
In the part of the code that detects errors, compute the set of candidate words by looking up all the trigrams for each word in the inverted index. Output five of these. Your output should look like the following.
> run-main appliednlp.spell.SpellingCorrector "This Facebook app shows that she is there favorite acress in tonw" /usr/share/dict/words /tmp/masc_vocab.txt
[info] Running appliednlp.spell.SpellingCorrector This Facebook app shows that she is there favorite acress in tonw /usr/share/dict/words /tmp/masc_vocab.txt
Detecting spelling errors in: This Facebook app shows that she is there favorite acress in tonw
ERROR: acress
Candidates: professed acoin apocrenic adherescence accouple acronym crumblingness gleesomeness unvarnishedness trest
ERROR: tonw
Candidates: Teutondom belton metapeptone tokened atoningly histon toploftical Eciton toxolysis toxicosis
For each spelling error, compute the cosine between it and all the candidates returned from the inverted index. Then output the top 20 candidates based on the values obtained.
Example output:
> run-main appliednlp.spell.SpellingCorrector "This Facebook app shows that she is there favorite acress in tonw" /usr/share/dict/words /tmp/masc_vocab.txt
[info] Running appliednlp.spell.SpellingCorrector This Facebook app shows that she is there favorite acress in tonw /usr/share/dict/words /tmp/masc_vocab.txt
Detecting spelling errors in: This Facebook app shows that she is there favorite acress in tonw
ERROR: acress
Candidates: professed acoin apocrenic adherescence accouple acronym crumblingness gleesomeness unvarnishedness trest
Top 20 (cress,0.7302967433402214) (acres,0.7302967433402214) (Paracress,0.6804138174397718) (actress,0.6172133998483676) (acre,0.6123724356957946) (acupress,0.5773502691896258) (Press,0.5477225575051661) (tress,0.5477225575051661) (Dress,0.5477225575051661) (press,0.5477225575051661) (dress,0.5477225575051661) (acred,0.5477225575051661) (crestless,0.5443310539518174) (creatress,0.5443310539518174) (acridness,0.5443310539518174) (acrestaff,0.5443310539518174) (acceptress,0.5163977794943223) (restress,0.5163977794943223) (sacredness,0.5163977794943223) (pennycress,0.5163977794943223)
ERROR: tonw
Candidates: Teutondom belton metapeptone tokened atoningly histon toploftical Eciton toxolysis toxicosis
Top 20 (ton,0.5773502691896258) (tonk,0.5) (tone,0.5) (tons,0.5) (tony,0.5) (tong,0.5) (tongs,0.4472135954999579) (toned,0.4472135954999579) (tonal,0.4472135954999579) (tonga,0.4472135954999579) (tonic,0.4472135954999579) (tones,0.4472135954999579) (tonus,0.4472135954999579) (toner,0.4472135954999579) (tonsor,0.4082482904638631) (tonish,0.4082482904638631) (tonkin,0.4082482904638631) (tongue,0.4082482904638631) (tonjon,0.4082482904638631) (tonous,0.4082482904638631)
[success] Total time: 14 s, completed Feb 4, 2013 4:20:29 PM