Skip to content

SpellCorrect Exercise

jasonbaldridge edited this page Feb 4, 2013 · 9 revisions

Spelling Correction Exercise

Step one: deriving a dictionary from data.

Read the blog post on using Unix for text processing, and do the commands discussed there yourself. Keep the dictionary file that you extracted from MASC -- you'll use it for later steps.

Step two: detecting errors

Update your applied-nlp repository and then modify the main method of the SpellingCorrector object in the file src/main/scala/appliednlp/spell/Spelling.scala so that it takes two arguments, the first of which is a sentence that will be checked for spelling errors and the second of which is the location of a dictionary file. For every word that is not in the dictionary, it should output "ERROR: ".

Do your development by editing the file in your editor of choice and running commands in SBT, primarily by using ~compile to get continuous compilation that runs every time you save a program file.

You should also run it using SBT. Here's an example of how your program should run.

> run-main appliednlp.spell.SpellingCorrector "This Facebook app shows that she is there favorite acress in tonw" /usr/share/dict/words
[info] Running appliednlp.spell.SpellingCorrector This Facebook app shows that she is there favorite acress in tonw /usr/share/dict/words 
Detecting spelling errors in: This Facebook app shows that she is there favorite acress in tonw
ERROR: This
ERROR: Facebook
ERROR: app
ERROR: shows
ERROR: acress
ERROR: tonw
[success] Total time: 13 s, completed Feb 4, 2013 11:56:23 AM

Step three: add another dictionary

Make it possible for your program to accept a third argument that is the MASC dictionary that you derived in step one. Your program should now run like the following.

> run-main appliednlp.spell.SpellingCorrector "This Facebook app shows that she is there favorite acress in tonw" /usr/share/dict/words /tmp/masc_vocab.txt
[info] Running appliednlp.spell.SpellingCorrector This Facebook app shows that she is there favorite acress in tonw /usr/share/dict/words /tmp/masc_vocab.txt
Detecting spelling errors in: This Facebook app shows that she is there favorite acress in tonw
ERROR: acress
ERROR: tonw
[success] Total time: 13 s, completed Feb 4, 2013 12:37:31 PM

Step four: build an inverted index

Build an inverted index using the vocabulary. This should be a Map[String,Set[String]] in which each key is a character trigram and each value is the set of all the words that have that trigram as a substring. You should pad the words with a hashmark on the front and back, e.g. #apple#. The index should then have entries like:

  • #ap -> { #apple#, #app#, #application, ... }
  • ple -> { #please#, #apple#, #aplenty#, ...}

You'll probably want to write a helper function that extracts character trigrams when given a word.

Step five: identifying all candidates in the inverted index

In the part of the code that detects errors, compute the set of candidate words by looking up all the trigrams for each word in the inverted index. Output five of these. Your output should look like the following.

> run-main appliednlp.spell.SpellingCorrector "This Facebook app shows that she is there favorite acress in tonw" /usr/share/dict/words /tmp/masc_vocab.txt
[info] Running appliednlp.spell.SpellingCorrector This Facebook app shows that she is there favorite acress in tonw /usr/share/dict/words /tmp/masc_vocab.txt
Detecting spelling errors in: This Facebook app shows that she is there favorite acress in tonw
ERROR: acress
  Candidates: professed acoin apocrenic adherescence accouple acronym crumblingness gleesomeness unvarnishedness trest
ERROR: tonw
  Candidates: Teutondom belton metapeptone tokened atoningly histon toploftical Eciton toxolysis toxicosis

Step six: rank candidates using cosine similarity

For each spelling error, compute the cosine between it and all the candidates returned from the inverted index. Then output the top 20 candidates based on the values obtained.

Clone this wiki locally