Skip to content

Latest commit

 

History

History
127 lines (83 loc) · 4.47 KB

background.md

File metadata and controls

127 lines (83 loc) · 4.47 KB

1. Background

Lecture 1

What problems are considered NLP, what the motivation is, why language is hard, why humans are good at it, what structure and mathematical properties language has, why rules-based doesn't scale, what is solved, what is unsolved, what is unsolvable…

Slides

Lecture 2

Non-English, non-Latin script and multilingual problems. What is different about NLP for English?

The language of language: anaphora, BLEU, canonicalisation, grammars, lemmatisation, n-grams, parallel corpora, segmentation, tokenisation, Zipf's law...

Slides

Lab

Increase the accuracy of Peter Norvig's classic spelling corrector in half a page of code without hurting performance too much.

Here's the code.

Some ideas: generate more or better candidates, add a cost function, use context, use the subword level, preprocess, add more data...

Peter Norvig has explained in depth many potential improvements on his page.

Submission Instructions

Submit your python code as a Kaggle Kernel to the Spelling dataset

Example: kaggle.com/bittlingmayer/spell-py

1. Add a new Kernel

Open kaggle.com/bittlingmayer/spelling
Click New Kernel
Choose Script or Notebook according to your preference
Title it spell.py

2. Edit the script

Delete what is there and paste in your spell.py and the evaluation code.

See the example with the baseline, which is Norvig's spell.py plus the evaluation code, with the following changes to work in a Kaggle Kernel:

big.txt is already in the environment, at ../input/big.txt.

So 'big.txt' must be changed to '../input/', for example:

WORDS = Counter(words(open('../input/big.txt').read()))

If you need to change some unit tests, that is fine. In fact, the tests in Norvig's original code break on the current version of big.txt.

If you need to do pre-processing of the data, note that you can write out files to the current directory too.

You must change the code at the end to test and print the result:

Remove:

spelltest(Testset(open('spell-testset1.txt'))) # Development set
spelltest(Testset(open('spell-testset2.txt'))) # Final test set

Add:

def test_corpus(filename):
    print("Testing " + filename)
    spelltest(Testset(open('../input/' + filename)))     

test_corpus('spell-testset1.txt') # Development set
test_corpus('spell-testset2.txt') # Final test set

# Supplementary sets
test_corpus('wikipedia.txt')
test_corpus('aspell.txt')

3. Test

Click Run to save and run
Open the Options tab and click Hide Script if you do not want others to see or find your code Open the Log tab
You should see something similar to this:

unit_tests pass
Testing spell-testset1.txt
75% of 270 correct (6% unknown) at 30 words per second
Testing spell-testset2.txt
68% of 400 correct (11% unknown) at 25 words per second
Testing wikipedia.txt
61% of 2455 correct (24% unknown) at 18 words per second
Testing aspell.txt
43% of 531 correct (23% unknown) at 13 words per second 

5. Explain your approach

Make sure the script or Notebook is professionally commented and formatted

In the Comments tab of the Kernel, explain your approach:

Did you do preprocessing?

What approaches did you try that failed?

What potential improvements could you make?

6. Submit

Send an email to [email protected] with subject Spelling and the link to your Kernel and your name:

kaggle.com/jmustermann/spell-py
Johanna Mustermann

More

Read double articulation

Watch the first part of Lecture 1 from Stanford's Natural Language Processing with Deep Learning

Understand the title of each chapter of Foundations of Statistical Natural Language Processing

Try to understand how str works in python 3

Play with the displaCy parsing visualisation

Read Norvig vs Chomsky with a good drink