-
Notifications
You must be signed in to change notification settings - Fork 49
Chalk command line tutorial
This page covers some what you need to do to start training models. To follow the instructions on this page, you must have successfully compiled and installed Chalk as described in the README. Also, you must be using a clone of the repository (the latest code, not a release).
The Open American National corpus has provided a set of open, unencumbered annotations for multiple domains (yay!) in the Manually Annotated Sub-Corpus (MASC). We'll use MASC v3.0.0 here.
Note: You may find there are things you wish were different about the MASC annotations (choices about tokenization, etc). They love to get feedback, so be sure to let them know by writing to [email protected].
The MASC annotations are provided in multiple XML files. Chalk provides a conversion utility that transforms the XML into the input formats needed for training sentence detection, tokenizer, and named-entity recognition models (for both Chalk and OpenNLP).
$ cd /tmp/
$ mkdir masc
$ cd masc
$ wget http://www.anc.org/MASC/download/MASC-3.0.0.tgz
$ tar xzf MASC-3.0.0.tgz
$ chalk run chalk.corpora.MascTransform data/written /tmp/chalk-masc-data
Creating train
Success: data/written/ficlets,1401
Success: data/written/ficlets,1403
Success: data/written/ficlets,1402
Failure: data/written/non-fiction,CUP1
Success: data/written/non-fiction,rybczynski-ch3
<...more status output...>
$ cd /tmp/chalk-masc-data
$ ls
dev test train
The three directories contain data splits for training models (train), evaluating their performance while tweaking them (dev), and a held out test set for evaluating them blindly (test). Each directory contains files for sentence detection, tokenization and named entity recognition.
$ ls train/
train-ner.txt train-sent.txt train-tok.txt
Check that you've got the right output by running the following command and comparing your output to this.
$ tail -3 train/train-tok.txt
He held me in his arms until I stopped flailing<SPLIT>.
It passed<SPLIT>.
This is my Hotel California<SPLIT>.
Assuming things went smoothly, you are ready to train models. All of the following instructions assume you are in the chalk-masc-data directory.
We need an example text, so let's use one about Aravind Joshi's ACL lifetime achievement award. (Note: I've made a few modifications and edits to make it a better example.)
The Association for Computational Linguistics is proud to present its first Lifetime Achievement Award to Prof. Aravind Joshi of the University of Pennsylvania. Aravind Joshi was born in 1929 in Pune, India, where he completed his secondary education as well as his first degree in Mechanical and Electrical Engineering, the latter in 1950. He worked as a research assistant in Linguistics at Penn from 1958-60, while completing his Ph.D. in Electrical Engineering, in 1960. Joshi's work and the work of his Penn colleagues at the frontiers of Cognitive Science was rewarded in 1991 by the establishment of a National Science Foundation Science and Technology Center for Research in Cognitive Science, which Aravind Joshi co-directed until 2001. Dr. Joshi has supervised thirty-six Ph.D. theses to-date, on topics including information and coding theory, and also pure linguistics.
Joshi rocks.
Run the following commands to get things set up with this text.
$ cd /tmp/chalk-masc-data
$ echo "The Association for Computational Linguistics is proud to present its first Lifetime Achievement Award to Prof. Aravind Joshi of the University of Pennsylvania. Aravind Joshi was born in 1929 in Pune, India, where he completed his secondary education as well as his first degree in Mechanical and Electrical Engineering, the latter in 1950. He worked as a research assistant in Linguistics at Penn from 1958-60, while completing his Ph.D. in Electrical Engineering, in 1960. Joshi's work and the work of his Penn colleagues at the frontiers of Cognitive Science was rewarded in 1991 by the establishment of a National Science Foundation Science and Technology Center for Research in Cognitive Science, which Aravind Joshi co-directed until 2001. Dr. Joshi has supervised thirty-six Ph.D. theses to-date, on topics including information and coding theory, and also pure linguistics." > joshi.txt
Do the following to train a sentence detector.
$ chalk cli SentenceDetectorTrainer -encoding UTF-8 -lang en -data train/train-sent.txt -model eng-masc-sent-tmp.bin
Indexing events using cutoff of 5
Computing event counts... done. 19168 events
Indexing... done.
Sorting and merging events... done. Reduced 19168 events to 14302.
Done indexing.
Incorporating indexed data for training...
done.
Number of Event Tokens: 14302
Number of Outcomes: 2
Number of Predicates: 1667
...done.
Computing model parameters ...
Performing 100 iterations.
1: ... loglikelihood=-13286.245156975032 0.7741026711185309
2: ... loglikelihood=-7936.714729168212 0.8232470784641068
3: ... loglikelihood=-6605.629415117238 0.8643050918196995
<more iterations>
98: ... loglikelihood=-3055.191153644887 0.9465254590984975
99: ... loglikelihood=-3050.6109732525756 0.9467341402337228
100: ... loglikelihood=-3046.0972508791187 0.9467341402337228
Writing sentence detector model ... done (0.073s)
Wrote sentence detector model to
path: /tmp/chalk-masc-data/eng-masc-sent-tmp.bin
Now run it on the example text.
$ chalk cli SentenceDetector eng-masc-sent-tmp.bin < joshi.txt
Loading Sentence Detector model ... done (0.042s)
The Association for Computational Linguistics is proud to present its first Lifetime Achievement Award to Prof.
Aravind Joshi of the University of Pennsylvania.
Aravind Joshi was born in 1929 in Pune, India, where he completed his secondary education as well as his first degree in Mechanical and Electrical Engineering, the latter in 1950.
He worked as a research assistant in Linguistics at Penn from 1958-60, while completing his Ph.D. in Electrical Engineering, in 1960.
Joshi's work and the work of his Penn colleagues at the frontiers of Cognitive Science was rewarded in 1991 by the establishment of a National Science Foundation Science and Technology Center for Research in Cognitive Science, which Aravind Joshi co-directed until 2001.
Dr. Joshi has supervised thirty-six Ph.D.
theses to-date, on topics including information and coding theory, and also pure linguistics.
Average: 39.5 sent/s
Total: 7 sent
Runtime: 0.177s
Overall, things look fine except the splits on 'Prof.' and 'Ph.D.'. For 'Prof.', there is only one training example in train-sent.txt, and a feature cutoff of 5 is used, so there is actually no evidence to go on for the model and it thinks it is a sentence ending period rather than an indicator of an abbreviation.
Evaluate the model.
$ chalk cli SentenceDetectorEvaluator -model eng-masc-sent-tmp.bin -data dev/dev-sent.txt -lang en
Loading Sentence Detector model ... done (0.034s)
Evaluating ... done
Precision: 0.8203161320316132
Recall: 0.7937471884840306
F-Measure: 0.8068129858253316
This performance is lower than we'd like. Looking at the data, there are probably some changes that need to be made to the MASC conversion. E.g. it includes lines like this in dev/dev-sent.txt:
Wisdom from the Top
"It's not me who can't keep a secret it's the people I tell that can't."
- Lincoln
"I have found the best way to give advice to your children is to find out what they want and then advise them to do that."
- Truman
"There's one thing about being a president - nobody can tell you when to sit down."
- Eisenhower
"Things are more like they are now than they have ever been before."
- Eisenhower
Dead Horse
The tribal wisdom of the North American Indians, passed on from one generation to the next, says that when you discover that you are riding a dead horse, the best strategy is to dismount.
So there are lines that don't correspond to our usual notion of sentence, and we might think to handle these a bit differently as the sentence detector is really expecting to have to deal with paragraph's of sentences that need splitting, not general formatting including titles, signatures, quotations, etc.
Also, there are apparent errors like the following:
They have concluded that 5000 years ago, their ancestors were already using mobile phones"
.
That could be a problem with the annotations themselves, or with the conversion, and needs looking into.
Do the following to train a tokenizer. (I'll suppress the output from here on.)
$ chalk cli TokenizerTrainer -encoding UTF-8 -lang en -data train/train-tok.txt -model eng-masc-token-tmp.bin
To test the tokenizer on the example text, we need to pass it through the sentence detector first and then on to the tokenizer.
$ chalk cli SentenceDetector eng-masc-sent-tmp.bin < joshi.txt | chalk cli TokenizerME eng-masc-token-tmp.bin
Loading Sentence Detector model ... Loading Tokenizer model ... done (0.061s)
done (0.257s)
Average: 31.5 sent/s
Total: 7 sent
Runtime: 0.222s
The Association for Computational Linguistics is proud to present its first Lifetime Achievement Award to Prof .
Aravind Joshi of the University of Pennsylvania .
Aravind Joshi was born in 1929 in Pune , India , where he completed his secondary education as well as his first degree in Mechanical and Electrical Engineering , the latter in 1950 .
He worked as a research assistant in Linguistics at Penn from 1958 - 60 , while completing his Ph.D . in Electrical Engineering , in 1960 .
Joshi 's work and the work of his Penn colleagues at the frontiers of Cognitive Science was rewarded in 1991 by the establishment of a National Science Foundation Science and Technology Center for Research in Cognitive Science , which Aravind Joshi co-directed until 2001 .
Dr. Joshi has supervised thirty-six Ph.D .
theses to-date , on topics including information and coding theory , and also pure linguistics .
Average: 34.8 sent/s
Total: 8 sent
Runtime: 0.23s
This almost all looks good. The errors again center on "Prof." and "Ph.D.", which should not be split from the final periods. Again, this seems to happen because of the lack of training instances involving "Prof" (and we aren't supplying an abbreviation dictionary that would help with this).
You can evaluate the performance of the trained tokenizer against the development data as follows.
$ chalk cli TokenizerMEEvaluator -model eng-masc-token-tmp.bin -data dev/dev-tok.txt -lang en
Loading Tokenizer model ... done (0.205s)
Evaluating ... done
Precision: 0.9853392295127684
Recall: 0.9788269748298243
F-Measure: 0.9820723063789235
The MASC conversion utility in Chalk produces CONLL 2003 formatted annotations, e.g.:
Isabella NNP NNP B-PER
Shae NNP NNP I-PER
, , , O
a DT DT O
girl NN NN O
from IN IN O
Mebane NNP NNP B-LOC
North NNP NNP B-LOC
Carolina NNP NNP I-LOC
. . . O
However, Chalk (currently) needs NER training data in OpenNLP format, e.g.:
<START:person> Isabella Shae <END> , a girl from <START:location> Mebane <END> <START:location> North Carolina <END> .
To convert the data, use the format converter.
$ chalk cli TokenNameFinderConverter conll03 -lang en -encoding UTF-8 -types person,location,organization -data train/train-ner.txt > train/train-ner-opennlp.txt
$ chalk cli TokenNameFinderConverter conll03 -lang en -encoding UTF-8 -types person,location,organization -data dev/dev-ner.txt > dev/dev-ner-opennlp.txt
Now train the model.
$ chalk cli TokenNameFinderTrainer -lang en -encoding UTF-8 -data train/train-ner-opennlp.txt -model eng-masc-ner-tmp.bin
Run it on the example text. (I have removed the timing information from the output given here.)
$ chalk cli SentenceDetector eng-masc-sent-tmp.bin < joshi.txt | chalk cli TokenizerME eng-masc-token-tmp.bin | chalk cli TokenNameFinder eng-masc-ner-tmp.bin
The Association for Computational Linguistics is proud to present its first Lifetime Achievement Award to Prof .
Aravind Joshi of the <START:organization> University of Pennsylvania <END> .
Aravind Joshi was born in 1929 in <START:location> Pune <END> , <START:location> India <END> , where he completed his secondary education as well as his first degree in Mechanical and Electrical Engineering , the latter in 1950 .
He worked as a research assistant in Linguistics at Penn from 1958 - 60 , while completing his Ph.D. in Electrical Engineering , in 1960 .
Joshi 's work and the work of his Penn colleagues at the frontiers of Cognitive Science was rewarded in 1991 by the establishment of a <START:organization> National Science Foundation Science <END> and <START:organization> Technology Center <END> for Research in Cognitive Science , which Aravind Joshi co-directed until 2001 .
Dr. Joshi has supervised thirty-six Ph.D. theses to-date , on topics including information and coding theory , and also pure linguistics .
Clearly some things can be improved! This will require some changes to the annotations and perhaps some modifications to the features, etc.
Evaluate the model.
$ chalk cli TokenNameFinderEvaluator -lang en -encoding UTF-8 -model eng-masc-ner-tmp.bin -data dev/dev-ner-opennlp.txt
Precision: 0.6302131603336423
Recall: 0.3530633437175493
F-Measure: 0.4525790349417637
More confirmation that there is more work to do. (Which could include checking/debugging the MASC transformation code to make sure it isn't messing up.)
To facilitate checking that things are working as expected while changes are mode to core components, Chalk includes a main method that trains models on the MASC data and evaluates them. It assumes that you have run the above commands to transform MASC into suitable formats (including generating the OpenNLP formats for NER).
$ chalk run chalk.corpora.MascEval /tmp/chalk-masc-data/
<much output for training models>
Precision: 0.8203161320316132
Recall: 0.7937471884840306
F-Measure: 0.8068129858253316
Tokenization
Precision: 0.9853392295127684
Recall: 0.9788269748298243
F-Measure: 0.9820723063789235
NER
Precision: 0.6302131603336423
Recall: 0.3530633437175493
F-Measure: 0.4525790349417637
Once you are satisfied with the development cycle, you probably want to train models on all the available data for use in applications. We'll make this easier in the future, but here's a straightforward way to do it, e.g. for the tokenizer:
$ cat */*-tok.txt > all-tok.txt
$ chalk cli TokenizerTrainer -encoding UTF-8 -lang en -data all-tok.txt -model eng-masc-token.bin