-
Notifications
You must be signed in to change notification settings - Fork 9
Getting Started
Open a terminal. Move to the directory you want to contain the Fieldspring directory, then clone the repository:
git clone https://github.com/utcompling/fieldspring.git
Set the environment variable FIELDSPRING_DIR to point to Fieldspring's directory, and add FIELDSPRING_DIR/bin to your PATH.
Compile Fieldspring like this:
./build update compile
Move to this directory:
cd $FIELDSPRING_DIR/data/models
Then run:
./getOpenNLPModels.sh
This should download the files en-ner-location.bin, en-token.bin, and en-sent.bin.
Run the script called download-geonames.sh
(in $FIELDSPRING_DIR/bin). This will put the correct version of the GeoNames gazetteer (a file called allCountries.zip) into $FIELDSPRING_DIR/data/gazetteers. It is important that you use this method to get GeoNames, as even slightly different versions will cause results to change.
Once you've obtained the correct allCountries.zip, import the gazetteer for use with Fieldspring by running this from FIELDSPRING_DIR:
fieldspring --memory 8g import-gazetteer -i data/gazetteers/allCountries.zip -o geonames-1dpc.ser.gz -dkm
You should have a directory (we'll call it /path/to/trconll/xml/) containing the TR-CoNLL corpus in XML format, with the subdirectories dev/ and test/ for each split. To import the test portion to be used with Fieldspring, run this from FIELDSPRING_DIR:
fieldspring --memory 8g import-corpus -i /path/to/trconll/xml/test/ -cf tr -gt -sg geonames-1dpc.ser.gz -sco trftest-gt-g1dpc.ser.gz
You should see output that includes this:
Number of word tokens: 67572
Number of word types: 11241
Number of toponym tokens: 1903
Number of toponym types: 440
Average ambiguity (locations per toponym): 13.68891224382554
Maximum ambiguity (locations per toponym): 857
Serializing corpus to trftest-gt-g1dpc.ser.gz ...done.
This will give you the version of the corpus with gold toponym identifications. Run this to get the version with NER identified toponyms:
fieldspring --memory 8g import-corpus -i /path/to/trconll/xml/test/ -cf tr -sg geonames-1dpc.ser.gz -sco trftest-ner-g1dpc.ser.gz
Throughout this guide, it is important that you use the same filenames as those shown (e.g. trftest-gt-g1dpc.ser.gz) in order for the scripts that run the experiments to run properly.
Download and unpack the original Perseus 19th Century American corpus found here: http://www.perseus.tufts.edu/hopper/opensource/downloads/texts/hopper-texts-AmericanHistory.tar.gz
Download the KML file containing the locations annotations in this dataset here: http://dsl.richmond.edu/emancipation/data-download/
Run the following script (from FIELDSPRING_DIR/bin) to combine the original corpus with the annotations:
prepare-cwar.sh /path/to/original/cwar/xml/ /path/to/reviseddyer20120320.kml $FIELDSPRING_DIR/geonames-1dpc.ser.gz /path/to/cwar/xml
Once you have the CWar corpus in the correct format in a directory (we'll call it /path/to/cwar/xml/) with subdirectories dev/ and test/ for each split, import the test portion by running this from FIELDSPRING_DIR:
fieldspring --memory 30g import-corpus -i /path/to/cwar/xml/test -cf tr -gt -sg geonames-1dpc.ser.gz -sco cwartest-gt-g1dpc-20spd.ser.gz -spd 20
This will give you the version of the corpus with gold toponym identifications. Run this to get the version with NER identified toponyms:
fieldspring --memory 30g import-corpus -i /path/to/cwar/xml/test -cf tr -sg geonames-1dpc.ser.gz -sco cwartest-ner-g1dpc-20spd.ser.gz -spd 20
The get the right Wikipedia data for the experiments below, run the download-wiki-data.sh
script from FIELDSPRING_DIR.
(This section can be skipped if one simply uses the classifiers already included in the download above.)
For the WISTR training instances relevant to the test split of TR-CoNLL, run the following from FIELDSPRING_DIR:
fieldspring --memory 30g run opennlp.fieldspring.tr.app.SupervisedTRFeatureExtractor -w /path/to/filtered-geo-text-training.txt -c /path/to/enwiki-20130102-permuted-training-unigram-counts.txt.bz2 -i /path/to/trconll/xml/test/ -g geonames-1dpc.ser.gz -s src/main/resources/data/eng/stopwords.txt -d /path/to/suptr-models-trtest/
Where /path/to/suptr-models-trtest/ is the path to the directory where the training instances will be written.
To train the models given the training instances, run this from FIELDSPRING_DIR:
fieldspring --memory 30g run opennlp.fieldspring.tr.app.SupervisedTRMaxentModelTrainer /path/to/suptr-models-trtest/
To run the experiments, run the following script from FIELDSPRING_DIR:
runexps.sh tr test gt /path/to/trconll/xml
The script takes four arguments:
-
Which corpus to evaluate on (either "tr" or "cwar")
-
Which split to evaluate on (either "dev" or "test")
-
Which toponym identification method to use (either "gt" for gold toponyms or "ner" for toponyms detected by a named entity recognizer)
-
The path to the directory containing the prepared corpus in XML format, which contains dev/ and test/ subdirectories.
If all of the files from the previous section are in place, this should output something like this:
\oracle & 104.57995807879772 & 19.828158539411007 & 1.0
\rand & 3914.634055985425 & 1412.4048552451488 & 0.3348197696023783
\population & 216.14728454090616 & 23.103466226857382 & 0.8099219620958752
\spider & 2689.7013998421176 & 982.4361524584441 & 0.49182460052025273
\tripdl & 1494.1413395381906 & 29.258599245838536 & 0.6198439241917503
\wistr & 279.05523246633146 & 22.579446357728344 & 0.8232998885172799
\wistr+\spider & 430.17546527897343 & 23.103466226857382 & 0.8182831661092531
\trawl & 235.41656899283578 & 22.579446357728344 & 0.81438127090301
\trawl+\spider & 297.1435368979808 & 23.103466226857382 & 0.806577480490524
The columns shown are mean error in kilometers, median error in kilometers, and accuracy. If "ner" is used as the toponym identification method, the three columns that will output are precision, recall, and F-score. The format is meant to be pasted into a LaTeX file with minimal additional markup (e.g. "\" at the end of each line if no other columns will be in the results table you are building).
To work with your own data (either a single .txt
file or a directory of them), run the following:
fieldspring --memory 2g import-corpus -i /path/to/your/data -cf p -sg geonames-1dpc.ser.gz -sco corpus-name.ser.gz
This will take some time as it will use a named entity recognizer to identify toponyms. You also may need to adjust the amount of memory if your corpus is large.
You can now run various toponym resolvers on the serialized corpus (corpus-name.ser.gz
) and output the result in a few different ways.
To run the population baseline, execute this:
fieldspring --memory 2g resolve -sci corpus-name.ser.gz -r population -o corpus-name-pop-resolved.xml -ok corpus-name-pop-resolved.kml -sco corpus-name-pop-resolved.ser.gz
This will resolve your corpus with the population baseline and output the result in TR-CoNLL format (XML) to wherever the -o flag points, a Google Earth readable file (KML) to wherever the -ok flag points, and a serialized resolved corpus to wherever the -sco flag points.
You can visualize the resolved serialized file with Fieldspring's own visualizer by executing this:
fieldspring --memory 8g viz corpus-name-pop-resolved.ser.gz
To run other resolvers, change what the -r flag is set to. Investigating runexps.sh in a text editor should help give you an idea of the resolvers currently supported. Many of them require additional data via additional flags.
Here are some of the commands you'd need to put after -r in order to run various resolvers (in many cases shortenings like 'pop' for population also work):
random
population
bmd (BasicMinDistance)
wmd (WeightedMinDistance, aka SPIDER)
maxent (Maximum Entropy, aka WISTR)
prob (Probabilistic, aka TRAWL)
constructiontpp (ConLAC)
acotpp (TRACO)
Finding where these are invoked in runexps.sh will give you an idea of what you'll need to run them yourself.
Adding the -oracle flag to any resolve command (I recommend random for speed) will use the oracle resolver.