Skip to content
speriosu edited this page Apr 26, 2013 · 48 revisions

Getting and compiling the code

Open a terminal. Move to the directory you want to contain the Fieldspring directory, then clone the repository:

git clone https://github.com/utcompling/fieldspring.git

Set the environment variable FIELDSPRING_DIR to point to Fieldspring's directory, and add FIELDSPRING_DIR/bin to your PATH.

Compile Fieldspring like this:

./build update compile

Downloading OpenNLP models

Move to this directory:

cd $FIELDSPRING_DIR/data/models

Then run:

./getOpenNLPModels.sh

This should download the files en-ner-location.bin, en-token.bin, and en-sent.bin.

Getting and preparing the GeoNames gazetteer

Run the script called download-geonames.sh (in $FIELDSPRING_DIR/bin). This will put the correct version of the GeoNames gazetteer (a file called allCountries.zip) into $FIELDSPRING_DIR/data/gazetteers. It is important that you use this method to get GeoNames, as even slightly different versions will cause results to change.

Once you've obtained the correct allCountries.zip, import the gazetteer for use with Fieldspring by running this from FIELDSPRING_DIR:

fieldspring --memory 8g import-gazetteer -i data/gazetteers/allCountries.zip -o geonames-1dpc.ser.gz -dkm

Importing the TR-CoNLL corpus

ADD INSTRUCTIONS FOR GENERATING THE TR-CoNLL CORPUS USING THE ORIGINAL REUTERS DATA AND A SCRIPT.

You should have a directory (we'll call it /path/to/trconll/xml/) containing the TR-CoNLL corpus in XML format, with the subdirectories dev/ and test/ for each split. To import the test portion to be used with Fieldspring, run this from FIELDSPRING_DIR:

fieldspring --memory 8g import-corpus -i /path/to/trconll/xml/test/ -cf tr -gt -sg geonames-1dpc.ser.gz -sco trftest-gt-g1dpc.ser.gz

You should see output that includes this:

Number of word tokens: 67572
Number of word types: 11241
Number of toponym tokens: 1903
Number of toponym types: 440
Average ambiguity (locations per toponym): 13.68891224382554
Maximum ambiguity (locations per toponym): 857

Serializing corpus to trftest-gt-g1dpc.ser.gz ...done.

This will give you the version of the corpus with gold toponym identifications. Run this to get the version with NER identified toponyms:

fieldspring --memory 8g import-corpus -i /path/to/trconll/xml/test/ -cf tr -sg geonames-1dpc.ser.gz -sco trftest-ner-g1dpc.ser.gz

Throughout this guide, it is important that you use the same filenames as those shown (e.g. trftest-gt-g1dpc.ser.gz) in order for the scripts that run the experiments to run properly.

Getting and importing the CWar corpus

Download and unpack the original Perseus 19th Century American corpus found here: http://www.perseus.tufts.edu/hopper/opensource/downloads/texts/hopper-texts-AmericanHistory.tar.gz

Download the KML file containing the locations annotations in this dataset here: http://dsl.richmond.edu/emancipation/data-download/

Run the following script (from FIELDSPRING_DIR/bin) to combine the original corpus with the annotations:

prepare-cwar.sh /path/to/original/cwar/xml/ /path/to/reviseddyer20120320.kml $FIELDSPRING_DIR/geonames-1dpc.ser.gz /path/to/cwar/xml

Once you have the CWar corpus in the correct format in a directory (we'll call it /path/to/cwar/xml/) with subdirectories dev/ and test/ for each split, import the test portion by running this from FIELDSPRING_DIR:

fieldspring --memory 30g import-corpus -i /path/to/cwar/xml/text -cf tr -gt -sg geonames-1dpc.ser.gz -sco cwartest-gt-g1dpc-20spd.ser.gz -spd 20

This will give you the version of the corpus with gold toponym identifications. Run this to get the version with NER identified toponyms:

fieldspring --memory 30g import-corpus -i /path/to/cwar/xml/text -cf tr -sg geonames-1dpc.ser.gz -sco cwartest-ner-g1dpc-20spd.ser.gz -spd 20

Getting and preprocessing the Wikipedia dump

SAY WHERE TO DOWNLOAD enwiki-20130102-pages-articles.xml.bz2

SAY HOW TO RUN BEN'S PREPROC SCRIPT ON IT

SAY HOW TO RUN FilterGeotaggedWiki

Extracting the WISTR training instances and training the classifiers

For the WISTR training instances relevant to the test split of TR-CoNLL, run the following from FIELDSPRING_DIR:

fieldspring --memory 30g run opennlp.fieldspring.tr.app.SupervisedTRFeatureExtractor -w /path/to/filtered-geo-text-training.txt -c /path/to/enwiki-20130102-permuted-training-unigram-counts.txt.bz2 -i /path/to/trconll/xml/test/ -g geonames-1dpc.ser.gz -s src/main/resources/data/eng/stopwords.txt -d /path/to/suptr-models-trtest/

Where /path/to/suptr-models-trtest/ is the path to the directory where the training instances will be written.

To train the models given the training instances, run this from FIELDSPRING_DIR:

fieldspring --memory 30g run opennlp.fieldspring.tr.app.SupervisedTRMaxentModelTrainer /path/to/suptr-models-trtest/

Running the document geolocation module

SAY HOW TO SET UP THE DIRECTORY FOR geolocate-document

SAY HOW TO RUN geolocate-document

Running the experiments to get the results

SAY HOW TO RUN THE SHELL FILE, AND HOW TO READ IT

Clone this wiki locally