-
Notifications
You must be signed in to change notification settings - Fork 9
Getting Started
Open a terminal. Move to the directory you want to contain the Fieldspring directory, then clone the repository:
git clone https://github.com/utcompling/fieldspring.git
Set the environment variable FIELDSPRING_DIR to point to Fieldspring's directory, and add FIELDSPRING_DIR/bin to your PATH.
Compile Fieldspring like this:
./build update compile
Download the version of the GeoNames gazetteer we used from the following location:
ADD URL HERE
Once you've obtained the correct allCountries.zip, import the gazetteer for use with Fieldspring by running this from FIELDSPRING_DIR:
fieldspring --memory 8g import-gazetteer -i data/gazetteers/allCountries.zip -o geonames-1dpc.ser.gz -dkm
Download the TR-CoNLL corpus from the following location:
ADD URL HERE
ADD INSTRUCTIONS FOR SPLITTING INTO DEV AND TEST
Now you should have a directory (we'll call it /path/to/trconll/xml/) containing the TR-CoNLL corpus in XML format, with the subdirectories dev/ and test/ for each split. To import the test portion to be used with Fieldspring, run this from FIELDSPRING_DIR:
fieldspring --memory 8g import-corpus -i /path/to/trconll/xml/test/ -cf tr -gt -sg geonames-1dpc.ser.gz -sco trftest-gt-g1dpc.ser.gz
You should see output that includes this:
Number of word tokens: 67572
Number of word types: 11241
Number of toponym tokens: 1903
Number of toponym types: 440
Average ambiguity (locations per toponym): 13.68891224382554
Maximum ambiguity (locations per toponym): 857
Serializing corpus to trftest-gt-g1dpc.ser.gz ...done.
This will give you the version of the corpus with gold toponym identifications. Run this to get the version with NER identified toponyms:
fieldspring --memory 8g import-corpus -i /path/to/trconll/xml/test/ -cf tr -sg geonames-1dpc.ser.gz -sco trftest-ner-g1dpc.ser.gz
Download and unpack the original Perseus 19th Century American corpus found here: http://www.perseus.tufts.edu/hopper/opensource/downloads/texts/hopper-texts-AmericanHistory.tar.gz
ADD HOW TO CONVERT IT TO THE RIGHT XML FORMAT, GIVEN THE KML FILE, HERE
Once you have the CWar corpus in the correct format in a directory (we'll call it /path/to/cwar/xml/) with subdirectories dev/ and test/ for each split, import the test portion by running this from FIELDSPRING_DIR:
fieldspring --memory 30g import-corpus -i /path/to/cwar/xml/text -cf tr -gt -sg geonames-1dpc.ser.gz -sco cwartest-gt-g1dpc-20spd.ser.gz
-spd 20
This will give you the version of the corpus with gold toponym identifications. Run this to get the version with NER identified toponyms:
fieldspring --memory 30g import-corpus -i /path/to/cwar/xml/text -cf tr -sg geonames-1dpc.ser.gz -sco cwartest-ner-g1dpc-20spd.ser.gz
-spd 20
SAY WHERE TO DOWNLOAD enwiki-20130102-pages-articles.xml.bz2
SAY HOW TO RUN BEN'S PREPROC SCRIPT ON IT
SAY HOW TO RUN FilterGeotaggedWiki
For the WISTR training instances relevant to the test split of TR-CoNLL, run the following from FIELDSPRING_DIR:
fieldspring --memory 30g run opennlp.fieldspring.tr.app.SupervisedTRFeatureExtractor -w /path/to/filtered-geo-text-training.txt -c /path/to/enwiki-20130102-permuted-training-unigram-counts.txt.bz2 -i /path/to/trconll/xml/test/ -g geonames-1dpc.ser.gz -s src/main/resources/data/eng/stopwords.txt -d /path/to/suptr-models-trtest/
Where /path/to/suptr-models-trtest/ is the path to the directory where the training instances will be written.
To train the models given the training instances, run this from FIELDSPRING_DIR:
fieldspring --memory 30g run opennlp.fieldspring.tr.app.SupervisedTRMaxentModelTrainer /path/to/suptr-models-trtest/
SAY HOW TO RUN geolocate-document
SAY HOW TO RUN THE SHELL FILE, AND HOW TO READ IT