intial commit

UW-xDD · Dec 17, 2016 · 06c4508 · 06c4508
commit 06c4508
Show file tree

Hide file tree

Showing 24 changed files with 2,729 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,4 @@
+.DS_Store
+
+credentials
+*.swp
diff --git a/README.md b/README.md
@@ -0,0 +1,91 @@
+# GeoDeepDive Application Template
+A template for building applications for [GeoDeepDive](https://geodeepdive.org)
+
+## Getting started
+Dependencies:
+  + [GNU Make](https://www.gnu.org/software/make/)
+  + [git](https://git-scm.com/)
+  + [pip](https://pypi.python.org/pypi/pip)
+  + [PostgreSQL](http://www.postgresql.org/)
+
+### OS X
+OS X ships with GNU Make, `git`, and Python, but you will need to install `pip` and PostgreSQL.
+
+To install `pip`:
+````
+sudo easy_install pip
+````
+
+To install PostgreSQL, it is recommended that you use [Postgres.app](http://postgresapp.com/). Download
+the most recent version, and be sure to follow [the instructions](http://postgresapp.com/documentation/cli-tools.html)
+for setting up the command line tools, primarily adding the following line to your `~/.bash_profile`:
+
+````
+export PATH=$PATH:/Applications/Postgres.app/Contents/Versions/latest/bin
+````
+
+
+### Setting up the project
+First, clone this repository and run the setup script:
+
+````
+git clone https://github.com/UW-DeepDiveInfrastructure/app-template
+cd app-template
+make
+````
+
+Edit `credentials` with the connection credentials for your local Postgres database.
+
+To create a database with the data included in `/setup/usgs_example`:
+
+````
+make local_setup
+````
+
+To run an example, run `python run.py`.
+
+## Running on GeoDeepDive Infrastructure
+All applications are required to have the same structure as this repository, namely an empty folder named `output`, a valid
+`config` file, an updated `requirements.txt` describing any Python dependencies, and `run.py` which runs the application
+and outputs results. The `credentials` file will be ignored and substituted with a unique version at run time.
+
+The GeoDeepDive infrastructure will have the following software available:
+  + Python 2.7+ (Python 3.x not supported at this time)
+  + PostgreSQL 9.4+, including command line tools and PostGIS
+
+#### Submitting a config file
+The `config` file outlines a list of terms OR dictionaries that you are interested in culling from the corpus. Once you have
+updated this file, a private repository will be set up for you under the UW-DeepDiveInfrastructure Github group for you to
+push the code from this repository to. Your `config` file will be used to generate a custom testing subset of documents that
+you can use to develop your application.
+
+#### Running the application
+Once you have developed your application and tested it against the corpus subset, simply push your application to the
+private repository created in the previous step. The application will then be run according to the parameters set in the
+`config` file.
+
+#### Getting results
+After the application is run, the contents of the `output` folder will be gzipped and be made available to download. If
+an error was encountered or your application did not run successfully any errors thrown will be logged into the file
+`errors.txt` which is included in the gzipped results package.
+
+## File Summary
+
+#### config
+A YAML file that contains project settings.
+
+
+#### credentials
+A YAML file that contains local postgres credentials for testing and generating examples.
+
+
+#### requirements.txt
+List of Python dependencies to be installed by `pip`
+
+
+#### run.py
+Python script that runs the entire application, including any setup tasks and exporting of results to the folder `/output`.
+
+
+## License
+CC-BY 4.0 International
diff --git a/config b/config
@@ -0,0 +1,17 @@
+# The name of the application (no spaces)
+app_name: strom
+
+# First and last name of the user
+user: Jon Husson
+
+# The NLP product to run the application against
+product: NLP352
+
+# How often the application should be run
+frequency: monthly
+
+# A list of terms used to subset the corpus
+terms: [stromatolite, stromatolitic, thrombolite, thrombolitic]
+
+# Stored dictionary of terms, to be set by GDD infrastructure admins
+dictionary: strom
diff --git a/credentials.example b/credentials.example
@@ -0,0 +1,6 @@
+postgres:
+    user: postgres_username
+    port: 5432
+    host: localhost
+    database: deepdive_app
+    password: password123
diff --git a/extractions/SQL.txt b/extractions/SQL.txt
@@ -0,0 +1,31 @@
+#==============================================================================
+# PG DUMP FOR RESULTS
+#==============================================================================
+
+pg_dump -t results -t strat_target -t strat_target_distant -t age_check -t bib -t target_adjectives DBNAME > ./output/output.sql
+
+#==============================================================================
+# CREATE (ALREADY PRESENT) DATABASE FROM DUMP
+#==============================================================================
+
+psql -d DBNAME -f ../output/output.sql
+
+#==============================================================================
+# USEFUL SQL QUERIES FOR SUMMARY RESULTS
+#==============================================================================
+
+COPY(SELECT strat_phrase_root,strat_name_id, COUNT(strat_name_id)
+	FROM results 
+	WHERE (strat_name_id<>'0' AND target_word ILIKE '%stromato%') 
+	GROUP BY strat_phrase_root, strat_name_id)
+	TO '/Users/jhusson/Box Sync/postdoc/deepdive/stroms/V2/test.csv' DELIMITER ',' CSV HEADER;
+
+#==============================================================================
+# INTERESTING STROMATOLITE ADJECTIVES
+#==============================================================================
+
+SELECT * from target_adjectives WHERE target_adjective ILIKE 'domal' OR
+target_adjective ILIKE 'columnar' OR
+target_adjective ILIKE 'conical' OR
+target_adjective ILIKE 'domical' OR
+target_adjective ILIKE 'domed'
diff --git a/input/url.txt b/input/url.txt
@@ -0,0 +1 @@
+deepdivesubmit.chtc.wisc.edu/static/strom_nlp_27Jan2016.zip
diff --git a/makefile b/makefile
@@ -0,0 +1,8 @@
+all:
+	cp credentials.example credentials;
+	pip install -r requirements.txt;
+
+
+
+local_setup:
+	./setup/setup.sh
diff --git a/output/.gitignore b/output/.gitignore
@@ -0,0 +1,2 @@
+*
+!.gitignore
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,6 @@
+psycopg2>=2.6.1
+pyyaml>=3.11
+tqdm>=1.0
+stop-words>=2015.2.23.1
+docopt>=0.6.1
+numpy>=1.9.2
diff --git a/run.py b/run.py
@@ -0,0 +1,78 @@
+#==============================================================================
+#RUN ALL  - STROMATOLITES
+#==============================================================================
+
+#path: /Users/jhusson/local/bin/deepdive-0.7.1/deepdive-apps/stromatolites
+
+#==============================================================================
+
+import os, time, subprocess, yaml
+
+#tic
+start_time = time.time()
+
+#load configuration file
+with open('./config', 'r') as config_yaml:
+    config = yaml.load(config_yaml)
+
+#load credentials file
+with open('./credentials', 'r') as credential_yaml:
+    credentials = yaml.load(credential_yaml)
+
+
+#ensure working directory is proper
+#os.chdir("/Users/jhusson/local/bin/deepdive-0.7.1/deepdive-apps/stromatolites")
+
+#INITALIZE THE POSTGRES TABLES
+print 'Step 1: Initialize the PSQL tables ...'
+subprocess.call('./setup/setup.sh', shell=True)
+os.system('python ./udf/initdb.py')
+
+#BUILD THE BIBLIOGRAPHY
+print 'Step 2: Build the bibliography ...'
+os.system('python ./udf/buildbib.py')
+
+#FIND TARGET INSTANCES
+print 'Step 3: Find stromatolite instances ...'
+os.system('python ./udf/ext_target.py')
+
+#FIND STRATIGRAPHIC ENTITIES
+print 'Step 4: Find stratigraphic entities ...'
+os.system('python ./udf/ext_strat_phrases.py')
+
+#FIND STRATIGRAPHIC MENTIONS
+print 'Step 5: Find stratigraphic mentions ...'
+os.system('python ./udf/ext_strat_mentions.py')
+
+#CHECK AGE - UNIT MATCH AGREEMENT
+print 'Step 6: Check age - unit match agreement ...'
+os.system('python ./udf/ext_age_check.py')
+
+#DEFINE RELATIONSHIPS BETWEEN TARGET AND STRATIGRAPHIC NAMES
+print 'Step 7: Define the relationships between stromatolite phrases and stratigraphic entities/mentions ...'
+os.system('python ./udf/ext_strat_target.py')
+
+#DEFINE RELATIONSHIPS BETWEEN TARGET AND DISTANT STRATIGRAPHIC NAMES
+print 'Step 8: Define the relationships between stromatolite phrases and distant stratigraphic entities/mentions ...'
+os.system('python ./udf/ext_strat_target_distant.py')
+
+#DEFINE RELATIONSHIPS BETWEEN TARGET AND DISTANT STRATIGRAPHIC NAMES
+print 'Step 9: Delineate reference section from main body extractions ...'
+os.system('python ./udf/ext_references.py')
+
+#BUILD A BEST RESULTS TABLE OF STROM-STRAT_NAME TUPLES
+print 'Step 10: Build a best results table of strom-strat_name tuples ...'
+os.system('python ./udf/ext_results.py')
+
+#FIND ADJECTIVES DESCRIBING STROM
+print 'Step 11: Find adjectives describing strom target words ...'
+os.system('python ./udf/ext_target_adjective.py')
+
+#POSTGRES DUMP
+print 'Step 12: Dump select results from PSQL ...'
+output = 'pg_dump -U '+ credentials['postgres']['user'] + ' -t results -t strat_target -t strat_target_distant -t age_check -t refs_location -t bib -t target_adjectives -d ' + credentials['postgres']['database'] + ' > ./output/output.sql'
+subprocess.call(output, shell=True)
+
+#summary of performance time
+elapsed_time = time.time() - start_time
+print '\n ###########\n\n elapsed time: %d seconds\n\n ###########\n\n' %(elapsed_time)
diff --git a/setup/setup.sh b/setup/setup.sh
@@ -0,0 +1,53 @@
+#!/bin/bash
+
+# via http://stackoverflow.com/a/21189044/1956065
+function parse_yaml {
+   local prefix=$2
+   local s='[[:space:]]*' w='[a-zA-Z0-9_]*' fs=$(echo @|tr @ '\034')
+   sed -ne "s|^\($s\):|\1|" \
+        -e "s|^\($s\)\($w\)$s:$s[\"']\(.*\)[\"']$s\$|\1$fs\2$fs\3|p" \
+        -e "s|^\($s\)\($w\)$s:$s\(.*\)$s\$|\1$fs\2$fs\3|p"  $1 |
+   awk -F$fs '{
+      indent = length($1)/2;
+      vname[indent] = $2;
+      for (i in vname) {if (i > indent) {delete vname[i]}}
+      if (length($3) > 0) {
+         vn=""; for (i=0; i<indent; i++) {vn=(vn)(vname[i])("_")}
+         printf("%s%s%s=\"%s\"\n", "'$prefix'",vn, $2, $3);
+      }
+   }'
+}
+
+eval $(parse_yaml credentials)
+eval $(parse_yaml config)
+
+export PGPASSWORD=$postgres__password
+
+pwd=$(pwd)
+
+# Create the database - if it exists an error will be thrown which can be ignored
+createdb $postgres__database -h $postgres__host -U $postgres__user -p $postgres__port
+
+# Vanilla NLP
+echo "DROP TABLE IF EXISTS ${app_name}_sentences_nlp; CREATE TABLE ${app_name}_sentences_nlp (docid text, sentid integer, wordidx integer[], words text[], poses text[], ners text[], lemmas text[], dep_paths text[], dep_parents integer[], font text[], layout text[]);" | psql -U $postgres__user -h $postgres__host -p $postgres__port $postgres__database
+
+echo "CREATE INDEX ON ${app_name}_sentences_nlp (docid);" | psql -U $postgres__user -h $postgres__host -p $postgres__port $postgres__database
+echo "CREATE INDEX ON ${app_name}_sentences_nlp (sentid);" | psql -U $postgres__user -h $postgres__host -p $postgres__port $postgres__database
+
+echo "COPY ${app_name}_sentences_nlp FROM '$pwd/input/sentences_nlp'" | psql -U $postgres__user -h $postgres__host -p $postgres__port $postgres__database
+
+# NLP352
+echo "DROP TABLE IF EXISTS ${app_name}_sentences_nlp352; CREATE TABLE ${app_name}_sentences_nlp352 (docid text, sentid integer, wordidx integer[], words text[], poses text[], ners text[], lemmas text[], dep_paths text[], dep_parents integer[]);" | psql -U $postgres__user -h $postgres__host -p $postgres__port $postgres__database
+
+echo "CREATE INDEX ON ${app_name}_sentences_nlp352 (docid);" | psql -U $postgres__user -h $postgres__host -p $postgres__port $postgres__database
+echo "CREATE INDEX ON ${app_name}_sentences_nlp352 (sentid);" | psql -U $postgres__user -h $postgres__host -p $postgres__port $postgres__database
+
+echo "COPY ${app_name}_sentences_nlp352 FROM '$pwd/input/sentences_nlp352'" | psql -U $postgres__user -h $postgres__host -p $postgres__port $postgres__database
+
+# NLP352 Bazaar
+echo "DROP TABLE IF EXISTS ${app_name}_sentences_nlp352_bazaar; CREATE TABLE ${app_name}_sentences_nlp352_bazaar (docid text, sentid integer, sentence text, words text[], lemmas text[], poses text[], ners text[], character_position integer[], dep_paths text[], dep_parents integer[]);" | psql -U $postgres__user -h $postgres__host -p $postgres__port $postgres__database
+
+echo "CREATE INDEX ON ${app_name}_sentences_nlp352_bazaar (docid);" | psql -U $postgres__user -h $postgres__host -p $postgres__port $postgres__database
+echo "CREATE INDEX ON ${app_name}_sentences_nlp352_bazaar (sentid);" | psql -U $postgres__user -h $postgres__host -p $postgres__port $postgres__database
+
+echo "COPY ${app_name}_sentences_nlp352_bazaar FROM '$pwd/input/sentences_nlp352_bazaar'" | psql -U $postgres__user -h $postgres__host -p $postgres__port $postgres__database
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		deepdivesubmit.chtc.wisc.edu/static/strom_nlp_27Jan2016.zip