Skip to content

Commit

Permalink
intial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
jonhusson committed Dec 17, 2016
0 parents commit 06c4508
Show file tree
Hide file tree
Showing 24 changed files with 2,729 additions and 0 deletions.
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
.DS_Store

credentials
*.swp
91 changes: 91 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# GeoDeepDive Application Template
A template for building applications for [GeoDeepDive](https://geodeepdive.org)

## Getting started
Dependencies:
+ [GNU Make](https://www.gnu.org/software/make/)
+ [git](https://git-scm.com/)
+ [pip](https://pypi.python.org/pypi/pip)
+ [PostgreSQL](http://www.postgresql.org/)

### OS X
OS X ships with GNU Make, `git`, and Python, but you will need to install `pip` and PostgreSQL.

To install `pip`:
````
sudo easy_install pip
````

To install PostgreSQL, it is recommended that you use [Postgres.app](http://postgresapp.com/). Download
the most recent version, and be sure to follow [the instructions](http://postgresapp.com/documentation/cli-tools.html)
for setting up the command line tools, primarily adding the following line to your `~/.bash_profile`:

````
export PATH=$PATH:/Applications/Postgres.app/Contents/Versions/latest/bin
````


### Setting up the project
First, clone this repository and run the setup script:

````
git clone https://github.com/UW-DeepDiveInfrastructure/app-template
cd app-template
make
````

Edit `credentials` with the connection credentials for your local Postgres database.

To create a database with the data included in `/setup/usgs_example`:

````
make local_setup
````

To run an example, run `python run.py`.

## Running on GeoDeepDive Infrastructure
All applications are required to have the same structure as this repository, namely an empty folder named `output`, a valid
`config` file, an updated `requirements.txt` describing any Python dependencies, and `run.py` which runs the application
and outputs results. The `credentials` file will be ignored and substituted with a unique version at run time.

The GeoDeepDive infrastructure will have the following software available:
+ Python 2.7+ (Python 3.x not supported at this time)
+ PostgreSQL 9.4+, including command line tools and PostGIS

#### Submitting a config file
The `config` file outlines a list of terms OR dictionaries that you are interested in culling from the corpus. Once you have
updated this file, a private repository will be set up for you under the UW-DeepDiveInfrastructure Github group for you to
push the code from this repository to. Your `config` file will be used to generate a custom testing subset of documents that
you can use to develop your application.

#### Running the application
Once you have developed your application and tested it against the corpus subset, simply push your application to the
private repository created in the previous step. The application will then be run according to the parameters set in the
`config` file.

#### Getting results
After the application is run, the contents of the `output` folder will be gzipped and be made available to download. If
an error was encountered or your application did not run successfully any errors thrown will be logged into the file
`errors.txt` which is included in the gzipped results package.

## File Summary

#### config
A YAML file that contains project settings.


#### credentials
A YAML file that contains local postgres credentials for testing and generating examples.


#### requirements.txt
List of Python dependencies to be installed by `pip`


#### run.py
Python script that runs the entire application, including any setup tasks and exporting of results to the folder `/output`.


## License
CC-BY 4.0 International
17 changes: 17 additions & 0 deletions config
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# The name of the application (no spaces)
app_name: strom

# First and last name of the user
user: Jon Husson

# The NLP product to run the application against
product: NLP352

# How often the application should be run
frequency: monthly

# A list of terms used to subset the corpus
terms: [stromatolite, stromatolitic, thrombolite, thrombolitic]

# Stored dictionary of terms, to be set by GDD infrastructure admins
dictionary: strom
6 changes: 6 additions & 0 deletions credentials.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
postgres:
user: postgres_username
port: 5432
host: localhost
database: deepdive_app
password: password123
31 changes: 31 additions & 0 deletions extractions/SQL.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
#==============================================================================
# PG DUMP FOR RESULTS
#==============================================================================

pg_dump -t results -t strat_target -t strat_target_distant -t age_check -t bib -t target_adjectives DBNAME > ./output/output.sql

#==============================================================================
# CREATE (ALREADY PRESENT) DATABASE FROM DUMP
#==============================================================================

psql -d DBNAME -f ../output/output.sql

#==============================================================================
# USEFUL SQL QUERIES FOR SUMMARY RESULTS
#==============================================================================

COPY(SELECT strat_phrase_root,strat_name_id, COUNT(strat_name_id)
FROM results
WHERE (strat_name_id<>'0' AND target_word ILIKE '%stromato%')
GROUP BY strat_phrase_root, strat_name_id)
TO '/Users/jhusson/Box Sync/postdoc/deepdive/stroms/V2/test.csv' DELIMITER ',' CSV HEADER;

#==============================================================================
# INTERESTING STROMATOLITE ADJECTIVES
#==============================================================================

SELECT * from target_adjectives WHERE target_adjective ILIKE 'domal' OR
target_adjective ILIKE 'columnar' OR
target_adjective ILIKE 'conical' OR
target_adjective ILIKE 'domical' OR
target_adjective ILIKE 'domed'
1 change: 1 addition & 0 deletions input/url.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
deepdivesubmit.chtc.wisc.edu/static/strom_nlp_27Jan2016.zip
8 changes: 8 additions & 0 deletions makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
all:
cp credentials.example credentials;
pip install -r requirements.txt;



local_setup:
./setup/setup.sh
2 changes: 2 additions & 0 deletions output/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
*
!.gitignore
6 changes: 6 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
psycopg2>=2.6.1
pyyaml>=3.11
tqdm>=1.0
stop-words>=2015.2.23.1
docopt>=0.6.1
numpy>=1.9.2
78 changes: 78 additions & 0 deletions run.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
#==============================================================================
#RUN ALL - STROMATOLITES
#==============================================================================

#path: /Users/jhusson/local/bin/deepdive-0.7.1/deepdive-apps/stromatolites

#==============================================================================

import os, time, subprocess, yaml

#tic
start_time = time.time()

#load configuration file
with open('./config', 'r') as config_yaml:
config = yaml.load(config_yaml)

#load credentials file
with open('./credentials', 'r') as credential_yaml:
credentials = yaml.load(credential_yaml)


#ensure working directory is proper
#os.chdir("/Users/jhusson/local/bin/deepdive-0.7.1/deepdive-apps/stromatolites")

#INITALIZE THE POSTGRES TABLES
print 'Step 1: Initialize the PSQL tables ...'
subprocess.call('./setup/setup.sh', shell=True)
os.system('python ./udf/initdb.py')

#BUILD THE BIBLIOGRAPHY
print 'Step 2: Build the bibliography ...'
os.system('python ./udf/buildbib.py')

#FIND TARGET INSTANCES
print 'Step 3: Find stromatolite instances ...'
os.system('python ./udf/ext_target.py')

#FIND STRATIGRAPHIC ENTITIES
print 'Step 4: Find stratigraphic entities ...'
os.system('python ./udf/ext_strat_phrases.py')

#FIND STRATIGRAPHIC MENTIONS
print 'Step 5: Find stratigraphic mentions ...'
os.system('python ./udf/ext_strat_mentions.py')

#CHECK AGE - UNIT MATCH AGREEMENT
print 'Step 6: Check age - unit match agreement ...'
os.system('python ./udf/ext_age_check.py')

#DEFINE RELATIONSHIPS BETWEEN TARGET AND STRATIGRAPHIC NAMES
print 'Step 7: Define the relationships between stromatolite phrases and stratigraphic entities/mentions ...'
os.system('python ./udf/ext_strat_target.py')

#DEFINE RELATIONSHIPS BETWEEN TARGET AND DISTANT STRATIGRAPHIC NAMES
print 'Step 8: Define the relationships between stromatolite phrases and distant stratigraphic entities/mentions ...'
os.system('python ./udf/ext_strat_target_distant.py')

#DEFINE RELATIONSHIPS BETWEEN TARGET AND DISTANT STRATIGRAPHIC NAMES
print 'Step 9: Delineate reference section from main body extractions ...'
os.system('python ./udf/ext_references.py')

#BUILD A BEST RESULTS TABLE OF STROM-STRAT_NAME TUPLES
print 'Step 10: Build a best results table of strom-strat_name tuples ...'
os.system('python ./udf/ext_results.py')

#FIND ADJECTIVES DESCRIBING STROM
print 'Step 11: Find adjectives describing strom target words ...'
os.system('python ./udf/ext_target_adjective.py')

#POSTGRES DUMP
print 'Step 12: Dump select results from PSQL ...'
output = 'pg_dump -U '+ credentials['postgres']['user'] + ' -t results -t strat_target -t strat_target_distant -t age_check -t refs_location -t bib -t target_adjectives -d ' + credentials['postgres']['database'] + ' > ./output/output.sql'
subprocess.call(output, shell=True)

#summary of performance time
elapsed_time = time.time() - start_time
print '\n ###########\n\n elapsed time: %d seconds\n\n ###########\n\n' %(elapsed_time)
53 changes: 53 additions & 0 deletions setup/setup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
#!/bin/bash

# via http://stackoverflow.com/a/21189044/1956065
function parse_yaml {
local prefix=$2
local s='[[:space:]]*' w='[a-zA-Z0-9_]*' fs=$(echo @|tr @ '\034')
sed -ne "s|^\($s\):|\1|" \
-e "s|^\($s\)\($w\)$s:$s[\"']\(.*\)[\"']$s\$|\1$fs\2$fs\3|p" \
-e "s|^\($s\)\($w\)$s:$s\(.*\)$s\$|\1$fs\2$fs\3|p" $1 |
awk -F$fs '{
indent = length($1)/2;
vname[indent] = $2;
for (i in vname) {if (i > indent) {delete vname[i]}}
if (length($3) > 0) {
vn=""; for (i=0; i<indent; i++) {vn=(vn)(vname[i])("_")}
printf("%s%s%s=\"%s\"\n", "'$prefix'",vn, $2, $3);
}
}'
}

eval $(parse_yaml credentials)
eval $(parse_yaml config)

export PGPASSWORD=$postgres__password

pwd=$(pwd)

# Create the database - if it exists an error will be thrown which can be ignored
createdb $postgres__database -h $postgres__host -U $postgres__user -p $postgres__port

# Vanilla NLP
echo "DROP TABLE IF EXISTS ${app_name}_sentences_nlp; CREATE TABLE ${app_name}_sentences_nlp (docid text, sentid integer, wordidx integer[], words text[], poses text[], ners text[], lemmas text[], dep_paths text[], dep_parents integer[], font text[], layout text[]);" | psql -U $postgres__user -h $postgres__host -p $postgres__port $postgres__database

echo "CREATE INDEX ON ${app_name}_sentences_nlp (docid);" | psql -U $postgres__user -h $postgres__host -p $postgres__port $postgres__database
echo "CREATE INDEX ON ${app_name}_sentences_nlp (sentid);" | psql -U $postgres__user -h $postgres__host -p $postgres__port $postgres__database

echo "COPY ${app_name}_sentences_nlp FROM '$pwd/input/sentences_nlp'" | psql -U $postgres__user -h $postgres__host -p $postgres__port $postgres__database

# NLP352
echo "DROP TABLE IF EXISTS ${app_name}_sentences_nlp352; CREATE TABLE ${app_name}_sentences_nlp352 (docid text, sentid integer, wordidx integer[], words text[], poses text[], ners text[], lemmas text[], dep_paths text[], dep_parents integer[]);" | psql -U $postgres__user -h $postgres__host -p $postgres__port $postgres__database

echo "CREATE INDEX ON ${app_name}_sentences_nlp352 (docid);" | psql -U $postgres__user -h $postgres__host -p $postgres__port $postgres__database
echo "CREATE INDEX ON ${app_name}_sentences_nlp352 (sentid);" | psql -U $postgres__user -h $postgres__host -p $postgres__port $postgres__database

echo "COPY ${app_name}_sentences_nlp352 FROM '$pwd/input/sentences_nlp352'" | psql -U $postgres__user -h $postgres__host -p $postgres__port $postgres__database

# NLP352 Bazaar
echo "DROP TABLE IF EXISTS ${app_name}_sentences_nlp352_bazaar; CREATE TABLE ${app_name}_sentences_nlp352_bazaar (docid text, sentid integer, sentence text, words text[], lemmas text[], poses text[], ners text[], character_position integer[], dep_paths text[], dep_parents integer[]);" | psql -U $postgres__user -h $postgres__host -p $postgres__port $postgres__database

echo "CREATE INDEX ON ${app_name}_sentences_nlp352_bazaar (docid);" | psql -U $postgres__user -h $postgres__host -p $postgres__port $postgres__database
echo "CREATE INDEX ON ${app_name}_sentences_nlp352_bazaar (sentid);" | psql -U $postgres__user -h $postgres__host -p $postgres__port $postgres__database

echo "COPY ${app_name}_sentences_nlp352_bazaar FROM '$pwd/input/sentences_nlp352_bazaar'" | psql -U $postgres__user -h $postgres__host -p $postgres__port $postgres__database
Loading

0 comments on commit 06c4508

Please sign in to comment.