Merge pull request #92 from CBIIT/dev-1.2.0

update the INS-ETL README.MD to describe the data pre-processing pipe…
CBIIT · Apr 26, 2023 · dd1d493 · dd1d493
2 parents 648ae0e + 2cd607c
commit dd1d493
Show file tree

Hide file tree

Showing 2 changed files with 48 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -52,10 +52,37 @@ prop_file: The file containing the properties for the specified schema
 
 dataset: The directory containing the data to be loaded, a temporary directory if loading from an S3 bucket
 
+### Run the Pre-processing Pipeline
+The INS project has a data pre-processing pipeline which consists of several Python scripts. These scripts format and in some cases generate a report about the data.
+
+It is essential to run these scripts in order for the raw gathered data to work in the INS web application.
+
+These scripts should be run from the root directory for the INS-ETL project in the following order and they will act upon data in the '/data' directory:
+
+1) <code>python date_restriction_for_outputs.py</code> This filters data based upon dates
+
+2) <code>python project_abstract_formatter.py</code> This removes '\n' characters from project abstracts
+
+3) <code>python extra_whitespace_formatter.py</code> This makes sure any whitespace is just a single space
+
+4) <code>python calculate_award_amount_ranges.py</code> This formats a column for the project data to be used on the UI
+
+5) <code>python tag_representative_project.py</code> This formats a column for the project data that is internal to the application but is required for the application to work properly
+
+6) <code>python output_count_report.py</code> This generates a report for the data, intended for data validation purposes
+
+There are assumptions:
+1) All files have the file extensions either '.txt' or '.tsv'. Our convention is that manually curated data ends in '.txt' while automatically gathered data ends in '.tsv'.
+2) All files start with the type of data in the file, case sensitive. For example: type 'patent' has files 'patent_application.tsv' and 'patent_grant.tsv', not 'granted_patent.tsv' or 'Patent_application.tsv'. Any filename-level annotation is to be done after the beginning of the filename is the type of data in the file.
+3) These are tab delimited files.
+
+NOTE0: For manually curated data, if copy/pasting was involved, there may be some characters that don't display properly, they may look like this '�'. These need to be addressed by hand. Take care when preparing manually curated data, in general.
+
+NOTE1: Sometimes automatically gathered data isn't perfect, sometimes there are sparse or ill-formatted rows that can usually be safely removed. From experience, these are very rare.
+
 ### Load Data into Neo4j
 Run following command to load data into neo4j database (under data_loading folder):
 
 ```bash
 python loader.py config/config.yml -p <neo4j password> -s model-desc/ins_model_file.yaml -s model-desc/ins_model_properties.yaml --prop-file model-desc/props-ins.yml --no-backup --dataset data
 ``` 
-
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,20 @@
+boto3==1.26.72
+botocore==1.29.72
+certifi==2022.12.7
+charset-normalizer==3.0.1
+elasticsearch==7.13.4
+idna==3.4
+jmespath==1.0.1
+neo4j==4.2.1
+neo4j-driver==5.5.0
+numpy==1.24.2
+pandas==1.5.3
+python-dateutil==2.8.2
+pytz==2022.7.1
+PyYAML==6.0
+requests==2.28.2
+requests-aws4auth==1.2.2
+s3transfer==0.6.0
+six==1.16.0
+tqdm==4.64.1
+urllib3==1.26.14