Skip to content

Commit

Permalink
Merge pull request #92 from CBIIT/dev-1.2.0
Browse files Browse the repository at this point in the history
update the INS-ETL README.MD to describe the data pre-processing pipe…
  • Loading branch information
David-YuWei authored Apr 26, 2023
2 parents 648ae0e + 2cd607c commit dd1d493
Show file tree
Hide file tree
Showing 2 changed files with 48 additions and 1 deletion.
29 changes: 28 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,10 +52,37 @@ prop_file: The file containing the properties for the specified schema

dataset: The directory containing the data to be loaded, a temporary directory if loading from an S3 bucket

### Run the Pre-processing Pipeline
The INS project has a data pre-processing pipeline which consists of several Python scripts. These scripts format and in some cases generate a report about the data.

It is essential to run these scripts in order for the raw gathered data to work in the INS web application.

These scripts should be run from the root directory for the INS-ETL project in the following order and they will act upon data in the '/data' directory:

1) <code>python date_restriction_for_outputs.py</code> This filters data based upon dates

2) <code>python project_abstract_formatter.py</code> This removes '\n' characters from project abstracts

3) <code>python extra_whitespace_formatter.py</code> This makes sure any whitespace is just a single space

4) <code>python calculate_award_amount_ranges.py</code> This formats a column for the project data to be used on the UI

5) <code>python tag_representative_project.py</code> This formats a column for the project data that is internal to the application but is required for the application to work properly

6) <code>python output_count_report.py</code> This generates a report for the data, intended for data validation purposes

There are assumptions:
1) All files have the file extensions either '.txt' or '.tsv'. Our convention is that manually curated data ends in '.txt' while automatically gathered data ends in '.tsv'.
2) All files start with the type of data in the file, case sensitive. For example: type 'patent' has files 'patent_application.tsv' and 'patent_grant.tsv', not 'granted_patent.tsv' or 'Patent_application.tsv'. Any filename-level annotation is to be done after the beginning of the filename is the type of data in the file.
3) These are tab delimited files.

NOTE0: For manually curated data, if copy/pasting was involved, there may be some characters that don't display properly, they may look like this '�'. These need to be addressed by hand. Take care when preparing manually curated data, in general.

NOTE1: Sometimes automatically gathered data isn't perfect, sometimes there are sparse or ill-formatted rows that can usually be safely removed. From experience, these are very rare.

### Load Data into Neo4j
Run following command to load data into neo4j database (under data_loading folder):

```bash
python loader.py config/config.yml -p <neo4j password> -s model-desc/ins_model_file.yaml -s model-desc/ins_model_properties.yaml --prop-file model-desc/props-ins.yml --no-backup --dataset data
```

20 changes: 20 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
boto3==1.26.72
botocore==1.29.72
certifi==2022.12.7
charset-normalizer==3.0.1
elasticsearch==7.13.4
idna==3.4
jmespath==1.0.1
neo4j==4.2.1
neo4j-driver==5.5.0
numpy==1.24.2
pandas==1.5.3
python-dateutil==2.8.2
pytz==2022.7.1
PyYAML==6.0
requests==2.28.2
requests-aws4auth==1.2.2
s3transfer==0.6.0
six==1.16.0
tqdm==4.64.1
urllib3==1.26.14

0 comments on commit dd1d493

Please sign in to comment.