If you want to mine data for INS project, please stay at the root folder.
Simply run the following command to initiate the mining process:
node index.js
Move to data_loading folder for the data loading process. The following steps will show you how to load the data into database.
Python 3.6 or newer
An initialized and running Neo4j database
Run the following command to install dependencies.
pip3 install -r requirements.txt
Or run pip install -r requirements.txt
if you are using virtualenv. The dependencies included in requirements.txt are listed below:
pyyaml
neo4j - version 1.7.6
boto3
requests
All the inputs of Data Loader can be set in a YAML format configuration file by using the fields defined below. Using a configuration file can make your Data Loader command significantly shorter.
An example configuration file can be found in config/config.yml
neo4j:uri: Address of the target Neo4j endpoint
neo4j:user: Username to be used for the Neo4j database
neo4j:password: Password to be used for the Neo4j database
schema: The file path(s) of the YAML formatted schema file(s)
prop_file: The file containing the properties for the specified schema
dataset: The directory containing the data to be loaded, a temporary directory if loading from an S3 bucket
The INS project has a data pre-processing pipeline which consists of several Python scripts. These scripts format and in some cases generate a report about the data.
It is essential to run these scripts in order for the raw gathered data to work in the INS web application.
These scripts should be run from the root directory for the INS-ETL project in the following order and they will act upon data in the '/data' directory:
-
python date_restriction_for_outputs.py
This filters data based upon dates -
python project_abstract_formatter.py
This removes '\n' characters from project abstracts -
python extra_whitespace_formatter.py
This makes sure any whitespace is just a single space -
python calculate_award_amount_ranges.py
This formats a column for the project data to be used on the UI -
python tag_representative_project.py
This formats a column for the project data that is internal to the application but is required for the application to work properly -
python output_count_report.py
This generates a report for the data, intended for data validation purposes
There are assumptions:
- All files have the file extensions either '.txt' or '.tsv'. Our convention is that manually curated data ends in '.txt' while automatically gathered data ends in '.tsv'.
- All files start with the type of data in the file, case sensitive. For example: type 'patent' has files 'patent_application.tsv' and 'patent_grant.tsv', not 'granted_patent.tsv' or 'Patent_application.tsv'. Any filename-level annotation is to be done after the beginning of the filename is the type of data in the file.
- These are tab delimited files.
NOTE0: For manually curated data, if copy/pasting was involved, there may be some characters that don't display properly, they may look like this '�'. These need to be addressed by hand. Take care when preparing manually curated data, in general.
NOTE1: Sometimes automatically gathered data isn't perfect, sometimes there are sparse or ill-formatted rows that can usually be safely removed. From experience, these are very rare.
Run following command to load data into neo4j database (under data_loading folder):
python loader.py config/config.yml -p <neo4j password> -s model-desc/ins_model_file.yaml -s model-desc/ins_model_properties.yaml --prop-file model-desc/props-ins.yml --no-backup --dataset data