GPS Database Processor is designed for database validation, update and publication for the GPS (Global Pneumococcal Sequencing Project).
This tool takes the path(s) of GPS1 or GPS2 database as input:
-1
,--gps1
- path to directory of GPS1 data-2
,--gps2
- path to directory of GPS2 data
It only takes one of the above in default or --check
mode, and requires both in --monocle
mode
It carries out several operations in the following order:
- Validation of columns and values of the specified directory
- The terminal output displays any unexpected or erroneous values
- For columns that should only contain UPPERCASE strings, any lowercase value will be converted, and the updated table will be saved in-place (unless
--check
mode is used) - If there are any critical errors, the tool will terminate its process and will not carry out the subsequent operations
- If run in
--check
mode, the operation stops here, and any case conversion will not be saved - Generate
table4
using inferred data based ontable1
,table3
and reference tables in thedata
directory- If there is a location that does not exist in
data/coordinates.csv
(one of the reference tables), it will stop the process- Except if it is running in
--location
mode, then it will attempt the fetch the information via Mapbox API- The first time this is triggered, it will ask for your Mapbox API key (access token) and save it locally at
config/api_keys.confg
for future use - For more information on the Mapbox API key (access token), please visit their documentation
- The first time this is triggered, it will ask for your Mapbox API key (access token) and save it locally at
- Except if it is running in
- If there is a location that does not exist in
- If not running in
--monocle
mode, the operation stops here - Generate
table_monocle.csv
for Monocle - Generate
data.json
for GPS Database Overview
- Conda
- Git
- Clone this repository to your machine
git clone https://github.com/sanger-bentley-group/gps-database-processor.git
- Go into the local copy of the repository
cd gps-database-processor
- Setup Conda Environment
conda env create -f environment.yml
- Pull any updates from remote repository
git pull
- Activate the Conda Environment
conda activate gps-db-processor
- Put the GPS database's three
.csv
source files into the directory containing the cloned repository - Run the following command to validate your input files and generate the output files:
./processor.py
- If you or the tool have updated any of the reference tables (i.e. any file in the
data
directory), create a PR (Pull Request) on this repository
-
-1
,--gps1
: path to directory of GPS1 data (should containtable1.csv
,table2.csv
, andtable3.csv
of GPS1) -
-2
,--gps2
: path to directory of GPS2 data (should containtable1.csv
,table2.csv
, andtable3.csv
of GPS2) -
-c
,--check
: perform validation only -
-m
,--monocle
: generate Monocle table and GPS Database Overview data payload from both GPS1 and GPS2 -
-l
,--location
: get coordinates for locations not yet exist indata/coordinates.csv
via MapBox API -
Example commands:
./processor.py --gps1 gps-data --check
./processor.py -1 gps-data -2 gps2-data -m
alpha2_country.csv
- Map
Country
intable1
to ISO 3166-1 alpha-2 code fordata.json
generation - 2 columns:
alpha2
,country
- representing ISO 3166-1 alpha-2 code, country name respectively - Modify the value in the
country
column if it does not match the value used inCountry
intable1
- Map
coordinates.csv
- Map the combination of
Country
,Region
,City
intable1
toLatitude
,Longitude
fortable4
generation - 3 columns:
Country-Region-City
,Latitude
,Longitude
- representing Country-Region-City combination, latitude, longitude respectively - If a new
Country-Region-City
combination is found, this tool will attempt to auto-assignLatitude
,Longitude
to it usinggeopy
and add them to this file - Modify values in the
Latitude
,Longitude
columns if incorrect values are assigned to them bygeopy
- Map the combination of
manifestations.csv
- Map the combination of
Clinical_manifestation
andSource
intable1
toManifestation
fortable4
generation - 3 columns:
Clinical_manifestation
,Source
,Manifestation
- representing clinical manifestation, sample source, resulting manifestation respectively - Add new
Clinical_manifestation
andSource
combinations and their resultingManifestation
to this file
- Map the combination of
non_standard_ages.csv
- Map non-numeric values in
Age_years
intable1
toless_than_5years_old
fortable4
generation - 2 columns:
value
,less_than_5years_old
- representing non-numeric age value, whether the value is less than 5 years old respectively - Add new non-numeric
Age_years
values and whether they areless_than_5years_old
(Y or N) to this file
- Map non-numeric values in
pcv_introduction_year.csv
- Map
Country
,Year
intable1
toVaccine_period
,Introduction_year
,PCV_type
fortable4
generation - 3 columns:
Country
,PCV_type
,Introduction_year
- representing country name, PCV type, introduction year of the PCV respectively - Add new
Country
,PCV_type
,Introduction_year
to this file when that country introduces PCV or has a change of PCV type in their National Immunisation/Vaccination Programme
- Map
pcv_valency.csv
- Map
In_silico_serotype
intable3
to all PCV types underPCV_type
of this file fortable4
generation - 2 columns:
PCV_type
,Serotypes
- representing PCV type, list of covered serotypes by that PCV type respectively - Add new
PCV_type
,Serotypes
to this file when there is a new PCV type
- Map
published_public_names.txt
- Map
Public_name
intable1
andtable3
toPublished
fortable4
generation - A list of
Public_name
of all samples that have been published
- Map
GPS Database requirement:
- GPS1 v5.0+
- GPS2 v5.0+
Tested on:
This project uses Open Source components. You can find the source code of their open source projects along with license information below. I acknowledge and am grateful to these developers for their contributions to open source.
- Copyright (c) 2008-2011, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData Development Team. All rights reserved.
- Copyright (c) 2011-2022, Open source contributors.
- License (BSD-3-Clause): https://github.com/pandas-dev/pandas/blob/main/LICENSE
- © geopy contributors 2006-2018 under the MIT License.
- License (MIT): https://github.com/geopy/geopy/blob/master/LICENSE
- draw.io is owned and developed by JGraph Ltd, a UK based software company.
- License (Apache License 2.0): https://github.com/jgraph/drawio/blob/dev/LICENSE