A pipeline to investigate population structures.
The Juno-population pipeline automates popPUNK. It is primarily used to categorize Streptococcus pneumoniae into Global Pneumococcal Sequence Clusters, though the pipeline can also support other species with popPUNK databases.
- Linux environment
- (mini)conda
- Python3.8 Python is the scripting language used to create the pipeline
- Clone the repository.
git clone https://github.com/RIVM-bioinformatics/juno-population.git
- Go to Juno directory.
cd juno-population
- Create & activate mamba environment.
conda env update -f envs/mamba.yaml
conda activate mamba
- Create & activate juno environment.
mamba env update -f envs/population_master.yaml
conda activate juno_population
- Example of run:
python population.py -i [input] -o [output] --db [popPUNK_database]
-h, --help
Shows the help of the pipeline
-i, --input
Path to a directory with fasta files or path to the output directory of the Juno-Assembly pipeline. It is important to link to the directory and not the files.
One of the following
-b --db
The name of (or path to) the popPUNK database, no trailing '/' when specifying a path. It overrides information provided with the --species argument.-s --species
Full scientific name of the species. Use all lowercase and underscores between the parts of a name, not spaces (e.g. streptococcus_pneumoniae). This is a convenience function to find and set the popPUNK database. Only Streptococcus pneumoniae is currently supported within the RIVM bio-informatics environment. Extra species can be supported by including them indatabase_locations.py
.
-o, --output
Path to the directory that is used for the output. If none is given the default will be an output directory in the juno-population folder.--external-clustering
Add if external cluster definitions should be used to name the clusters (see popPUNK and GPSC documentation). A{db_name}_external_clusters.csv
file should be present in the popPUNK database directory when using this flag.-n --dryrun
If you want to run a dry run use one of these parameters
python population.py -i [path/to/fasta_files/] --db [path/to/poppunk_db]
When you want to provide a popPUNK database:
python population.py -i path/to/fasta_files/ -o output/ --db path/to/GPS_v6
When analyzing a supported species and the popPUNK database contains a cluster definition file that should be used:
python population.py -i path/to/fasta_files/ -o output/ -s streptococcus_pneumoniae --external-clustering
- log: Log with output and error file from the cluster and for each Snakemake rule/step that is performed.
- results_per_sample: Directory with output produced by popPUNK for each sample.
- q_files: Directory containing the input for
poppunk_assign
. Subsequent analysis by other popPUNK modules (e.g.poppunk_visualise
or building a MST on large datasets) may require these files. - poppunk_clusters.csv: Summary file with cluster definitions for each sample within the results_per_sample folder.
This pipeline is licensed with a AGPL3 license. Detailed information can be found inside the 'LICENSE' file in this repository.
- Contact person: Roxanne Wolthuis & Karim Hajji
- Email [email protected] / [email protected]