Skip to content

RIVM-bioinformatics/juno-population

Repository files navigation

Juno-Population

A pipeline to investigate population structures.

About this project

The Juno-population pipeline automates popPUNK. It is primarily used to categorize Streptococcus pneumoniae into Global Pneumococcal Sequence Clusters, though the pipeline can also support other species with popPUNK databases.

Prerequisites

  • Linux environment
  • (mini)conda
  • Python3.8 Python is the scripting language used to create the pipeline

Installation

  1. Clone the repository.
git clone https://github.com/RIVM-bioinformatics/juno-population.git
  1. Go to Juno directory.
cd juno-population
  1. Create & activate mamba environment.
conda env update -f envs/mamba.yaml
conda activate mamba
  1. Create & activate juno environment.
mamba env update -f envs/population_master.yaml
conda activate juno_population
  1. Example of run:
python population.py -i [input] -o [output] --db [popPUNK_database]

Parameters & Usage

Command for help

  • -h, --help Shows the help of the pipeline

Required parameters

  • -i, --input Path to a directory with fasta files or path to the output directory of the Juno-Assembly pipeline. It is important to link to the directory and not the files.

One of the following

  • -b --db The name of (or path to) the popPUNK database, no trailing '/' when specifying a path. It overrides information provided with the --species argument.
  • -s --species Full scientific name of the species. Use all lowercase and underscores between the parts of a name, not spaces (e.g. streptococcus_pneumoniae). This is a convenience function to find and set the popPUNK database. Only Streptococcus pneumoniae is currently supported within the RIVM bio-informatics environment. Extra species can be supported by including them in database_locations.py.

Optional parameters

  • -o, --output Path to the directory that is used for the output. If none is given the default will be an output directory in the juno-population folder.
  • --external-clustering Add if external cluster definitions should be used to name the clusters (see popPUNK and GPSC documentation). A {db_name}_external_clusters.csv file should be present in the popPUNK database directory when using this flag.
  • -n --dryrun If you want to run a dry run use one of these parameters

The base command to run this pipeline

python population.py -i [path/to/fasta_files/] --db [path/to/poppunk_db]

Two examples of running this pipeline

When you want to provide a popPUNK database:

python population.py -i path/to/fasta_files/ -o output/ --db path/to/GPS_v6

When analyzing a supported species and the popPUNK database contains a cluster definition file that should be used:

python population.py -i path/to/fasta_files/ -o output/ -s streptococcus_pneumoniae --external-clustering

Explanation of the output

  • log: Log with output and error file from the cluster and for each Snakemake rule/step that is performed.
  • results_per_sample: Directory with output produced by popPUNK for each sample.
  • q_files: Directory containing the input for poppunk_assign. Subsequent analysis by other popPUNK modules (e.g. poppunk_visualise or building a MST on large datasets) may require these files.
  • poppunk_clusters.csv: Summary file with cluster definitions for each sample within the results_per_sample folder.

License

This pipeline is licensed with a AGPL3 license. Detailed information can be found inside the 'LICENSE' file in this repository.

Contact