Skip to content

Latest commit

 

History

History
210 lines (123 loc) · 8.46 KB

README.md

File metadata and controls

210 lines (123 loc) · 8.46 KB

assemblycomparator2

Assemblycomparator is a genomes-to-report pipeline. It is a bit like nullarbor, but it takes in genomes (assemblies) instead of reads.

It works by calling an alias that invokes the activation of a conda environment and subsequently calls a snakemake pipeline on the fasta-files in the current working directory of your terminal.

Assemblycomparator performs a palette of analyses on your genomes, and compares them. The main results from these analyses are summarized in a html-report that can be easily distributed.

Usage

Make a directory with the assembly-files you want to investigate with assemblycomparator2. Go into that directory in the terminal, and run the command assemblycomparator2_slurm or assemblycomparator2_local. assemblycomparator2 will then create a sub-directory containing a plethora of analyses.

Some useful commands

  • Execute a 'dry run'. That is, show the jobs which will run, without triggering the computation:

    assemblycomparator2 -n

  • Simply, run assemblycomparator on the genomes in the current directory:

    assemblycomparator2

  • If you're not sure your internet connection to the cluster will last for the full assemblycomparator2 run, put a & in the end.

    assemblycomparator2 &

A bit more advanced controls
  • Execute all jobs up until (inclusive of) a specific job in the job graph:

    assemblycomparator2 --until mlst

  • Select a specific MLST-scheme to use on all of the samples: (defaults to automatic)

    assemblycomparator2 --config mlst_scheme=hpylori

  • Select a specific roary blastp-identity: (defaults to 95)

    assemblycomparator2 --config roary_blastp_identity=90

  • Rerun a specific rule, (might be necessary if some parts of the report is missing):

    assemblycomparator2 -R report

What analyses does it do?

For each assembly

For each group

  • roary (pan and core genome)
  • snp-dists (core genome pairwise snp-distances)
  • FastTree (phylogenetic tree of the core genome)
  • Mashtree (super fast distance measurement)
  • A nice report easy to share with your friends (demo)

Below is a snakemake exported directed graph of the rules involved: dag

Installation

Assemblycomparator2 needs Snakemake and the dependencies which can be needed for running on your specific setup. I.e. DRMAA for Slurm-mananged HPC's. You can either follow the official Snakemake instructions or use our guide below.

1. Preliminary Setup

  • We recommend that you use mamba instead of conda:

    conda install -n base -c conda-forge mamba
    
  • Set the base directory for assemblycomparator2. You can change it to anything you'd like.

    ASSCOM2_BASE=~/assemblycomparator2
    
    mkdir -p $ASSCOM2_BASE
     
    # And save it into your .bashrc
    echo "export ASSCOM2_BASE=$ASSCOM2_BASE" >> ~/.bashrc 
     
    
  • Clone the assemblycomparator2 GitHub-repository into that base

    git clone https://github.com/cmkobel/assemblycomparator2.git $ASSCOM2_BASE
    
    # Optionally use the git protocol:
    # git clone [email protected]:cmkobel/assemblycomparator2.git $ASSCOM2_BASE
    
    # Setup a asscom2 base environment which is used to call snakemake
    cd $ASSCOM2_BASE && mamba env create -f environment.yaml
    
    
    
  • Set an alias that makes it easy to run assemblycomparator2 from anywhere in your filesystem

  • You have to decide whether you want to use Singularity (recommended if possible) or Conda for package management.

2. Install the alias

Select A or B depending on whether you want to install on a slurm-enabled HPC or a local system without slurm.

Option A) For HPCs with Slurm using Conda

# Main alias for running assemblycomparator2
echo "alias assemblycomparator2='conda run --live-stream --name assemblycomparator2 \
    snakemake --snakefile ${ASSCOM2_BASE}/snakefile \
        --profile ${ASSCOM2_BASE}/profile/slurm/ \
        --configfile ${ASSCOM2_BASE}/config.yaml'" >> ~/.bashrc

# Set the SNAKEMAKE_CONDA_PREFIX-variable, so the package installations can be reused between runs.
echo "export SNAKEMAKE_CONDA_PREFIX=${ASSCOM2_BASE}/conda_base" >> ~/.bashrc 
 

Option B) For local setups using Conda

# Main alias for running assemblycomparator2
echo "alias assemblycomparator2='conda run --live-stream --name assemblycomparator2 \
    snakemake --snakefile ${ASSCOM2_BASE}/snakefile \
        --profile ${ASSCOM2_BASE}/profile/local/ \
        --configfile ${ASSCOM2_BASE}/config.yaml'" >> ~/.bashrc

# Set the SNAKEMAKE_CONDA_PREFIX-variable, so the package installations can be reused between runs.
echo "export SNAKEMAKE_CONDA_PREFIX=${ASSCOM2_BASE}/conda_base" >> ~/.bashrc 
 

Setup of dasabases

  • Kraken2: If you already have a local copy of a kraken2 database, you can set the ASSCOM2_KRAKEN_DB system variable to its path.
  • GTDB-tk: Download the GTDB-tk database and set the GTDBTK_DATA_PATH variable to point to its directory.

Testing installation (optional)

assemblycomparator2 comes with a handful of E. faecium assemblies (illumina/skesa) which can be used to check that everything works as expected. In order to run this test, simply go into the location of these assemblies, and run the assemblycomparator2-command

cd ${ASSCOM2_BASE}/tests/E._faecium_plasmids
assemblycomparator2
 

If you encounter problems testing your installation, please refer to the issues tab of this repository.

Updating an existing installation (optional)

If you should -later down the line- wish to update the installation, run this command and you should be all set:

cd $ASSCOM2_BASE && git pull

# You might also want to update snakemake
conda env update --name assemblycomparator2 --file environment.yaml

# If you wish to update the job-environments, you can simply delete the contents of $SNAKEMAKE_CONDA_PREFIX
rm -r $SNAKEMAKE_CONDA_PREFIX/* 
# .. The environments will then be reinstalled from scratch next time you run assemblycomparator2

Note: If new databases have been added to kraken or mashscreen, you can rerun the above-mentioned set_up_*.sh-scripts.

Future functionality

In the future we might add some of the following pieces of software into assemblycomparator2.

Sample basis

  • Oriloc (Identify possible replication origins, and thereby help identify chromids)
  • RFplasmid (Identify plasmids using the pentamer-random-forest method)
  • Kaptive (surface polysaccharide loci for Klebsiella and Acinetobacter baumannii)
  • mash screen (recognition of plasmids-of-interest)

Batch basis

  • IQ-tree (phylogenetic tree of core genome with bootstrapping)
  • GC3-profiling ("fingerprinting" of the distribution of GC-content)
  • Identification of horizontally transferred genes?
  • panito (average nucleotide identity)
  • GenAPI (alternative to roary)

Development will continue.