Phylign is executed via GNU Make, which handles all parameters and passes them to Snakemake.
Here's a list of all implemented commands (to be executed as make {command}
):
####################
# General commands #
####################
all Run everything (the default rule)
help Print help messages
clean Clean intermediate search files
cleanall Clean all generated and downloaded files
##################
# Pipeline steps #
##################
conda Create the conda environments
match Match queries using COBS (queries -> candidates)
map Map candidates to assemblies (candidates -> alignments)
#############
# Reporting #
#############
config Print configuration without comments
report Generate Snakemake report
###########
# Cluster #
###########
cluster_slurm Submit to a SLURM cluster
cluster_lsf Submit to LSF cluster
##################
# For developers #
##################
format Reformat Python and Snakemake files
checkformat Check source code format
Note: make format
requires
YAPF and
Snakefmt, which can be installed by
conda install -c conda-forge -c bioconda yapf snakefmt
.
asms/
,cobs/
Downloaded assemblies and COBS indexesinput/
Queries, to be provided within one or more FASTA/FASTQ files, possibly gzipped (.fa
)intermediate/
Intermediate files00_queries_preprocessed/
Preprocessed queries01_queries_merged/
Merged queries02_cobs_decompressed/
Decompressed COBS indexes (temporary, used only in the disk mode is used)03_match/
COBS matches04_filter/
Filtered candidates05_map/
Minimap2 alignments
logs/
Logs and benchmarksoutput/
The resulting files (in a headerless SAM format)
Input files: FASTA or FASTQ files possibly compressed by gzipped. The files
are searched in the input/
directory, as files with the following suffixes:
.fa
, .fasta
, .fq
, .fastq
(possibly with .gz
at the end).
Output files:
output/{name}.sam_summary.gz
: output alignments in a headerless SAM formatoutput/{name}.sam_summary.stats
: statistics about your computed alignments in TSV
SAM headers are omitted as all search experiments
generate hits across large numbers of assemblies (many
of them being spurious). As a result, SAM headers then
dominate the outputs. Nevertheless, we note that, in
principle, the SAM headers can always be recreated from the
FASTA files in asms/
, although this functionality is not
currently implemented.
- Swapping if the number of queries too high. If the number of queries is too high (e.g., 10M Illumina reads), the auxiliary Python scripts start to use too much memory, which may result in swapping. Try to keep the number of queries moderate and ideally their names short.
- No support for ambiguous characters in queries. Queries are expected to be over the ACGT alphabet. All non-ACGT characters in queries are first converted to A.
- Too many reported hits. When queries have too many equally good hits in the database, even if the threshold on the maximum number of hits is chosen low – for instance 10 – the program will take top 10 + ties, which can be a huge number (especially for short sequences).