Skip to content

BirgitRijvers/NanoCLUST_toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NanoCLUST toolkit

Programs that enhance NanoCLUST V1.0dev usage and output.

Scripts included

  • NC_runner.py
  • NC_summarizer.py
  • NC_gridsearch.py

To be included

  • summary_to_phyloseq.py (create OTU and TAX files for phyloseq based on CSV summary)
  • phyloseqize.py (create OTU and TAX files for phyloseq based directly on NanoCLUST output (no summary needed) (combine with summary_to_phyloseq.py?))
  • NC_cluster_concat.py (concatenate cluster consensus sequences from multiple NanoCLUST runs into .fasta file)
  • concat_fastqgz.py (recursively concatenate .fastq.gz files in a directory into separate .fastq files (needed to run NanoCLUST) (combine with NC_runner.py?))

NC_runner.py

image

The NC_runner.py script streamlines the execution of NanoCLUST on multiple files within a directory using just one command.

This script is useful when dealing with multiple fastq files, when the approach of using wildcards to select multiple input files does not work.

Usage

  1. Download the NC_runner.py script to a convenient location, preferably your home directory to minimize path-related errors.
  2. (Optional) In the script's argparse default section, provide the absolute paths for your database and tax-database to reduce command length and prevent path-related errors.
  3. (Optional) Include the path to the main.nf file from NanoCLUST in the argparse default section.
  4. (Optional) Modify the default output directory location.
  5. (Optional) Adjust the default suffix in the argparse section to a suffix you commonly use.
  6. Execute the script!

If no input directory is specified or the input directory doesn't exist, the script will exit.

If the specified output directory doesn't exist, the script will notify you and create it.

Output

NanoCLUST's outputs are organized in the specified output directory or your current working directory by default. For each NanoCLUST run, a separate folder is created with the corresponding sample name. These folders contain three output directories (classification data, FastQC results, and pipeline info) generated by NanoCLUST.

Example commands

Basic command, only input directory specified (default settings):

python NC_runner.py sequencedata 

Input and output directory, file suffix, main.nf path, database paths specified

python NC_runner.py sequencedata -o NanoCLUST_out -s .fastq.gz -n project1/programs/NanoCLUST/main.nf -d project1/db/16S_ribosomal_RNA -t project1/db/taxdb

Getting help

python NC_runner.py -h

NC_summarizer.py

image

The NC_summarizer.py script facilitates the concatenation of taxonomic classification results from multiple NanoCLUST runs into a single CSV file. Users can specify the taxonomic level from which the results should be concatenated.

This script is useful when comparing taxonomic classification results across multiple NanoCLUST runs. Barplots containing relative abundances across samples can be easily created based on the generated CSV.

Usage

  1. Download the NC_summarizer.py script to a convenient location, preferably your home directory to minimize path-related errors.
  2. (Optional) Modify the default output directory location in the argparse section of the script.
  3. (Optional) Modify the default taxonomic level in the argparse section of the script.
  4. Execute the script!

Output

The program generates a single CSV file with three columns: "runname," "taxid," and "rel_abundance." The "runname" column contains the NanoCLUST run name for all taxonomic IDs detected by NanoCLUST, listed in the "taxid" column. The "rel_abundance" column contains the relative abundance of the noted taxid for that run.

Example commands

Basic command, only input directory specified (default settings):

python NC_summarizer.py NanoCLUST_out

Input directory, output file location and taxonomic level specified

python NC_summarizer.py NanoCLUST_out -o NanoCLUST_out/NCsummary.csv -l species

Getting help

python NC_summarizer.py -h

NC_gridsearch.py

The NC_gridsearch.py script performs NanoCLUST runs with a set of combinations of UMAP set size and minimum cluster size values.

This script is useful when determining the optimal UMAP set size and minimum cluster size parameters for your data.

For more information on UMAP set size and minimum cluster size, refer to the NanoCLUST GitHub page.

Usage

  1. Download the NC_gridsearch.py script to a convenient location, preferably your home directory to minimize path-related errors.
  2. (Optional) Modify the default set of parameters to be tested in the argparse section of the script.
  3. (Optional) Modify the default output directory location in the argparse section of the script to reduce command length and prevent path-related errors.
  4. (Optional) In the script's argparse default section, provide the absolute paths for your database and tax-databas.
  5. (Optional) Include the path to the main.nf file from NanoCLUST in the argparse default section.
  6. Execute the script!

Output

NanoCLUST's outputs of all runs are organized in the specified output directory or your current working directory by default. For each NanoCLUST run, a separate folder is created with the UMAP set size and minimum cluster size used. These folders contain three output directories (classification data, FastQC results, and pipeline info) generated by NanoCLUST.

Example commands

Basic command, only input reads specified (default settings):

python NC_gridsearch.py reads.fastq

Input reads, output directory, parameters, database, taxdatabase, main.nf path specified

python NC_gridsearch.py reads.fastq -o gridsearch -p 100 120 140 160 180 200 -d databases/NanoCLUST/db -t databases/NanoCLUST/taxdb -n NanoCLUST/main.nf

Getting help

python NC_gridsearch.py -h

About

Small programs that enhance NanoCLUST usage and output

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages