Programs that enhance NanoCLUST V1.0dev usage and output.
- NC_runner.py
- NC_summarizer.py
- NC_gridsearch.py
To be included
- summary_to_phyloseq.py (create OTU and TAX files for phyloseq based on CSV summary)
- phyloseqize.py (create OTU and TAX files for phyloseq based directly on NanoCLUST output (no summary needed) (combine with summary_to_phyloseq.py?))
- NC_cluster_concat.py (concatenate cluster consensus sequences from multiple NanoCLUST runs into .fasta file)
- concat_fastqgz.py (recursively concatenate .fastq.gz files in a directory into separate .fastq files (needed to run NanoCLUST) (combine with NC_runner.py?))
![image](https://private-user-images.githubusercontent.com/126883391/275911222-acb76a00-2832-4ebd-98f6-c0f55e605051.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkwNzUwMDgsIm5iZiI6MTczOTA3NDcwOCwicGF0aCI6Ii8xMjY4ODMzOTEvMjc1OTExMjIyLWFjYjc2YTAwLTI4MzItNGViZC05OGY2LWMwZjU1ZTYwNTA1MS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjA5JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIwOVQwNDE4MjhaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT04YWRkMjI5Mzk3ZTJkYjhkZDI2YjM4M2FkNmUzZDc5MzYyZGFkMTNjODdlMjViNjI3Y2U2M2NlYjYwYThkMTA5JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9._2UpADmEY6Gwje7cOnfOPCibEvjDx8gRDany9xrY7Dg)
The NC_runner.py script streamlines the execution of NanoCLUST on multiple files within a directory using just one command.
This script is useful when dealing with multiple fastq files, when the approach of using wildcards to select multiple input files does not work.
- Download the NC_runner.py script to a convenient location, preferably your home directory to minimize path-related errors.
- (Optional) In the script's argparse default section, provide the absolute paths for your database and tax-database to reduce command length and prevent path-related errors.
- (Optional) Include the path to the main.nf file from NanoCLUST in the argparse default section.
- (Optional) Modify the default output directory location.
- (Optional) Adjust the default suffix in the argparse section to a suffix you commonly use.
- Execute the script!
If no input directory is specified or the input directory doesn't exist, the script will exit.
If the specified output directory doesn't exist, the script will notify you and create it.
NanoCLUST's outputs are organized in the specified output directory or your current working directory by default. For each NanoCLUST run, a separate folder is created with the corresponding sample name. These folders contain three output directories (classification data, FastQC results, and pipeline info) generated by NanoCLUST.
Basic command, only input directory specified (default settings):
python NC_runner.py sequencedata
Input and output directory, file suffix, main.nf path, database paths specified
python NC_runner.py sequencedata -o NanoCLUST_out -s .fastq.gz -n project1/programs/NanoCLUST/main.nf -d project1/db/16S_ribosomal_RNA -t project1/db/taxdb
Getting help
python NC_runner.py -h
![image](https://private-user-images.githubusercontent.com/126883391/298276308-3751d3d0-adee-44f9-ba33-f7212d9a5383.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkwNzUwMDgsIm5iZiI6MTczOTA3NDcwOCwicGF0aCI6Ii8xMjY4ODMzOTEvMjk4Mjc2MzA4LTM3NTFkM2QwLWFkZWUtNDRmOS1iYTMzLWY3MjEyZDlhNTM4My5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjA5JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIwOVQwNDE4MjhaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT04ZTM0OGM1OGU2NzdjOTBmYTI4NGU3ZDljZGQ2ZWFmZTljNWNmZWZhZmExZDU1ZmJmNzI5MmQ1NmI1YTc2YTAzJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.PV2oGLmQfkOQU_ccjODOqo8oX3JkDiUeKUSjYheyDCc)
The NC_summarizer.py script facilitates the concatenation of taxonomic classification results from multiple NanoCLUST runs into a single CSV file. Users can specify the taxonomic level from which the results should be concatenated.
This script is useful when comparing taxonomic classification results across multiple NanoCLUST runs. Barplots containing relative abundances across samples can be easily created based on the generated CSV.
- Download the NC_summarizer.py script to a convenient location, preferably your home directory to minimize path-related errors.
- (Optional) Modify the default output directory location in the argparse section of the script.
- (Optional) Modify the default taxonomic level in the argparse section of the script.
- Execute the script!
The program generates a single CSV file with three columns: "runname," "taxid," and "rel_abundance." The "runname" column contains the NanoCLUST run name for all taxonomic IDs detected by NanoCLUST, listed in the "taxid" column. The "rel_abundance" column contains the relative abundance of the noted taxid for that run.
Basic command, only input directory specified (default settings):
python NC_summarizer.py NanoCLUST_out
Input directory, output file location and taxonomic level specified
python NC_summarizer.py NanoCLUST_out -o NanoCLUST_out/NCsummary.csv -l species
Getting help
python NC_summarizer.py -h
The NC_gridsearch.py script performs NanoCLUST runs with a set of combinations of UMAP set size and minimum cluster size values.
This script is useful when determining the optimal UMAP set size and minimum cluster size parameters for your data.
For more information on UMAP set size and minimum cluster size, refer to the NanoCLUST GitHub page.
- Download the NC_gridsearch.py script to a convenient location, preferably your home directory to minimize path-related errors.
- (Optional) Modify the default set of parameters to be tested in the argparse section of the script.
- (Optional) Modify the default output directory location in the argparse section of the script to reduce command length and prevent path-related errors.
- (Optional) In the script's argparse default section, provide the absolute paths for your database and tax-databas.
- (Optional) Include the path to the main.nf file from NanoCLUST in the argparse default section.
- Execute the script!
NanoCLUST's outputs of all runs are organized in the specified output directory or your current working directory by default. For each NanoCLUST run, a separate folder is created with the UMAP set size and minimum cluster size used. These folders contain three output directories (classification data, FastQC results, and pipeline info) generated by NanoCLUST.
Basic command, only input reads specified (default settings):
python NC_gridsearch.py reads.fastq
Input reads, output directory, parameters, database, taxdatabase, main.nf path specified
python NC_gridsearch.py reads.fastq -o gridsearch -p 100 120 140 160 180 200 -d databases/NanoCLUST/db -t databases/NanoCLUST/taxdb -n NanoCLUST/main.nf
Getting help
python NC_gridsearch.py -h