This pipeline processes single-cell RNA-seq data, starting with quality control using FastQC and proceeding to barcode-aware alignment and quantification using STARsolo. The pipeline is written in Nextflow to ensure reproducibility, scalability, and efficient handling of high-throughput sequencing data.
-
FastQC: Performs quality control on raw FASTQ files.
-
STARsolo: Aligns reads to a reference genome and performs cell barcode and UMI-based quantification.
-
- Nextflow: Workflow orchestrator.
-
Ensure Nextflow is installed on your system:
curl -s https://get.nextflow.io | bash
This pipeline uses list of required and optional parameters. The complete list of inputs, with their description and file types can be listed by running on the command line:
$ nextflow run help.nf
- CSV file: CSV file that contains a list of FASTQ files.
- Reference Genome: A directory containing STAR genome indices generated by STAR.
- Barcode Whitelist File: A text file containing valid cell barcodes for STARsolo.
- Sample CSV File Containing List of FASTQ Files
sample_id,read_1,read_2
sample1,/path/to/sample1_R1.fastq.gz,/path/to/sample1_R2.fastq.gz
sample2,/path/to/sample2_R1.fastq.gz,/path/to/sample2_R2.fastq.gz
sample3,/path/to/sample3_R1.fastq.gz,/path/to/sample3_R2.fastq.gz
- sample_id: A unique identifier for each sample.
- read_1: The path to the forward (R1) read FASTQ file.
- read_2: The path to the reverse (R2) read FASTQ file (if paired-end; otherwise, this can be empty for single-end).
You can manually create a CSV file in any text editor (e.g., Notepad, VS Code, Sublime Text), or you can use the Python script create_csv_readsfile.py located in the src/ folder.
To run the script, use the following command:
python create_csv_readsfile.py --fastq_dir /path/to/fastq_files --output_csv readsfile.csv
- fastq_dir: The path for the directory where the FASTQ files are located.
- output_csv (optional): The name of the output CSV file. If not provided, it defaults to readsfile.csv.
-
Reference Genome: Pre-built STAR index. An example of input data can be found in the folder test_data/star/STARindex.
-
Barcode Whitelist: An example of the whitelist file can be found in test_data/star/10x_V2_barcode_whitelist.txt.
- output_dir - Path to the output folder [Default: results]
- aligner - Choose aligner (kallisto,cellranger, etc) [default:star]
- skip_fastqc - Determine weather to use skipp fastqc or not [Default: null]
- Clone the repository containing the pipeline:
git clone https://github.com/JPejovicApis/GeneXOmics.git
cd GeneXOmics
- Run the pipeline with Nextflow:
nextflow run main.nf --readsfile /path/to/csvfile/readsfile.csv \
--starindex /path/to/STAR/genome/indices \
--whitelist /path/to/whitelist.txt \
--output_dir results/
If you don't specify --output_dir
, Nextflow will create results dir in your current directory and place your results there.
If you want to place your results somewhere else, provide additional path by using --output_dir
parameter in your Nextflow command line.
In case when the run is unsuccessful, you can fix your issues and then run it from the failure point, by using -resume
parameter in your Nextflow command line.
- FastQC Output Files:
-
_fastqc.html: The main HTML report providing a detailed visual summary of the quality control analysis for the sequencing data. You can open this file in a web browser to view metrics such as per base sequence quality, GC content, adapter content, and more.
-
_fastqc.zip: A compressed archive containing all output files, including the fastqc.html report, raw data, summary, and log files. Extract the archive to inspect individual files if needed.
-
Inside the fastqc.zip archive:
-
fastqc_data.txt: Contains raw data and metrics in tabular form, including sequence quality scores, GC content, and sequence length distribution.
-
summary.txt: A summary of the quality control checks, including pass, warning, or fail status for each metric.
-
fastqc_report.html: The HTML report included in the ZIP file.
-
Images/: PNG images for each plot shown in the HTML report:
-
per_base_quality.png: Per base quality score plot.
-
per_base_sequence_content.png: Per base sequence content plot.
-
per_sequence_gc_content.png: GC content distribution.
-
sequence_length_distribution.png: Sequence length distribution plot.
-
adapter_content.png: Adapter content plot
-
Other quality control metric-related images.
-
-
- STARsolo Output Files:
-
Solo.out/Gene/: Folder containing the gene expression matrix output from scRNA-seq quantification.
-
barcodes.tsv: List of detected cell barcodes.
-
features.tsv: A list of gene features (such as gene names or gene IDs) quantified in the run. Each line corresponds to a gene, typically annotated with a gene symbol or identifier.
-
matrix.mtx: The core output file in Matrix Market (MTX) format, which stores the sparse matrix of gene expression counts. This file contains three columns: the row (gene), the column (cell barcode), and the count (expression level). This matrix can be loaded into downstream tools such as Seurat, Scanpy, or other single-cell analysis software for further processing.
-
summary.csv: A CSV file that summarizes the read processing statistics for each cell barcode, such as the number of reads, unique molecular identifiers (UMIs), and detected genes.
-
-
Solo.out/Log.final.out:
A log file containing the summary of the STARsolo alignment and quantification process. It includes information on the number of reads, alignments, and barcodes processed, as well as performance metrics.
-
FastQC Reports: Generated for each input FASTQ file, available in the results/fastqc/ directory
-
STARsolo Outputs: Barcode-aware alignments, cell-UMI matrices, and gene quantifications, available in the results/STARsolo/test_rnaseq_Solo.out/ directory.
nextflow run main.nf -profile test_aws -plugins nf-quilt --outdir 'quilt+s3://GeneXOmics/results'
To run a pipeline locally, download test_data folder and run folowwing command
nextflow run main.nf -profile test_local