Skip to content

JPejovicApis/GeneXOmics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Single-cell RNA-seq Pipeline: FastQC and STARsolo

This pipeline processes single-cell RNA-seq data, starting with quality control using FastQC and proceeding to barcode-aware alignment and quantification using STARsolo. The pipeline is written in Nextflow to ensure reproducibility, scalability, and efficient handling of high-throughput sequencing data.

Pipeline Overview

  • FastQC: Performs quality control on raw FASTQ files.

  • STARsolo: Aligns reads to a reference genome and performs cell barcode and UMI-based quantification.

Requirements

  • Software Dependencies

    • Nextflow: Workflow orchestrator.
  • Required Tools and Installation Instructions

    Ensure Nextflow is installed on your system:

    Nextflow: Installation Guide

curl -s https://get.nextflow.io | bash

Input Files

This pipeline uses list of required and optional parameters. The complete list of inputs, with their description and file types can be listed by running on the command line: 

$ nextflow run help.nf
  • CSV file: CSV file that contains a list of FASTQ files.
  • Reference Genome: A directory containing STAR genome indices generated by STAR.
  • Barcode Whitelist File: A text file containing valid cell barcodes for STARsolo.

Example Inputs

  1. Sample CSV File Containing List of FASTQ Files
sample_id,read_1,read_2
sample1,/path/to/sample1_R1.fastq.gz,/path/to/sample1_R2.fastq.gz
sample2,/path/to/sample2_R1.fastq.gz,/path/to/sample2_R2.fastq.gz
sample3,/path/to/sample3_R1.fastq.gz,/path/to/sample3_R2.fastq.gz
  • sample_id: A unique identifier for each sample.
  • read_1: The path to the forward (R1) read FASTQ file.
  • read_2: The path to the reverse (R2) read FASTQ file (if paired-end; otherwise, this can be empty for single-end).

You can manually create a CSV file in any text editor (e.g., Notepad, VS Code, Sublime Text), or you can use the Python script create_csv_readsfile.py located in the src/ folder.

To run the script, use the following command:

python create_csv_readsfile.py --fastq_dir /path/to/fastq_files --output_csv readsfile.csv
  • fastq_dir: The path for the directory where the FASTQ files are located.
  • output_csv (optional): The name of the output CSV file. If not provided, it defaults to readsfile.csv.
  1. Reference Genome: Pre-built STAR index. An example of input data can be found in the folder test_data/star/STARindex.

  2. Barcode Whitelist: An example of the whitelist file can be found in test_data/star/10x_V2_barcode_whitelist.txt.

List of parametars

- output_dir        - Path to the output folder [Default: results]
- aligner           - Choose aligner (kallisto,cellranger, etc) [default:star] 
- skip_fastqc       - Determine weather to use skipp fastqc or not [Default: null]

Running the Pipeline

  • Clone the repository containing the pipeline:
git clone https://github.com/JPejovicApis/GeneXOmics.git
cd GeneXOmics
  • Run the pipeline with Nextflow:
nextflow run main.nf --readsfile /path/to/csvfile/readsfile.csv \
                     --starindex /path/to/STAR/genome/indices \
                     --whitelist /path/to/whitelist.txt \
                     --output_dir results/

If you don't specify --output_dir, Nextflow will create results dir in your current directory and place your results there. If you want to place your results somewhere else, provide additional path by using --output_dir parameter in your Nextflow command line.

In case when the run is unsuccessful, you can fix your issues and then run it from the failure point, by using -resume parameter in your Nextflow command line.

Output Files

  1. FastQC Output Files:
  • _fastqc.html: The main HTML report providing a detailed visual summary of the quality control analysis for the sequencing data. You can open this file in a web browser to view metrics such as per base sequence quality, GC content, adapter content, and more.

  • _fastqc.zip: A compressed archive containing all output files, including the fastqc.html report, raw data, summary, and log files. Extract the archive to inspect individual files if needed.

  • Inside the fastqc.zip archive:

    • fastqc_data.txt: Contains raw data and metrics in tabular form, including sequence quality scores, GC content, and sequence length distribution.

    • summary.txt: A summary of the quality control checks, including pass, warning, or fail status for each metric.

    • fastqc_report.html: The HTML report included in the ZIP file.

    • Images/: PNG images for each plot shown in the HTML report:

      • per_base_quality.png: Per base quality score plot.

      • per_base_sequence_content.png: Per base sequence content plot.

      • per_sequence_gc_content.png: GC content distribution.

      • sequence_length_distribution.png: Sequence length distribution plot.

      • adapter_content.png: Adapter content plot

      • Other quality control metric-related images.

  1. STARsolo Output Files:
  • Solo.out/Gene/: Folder containing the gene expression matrix output from scRNA-seq quantification.

    • barcodes.tsv: List of detected cell barcodes.

    • features.tsv: A list of gene features (such as gene names or gene IDs) quantified in the run. Each line corresponds to a gene, typically annotated with a gene symbol or identifier.

    • matrix.mtx: The core output file in Matrix Market (MTX) format, which stores the sparse matrix of gene expression counts. This file contains three columns: the row (gene), the column (cell barcode), and the count (expression level). This matrix can be loaded into downstream tools such as Seurat, Scanpy, or other single-cell analysis software for further processing.

    • summary.csv: A CSV file that summarizes the read processing statistics for each cell barcode, such as the number of reads, unique molecular identifiers (UMIs), and detected genes.

  • Solo.out/Log.final.out:

A log file containing the summary of the STARsolo alignment and quantification process. It includes information on the number of reads, alignments, and barcodes processed, as well as performance metrics.

Example Outputs

  • FastQC Reports: Generated for each input FASTQ file, available in the results/fastqc/ directory

  • STARsolo Outputs: Barcode-aware alignments, cell-UMI matrices, and gene quantifications, available in the results/STARsolo/test_rnaseq_Solo.out/ directory.

Running pipeline on AWS

nextflow run main.nf -profile test_aws -plugins nf-quilt --outdir 'quilt+s3://GeneXOmics/results' 

Running with dummy data

To run a pipeline locally, download test_data folder and run folowwing command

nextflow run main.nf -profile test_local 

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published