Repository for the processing and analysis of CROP-seq data. This repository contains code to do the following steps for CROP-seq analysis:
- Reference preparation
- Guide quantification from scRNA-seq data
- Guide quantificationf from enriched RNAseq data
The reference requires the following information:
- Primary assembly of the organism (FASTA and GTF)
- Spreadsheet with gRNA sequence and target genes
- Spreadsheet with CROP-seq vector elements
- Use the following script: buildReference.py to build the guide reference.
python buildReference.py -i example_data/GWAS_sgRNAs.csv -c example_data/CROP\-seq.csv -o Test
- Concatenate the generated files to your primary assembly files, to create a new set of reference files.
> cat Homo_sapiens.GRCh38.dna.primary_assembly.fa CROPseq.fa > Homo_sapiens.CROPseq.fa
> cat Homo_sapiens.GRCh38.98.chr.gtf CROPseq.gtf > Homo_sapiens.CROPseq.gtf
- Convert the files to Cell Ranger reference files using cellranger mkref.
# Path to input files
GENOME_NAME=Homo_sapiens.CROPseq
INPUT_FASTA=Homo_sapiens.CROPseq.fa
INPUT_GTF=Homo_sapiens.CROPseq.gtf
INPUT_FILTERED_GTF=Homo_sapiens.CROPseq.Filtered.gtf
# Filter GTF
cellranger mkgtf ${INPUT_GTF} ${INPUT_FILTERED_GTF} --attribute=gene_biotype:protein_coding --attribute=gene_biotype:lincRNA --attribute=gene_biotype:antisense || exit 1
# Create STAR reference
cellranger mkref --genome=${GENOME_NAME} --fasta=${INPUT_FASTA} --genes=${INPUT_FILTERED_GTF} || exit 1
This step is usually done by the service provider. Samples need to be mapped to the reference prepared in the previous step.
Data for all arrays need to be combined via the cellranger aggr function, which not only concatenates the count matrices, but also performs a read depth equalization between all samples.
If the reference has worked, you will receive a count table with the guide and control RNA as additional genes in the expression matrix. These can be retrieved using the following steps.
This uses merged paired-end reads as generated by the Galaxy pipeline.
- Label reads with combined cell barcode and UMI sequence, located at the first 28 nucleotides of the merged sequences. This can be done with fastp.
fastp -i Galaxy112-[ARRAY1_R1_R2_merge].fastqsanger -o ARRAY1.fastq -U --umi_loc read1 --umi_len 28 -w 4
- Strip polyT tail from the reads using cutadapt.
cutadapt -j 4 -g "T{200}" -o ARRAY1_Processed.fastq ARRAY1.fastq
This can be run for all samples using the preprocessFastq.sh script.
Map guides to the reference generated in the first section using STAR aligner.
STAR --genomeDir $REF_DIR --readFilesCommand cat \
--readFilesIn $INPUT_FILE \
--outSAMtype BAM Unsorted \
--quantMode GeneCounts \
--runThreadN 16 \
--outFileNamePrefix $SAMPLE_NAME \
--outReadsUnmapped Fastx \
--outFilterScoreMinOverLread 0 \
--outFilterMatchNminOverLread 0 \
--outFilterMatchNmin 0 \
--outFilterMismatchNmax 2
This will output a BAM file and a count table that we will use to trace back to the original cells.
Run quantifyEnrichment.py. This will process the BAM files (eg. ARRAY1_Aligned.sortedByCoord.out.bam) generated by STAR. These files will need to be indexed and position-sorted to work.
The script can be invoked as follows:
python quantifyEnrichment.py -i ARRAY1_Aligned.sortedByCoord.out.bam -o ARRAY1_EnrichmentReads.tsv
This will generate a spreadsheet that contains the number of reads mapped to each UMI-barcoded molecule containing a gRNA, and the cell it came from.
gRNA | cell_barcode | umi | reads |
---|---|---|---|
GUIDES_sg001_ABCA1 | AAACCCATCACTGTTT | GTCCAGGGGGT | 1 |
GUIDES_sg001_ABCA1 | AAACGAAGTCCAGCGT | TATTGAGCATT | 1 |
GUIDES_sg001_ABCA1 | AAAGAACTCGGCGATC | TTATTTGCGTC | 1 |
GUIDES_sg001_ABCA1 | AAAGGGCGTCCCGTGA | GGCCGCCCTGC | 2 |
GUIDES_sg001_ABCA1 | AAATGGAAGCAGCCAG | ACATCAATCAG | 1 |
GUIDES_sg001_ABCA1 | AAATGGAGTAGACAGC | TACTTATAATG | 1 |
GUIDES_sg001_ABCA1 | AACAACCCAATTTCGG | AGTTGCTGCGG | 1 |
GUIDES_sg001_ABCA1 | AACAACCGTTTATGCG | TATATTATCAG | 3 |
GUIDES_sg001_ABCA1 | AACAAGAGTATGCTAC | GGCGAGATCAT | 1 |
The output must then be processed via the R script [quantifyEnrichment.R][quantifyEnrichment.R], which will parse the output and identify the guides present in each cell based on the MiSeq enrichment data.
Script is invoked as follows:
Rscript quantifyTranscriptome.R ARRAY1_EnrichmentReads.tsv ARRAY1
Guides are assigned to a cell if:
- They are the only guides found in a cell
- They have more reads than 3x the sum other reads assigned to other guides in the cell AND the guides target a different gene
- The guides for a single gene have 3x the sum of reads assigned to other guides in the cell
cell_barcode | gRNA | type | target |
---|---|---|---|
AAAAAAAAGGAGGGGG | sg097 | GUIDES | TIMP3 |
AAAAAAAAGGGTACGT | sg103 | GUIDES | TRIOBP |
AAAAAAGGACGGGGGA | sg020 | GUIDES | ANTXR1 |
AAAAACTGTCCTTAAG | sg045 | GUIDES | EMID1 |
AAAAAGATCCAAGGAA | NA | NA | NA |
AAAAAGTGAAAGAGAA | sg112 | GUIDES | TEX41 |
Run quantifyTranscriptome.R to retrieve gRNA-mapped UMI counts from the Chromium data. Input should be the filtered_feature_bc_matrix
found in the Cell Ranger output folder.
You will need to invoke the script as follows:
Rscript quantifyTranscriptome.R AH_array-1/filtered_feature_bc_matrix ARRAY1
This will generate a file with the following information:
cell_barcode | gRNA |
---|---|
AAACCCAAGAGTGTGC | NA |
AAACCCAAGCGCTGAA | NA |
AAACCCAAGGCACAAC | NA |
AAACCCAAGTCGTTAC | NA |
AAACCCAAGTCTGGAG | NA |
AAACCCACAGAGACTG | NA |
AAACCCACATAGGAGC | NA |
AAACCCAGTACTCGTA | NA |
AAACCCAGTATAGCTC | NA |
AAACCCAGTTGTTTGG | NA |
AAACCCATCACTGTTT | GUIDES-sg097-TIMP3-gene |
The processTranscriptome.R script performs two functions:
- Integration of assignments from read and transcriptome information
- Filter cells based on QC quality and adds gRNA assignments to gene-count information
It is recommended you run this script manually as you will need to review the QC reports to make decisions on which cells to retain.
The aggregated scRNA-seq data was loaded into R via Seurat, and final guide assignments were added to the object metadata with the Prepare_SingleCellData.R script. This step also normalises the dataset using SCTransform.
The script CombinedLRT.R loads the Seurat object generated in the previous step and runs differential expression analysis on it. User will need to run analysis on each gRNA target to review transcriptome-wide effects on test group vs controls.