This is a computational workflow for bioinformatics analysis of single-cell DNA and DNA methylation sequencing data, based on a custom molecular biology modification of 10x Genomics Chromium Single Cell Multiome ATAC + Gene Expression assay.
The pipeline assumes the reads are in a directory arranged in the following format:
/input_directory/sample_R1_001.fastq.gz # Read1 of DNA fragment
/input_directory/sample_R2_001.fastq.gz # Index (barcode) read
/input_directory/sample_R3_001.fastq.gz # Read2 of DNA fragment
- The first step is to do some preprocessing on the input FASTQ files.
The pipeline currently is harcoded with the assumption that R2 is 24nt long and has 8nt of splint adapter CAGACGCG at the beginning and reverse compliment of 10x ATAC barcodes from 9-24. These options can be modified by modifying
extract_clean_fastq
function withinmethyl_utils.py
script
The pipeline does several steps including splitting, trimming, barcode transfer from index reads into cDNA reads, potentially quality filtering reads, and collecting raw barcodes for barcode matching.
The following script wraps the methyl_fastq_pipeline.py
into a SLURM job with three input variables required for input
input_directory
and sample
and a third variable which is twice the estimate of expected cells and is used for calling:
sbatch ~/methylation/scripts/SLURM_fastq_pipeline.sh input_directory sample 30000
or
cd input_directory
sbatch ~/methylation/scripts/SLURM_fastq_pipeline.sh . sample 30000
- Adapater trimming is done using fastp, multiple parts per task, controlled by
multiplier_per_task
variable(default=7)
sbatch --array=1-n ~/methylation/scripts/SLURM_trim_pipeline.sh . sample
- Alignments are done with Bismark and array jobs, one task for each part:
sbatch --array=1-n ~/methylation/scripts/SLURM_align_parts.sh . sample ../GRCm39_full
sbatch --array=1-n ~/methylation/scripts/SLURM_align_parts.sh . sample ../GRCh38_v44/
- Alignments for genomic data is done with minimap2, multiple parts per task, controlled by
multiplier_per_task
variable(default=6)
sbatch --array=1-n ~/methylation/scripts/SLURM_align_parts_minimap.sh . sample ../GRCh38_v44/GRCh38_v44_chrs.mmi
- There are partially converted reads which need to be removed, takes around 20-30min for a %80 aligned batch of 10m 150-150 reads
multiple parts per task, controlled by
multiplier_per_task
variable(default=2)
sbatch --array=1-n ~/methylation/scripts/SLURM_filter_non_conv.sh . sample
- For very large datasets we split the pipeline into pieces and submit them as array jobs:
This is an array job SLURM_ARRAY_TASK_ID
is embedded as an input argument to save_quad_batch.py
.
Each which subsequently splits the parts into batches of 12 and processess them in pools of two using 2 cores. TASK_ID
will determine which 12-part batch of parts the task will process. TASK_ID
1 will process parts 000
to 011
, TASK_ID
4 will process parts 036
to 047
and so on.
Each bam of roughly 8m paired-end reads with fragment average of 100nt will need 18GB RAM to produce jsons
sbatch --array=1-n ~/methylation/scripts/SLURM_save_quad_batch.sh . sample
Aggregation of batches into single files is done separately for CpG an Non_CpG contexts: 20 batches per task for CpG takes less than 10 minute, could run only 1 job too 2 batches per task for Non_CpG takes (~5min per 5GB json batch) need 24x memory of CpG, 20-40GB each batch memory is 4x the final json file size
sbatch --array=1-13 ~/methylation/scripts/SLURM_aggregate_quad_parts_CpG.sh . sample
sbatch --array=1-63 ~/methylation/scripts/SLURM_aggregate_quad_parts_Non_CpG.sh . sample
- To build allc files for each barcode in each batch, jsons are deduped and converted to tsv: multiplier for CpG is 20 and Non_CpG is 2, it takes ~2 hours per batch of Non_CpG to be converted to tsv
sbatch --array=1-n ~/methylation/scripts/SLURM_make_allc_Non_CpG.sh . sample
sbatch --array=1-n ~/methylation/scripts/SLURM_make_allc_CpG.sh . sample
and allc files are generated by running, multiplier is 4, each task is a 4 cores process
sbatch --array=1-n ~/methylation/scripts/SLURM_allc_process.sh . sample
- To build count matrices from batches, we first make matrix from each batch and then stack them
First we make count matrix from methylation calls for barcodes in each batch. For each context and window size three different matices are built: z-scored methylation levels counts of methylated bases counts of nonmethylated bases
A mouse example with GRCm39
reference
sbatch ~/methylation/scripts/SLURM_make_count_mtx_windows_CpG.sh . sample 100000 ~/methylation/data/GRCm39_v34_allcontigs.fasta.fai
sbatch ~/methylation/scripts/SLURM_make_count_mtx_windows_Non_CpG.sh . sample 100000 ~/methylation/data/GRCm39_v34_allcontigs.fasta.fai
A human example with GRCh38
reference
sbatch ~/methylation/scripts/SLURM_make_count_mtx_windows.sh . sample 100000 CpG_context ~/methylation/data/GRCh38_v44_chrs.fasta.fai
Then we stack all count matricies to make one set of final matrices for each of three in above, we also make coverage matrix, these final matrices are stored in AnnData format.
sbatch ~/methylation/scripts/SLURM_stack_mtx_windows.sh . sample 100000 CpG_context ~/methylation/data/GRCm39_v34_allcontigs.fasta.fai
- Another way to build count matrices is using gene intervals instead of windows, for this we have a preprocessed version of gencode gtf which is essenetially a bed file with intervals and ensembl gene id.
For mouse we can obtain Comprehensive gene annotation
from: https://www.gencodegenes.org/mouse/release_M35.html
For human we can obtain one from: https://www.gencodegenes.org/human/release_46.html
sbatch ~/methylation/scripts/SLURM_make_count_mtx_genes.sh . sample CpG_context ~/methylation/data/gencode.vM35.csv.gz
sbatch ~/methylation/scripts/SLURM_make_count_mtx_genes.sh . sample CpG_context ~/methylation/data/gencode.v46.csv.gz
Then stacking method also needs some modifications
sbatch ~/methylation/scripts/SLURM_stack_mtx_genes.sh . sample CpG_context ~/methylation/data/gencode.vM35.csv.gz
- To build final bam with duplications marked and barcodes add as tag
BC
We first add barcode tag into each part and filter out all reads and did not match to whitelist:
sbatch ~/methylation/scripts/SLURM_tag_bam_parts.sh . sample
After tagging all bam parts we can aggregate and mark duplicates using the following:
sbatch ~/methylation/scripts/bam_merge.sh . sample
Finally we can compute statistics from the entire bam. We do this in a per chromosome level and in parallel:
sbatch ~/methylation/scripts/SLURM_compute_bam.sh . sample ~/methylation/data/GRCm39_v34_allcontigs.fasta.fai
sbatch ~/methylation/scripts/SLURM_compute_bam.sh . sample ~/methylation/data/GRCh38_v44_chrs.fasta.fai