Skip to content

Latest commit

 

History

History
156 lines (114 loc) · 7.07 KB

README.md

File metadata and controls

156 lines (114 loc) · 7.07 KB

methylranger

This is a computational workflow for bioinformatics analysis of single-cell DNA and DNA methylation sequencing data, based on a custom molecular biology modification of 10x Genomics Chromium Single Cell Multiome ATAC + Gene Expression assay.

The pipeline assumes the reads are in a directory arranged in the following format:

/input_directory/sample_R1_001.fastq.gz # Read1 of DNA fragment
/input_directory/sample_R2_001.fastq.gz # Index (barcode) read
/input_directory/sample_R3_001.fastq.gz # Read2 of DNA fragment

Splitting fastq, barcode aggregation, correction and cell calling

  • The first step is to do some preprocessing on the input FASTQ files. The pipeline currently is harcoded with the assumption that R2 is 24nt long and has 8nt of splint adapter CAGACGCG at the beginning and reverse compliment of 10x ATAC barcodes from 9-24. These options can be modified by modifying extract_clean_fastq function within methyl_utils.py script

The pipeline does several steps including splitting, trimming, barcode transfer from index reads into cDNA reads, potentially quality filtering reads, and collecting raw barcodes for barcode matching.

The following script wraps the methyl_fastq_pipeline.py into a SLURM job with three input variables required for input input_directory and sample and a third variable which is twice the estimate of expected cells and is used for calling:

sbatch ~/methylation/scripts/SLURM_fastq_pipeline.sh input_directory sample 30000

or

cd input_directory
sbatch ~/methylation/scripts/SLURM_fastq_pipeline.sh . sample 30000

Trimming, alignment and filtering

  • Adapater trimming is done using fastp, multiple parts per task, controlled by multiplier_per_task variable (default=7)
sbatch --array=1-n ~/methylation/scripts/SLURM_trim_pipeline.sh . sample
  • Alignments are done with Bismark and array jobs, one task for each part:
sbatch --array=1-n ~/methylation/scripts/SLURM_align_parts.sh . sample ../GRCm39_full
sbatch --array=1-n ~/methylation/scripts/SLURM_align_parts.sh . sample ../GRCh38_v44/
  • Alignments for genomic data is done with minimap2, multiple parts per task, controlled by multiplier_per_task variable (default=6)
sbatch  --array=1-n ~/methylation/scripts/SLURM_align_parts_minimap.sh . sample  ../GRCh38_v44/GRCh38_v44_chrs.mmi
  • There are partially converted reads which need to be removed, takes around 20-30min for a %80 aligned batch of 10m 150-150 reads multiple parts per task, controlled by multiplier_per_task variable (default=2)
sbatch --array=1-n ~/methylation/scripts/SLURM_filter_non_conv.sh . sample

CH and CG json file geneation

  • For very large datasets we split the pipeline into pieces and submit them as array jobs:

This is an array job SLURM_ARRAY_TASK_ID is embedded as an input argument to save_quad_batch.py.

Each which subsequently splits the parts into batches of 12 and processess them in pools of two using 2 cores. TASK_ID will determine which 12-part batch of parts the task will process. TASK_ID 1 will process parts 000 to 011, TASK_ID 4 will process parts 036 to 047 and so on. Each bam of roughly 8m paired-end reads with fragment average of 100nt will need 18GB RAM to produce jsons

sbatch  --array=1-n ~/methylation/scripts/SLURM_save_quad_batch.sh . sample

Aggregation of batches into single files is done separately for CpG an Non_CpG contexts: 20 batches per task for CpG takes less than 10 minute, could run only 1 job too 2 batches per task for Non_CpG takes (~5min per 5GB json batch) need 24x memory of CpG, 20-40GB each batch memory is 4x the final json file size

sbatch --array=1-13 ~/methylation/scripts/SLURM_aggregate_quad_parts_CpG.sh . sample
sbatch --array=1-63 ~/methylation/scripts/SLURM_aggregate_quad_parts_Non_CpG.sh . sample

ALLC and MCDS file geneation

  • To build allc files for each barcode in each batch, jsons are deduped and converted to tsv: multiplier for CpG is 20 and Non_CpG is 2, it takes ~2 hours per batch of Non_CpG to be converted to tsv
sbatch --array=1-n ~/methylation/scripts/SLURM_make_allc_Non_CpG.sh . sample
sbatch --array=1-n ~/methylation/scripts/SLURM_make_allc_CpG.sh . sample

and allc files are generated by running, multiplier is 4, each task is a 4 cores process

sbatch  --array=1-n ~/methylation/scripts/SLURM_allc_process.sh . sample

mtx file geneation for windows and gene

  • To build count matrices from batches, we first make matrix from each batch and then stack them

First we make count matrix from methylation calls for barcodes in each batch. For each context and window size three different matices are built: z-scored methylation levels counts of methylated bases counts of nonmethylated bases

A mouse example with GRCm39 reference

sbatch ~/methylation/scripts/SLURM_make_count_mtx_windows_CpG.sh . sample 100000 ~/methylation/data/GRCm39_v34_allcontigs.fasta.fai
sbatch ~/methylation/scripts/SLURM_make_count_mtx_windows_Non_CpG.sh . sample 100000 ~/methylation/data/GRCm39_v34_allcontigs.fasta.fai

A human example with GRCh38 reference

sbatch ~/methylation/scripts/SLURM_make_count_mtx_windows.sh . sample 100000 CpG_context ~/methylation/data/GRCh38_v44_chrs.fasta.fai 

Then we stack all count matricies to make one set of final matrices for each of three in above, we also make coverage matrix, these final matrices are stored in AnnData format.

sbatch ~/methylation/scripts/SLURM_stack_mtx_windows.sh . sample 100000 CpG_context ~/methylation/data/GRCm39_v34_allcontigs.fasta.fai
  • Another way to build count matrices is using gene intervals instead of windows, for this we have a preprocessed version of gencode gtf which is essenetially a bed file with intervals and ensembl gene id.

For mouse we can obtain Comprehensive gene annotation from: https://www.gencodegenes.org/mouse/release_M35.html For human we can obtain one from: https://www.gencodegenes.org/human/release_46.html

sbatch ~/methylation/scripts/SLURM_make_count_mtx_genes.sh . sample CpG_context ~/methylation/data/gencode.vM35.csv.gz
sbatch ~/methylation/scripts/SLURM_make_count_mtx_genes.sh . sample CpG_context ~/methylation/data/gencode.v46.csv.gz 

Then stacking method also needs some modifications

sbatch ~/methylation/scripts/SLURM_stack_mtx_genes.sh . sample CpG_context ~/methylation/data/gencode.vM35.csv.gz

build final bam

  • To build final bam with duplications marked and barcodes add as tag BC We first add barcode tag into each part and filter out all reads and did not match to whitelist:
sbatch ~/methylation/scripts/SLURM_tag_bam_parts.sh . sample

After tagging all bam parts we can aggregate and mark duplicates using the following:

sbatch ~/methylation/scripts/bam_merge.sh . sample

Finally we can compute statistics from the entire bam. We do this in a per chromosome level and in parallel:

sbatch ~/methylation/scripts/SLURM_compute_bam.sh . sample ~/methylation/data/GRCm39_v34_allcontigs.fasta.fai
sbatch ~/methylation/scripts/SLURM_compute_bam.sh . sample ~/methylation/data/GRCh38_v44_chrs.fasta.fai