Create readgroups from barcodes with random nucleotides incorporates into them for improving sequencing depth. The analysis for marking / removing duplicates does not handle random barcodes/UMIs (Unique Molecular Indentifiers) and should be modified to include random barcodes/UMIs in the analysis.
This is a git for noting down the steps using simple perl
tools.
The following steps are performed
- Basecall to fastq
- Integrate the barcode into reads
- Trim off landing probes
- Align
- Add read group info
- Expand read group info
- Mark duplicates
Requirements
- perl
- an aligner eg. bwa-7.12/hisat2-2.0.4
- picard-1.140 and up
- this repo
- pipeline-util
- bedtools
- samtools
Installing
Add the scr/
dir to your path.
notes
This is primairly used for the Nugene Ovation® Target Enrichment System. link
Use bcl to fastq from illumina. This is an example for questions and documentation see here. This is not a complete guide for more detailed install instructions visit the illumina website.
Install bcltofastq(2)
mkdir -p source/bclToFastq2/
cd source/bclToFastq/; wget ftp://webdata2:[email protected]/downloads/software/bcl2fastq/bcl2fastq2-v2.17.1.14.tar.zip
unzip bcl2fastq2-v2.17.1.14.tar.zip
tar -zxf bcl2fastq2-v2.17.1.14.tar.gz
mv bcl2fastq bcl2fastq-v2.17.1.14
cd bcl2fastq-v2.17.1.14
mkdir -p install/bclToFastq/bcl2fastq-v2.17.1.14
./src/configure --prefix=install/bclToFastq/bcl2fastq-v2.17.1.14
make
make install
Run bcl2fastq. This is ran on a paired end run starting from the main run folder. Note the --use-bases-mask y*,i8y*,y*
and the --minimum-trimmed-read-length 0
options.
--use-bases-mask y*,i8y*,y*
will output the N6 random/UMI barcodes a second read. For Single End run only specify--use-bases-mask y*,i8y*
.--minimum-trimmed-read-length 0
this will prevent the bcl2fastq software from truncating your N6 barcodes/UMI into sequences of NNNNNN because the software is set by default to mask the cycles with N these when they are less then 22nt long. Note: this might even be set to5
but did not test it.- In the Samplesheet.csv it's also advised to turn on the Illumina adapter trimming when running PE sequencing.
this code below is for running the analysis without adapter trimming
bcl2fastq \
--sample-sheet SampleSheet.csv \
-i Data/Intensities/BaseCalls/ \
-R ./ \
--intensities-dir Data/Intensities/ \
-o Data/Intensities/BaseCalls/run2 \
--use-bases-mask y*,i8y*,y* \
--barcode-mismatches 1 \
--minimum-trimmed-read-length 0 \
--create-fastq-for-index-reads
With adapter trimming you should confirm the settings part contains the following adapter sequences.
[Settings]
Adapter,AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
AdapterRead3,AGATCGGAAGAGCGTCGTGTAGGGAAAGA
Also you need to edit the bcl2fastq call to replace --minimum-trimmed-read-length 0
with --mask-short-adapter-reads 5
resulting into the command below.
bcl2fastq --sample-sheet SampleSheet.csv \
-i Data/Intensities/BaseCalls/ \
-R ./ \
--intensities-dir Data/Intensities/ \
-o Data/Intensities/BaseCalls/run8 \
--use-bases-mask y*,i8y*,y* \
--create-fastq-for-index-reads \
--barcode-mismatches 1 \
--mask-short-adapter-reads 5
install bcltofastq(1) depreciated
mkdir -p path/to/bclToFastq/
cd path/to/bclToFastq/; wget ftp://webdata:[email protected]/Downloads/Software/bcl2fastq/bcl2fastq-1.8.4.tar.bz2
tar -jxf bcl2fastq-1.8.4.tar.bz2
mv bcl2fastq bcl2fastq-1.8.4
cd bcl2fastq-1.8.4
src/configure --prefix=/full/path/to/bclToFastq/bcl2fastq-1.8.4
make
make install
run bcltofastq on data the i8y6 part says to treat the random part as read2 and the paired ends as read1/3
##make makefiles #note first run without commands or without --use-bases-mask y151,i8y6,y151 to infer experiment
perl /full/path/to/bclToFastq/bcl2fastq-1.8.4/bin/configureBclToFastq.pl --output-dir /full/path/to/fastq --input-dir /full/path/to/basecalldir/Data/Intensities/BaseCalls --force --no-eamss --use-bases-mask y151,i8y6,y151 --sample-sheet /full/path/to/SampleSheetBclToFastq.csv --mismatches 1 &
#run makefiles with 12 cores in background
nohup make -C path/to/fastq -j 12&
For the in-house generated data from nugene use:
$perl -wpe 'if($. % 4 == 1 ){chomp; my @t =split(" "); my @hd= split(":",$t[0]); $hd[2] = $hd[2] ."_" .$t[2]; $_=join(" ",(join(":", @hd), $t[1], $t[2]))."\n";}
When you have used bcltofastq. you can use these helper scripts from this github. Here are some examples.
perl NugeneMergeFastqFiles.pl -q 20 randombc_R2.fq.gz outdir reads_R1.fastq.gz reads_R3.fastq.gz
perl NugeneMergeFastqFiles.pl randombc_R2.fq.gz outdir reads_R1.fastq.gz
This will add the barcode to the flowcell ID(fcid) like "${fcid}_${barcode}"
trim of landing probes using the bbduk.sh programme from the bbmap package. This is installed by downloading/extracting the data from the sourceforge project page. This also needs a probe sequence file 'probes_ET1262F_2_182.fasta' you can generate this using the supplied 'probes_ET1262F_2_182.bed' file and bedtools getfasta (versus ucsc genome). This is the advised config:
bbduk.sh in=in.fq.gz out=out.fq.gz ref=probes_ET1262F_2_182.fasta hdist=1 ktrim=r rcomp=f k=31 mink=11 qtrim=r trimq=20 minlen=20
A better alternative is to use the probe mapping locations (probes.bed) with the read alignment to filter the landing probes. This step will need to have bedtools,samtools installed also see:pipeline-util and the trimByBed.pl script.
bwa mem ... reads_R1.fastq.gz > align.sam || hisat2 ... -U reads_R1.fastq.gz -S align.sam
trimByBed.pl -s align.sam -b probe.bed -o reads_R1.cleaned.fastq.gz
bbduk.sh in=in.fq.gz out=out.fq.gz qtrim=r trimq=20 minlen=20
Align with your favourite aligner storing the data in sam format for output data.
bwa-mem .... || hisat2 ...
Use AddOrReplaceReadGroups from the picard toolkit to add your readgroups.
java -Xmx6g -jar AddOrReplaceReadGroups.jar ...
Add read group info for each barcode and reindex. This wil restore your fcid tag + create a readgroup for each random barcode with the correct sample.
perl NugeneDigitalSplitter.pl in.bam out.bam
java -Xmx6g -jar picard.jar BuildBamIndex INPUT=out.bam
This should work. Although the duplicate filtering might not be completely perfect because It does't take into account the Alignment distances of the different UMIs.
java -Xmx6g -jar picard.jar MarkDuplicates...
and now you can run the rest of your variant calling pipeline!