Fastq support (#44)

* Telephone number for medical questions in the report (#32) * updated version in configs * Removed print statements * Added telephone number in the report for medical information * update version in the configs * updated changelog.md * Updated DPYD HAPB3 haplotype related rsids (#33) * Update CHANGELOG.md Signed-off-by: Ram Sai Nanduri <[email protected]> * Added a new module for pharmcat (#35) * reflects v1.1.0 Signed-off-by: Ram Sai Nanduri <[email protected]> * Report update (#37) * Update report text * Updated changelog and added github workflow for changelog reminder * Update README.md Signed-off-by: Ram Sai Nanduri <[email protected]> * Pharmcat update 2.12.0 (#38) * Updated Pharmacat module Updated the sub workflow with respect to pharmcat, now the haplotype filtration takes palce first and then ontarget followed by annotation.. * updated annotation output file name * Updated configs, modules and workflows with respect to pharmcat * Added options for the report - extended report, match all the haplotypes not just the top ones, report title * Updated ReadMe and ChangeLog * Update config (#40) * fixed pharmat memory bug * Updated Changelog * Fixed Zero division error when there is no allele depth for a variant (#43) * Fixed Zero division error when there is no allele depth for a variant * Updated Changelog #42 #43 * Added Fastq support - refactored the structure of the modules , subworkflows and workflows - based on the input the workflows are selected - removed support for HG19 - draft version of the workflow Things to do - add more support for pharmcat - add suport for cnvs --------- Signed-off-by: Ram Sai Nanduri <[email protected]>
SMD-Bioinformatics-Lund · Jan 22, 2025 · c5cb636 · c5cb636
1 parent 71a1d94
commit c5cb636
Show file tree

Hide file tree

Showing 72 changed files with 4,488 additions and 469 deletions.
diff --git a/.github/workflows/changelog-reminder.yml b/.github/workflows/changelog-reminder.yml
@@ -0,0 +1,15 @@
+name: "CHANGELOG Reminder"
+on:
+  pull_request:
+      types: [opened, synchronize, reopened, ready_for_review, labeled, unlabeled]
+
+jobs:
+  # Enforces the update of a changelog file on every pull request
+  changelog:
+    runs-on: ubuntu-latest
+    steps:
+    - uses: actions/checkout@v4
+    - uses: dangoslen/changelog-enforcer@v3
+      with:
+        changeLogPath: 'CHANGELOG.md'
+        skipLabels: 'skip-changelog-update'
diff --git a/.gitignore b/.gitignore
@@ -33,4 +33,5 @@ create_pgx_samplesheet.sh
 # Other files
 bin/report_template.txt
 bin/snakemake_report.py
-bin/pdf.py
+bin/pdf.py
+subworkflows/local/pharmacoGenomics.nf.backup
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,4 +1,28 @@
-# V1.0.0
+# 2.0.2
+   - Fixed Zero divison error when there is no allele depth in the variant calls (#43)
+# 2.0.1
+   - Updated Config
+   - Fixed Pharmacat memory bug
+
+# v2.0.0
+   - Major change to the process flow
+   - Updated pharmact to v2.12.0
+   - Updated ReadMe.md
+
+# v1.1.1
+   - Updated QC text in the report
+
+# v1.1.0
+   - Added pharmCAT module to the pipeline
+
+# v1.0.2
+   - Updated DPYD HapB3 haplotype related rsids (rs75017182 and rs56038477)
+   - Main target bed regions are padded with 20bp up and down stream
+
+# v1.0.1
+   - The report will now include a telephone number for medical information support related to its contents.
+
+# v1.0.0
 1. Added and updated various features:
    - Added license and workflow image.
    - Implemented nf-core style stubs.

diff --git a/README.md b/README.md
@@ -1,9 +1,9 @@
 <hr>
 
-[![Nextflow DSL2](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A523.04.0-23aa62.svg)](https://www.nextflow.io/docs/latest/dsl2.html) [![Singularity Version](https://img.shields.io/badge/Singularity-%E2%89%A53.8.0-orange)](https://sylabs.io/docs/) [![Run with Singularity](https://img.shields.io/badge/Run%20with-Singularity-orange)](https://sylabs.io/docs/)
-
-[![PharmGKB](https://img.shields.io/badge/PharmGKB-Explore-blue)](https://www.pharmgkb.org/) [![CPIC](https://img.shields.io/badge/CPIC-Explore-green)](https://cpicpgx.org/) [![PharmVar](https://img.shields.io/badge/PharmVar-Explore-yellow)](https://www.pharmvar.org/)
+[![Nextflow DSL2](https://img.shields.io/badge/NextFlow_DSL2-23.04.0-23.svg)](https://www.nextflow.io/docs/latest/dsl2.html) [![Singularity Version](https://img.shields.io/badge/Singularity-%E2%89%A53.8.0-orange)](https://sylabs.io/docs/)  [![PharmCat Version](https://img.shields.io/badge/PhamCat-2.12.0-green)](https://sylabs.io/docs/) [![Run with Singularity](https://img.shields.io/badge/Run%20with-Singularity-orange)](https://sylabs.io/docs/)
 
+[![PharmGKB](https://img.shields.io/badge/PharmGKB-blue)](https://www.pharmgkb.org/) [![CPIC](https://img.shields.io/badge/CPIC-green)](https://cpicpgx.org/) [![PharmVar](https://img.shields.io/badge/PharmVar-yellow)](https://www.pharmvar.org/)
+[![PharmCAT](https://img.shields.io/badge/Support_for-PharmCAT-orange)](https://pharmcat.org/)
 <hr>
 
 <!-- HTML-style heading -->
@@ -13,7 +13,8 @@
 
 Welcome to PGxModule: Revolutionizing Genomic Medicine!
 
-PGxModule is an advanced Nextflow DSL2 workflow, designed to seamlessly integrate into your genomics pipeline. It empowers you to generate sample-specific reports with clinical guidelines, leveraging state-of-the-art variant detection in Genomic Medicine Sweden sequencing panels. This workflow is inspired by JoelAAs.
+PGxModule is an advanced Nextflow DSL2 workflow, designed to seamlessly integrate into your genomics pipeline. It empowers you to generate sample-specific reports with clinical guidelines, leveraging state-of-the-art variant detection in Genomic Medicine Sweden sequencing panels. This workflow is inspired by JoelAAs. Besides, we have also implemented [Pharmcat](https://pharmcat.org/) Pharmacogenomics Clinical Annotation Tool
+report with this pipeline, where we get recommendataion for all the detected haplotyopes directly from the CPIC.
 
 ### Key Features:
 
@@ -24,7 +25,7 @@ PGxModule is an advanced Nextflow DSL2 workflow, designed to seamlessly integrat
 
 ## Pipeline Summary
 
-The pipeline focuses on 19 SNPs from TPMT, DPYD, and NUDT15 genes, with plans to incorporate additional genes in future updates. The target selection is meticulously curated from reputable databases such as [PharmGKB](https://www.pharmgkb.org/) and [PharmVar](https://www.pharmvar.org/), guided by [CPIC](https://cpicpgx.org/) recommendations. As the pipeline evolves, it aims to broaden its scope, providing a more comprehensive analysis of pharmacogenomic variations to enhance clinical insights.  
+This pipeline branches into two analysis, one which only focuses on 19 SNPs from TPMT, DPYD, and NUDT15 genes, with plans to incorporate additional genes in future updates. The second part of the analysis is by using an external tool [Pharmcat](https://pharmcat.org/) developed by [PharmGKB](https://www.pharmgkb.org/) where we try to find as many haplotypes as we can without subsetting the original bam, these haplotypes are then annotated and reported along with the clinical recommendations. The target selection is meticulously curated from reputable databases such as [PharmGKB](https://www.pharmgkb.org/) and [PharmVar](https://www.pharmvar.org/), guided by [CPIC](https://cpicpgx.org/) recommendations. As the pipeline evolves, it aims to broaden its scope, providing a more comprehensive analysis of pharmacogenomic variations to enhance clinical insights. 
 
 
 ## Pipeline Steps
@@ -33,30 +34,35 @@ The PGxModule pipeline was executed with Nextflow version 23.04.2. The pipeline
 
 1. **CSV Validation**  
 The CSV Validation step ensures the correctness and integrity of the input CSV file. It checks for proper formatting, required fields, and data consistency, providing a foundation for accurate downstream processing in the PGxModule pipeline.
-2. **Getting Ontarget Bam**  
-This step involves extracting the on-target BAM files from the analyzed samples. These BAM files specifically capture the sequencing data aligned to the regions of interest, enabling reduction in time and focused analysis on the genomic regions relevant to the pharmacogenomic study.
-3. **Haplotype Calling**  
+2. **Haplotype Calling**  
 Haplotype Calling is a crucial stage where the pipeline identifies and assembles haplotypes from the sequencing data. This process is fundamental in characterizing the genetic variations present in the samples, laying the groundwork for subsequent analyses and variant interpretation.
-4. **Haplotype Annotation**  
-Haplotypes which are called are annotated with dbSNP ids. 
-5. **Haplotype Filtration**  
+3. **Haplotype Filtration**  
 Haplotype Filtration focuses on refining the set of identified haplotypes, applying specific criteria to select variants of interest and discard noise. This process enhances the precision of the haplotype dataset, ensuring that downstream analyses are based on high-quality and clinically relevant variants.
-6. **Coverage Analysis**  
-Coverage Analysis evaluates the sequencing depth across targeted regions, providing insights into the reliability of variant calls. By assessing coverage, this step identifies regions with insufficient data and informs the overall confidence in the accuracy of the genomic information obtained from the samples.
-7. **Detection of variants**  
+4. **PharmCat Preprocessing**
+A script to preprocess VCF files for PharmCAT, ensuring compliance with VCF v4.2, stripping irrelevant PGx positions, normalizing variants, and optionally filtering sample data.
+5. **PharmCat**
+This scripts helps us to match the vcf positions with the pharmaco positions, runs the phenotypes and then finally the pharmcat report with all the recommendations.
+6. **Ontarget VCF**
+This step involves extracting the on-target VCF posiitons from the analyzed samples.
+7. **Haplotype Annotation**  
+Haplotypes which are called are annotated with dbSNP ids. 
+8. **Detection of variants**  
 Checking the variants of interest in the whole set of haplotypes and are used for futher analysis
-8. **Clinial Recommendations**  
+9. **Clinial Recommendations**  
 Identified haplotypes are annotated with Haplotype Ids, clincial reccomendations, interaction guidelines based on CPIC.
-9. **Report**  
+10. **Getting Ontarget Bam**  
+This step involves extracting the on-target BAM files from the analyzed samples. These BAM files specifically capture the sequencing data aligned to the regions of interest, enabling reduction in time and focused analysis on the genomic regions relevant to the pharmacogenomic study.
+11. **Coverage Analysis**  
+Coverage Analysis evaluates the sequencing depth across targeted regions, providing insights into the reliability of variant calls. By assessing coverage, this step identifies regions with insufficient data and informs the overall confidence in the accuracy of the genomic information obtained from the samples.
+12. **Report**  
 The Report step consolidates the findings from the preceding analyses into a comprehensive report. This report includes detailed information on detected variants, clinical guidelines, interaction assessments, and other relevant pharmacogenomic insights. It serves as a valuable resource for clinicians and researchers, aiding in informed decision-making based on the genomic characteristics of the analyzed samples.
 
 ## Example Input CSV
 
-| clarity_sample_id | id      | type | assay     | group   | bam                                   | bai                                   | purity |
-|-------------------|---------|------|-----------|---------|---------------------------------------|---------------------------------------|--------|
-| CMD123456         | Sample1 | T    | solid-pgx | Sample1 | Sample1.T.bwa.umi.sort.bam           | Sample1.T.bwa.umi.sort.bam.bai       | 0.30   |
-| CMD987654         | Sample2 | T    | solid-pgx | Sample2 | Sample2.T.bwa.umi.sort.bam           | Sample2.T.bwa.umi.sort.bam.bai       | 0.30   |  
-
+| clarity_sample_id | id      | type | assay       | group   | bam                                  | bai                                  | purity |
+|-------------------|---------|------|-------------|---------|--------------------------------------|--------------------------------------|--------|
+| XXX000001         | Sample1 | T    | gmssolidpgx | Sample1 | Sample1.T.bwa.umi.sort.bam           | Sample1.T.bwa.umi.sort.bam.bai       | 0.30   |
+| XXX000002         | Sample2 | T    | gmssolidpgx | Sample2 | Sample2.T.bwa.umi.sort.bam           | Sample2.T.bwa.umi.sort.bam.bai       | 0.30   |  
 
 
 ## Setup
@@ -111,5 +117,5 @@ nextflow run main.nf --csv /path/to/csv/input.csv -profile "panel,hg38,solid" --
 
 ## Workflow Image  
 
-<img src="resources/workflow_images/PGx.png" alt="Workflow Image" width="50%">
+<img src="resources/workflow_images/PGXModule_Workflow_v2.0.0.png" alt="Workflow Image" width="50%">
 
diff --git a/bin/overlapping_genes.pl b/bin/overlapping_genes.pl
@@ -0,0 +1,18 @@
+#!/usr/bin/perl -w
+use strict;
+
+my $in_bed = $ARGV[0];
+my $genes_bed = $ARGV[1];
+
+my $rnd = int rand 1000000000;
+my $tmp_infile = "input.$rnd.bed";
+system("grep -v '^\@' $in_bed| grep -v ^CONTIG| grep -v ^REF > $tmp_infile");
+my @overlap = `bedtools intersect -a $tmp_infile -b $genes_bed -loj`;
+unlink $tmp_infile;
+
+foreach my $line (@overlap) {
+    chomp $line;
+    my @f = split /\t/, $line;
+    print "$f[0]\t$f[1]\t$f[2]\t$f[3]\t$f[-1]\n";
+}
+
diff --git a/bin/panel_depth.pl b/bin/panel_depth.pl
@@ -0,0 +1,47 @@
+#!/usr/bin/perl -w
+use strict;
+
+my $cutoff = 500;
+die "USAGE: panel_depth.pl BAM BED\n" unless @ARGV == 2;
+
+my( $bam, $bed ) = ( $ARGV[0], $ARGV[1] );
+
+die "No file $bam" unless -s $bam;
+die "No file $bed" unless -s $bed;
+
+
+open( DEPTH, "sambamba depth base $bam -L $bed |" );
+
+my( $start_pos, $start_chr, $last_low_pos, $last_low_chr, $low_cov_sum );
+while( <DEPTH> ) {
+    my @a = split /\t/;
+    my( $chr, $pos, $depth ) = ( $a[0], $a[1], $a[2] );
+
+    if( $depth < $cutoff ) {
+
+	# Prev low position was right before
+	if( $last_low_chr and $last_low_pos and $last_low_chr eq $chr and $last_low_pos == $pos-1 ) {
+	    # Skip along low depth region
+	    $low_cov_sum += $depth;
+	}
+
+	# Prev low postion was somewhere else
+	else {
+	    if( $start_pos and $start_chr ) {
+		print $start_chr."\t".$start_pos."\t".$last_low_pos."\t".($low_cov_sum/($last_low_pos-$start_pos+1))."\n";
+	    }
+	    $start_chr = $chr;
+	    $start_pos = $pos;
+	    $low_cov_sum = $depth;
+	}
+	$last_low_chr = $chr;
+	$last_low_pos = $pos;
+    }
+    else {
+	if( $start_chr and $start_pos ) {
+	    print $start_chr."\t".$start_pos."\t".$last_low_pos."\t".($low_cov_sum/($last_low_pos-$start_pos+1))."\n";
+	    undef $start_chr;
+	    undef $start_pos;
+	}
+    }
+}