Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread for possible enhancements in future #120

Open
abhi18av opened this issue Aug 30, 2022 · 16 comments
Open

Thread for possible enhancements in future #120

abhi18av opened this issue Aug 30, 2022 · 16 comments
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed wish list

Comments

@abhi18av
Copy link
Member

abhi18av commented Aug 30, 2022

On this thread we can keep track of the enhancements which would be nice to have.

@TimHHH
Copy link
Collaborator

TimHHH commented Sep 1, 2022

Here are a couple of suggestions:

  • Adapters are currently not identified because it is not required for assembling a quality genome and therefore just eats time. However, it may be useful for troubleshooting failed libraries. If required I would suggest GATK MarkIlluminaAdapters (Picard) for marking adapters, this would then automatically be reported in the PCT_EXC_ADAPTER stats.
  • A simple check to make sure the fastq.gz files are not corrupted: gzip -t [file]
  • Convert the clusters identified by clusterpicker to a iTol annotation file.
  • Give a warning for samples with a DosR deletion, which tends to be huge and affects downstream analyses.
  • NTM presence could be determined with https://github.com/jodyphelan/NTM-Profiler
  • TransPhylo could be implemented, see e.g. https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000361
  • Rapid annotation with Nirvana: https://github.com/Illumina/Nirvana

@abhi18av abhi18av added enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed labels Sep 1, 2022
@TimHHH

This comment was marked as resolved.

@TimHHH

This comment was marked as resolved.

@TimHHH
Copy link
Collaborator

TimHHH commented Sep 22, 2022

At the moment we identify the 5/12 SNP clusters based on a percentage, since cluster picker does not accept SNP cut-offs, the conversion from SNP cut-off to percentage may cause some minor deviation due to rounding error.
One solution would be to identify the clusters from the *.snp_dists.tsv with a python script. Conor has such a script.
Solution two is to move away from SNP cut-offs completely as they are nonsense. e.g. Conor mentioned some Bayesian method.

@TimHHH

This comment was marked as resolved.

@abhi18av
Copy link
Member Author

abhi18av commented Oct 1, 2022

For environments in which GPUs are available, we can selectively rely upon enhanced versions of bwa/gatk etc https://docs.nvidia.com/clara/parabricks/4.0.0/index.html

@abhi18av
Copy link
Member Author

abhi18av commented Oct 7, 2022

Explore how the xbs-nf and https://github.com/kachelo/NGS-analysis-from-TB-sputum could be used together.

@TimHHH
Copy link
Collaborator

TimHHH commented Oct 10, 2022

Explore how the xbs-nf and https://github.com/kachelo/NGS-analysis-from-TB-sputum could be used together.

This pipeline is not too different from ours, main differences:

  • They use hard filtering, which I do not recommend, it is way to lax. VQSR preferred.
  • They use a metagenomic approach to clean up the data before mapping, this takes up a lot of time, XBS was designed to remove this with the much faster k-paramter and VQSR.
  • They don;t have the post calling analyses such as phylogeny/transmission/DR
  • I don't recommend BQSR.

Excellent on them for using the joint calling!
Important: their paper uses in vitro enrichment (Agilent’s Sure Select XT) and in silico enrichment, both work to clean up the eventual data, but analysing the sputum WGS data directly is going to be cheaper and faster.

@TimHHH
Copy link
Collaborator

TimHHH commented Oct 11, 2022

NTM presence could be determined with https://github.com/jodyphelan/NTM-Profiler

On this subject. We currently use a single SNP to determine the presence of NTMs, this works nicely but for one exception. Mycobacterium lentiflavum does not display this SNP and is hence missed, so far we have only observed this for one sample (SMARTT-TM-046) and it makes sense that this is very rare. NTM-Profiler might be a solution for this though.

@TimHHH
Copy link
Collaborator

TimHHH commented Oct 27, 2022

We occasionally encounter issues with FastQ files; some are truncated, some are empty, some have unequal numbers of reads in R1 and R2. These tend to result in crashes that stop the entire run unfortunately. It would be good to run some kind of FastQ validation before FastQC (https://github.com/TORCH-Consortium/xbs-nf/blob/master/workflows/quality_check_wf.nf). Some quick searches reveal:

https://genome.sph.umich.edu/wiki/FastQValidator
https://biopet.github.io/validatefastq/0.1.1/index.html
https://github.com/nunofonseca/fastq_utils

@vrennie @'ing you after discussion to implement this asap
There are also BASH options like gzip -t $file and zcat $file | wc -l.

Tracked here #131

@TimHHH
Copy link
Collaborator

TimHHH commented Oct 27, 2022

@adippenaar provided me with sequencing data from a sample that evolved a Rv0678 structural variant + phenotypic BDQ resistance. Anzaan suspect a IS6110-mediated disruption. At the moment this SV is not picked up by MAGMA.
It may be worth exploring further options to pick this SV up:
https://github.com/SemiQuant/svTyper
or playing around with the settings on Delly (e.g. CNV option)

@abhi18av
Copy link
Member Author

abhi18av commented Apr 8, 2023

Concerning the Structural Variants:

  1. There is some ongoing work for accommodate an alternative structural variants subworkflow (in addition to the delly one) Add another round of TBProfiler for detection of structural variants  #146

  2. As we rely heavily on GATK4 for the core of the pipeline, would https://github.com/broadinstitute/gatk-sv be worth exploring?

@TimHHH
Copy link
Collaborator

TimHHH commented Apr 17, 2023

2. As we rely heavily on GATK4 for the core of the pipeline, would https://github.com/broadinstitute/gatk-sv be worth exploring?

Yes this is very interesting, but this will take some serious hands-on time to implement and and test, not something for in the near future. For now the TBprofiler approach is easier 👍

@abhi18av
Copy link
Member Author

Gotcha - thanks for the suggestions Tim 😊

  • Adding another thought here regarding the gatk package - one of the core strengths of the package is its integration with spark which can speed up the overall analysis.

Recently, while working on an analysis (on CERI MAC) I kept running across the memory failure on gatk MarkDuplicates step even when I was using the docker images. On a side note, my suspicion is that it has something to do with docker-desktop's Hyperkit configuration (see reference here), which I circumvented by relying on conda integration.

However, this led me to investigate a little bit about the gatk4-spark packages - I'd like to explore that later to see if it could help us optimize the pipeline for a setting where we have smaller-yet-multiple compute nodes, especially the case with cloud.

@abhi18av
Copy link
Member Author

abhi18av commented Aug 6, 2023

@abhi18av
Copy link
Member Author

abhi18av commented Nov 1, 2023

We could also rely upon genozip https://www.genozip.com to compress the resultant outputs of the pipeline, in the extreme cases that people would like to use this pipeline for further exploration of VCF/BAM files.

And there's always the direction of long-reads which we can explore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed wish list
Projects
None yet
Development

No branches or pull requests

2 participants