Thread for possible enhancements in future #120

abhi18av · 2022-08-30T07:41:01Z

On this thread we can keep track of the enhancements which would be nice to have.

TimHHH · 2022-09-01T09:09:12Z

Here are a couple of suggestions:

Adapters are currently not identified because it is not required for assembling a quality genome and therefore just eats time. However, it may be useful for troubleshooting failed libraries. If required I would suggest GATK MarkIlluminaAdapters (Picard) for marking adapters, this would then automatically be reported in the PCT_EXC_ADAPTER stats.
A simple check to make sure the fastq.gz files are not corrupted: gzip -t [file]
Convert the clusters identified by clusterpicker to a iTol annotation file.
Give a warning for samples with a DosR deletion, which tends to be huge and affects downstream analyses.
NTM presence could be determined with https://github.com/jodyphelan/NTM-Profiler
TransPhylo could be implemented, see e.g. https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000361
Rapid annotation with Nirvana: https://github.com/Illumina/Nirvana

TimHHH · 2022-09-22T10:49:30Z

At the moment we identify the 5/12 SNP clusters based on a percentage, since cluster picker does not accept SNP cut-offs, the conversion from SNP cut-off to percentage may cause some minor deviation due to rounding error.
One solution would be to identify the clusters from the *.snp_dists.tsv with a python script. Conor has such a script.
Solution two is to move away from SNP cut-offs completely as they are nonsense. e.g. Conor mentioned some Bayesian method.

abhi18av · 2022-10-01T15:42:56Z

For environments in which GPUs are available, we can selectively rely upon enhanced versions of bwa/gatk etc https://docs.nvidia.com/clara/parabricks/4.0.0/index.html

abhi18av · 2022-10-07T06:13:38Z

Explore how the xbs-nf and https://github.com/kachelo/NGS-analysis-from-TB-sputum could be used together.

TimHHH · 2022-10-10T09:52:10Z

Explore how the xbs-nf and https://github.com/kachelo/NGS-analysis-from-TB-sputum could be used together.

This pipeline is not too different from ours, main differences:

They use hard filtering, which I do not recommend, it is way to lax. VQSR preferred.
They use a metagenomic approach to clean up the data before mapping, this takes up a lot of time, XBS was designed to remove this with the much faster k-paramter and VQSR.
They don;t have the post calling analyses such as phylogeny/transmission/DR
I don't recommend BQSR.

Excellent on them for using the joint calling!
Important: their paper uses in vitro enrichment (Agilent’s Sure Select XT) and in silico enrichment, both work to clean up the eventual data, but analysing the sputum WGS data directly is going to be cheaper and faster.

TimHHH · 2022-10-11T10:50:04Z

NTM presence could be determined with https://github.com/jodyphelan/NTM-Profiler

On this subject. We currently use a single SNP to determine the presence of NTMs, this works nicely but for one exception. Mycobacterium lentiflavum does not display this SNP and is hence missed, so far we have only observed this for one sample (SMARTT-TM-046) and it makes sense that this is very rare. NTM-Profiler might be a solution for this though.

TimHHH · 2022-10-27T08:10:59Z

We occasionally encounter issues with FastQ files; some are truncated, some are empty, some have unequal numbers of reads in R1 and R2. These tend to result in crashes that stop the entire run unfortunately. It would be good to run some kind of FastQ validation before FastQC (https://github.com/TORCH-Consortium/xbs-nf/blob/master/workflows/quality_check_wf.nf). Some quick searches reveal:

https://genome.sph.umich.edu/wiki/FastQValidator
https://biopet.github.io/validatefastq/0.1.1/index.html
https://github.com/nunofonseca/fastq_utils

@vrennie @'ing you after discussion to implement this asap
There are also BASH options like gzip -t $file and zcat $file | wc -l.

Tracked here #131

TimHHH · 2022-10-27T08:47:26Z

@adippenaar provided me with sequencing data from a sample that evolved a Rv0678 structural variant + phenotypic BDQ resistance. Anzaan suspect a IS6110-mediated disruption. At the moment this SV is not picked up by MAGMA.
It may be worth exploring further options to pick this SV up:
https://github.com/SemiQuant/svTyper
or playing around with the settings on Delly (e.g. CNV option)

abhi18av · 2023-04-08T09:13:36Z

Concerning the Structural Variants:

There is some ongoing work for accommodate an alternative structural variants subworkflow (in addition to the delly one) Add another round of TBProfiler for detection of structural variants #146
As we rely heavily on GATK4 for the core of the pipeline, would https://github.com/broadinstitute/gatk-sv be worth exploring?

TimHHH · 2023-04-17T09:27:45Z

2. As we rely heavily on GATK4 for the core of the pipeline, would https://github.com/broadinstitute/gatk-sv be worth exploring?

Yes this is very interesting, but this will take some serious hands-on time to implement and and test, not something for in the near future. For now the TBprofiler approach is easier 👍

abhi18av · 2023-04-17T16:55:53Z

Gotcha - thanks for the suggestions Tim 😊

Adding another thought here regarding the gatk package - one of the core strengths of the package is its integration with spark which can speed up the overall analysis.

Recently, while working on an analysis (on CERI MAC) I kept running across the memory failure on gatk MarkDuplicates step even when I was using the docker images. On a side note, my suspicion is that it has something to do with docker-desktop's Hyperkit configuration (see reference here), which I circumvented by relying on conda integration.

However, this led me to investigate a little bit about the gatk4-spark packages - I'd like to explore that later to see if it could help us optimize the pipeline for a setting where we have smaller-yet-multiple compute nodes, especially the case with cloud.

abhi18av · 2023-08-06T17:28:00Z

Consider adding spoligotyping either via the original https://anaconda.org/bioconda/spotyping package or the experimental support in tb-profiler https://github.com/jodyphelan/TBProfiler#spoligotyping

abhi18av · 2023-11-01T17:16:29Z

We could also rely upon genozip https://www.genozip.com to compress the resultant outputs of the pipeline, in the extreme cases that people would like to use this pipeline for further exploration of VCF/BAM files.

And there's always the direction of long-reads which we can explore.

abhi18av added enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed labels Sep 1, 2022

This comment was marked as resolved.

Sign in to view

TimHHH added the wish list label Oct 27, 2022

abhi18av mentioned this issue Nov 1, 2022

Add FastQ file checkpoint #131

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thread for possible enhancements in future #120

Thread for possible enhancements in future #120

abhi18av commented Aug 30, 2022 •

edited

Loading

TimHHH commented Sep 1, 2022 •

edited by abhi18av

Loading

This comment was marked as resolved.

This comment was marked as resolved.

TimHHH commented Sep 22, 2022

This comment was marked as resolved.

abhi18av commented Oct 1, 2022

abhi18av commented Oct 7, 2022

TimHHH commented Oct 10, 2022 •

edited

Loading

TimHHH commented Oct 11, 2022 •

edited

Loading

TimHHH commented Oct 27, 2022 •

edited by abhi18av

Loading

TimHHH commented Oct 27, 2022

abhi18av commented Apr 8, 2023 •

edited

Loading

TimHHH commented Apr 17, 2023

abhi18av commented Apr 17, 2023

abhi18av commented Aug 6, 2023

abhi18av commented Nov 1, 2023 •

edited

Loading

Thread for possible enhancements in future #120

Thread for possible enhancements in future #120

Comments

abhi18av commented Aug 30, 2022 • edited Loading

TimHHH commented Sep 1, 2022 • edited by abhi18av Loading

This comment was marked as resolved.

This comment was marked as resolved.

TimHHH commented Sep 22, 2022

This comment was marked as resolved.

abhi18av commented Oct 1, 2022

abhi18av commented Oct 7, 2022

TimHHH commented Oct 10, 2022 • edited Loading

TimHHH commented Oct 11, 2022 • edited Loading

TimHHH commented Oct 27, 2022 • edited by abhi18av Loading

TimHHH commented Oct 27, 2022

abhi18av commented Apr 8, 2023 • edited Loading

TimHHH commented Apr 17, 2023

abhi18av commented Apr 17, 2023

abhi18av commented Aug 6, 2023

abhi18av commented Nov 1, 2023 • edited Loading

abhi18av commented Aug 30, 2022 •

edited

Loading

TimHHH commented Sep 1, 2022 •

edited by abhi18av

Loading

TimHHH commented Oct 10, 2022 •

edited

Loading

TimHHH commented Oct 11, 2022 •

edited

Loading

TimHHH commented Oct 27, 2022 •

edited by abhi18av

Loading

abhi18av commented Apr 8, 2023 •

edited

Loading

abhi18av commented Nov 1, 2023 •

edited

Loading