Try using breseq instead #9

hgscott · 2024-12-10T18:58:15Z

Chris Marx suggested using breseq for our pipeline. It goes from reads (assuming trimmed) and a GenBank reference file, to a VCF.

hgscott · 2024-12-16T19:40:09Z

Need to install breseq on the SCC.

I made a conda environment for it with: conda create -p ./envs/breseq bioconda::breseq

I can activate the environment with conda activate /projectnb/hfsp/SNP23/envs/breseq and everything seems to be installed correctly, because I get the help entry:

hgscott · 2024-12-16T19:44:24Z

From the documentation, it looks like I can:

Use FASTA files for the reference (that might mean I don't get the gene names)
Use multiple references? "For multiple files"- not really sure what that means.
Use multiple reads. Would this be for the forward and reverse or for different samples?

hgscott · 2024-12-16T22:17:40Z

I tested out all of the following combinations in breseq_test/run_breseq.sh (all for HOT1A3 against it's reference genome):

Run just read 1 (trimmed) against the genbank genome
Run just read 2 (trimmed) against the genbank genome
Run both reads (trimmed) against the genbank genome (only supplied once)
Run both reads (trimmed) against the fasta format reference genome

They all ran without throwing any errors- and generated output files.

hgscott · 2024-12-16T22:28:35Z

To compare all of those of VCF files I made Venn diagrams comparing the variants in:

Read 1 vs Read 2
The union of read 1 and read 2 vs both reads (gen bank)
- Not sure why this isn't a perfect overlap- I would guess something about the number of reads makes some pass the filter differently
Both reads against the genbank genome vs both reads against the fasta genome
- They use different chromosome names, so there were none in common
- But when I fixed the chromosome names, there was a perfect overlap*
- But the fasta file report doesn't have any gene information:

hgscott · 2024-12-16T22:33:39Z

I compared the results from the fasta comparison to the results from my old pipeline:

My unfiltered results just have so many more results it is overpowering
My filtered results, have some unique/missed variants, but a large overlap

hgscott · 2024-12-16T23:14:45Z

I tried running the HOT1A3 negative control with:
breseq -r /projectnb/hfsp/IAMM_genomes_from_Luca/2687453488.fna -o breseq_test/04-HOT1A3_fast_both_reads results/raw_files/D20-160028-4500T/trimmed_D20-160028_1_sequence.fastq.gz results/raw_files/D20-160028-4500T/trimmed_D20-160028_2_sequence.fastq.gz

But that gives me an error:

ChatGPT says that it's because of my bowtie version, but I don't know why that would only be a problem for this sample.

hgscott · 2024-12-16T23:18:01Z

When I go back and re-run one of the other examples (not the negative control), I'm getting the same error now.

I did install matplotlib and matplotlib-venn to the conda environment to make the venn diagrams. Maybe that installation changed the bowtie version and broke everything.

hgscott · 2024-12-16T23:25:48Z

So I deleted that conda environment, and made a fresh one with the code above.

And that worked.

hgscott · 2024-12-17T00:31:08Z

I ran it on HOT1A3's negative control, and it generated the VCF, but the process got killed while creating coverage plots:

From the VCF file, I can see that it has 100 SNPs identified.

hgscott · 2024-12-17T01:15:44Z

I ran it as a batch script (it took about about 30 min), and it went all the way through the pipeline (it generated the index.html file).

hgscott · 2024-12-17T15:14:10Z

I updated my batch file to also run the negative controls for Luca5 (D20-160033), Citrea (D20-160039), and DSS-3 (D20-160042) and will let it run over night.

hgscott · 2024-12-17T15:16:08Z

The batch script finished, and everything appears to have worked without throwing any errors.

To get just the total number of SNPs in the VCF file, I can run grep -v '^#' <VCF FILE> | wc -l and I got the following number of SNPs:

HOT1A3: 100 (old pipeline with had 153)
Luca5: 81 (old=345)
Citrea: 49 (old=178)
DSS-3: 124 (old=151)

hgscott · 2024-12-17T15:56:11Z

I sent those stats and the a link to the downloaded results in a shared google drive folder to Daniel Sher and Segrè today ~11AM, before we meet at 12.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try using breseq instead #9

Try using breseq instead #9

hgscott commented Dec 10, 2024

hgscott commented Dec 16, 2024

hgscott commented Dec 16, 2024

hgscott commented Dec 16, 2024

hgscott commented Dec 16, 2024 •

edited

Loading

hgscott commented Dec 16, 2024

hgscott commented Dec 16, 2024

hgscott commented Dec 16, 2024

hgscott commented Dec 16, 2024

hgscott commented Dec 17, 2024

hgscott commented Dec 17, 2024

hgscott commented Dec 17, 2024

hgscott commented Dec 17, 2024 •

edited

Loading

hgscott commented Dec 17, 2024 •

edited

Loading

Try using breseq instead #9

Try using breseq instead #9

Comments

hgscott commented Dec 10, 2024

hgscott commented Dec 16, 2024

hgscott commented Dec 16, 2024

hgscott commented Dec 16, 2024

hgscott commented Dec 16, 2024 • edited Loading

hgscott commented Dec 16, 2024

hgscott commented Dec 16, 2024

hgscott commented Dec 16, 2024

hgscott commented Dec 16, 2024

hgscott commented Dec 17, 2024

hgscott commented Dec 17, 2024

hgscott commented Dec 17, 2024

hgscott commented Dec 17, 2024 • edited Loading

hgscott commented Dec 17, 2024 • edited Loading

hgscott commented Dec 16, 2024 •

edited

Loading

hgscott commented Dec 17, 2024 •

edited

Loading

hgscott commented Dec 17, 2024 •

edited

Loading