Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try using breseq instead #9

Open
hgscott opened this issue Dec 10, 2024 · 13 comments
Open

Try using breseq instead #9

hgscott opened this issue Dec 10, 2024 · 13 comments

Comments

@hgscott
Copy link
Member

hgscott commented Dec 10, 2024

Chris Marx suggested using breseq for our pipeline. It goes from reads (assuming trimmed) and a GenBank reference file, to a VCF.

@hgscott
Copy link
Member Author

hgscott commented Dec 16, 2024

Need to install breseq on the SCC.

I made a conda environment for it with: conda create -p ./envs/breseq bioconda::breseq

I can activate the environment with conda activate /projectnb/hfsp/SNP23/envs/breseq and everything seems to be installed correctly, because I get the help entry:
image

@hgscott
Copy link
Member Author

hgscott commented Dec 16, 2024

From the documentation, it looks like I can:

  • Use FASTA files for the reference (that might mean I don't get the gene names)
  • Use multiple references? "For multiple files"- not really sure what that means.
  • Use multiple reads. Would this be for the forward and reverse or for different samples?

@hgscott
Copy link
Member Author

hgscott commented Dec 16, 2024

I tested out all of the following combinations in breseq_test/run_breseq.sh (all for HOT1A3 against it's reference genome):

  • Run just read 1 (trimmed) against the genbank genome
  • Run just read 2 (trimmed) against the genbank genome
  • Run both reads (trimmed) against the genbank genome (only supplied once)
  • Run both reads (trimmed) against the fasta format reference genome

They all ran without throwing any errors- and generated output files.

@hgscott
Copy link
Member Author

hgscott commented Dec 16, 2024

To compare all of those of VCF files I made Venn diagrams comparing the variants in:

  • Read 1 vs Read 2
    image

  • The union of read 1 and read 2 vs both reads (gen bank)

    • Not sure why this isn't a perfect overlap- I would guess something about the number of reads makes some pass the filter differently
      image
  • Both reads against the genbank genome vs both reads against the fasta genome

    • They use different chromosome names, so there were none in common
      image

    • But when I fixed the chromosome names, there was a perfect overlap*
      image

    • But the fasta file report doesn't have any gene information:
      image

@hgscott
Copy link
Member Author

hgscott commented Dec 16, 2024

I compared the results from the fasta comparison to the results from my old pipeline:

  • My unfiltered results just have so many more results it is overpowering
    image

  • My filtered results, have some unique/missed variants, but a large overlap
    image

@hgscott
Copy link
Member Author

hgscott commented Dec 16, 2024

I tried running the HOT1A3 negative control with:
breseq -r /projectnb/hfsp/IAMM_genomes_from_Luca/2687453488.fna -o breseq_test/04-HOT1A3_fast_both_reads results/raw_files/D20-160028-4500T/trimmed_D20-160028_1_sequence.fastq.gz results/raw_files/D20-160028-4500T/trimmed_D20-160028_2_sequence.fastq.gz

But that gives me an error:
image

ChatGPT says that it's because of my bowtie version, but I don't know why that would only be a problem for this sample.

@hgscott
Copy link
Member Author

hgscott commented Dec 16, 2024

When I go back and re-run one of the other examples (not the negative control), I'm getting the same error now.

I did install matplotlib and matplotlib-venn to the conda environment to make the venn diagrams. Maybe that installation changed the bowtie version and broke everything.

@hgscott
Copy link
Member Author

hgscott commented Dec 16, 2024

So I deleted that conda environment, and made a fresh one with the code above.

And that worked.

@hgscott
Copy link
Member Author

hgscott commented Dec 17, 2024

I ran it on HOT1A3's negative control, and it generated the VCF, but the process got killed while creating coverage plots:
image

From the VCF file, I can see that it has 100 SNPs identified.

@hgscott
Copy link
Member Author

hgscott commented Dec 17, 2024

I ran it as a batch script (it took about about 30 min), and it went all the way through the pipeline (it generated the index.html file).

@hgscott
Copy link
Member Author

hgscott commented Dec 17, 2024

I updated my batch file to also run the negative controls for Luca5 (D20-160033), Citrea (D20-160039), and DSS-3 (D20-160042) and will let it run over night.

@hgscott
Copy link
Member Author

hgscott commented Dec 17, 2024

The batch script finished, and everything appears to have worked without throwing any errors.

To get just the total number of SNPs in the VCF file, I can run grep -v '^#' <VCF FILE> | wc -l and I got the following number of SNPs:

  • HOT1A3: 100 (old pipeline with had 153)
  • Luca5: 81 (old=345)
  • Citrea: 49 (old=178)
  • DSS-3: 124 (old=151)

@hgscott
Copy link
Member Author

hgscott commented Dec 17, 2024

I sent those stats and the a link to the downloaded results in a shared google drive folder to Daniel Sher and Segrè today ~11AM, before we meet at 12.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant