Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation for VCF as input #394

Open
mbhall88 opened this issue Sep 9, 2024 · 3 comments
Open

Documentation for VCF as input #394

mbhall88 opened this issue Sep 9, 2024 · 3 comments

Comments

@mbhall88
Copy link

mbhall88 commented Sep 9, 2024

Hi Jody,

I have just been running tbprofiler with some samples using VCF as the input (it is ONT data I have variant-called with Clair3). Forgive me if I have missed it somewhere but there doesn't seem to be any documentation about what is expected of the VCF?

For future me (and maybe others) the VCF needs to be indexable - i.e., BGZIP-compressed VCF (.vcf.gz) or BCF. And the other thing which I found a little more sinister was that the CHROM names must be Chromosome. I had them as NC_000962.3 and tbprofiler ran without any errors, but I essentially got not resistance predictions. When I changed the CHROM name in the VCF I got the expected predictions.

My hacky/fast way of making this change was

bcftools view in.vcf.gz | sed 's/NC_000962.3/Chromosome/g' | bcftools view -o out.bcf

and then run tbprofiler with -v out.bcf.

I guess a more robust solution would be to use BCFtools

echo -e 'NC_000962.3\tChromosome' | bcftools annotate --rename-chrs -  -o out.bcf in.vcf.gz

Anyway, maybe some of these examples could be added to the docs? I know I would find it useful, so maybe others would too?

@jodyphelan
Copy link
Owner

Hi Michael

Apologies for the awful documentation, I really need to invest some time into improving them! I will try to put together a section on what it looks for in a VCF.

Yes the default database uses 'Chromosome' as the chromosome name. If you would like to use your VCFs with a different chromosome name then I would recommend doing --match_ref </path/to/your/refrence.fasta> in update_db or create_db which will use whatever name is in your own fasta file. Again as you pointed out this isn't very clear, so I'll try maybe make a little decision tree figure on datainputs and recomended settings.

The fact it doesn't complain when you feed it a VCF with different chrom names is pretty critical! I'll put in a fix for that and make a new release asap!

And I didn't know abut --rename-chrs section on bcftools, I'm using my own hacky script internally but this is far more elegant!

@mbhall88
Copy link
Author

No worries. It's hard to keep docs updated as a tool evolves.

Personally, just renaming the chrom in the VCF as I outlined above is probably an easier route than updating the DB. It's also totally fine to expect users to do this, and I guess I kind of created this issue to show an example pf how I achieved it. Selfishly for future me, but hopefully others find it useful. Also, feel free to use it in the docs if you think it is helpful.

Thanks again for keeping TBProfiler updated and evolving.

jodyphelan added a commit to jodyphelan/pathogen-profiler that referenced this issue Sep 30, 2024
@jodyphelan
Copy link
Owner

Thanks! built in some checks now for VCF and BAM formats. Will get those released soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants