Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDR regions not included in coresyn calls, but still present in filtered VCF #15

Open
mparker2 opened this issue Sep 10, 2024 · 3 comments

Comments

@mparker2
Copy link

Hi @lrauschning,

when I run msyd call in --core mode on some potato haplotypes, I get a nice PFF file of the coresyn regions and a merged VCF for the same coresyn. The SNPs and indels which do not overlap coresyn regions are nicely filtered out, which is exactly what I want.

Msyd breaks the coresyn regions on HDRs. I don't know if @mnshgl0110 would agree with this, since HDRs are considered as part of a syntenic region by syri, but it works quite nicely for me. I want to get rid of them for my current analysis. However, they are not filtered out in the VCF. This doesn't seem quite correct... imo either the PFF should include the HDR regions as coresyn (and the VCF include them), or they should be filtered out of the VCF.

What is your opinion?

@lrauschning
Copy link
Collaborator

Hiya,
nice to see the VCF filtering (mostly) works!
The reason HDR regions break coresyn is that they do not have corresponding SYNAL annotations, which is what (the current iteration of) msyd works on so that we can have exact alignments with basepair precision.
I think the HDRs might be retained because they start right before/after a coresyn region, and we fetch all variants intersecting a multisyn (end-inclusive) for reporting in the merged VCF.
If you just want the snps, not passing -x/--complex should filter out any VCF records with symbolic alleles incl. HDRs.
Other than that, grep -v HDR would be a quick workaround. Might be worth adding a CLI option to restrict merging to records strictly within a multisyn, though.

@mparker2
Copy link
Author

That makes sense. Using --complex is not a foolproof solution because syri is able to create VCFs with both symbolic or full sequence alleles (using --hdrseq)

@lrauschning
Copy link
Collaborator

Ah, true.
Then I'll look into adding a CLI option for strictly contained records.
I think filtering for specific types of records is probablybest left to the user, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants