Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated output documentation for yak triobin #1

Open
williamrowell opened this issue Mar 24, 2020 · 10 comments
Open

Updated output documentation for yak triobin #1

williamrowell opened this issue Mar 24, 2020 · 10 comments

Comments

@williamrowell
Copy link

williamrowell commented Mar 24, 2020

yak/triobin.c

Line 176 in 6de3aff

// fprintf(stderr, "Output: ctg err strongMixed sPat sMat weakMixed wPat1 wMat1 wPat2 wMat2\n");

Do you have any updated documentation for the output of yak triobin? I'm looking at the output of verison r43, which has 13 columns as opposed to the 10 columns documented in the help text. I'm especially trying to understand column 2, which has values m, p, a, and 0.

@lh3
Copy link
Owner

lh3 commented Mar 25, 2020

  • m=mother
  • p=father
  • a=ambiguous, 0=noCount/ambiguous

@williamrowell
Copy link
Author

Thanks for the quick answer! That's what I guessed, but wanted to make sure before proceeding. Thanks for the tool!

@lh3
Copy link
Owner

lh3 commented Mar 25, 2020

Forgot to say that you can ignore most of other columns. Those are mostly for debugging purpose.

@zeeev
Copy link

zeeev commented Dec 2, 2020

Dear @lh3,

We are testing out trio binning and it looks like our binned assemblies are more fragmented than the non-binned assemblies. Both haplotypes have good coverage. Is there a way to adjust the triobinning step to be more specific? I.E. require more p/m markers?

What is the meaning of these options?:

  -c INT     min occurrence [2]
  -d INT     mid occurrence [5]

Do you have any suggestions for improving binning at the counting stage?

@lh3
Copy link
Owner

lh3 commented Dec 2, 2020

By default, if a k-mer occurs 5 times or more in mother but occurs twice or less in father, the k-mer is considered to be a mother-specific k-mer. The label on the 2nd column is determined by the rest of columns under complex rules coded in function tb_classify(). You can't tune these rules on the command line.

It is hard to get perfect trio binning. Hifiasm effectively uses the HiFi assembly graph to fix binning errors. Without doing that, hifiasm would only get ~10Mb N50, comparable to trio HiCanu.

@lh3
Copy link
Owner

lh3 commented Dec 2, 2020

For a simple way to increase specificity:

awk '$3>=21&&$4<=2&&$2=="p"' triobin.txt > paternal.txt
awk '$4>=21&&$3<=2&&$2=="m"' triobin.txt > maternal.txt
# the rest are ambiguous

@zeeev
Copy link

zeeev commented Dec 2, 2020

Hi @lh3,

Thank you for sharing these ideas. Just confirming, you think triobinning isn't as effective as just assembling and phasing in a single genome? That has been my experience, at least using yak and HifiASM/IPA.

@lh3
Copy link
Owner

lh3 commented Dec 2, 2020

Yes, when HiFi phasing and trio phasing are inconsistent, HiFi phasing is often the correct one.

@lh3
Copy link
Owner

lh3 commented Dec 2, 2020

In early days, we tried hicanu trio binning. I manually inspected many differences between hicanu and yak binning. I think yak is generally more accurate. Nonetheless, the assembly with hicanu binning is similar to the assembly with yak binning.

@lh3
Copy link
Owner

lh3 commented Dec 2, 2020

Also, hifiasm applies trio binning to error corrected reads. This noticeably improves the binning accuracy: there are much fewer inconsistencies between trio phasing and hifi read phasing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants