-
Notifications
You must be signed in to change notification settings - Fork 577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Residue numbering start specification #74
Comments
I am wondering whether this should be something that AlphaFold does -- I feel this goes against the UNIX philosophy of doing one thing and doing it well. This is clearly something that is a post-processing step that should be done on the produced mmCIF file. If we were to introduce it in AlphaFold, it would require modifying the input format (to specify the starting residue ID for each chain), which seems too invasive. That being said, I think I will add a utility method in the I will leave the issue open until I implement it, then I will comment here with a Python snippet to achieve what you are asking for. |
@lucajovine residue numbering is handled in our AlphaPulldown interface to AlphaFold2 (https://github.com/KosinskiLab/AlphaPulldown). You calculate input features for full-length proteins and then you run predictions for any subsets of residues preserving original full-length residue numbers. When/if we add AlphaFold3 backend, the same functionality will be supported. |
@jkosinski thank you for mentioning this, but frankly this was more of a general comment than something specifically aimed at my lab (where we already have our own post-processing scripts for doing this). @Augustin-Zidek may I respectfully disagree? I get the UNIX philosophy standpoint, but I do not see why using the correct numbering would go against that — in fact, rather the opposite. Just consider all the secreted proteins that have an N-terminal signal peptide: for any biologically meaningful prediction, we normally do not include those residues (which, in real life, are essentially never "seen" by the rest of the protein), so all resulting mature protein predictions end up being misnumbered (compared to the numbering that one finds in UniProt, for example). And the same of course happens if one wants to predict the structure of an engineered construct, which may have tags or the like. In all these cases, enforcing numbering from 1 is biologically meaningless. Moreover, I do not quite see how the introduction of the option that I was envisaging would be so intrusive, considering that it would just add one (optional) line per sequence in the input JSON. One may of course argue that if one is able to install and run AlphaFold on their own machine, surely they can easily work out a way to renumber residues if needed. And this is indeed the case (as I mentioned above). But for a lot of the biologists that make predictions using the server this is simply not trivial (as I often get asked about this issue), so my main reason for bringing this up was just to try making everyone's life easier... |
This is not a bug but rather a request that, however, I am sure everybody will stand by: please add a way to specify in the input json file what is the number of the starting residue of each chain.
As users, we almost always have to do this "by hand" a posteriori, since AlphaFold numbering start defaults to 1 and - in the majority of cases - does not match the one we actually work with. This is a small change that could make things much easier for everybody, so I hope you'll take the suggestion into consideration. Thank you!
The text was updated successfully, but these errors were encountered: