-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Implementing gemmi
-based mmcif reader (with easy extension to PDB/PDBx and mmJSON)
#4712
base: develop
Are you sure you want to change the base?
Conversation
Hello @marinegor! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2024-10-25 11:17:29 UTC |
Linter Bot Results:Hi @marinegor! Thanks for making this PR. We linted your code and found the following: Some issues were found with the formatting of your code.
Please have a look at the Please note: The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good so far, will require a small test file to check reader/parser halves.
pass | ||
|
||
|
||
class MMCIFWriter(base.WriterBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't include this at this stage, Writer is optional
from .base import TopologyReaderBase | ||
|
||
|
||
def _into_idx(arr: list[int]) -> list[int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document what this does, ideally with an example
record_types, # res.het_flag | ||
tempfactors, # at.b_iso | ||
residx, # _into_idx(res.seqid.num) | ||
) = map( # this is probably not pretty, but it's efficient -- one loop over the mmcif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all the fields here guaranteed in a valid pdbx? One benefit to working column by column is that you can do optional columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an example of a PDBx in mind, or like a test set for them? I've never actually worked with the format, since in RCSB afaik we have only pdb or mmcif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PDBx is mmcif. The download links here will give you an example file: https://www.rcsb.org/structure/4ake we use 4ake elsewhere in the testsuite. In my experience, sometimes the PDB / mmcif versions of the same entry aren't completely identical, so I wouldn't worry about trying to align the PDB & PDBx tests.
np.array, | ||
list( | ||
zip( | ||
*[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm struggling to follow the logic here, a comment breaking down what this double nested loop iteration into a zip is doing would be nice
@@ -78,6 +78,7 @@ extra_formats = [ | |||
"pytng>=0.2.3", | |||
"gsd>3.0.0", | |||
"rdkit>=2020.03.1", | |||
"gemmi", # for mmcif format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will probably be optional, so other imports will have to respect that too
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #4712 +/- ##
===========================================
- Coverage 93.65% 93.59% -0.06%
===========================================
Files 175 189 +14
Lines 21564 22715 +1151
Branches 3023 3028 +5
===========================================
+ Hits 20195 21261 +1066
- Misses 925 1005 +80
- Partials 444 449 +5 ☔ View full report in Codecov by Sentry. |
Hi @richardjgowers , I think the PR is ready now, and I also fixed (I believe) your comments. |
Fixes #2367 and also extends #4303
Changes made in this Pull Request:
gemmi
library (link) to parse mmcif filesclass MMCIFReader(base.SingleFrameReaderBase)
andclass MMCIFParser(TopologyReaderBase)
classes for thatAs a bonus, this implementation would potentially allow to read any of the gemmi-supported formats (source):
Also, this (with slight modifications) also would allow reading mmcif with multiple models sharing the same topology, as well as more feature-rich parsing of PDBs (the same code without changes can be used for parsing altlocs, charges, etc, from all of these formats).
However, I'm slightly lost on what's to be done next for this PR to be merged, so I'm asking if someone could help me navigate here (tagging @richardjgowers here as author of original PDBx implementation 4303).
PR Checklist
Developers certificate of origin
📚 Documentation preview 📚: https://mdanalysis--4712.org.readthedocs.build/en/4712/