Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow template fetching #204

Closed
wants to merge 15 commits into from
Closed

Conversation

rwxayheee
Copy link
Contributor

@rwxayheee rwxayheee commented Oct 17, 2024

This PR includes the utilities for residue template generation. **Currently it works for noncovalent cofactors only. **

Here's a summary of the changes:

  • Insert a check for unknown residues in the __init__ of LinkedRDKitChorizo.
    If raw_input_mols or set_templates contain residues not in the current loaded residue template, a new residue template will be made.

  • Fetch from rcsb
    The input_resname or specified resname will be used to fetch a definition CIF file from RCSB (https://files.rcsb.org/ligands/download/). The CIF files will be downloaded as temporary files.

  • Build process
    The chemical component will be processed into the canonical protonation state (after deprotonating a default set of acidic protons). No further embedding, extension, capping, linker guessing will be attempted. The logging level of the build process is controlled by ChemicalComponent_LoggingControler. Other associated functions are imported from chemtempgen.py.

  • Added dependencies
    gemmi, copy, urllib.request (and network connectivity), time, tempfile

  • Example output

(mk_dev) amyhe@Amys-MBP Downloads % mk_prepare_receptor.py -i 1d3g_input/1d3g.pdb -o 1d3g_output -j --default_altloc A
/Users/amyhe/micromamba/envs/mk_dev/bin/mk_prepare_receptor.py:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  __import__('pkg_resources').require('meeko==0.6.0a3')
@> 3144 atoms and 1 coordinate set(s) were parsed in 0.03s.
DEBUG:.prody:3144 atoms and 1 coordinate set(s) were parsed in 0.03s.
Input residue names {'DDQ', 'BRE', 'ORO', 'ACT', 'FMN', 'SO4'} not in residue_templates

WARNING:root:Molecule doesn't contain matching  with acidic_proton_loc {'[H][O][CX3](=O)': 0, '[H][O][SX4](=O)': 0, '[H][O][SX3](=O)': 0, '[H][O][PX4](=O)': 0, '[H][O][PX3](=O)': 0, '[H][O][CX3](=S)': 0, '[H][O][SX4](=S)': 0, '[H][SX2][a]': 0} -> deprotonate returning original mol... 
******************** New Template Built ********************
{
    "ambiguous": {},
    "residue_templates": {
        "DDQ": {
            "smiles": "[H]C([H])([H])C([H])([H])C([H])([H])C([H])([H])C([H])([H])C([H])([H])C([H])([H])C([H])([H])C([H])([H])C([H])([H])[N+]([O-])(C([H])([H])[H])C([H])([H])[H]",
            "atom_name": ["H101", "C10", "H102", "H103", "C9", "H91", "H92", "C8", "H81", "H82", "C7", "H71", "H72", "C6", "H61", "H62", "C5", "H51", "H52", "C4", "H41", "H42", "C3", "H31", "H32", "C2", "H21", "H22", "C1", "H11", "H12", "N1", "O1", "CM1", "HM11", "HM12", "HM13", "CM2", "HM21", "HM22", "HM23"],
            "link_labels": {}
        }
    }
}
************************************************************
WARNING:root:Molecule doesn't contain matching  with acidic_proton_loc {'[H][O][CX3](=O)': 0, '[H][O][SX4](=O)': 0, '[H][O][SX3](=O)': 0, '[H][O][PX4](=O)': 0, '[H][O][PX3](=O)': 0, '[H][O][CX3](=S)': 0, '[H][O][SX4](=S)': 0, '[H][SX2][a]': 0} -> deprotonate returning original mol... 
******************** New Template Built ********************
{
    "ambiguous": {},
    "residue_templates": {
        "ACT": {
            "smiles": "[H]C([H])([H])C(=O)[O-]",
            "atom_name": ["H1", "CH3", "H2", "H3", "C", "O", "OXT"],
            "link_labels": {}
        }
    }
}
************************************************************
******************** New Template Built ********************
{
    "ambiguous": {},
    "residue_templates": {
        "BRE": {
            "smiles": "[H]C1=C([H])C([H])=C(C2=C([H])C([H])=C(C3=NC4=C(C([H])=C(F)C([H])=C4[H])C(C(=O)[O-])=C3C([H])([H])[H])C([H])=C2[H])C([H])=C1[H]",
            "atom_name": ["H211", "C21", "C20", "H201", "C18", "H181", "C17", "C16", "C12", "H121", "C8", "H81", "C23", "C3", "N1", "C7", "C6", "C10", "H101", "C14", "F1", "C15", "H151", "C11", "H111", "C4", "C5", "O1", "O2", "C2", "C1", "H11", "H12", "H13", "C9", "H91", "C13", "H131", "C19", "H191", "C22", "H221"],
            "link_labels": {}
        }
    }
}
************************************************************
******************** New Template Built ********************
{
    "ambiguous": {},
    "residue_templates": {
        "ORO": {
            "smiles": "[H]C1=C(C(=O)[O-])N([H])C(=O)N([H])C1=O",
            "atom_name": ["H5", "C5", "C6", "C7", "O71", "O72", "N1", "HN1", "C2", "O2", "N3", "HN3", "C4", "O4"],
            "link_labels": {}
        }
    }
}
************************************************************
******************** New Template Built ********************
{
    "ambiguous": {},
    "residue_templates": {
        "FMN": {
            "smiles": "[H]OC([H])(C([H])([H])OP(=O)([O-])[O-])C([H])(O[H])C([H])(O[H])C([H])([H])N1C2=NC(=O)N([H])C(=O)C2=NC2=C([H])C(C([H])([H])[H])=C(C([H])([H])[H])C([H])=C21",
            "atom_name": ["HO4'", "O4'", "C4'", "H4'", "C5'", "H5'1", "H5'2", "O5'", "P", "O1P", "O2P", "O3P", "C3'", "H3'", "O3'", "HO3'", "C2'", "H2'", "O2'", "HO2'", "C1'", "H1'1", "H1'2", "N10", "C10", "N1", "C2", "O2", "N3", "HN3", "C4", "O4", "C4A", "N5", "C5A", "C6", "H6", "C7", "C7M", "HM71", "HM72", "HM73", "C8", "C8M", "HM81", "HM82", "HM83", "C9", "H9", "C9A"],
            "link_labels": {}
        }
    }
}
************************************************************
WARNING:root:Molecule doesn't contain matching  with acidic_proton_loc {'[H][O][CX3](=O)': 0, '[H][O][SX4](=O)': 0, '[H][O][SX3](=O)': 0, '[H][O][PX4](=O)': 0, '[H][O][PX3](=O)': 0, '[H][O][CX3](=S)': 0, '[H][O][SX4](=S)': 0, '[H][SX2][a]': 0} -> deprotonate returning original mol... 
******************** New Template Built ********************
{
    "ambiguous": {},
    "residue_templates": {
        "SO4": {
            "smiles": "O=S(=O)([O-])[O-]",
            "atom_name": ["O1", "S", "O2", "O3", "O4"],
            "link_labels": {}
        }
    }
}
************************************************************

Files written:
1d3g_output.json <-- parameterized receptor

@diogomart Could you please take a quick look, and let me know if you like the template generation to be part of chorizo creating? If so, I have some more improvements to make and will continue working on this. If not, I will open another PR with just the check only. I can write some examples of how to make and edit chemical templates with the functions.
Thank you for your time and kind advice in advance

@rwxayheee rwxayheee requested a review from diogomart October 17, 2024 23:15
@diogomart
Copy link
Contributor

This idea and implementation look great! I think it belongs with the chorizo. Thanks a lot!

@rwxayheee
Copy link
Contributor Author

Awesome, thanks so much @diogomart!
I will continue working on this. Converting to draft for now

@rwxayheee rwxayheee marked this pull request as draft October 18, 2024 00:26
@rwxayheee rwxayheee marked this pull request as ready for review October 18, 2024 23:53
@rwxayheee rwxayheee marked this pull request as draft October 18, 2024 23:57
@rwxayheee
Copy link
Contributor Author

rwxayheee commented Oct 19, 2024

Here's a summary of the recent changes:

  • For ChorizoCreationError
    Previously, I forgot to format the same error for the rdkit parser. I fixed it and I did some small changes.
    Now, the main error messages and the recommendations (optional) are separate, so we can write different recommendations for different main errors. The format is set by the string representation of the class.
    Here's an example output (when there's no Internet connection):
(mk_dev) amyhe@Amys-MBP Downloads % mk_prepare_receptor.py -i 3mxh.pdb -o 3mxh_output -j -p -a --default_altloc A --delete_residues R:8
/Users/amyhe/micromamba/envs/mk_dev/bin/mk_prepare_receptor.py:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  __import__('pkg_resources').require('meeko==0.6.0a3')
@> 2916 atoms and 1 coordinate set(s) were parsed in 0.02s.
DEBUG:.prody:2916 atoms and 1 coordinate set(s) were parsed in 0.02s.
Input residues {'R:1': 'C2E'} not in residue_templates

Trying to resolve unknown residues by building chemical templates... 

Error: Creation of data structure for receptor failed.

Details:
Max retries reached. Could not download CIF file for C2E. Error: <urlopen error [Errno 8] nodename nor servname provided, or not known>
  • For template fetching
    Now, the __init__ block of LinkedRDKitChorizo always tries to generate templates for unbound residues. But it never forgives residues missing templates, if they're linking fragments (have bonds with other residues).
    Here's an example output:
(mk_dev) amyhe@Amys-MBP Downloads % mk_prepare_receptor.py -i 3mxh.pdb -o 3mxh_output -j -p -a --default_altloc A    
/Users/amyhe/micromamba/envs/mk_dev/bin/mk_prepare_receptor.py:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  __import__('pkg_resources').require('meeko==0.6.0a3')
@> 2916 atoms and 1 coordinate set(s) were parsed in 0.02s.
DEBUG:.prody:2916 atoms and 1 coordinate set(s) were parsed in 0.02s.
Input residues {'R:8': 'GTP', 'R:1': 'C2E'} not in residue_templates

Trying to resolve unknown residues by building chemical templates... 

Error: Creation of data structure for receptor failed.

Details:
Unknown residues: {'R:8': 'GTP'} appear to be linking fragments. 
Guessing chemical templates with linker_labels are not currently supported. 

Recommendations:
1. (to parameterize the residues) Use --add_templates to pass the additional templates with valid linker_labels, 
2. (to skip the residues) Use --delete_residues to ignore them. Residues will be deleted from the prepared receptor.

The linking residues' templates could have been generated, if we allow guessing certain linker types:
(1) N-term, C-term of amino acid residues
(2) 5', 3'-linking of nucleotides

@diogomart Do you want guessing the linkers to be part of chorizo creation?

If so, I can add the functions. If not, we will leave them to users. The template preparation is not too difficult to do with a standalone script. Users can do some basic editing and assign linkers by patterns or atom names. And when needed, they can define new linking reactions.

This is not urgent to merge. Thanks again for looking at this!

Edit:

  • After b93ed7a, all new templates will have a non-empty ambiguous:
(mk_dev) amyhe@Amys-MBP Downloads % mk_prepare_receptor.py -i 1d3g_input/1d3g.pdb -o 1d3g_output -j --default_altloc A
/Users/amyhe/micromamba/envs/mk_dev/bin/mk_prepare_receptor.py:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  __import__('pkg_resources').require('meeko==0.6.0a3')
@> 3144 atoms and 1 coordinate set(s) were parsed in 0.02s.
DEBUG:.prody:3144 atoms and 1 coordinate set(s) were parsed in 0.02s.
Input residues {'A:400': 'SO4', 'A:401': 'ACT', 'A:397': 'BRE', 'A:398': 'FMN', 'A:399': 'ORO', 'A:700': 'DDQ'} not in residue_templates

Trying to resolve unknown residues by building chemical templates... 
WARNING:root:Molecule doesn't contain matching atoms for acidic_proton_loc:{'[H][O][CX3](=O)': 0, '[H][O][SX4](=O)': 0, '[H][O][SX3](=O)': 0, '[H][O][PX4](=O)': 0, '[H][O][PX3](=O)': 0, '[H][O][CX3](=S)': 0, '[H][O][SX4](=S)': 0, '[H][SX2][a]': 0}-> deprotonate returning original mol... 
******************** New Template Built ********************
{
    "ambiguous": {
        "SO4": ["SO4"]
    },
    "residue_templates": {
        "SO4": {
            "smiles": "O=S(=O)([O-])[O-]",
            "atom_name": ["O1", "S", "O2", "O3", "O4"],
            "link_labels": {}
        }
    }
}
************************************************************
  • 6d1ae4f adds a protection to KeyError, when an ambiguous template_key is passed by set_templates.

  • guess linker
    rwxayheee/Meeko@guess_linker is a working version that allows linker guessing for modified AA and NA. It does not forgive modified backbones, which might require modified padders. Example output:

(mk_dev) amyhe@Amys-MBP Downloads % mk_prepare_receptor.py -i 3mxh.pdb -o 3mxh_output -j -p --default_altloc A  
/Users/amyhe/micromamba/envs/mk_dev/bin/mk_prepare_receptor.py:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  __import__('pkg_resources').require('meeko==0.6.0a3')
@> 2916 atoms and 1 coordinate set(s) were parsed in 0.02s.
DEBUG:.prody:2916 atoms and 1 coordinate set(s) were parsed in 0.02s.
Input residues {'R:8': 'GTP', 'R:1': 'C2E'} not in residue_templates

Trying to resolve unknown residues by building chemical templates... 
******************** New Template Built ********************
{
    "ambiguous": {
        "C2E": ["C2E"]
    },
    "residue_templates": {
        "C2E": {
            "smiles": "[H]OC1([H])C([H])(N2C([H])=NC3=C2N=C(N([H])[H])N([H])C3=O)OC2([H])C([H])([H])OP(=O)([O-])OC3([H])C([H])(OC([H])(N4C([H])=NC5=C4N=C(N([H])[H])N([H])C5=O)C3([H])O[H])C([H])([H])OP(=O)([O-])OC21[H]",
            "atom_name": ["HO2'", "O2'", "C2'", "H2'", "C1'", "H1'", "N9", "C8", "H8", "N7", "C5", "C4", "N3", "C2", "N2", "HN21", "HN22", "N1", "HN1", "C6", "O6", "O4'", "C4'", "H4'", "C5'", "H5'1", "H5'2", "O5'", "P1", "O1P", "O2P", "O3A", "C3A", "H3A", "C4A", "H4A", "O4A", "C1A", "H1A", "N91", "C81", "H81", "N71", "C51", "C41", "N31", "C21", "N21", "HN24", "HN23", "N11", "HN11", "C61", "O61", "C2A", "H2A", "O2A", "HO2A", "C5A", "H511", "H512", "O5A", "P11", "O11", "O21", "O3'", "C3'", "H3'"],
            "link_labels": {}
        }
    }
}
************************************************************
WARNING:root:Molecule breaks into fragments during the deleterious editing of GTP_ -> skipping the vaiant... 
WARNING:root:Molecule breaks into fragments during the deleterious editing of GTP_3 -> skipping the vaiant... 
WARNING:root:Molecule doesn't contain wanted_smarts: [PX4h1] -> continue with next pattern... 
WARNING:root:Molecule doesn't contain pattern: [PX4h1] -> linker label for 5-prime will not be made. 
WARNING:root:Molecule breaks into fragments during the deleterious editing of GTP_5 -> skipping the vaiant... 
******************** New Template Built ********************
{
    "ambiguous": {
        "GTP": ["GTP_5p"]
    },
    "residue_templates": {
        "GTP_5p": {
            "smiles": "[H]OC1([H])C([H])(N2C([H])=NC3=C2N=C(N([H])[H])N([H])C3=O)OC([H])(C([H])([H])OP(=O)([O-])OP(=O)([O-])OP(=O)([O-])[O-])C1([H])O",
            "atom_name": ["HO2'", "O2'", "C2'", "H2'", "C1'", "H1'", "N9", "C8", "H8", "N7", "C5", "C4", "N3", "C2", "N2", "HN21", "HN22", "N1", "HN1", "C6", "O6", "O4'", "C4'", "H4'", "C5'", "H5'", "H5''", "O5'", "PA", "O1A", "O2A", "O3A", "PB", "O1B", "O2B", "O3B", "PG", "O1G", "O2G", "O3G", "C3'", "H3'", "O3'"],
            "link_labels": {"42": "3-prime"}
        }
    }
}
************************************************************

Files written:
 3mxh_output.json <-- parameterized receptor
3mxh_output.pdbqt <-- static (i.e., rigid) receptor input file

@rwxayheee rwxayheee marked this pull request as ready for review October 19, 2024 01:34
cc.link_labels vs atom_idx
@rwxayheee rwxayheee marked this pull request as draft October 19, 2024 02:28
@rwxayheee rwxayheee marked this pull request as ready for review October 19, 2024 04:57
@diogomart
Copy link
Contributor

@diogomart Do you want guessing the linkers to be part of chorizo creation?

Yes! That would enable non-standard amino-acids, which is important. But no rush, we can release it v0.7 to have as much time as needed to implement it.

I started testing this PR, I'll post here soon. Thanks for all this work!

@rwxayheee
Copy link
Contributor Author

rwxayheee commented Oct 20, 2024

Hi @diogomart
I have a branch called guess_linker that does the guessing with restrictions (no backbone modifications), would you like me to create a PR? Thanks so much for testing it. And again no rush, this isn’t a planned feature for 0.6.0. I just wanted to write it down before I forget how to

@rwxayheee
Copy link
Contributor Author

rwxayheee commented Oct 21, 2024

(just commenting for record) There are two types of structures guess linker can't handle:

  1. residue with modified backbone
  2. modified sidechain and the modified residue consists of multiple linking fragments, like glycosylation

Both might require pre-registration or a guess for a new padder. This feels a bit conflicting with the current mapping and padding process, where we explicitly restrict the type and position of padding in the templates. I'm not very sure what to do.......

If we want to enable guessing padders - I can rework Padder and template matching in a new branch. The default padding will contain the 1-4 covalent (or as many) atoms in the adjacent residue. But this could be too forgiving as it trusts input structure unconditionally and reads whatever bonded component from the adjacent residue. There will be no more adjacent residue check and the guess will supercede the fallback reaction logics.

Again no rush on this at all, and if there're more important things we should do at the moment, then this can wait (indefinitely until someone needs it)

@diogomart
Copy link
Contributor

@rwxayheee yes please create a PR for guess_linker

@diogomart
Copy link
Contributor

I noticed that heme (HEM) does not build as an rdkit molecule because RDKit complains about a nitrogen having 4 bonds without a +1 formal charge. This happened with rwmol.UpdatePropertyCache() inside the try statement in function from_cif(). Interestingly, writing a MOL block and reading it again fixed the issue (but atom properties were lost, causing an issue further down the line). Pretty sure this has to do with dative bonds (or lack thereof from the CIF file) as Stefano had encoutered this issue before and fixed it by setting the bonds between N and Fe to dative.

@rwxayheee rwxayheee mentioned this pull request Oct 21, 2024
@rwxayheee
Copy link
Contributor Author

rwxayheee commented Oct 21, 2024

I noticed that heme (HEM) does not build as an rdkit molecule because RDKit complains about a nitrogen having 4 bonds without a +1 formal charge.

Yes, anything containing dative bonds of metals may not work. It happens with the Pt-nucleotide complexes too, but they were gated at an earlier point because the element isn't supported.

Would you want us to fix it, or leave it as an exception?

In the definition file of HEM, the charge column is all 0. This is incorrect because before deprotonation, the charge of the molecule should be +2.. This is failing like a few other ligands, when definition is incorrect.

Maybe we can put HEM in the default template, so the fetching never needs to happen

@diogomart
Copy link
Contributor

Maybe we can put HEM in the default template

That makes a lot of sense

@rwxayheee
Copy link
Contributor Author

Closing as it is superseded by #206

@rwxayheee rwxayheee closed this Oct 22, 2024
@rwxayheee rwxayheee deleted the fetch_temp branch October 22, 2024 22:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants