Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quick Shepherd question #3

Open
Nick-Mul opened this issue Dec 10, 2024 · 3 comments
Open

quick Shepherd question #3

Nick-Mul opened this issue Dec 10, 2024 · 3 comments

Comments

@Nick-Mul
Copy link

Hello,
This looks really useful and everything seems to be working :)

Really quick question, could you give me a pointer on how to extract a molecule from the pickle files?

Thanks,
Nick

e.g.

 with open(f"samples/NP_analogues/shepherd_samples_NP_{k}.pickle", "wb") as f:
        pickle.dump(result, f)
@keiradams
Copy link
Collaborator

This is performed in example scripts in the shepherd-score repository, see https://github.com/coleygroup/shepherd-score/tree/main/scripts. Also tagging @kentoabeywardane in case there is a more explicit demonstration.

For a simpler (but less robust) process, you can follow the last parts of RUNME_conditional_generation_MOSESaq.ipynb after reading in the pickle file -- paper_experiments/run_inference_natural_products.py indicates the structure of the pickle file. The dictionary containing the ShEPhERD-generated samples are in result[-1].

@keiradams
Copy link
Collaborator

@Nick-Mul to be explicitly clear, here's code you can follow. I didn't explicitly test this just now, so let me know if there's a problem

 with open(f"samples/NP_analogues/shepherd_samples_NP_{k}.pickle", "rb") as f:
        result = pickle.load(f)

generated_samples = result[-1]

# quick visualization of generated samples
# full analyses, including extensive validity checks, can be performed by following https://github.com/coleygroup/shepherd-score

for b,sample_dict in enumerate(generated_samples):
    
    xyz = '' 
    
    x_ = sample_dict['x1']['atoms']
    pos_ = sample_dict['x1']['positions']
    
    xyz += f'{len(x_)}\n{b+1}\n'
    for a in range(len(x_)):
        atomic_number_ = int(x_[a])
        position_ = pos_[a]
        
        xyz+= f'{rdkit.Chem.Atom(atomic_number_).GetSymbol()} {str(position_[0].round(3))} {str(position_[1].round(3))} {str(position_[2].round(3))}\n'
    xyz+= '\n'
    
    try:
        mol_ = rdkit.Chem.MolFromXYZBlock(xyz)
    except Exception as e:
        mol_ = None
        print(f'invalid molecule: {e}')
        continue
    
    try:
        for c in [0, 1, -1, 2, -2]:
            mol__ = deepcopy(mol_)
            try:
                rdkit.Chem.rdDetermineBonds.DetermineBonds(mol__, charge = c, embedChiral = True)
            except:
                continue
            if mol__ is not None:
                print(c)
                break 
    except Exception as e:
        mol_ = None
        print(f'invalid molecule: {e}')
        continue
    
    mol_ = mol__
    try:
        assert sum([a.GetNumRadicalElectrons() for a in mol_.GetAtoms()]) == 0, 'has radical electrons'
        mol_.UpdatePropertyCache()
        rdkit.Chem.GetSymmSSSR(mol_)
        
    except Exception as e:
        mol_ = None
        print(f'invalid molecule: {e}')
        continue

    display(rdkit.Chem.MolFromSmiles(rdkit.Chem.MolToSmiles(mol_)))
    
    continue

@kentoabeywardane
Copy link
Collaborator

To extract an RDKit mol object you can use get_mol_from_atom_pos() found in shepherd-score.evaluation.utils.convert_data. The evaluation scripts that Keir linked show how this function is used in conjuction with the rest of the evaluation pipelines.

Here's an example of how to get an RDKit Mol of the generated molecule at sample_idx:

samples = pickle.load('samples.pickle')
mol, charge, xyz_block = get_mol_from_atom_pos(
    samples[-1][sample_idx]['x1']['atoms'],
    samples[-1][sample_idx]['x1']['positions']
)

To be explicit, *samples*.pickle files are lists. In the conditional setting, the structure is:

[
    ref_molec_molblock, # str
    ref_partial_charges, # np.array
    ref_surface_points, # np.array
    ref_electrostatics, # np.array
    ref_pharmacophore_types, # np.array
    ref_pharmacophore_positions, # np.array
    ref_pharmacophore_vectors, # np.array
    [ # list of dictionaries of generated samples.
        {'x1': {
            'atoms': np.array (N,),
            'positions': np.array (N,3)
               }
         'x2': {
            'positions': np.array (S,3)
               }
         'x3': {
            'positions': np.array (S,3),
            'charges': np.array (S,)
               }
         'x4': {
            'types': np.array (P,),
            'positions': np.array (P,3),
            'directions': np.array (P,3)
               }
         }
    ...
    ]
]

In the unconditional setting, there are no reference molecules so it is just the inner list containing the generated samples.

Note that depending on the conditioning at generation, some of the representations may be nonsense. For example, when sampling from P(x1 | x3, x4), the `positions' in the 'x2' key should be ignored since it will just be noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants