quick Shepherd question #3

Nick-Mul · 2024-12-10T17:57:13Z

Hello,
This looks really useful and everything seems to be working :)

Really quick question, could you give me a pointer on how to extract a molecule from the pickle files?

Thanks,
Nick

e.g.

 with open(f"samples/NP_analogues/shepherd_samples_NP_{k}.pickle", "wb") as f:
        pickle.dump(result, f)

keiradams · 2024-12-10T19:16:52Z

This is performed in example scripts in the shepherd-score repository, see https://github.com/coleygroup/shepherd-score/tree/main/scripts. Also tagging @kentoabeywardane in case there is a more explicit demonstration.

For a simpler (but less robust) process, you can follow the last parts of RUNME_conditional_generation_MOSESaq.ipynb after reading in the pickle file -- paper_experiments/run_inference_natural_products.py indicates the structure of the pickle file. The dictionary containing the ShEPhERD-generated samples are in result[-1].

keiradams · 2024-12-10T19:20:32Z

@Nick-Mul to be explicitly clear, here's code you can follow. I didn't explicitly test this just now, so let me know if there's a problem

 with open(f"samples/NP_analogues/shepherd_samples_NP_{k}.pickle", "rb") as f:
        result = pickle.load(f)

generated_samples = result[-1]

# quick visualization of generated samples
# full analyses, including extensive validity checks, can be performed by following https://github.com/coleygroup/shepherd-score

for b,sample_dict in enumerate(generated_samples):
    
    xyz = '' 
    
    x_ = sample_dict['x1']['atoms']
    pos_ = sample_dict['x1']['positions']
    
    xyz += f'{len(x_)}\n{b+1}\n'
    for a in range(len(x_)):
        atomic_number_ = int(x_[a])
        position_ = pos_[a]
        
        xyz+= f'{rdkit.Chem.Atom(atomic_number_).GetSymbol()} {str(position_[0].round(3))} {str(position_[1].round(3))} {str(position_[2].round(3))}\n'
    xyz+= '\n'
    
    try:
        mol_ = rdkit.Chem.MolFromXYZBlock(xyz)
    except Exception as e:
        mol_ = None
        print(f'invalid molecule: {e}')
        continue
    
    try:
        for c in [0, 1, -1, 2, -2]:
            mol__ = deepcopy(mol_)
            try:
                rdkit.Chem.rdDetermineBonds.DetermineBonds(mol__, charge = c, embedChiral = True)
            except:
                continue
            if mol__ is not None:
                print(c)
                break 
    except Exception as e:
        mol_ = None
        print(f'invalid molecule: {e}')
        continue
    
    mol_ = mol__
    try:
        assert sum([a.GetNumRadicalElectrons() for a in mol_.GetAtoms()]) == 0, 'has radical electrons'
        mol_.UpdatePropertyCache()
        rdkit.Chem.GetSymmSSSR(mol_)
        
    except Exception as e:
        mol_ = None
        print(f'invalid molecule: {e}')
        continue

    display(rdkit.Chem.MolFromSmiles(rdkit.Chem.MolToSmiles(mol_)))
    
    continue

kentoabeywardane · 2024-12-10T19:49:40Z

To extract an RDKit mol object you can use get_mol_from_atom_pos() found in shepherd-score.evaluation.utils.convert_data. The evaluation scripts that Keir linked show how this function is used in conjuction with the rest of the evaluation pipelines.

Here's an example of how to get an RDKit Mol of the generated molecule at sample_idx:

samples = pickle.load('samples.pickle')
mol, charge, xyz_block = get_mol_from_atom_pos(
    samples[-1][sample_idx]['x1']['atoms'],
    samples[-1][sample_idx]['x1']['positions']
)

To be explicit, *samples*.pickle files are lists. In the conditional setting, the structure is:

[
    ref_molec_molblock, # str
    ref_partial_charges, # np.array
    ref_surface_points, # np.array
    ref_electrostatics, # np.array
    ref_pharmacophore_types, # np.array
    ref_pharmacophore_positions, # np.array
    ref_pharmacophore_vectors, # np.array
    [ # list of dictionaries of generated samples.
        {'x1': {
            'atoms': np.array (N,),
            'positions': np.array (N,3)
               }
         'x2': {
            'positions': np.array (S,3)
               }
         'x3': {
            'positions': np.array (S,3),
            'charges': np.array (S,)
               }
         'x4': {
            'types': np.array (P,),
            'positions': np.array (P,3),
            'directions': np.array (P,3)
               }
         }
    ...
    ]
]

In the unconditional setting, there are no reference molecules so it is just the inner list containing the generated samples.

Note that depending on the conditioning at generation, some of the representations may be nonsense. For example, when sampling from P(x1 | x3, x4), the `positions' in the 'x2' key should be ignored since it will just be noise.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quick Shepherd question #3

quick Shepherd question #3

Nick-Mul commented Dec 10, 2024

keiradams commented Dec 10, 2024

keiradams commented Dec 10, 2024

kentoabeywardane commented Dec 10, 2024

quick Shepherd question #3

quick Shepherd question #3

Comments

Nick-Mul commented Dec 10, 2024

keiradams commented Dec 10, 2024

keiradams commented Dec 10, 2024

kentoabeywardane commented Dec 10, 2024