Skip to content

Latest commit

 

History

History
542 lines (437 loc) · 24.1 KB

README.md

File metadata and controls

542 lines (437 loc) · 24.1 KB

emle-engine

GitHub Actions License: GPL v2

Emily Engine

(Mascot courtesy Nictrain123 CC BY 3.0.)

A simple interface to allow electrostatic embedding of machine learning potentials using an ORCA-like interface. Based on code by Kirill Zinovjev. An example sander implementation is provided. This works by reusing the existing interface between sander and ORCA, meaning that no modifications to sander are needed. An OpenMM interface is also available via Sire. The embedding model currently supports the HCNOS elements. We plan to add support for further elements in the near future.

Further details can be found in our paper, available here. Please cite this work if you use emle-engine in your research. Supplementary information and data can be found here. For the original theory behind EMLE, please refer to this publication.

We thank EPSRC for funding (grant code EP/V011421/1).

Installation

First create a conda environment with all of the required dependencies:

conda env create -f environment.yaml
conda activate emle

If this fails, try using mamba as a replacement for conda.

For GPU functionality, you will need to install appropriate CUDA drivers on your host system along with NVCC, the CUDA compiler driver. (This doesn't come with cudatoolkit from conda-forge.)

(Depending on your CUDA setup, you might need to prefix the environment creation command above with something like CONDA_OVERRIDE_CUDA="11.2" to resolve an environment that is compatible with your CUDA driver.)

Finally, install emle-engine:

pip install .

If you are developing and want an editable install, use:

pip install -e .

Usage

To start an EMLE calculation server:

emle-server

For usage information, run:

emle-server --help

(By default, an emle_settings.yaml file will be written to the working directory containing the settings used to configure the server. This can be used to re-run an existing simulation using the --config option or EMLE_CONFIG environment variable. Additional emle_pid.txt and emle_port.txt files contain the process ID and port of the server.)

To launch a client to send a job to the server:

orca orca_input

Where orca_input is the path to a fully specified ORCA input file. In the examples given here, the orca executable will be called by sander when performing QM/MM, i.e. we are using a fake ORCA as the QM backend.

(Alternatively, just running orca orca_input will try to connect to an existing server and start one for you if a connection is not found.)

The server and client should both connect to the same hostname and port. This can be specified in a script using the environment variables EMLE_HOST and EMLE_PORT. If not specified, then the same default values will be used for both the client and server.

To stop the server:

emle-stop

NNPOps

The EMLE model uses Atomic Environment Vectors (AEVs) for the calculation of the electrostatic embedding energy. For performance, it is desirable to use the optimised symmetry functions provided by the NNPOps package. This requires a static compute graph, so needs to know the atomic numbers for the atoms in the QM region in advance. These can be specified using the EMLE_ATOMIC_NUMBERS environment variable, or the --atomic-numbers command-line argument when launching the server. This option should only be used if the QM region is fixed, i.e. the atoms in the QM region do not change each time the calculator is called.

Backends

The embedding method relies on in vacuo energies and gradients, to which corrections are added based on the predictions of the model. At present we support the use of TorchANI, DeePMD-kit, ORCA, SQM, XTB, or PySander for the backend, providing reference MM or QM with EMLE embedding, and pure EMLE implementations. To specify a backend, use the --backend argument when launching emle-server, e.g:

emle-server --backend torchani

(The default backend is torchani.)

When using the orca backend, you will also need to specify the path to the real orca exectubale using the --orca-path command-line argument, or the EMLE_ORCA_PATH environment variable. (To check that EMLE is running, look for a log or settings file in the working directory.) The input for orca will be taken from the &orc section of the sander configuration file, so use this to specify the method, etc.

When using deepmd as the backend you will also need to specify a model file to use. This can be passed with the --deepmd-model command-line argument, or using the EMLE_DEEPMD_MODEL environment variable. This can be a single file, or a set of model files specified using wildcards, or as a comma-separated list. When multiple files are specified, energies and gradients will be averaged over the models. The model files need to be visible to the emle-server, so we recommend the use of absolute paths.

When using pysander or sqm as the backend you will also need to specify the path to an AMBER parm7 topology file for the QM region. This can be specified using the --parm7 command-line argument, or via the EMLE_PARM7 environment variable.

We also provide a flexible way of supporting external backends via a callback function that can be specified via:

emle-server --external-backend module.function

The function should take a single argument, an ase.Atoms object for the QM region, and return the energy in Hartree as a float along with the gradients in Hartree/Bohr as a numpy.ndarray. The external backend can also be supplied using the EMLE_EXTERNAL_BACKEND environment variable. When set, the external backend will take precendence over any other backend. If the callback is a function within a local module, then make sure that the directory containing the module is in sys.path, or is visible to the emle-server, e.g. the server is launched from the same directory as the module. Alternatively, use --plugin-path to specify the path to a directory containing the module. This can also be specified using the EMLE_PLUGIN_PATH environment variable. Make sure that this is an absolute path so that it is visible to the server regardless of where it is launched.

Delta-learning corrections

We also support the use Rascal for the calculation of delta-learning corrections to the in vacuo energies and gradients. To use, you will first need to create an environment with the additional dependencies:

conda env create -f environment_rascal.yaml
conda activate emle-rascal

(These are not included in the default environment as they limit the supported Python versions.)

Then, specify a model file using the --rascal-model command-line argument, or via the EMLE_RASCAL_MODEL environment variable.

Note that the chosen backend must match the one used to train the model. At present this is left to the user to specify. In future we aim to encode the backend in the model file so that it can be selected automatically.

Device

We currently support CPU and CUDA as the device for PyTorch. This can be configured using the EMLE_DEVICE environment variable, or by using the --device argument when launching emle-server, e.g.:

emle-server --device cuda

When no device is specified, we will preferentially try to use CUDA if it is available. By default, the first CUDA device index will be used. If you want to specify the index, e.g. when running on a multi-GPU setup, then you can use the following syntax:

emle-server --device cuda:1

This would tell PyTorch that we want to use device index 1. The same formatting works for the environment varialbe, e.g. EMLE_DEVICE=cuda:1.

Embedding method

We support electrostatic, mechanical, non-polarisable, and MM embedding. Here non-polarisable emedding uses the EMLE model to predict charges for the QM region, but ignores the induced component of the potential. MM embedding allows the user to specify fixed MM charges for the QM atoms, with induction once again disabled. Obviously we are advocating our electrostatic embedding scheme, but the use of different embedding schemes provides a useful reference for determining the benefit of using electrostatic embedding for a given system. The embedding method can be specified using the EMLE_METHOD environment variable, or when launching the server, e.g.:

emle-server --method mechanical

The default option is (unsurprisingly) electrostatic. When using MM embedding you will also need to specify MM charges for the atoms within the QM region. This can be done using the --mm-charges option, or via the EMLE_MM_CHARGES environment variable. The charges can be specified as a list of floats (space separated from the command-line, comma-separated when using the environment variable) or a path to a file. When using a file, this should be formatted as a single column, with one line per QM atom. The units are electron charge.

Alpha mode

We support two methods for the calculation of atomic polarisabilities. The default, species, uses a single volume scaling factor for each species. Alternatively, reference, calculates the scaling factors using Gaussian Process Regression (GPR) using the values learned for each reference environment. The alpha mode can be specified using the --alpha-mode command-line argument, or via the EMLE_ALPHA_MODE environment variable.

Logging

Energies can be written to a file using the --energy-file command-line argument or the EMLE_ENERGY_FILE environment variable. The frequency of logging can be specified using --energy-frequency or EMLE_ENERGY_FREQUENCY. This should be an integer specifying the frequency, in integration steps, at which energies are written. (The default is 0, which means that energies aren't logged.) The output will look something like the following, where the columns specify the current step, the in vacuo energy and the total energy.

General log messages are written to the file specified by the --log-file or EMLE_LOG_FILE options. (When using the Python API, by default, no log file is used and diagnostic messages are written to sys.stderr. When using emle-server, logs are by default redirected to emle_log.txt.) The log level can be adjusted by using the --log-level or EMLE_LOG_LEVEL options. For performance, the default log level is set to ERROR.

#     Step            E_vac (Eh)            E_tot (Eh)
         0     -495.724193647246     -495.720214843750
         1     -495.724193662147     -495.720214843750
         2     -495.722049429755     -495.718475341797
         3     -495.717705026011     -495.714660644531
         4     -495.714381769041     -495.711761474609
         5     -495.712389051656     -495.710021972656
         6     -495.710483833889     -495.707977294922
         7     -495.708991110067     -495.706909179688
         8     -495.708890005688     -495.707183837891
         9     -495.711066677908     -495.709045410156
        10     -495.714580371718     -495.712799072266

The xyz coordinates of the QM (ML) and MM regions can be logged by providing the --qm-xyz-frequency command-line argument or by setting the EMLE_QM_XYZ_FREQUENCY environment variable (default is 0, indicating no logging). This generates a qm.xyz file (can be changed by --qm-xyz-file argument or the EMLE_QM_XYZ_FILE environment variable) as an XYZ trajectory for the QM region, and a pc.xyz file (controlled by --pc-xyz-file argument or the EMLE_PC_XYZ_FILE environment variable) with the following format:

<number of point charges in frame1>
charge_1 x y z
charge_2 x y z
...
charge_n x y z
<number of point charges in frame2>
charge_1 x y z
charge_2 x y z
...

The qm.xyz and pc.xyz files can be used for error analysis as described below.

Why do we need an EMLE server?

The EMLE implementation uses several ML frameworks to predict energies and gradients. DeePMD-kit or TorchANI can be used for the in vacuo predictions and custom PyTorch code is used to predict corrections to the in vacuo values in the presence of point charges. The frameworks make heavy use of just-in-time compilation. This compilation is performed during to the first EMLE call, hence subsequent calculatons are much faster. By using a long-lived server process to handle EMLE calls from sander we can get the performance gain of JIT compilation.

Demo

A demo showing how to run EMLE on a solvated alanine dipeptide system can be found in the demo directory. To run:

cd demo
./demo.sh

Output will be written to the demo/output directory.

End-state correction

It is possible to use emle-engine to perform end-state correction (ESC) for alchemical free-energy calculations. Here a λ value is used to interpolate between the full MM (λ = 0) and EMLE (λ = 1) modified potential. To use this feature specify the λ value from the command-line, e.g.:

emle-server --lambda-interpolate 0.5

or via the ESC_LAMBDA_INTERPOLATE environment variable. When performing interpolation it is also necessary to specifiy the path to a topology file for the QM region. This can be specified using the --parm7 command-line argument, or via the EMLE_PARM7 environment variables You will also need to specify the (zero-based) indices of the atoms within the QM region. To do so, use the --qm-indices command-line argument, or the EMLE_QM_INDICES environment variable. Finally, you will need specify MM charges for the QM atoms using the --mm-charges command-line argument or the EMLE_MM_CHARGES environment variable. These are used to calculate the electrostatic interactions between point charges on the QM and MM regions.

It is possible to pass one or two values for λ. If a single value is used, then the calculator will always use that value for interpolation, unless it is updated externally using the --set-lambda-interpolate command line option, e.g.:

emle-server --set-lambda-interpolate 1

Alternatively, if two values are passed then these will be used as initial and final values of λ, with the additional --interpolate-steps option specifying the number of steps (calls to the server) over which λ will be linearly interpolated. (This can also be specified using the EMLE_INTERPOLATE_STEPS environment variable.) In this case the energy file (if written) will contain output similar to that shown below. The columns specify the current step, the current λ value, the energy at the current λ value, and the pure MM and EMLE energies.

#     Step                     λ             E(λ) (Eh)           E(λ=0) (Eh)           E(λ=1) (Eh)
         0        0.000000000000       -0.031915396452       -0.031915396452     -495.735900878906
         5        0.100000000000      -49.588279724121       -0.017992891371     -495.720855712891
        10        0.200000000000      -99.163040161133       -0.023267691955     -495.722106933594
        15        0.300000000000     -148.726318359375       -0.015972195193     -495.717071533203
        20        0.400000000000     -198.299896240234       -0.020024012774     -495.719726562500
        25        0.500000000000     -247.870407104492       -0.019878614694     -495.720947265625
        30        0.600000000000     -297.434417724609       -0.013046705164     -495.715332031250
        35        0.700000000000     -347.003417968750       -0.008571878076     -495.715515136719
        40        0.800000000000     -396.570098876953       -0.006970465649     -495.710876464844
        45        0.900000000000     -446.150207519531       -0.019694851711     -495.720275878906
        50        1.000000000000     -495.725952148438       -0.020683377981     -495.725952148438

OpenMM integration

We provide an interface between emle-engine and OpenMM via the Sire molecular simulation framework. This allows QM/MM simulations to be run with OpenMM using EMLE for the embedding model. This provides greatly improved performance and flexibility in comparison to the sander interface.

To use, first create an emle-sire conda environment:

conda env create -f environment_sire.yaml
conda activate emle-sire

Next install emle-engine into the environment:

pip install .

For instructions on how to use the emle-sire interface, see the tutorial documentation here.

When performing end-state correction simulations using the emle-sire interface there is no need to specify the lambda_interpolate keyword when creating an EMLECalculator instance. Instead, interpolation can be enabled when creating a Sire dynamics object via the same keyword. (See the tutorial for details.)

Torch models

The emle.models module provides a number of torch models. The base EMLE model can be used to compute the EMLE energy in isolation. The combined ANI2xEMLE and MACEEMLE models allow the computation of in vacuo and embedding energies in one go, using the ANI2x and MACE models respectively. Creating additional models is straightforward. For details of how to use the torch models, see the tutorial documentation here.

Error analysis

emle-engine provides a CLI tool emle-analyze that facilitates analysis of the performance of EMLE-based simulations. It requires a set of single point reference calculations for a trajectory generated with emle-server (currently only ORCA is supported). It also requires MBIS decomposition of the in vacuo electronic density of the QM region with horton. Usage:

emle-analyze --qm-xyz qm.xyz \
             --pc.xyz pc.xyz \
             --emle-model model.mat \
             --orca-tarball orca.tar \
             --backend [deepmd, ani2x]
             --alpha
             result.mat

qm.xyz and pc.xyz are the QM and MM XYZ trajectories written out by emle-server (see above in the "Logging" section).

model.mat is the EMLE model used.

orca.tar is a tarball containing single point ORCA calculations and corresponding horton outputs. All files should be named as index.* where index is a numeric value identifying the snapshot (does not have to be consecutive) and the extensions are:

  • .vac.orca: ORCA output for gas phase calculation. When --alpha argument is provided, must also include molecular dipolar polarizability (%elprop Polar)
  • .h5: horton output for gas phase calculation
  • .pc.orca: ORCA output for calculation with point charges
  • .pc: charges and positions of the point charges (the ones used for .pc.orca calculation)
  • .vpot: output of orca_vpot, electrostatic potential of gas phase system at the positions of the point charges

Optional --backend argument allows to also extract the energies with the in vacuo backend. Currently, only deepmd and ani2x backends are supported by emle-analyze. When deepmd backend is used, the DeepMD model must be provided with --deepmd-model.

Training of custom EMLE models

Training of custom EMLE models can be performed with emle-train tool. It requires a tarball with the reference QM calculations with the same naming convention as the one for emle-analyze script, with the difference that only gas phase calculations are required and dipolar polarizabilies must be present. Simple usage:

emle-train --orca-tarball orca.tar model.mat

The resulting model.mat file can be directly used as --emle-model argument for emle-server. A full list of argument and their default values can be printed with emle-train -h:

usage: emle-train [-h] --orca-tarball name.tar [--train-mask] [--sigma] [--ivm-thr] [--epochs]
                  [--lr-qeq] [--lr-thole] [--lr-sqrtk] [--print-every] [--computer-n-species]
                  [--computer-zid-map] [--plot-data name.mat]
                  output

EMLE training script

positional arguments:
  output                Output model file

options:
  -h, --help            show this help message and exit
  --orca-tarball name.tar
                        ORCA tarball (default: None)
  --train-mask          Mask for training set (default: None)
  --sigma               Sigma value for GPR (default: 0.001)
  --ivm-thr             IVM threshold (default: 0.05)
  --epochs              Number of training epochs (default: 100)
  --lr-qeq              Learning rate for QEq params (a_QEq, chi_ref) (default: 0.05)
  --lr-thole            Learning rate for Thole model params (a_Thole, k_Z) (default: 0.05)
  --lr-sqrtk            Learning rate for polarizability scaling factors (sqrtk_ref) (default: 0.05)
  --print-every         How often to print training progress (default: 10)
  --computer-n-species
                        Number of species supported by AEV computer (default: None)
  --computer-zid-map    Map between EMLE and AEV computer zid values (default: None)
  --plot-data name.mat  Data for plotting (default: None)

train-mask if a file containing 0's and 1's that defines the subset of the full training set (provided as --orca-tarball) that is used for training. Note that the values written to --plot-data are for the full training set, which allows to do prediction plots for train/test sets.

The --computer-n-species and --computer-zid-map arguments are only needed when using a common AEV computer for both gas phase backend and EMLE model.

Issues

The DeePMD-kit conda package pulls in a version of MPI which may cause problems if using ORCA as the in vacuo backend, particularly when running on HPC resources that might enforce a specific MPI setup. (ORCA will internally call mpirun to parallelise work.) Since we don't need any of the MPI functionality from DeePMD-kit, the problematic packages can be safely removed from the environment with:

conda remove --force mpi mpich

Alternatively, if performance isn't an issue, simply set the number of threads to 1 in the sander input file, e.g.:

&orc
  method='XTB2',
  num_threads=1
/

When running on an HPC resource it can often take a while for the emle-server to start. As such, the client will try reconnecting on failure a specified number of times before raising an exception. (Sleeping 2 seconds between retries.) By default, the client tries will try to connect 100 times. If this is unsuitable for your setup, then the number of attempts can be configured using the EMLE_RETRIES environment variable.

When performing interpolation it is currently not possible to use AMBER force fields with CMAP terms due to a memory deallocation bug in pysander.