Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
asa_output.py	asa_output.py
config_dict.py	config_dict.py
contacts.py	contacts.py
contacts_dataset.py	contacts_dataset.py
contacts_experiment.py	contacts_experiment.py
contacts_network.py	contacts_network.py
distogram_io.py	distogram_io.py
ensemble_contact_maps.py	ensemble_contact_maps.py
parsers.py	parsers.py
paste_contact_maps.py	paste_contact_maps.py
requirements.txt	requirements.txt
run_eval.sh	run_eval.sh
secstruct.py	secstruct.py
test_domains.txt	test_domains.txt
train_domains.txt	train_domains.txt
two_dim_convnet.py	two_dim_convnet.py
two_dim_resnet.py	two_dim_resnet.py

AlphaFold

This package provides an implementation of the contact prediction network, associated model weights and CASP13 dataset as published in Nature.

Any publication that discloses findings arising from using this source code must cite Improved protein structure prediction using potentials from deep learning by Andrew W. Senior, Richard Evans, John Jumper, James Kirkpatrick, Laurent Sifre, Tim Green, Chongli Qin, Augustin Žídek, Alexander W. R. Nelson, Alex Bridgland, Hugo Penedones, Stig Petersen, Karen Simonyan, Steve Crossan, Pushmeet Kohli, David T. Jones, David Silver, Koray Kavukcuoglu, Demis Hassabis.

The paper abstract can be found on Nature's site 10.1038/s41586-019-1923-7 and the full text can be accessed directly at https://rdcu.be/b0mtx.

Setup

Dependencies

Python 3.6+.
Abseil 0.8.0
Numpy 1.16
Six 1.12
Setuptools 41.0.0
Sonnet 1.35
TensorFlow 1.14. Not compatible with TensorFlow 2.0+.
TensorFlow Probability 0.7.0

You can set up Python virtual environment (you might need to install the python3-venv package first) with all needed dependencies inside the forked deepmind_research repository using:

python3 -m venv alphafold_venv
source alphafold_venv/bin/activate
pip install wheel
pip install -r alphafold_casp13/requirements.txt

Alternatively, you can just use the run_eval.sh script provided which will run these commands for you. See the section on running the system below for more details.

Data

While the code is licensed under the Apache 2.0 License, the AlphaFold weights and data are made available for non-commercial use only under the terms of the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. You can find details at: https://creativecommons.org/licenses/by-nc/4.0/legalcode

You can download the data from:

http://bit.ly/alphafold-casp13-data-license: The data license file.
http://bit.ly/alphafold-casp13-data: The dataset to reproduce AlphaFold's CASP13 results.
http://bit.ly/alphafold-casp13-weights: The model checkpoints.

Input data

The dataset to reproduce AlphaFold's CASP13 results can be downloaded from http://bit.ly/alphafold-casp13-data. The dataset is in a single zip file called casp13_data.zip which has about 43.5 GB.

The zip file contains 1 directory for each CASP13 target and a LICENSE.txt file. Each target directory contains the following files:

TARGET.tfrec file. This is a TFRecord file with serialized tf.train.Example protocol buffers that contain the features needed to run the model.
contacts/TARGET.pickle file(s) with the predicted distogram.
contacts/TARGET.rr file(s) with the contact map derived from the predicted distogram. The RR format is described on the CASP website.

Note that for T0999 the target was manually split based on hits in HHSearch into 5 sub-targets, hence there are 5 distograms (contacts/T0999s{1,2,3,4,5}.pickle) and 5 RR files (contacts/T0999s{1,2,3,4,5}.rr).

The contacts/ folder is not needed to run the model, these files are included only for convenience so that you don't need to run the inference for CASP13 targets to get the contact map.

Model checkpoints

The model checkpoints can be downloaded from http://bit.ly/alphafold-casp13-weights. The model checkpoints are in a zip file called alphafold_casp13_weights.zip which has about 210 MB.

The zip file contains:

A directory 873731. This contains the weights for the distogram model.
A directory 916425. This contains the weights for the background distogram model.
A directory 941521. This contains the weights for the torsion model.
LICENSE.txt. The model checkpoints have a non-commercial license which is defined in this file.

Each directory with model weights contains a number of different model configurations. Each model has a config file and associated weights. There is only one torsion model. Each model directory also contains a stats file that is used for feature normalization specific to that model.

Distogram prediction

Running the system

You can use the run_eval.sh script to run the entire Distogram prediction system. There are a few steps you need to start with:

Download the input data as described above. Unpack the data in the directory with the code.
Download the model checkpoints as described above. Unpack the data.
In run_eval.sh set the following:
- DISTOGRAM_MODEL to the path to the directory with the distogram model.
- BACKGROUND_MODEL to the path to the directory with the background model.
- TORSION_MODEL to the path to the directory with the torsion model.
- TARGET to the name of the target.
- TARGET_PATH to the path to the directory with the target input data.
- OUTPUT_DIR is by default set to a new directory with a timestamp within your home directory.

Then run alphafold_casp13/run_eval.sh from the deepmind_research parent directory (you will get errors if you try running run_eval.sh directly from the alphafold_casp13 directory).

The contact prediction works in the following way:

4 replicas (by replica we mean a configuration file describing the network architecture and a snapshot with the network weights), each with slightly different model configuration, are launched to predict the distogram.
4 replicas, each with slightly different model configuration are launched to predict the background distogram.
1 replica is launched to predict the torsions.
The predictions from the different replicas are averaged together using ensemble_contact_maps.py.
The predictions for the 64 × 64, 128 × 128 and 256 × 256 distogram crops are pasted together using paste_contact_maps.py.

When running run_eval.sh the output has the following directory structure:

distogram/: Contains 4 subfolders, one for each replica. Each of these contain the predicted ASA, secondary structure and a pickle file with the distogram for each crop. It also contains an ensemble directory with the ensembled distograms.
background_distogram/: Contains 4 subfolders, one for each replica. Each of these contain a pickle file with the background distogram for each crop. It also contains an ensemble directory with the ensembled background distograms.
torsion/: Contains 1 subfolder as there was only a single replica. This folder contains contains the predicted ASA, secondary structure, backbone torsions and a pickle file with the distogram for each crop. It also contains an ensemble directory, which contains a copy of the predicted output as there is only a single replica in this case.
pasted/: Contains distograms obtained from the ensembled distograms by pasting. An RR contact map file is computed from this pasted distogram. This is the final distogram that was used in the subsequent AlphaFold folding pipeline in CASP13.

Data splits

We used a version of PDB downloaded on 2018-03-15. The train/test split can be found in the train_domains.txt and test_domains.txt files in this repository. The split is based on the CATH 2018-03-16 database.

Features

There is currently no plan to open source the feature generation code as it is tightly coupled to our internal infrastructure as well as external tools which we cannot open source.

Some features are needed only as placeholders to construct the model. These can be set to all zeros when running the inference. Such features are marked in the table below as not needed and you can just fill them with zeros when running inference.

The table below provides an overview of the features we used to make it possible to reconstruct our feature generation code. Some features that require more thorough explanation are explained in the section below the table. Note that NR stands for number of residues, i.e. the length of the amino acid sequence:

Name	Needed	TF DType	Shape	Description
`aatype`	✔️	float32	`(NR, 21)`	One hot encoding of amino acid types. The mapping is `ARNDCQEGHILKMFPSTWYVX -> range(21)`. See below.
`alpha_mask`	❌	int64	`(NR, 1)`	Mask for `alpha_positions`.
`alpha_positions`	❌	float32	`(NR, 3)`	`(x, y, z)` Carbon Alpha coordinates.
`beta_mask`	❌	int64	`(NR, 1)`	Mask for `beta_positions`.
`beta_positions`	❌	float32	`(NR, 3)`	`(x, y, z)` Carbon Beta coordinates.
`between_segment_residues`	❌	int64	`(NR, 1)`	The number of between segment residues (BSR) at the next position. E.g. `ABCXXD` (`XX` is BSR) would be `[0,0,2,0]`.
`chain_name`	❌	string	`(1)`	The chain name. E.g. 'A', 'B', ...
`deletion_probability`	✔️	float32	`(NR, 1)`	The fraction of sequences that had a deletion at this position. See below.
`domain_name`	❌	string	`(1)`	The domain name.
`gap_matrix`	✔️	float32	`(NR, NR, 1)`	Covariation signal from the gapped states, this gives an indication of the variance induced due to gapped states. See below.
`hhblits_profile`	❌	float32	`(NR, 22)`	A profile (probability distribution over amino acid types) computed using HHBlits MSA. Encoding: 20 amino acids + 'X' + '-'.
`hmm_profile`	✔️	float32	`(NR, 30)`	The HHBlits HHM profile (from the `-ohhm` HHBlits output file). Asterisks in the output are replaced by 0.0. See below.
`key`	❌	string	`(1)`	The unique id of the protein.
`mutual_information`	✔️	float32	`(NR, NR, 1)`	The average product corrected mutual information. See https://doi.org/10.1093/bioinformatics/btm604.
`non_gapped_profile`	✔️	float32	`(NR, 21)`	A profile from amino acids only (discounting gaps). See below.
`num_alignments`	✔️	int64	`(NR, 1)`	The number of HHBlits multiple sequence alignments. Has to be repeated `NR` times. See below.
`num_effective_alignments`	❌	float32	`(1)`	The number of effective alignments (neff at 62 % sequence similarity).
`phi_angles`	❌	float32	`(NR, 1)`	The phi angles.
`phi_mask`	❌	int64	`(NR, 1)`	Mask for `phi_angles`.
`profile`	❌	float32	`(NR, 21)`	A profile (probability distribution over amino acid types) computed using PSI-BLAST. Equivalent to the output of ChkParse.
`profile_with_prior`	✔️	float32	`(NR, 22)`	Profile which takes into account priors and Blosum matrix. See equation 5 in https://doi.org/10.1093/nar/25.17.3389.
`profile_with_prior_without_gaps`	✔️	float32	`(NR, 21)`	Same as `profile_with_prior` but without gaps included.
`pseudo_bias`	✔️	float32	`(NR, 22)`	The bias computed in the MSA pseudolikelihood computation.
`pseudo_frob`	✔️	float32	`(NR, NR, 1)`	Frobenius norm of `pseudolikelihood` (gaps not included). Similar to the output of CCMPred.
`pseudolikelihood`	✔️	float32	`(NR, NR, 484)`	The weights computed in the MSA pseudolikelihood computation.
`psi_angles`	❌	float32	`(NR, 1)`	The psi angles.
`psi_mask`	❌	int64	`(NR, 1)`	Mask for `psi_angles`.
`residue_index`	✔️	int64	`(NR, 1)`	Index of each residue giong from 0 to `NR - 1`. See below.
`resolution`	❌	float32	`(1)`	The protein structure resolution.
`reweighted_profile`	✔️	float32	`(NR, 22)`	Profile where sequences are reweighted to weight rarer sequences higher. See below.
`sec_structure`	❌	int64	`(NR, 8)`	Secondary structure generated by DSSP and one-hot encoded by the mapping `-HETSGBI -> range(8)`.
`sec_structure_mask`	❌	int64	`(NR, 1)`	Mask for `sec_structure_mask`.
`seq_length`	✔️	int64	`(NR, 1)`	The length of the amino acid sequence. Has to be repeated `NR` times. See below.
`sequence`	✔️	string	`(1)`	The amino acid sequence (1-letter amino acid encoding). See below.
`solv_surf`	❌	float32	`(NR, 1)`	Relative solvent accessible area computed using DSSP and then normalized by amino acid maximum accessibility.
`solv_surf_mask`	❌	int64	`(NR, 1)`	Mask for `solv_surf`.
`superfamily`	❌	string	`(1)`	The superfamily CATH code.

More details on needed features

`aatype`

One hot encoding of amino acid types. The following code converts an amino acid string into the one-hot encoding:

def sequence_to_onehot(sequence):
  """Maps the given sequence into a one-hot encoded matrix."""
  mapping = {aa: i for i, aa in enumerate('ARNDCQEGHILKMFPSTWYVX')}
  num_entries = max(mapping.values()) + 1
  one_hot_arr = np.zeros((len(sequence), num_entries), dtype=np.int32)

  for aa_index, aa_type in enumerate(sequence):
    aa_id = mapping[aa_type]
    one_hot_arr[aa_index, aa_id] = 1

  return one_hot_arr

`deletion_probability`

The fraction of sequences that had a deletion (denoteby by a lowercase letter in the A3M format) at this position. Example:

MSA = A c r k
      A C r k
      A C R k
deletion_probability = [[0], [1/3], [2/3], [1]]

`gap_matrix`

Covariation signal from the gapped states, this gives an indication of the variance induced due to gapped states. Example:

MSA = A A C D B D F J G B M A
      - - C D B D F J G B M A
      A A C D B - - J G B M A
gap_count = [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
             [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
             [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0]]
gap_matrix = np.matmul(gap_count.T, gap_count)

`hmm_profile`

The HHBlits HHM profile (from the -ohhm HHBlits output file). Asterisks in the output are replaced by 0.0. The following code parses the HHM file:

def extract_hmm_profile(hhm_file, sequence, asterisks_replace=0.0):
  """Extracts information from the hmm file and replaces asterisks."""
  profile_part = hhm_file.split('#')[-1]
  profile_part = profile_part.split('\n')
  whole_profile = [i.split() for i in profile_part]
  # This part strips away the header and the footer.
  whole_profile = whole_profile[5:-2]
  gap_profile = np.zeros((len(sequence), 10))
  aa_profile = np.zeros((len(sequence), 20))
  count_aa = 0
  count_gap = 0
  for line_values in whole_profile:
    if len(line_values) == 23:
      # The first and the last values in line_values are metadata, skip them.
      for j, t in enumerate(line_values[2:-1]):
        aa_profile[count_aa, j] = (
            2**(-float(t) / 1000.) if t != '*' else asterisks_replace)
      count_aa += 1
    elif len(line_values) == 10:
      for j, t in enumerate(line_values):
        gap_profile[count_gap, j] = (
            2**(-float(t) / 1000.) if t != '*' else asterisks_replace)
      count_gap += 1
    elif not line_values:
      pass
    else:
      raise ValueError('Wrong length of line %s hhm file. Expected 0, 10 or 23'
                       'got %d'%(line_values, len(line_values)))
  hmm_profile = np.hstack([aa_profile, gap_profile])
  assert len(hmm_profile) == len(sequence)
  return hmm_profile

`non_gapped_profile`

A profile from amino acids only (discounting gaps).

def non_gapped_profile(amino_acids):
  """Computes a profile from only amino acids and discounting gaps."""
  profile = np.zeros(21)
  for aa in amino_acids:
    if aa != 21:  # Ignore gaps.
      profile[aa] += 1.
  return profile / np.sum(profile)

`num_alignments`

The number of HHBlits multiple sequence alignments. Has to be repeated NR times. For example, if there are 10 alignments for a sequence of length 8, then num_alignments = [[10], [10], [10], [10], [10], [10], [10], [10]].

`pseudo_frob`

This feature collapses the 484 channels of pseudolikelihood into one by taking the Frobenius norm of the 484 channels and then subtracting the Average Product Correction of the computed Frobenius norm. The Frobenius norm does not take into account the 22nd gap state.

`pseudolikelihood`

Parameters of a Potts Model coupling the amino acid types of particular residues estimated by pseudolikelihood. See https://doi.org/10.1103/PhysRevE.87.012707 for more details.

`residue_index`

Index of each residue giong from 0 to NR - 1. For example, the sequence AACR has residue_index = [[0], [1], [2], [3]].

`reweighted_profile`

Profile where sequences are reweighted to weight rarer sequences higher. The sequence weights are calculated like this:

def sequence_weights(sequence_matrix):
  """Compute sequence reweighting to weight rarer sequences higher."""
  num_rows, num_res = sequence_matrix.shape
  cutoff = 0.62 * num_res
  weights = np.ones(num_rows, dtype=np.float32)
  for i in range(num_rows):
    for j in range(i + 1, num_rows):
      similarity = (sequence_matrix[i] == sequence_matrix[j]).sum()
      if similarity > cutoff:
        weights[i] += 1
        weights[j] += 1
  return 1.0 / weights

`seq_length`

The length of the amino acid sequence. Has to be repeated NR times. For example, the sequence AACR would have seq_length = [[4], [4], [4], [4]].

`sequence`

The amino acid sequence (1-letter amino acid encoding). For example, a protein with Alanine, Lysine, Arginine has sequence = 'AKR'.

Disclaimer

This is not an official Google product.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

alphafold_casp13

alphafold_casp13

README.md

AlphaFold

Setup

Dependencies

Data

Input data

Model checkpoints

Distogram prediction

Running the system

Data splits

Features

More details on needed features

`aatype`

`deletion_probability`

`gap_matrix`

`hmm_profile`

`non_gapped_profile`

`num_alignments`

`pseudo_frob`

`pseudolikelihood`

`residue_index`

`reweighted_profile`

`seq_length`

`sequence`

Disclaimer

Files

alphafold_casp13

Directory actions

More options

Directory actions

More options

Latest commit

History

alphafold_casp13

Folders and files

parent directory

README.md

AlphaFold

Setup

Dependencies

Data

Input data

Model checkpoints

Distogram prediction

Running the system

Data splits

Features

More details on needed features

aatype

deletion_probability

gap_matrix

hmm_profile

non_gapped_profile

num_alignments

pseudo_frob

pseudolikelihood

residue_index

reweighted_profile

seq_length

sequence

Disclaimer

`aatype`

`deletion_probability`

`gap_matrix`

`hmm_profile`

`non_gapped_profile`

`num_alignments`

`pseudo_frob`

`pseudolikelihood`

`residue_index`

`reweighted_profile`

`seq_length`

`sequence`