Skip to content

Jupyter notebooks to download and analyze the Betalactamase database with protein langue models

Notifications You must be signed in to change notification settings

miangoar/Betalactamase-analysis-with-machine-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 

Repository files navigation

Large-scale analysis of the β-lactamase sequence space with protein language models

Texto alternativo

This repository contains the dataset description and Jupyter notebooks used to analyze sequences from the BetaLactamase DataBase (BLDB) using protein language models

Content list

Dataset description

The main dataset and 2D representations computed with PCA, t-SNE, and UMAP are available through Zenodo:

The main dataset contains 29,445 rows and 82 columns. The rows represent all sequences retrieved from the BLDB as January 8, 2024. The columns contain information processed from the BLDB, including their taxonomy annotated against the Genome Taxonomy Database (GTDB RS207), the per-protein embeddings derived from five protein language models (ESM-1b, ESM2-650, ESM2-3b, CARP-640M, ProtTrans-t5-xl-u50), functional annotations estimated with Biopython, sequence quality filters applied to select sequences for the analysis, annotations from the AlphaFold Database (AFDB) for the available structures, and the secondary structure annotations generated from the predicted structures by AlphaFold2 using pyDSSP.

Column name Description
BLDB info ———————————————————————————————
#name header information
(eg. gi | 30230644 | gb | AAP20891.1 | TEM-1 | class A broad-spectrum beta-lactamase TEM-1)
seq amino acid sequence
length sequence length
filename filename from BLDB
(eg. A-TEM-1-prot.fasta, C-PDC-1-prot.fasta, B3-GOB-1-prot.fasta)
bla_class molecular classification parsed from the filename
(ie. A, C, D and B1, B2, B3)
protein_name protein name parsed from the filename
(e.g. TEM-1, PDC-1, GOB-1, etc)
protein_family_filename protein family parsed from the filename. This is because, apart from the recognized families, the BLDB assigns a putative family labeled as AFAM{numerical_id}
superfamily SBL for serinbetalactamases (i.e. A, C and D) and MBL for metalobetalactamase (i.e. B1, B2 and B3)
protein_family protein family parsed from the header and validated against the list of recognized families by the BLDB. Since this information is validated, it was used as the default for the analysis
top_fam indicates whether the protein family is among the top 20 most abundant families in each molecular class while preserving their names; otherwise, the labels No top and Unknown were assigned
seq_id sequence ID for easy manipulation (eg. bldb_000001 ... bldb_029445)
ambler_class
alternative_protein_name
subfamily
genpept_id
genbank_id
pubmed_id
seq_url
pdb_structures
mutants
phenotype
functional_info
source
curated annotations by the BLDB parsed using the protein name. The BLDB only provides annotations for 7,926 sequences. The original column Natural (N) or Acquired (A) from the BLDB was renamed as source. The subfamily column is particularly important for class D betalactamases, as the OXA family is very abundant in the dataset
GTDB annotations ———————————————————————————————
bitscore bitscore value estimated by Diamond2 against the protein sequences in the GTDB
Domain
Phylum
Class
Order
Family
Genus
Species
taxonomic annotations from the GTDB using the LCA algorithm implemented by GTDB2DIAMOND
Philippon phylogenetic groups ———————————————————————————————
phylo_group phylogenetic groups proposed by Philippon et al 2019 for class A beta-lactamases. The groups were annotated using the representative enzyme families extracted from supplementary table S1.
phylo_group_genus
phylo_group_sp
phylogenetic groups proposed for class A annotated using the set of putative genus and species names
bla_subclass Subclass annotations for classes A2 and C2. A2 was annotated using phylogenetic group A, as suggested by Philippon et al. 2019. C2 was annotated using the three suggested genera for this class (Legionella, Bradyrhizobium, Parachlamydia), as proposed by Philippon et al 2022
Embeddings ———————————————————————————————
esm1b
esm2_650m
esm2_3b
carp
t5xlu50
per-protein embeddings computed from the last layer of each protein language model using their respective provided scripts:
prott5_embedder.py for t5xlu50
extract.py for the ESM family and
extract.py for CARP
Biopython annotaions ———————————————————————————————
molecular_weight sum of the molecular mass (in Daltons) of each residue in the protein
aromaticity fraction of amino acids with aromatic properties (F, W, Y) in the sequence
instability protein instability estimated based on the frequency of dipeptides associated with low and high stability observed in stable proteins. Values >40 indicate that the protein is unstable (has a short half-life)
gravy estimated as the sum of the hydropathy of each residue divided by the sequence length
isoelectric_point estimation of the pH at which the protein has no net electric charge
entropy estimation of the diversity of sequence composition. Low values indicate lower diversity of amino acids in the composition. A sequence with minimum entropy consists of a single type of residue, while a sequence with maximum entropy contains all possible residues in equal proportions
helix, turn, sheet estimates based on the fraction of residues associated with helix regions (V, I, Y, F, W, L), beta sheets (E, M, A, L), and turns (N, P, G, S)
Sequence quality filters ———————————————————————————————
pass_the_filter sequences were labeled as yes or no based on the following filters: their header does not indicate that they are partial sequences, and to remove outliers for each molecular class, their sequence length falls within a range of ±30% of their median value. In total, 801 sequences did not pass the filters
is_clust_rep_30
is_clust_rep_60
is_clust_rep_90
sequences were labeled as yes or no if they are representative sequences clustered at 30%, 60%, and 90% sequence identity according to the mmseqs easy-cluster pipeline (with parameters --min-seq-id 0.3, -c 0.8, --cov-mode 1). The clustering was performed for each molecular class, using only the sequences without partials and outliers
AFDB annotations ———————————————————————————————
has_af2_model sequences were labeled as yes or no if their sequence is equal to the set of curated sequences from the UniProt DB for their respective superfamilies using PFAM CLAN IDs (CL0013 for PBP-like and CL0381 for metallo-hydrolase superfamily). A total of 12,688 sequences have an AF2 model
model
mean_plddt
resid_plddt
entryId
gene
uniprotAccession
uniprotId
uniprotDescription
taxId
organismScientificName
uniprotStart
uniprotEnd
modelCreatedDate
latestVersion
allVersions
allVersions
isReviewed
isReferenceProteome
cifUrl
bcifUrl
pdbUrl
paeImageUrl
paeDocUrl
dup_entry
structure annotations from the UniProt-AFDB for the set of sequences that have a predicted model by AF2
PyDSSP annotations ———————————————————————————————
secondary_structure secondary structure representation
frec_turn
frec_helix
frec_beta
propotion of turn, helix and beta-sheet regions for each protein structure
simple_secondary_structure simplified representation without redundancy of the secondary structure annotation (eg. ---EE-----HHHHH--EE -> -E-H-E)

Notebooks

You can use nbviewer to render the notebooks if GitHub can't do it. Also you can open the notebook and replace the "github.com" domain by "nbsanity.com" domain.

Name Description
bldb_01_create_dataset download, process, and clean the data from BLDB. Also, add the annotations from UniProt-AFDB, taxonomy from GTDB, Biopython functional estimates, phylogenetic groups, and per-protein embeddings
bldb_02_embeddings merge the embedding representations into a single dataset
bldb_03_af2 cross-reference of the AFDB with the BLDB by matching their sequences to retrieve structural annotations from AF2-predicted structures
bldb_04_PCA compute PCA representations
bldb_04_PCA_rep90 compute PCA representations only for representative sequences clustered at 90% sequence identity
bldb_tsne_esm2_650m_sbl example of how to compute the tSNE representations in Google colab for the SBL and the model ESM2-650M. The same procedure was applied for MBLs and all other protein language models
bldb_05_tSNE_merge_csv merge tSNE representations for each model into a single dataset
bldb_06_tSNE tSNE plots by molecular classification
bldb_tSNE_rep90 tSNE plots using only for representative sequences clustered at 90% sequence identity
bldb_umap_create compute UMAP representations and merge it into a single dataset
bldb_umap_map UMAP plots by molecular classification
bldb_07_tax_panel map the taxonomical information in the tSNE representations
bldb_08_fams_plots map the enzyme family information in the tSNE representations
bldb_09_class_a_foldseek map the class A phylo groups and perform the sequence and structure analysis with foldseek
bldb_10_class_c_foldseek_analysis map the subclasses of class C betalactamases and perform the sequence and structure analysis with foldseek
bldb_11_unsupervised distance-based analysis (cosine and euclidean) and clustering analysis (k-means and hierarchical clustering)
bldb_12_biochem map the biochemical information in the tSNE representations

About

Jupyter notebooks to download and analyze the Betalactamase database with protein langue models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published