Large-scale analysis of the β-lactamase sequence space with protein language models

This repository contains the dataset description and Jupyter notebooks used to analyze sequences from the BetaLactamase DataBase (BLDB) using protein language models

Content list

Dataset description

The main dataset and 2D representations computed with PCA, t-SNE, and UMAP are available through Zenodo:

https://zenodo.org/records/14743325

The main dataset contains 29,445 rows and 82 columns. The rows represent all sequences retrieved from the BLDB as January 8, 2024. The columns contain information processed from the BLDB, including their taxonomy annotated against the Genome Taxonomy Database (GTDB RS207), the per-protein embeddings derived from five protein language models (ESM-1b, ESM2-650, ESM2-3b, CARP-640M, ProtTrans-t5-xl-u50), functional annotations estimated with Biopython, sequence quality filters applied to select sequences for the analysis, annotations from the AlphaFold Database (AFDB) for the available structures, and the secondary structure annotations generated from the predicted structures by AlphaFold2 using pyDSSP.

Column name	Description
BLDB info	———————————————————————————————
#name	header information (eg. `gi \| 30230644 \| gb \| AAP20891.1 \| TEM-1 \| class A broad-spectrum beta-lactamase TEM-1`)
seq	amino acid sequence
length	sequence length
filename	filename from BLDB (eg. `A-TEM-1-prot.fasta`, `C-PDC-1-prot.fasta`, `B3-GOB-1-prot.fasta`)
bla_class	molecular classification parsed from the filename (ie. A, C, D and B1, B2, B3)
protein_name	protein name parsed from the filename (e.g. TEM-1, PDC-1, GOB-1, etc)
protein_family_filename	protein family parsed from the filename. This is because, apart from the recognized families, the BLDB assigns a putative family labeled as `AFAM{numerical_id}`
superfamily	SBL for serinbetalactamases (i.e. A, C and D) and MBL for metalobetalactamase (i.e. B1, B2 and B3)
protein_family	protein family parsed from the header and validated against the list of recognized families by the BLDB. Since this information is validated, it was used as the default for the analysis
top_fam	indicates whether the protein family is among the top 20 most abundant families in each molecular class while preserving their names; otherwise, the labels `No top` and `Unknown` were assigned
seq_id	sequence ID for easy manipulation (eg. `bldb_000001` ... `bldb_029445`)
ambler_class alternative_protein_name subfamily genpept_id genbank_id pubmed_id seq_url pdb_structures mutants phenotype functional_info source	curated annotations by the BLDB parsed using the protein name. The BLDB only provides annotations for 7,926 sequences. The original column `Natural (N) or Acquired (A)` from the BLDB was renamed as `source`. The `subfamily` column is particularly important for class D betalactamases, as the OXA family is very abundant in the dataset
GTDB annotations	———————————————————————————————
bitscore	bitscore value estimated by Diamond2 against the protein sequences in the GTDB
Domain Phylum Class Order Family Genus Species	taxonomic annotations from the GTDB using the LCA algorithm implemented by GTDB2DIAMOND
Philippon phylogenetic groups	———————————————————————————————
phylo_group	phylogenetic groups proposed by Philippon et al 2019 for class A beta-lactamases. The groups were annotated using the representative enzyme families extracted from supplementary table S1.
phylo_group_genus phylo_group_sp	phylogenetic groups proposed for class A annotated using the set of putative genus and species names
bla_subclass	Subclass annotations for classes A2 and C2. A2 was annotated using phylogenetic group A, as suggested by Philippon et al. 2019. C2 was annotated using the three suggested genera for this class (Legionella, Bradyrhizobium, Parachlamydia), as proposed by Philippon et al 2022
Embeddings	———————————————————————————————
esm1b esm2_650m esm2_3b carp t5xlu50	per-protein embeddings computed from the last layer of each protein language model using their respective provided scripts: prott5_embedder.py for t5xlu50 extract.py for the ESM family and extract.py for CARP
Biopython annotaions	———————————————————————————————
molecular_weight	sum of the molecular mass (in Daltons) of each residue in the protein
aromaticity	fraction of amino acids with aromatic properties (F, W, Y) in the sequence
instability	protein instability estimated based on the frequency of dipeptides associated with low and high stability observed in stable proteins. Values >40 indicate that the protein is unstable (has a short half-life)
gravy	estimated as the sum of the hydropathy of each residue divided by the sequence length
isoelectric_point	estimation of the pH at which the protein has no net electric charge
entropy	estimation of the diversity of sequence composition. Low values indicate lower diversity of amino acids in the composition. A sequence with minimum entropy consists of a single type of residue, while a sequence with maximum entropy contains all possible residues in equal proportions
helix, turn, sheet	estimates based on the fraction of residues associated with helix regions (V, I, Y, F, W, L), beta sheets (E, M, A, L), and turns (N, P, G, S)
Sequence quality filters	———————————————————————————————
pass_the_filter	sequences were labeled as `yes` or `no` based on the following filters: their header does not indicate that they are partial sequences, and to remove outliers for each molecular class, their sequence length falls within a range of ±30% of their median value. In total, 801 sequences did not pass the filters
is_clust_rep_30 is_clust_rep_60 is_clust_rep_90	sequences were labeled as `yes` or `no` if they are representative sequences clustered at 30%, 60%, and 90% sequence identity according to the mmseqs easy-cluster pipeline (with parameters `--min-seq-id 0.3`, `-c 0.8`, `--cov-mode 1`). The clustering was performed for each molecular class, using only the sequences without partials and outliers
AFDB annotations	———————————————————————————————
has_af2_model	sequences were labeled as `yes` or `no` if their sequence is equal to the set of curated sequences from the UniProt DB for their respective superfamilies using PFAM CLAN IDs (CL0013 for PBP-like and CL0381 for metallo-hydrolase superfamily). A total of 12,688 sequences have an AF2 model
model mean_plddt resid_plddt entryId gene uniprotAccession uniprotId uniprotDescription taxId organismScientificName uniprotStart uniprotEnd modelCreatedDate latestVersion allVersions allVersions isReviewed isReferenceProteome cifUrl bcifUrl pdbUrl paeImageUrl paeDocUrl dup_entry	structure annotations from the UniProt-AFDB for the set of sequences that have a predicted model by AF2
PyDSSP annotations	———————————————————————————————
secondary_structure	secondary structure representation
frec_turn frec_helix frec_beta	propotion of turn, helix and beta-sheet regions for each protein structure
simple_secondary_structure	simplified representation without redundancy of the secondary structure annotation (eg. `---EE-----HHHHH--EE` -> `-E-H-E`)

Notebooks

You can use nbviewer to render the notebooks if GitHub can't do it. Also you can open the notebook and replace the "github.com" domain by "nbsanity.com" domain.

Name	Description
bldb_01_create_dataset	download, process, and clean the data from BLDB. Also, add the annotations from UniProt-AFDB, taxonomy from GTDB, Biopython functional estimates, phylogenetic groups, and per-protein embeddings
bldb_02_embeddings	merge the embedding representations into a single dataset
bldb_03_af2	cross-reference of the AFDB with the BLDB by matching their sequences to retrieve structural annotations from AF2-predicted structures
bldb_04_PCA	compute PCA representations
bldb_04_PCA_rep90	compute PCA representations only for representative sequences clustered at 90% sequence identity
bldb_tsne_esm2_650m_sbl	example of how to compute the tSNE representations in Google colab for the SBL and the model ESM2-650M. The same procedure was applied for MBLs and all other protein language models
bldb_05_tSNE_merge_csv	merge tSNE representations for each model into a single dataset
bldb_06_tSNE	tSNE plots by molecular classification
bldb_tSNE_rep90	tSNE plots using only for representative sequences clustered at 90% sequence identity
bldb_umap_create	compute UMAP representations and merge it into a single dataset
bldb_umap_map	UMAP plots by molecular classification
bldb_07_tax_panel	map the taxonomical information in the tSNE representations
bldb_08_fams_plots	map the enzyme family information in the tSNE representations
bldb_09_class_a_foldseek	map the class A phylo groups and perform the sequence and structure analysis with foldseek
bldb_10_class_c_foldseek_analysis	map the subclasses of class C betalactamases and perform the sequence and structure analysis with foldseek
bldb_11_unsupervised	distance-based analysis (cosine and euclidean) and clustering analysis (k-means and hierarchical clustering)
bldb_12_biochem	map the biochemical information in the tSNE representations

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
notebooks		notebooks
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Large-scale analysis of the β-lactamase sequence space with protein language models

Content list

Dataset description

Notebooks

About

Releases

Packages

Languages

miangoar/Betalactamase-analysis-with-machine-learning

Folders and files

Latest commit

History

Repository files navigation

Large-scale analysis of the β-lactamase sequence space with protein language models

Content list

Dataset description

Notebooks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages