The main dataset and 2D representations computed with PCA, t-SNE, and UMAP are available through Zenodo:
The main dataset contains 29,445 rows and 82 columns. The rows represent all sequences retrieved from the BLDB as January 8, 2024. The columns contain information processed from the BLDB, including their taxonomy annotated against the Genome Taxonomy Database (GTDB RS207), the per-protein embeddings derived from five protein language models (ESM-1b, ESM2-650, ESM2-3b, CARP-640M, ProtTrans-t5-xl-u50), functional annotations estimated with Biopython, sequence quality filters applied to select sequences for the analysis, annotations from the AlphaFold Database (AFDB) for the available structures, and the secondary structure annotations generated from the predicted structures by AlphaFold2 using pyDSSP.
Column name | Description |
---|---|
BLDB info | ——————————————————————————————— |
#name | header information (eg. gi | 30230644 | gb | AAP20891.1 | TEM-1 | class A broad-spectrum beta-lactamase TEM-1 ) |
seq | amino acid sequence |
length | sequence length |
filename | filename from BLDB (eg. A-TEM-1-prot.fasta , C-PDC-1-prot.fasta , B3-GOB-1-prot.fasta ) |
bla_class | molecular classification parsed from the filename (ie. A, C, D and B1, B2, B3) |
protein_name | protein name parsed from the filename (e.g. TEM-1, PDC-1, GOB-1, etc) |
protein_family_filename | protein family parsed from the filename. This is because, apart from the recognized families, the BLDB assigns a putative family labeled as AFAM{numerical_id} |
superfamily | SBL for serinbetalactamases (i.e. A, C and D) and MBL for metalobetalactamase (i.e. B1, B2 and B3) |
protein_family | protein family parsed from the header and validated against the list of recognized families by the BLDB. Since this information is validated, it was used as the default for the analysis |
top_fam | indicates whether the protein family is among the top 20 most abundant families in each molecular class while preserving their names; otherwise, the labels No top and Unknown were assigned |
seq_id | sequence ID for easy manipulation (eg. bldb_000001 ... bldb_029445 ) |
ambler_class alternative_protein_name subfamily genpept_id genbank_id pubmed_id seq_url pdb_structures mutants phenotype functional_info source |
curated annotations by the BLDB parsed using the protein name. The BLDB only provides annotations for 7,926 sequences. The original column Natural (N) or Acquired (A) from the BLDB was renamed as source . The subfamily column is particularly important for class D betalactamases, as the OXA family is very abundant in the dataset |
GTDB annotations | ——————————————————————————————— |
bitscore | bitscore value estimated by Diamond2 against the protein sequences in the GTDB |
Domain Phylum Class Order Family Genus Species |
taxonomic annotations from the GTDB using the LCA algorithm implemented by GTDB2DIAMOND |
Philippon phylogenetic groups | ——————————————————————————————— |
phylo_group | phylogenetic groups proposed by Philippon et al 2019 for class A beta-lactamases. The groups were annotated using the representative enzyme families extracted from supplementary table S1. |
phylo_group_genus phylo_group_sp |
phylogenetic groups proposed for class A annotated using the set of putative genus and species names |
bla_subclass | Subclass annotations for classes A2 and C2. A2 was annotated using phylogenetic group A, as suggested by Philippon et al. 2019. C2 was annotated using the three suggested genera for this class (Legionella, Bradyrhizobium, Parachlamydia), as proposed by Philippon et al 2022 |
Embeddings | ——————————————————————————————— |
esm1b esm2_650m esm2_3b carp t5xlu50 |
per-protein embeddings computed from the last layer of each protein language model using their respective provided scripts: prott5_embedder.py for t5xlu50 extract.py for the ESM family and extract.py for CARP |
Biopython annotaions | ——————————————————————————————— |
molecular_weight | sum of the molecular mass (in Daltons) of each residue in the protein |
aromaticity | fraction of amino acids with aromatic properties (F, W, Y) in the sequence |
instability | protein instability estimated based on the frequency of dipeptides associated with low and high stability observed in stable proteins. Values >40 indicate that the protein is unstable (has a short half-life) |
gravy | estimated as the sum of the hydropathy of each residue divided by the sequence length |
isoelectric_point | estimation of the pH at which the protein has no net electric charge |
entropy | estimation of the diversity of sequence composition. Low values indicate lower diversity of amino acids in the composition. A sequence with minimum entropy consists of a single type of residue, while a sequence with maximum entropy contains all possible residues in equal proportions |
helix, turn, sheet | estimates based on the fraction of residues associated with helix regions (V, I, Y, F, W, L), beta sheets (E, M, A, L), and turns (N, P, G, S) |
Sequence quality filters | ——————————————————————————————— |
pass_the_filter | sequences were labeled as yes or no based on the following filters: their header does not indicate that they are partial sequences, and to remove outliers for each molecular class, their sequence length falls within a range of ±30% of their median value. In total, 801 sequences did not pass the filters |
is_clust_rep_30 is_clust_rep_60 is_clust_rep_90 |
sequences were labeled as yes or no if they are representative sequences clustered at 30%, 60%, and 90% sequence identity according to the mmseqs easy-cluster pipeline (with parameters --min-seq-id 0.3 , -c 0.8 , --cov-mode 1 ). The clustering was performed for each molecular class, using only the sequences without partials and outliers |
AFDB annotations | ——————————————————————————————— |
has_af2_model | sequences were labeled as yes or no if their sequence is equal to the set of curated sequences from the UniProt DB for their respective superfamilies using PFAM CLAN IDs (CL0013 for PBP-like and CL0381 for metallo-hydrolase superfamily). A total of 12,688 sequences have an AF2 model |
model mean_plddt resid_plddt entryId gene uniprotAccession uniprotId uniprotDescription taxId organismScientificName uniprotStart uniprotEnd modelCreatedDate latestVersion allVersions allVersions isReviewed isReferenceProteome cifUrl bcifUrl pdbUrl paeImageUrl paeDocUrl dup_entry |
structure annotations from the UniProt-AFDB for the set of sequences that have a predicted model by AF2 |
PyDSSP annotations | ——————————————————————————————— |
secondary_structure | secondary structure representation |
frec_turn frec_helix frec_beta |
propotion of turn, helix and beta-sheet regions for each protein structure |
simple_secondary_structure | simplified representation without redundancy of the secondary structure annotation (eg. ---EE-----HHHHH--EE -> -E-H-E ) |
You can use nbviewer to render the notebooks if GitHub can't do it. Also you can open the notebook and replace the "github.com" domain by "nbsanity.com" domain.
Name | Description |
---|---|
bldb_01_create_dataset | download, process, and clean the data from BLDB. Also, add the annotations from UniProt-AFDB, taxonomy from GTDB, Biopython functional estimates, phylogenetic groups, and per-protein embeddings |
bldb_02_embeddings | merge the embedding representations into a single dataset |
bldb_03_af2 | cross-reference of the AFDB with the BLDB by matching their sequences to retrieve structural annotations from AF2-predicted structures |
bldb_04_PCA | compute PCA representations |
bldb_04_PCA_rep90 | compute PCA representations only for representative sequences clustered at 90% sequence identity |
bldb_tsne_esm2_650m_sbl | example of how to compute the tSNE representations in Google colab for the SBL and the model ESM2-650M. The same procedure was applied for MBLs and all other protein language models |
bldb_05_tSNE_merge_csv | merge tSNE representations for each model into a single dataset |
bldb_06_tSNE | tSNE plots by molecular classification |
bldb_tSNE_rep90 | tSNE plots using only for representative sequences clustered at 90% sequence identity |
bldb_umap_create | compute UMAP representations and merge it into a single dataset |
bldb_umap_map | UMAP plots by molecular classification |
bldb_07_tax_panel | map the taxonomical information in the tSNE representations |
bldb_08_fams_plots | map the enzyme family information in the tSNE representations |
bldb_09_class_a_foldseek | map the class A phylo groups and perform the sequence and structure analysis with foldseek |
bldb_10_class_c_foldseek_analysis | map the subclasses of class C betalactamases and perform the sequence and structure analysis with foldseek |
bldb_11_unsupervised | distance-based analysis (cosine and euclidean) and clustering analysis (k-means and hierarchical clustering) |
bldb_12_biochem | map the biochemical information in the tSNE representations |