Skip to content

sky1ove/katlas

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KATLAS

Open In Colab PyPI

KATLAS is a repository containing python tools to predict kinases given a substrate sequence. It also contains datasets of kinase substrate specificities and human phosphoproteomics.

References: Please cite the appropriate papers if KATLAS is helpful to your research.

Reproduce datasets & figures

Follow the instructions in katlas_raw: https://github.com/sky1ove/katlas_raw

Need to install the package via: pip install 'python-katlas[dev]' -U

Web applications

Users can now run the analysis directly on the web without needing to code.

Check out our latest web platform: kinase-atlas.com

Tutorials on Colab

Install

pip install python-katlas -U

To use other modules besides the core, do pip install 'python-katlas[dev]' -U

Import

from katlas.core import *

Quick start

We provide two methods to calculate substrate sequence:

  • Computational Data-Driven Method (CDDM)
  • Positional Scanning Peptide Array (PSPA)

We consider the input in two formats:

  • a single input string (phosphorylation site)
  • a csv/dataframe that contains a column of phosphorylation sites

For input sequences, we also consider it in two conditions:

  • all capital
  • contains lower cases indicating phosphorylation status

Single sequence as input

CDDM, all capital

predict_kinase('AAAAAAASGGAGSDN',**param_CDDM_upper)
considering string: ['-7A', '-6A', '-5A', '-4A', '-3A', '-2A', '-1A', '0S', '1G', '2G', '3A', '4G', '5S', '6D', '7N']

kinase
PAK6     2.032
ULK3     2.032
PRKX     2.012
ATR      1.991
PRKD1    1.988
         ...  
DDR2     0.928
EPHA4    0.928
TEK      0.921
KIT      0.915
FGFR3    0.910
Length: 289, dtype: float64

CDDM, with lower case indicating phosphorylation status

predict_kinase('AAAAAAAsGGAGsDN',**param_CDDM)
considering string: ['-7A', '-6A', '-5A', '-4A', '-3A', '-2A', '-1A', '0s', '1G', '2G', '3A', '4G', '5s', '6D', '7N']

kinase
ULK3     1.987
PAK6     1.981
PRKD1    1.946
PIM3     1.944
PRKX     1.939
         ...  
EPHA4    0.905
EGFR     0.900
TEK      0.898
FGFR3    0.894
KIT      0.882
Length: 289, dtype: float64

PSPA, with lower case indicating phosphorylation status

predict_kinase('AEEKEyHsEGG',**param_PSPA).head()
considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0y', '1H', '2s', '3E', '4G', '5G']

kinase
EGFR     4.013
FGFR4    3.568
ZAP70    3.412
CSK      3.241
SYK      3.209
dtype: float64

To replicate the results from The Kinase Library (PSPA)

Check this link: The Kinase Library, and use log2(score) to rank, it shows same results with the below (with slight differences due to rounding).

predict_kinase('AEEKEyHSEGG',**param_PSPA).head(10)
considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0y', '1H', '2S', '3E', '4G', '5G']

kinase
EGFR         3.181
FGFR4        2.390
CSK          2.308
ZAP70        2.068
SYK          1.998
PDHK1_TYR    1.922
RET          1.732
MATK         1.688
FLT1         1.627
BMPR2_TYR    1.456
dtype: float64
  • So far The kinase Library considers all tyr sequences in capital regardless of whether or not they contain lower cases, which is a small bug and should be fixed soon.
  • Kinase with “_TYR” indicates it is a dual specificity kinase tested in PSPA tyrosine setting, which has not been included in kinase-library yet.

We can also calculate the percentile score using a referenced score sheet.

# Percentile reference sheet
y_pct = Data.get_pspa_tyr_pct()

get_pct('AEEKEyHSEGG',**param_PSPA_y, pct_ref = y_pct)
considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0Y', '1H', '2S', '3E', '4G', '5G']
log2(score) percentile
EGFR 3.181 96.787423
FGFR4 2.390 94.012303
CSK 2.308 95.201640
ZAP70 2.068 88.380041
SYK 1.998 85.522898
... ... ...
EPHA1 -3.501 12.139440
FES -3.699 21.216678
TNK1 -4.269 5.481887
TNK2 -4.577 2.050581
DDR2 -4.920 10.403281

93 rows × 2 columns

High-throughput substrate scoring on a dataframe

Load your csv

# df = pd.read_csv('your_file.csv')

Load a demo df

# Load a demo df with phosphorylation sites
df = Data.get_ochoa_site().head()
df.iloc[:,-2:]
site_seq gene_site
0 VDDEKGDSNDDYDSA A0A075B6Q4_S24
1 YDSAGLLSDEDCMSV A0A075B6Q4_S35
2 IADHLFWSEETKSRF A0A075B6Q4_S57
3 KSRFTEYSMTSSVMR A0A075B6Q4_S68
4 FTEYSMTSSVMRRNE A0A075B6Q4_S71

Set the column name and param to calculate

Here we choose param_CDDM_upper, as the sequences in the demo df are all in capital. You can also choose other params.

results = predict_kinase_df(df,'site_seq',**param_CDDM_upper)
results
input dataframe has a length 5
Preprocessing
Finish preprocessing
Calculating position: [-7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7]

100%|██████████| 289/289 [00:05<00:00, 56.64it/s]
kinase SRC EPHA3 FES NTRK3 ALK EPHA8 ABL1 FLT3 EPHB2 FYN ... MEK5 PKN2 MAP2K7 MRCKB HIPK3 CDK8 BUB1 MEKK3 MAP2K3 GRK1
0 0.991760 1.093712 1.051750 1.067134 1.013682 1.097519 0.966379 0.982464 1.054986 1.055910 ... 1.314859 1.635470 1.652251 1.622672 1.362973 1.797155 1.305198 1.423618 1.504941 1.872020
1 0.910262 0.953743 0.942327 0.950601 0.872694 0.932586 0.846899 0.826662 0.915020 0.942713 ... 1.175454 1.402006 1.430392 1.215826 1.569373 1.716455 1.270999 1.195081 1.223082 1.793290
2 0.849866 0.899910 0.848895 0.879652 0.874959 0.899414 0.839200 0.836523 0.858040 0.867269 ... 1.408003 1.813739 1.454786 1.084522 1.352556 1.524663 1.377839 1.173830 1.305691 1.811849
3 0.803826 0.836527 0.800759 0.894570 0.839905 0.781001 0.847847 0.807040 0.805877 0.801402 ... 1.110307 1.703637 1.795092 1.469653 1.549936 1.491344 1.446922 1.055452 1.534895 1.741090
4 0.822793 0.796532 0.792343 0.839882 0.810122 0.781420 0.805251 0.795022 0.790380 0.864538 ... 1.062617 1.357689 1.485945 1.249266 1.456078 1.422782 1.376471 1.089629 1.121309 1.697524

5 rows × 289 columns

Phosphorylation sites

Besides calculating sequence scores, we also provides multiple datasets of phosphorylation sites.

CPTAC pan-cancer phosphoproteomics

df = Data.get_cptac_ensembl_site()
df.head(3)
gene site site_seq protein gene_name gene_site protein_site
0 ENSG00000003056.8 S267 DDQLGEESEERDDHL ENSP00000000412.3 M6PR M6PR_S267 ENSP00000000412_S267
1 ENSG00000003056.8 S267 DDQLGEESEERDDHL ENSP00000440488.2 M6PR M6PR_S267 ENSP00000440488_S267
2 ENSG00000048028.11 S1053 PPTIRPNSPYDLCSR ENSP00000003302.4 USP28 USP28_S1053 ENSP00000003302_S1053
df = Data.get_ochoa_site()
df.head(3)
uniprot position residue is_disopred disopred_score log10_hotspot_pval_min isHotspot uniprot_position functional_score current_uniprot name gene Sequence is_valid site_seq gene_site
0 A0A075B6Q4 24 S True 0.91 6.839384 True A0A075B6Q4_24 0.149257 A0A075B6Q4 A0A075B6Q4_HUMAN None MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... True VDDEKGDSNDDYDSA A0A075B6Q4_S24
1 A0A075B6Q4 35 S True 0.87 9.192622 False A0A075B6Q4_35 0.136966 A0A075B6Q4 A0A075B6Q4_HUMAN None MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... True YDSAGLLSDEDCMSV A0A075B6Q4_S35
2 A0A075B6Q4 57 S False 0.28 0.818834 False A0A075B6Q4_57 0.125364 A0A075B6Q4 A0A075B6Q4_HUMAN None MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT... True IADHLFWSEETKSRF A0A075B6Q4_S57

PhosphoSitePlus human phosphorylation site

df = Data.get_psp_human_site()
df.head(3)
gene protein uniprot site gene_site SITE_GRP_ID species site_seq LT_LIT MS_LIT MS_CST CST_CAT# Ambiguous_Site
0 YWHAB 14-3-3 beta P31946 T2 YWHAB_T2 15718712 human ______MtMDksELV NaN 3.0 1.0 None 0
1 YWHAB 14-3-3 beta P31946 S6 YWHAB_S6 15718709 human __MtMDksELVQkAk NaN 8.0 NaN None 0
2 YWHAB 14-3-3 beta P31946 Y21 YWHAB_Y21 3426383 human LAEQAERyDDMAAAM NaN NaN 4.0 None 0

Unique sites of combined Ochoa & PhosphoSitePlus

df = Data.get_combine_site_psp_ochoa()
df.head(3)
site_seq gene_site gene source num_site acceptor -7 -6 -5 -4 ... -2 -1 0 1 2 3 4 5 6 7
0 AAAAAAASGGAGSDN PBX1_S136 PBX1 ochoa 1 S A A A A ... A A S G G A G S D N
1 AAAAAAASGGGVSPD PBX2_S146 PBX2 ochoa 1 S A A A A ... A A S G G G V S P D
2 AAAAAAASGVTTGKP CLASR_S349 CLASR ochoa 1 S A A A A ... A A S G V T T G K P

3 rows × 21 columns

Phosphorylation site sequence example

All capital - 15 length (-7 to +7)

  • QSEEEKLSPSPTTED
  • TLQHVPDYRQNVYIP
  • TMGLSARyGPQFTLQ

All capital - 10 length (-5 to +4)

  • SRDPHYQDPH
  • LDNPDyQQDF
  • AAAAAsGGAG

With lowercase - (-7 to +7)

  • QsEEEKLsPsPTTED
  • TLQHVPDyRQNVYIP
  • TMGLsARyGPQFTLQ

With lowercase - (-5 to +4)

  • sRDPHyQDPH
  • LDNPDyQQDF
  • AAAAAsGGAG