KATLAS

KATLAS is a repository containing python tools to predict kinases given a substrate sequence. It also contains datasets of kinase substrate specificities and human phosphoproteomics.

References: Please cite the appropriate papers if KATLAS is helpful to your research.

KATLAS was described in the paper [Computational Decoding of Human Kinome Substrate Specificities and Functions]
The positional scanning peptide array (PSPA) data is from paper An atlas of substrate specificities for the human serine/threonine kinome and paper The intrinsic substrate specificity of the human tyrosine kinome
The kinase substrate datasets used for generating PSSMs are derived from PhosphoSitePlus and paper Large-scale Discovery of Substrates of the Human Kinome
Phosphorylation sites are acquired from PhosphoSitePlus, paper The functional landscape of the human phosphoproteome, and CPTAC / LinkedOmics

Reproduce datasets & figures

Follow the instructions in katlas_raw: https://github.com/sky1ove/katlas_raw

Need to install the package via: pip install 'python-katlas[dev]' -U

Web applications

Users can now run the analysis directly on the web without needing to code.

Check out our latest web platform: kinase-atlas.com

Tutorials on Colab

Install

pip install python-katlas -U

To use other modules besides the core, do pip install 'python-katlas[dev]' -U

Import

from katlas.core import *

Quick start

We provide two methods to calculate substrate sequence:

Computational Data-Driven Method (CDDM)
Positional Scanning Peptide Array (PSPA)

We consider the input in two formats:

a single input string (phosphorylation site)
a csv/dataframe that contains a column of phosphorylation sites

For input sequences, we also consider it in two conditions:

all capital
contains lower cases indicating phosphorylation status

Single sequence as input

CDDM, all capital

predict_kinase('AAAAAAASGGAGSDN',**param_CDDM_upper)

considering string: ['-7A', '-6A', '-5A', '-4A', '-3A', '-2A', '-1A', '0S', '1G', '2G', '3A', '4G', '5S', '6D', '7N']

kinase
PAK6     2.032
ULK3     2.032
PRKX     2.012
ATR      1.991
PRKD1    1.988
         ...  
DDR2     0.928
EPHA4    0.928
TEK      0.921
KIT      0.915
FGFR3    0.910
Length: 289, dtype: float64

CDDM, with lower case indicating phosphorylation status

predict_kinase('AAAAAAAsGGAGsDN',**param_CDDM)

considering string: ['-7A', '-6A', '-5A', '-4A', '-3A', '-2A', '-1A', '0s', '1G', '2G', '3A', '4G', '5s', '6D', '7N']

kinase
ULK3     1.987
PAK6     1.981
PRKD1    1.946
PIM3     1.944
PRKX     1.939
         ...  
EPHA4    0.905
EGFR     0.900
TEK      0.898
FGFR3    0.894
KIT      0.882
Length: 289, dtype: float64

PSPA, with lower case indicating phosphorylation status

predict_kinase('AEEKEyHsEGG',**param_PSPA).head()

considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0y', '1H', '2s', '3E', '4G', '5G']

kinase
EGFR     4.013
FGFR4    3.568
ZAP70    3.412
CSK      3.241
SYK      3.209
dtype: float64

To replicate the results from The Kinase Library (PSPA)

Check this link: The Kinase Library, and use log2(score) to rank, it shows same results with the below (with slight differences due to rounding).

predict_kinase('AEEKEyHSEGG',**param_PSPA).head(10)

considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0y', '1H', '2S', '3E', '4G', '5G']

kinase
EGFR         3.181
FGFR4        2.390
CSK          2.308
ZAP70        2.068
SYK          1.998
PDHK1_TYR    1.922
RET          1.732
MATK         1.688
FLT1         1.627
BMPR2_TYR    1.456
dtype: float64

So far The kinase Library considers all tyr sequences in capital regardless of whether or not they contain lower cases, which is a small bug and should be fixed soon.
Kinase with “_TYR” indicates it is a dual specificity kinase tested in PSPA tyrosine setting, which has not been included in kinase-library yet.

We can also calculate the percentile score using a referenced score sheet.

# Percentile reference sheet
y_pct = Data.get_pspa_tyr_pct()

get_pct('AEEKEyHSEGG',**param_PSPA_y, pct_ref = y_pct)

considering string: ['-5A', '-4E', '-3E', '-2K', '-1E', '0Y', '1H', '2S', '3E', '4G', '5G']

	log2(score)	percentile
EGFR	3.181	96.787423
FGFR4	2.390	94.012303
CSK	2.308	95.201640
ZAP70	2.068	88.380041
SYK	1.998	85.522898
...	...	...
EPHA1	-3.501	12.139440
FES	-3.699	21.216678
TNK1	-4.269	5.481887
TNK2	-4.577	2.050581
DDR2	-4.920	10.403281

93 rows × 2 columns

High-throughput substrate scoring on a dataframe

Load your csv

# df = pd.read_csv('your_file.csv')

Load a demo df

# Load a demo df with phosphorylation sites
df = Data.get_ochoa_site().head()
df.iloc[:,-2:]

	site_seq	gene_site
0	VDDEKGDSNDDYDSA	A0A075B6Q4_S24
1	YDSAGLLSDEDCMSV	A0A075B6Q4_S35
2	IADHLFWSEETKSRF	A0A075B6Q4_S57
3	KSRFTEYSMTSSVMR	A0A075B6Q4_S68
4	FTEYSMTSSVMRRNE	A0A075B6Q4_S71

Set the column name and param to calculate

Here we choose param_CDDM_upper, as the sequences in the demo df are all in capital. You can also choose other params.

results = predict_kinase_df(df,'site_seq',**param_CDDM_upper)
results

input dataframe has a length 5
Preprocessing
Finish preprocessing
Calculating position: [-7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7]

100%|██████████| 289/289 [00:05<00:00, 56.64it/s]

kinase	SRC	EPHA3	FES	NTRK3	ALK	EPHA8	ABL1	FLT3	EPHB2	FYN	...	MEK5	PKN2	MAP2K7	MRCKB	HIPK3	CDK8	BUB1	MEKK3	MAP2K3	GRK1
0	0.991760	1.093712	1.051750	1.067134	1.013682	1.097519	0.966379	0.982464	1.054986	1.055910	...	1.314859	1.635470	1.652251	1.622672	1.362973	1.797155	1.305198	1.423618	1.504941	1.872020
1	0.910262	0.953743	0.942327	0.950601	0.872694	0.932586	0.846899	0.826662	0.915020	0.942713	...	1.175454	1.402006	1.430392	1.215826	1.569373	1.716455	1.270999	1.195081	1.223082	1.793290
2	0.849866	0.899910	0.848895	0.879652	0.874959	0.899414	0.839200	0.836523	0.858040	0.867269	...	1.408003	1.813739	1.454786	1.084522	1.352556	1.524663	1.377839	1.173830	1.305691	1.811849
3	0.803826	0.836527	0.800759	0.894570	0.839905	0.781001	0.847847	0.807040	0.805877	0.801402	...	1.110307	1.703637	1.795092	1.469653	1.549936	1.491344	1.446922	1.055452	1.534895	1.741090
4	0.822793	0.796532	0.792343	0.839882	0.810122	0.781420	0.805251	0.795022	0.790380	0.864538	...	1.062617	1.357689	1.485945	1.249266	1.456078	1.422782	1.376471	1.089629	1.121309	1.697524

5 rows × 289 columns

Phosphorylation sites

Besides calculating sequence scores, we also provides multiple datasets of phosphorylation sites.

CPTAC pan-cancer phosphoproteomics

df = Data.get_cptac_ensembl_site()
df.head(3)

	gene	site	site_seq	protein	gene_name	gene_site	protein_site
0	ENSG00000003056.8	S267	DDQLGEESEERDDHL	ENSP00000000412.3	M6PR	M6PR_S267	ENSP00000000412_S267
1	ENSG00000003056.8	S267	DDQLGEESEERDDHL	ENSP00000440488.2	M6PR	M6PR_S267	ENSP00000440488_S267
2	ENSG00000048028.11	S1053	PPTIRPNSPYDLCSR	ENSP00000003302.4	USP28	USP28_S1053	ENSP00000003302_S1053

Ochoa et al. human phosphoproteome

df = Data.get_ochoa_site()
df.head(3)

	uniprot	position	residue	is_disopred	disopred_score	log10_hotspot_pval_min	isHotspot	uniprot_position	functional_score	current_uniprot	name	gene	Sequence	is_valid	site_seq	gene_site
0	A0A075B6Q4	24	S	True	0.91	6.839384	True	A0A075B6Q4_24	0.149257	A0A075B6Q4	A0A075B6Q4_HUMAN	None	MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT...	True	VDDEKGDSNDDYDSA	A0A075B6Q4_S24
1	A0A075B6Q4	35	S	True	0.87	9.192622	False	A0A075B6Q4_35	0.136966	A0A075B6Q4	A0A075B6Q4_HUMAN	None	MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT...	True	YDSAGLLSDEDCMSV	A0A075B6Q4_S35
2	A0A075B6Q4	57	S	False	0.28	0.818834	False	A0A075B6Q4_57	0.125364	A0A075B6Q4	A0A075B6Q4_HUMAN	None	MDIQKSENEDDSEWEDVDDEKGDSNDDYDSAGLLSDEDCMSVPGKT...	True	IADHLFWSEETKSRF	A0A075B6Q4_S57

PhosphoSitePlus human phosphorylation site

df = Data.get_psp_human_site()
df.head(3)

	gene	protein	uniprot	site	gene_site	SITE_GRP_ID	species	site_seq	LT_LIT	MS_LIT	MS_CST	CST_CAT#
0	YWHAB	14-3-3 beta	P31946	T2	YWHAB_T2	15718712	human	______MtMDksELV	NaN	3.0	1.0	None
1	YWHAB	14-3-3 beta	P31946	S6	YWHAB_S6	15718709	human	__MtMDksELVQkAk	NaN	8.0	NaN	None
2	YWHAB	14-3-3 beta	P31946	Y21	YWHAB_Y21	3426383	human	LAEQAERyDDMAAAM	NaN	NaN	4.0	None

Unique sites of combined Ochoa & PhosphoSitePlus

df = Data.get_combine_site_psp_ochoa()
df.head(3)

	site_seq	gene_site	gene	source	num_site	acceptor	-7	-6	-5	-4	...	-2	-1	0	1	2	3	4	5	6	7
0	AAAAAAASGGAGSDN	PBX1_S136	PBX1	ochoa	1	S	A	A	A	A	...	A	A	S	G	G	A	G	S	D	N
1	AAAAAAASGGGVSPD	PBX2_S146	PBX2	ochoa	1	S	A	A	A	A	...	A	A	S	G	G	G	V	S	P	D
2	AAAAAAASGVTTGKP	CLASR_S349	CLASR	ochoa	1	S	A	A	A	A	...	A	A	S	G	V	T	T	G	K	P

3 rows × 21 columns

Phosphorylation site sequence example

All capital - 15 length (-7 to +7)

QSEEEKLSPSPTTED
TLQHVPDYRQNVYIP
TMGLSARyGPQFTLQ

All capital - 10 length (-5 to +4)

SRDPHYQDPH
LDNPDyQQDF
AAAAAsGGAG

With lowercase - (-7 to +7)

QsEEEKLsPsPTTED
TLQHVPDyRQNVYIP
TMGLsARyGPQFTLQ

With lowercase - (-5 to +4)

sRDPHyQDPH
LDNPDyQQDF
AAAAAsGGAG

Name		Name	Last commit message	Last commit date
Latest commit History 220 Commits
.github/workflows		.github/workflows
dataset		dataset
katlas		katlas
nbs		nbs
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
settings.ini		settings.ini
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KATLAS

Reproduce datasets & figures

Web applications

Tutorials on Colab

Install

Import

Quick start

Single sequence as input

CDDM, all capital

CDDM, with lower case indicating phosphorylation status

PSPA, with lower case indicating phosphorylation status

To replicate the results from The Kinase Library (PSPA)

High-throughput substrate scoring on a dataframe

Load your csv

Load a demo df

Set the column name and param to calculate

Phosphorylation sites

CPTAC pan-cancer phosphoproteomics

Ochoa et al. human phosphoproteome

PhosphoSitePlus human phosphorylation site

Unique sites of combined Ochoa & PhosphoSitePlus

Phosphorylation site sequence example

About

Releases 2

Packages

Contributors 2

Languages

License

sky1ove/katlas

Folders and files

Latest commit

History

Repository files navigation

KATLAS

Reproduce datasets & figures

Web applications

Tutorials on Colab

Install

Import

Quick start

Single sequence as input

CDDM, all capital

CDDM, with lower case indicating phosphorylation status

PSPA, with lower case indicating phosphorylation status

To replicate the results from The Kinase Library (PSPA)

High-throughput substrate scoring on a dataframe

Load your csv

Load a demo df

Set the column name and param to calculate

Phosphorylation sites

CPTAC pan-cancer phosphoproteomics

Ochoa et al. human phosphoproteome

PhosphoSitePlus human phosphorylation site

Unique sites of combined Ochoa & PhosphoSitePlus

Phosphorylation site sequence example

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Languages

Packages