Inspect the single-lineage model run on the prostate data (008)

Model attributes:

sgRNA | gene varying intercept
RNA and CN varying effects per gene
correlation between gene varying effects modeled using the multivariate normal and Cholesky decomposition (non-centered parameterization)
target gene mutation variable and cancer gene comutation variable.
varying intercept and slope for copy number for chromosome nested with in cell line

%load_ext autoreload
%autoreload 2

from math import ceil
from time import time

import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from IPython.display import Markdown, display

from speclet.analysis.arviz_analysis import extract_coords_param_names
from speclet.bayesian_models.lineage_hierarchical_nb import LineageHierNegBinomModel
from speclet.data_processing.common import head_tail
from speclet.io import modeling_data_dir, models_dir
from speclet.managers.data_managers import CrisprScreenDataManager, broad_only
from speclet.plot import set_speclet_theme
from speclet.project_configuration import arviz_config

# Notebook execution timer.
notebook_tic = time()

# Plotting setup.
set_speclet_theme()
%config InlineBackend.figure_format = "retina"
arviz_config()

Data

saved_model_dir = models_dir() / "hnb-single-lineage-prostate-008_PYMC_NUMPYRO"

with open(saved_model_dir / "description.txt") as f:
    model_description = "".join(list(f))

print(model_description)

config. name: 'hnb-single-lineage-prostate-008'
model name: 'LineageHierNegBinomModel'
model version: '0.0.3'
model description: A hierarchical negative binomial generalized linear model fora single lineage.fit method: 'PYMC_NUMPYRO'

--------------------------------------------------------------------------------

CONFIGURATION

{
    "name": "hnb-single-lineage-prostate-008",
    "description": " Single lineage hierarchical negative binomial model for prostate data from the Broad.\nVarying effects for each chromosome of each cell line has been nested under the existing varying effects for cell line. ",
    "active": true,
    "model": "LINEAGE_HIERARCHICAL_NB",
    "data_file": "modeling_data/lineage-modeling-data/depmap-modeling-data_prostate.csv",
    "model_kwargs": {
        "lineage": "prostate"
    },
    "sampling_kwargs": {
        "pymc_mcmc": null,
        "pymc_advi": null,
        "pymc_numpyro": {
            "draws": 1000,
            "tune": 1500,
            "chains": 4,
            "target_accept": 0.98,
            "progress_bar": true,
            "chain_method": "parallel",
            "postprocessing_backend": "cpu",
            "idata_kwargs": {
                "log_likelihood": false
            },
            "nuts_kwargs": {
                "step_size": 0.01,
                "max_tree_depth": 11
            }
        }
    }
}

--------------------------------------------------------------------------------

POSTERIOR

<xarray.Dataset>
Dimensions:                    (chain: 4, draw: 1000, delta_genes_dim_0: 5,
                                delta_genes_dim_1: 18119, sgrna: 71062,
                                delta_cells_dim_0: 2, delta_cells_dim_1: 5,
                                cell_chrom: 115, genes_chol_cov_dim_0: 15,
                                cells_chol_cov_dim_0: 3,
                                genes_chol_cov_corr_dim_0: 5,
                                genes_chol_cov_corr_dim_1: 5,
                                genes_chol_cov_stds_dim_0: 5, cancer_gene: 1,
                                gene: 18119, cells_chol_cov_corr_dim_0: 2,
                                cells_chol_cov_corr_dim_1: 2,
                                cells_chol_cov_stds_dim_0: 2, cell_line: 5)
Coordinates: (12/19)
  * chain                      (chain) int64 0 1 2 3
  * draw                       (draw) int64 0 1 2 3 4 5 ... 995 996 997 998 999
  * delta_genes_dim_0          (delta_genes_dim_0) int64 0 1 2 3 4
  * delta_genes_dim_1          (delta_genes_dim_1) int64 0 1 2 ... 18117 18118
  * sgrna                      (sgrna) object 'AAAAAAATCCAGCAATGCAG' ... 'TTT...
  * delta_cells_dim_0          (delta_cells_dim_0) int64 0 1
    ...                         ...
  * cancer_gene                (cancer_gene) object 'ZFHX3'
  * gene                       (gene) object 'A1BG' 'A1CF' ... 'ZZEF1' 'ZZZ3'
  * cells_chol_cov_corr_dim_0  (cells_chol_cov_corr_dim_0) int64 0 1
  * cells_chol_cov_corr_dim_1  (cells_chol_cov_corr_dim_1) int64 0 1
  * cells_chol_cov_stds_dim_0  (cells_chol_cov_stds_dim_0) int64 0 1
  * cell_line                  (cell_line) object 'ACH-000115' ... 'ACH-001648'
Data variables: (12/35)
    mu_mu_a                    (chain, draw) float64 ...
    mu_b                       (chain, draw) float64 ...
    delta_genes                (chain, draw, delta_genes_dim_0, delta_genes_dim_1) float64 ...
    delta_a                    (chain, draw, sgrna) float64 ...
    mu_mu_m                    (chain, draw) float64 ...
    delta_cells                (chain, draw, delta_cells_dim_0, delta_cells_dim_1) float64 ...
    ...                         ...
    sigma_mu_k                 (chain, draw) float64 ...
    sigma_mu_m                 (chain, draw) float64 ...
    mu_k                       (chain, draw, cell_line) float64 ...
    mu_m                       (chain, draw, cell_line) float64 ...
    k                          (chain, draw, cell_chrom) float64 ...
    m                          (chain, draw, cell_chrom) float64 ...
Attributes:
    created_at:           2022-08-07 22:52:41.742053
    arviz_version:        0.12.1
    model_name:           LineageHierNegBinomModel
    model_version:        0.0.3
    model_doc:            A hierarchical negative binomial generalized linear...
    previous_created_at:  ['2022-08-07 22:52:41.742053', '2022-08-08T02:36:24...

--------------------------------------------------------------------------------

SAMPLE STATS

<xarray.Dataset>
Dimensions:          (chain: 4, draw: 1000)
Coordinates:
  * chain            (chain) int64 0 1 2 3
  * draw             (draw) int64 0 1 2 3 4 5 6 ... 993 994 995 996 997 998 999
Data variables:
    acceptance_rate  (chain, draw) float64 ...
    step_size        (chain, draw) float64 ...
    diverging        (chain, draw) bool ...
    energy           (chain, draw) float64 ...
    n_steps          (chain, draw) int64 ...
    tree_depth       (chain, draw) int64 ...
    lp               (chain, draw) float64 ...
Attributes:
    created_at:           2022-08-07 22:52:41.742053
    arviz_version:        0.12.1
    previous_created_at:  ['2022-08-07 22:52:41.742053', '2022-08-08T02:36:24...

--------------------------------------------------------------------------------

MCMC DESCRIPTION

date created: 2022-08-07 22:52
sampled 4 chains with (unknown) tuning steps and 1,000 draws
num. divergences: 0, 0, 0, 0
percent divergences: 0.0, 0.0, 0.0, 0.0
BFMI: 2.001, 1.995, 0.588, 1.9
avg. step size: 0.0, 0.0, 0.003, 0.0
avg. accept prob.: 0.981, 0.987, 0.987, 0.974
avg. tree depth: 11.0, 11.0, 11.0, 11.0

Load posterior summary

prostate_post_summary = pd.read_csv(saved_model_dir / "posterior-summary.csv").assign(
    var_name=lambda d: [x.split("[")[0] for x in d["parameter"]]
)
prostate_post_summary.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	parameter	mean	sd	hdi_5.5%	hdi_94.5%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat	var_name
0	mu_mu_a	0.102	0.023	0.066	0.119	0.009	0.007	6.0	34.0	2.55	mu_mu_a
1	mu_b	0.006	0.006	0.001	0.016	0.003	0.002	5.0	11.0	3.15	mu_b
2	mu_mu_m	-0.275	0.084	-0.326	-0.145	0.037	0.030	5.0	12.0	3.26	mu_mu_m
3	sigma_a	0.053	0.092	0.000	0.212	0.046	0.035	4.0	4.0	3.55	sigma_a
4	sigma_k	0.009	0.014	0.000	0.033	0.007	0.005	4.0	4.0	3.60	sigma_k

Load trace object

trace_file = saved_model_dir / "posterior.netcdf"
assert trace_file.exists()
trace = az.from_netcdf(trace_file)

Prostate data

prostate_dm = CrisprScreenDataManager(
    modeling_data_dir() / "lineage-modeling-data" / "depmap-modeling-data_prostate.csv",
    transformations=[broad_only],
)

prostate_data = prostate_dm.get_data(read_kwargs={"low_memory": False})
prostate_data.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	sgrna	replicate_id	lfc	p_dna_batch	genome_alignment	hugo_symbol	screen	multiple_hits_on_gene	sgrna_target_chr	sgrna_target_pos	...	any_deleterious	any_tcga_hotspot	any_cosmic_hotspot	is_mutated	copy_number	lineage	lineage_subtype	primary_or_metastasis	is_male	age
0	AAAGCCCAGGAGTATGGGAG	Vcap-304Cas9_RepA_p4_batch3	0.246450	3	chr2_130522105_-	CFC1B	broad	True	2	130522105	...	NaN	NaN	NaN	False	0.999455	prostate	prostate_adenocarcinoma	metastasis	True	59.0
1	AAATCAGAGAAACCTGAACG	Vcap-304Cas9_RepA_p4_batch3	0.626518	3	chr11_89916950_-	TRIM49D1	broad	True	11	89916950	...	NaN	NaN	NaN	False	1.281907	prostate	prostate_adenocarcinoma	metastasis	True	59.0
2	AACGTCTTTGAAGAAAGCTG	Vcap-304Cas9_RepA_p4_batch3	0.165114	3	chr5_71055421_-	GTF2H2	broad	True	5	71055421	...	NaN	NaN	NaN	False	0.616885	prostate	prostate_adenocarcinoma	metastasis	True	59.0
3	AACGTCTTTGAAGGAAGCTG	Vcap-304Cas9_RepA_p4_batch3	-0.094688	3	chr5_69572480_+	GTF2H2C	broad	True	5	69572480	...	NaN	NaN	NaN	False	0.616885	prostate	prostate_adenocarcinoma	metastasis	True	59.0
4	AAGAGGTTCCAGACTACTTA	Vcap-304Cas9_RepA_p4_batch3	0.294496	3	chrX_155898173_+	VAMP7	broad	True	X	155898173	...	NaN	NaN	NaN	False	0.615935	prostate	prostate_adenocarcinoma	metastasis	True	59.0

5 rows × 25 columns

Single lineage model

prostate_model = LineageHierNegBinomModel(lineage="prostate")

valid_prostate_data = prostate_model.data_processing_pipeline(prostate_data.copy())
prostate_mdl_data = prostate_model.make_data_structure(valid_prostate_data)

[INFO] 2022-08-08 09:02:37 [(lineage_hierarchical_nb.py:data_processing_pipeline:317] Processing data for modeling.
[INFO] 2022-08-08 09:02:37 [(lineage_hierarchical_nb.py:data_processing_pipeline:318] LFC limits: (-5.0, 5.0)
[WARNING] 2022-08-08 09:03:57 [(lineage_hierarchical_nb.py:data_processing_pipeline:376] number of data points dropped: 2
[INFO] 2022-08-08 09:03:57 [(lineage_hierarchical_nb.py:target_gene_is_mutated_vector:587] number of genes mutated in all cells lines: 0
[DEBUG] 2022-08-08 09:03:57 [(lineage_hierarchical_nb.py:target_gene_is_mutated_vector:590] Genes always mutated:
[DEBUG] 2022-08-08 09:03:57 [(cancer_gene_mutation_matrix.py:_trim_cancer_genes:68] all_mut: {}
[INFO] 2022-08-08 09:03:57 [(cancer_gene_mutation_matrix.py:_trim_cancer_genes:77] Dropping 8 cancer genes.
[DEBUG] 2022-08-08 09:03:57 [(cancer_gene_mutation_matrix.py:_trim_cancer_genes:79] Dropped cancer genes: ['AR', 'AXIN1', 'FOXA1', 'KLF6', 'NCOR2', 'PTEN', 'SALL4', 'SPOP']

Analysis

sns.histplot(x=prostate_post_summary["r_hat"], binwidth=0.01, stat="proportion");

fig, ax = plt.subplots(figsize=(8, 5))
sns.boxplot(data=prostate_post_summary, x="var_name", y="r_hat", ax=ax)
ax.tick_params(rotation=90)
plt.show()

az.plot_energy(trace);

energy = trace.sample_stats.energy.values
marginal_e = pd.DataFrame((energy - energy.mean(axis=1)[:, None]).T).assign(
    energy="marginal"
)
transition_e = pd.DataFrame((energy[:, :-1] - energy[:, 1:]).T).assign(
    energy="transition"
)
energy_df = pd.concat([marginal_e, transition_e]).reset_index(drop=True)
bfmi = az.bfmi(trace)

fig, axes = plt.subplots(2, 2, figsize=(8, 6))
for i, ax in enumerate(axes.flatten()):
    sns.kdeplot(data=energy_df, x=i, hue="energy", ax=ax)
    ax.set_title(f"chain {i} – BFMI: {bfmi[i]:0.2f}")
    ax.set_xlabel(None)
    xmin, _ = ax.get_xlim()
    _, ymax = ax.get_ylim()
    ax.get_legend().set_frame_on(False)

fig.tight_layout()
plt.show()

hmc_energy = (
    trace.sample_stats.get("energy")
    .to_dataframe()
    .reset_index(drop=False)
    .astype({"chain": "category"})
)
ax = sns.lineplot(data=hmc_energy, x="draw", y="energy", hue="chain", palette="Set1")
ax.set_xlim(0, hmc_energy["draw"].max())
plt.show()

stats = ["step_size", "n_steps", "tree_depth", "acceptance_rate", "energy"]
trace.sample_stats.get(stats).to_dataframe().groupby("chain").mean()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	step_size	n_steps	tree_depth	acceptance_rate	energy
chain
0	1.599252e-10	2047.0	11.0	0.980880	2.555008e+06
1	2.494715e-14	2047.0	11.0	0.987095	2.550493e+06
2	2.656016e-03	2047.0	11.0	0.987374	2.454301e+06
3	1.294592e-08	2047.0	11.0	0.974076	2.548882e+06

HAPPY_CHAINS = [2]

az.plot_trace(trace, var_names=["mu_mu_a", "mu_b", "mu_m"], compact=False)
plt.tight_layout()
plt.show()

az.plot_trace(trace, var_names=["^sigma_*"], filter_vars="regex", compact=False)
plt.tight_layout()

/home/jc604/.conda/envs/speclet/lib/python3.10/site-packages/arviz/stats/density_utils.py:491: UserWarning: Your data appears to have a single value or no finite values
  warnings.warn("Your data appears to have a single value or no finite values")
/home/jc604/.conda/envs/speclet/lib/python3.10/site-packages/arviz/stats/density_utils.py:491: UserWarning: Your data appears to have a single value or no finite values
  warnings.warn("Your data appears to have a single value or no finite values")
/home/jc604/.conda/envs/speclet/lib/python3.10/site-packages/arviz/stats/density_utils.py:491: UserWarning: Your data appears to have a single value or no finite values
  warnings.warn("Your data appears to have a single value or no finite values")
/home/jc604/.conda/envs/speclet/lib/python3.10/site-packages/arviz/stats/density_utils.py:491: UserWarning: Your data appears to have a single value or no finite values
  warnings.warn("Your data appears to have a single value or no finite values")
/home/jc604/.conda/envs/speclet/lib/python3.10/site-packages/arviz/stats/density_utils.py:491: UserWarning: Your data appears to have a single value or no finite values
  warnings.warn("Your data appears to have a single value or no finite values")

sigmas = ["sigma_mu_a", "sigma_b", "sigma_d", "sigma_f", "sigma_k"]
trace.posterior.get(sigmas).mean(dim="draw").to_dataframe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	sigma_mu_a	sigma_b	sigma_d	sigma_f	sigma_k
chain
0	3.350969e-14	1.602743e-12	0.033105	0.570832	1.047392e-05
1	1.621624e-12	2.806242e-16	0.030183	0.078072	2.914971e-11
2	2.303786e-01	5.700978e-02	0.321644	0.129522	3.312310e-02
3	7.609938e-09	1.138677e-10	0.408969	0.004811	1.384195e-03

az.plot_trace(trace, var_names=["alpha"], compact=False)
plt.tight_layout()

az.plot_forest(
    trace, var_names=["^sigma_*"], filter_vars="regex", combined=False, figsize=(5, 5)
)
plt.tight_layout()

trace.posterior = trace.posterior.sel(chain=HAPPY_CHAINS)

az.plot_trace(
    trace, var_names=["mu_mu_a", "mu_b", "mu_k", "mu_mu_m", "mu_m"], compact=True
)
plt.tight_layout()
plt.show()

az.plot_trace(trace, var_names=["^sigma_*"], filter_vars="regex", compact=True)
plt.tight_layout()
plt.show()

az.plot_forest(
    trace,
    var_names=["sigma_mu_a", "sigma_b", "sigma_d", "sigma_f", "sigma_h"],
    figsize=(5, 3),
    combined=True,
);

az.plot_forest(
    trace,
    var_names=["sigma_mu_k", "mu_k", "sigma_k", "mu_mu_m", "sigma_mu_m", "mu_m"],
    figsize=(5, 6),
    combined=True,
);

var_names = ["a", "mu_a", "b", "d", "f", "h"]
_, axes = plt.subplots(2, 3, figsize=(8, 6), sharex=True)
for ax, var_name in zip(axes.flatten(), var_names):
    x = prostate_post_summary.query(f"var_name == '{var_name}'")["mean"]
    sns.kdeplot(x=x, ax=ax)
    ax.set_title(var_name)
    ax.set_xlim(-2, 1)

plt.tight_layout()
plt.show()

sgrna_to_gene_map = (
    prostate_data.copy()[["hugo_symbol", "sgrna"]]
    .drop_duplicates()
    .reset_index(drop=True)
)

for v in ["mu_a", "b", "d", "f", "h", "k", "m"]:
    display(Markdown(f"variable: **{v}**"))
    top = (
        prostate_post_summary.query(f"var_name == '{v}'")
        .sort_values("mean")
        .pipe(head_tail, 5)
    )
    display(top)

variable: mu_a

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	parameter	mean	sd	hdi_5.5%	hdi_94.5%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat	var_name
7786	mu_a[KIF11]	-0.218	0.573	-1.221	0.119	0.284	0.218	5.0	17.0	3.14	mu_a
6988	mu_a[HSPE1]	-0.173	0.496	-1.044	0.119	0.246	0.188	5.0	17.0	3.15	mu_a
14745	mu_a[SPC24]	-0.169	0.490	-1.025	0.119	0.243	0.186	5.0	17.0	3.15	mu_a
4440	mu_a[DUX4]	-0.165	0.481	-1.005	0.119	0.239	0.182	5.0	17.0	3.14	mu_a
12654	mu_a[RAN]	-0.165	0.481	-1.016	0.119	0.238	0.182	5.0	17.0	3.14	mu_a
941	mu_a[ARHGAP44]	0.185	0.137	0.100	0.420	0.064	0.049	5.0	23.0	3.16	mu_a
12058	mu_a[PRAMEF4]	0.186	0.138	0.100	0.420	0.065	0.049	5.0	23.0	2.89	mu_a
6763	mu_a[HLA-DQB1]	0.187	0.141	0.100	0.426	0.065	0.049	5.0	23.0	2.90	mu_a
16215	mu_a[TP53]	0.200	0.160	0.100	0.480	0.077	0.058	5.0	23.0	3.16	mu_a
2456	mu_a[CCNF]	0.202	0.166	0.100	0.485	0.078	0.059	5.0	23.0	2.90	mu_a

variable: b

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	parameter	mean	sd	hdi_5.5%	hdi_94.5%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat	var_name
22903	b[EP300]	-0.079	0.150	-0.341	0.016	0.074	0.056	5.0	29.0	3.32	b
34345	b[TP63]	-0.041	0.086	-0.192	0.016	0.042	0.032	5.0	29.0	3.32	b
33384	b[TADA1]	-0.040	0.084	-0.187	0.016	0.041	0.031	5.0	29.0	3.32	b
33099	b[STAG2]	-0.039	0.081	-0.181	0.016	0.039	0.030	5.0	31.0	3.33	b
24977	b[HOXB13]	-0.037	0.079	-0.174	0.016	0.038	0.029	5.0	27.0	3.32	b
27995	b[NDUFB11]	0.062	0.097	0.001	0.232	0.047	0.036	4.0	11.0	3.35	b
19387	b[ATP6V1F]	0.062	0.098	0.001	0.234	0.048	0.036	4.0	11.0	3.35	b
24317	b[GPI]	0.065	0.104	0.001	0.246	0.050	0.038	4.0	11.0	3.35	b
27885	b[NARS2]	0.067	0.106	0.001	0.252	0.052	0.039	4.0	11.0	3.35	b
18592	b[AIFM1]	0.073	0.117	0.001	0.278	0.057	0.044	4.0	11.0	3.35	b

variable: d

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	parameter	mean	sd	hdi_5.5%	hdi_94.5%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat	var_name
52956	d[UBE2N]	-0.323	0.427	-1.046	-0.024	0.204	0.155	5.0	28.0	3.41	d
40724	d[EARS2]	-0.285	0.389	-0.934	0.010	0.182	0.138	4.0	11.0	3.48	d
44643	d[LONP1]	-0.285	0.367	-0.871	0.010	0.169	0.128	4.0	15.0	3.28	d
44099	d[KLF5]	-0.276	0.368	-0.885	0.017	0.174	0.132	5.0	21.0	3.25	d
51054	d[SPPL3]	-0.270	0.287	-0.591	-0.020	0.128	0.097	6.0	25.0	2.38	d
47602	d[PFDN5]	0.260	0.286	0.002	0.559	0.127	0.096	6.0	20.0	2.19	d
42314	d[GMPS]	0.271	0.326	-0.031	0.683	0.147	0.111	5.0	15.0	2.58	d
38953	d[CDT1]	0.274	0.331	-0.013	0.786	0.154	0.117	4.0	13.0	3.21	d
41624	d[FDFT1]	0.285	0.305	0.017	0.660	0.136	0.103	5.0	14.0	2.53	d
45609	d[MRPL39]	0.290	0.366	-0.028	0.855	0.171	0.130	5.0	28.0	2.97	d

variable: f

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	parameter	mean	sd	hdi_5.5%	hdi_94.5%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat	var_name
63360	f[MEGF9]	-0.261	0.344	-0.845	-0.000	0.169	0.131	6.0	28.0	2.47	f
70149	f[TMBIM6]	-0.260	0.412	-0.968	0.002	0.204	0.157	6.0	22.0	2.72	f
61266	f[HRASLS2]	-0.259	0.396	-0.938	0.005	0.196	0.151	6.0	13.0	2.73	f
61235	f[HOXD11]	-0.258	0.375	-0.899	-0.002	0.185	0.143	6.0	29.0	2.49	f
56342	f[C7orf57]	-0.254	0.381	-0.905	-0.004	0.188	0.145	6.0	14.0	2.85	f
61786	f[ISCU]	0.262	0.318	0.002	0.779	0.157	0.121	5.0	13.0	3.19	f
68912	f[SNRNP200]	0.264	0.287	0.001	0.728	0.140	0.108	5.0	18.0	2.81	f
63725	f[MRPL36]	0.269	0.330	0.003	0.807	0.162	0.125	5.0	14.0	3.30	f
67712	f[RSL24D1]	0.271	0.371	-0.007	0.889	0.183	0.141	5.0	22.0	3.82	f
58986	f[EIF3CL]	0.304	0.411	-0.004	0.990	0.203	0.157	5.0	23.0	3.13	f

variable: h

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	parameter	mean	sd	hdi_5.5%	hdi_94.5%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat	var_name
156838	h[RPS4X, ZFHX3]	-0.241	0.305	-0.754	0.001	0.151	0.115	4.0	12.0	3.37	h
151324	h[KIF11, ZFHX3]	-0.215	0.334	-0.799	0.000	0.165	0.126	5.0	17.0	3.10	h
158283	h[SPC24, ZFHX3]	-0.210	0.287	-0.701	0.000	0.141	0.108	4.0	17.0	3.30	h
159963	h[TRNT1, ZFHX3]	-0.209	0.310	-0.747	-0.000	0.153	0.117	5.0	22.0	3.02	h
148218	h[ELL, ZFHX3]	-0.206	0.327	-0.781	-0.000	0.162	0.124	4.0	11.0	3.34	h
149500	h[GIMAP4, ZFHX3]	0.100	0.103	-0.000	0.223	0.048	0.037	5.0	13.0	2.47	h
152541	h[MEGF9, ZFHX3]	0.102	0.116	-0.009	0.249	0.055	0.042	5.0	15.0	2.53	h
150416	h[HOXD11, ZFHX3]	0.104	0.133	-0.008	0.309	0.064	0.049	5.0	19.0	2.99	h
150212	h[HIP1R, ZFHX3]	0.105	0.121	-0.000	0.277	0.057	0.043	5.0	24.0	2.79	h
148322	h[EP300, ZFHX3]	0.153	0.223	-0.002	0.541	0.110	0.084	5.0	28.0	3.22	h

variable: k

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	parameter	mean	sd	hdi_5.5%	hdi_94.5%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat	var_name
161738	k[ACH-001453__12]	-0.030	0.053	-0.121	-0.000	0.025	0.019	4.0	16.0	3.53	k
161765	k[ACH-001627__16]	-0.026	0.046	-0.104	0.000	0.022	0.017	5.0	25.0	3.13	k
161719	k[ACH-000977__16]	-0.023	0.044	-0.097	0.001	0.021	0.016	5.0	16.0	3.09	k
161722	k[ACH-000977__19]	-0.020	0.039	-0.085	0.001	0.018	0.014	5.0	16.0	3.39	k
161766	k[ACH-001627__17]	-0.019	0.037	-0.079	0.000	0.017	0.013	5.0	21.0	3.19	k
161691	k[ACH-000115__11]	0.016	0.031	-0.001	0.069	0.014	0.010	5.0	15.0	2.97	k
161793	k[ACH-001648__21]	0.016	0.031	-0.001	0.070	0.014	0.010	5.0	17.0	3.06	k
161684	k[ACH-000115__4]	0.016	0.032	-0.001	0.072	0.015	0.011	5.0	27.0	3.05	k
161776	k[ACH-001648__4]	0.018	0.035	-0.001	0.079	0.016	0.012	5.0	16.0	3.21	k
161703	k[ACH-000115__X]	0.019	0.036	-0.000	0.081	0.016	0.012	5.0	14.0	3.14	k

variable: m

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	parameter	mean	sd	hdi_5.5%	hdi_94.5%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat	var_name
161825	m[ACH-000977__7]	-0.478	0.180	-0.730	-0.293	0.077	0.058	5.0	17.0	2.36	m
161822	m[ACH-000977__4]	-0.465	0.168	-0.696	-0.294	0.074	0.056	5.0	19.0	2.30	m
161837	m[ACH-000977__19]	-0.461	0.111	-0.590	-0.335	0.054	0.041	5.0	16.0	2.48	m
161831	m[ACH-000977__13]	-0.454	0.133	-0.641	-0.336	0.059	0.045	5.0	29.0	2.54	m
161824	m[ACH-000977__6]	-0.447	0.126	-0.575	-0.280	0.059	0.045	5.0	17.0	2.16	m
161890	m[ACH-001648__3]	-0.191	0.054	-0.271	-0.131	0.025	0.020	5.0	15.0	2.32	m
161852	m[ACH-001453__11]	-0.187	0.146	-0.318	0.030	0.065	0.052	5.0	17.0	2.79	m
161861	m[ACH-001453__20]	-0.185	0.151	-0.335	0.052	0.071	0.056	5.0	15.0	3.23	m
161906	m[ACH-001648__19]	-0.178	0.086	-0.311	-0.081	0.042	0.033	4.0	15.0	3.20	m
161893	m[ACH-001648__6]	-0.169	0.106	-0.268	-0.012	0.041	0.031	6.0	28.0	2.22	m

example_genes = ["KIF11", "AR", "NF2"]
az.plot_trace(
    trace,
    var_names=["mu_a", "b", "d", "f", "h"],
    coords={"gene": example_genes},
    compact=True,
)
plt.tight_layout()
plt.show()

sgrnas_sample = trace.posterior.coords["sgrna"].values[:5]

az.plot_trace(trace, var_names="a", coords={"sgrna": sgrnas_sample}, compact=False)
plt.tight_layout()
plt.show()

az.plot_forest(trace, var_names=["mu_k", "mu_m"], combined=True, figsize=(5, 4))
plt.tight_layout()
plt.show()

cell_chromosome_map = (
    valid_prostate_data[["depmap_id", "sgrna_target_chr", "cell_chrom"]]
    .drop_duplicates()
    .sort_values("cell_chrom")
    .reset_index(drop=True)
)

chromosome_effect_post = (
    az.summary(trace, var_names=["k", "m"], kind="stats")
    .pipe(extract_coords_param_names, names="cell_chrom")
    .assign(var_name=lambda d: [p[0] for p in d.index.values])
    .merge(cell_chromosome_map, on="cell_chrom")
)

cell_effect_post = (
    az.summary(trace, var_names=["mu_k", "mu_m"], kind="stats")
    .pipe(extract_coords_param_names, names="depmap_id")
    .assign(var_name=lambda d: [p[:4] for p in d.index.values])
    .reset_index(drop=True)
    .pivot_wider(
        index="depmap_id",
        names_from="var_name",
        values_from=["mean", "hdi_5.5%", "hdi_94.5%"],
    )
)

ncells = chromosome_effect_post["depmap_id"].nunique()
ncols = 3
nrows = ceil(ncells / 3)
fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(9, nrows * 3))
for ax, (cell, data_c) in zip(
    axes.flatten(), chromosome_effect_post.groupby("depmap_id")
):
    plot_df = data_c.pivot_wider(
        index="sgrna_target_chr",
        names_from="var_name",
        values_from=["mean", "hdi_5.5%", "hdi_94.5%"],
    )
    cell_eff = (
        cell_effect_post.copy().query(f"depmap_id == '{cell}'").reset_index(drop=True)
    )

    ax.axhline(0, color="k", zorder=1)
    ax.axvline(0, color="k", zorder=1)

    ax.vlines(
        x=plot_df["mean_k"],
        ymin=plot_df["hdi_5.5%_m"],
        ymax=plot_df["hdi_94.5%_m"],
        alpha=0.5,
        zorder=5,
    )
    ax.hlines(
        y=plot_df["mean_m"],
        xmin=plot_df["hdi_5.5%_k"],
        xmax=plot_df["hdi_94.5%_k"],
        alpha=0.5,
        zorder=5,
    )
    sns.scatterplot(data=plot_df, x="mean_k", y="mean_m", ax=ax, zorder=10)

    ax.vlines(
        x=cell_eff["mean_mu_k"],
        ymin=cell_eff["hdi_5.5%_mu_m"],
        ymax=cell_eff["hdi_94.5%_mu_m"],
        alpha=0.8,
        zorder=15,
        color="red",
    )
    ax.hlines(
        y=cell_eff["mean_mu_m"],
        xmin=cell_eff["hdi_5.5%_mu_k"],
        xmax=cell_eff["hdi_94.5%_mu_k"],
        alpha=0.8,
        zorder=15,
        color="red",
    )
    sns.scatterplot(
        data=cell_eff, x="mean_mu_k", y="mean_mu_m", ax=ax, zorder=20, color="red"
    )
    ax.set_title(cell)
    ax.set_xlabel(None)
    ax.set_ylabel(None)

for ax in axes[-1, :]:
    ax.set_xlabel("$k$")
for ax in axes[:, 0]:
    ax.set_ylabel("$m$")
for ax in axes.flatten()[ncells:]:
    ax.axis("off")

fig.tight_layout()
plt.show()

for v, data_v in chromosome_effect_post.groupby("var_name"):
    df = (
        data_v.copy()
        .reset_index(drop=True)
        .pivot_wider(
            index="depmap_id", names_from="sgrna_target_chr", values_from="mean"
        )
        .set_index("depmap_id")
    )
    cg = sns.clustermap(df, figsize=(6, 4))
    cg.ax_col_dendrogram.set_title(f"variable: ${v}$")
    plt.show()

genes_var_names = ["mu_a", "b", "d", "f"]
genes_var_names += [f"h[{g}]" for g in trace.posterior.coords["cancer_gene"].values]
gene_corr_post = (
    az.summary(trace, "genes_chol_cov_corr", kind="stats")
    .pipe(extract_coords_param_names, names=["d1", "d2"])
    .astype({"d1": int, "d2": int})
    .assign(
        p1=lambda d: [genes_var_names[i] for i in d["d1"]],
        p2=lambda d: [genes_var_names[i] for i in d["d2"]],
    )
    .assign(
        p1=lambda d: pd.Categorical(d["p1"], categories=d["p1"].unique(), ordered=True)
    )
    .assign(
        p2=lambda d: pd.Categorical(
            d["p2"], categories=d["p1"].cat.categories, ordered=True
        )
    )
)

plot_df = gene_corr_post.pivot_wider("p1", "p2", "mean").set_index("p1")
sns.heatmap(plot_df, cmap="coolwarm", vmin=-1, vmax=1)
plt.show()

cancer_genes = trace.posterior.coords["cancer_gene"].values.tolist()
cancer_gene_mutants = (
    valid_prostate_data.filter_column_isin("hugo_symbol", cancer_genes)[
        ["hugo_symbol", "depmap_id", "is_mutated"]
    ]
    .drop_duplicates()
    .assign(is_mutated=lambda d: d["is_mutated"].map({True: "X", False: ""}))
    .pivot_wider("depmap_id", names_from="hugo_symbol", values_from="is_mutated")
    .set_index("depmap_id")
)
cancer_gene_mutants

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	ZFHX3
depmap_id
ACH-000115
ACH-000977	X
ACH-001453
ACH-001627	X
ACH-001648

az.plot_trace(trace, var_names=["sigma_h"], compact=False)
plt.tight_layout()
plt.show()

h_post_summary = (
    prostate_post_summary.query("var_name == 'h'")
    .reset_index(drop=True)
    .pipe(
        extract_coords_param_names,
        names=["hugo_symbol", "cancer_gene"],
        col="parameter",
    )
)

ax = sns.kdeplot(data=h_post_summary, x="mean", hue="cancer_gene")
ax.set_xlabel(r"$\bar{h}_g$ posterior")
ax.set_ylabel("density")
ax.get_legend().set_title("cancer gene\ncomut.")
plt.show()

fig, axes = plt.subplots(
    len(cancer_genes), 1, squeeze=False, figsize=(8, len(cancer_genes) * 5)
)
for ax, cg in zip(axes.flatten(), cancer_genes):
    h_hits = (
        h_post_summary.filter_column_isin("cancer_gene", [cg])
        .sort_values("mean")
        .pipe(head_tail, n=5)["hugo_symbol"]
        .tolist()
    )

    h_hits_data = (
        valid_prostate_data.filter_column_isin("hugo_symbol", h_hits)
        .merge(cancer_gene_mutants.reset_index(), on="depmap_id")
        .reset_index()
        .astype({"hugo_symbol": str})
        .assign(
            hugo_symbol=lambda d: pd.Categorical(d["hugo_symbol"], categories=h_hits),
            _cg_mut=lambda d: d[cg].map({"X": "mut.", "": "WT"}),
        )
    )
    mut_pal = {"mut.": "tab:red", "WT": "gray"}
    sns.boxplot(
        data=h_hits_data,
        x="hugo_symbol",
        y="lfc",
        hue="_cg_mut",
        palette=mut_pal,
        ax=ax,
        showfliers=False,
        boxprops={"alpha": 0.5},
    )
    sns.stripplot(
        data=h_hits_data,
        x="hugo_symbol",
        y="lfc",
        hue="_cg_mut",
        dodge=True,
        palette=mut_pal,
        ax=ax,
    )
    sns.move_legend(ax, loc="upper left", bbox_to_anchor=(1, 1), title=cg)

plt.tight_layout()
plt.show()

Session info

notebook_toc = time()
print(f"execution time: {(notebook_toc - notebook_tic) / 60:.2f} minutes")

execution time: 2.41 minutes

%load_ext watermark
%watermark -d -u -v -iv -b -h -m

Last updated: 2022-08-08

Python implementation: CPython
Python version       : 3.10.5
IPython version      : 8.4.0

Compiler    : GCC 10.3.0
OS          : Linux
Release     : 3.10.0-1160.45.1.el7.x86_64
Machine     : x86_64
Processor   : x86_64
CPU cores   : 32
Architecture: 64bit

Hostname: compute-a-16-170.o2.rc.hms.harvard.edu

Git branch: varying-chromosome

seaborn   : 0.11.2
arviz     : 0.12.1
matplotlib: 3.5.2
numpy     : 1.23.1
pandas    : 1.4.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

033_single-lineage-prostate-inspection_008.md

033_single-lineage-prostate-inspection_008.md

Inspect the single-lineage model run on the prostate data (008)

Data

Load posterior summary

Load trace object

Prostate data

Single lineage model

Analysis

Session info

Files

033_single-lineage-prostate-inspection_008.md

Latest commit

History

033_single-lineage-prostate-inspection_008.md

File metadata and controls

Inspect the single-lineage model run on the prostate data (008)

Data

Load posterior summary

Load trace object

Prostate data

Single lineage model

Analysis

Session info