Skip to content

Latest commit

 

History

History

notebooks

Notebooks

Notebooks were used for experimentation and analysis, not for modeling fitting – model fitting was performed in a the modeling fitting pipeline. Each subdirectory contains notebooks for a set of analyses.

Many of these notebooks can no longer be run from start to finish because the shared code has been altered and the notebook has not been updated. This is not an issue for notebooks that are for experimentation or reference. The expected behavior is indicated for each group of notebooks. Note, the provided links are to the Markdown files generated from each notebook.

reproducibility-full

Notebooks for general exploratory data analysis.

  1. Basic data statistics and plots.
  2. Exploration of molecular variates.
  3. Cell line lineages and lineage subtypes. Understanding the lineage and lineage subtype relationships.
  4. Exploration of data by cell line lineage. Testing how well the cell lines can be clustered by the raw log-fold change data.

reproducibility-limited

These notebooks are for experimenting with various model designs, each one using different covariates and model structures. They primarily use synthetic data and test whether the known values can be recovered.

These were early notebooks and are retained for reference only.

  1. Model 1. Standard linear model of one gene using RNA expression as a predictor.
  2. Model 2. A hierarchical linear model of multiple genes with a varying intercept and slope on RNA expression.
  3. Model 3. (Failed) A hierarchical model of multiple genes and cell lines with a varying intercept for each. The gene level model consisted of an intercept and slope for RNA expression.
  4. Model 4. A hierarchical model of multiple genes and cell lines with a varying intercept for each and a slope for each gene on RNA expression.
  5. Model 5. A multi-level hierarchical model with the main level consisting of two varying intercepts, one for the sgRNA and one for the cell line. The sgRNA varying intercept had an additional level where each guide came from a distribution for each gene.
  6. Model 6. A hierarchical model with varying effects for the intercept and slope pooling across sgRNA and gene. The slope is on synthetic copy number data.
  7. Model 7. A model with a single 2D varying intercept with one dimension for sgRNA and one for cell line. Then try to have the sgRNA dimension vary by gene. A model with two varying intercepts was also successfully fit in this notebook.

reproducibility-limited

A series of notebooks that build the final model based off of the earlier experimentation. The model was constructed one piece at a time, experimenting with the effect of the latest addition on the model fit and other variables.

  1. NB distributions. Some initial experiments with NB distributions.
  2. NB likelihooods. Basic practice with generalized linear models with NB likelihoods.
  3. NB GLMs on CRISPR screen data. Generalized linear models with a NB likelihood fit with example CRISPR screen data.
  4. Exposure in CRISPR screen models. Experimentation with different measures of "exposure" for a NB model of CRISPR screen data.
  5. Initial CRISPR screen models. Some simpler GLMs with a NB likelihood and some interesting covariates for modeling CRISPR data.
  6. Comparing LM to GLM. A comparison of similar GLMs except with either an identity or exponential link function and Gaussian or NB likelihood.
  7. Add CN covariate. Introducing the copy number of the target location as a covariate to the model.
  8. Add screen source covariate. Experimenting with adding the source of the screen data as a batch effect. This was later removed and only a single screen was used (from the Broad).
  9. Cancer gene comutation matrix example. Experimenting with how to build the comutation matrix and integrate it into the model.
  10. Cancer gene comutation matrix into the current model. Example implementation of the cancer gene comutation covariate.
  11. Cancer gene comutation with CGC list. Building the cancer gene comutation covariate from the CGC cancer gene list.
  12. Experimentation with current model structure. Some general exploration of the current model and how it behaves with real data.
  13. Experimenting with a simpler model. The model was simplified in order to move the project along.
  14. Introduce a chromosome-varying effect. Introduction of another hierarchical layer in the cell line-effect variables to account for differential sensitivity of each chromosome of each cell line.

The following notebooks were analyses of running the model on the lineages prostate, liver, and colorectal. These lineages were chosen because they represent examples of datasets of different sizes and complexities (e.g. number of cancer genes in the comutation matrix.) The model of MCMC parameters were altered slightly from version to version. (The posterior data for the models no longer exists, so these notebooks are retained primarily for reference.)

  1. Experimental run with prostate (model v001).
  2. Experimental run with prostate (model v002).
  3. Experimental run with prostate (model v003).
  4. Experimental run with prostate (model v004).
  5. Experimental run with multiple lineages (model v004).
  6. Experimental run with colorectal (model v004).
  7. Experimental run with colorectal (model v005).
  8. Experimental run with prostate (model v006).
  9. Experimental run with prostate (model v007).
  10. Experimental run with liver (model v007).
  11. Experimental run with colorectal (model v007).
  12. Experimental run with prostate (model v008).
  13. Experimental run with prostate (model v009).
  14. Experimental run with colorectal (model v009).

reproducibility-full

Final analyses of fitting the final model to all of the cell line lineages. Many of the figures were generated using the results of these analyses.

  1. First look at fit models. First, preliminary look at the fit models for all of the lineages.
  2. Model diagnositcs. Collect MCMC and model-fit diagnostics.
  3. Simple descriptions of models. Summary statistics on the dynamic features of the models (e.g. number of cancer genes included).
  4. Gene essentiality. Analyze base "essentiality" of genes by lineage.
  5. Molecular and cellular covariates. Analysis of the covariates for molecular data on the gene and cell line levels.
  6. Gene mutation effects. Effects of the mutation of the target gene and discovery of putative driver genes.
  7. Cancer gene comutation analysis. Analysis of the cancer gene comutation variables to discover possible synthetic lethal interactions.

Some other notebooks are in this directory though they were from earlier analyses and can be ignored.

  1. Preliminary analysis with a few lineages.
  2. Small analysis with more lineages.
  3. Analyzing the molecular covariates with the preliminary models.

reproducibility-limited

These are small, experimental notebooks that exist to quickly test ideas for other uses. They may not be fully reproducible, but exist primarily for reference.

  1. Multiple varying intercepts example. An example model for fitting multiple varying intercepts.
  2. Saving and loading PyMC3 models and samples. Testing various methods for wrapping pm.sample() for automatic caching and re-loading.
  3. Fitting splines. How to fit splines with PyMC3. I have yet to get a working multi-level model.
  4. Simple SBC example. A quick proof-of-concept for simulation-based calibration workflow.
  5. Combining MCMC chains. How to combine MCMC chains run separately into a single ArviZ InferenceData object.
  6. Scaling copy number data. Effects of different transformations on copy number data.
  7. Scaling RNA expression data. Effects of different transformations on RNA expression data.
  8. Mixing centered and non-centered parameterizations.
  9. PyMC vs. Stan. Simple comparison of MCMC speed and performance between PyMC and Stan building the same models.

Miscellaneous notebooks for testing ideas not related to model design.

  1. PyMC custom callback.
  2. Different PyMC backends.
  3. Different PyMC backends on O2.