Analysis of crosspopulation prediction capacity using a simulation of two ancestral populations and an admixed one from HapMap3 phase 3 data.
- Python 2.7 (note: scripts here are incompatible with Python 3)
- R version 3.4.3 or higher
- a Linux computing environment
All other requisite binaries and external data are downloaded and configured automatically.
This analysis makes extensive use of Rscript
.
It assumes a correct installation and configuration of the default R environment.
The R packages used here are
tidyverse
, particularlyggplot2
anddplyr
knitr
data.table
optparse
doParallel
glmnet
assertthat
All of these packages are registered on CRAN. Install these packages with any standard approach. From within R, a one-shot approach to installing packages is to type
install.packages(c(tidyverse, knitr, data.table, optparse, doParallel, glmnet, assertthat))
assuming compilers, library paths, CRAN repository, and write permissions are all configured correctly.
Additionally, the following Bioconductor packages are required:
Install these packages using the Bioconductor protocol.
The analysis is divded into three parts, each with its own documentation:
- Simulating haplotypes (docs)
- Estimating and evaluating prediction models (docs)
- Compiling and plotting results (docs)
- Simulating a TWAS (docs)
The analysis will generate a directory ./src/analysis
with all output files.
Each analysis step has its own BASH script. Run the scripts in sequential order. A demonstration script is provided for this purpose:
./src/./run_analysis.sh
Note that step (2) uses a parallel computing framework that sends jobs into the shell background. The user must wait for all background jobs to complete before proceeding to step (3).
Simulations from step (1) are not reproducible. Simulated data used in the full analysis can be found here.
The parameters for testing different prediction scenarios in step (2) are set for demonstration purposes only.
The full analysis tests
- 4 model sizes
- 100 random seeds
- 98 genes
- 11 shared eQTL proportions for a total of 431,200 jobs. Consequently, the full analysis generates a HUGE amount of data. DO NOT RUN a full analysis without at least 1.5TB of disc space!