Skip to content

Latest commit

 

History

History
 
 

03_hapmap3-simulation

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Crosspopulation prediction analysis using HapMap3 data

Analysis of crosspopulation prediction capacity using a simulation of two ancestral populations and an admixed one from HapMap3 phase 3 data.

Prerequisites

  • Python 2.7 (note: scripts here are incompatible with Python 3)
  • R version 3.4.3 or higher
  • a Linux computing environment

All other requisite binaries and external data are downloaded and configured automatically.

This analysis makes extensive use of Rscript. It assumes a correct installation and configuration of the default R environment.

The R packages used here are

  • tidyverse, particularly ggplot2 and dplyr
  • knitr
  • data.table
  • optparse
  • doParallel
  • glmnet
  • assertthat

All of these packages are registered on CRAN. Install these packages with any standard approach. From within R, a one-shot approach to installing packages is to type

install.packages(c(tidyverse, knitr, data.table, optparse, doParallel, glmnet, assertthat))

assuming compilers, library paths, CRAN repository, and write permissions are all configured correctly.

Additionally, the following Bioconductor packages are required:

Install these packages using the Bioconductor protocol.

Running

The analysis is divded into three parts, each with its own documentation:

  1. Simulating haplotypes (docs)
  2. Estimating and evaluating prediction models (docs)
  3. Compiling and plotting results (docs)
  4. Simulating a TWAS (docs)

The analysis will generate a directory ./src/analysis with all output files.

Each analysis step has its own BASH script. Run the scripts in sequential order. A demonstration script is provided for this purpose:

./src/./run_analysis.sh

Note that step (2) uses a parallel computing framework that sends jobs into the shell background. The user must wait for all background jobs to complete before proceeding to step (3).

Notes

Simulations from step (1) are not reproducible. Simulated data used in the full analysis can be found here.

The parameters for testing different prediction scenarios in step (2) are set for demonstration purposes only.

The full analysis tests

  • 4 model sizes
  • 100 random seeds
  • 98 genes
  • 11 shared eQTL proportions for a total of 431,200 jobs. Consequently, the full analysis generates a HUGE amount of data. DO NOT RUN a full analysis without at least 1.5TB of disc space!