Name		Name	Last commit message	Last commit date
parent directory ..
src		src
README.md		README.md

README.md

Crosspopulation prediction analysis using HapMap3 data

Analysis of crosspopulation prediction capacity using a simulation of two ancestral populations and an admixed one from HapMap3 phase 3 data.

Prerequisites

Python 2.7 (note: scripts here are incompatible with Python 3)
R version 3.4.3 or higher
a Linux computing environment

All other requisite binaries and external data are downloaded and configured automatically.

This analysis makes extensive use of Rscript. It assumes a correct installation and configuration of the default R environment.

The R packages used here are

tidyverse, particularly ggplot2 and dplyr
knitr
data.table
optparse
doParallel
glmnet
assertthat

All of these packages are registered on CRAN. Install these packages with any standard approach. From within R, a one-shot approach to installing packages is to type

install.packages(c(tidyverse, knitr, data.table, optparse, doParallel, glmnet, assertthat))

assuming compilers, library paths, CRAN repository, and write permissions are all configured correctly.

Additionally, the following Bioconductor packages are required:

Install these packages using the Bioconductor protocol.

Running

The analysis is divded into three parts, each with its own documentation:

Simulating haplotypes (docs)
Estimating and evaluating prediction models (docs)
Compiling and plotting results (docs)
Simulating a TWAS (docs)

The analysis will generate a directory ./src/analysis with all output files.

Each analysis step has its own BASH script. Run the scripts in sequential order. A demonstration script is provided for this purpose:

./src/./run_analysis.sh

Note that step (2) uses a parallel computing framework that sends jobs into the shell background. The user must wait for all background jobs to complete before proceeding to step (3).

Notes

Simulations from step (1) are not reproducible. Simulated data used in the full analysis can be found here.

The parameters for testing different prediction scenarios in step (2) are set for demonstration purposes only.

The full analysis tests

4 model sizes
100 random seeds
98 genes
11 shared eQTL proportions for a total of 431,200 jobs. Consequently, the full analysis generates a HUGE amount of data. DO NOT RUN a full analysis without at least 1.5TB of disc space!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

03_hapmap3-simulation

03_hapmap3-simulation

README.md

Crosspopulation prediction analysis using HapMap3 data

Prerequisites

Running

Notes

Files

03_hapmap3-simulation

Directory actions

More options

Directory actions

More options

Latest commit

History

03_hapmap3-simulation

Folders and files

parent directory

README.md

Crosspopulation prediction analysis using HapMap3 data

Prerequisites

Running

Notes