SUMmary statistics-based Randomized Haseman-Elston regression
sumrhe
is an efficient tool for accurately estimating:
- heritability of phenotypes from summary statistics
- genome-wide LD scores from biobank data
You may set up sumrhe
using Conda (Anaconda3 or Miniconda) or virtual environment and pip (Miniconda or Python3)
- Clone the repository
git clone https://github.com/bronsonj98/SUMRHE.git
cd SUMRHE
- Create the conda environment
conda env create -f environment.yml
- Activate the conda environment
conda activate sumrhe
- Clone the repository
git clone https://github.com/bronsonj98/SUMRHE.git
cd SUMRHE
- Create a virtual environment
python -m venv sumrhe
- Activate the virtual environment
# on Windows
sumrhe\Scripts\activate
# on macOS/Linux
source sumrhe/bin/activate
- Install dependencies
pip install -r requirements.txt
sumrhe
can accurately (i.e., comparable to methods that use individual-level data) estimate heritability from summary-level data.
To estimate (partitioned) heritability, you need the following:
- LD scores (either genome-wide or fixed-window) OR trace summaries calculated from in-sample or reference population genotype.
- GWAS summary statistics for the trait of interest
Here is an example run command (h2_ldscore.sh
in the example directory):
python3 ../src/sumrhe.py --pheno ./sim_50k_h2_0.25_p_0.01.sumstat \
--ldscores ./double_uniform_10k_stoc_k100.gw.ldscore.gz \
--annot ./double_uniform_0.2.txt \
--verbose \
--njack 1000
Running the script should result in a total heritability estimate h^2
of 0.27439 and SE
of 0.01578. Here, the partitioned genome-wide LD scores (gw.ldscore.gz
) estimated using k=100
random vectors and N=10,000
reference sample have been used.
Currently, to estimate partitioned heritability, the "thin" annotation file used for calculating the LD scores must also be provided separately (to be updated).
Similarly, you may estimate heritability using the trace summary statistics (.tr), which is mathematically equivalent to genome-wide sum of the LD scores (h2_tr.sh
in the example directory):
python3 ../src/sumrhe.py --pheno ./sim_50k_h2_0.25_p_0.01.sumstat \
--trace ./double_uniform_50k_k100 \
--annot ./double_uniform_0.2.txt \
--verbose
You should get an heritability estimate of 0.26851 and SE
of 0.01512. Note, the h^2
estimates are different, as the trace summaries (.tr
) are estimated from the N=50,000
in-sample genotype, compared to the stochastic genome-wide LD scores, calculated from the N=10,000
reference genotype. Still, both estimates overlap well within their error ranges.
If you'd like to create your own trace summaries, please refer to the PyRHE
program from our lab: https://github.com/sriramlab/PyRHE. You may also use the C++ version of GENIE
.
You would need your own individual-level genotype for this (running with -tr
option will save the trace summaries). While less flexible than using SNP-level LD scores, the trace summaries are directly estimated with GENIE
or pyRHE
, and the files are a lot smaller in size (less than 0.1 MB).
Other methods use sliding fixed-sized windows (typically < 2Mb) to estimate the LD scores of the SNPs. This results in under-estimation of LD scores, as the long-range LD (> 2Mb) is not captured. Often times, this results in upward-biased estimates of heritability. One of the advantages of Randomized Haseman-Elston regression is that it can efficiently estimate genome-wide correlations between SNPs through random projection. In the original RHE papers, this is used to estimate the trace of squared kinship matrix. sumrhe
extends this idea further by estimating the genome-wide (partitioned) LD scores.
Here is an example run command (estimate_gwldscore.sh
in the example directory):
python3 ../src/sumrhe.py --geno ./small \
--annot small.double_annot.txt \
--out double_stoc \
--nvecs 100 \
--nworkers 8
This script should run within a few seconds and create a gzip file named double_stoc.gw.ldscore.gz
and double_stoc.log
. The file format of double_stoc.gw.ldscore.gz
is identical to the traditional LDSC LD scores, where the first three columns are metadata ('CHR', 'SNP', 'BP'), and the remaining columns the (partitioned) LD scores.
--trace : File path for trace summary statistics (.tr) and corresponding metadata (.MN). If the path is a directory, all trace summaires (ending with .tr) will be used by aggregating them.
--save-trace : File path for saving (aggregated) trace summaries (.tr) and corresponding metadata (.MN)
--pheno : File path for phenotype-specific summary statistics (.sumstat). If the path is a directory, all summary statistics (ending with .sumstat) will be used.
--bim : File path for the reference .bim file used for trace calculation (optional)
--out : Output file path to save the analysis log and result (.log) or the genome-wide LD scores (.gw.ldscore.gz)
--max-chisq : Filter out SNPs with chi-sq statistic above the threshold.
--filter-both-sides : When filtering SNPs, remove their effects on both trace and yKy.
--ldscore : File path for LD scores of the reference SNPs. You may use either the traditional (truncated) LD scores (.l2.ldscore.gz) or genome-wide stochastic LD scores (.gw.ldscore.gz)
--all-snps : Use all the SNPs in the phenotype sumamry statistics. Make sure this is safe to do so.
--verbose : Verbose mode: print out the normal equations
--suppress : Suppress mode: do not print out the outputs to stdout (log file only)
--njack : Number of jackknife blocks (only if using LD scores as input)
--annot : Path of the annotation file (if using partitioned heritability)
--geno : Path of the genotype file to calculate the genome-wide LD scores. Calculates partitioned scores if --annot is also specified.
--nworkers : Number of workers for multiprocessing to calculate stochastic genome-wide LD scores. Default is 4.
--nvecs : Number of random vectors to use for estimating stochastic genome-wide LD scores. Default is 10.
--step_size : Number of SNPs to process in each step of estimating stochastic genome-wide LD scores. Default is 1000.
--seed : Seed for estimating stochastic genome-wide LD scores. If not specified, the default numpy (pseudo) random number generator will be used.
✅ partitioned heritability
✅ stochastic genome-wide LD scores
✅ better SE estimates with LD scores
✅ easier input file formatting
☑️ both-side filtering of outlier SNPs
☑️ genetic correlation
SUM-RHE
is now published on Genome Research as an open access article. For references, please cite
Jeong, M., Pazokitoroudi, A., Liu, Z., & Sankararaman, S. (2024). Scalable summary statistics-based heritability estimation method with individual genotype level accuracy. Genome research, gr.279207.124. Advance online publication. https://doi.org/10.1101/gr.279207.124