SCOPE is a method for performing scalable population structure inference on biobank-scale genomic data. SCOPE utilizes a likelihood-free framework that involves estimation of the individual allele frequency (IAF) matrix through a modified version of principal component analysis (PCA) known as latent subspace estimation (LSE) followed by alternating least squares (ALS) to transform the estimated IAF matrix into ancestral allele frequencies and admixture proportions. SCOPE utilizes two major optimizations to enable scalable inference. Firstly, SCOPE uses randomized eigendecomposition to efficiently estimate the latent subspace. Second, SCOPE uses the Mailman algorithm for fast matrix-vector multiplication involving the genotype matrix.
This project is licensed under the MIT License - see the LICENSE file for details.
This code is based on contributions from the following sources:
- Eigen - C++ template library for linear algebra
- Spectra - C++ Library For Large Scale Eigenvalue Problems
- ProPCA - Scalable probabilistic PCA for large-scale genetic variation data
The following packages are required on a Linux machine to compile and use SCOPE.
g++ (>=4.5)
cmake (>=2.8.12)
make (>=3.81)
SCOPE has been tested on CentOS 6.10 and 7, g++ 4.8.5 and 4.9.3, make 3.81 and 3.82, and cmake 2.8.12.2 and 3.7.2.
To install SCOPE, run the following commands:
git clone https://github.com/sriramlab/SCOPE.git
cd SCOPE
mkdir build
cd build
cmake ..
make
SCOPE should finish compiling within a few minutes. An example script can be found in the examples
subdirectory to test SCOPE. We have additionally included several scripts that can regenerate the simulations and real datasets we used in our manuscript. Please see the subdirectories in misc
for more detail.
SCOPE can be run the from the command line using the following options. At minimum, SCOPE requires the path to the PLINK binary prefix.
* genotype (-g) : Path to PLINK binary prefix
* frequencies (-freq) : Path to PLINK frequency file for supervision (default: none)
* num_evec (-k) : Number of latent populations (default: 5)
* max_iterations (-m) : Maximum number of iterations for ALS (default: 1000)
* convergence_limit (-cl) : Convergence threshold for LSE and ALS (default: 0.00001)
* output_path (-o) : Output prefix (default: scope_)
* nthreads (-nt): Number of threads to use (default: 1)
* seed (-seed): Seed to use (default: system time)
To perform supervised population structure inference, provide the -freq
parameter. The file needed for this parameter can be generated using plink --freq --within
. If no frequency file is provided, SCOPE will perform unsupervised population structure inference. When using the supervised mode, be sure to make sure that the ordering of the SNPs match between the frequency file and the target dataset. Alleles much also be coded consistently between the two. One can flip alleles using the --flip
and --flip-subset
commands in PLINK.
SCOPE will output the following files:
scope_V.txt
: the estimated latent subspace from LSEscope_Phat.txt
: the estimated allele frequencies for the latent populationsscope_Qhat.txt
: the estimated admixture proportions for each individual
Each column of Phat.txt
corresponds to a row of Qhat.txt
. If Qhat.txt
is transposed, the columns will correspond to the columns of Phat.txt
. If running SCOPE in supervised mode, the order of the columns in Phat.txt
corresponds to the order displayed in the PLINK frequency file.