-
Notifications
You must be signed in to change notification settings - Fork 4
Input format
For RapidoPGS, PRS-CS and PRSice2, provide the GWAS summary statistics data containing the genotype-phenotype associations as base file. For PLINK2, provide the file containing the posterior variant effect sizes (or weights) as base file. Preferably, the base file should be .gz
compressed, but plain text should also work. The file should be tab-delimited and the header should consist of a single line. The table below describes the configuration file parameters required for each method. R
and O
indicate that a parameter is 'required or 'optional', respectively.
Parameter | Description | RapidoPGS | PRS-CS | PRSice | PLINK |
---|---|---|---|---|---|
BASEDATA |
Path to the file containing the base data | R | R | R | R |
BF_BUILD |
Build of the base file, e.g. "hg19" or "hg38" | R | |||
BF_ID_COL |
Name of the SNP ID column in the base file | R | R | R | R |
BF_CHR_COL |
Name of the chromosome column in the base file | R | R | ||
BF_POS_COL |
Name of the position column in the base file | R | R | ||
BF_EFFECT_COL |
Name of the effect allele column in the base file | R | R | R | R |
BF_NON_EFFECT_COL |
Name of the non-effect allele column in the base file | R | R | R | |
BF_STAT |
Type of measure in the BF_STAT_COL, either "beta" or "or" | * | R | R | |
BF_STAT_COL |
Name of the beta/OR/effect size column in the base file | R | R | R | R |
BF_FRQ_COL |
Name of the effect allele frequency column in the base file | R/O** | |||
BF_SE_COL |
Name of the column of the standard error of the beta/OR value | R | |||
BF_PVALUE_COL |
Name of the column containing the P-values of the assocation test | R | R | R | |
BF_SBJ_COL |
Name of the column containing the sample size for each variant | R/O*** | |||
BF_SAMPLE_SIZE |
Sample size of the GWAS | R/O*** | R | ||
BF_TARGET_TYPE |
"cc" for a case control trait, "quant" for a quantative trait | R |
* RapidoPGS might or might not support odds ratios.
** Required for Rapido for quantative traits only.
*** For quantative traits using Rapido, provide either BF_SBJ_COL
or BF_SAMPLE_SIZE
.
The target data contains the genotypes of individuals within a population. We currently only support target data in BGEN format v1.2. BGEN v1.1 and v1.3+ have not been tested and might work. The following configuration file parameters related to the target data are required for all methods:
-
VALIDATIONDATA
: path to the directory containing the validation data, e.g./hpc/data/_ae_originals
. -
VALIDATIONPREFIX
: prefix of the validation data excluding the chr-number and extension, e.g.aegs_combo_1kGp3GoNL5_RAW_chr
. -
VAL_REF_POS
: position of the reference allele in the .BGEN files relative to the alternative allele: ref-first, ref-last or ref-unknown.
Note: It is perhaps superfluous to note that the format of
BF_ID_COL
should follow that of the SNP ID noted in the target data (a.k.a. 'validation'-data).
Please provide a sample file in the SNPTEST sample file format. The identifiers in the ID column must of course match the identifiers of the target population. Phenotypes and covariates are not used in the polygenic score computations by PRS-CS, RapidoPGS and PLINK2. PRSice2 does however require a single phenotype which it uses to find the best fitted set of polygenic scores across multiple P-value thresholds. The phenotype used by PRSice2 is supplied using the PRSICE_PHENOTYPE
and PRSICE_PHENOTYPE_BINARY
parameters. Also note that for PRSice2, the samples should occur in the same order as they occur in the BGEN files, otherwise PRSice will return an error.
-
SAMPLE_FILE
: path to the sample file. -
PRSICE_PHENOTYPE
: phenotype which will be used by PRSice2 to find the best fitted set of polygenic scores, this phenotype must be present in the sample file. -
PRSICE_PHENOTYPE_BINARY
: [TRUE/FALSE] indicating whetherPRSICE_PHENOTYPE
contains a binary phenotype.
Several methods are able to use an external linkage disequilibrium reference panel. Such a panel is used to improve the LD estimation. These methods are PRS-CS and PRSice2.
-
LDDATA
: path to the LD reference panel. Note that PRS-CS and PRSice2 expect a different format.
PRS-CS requires an external LD panel. The developer recommends to use one of the panels supplied on the PRS-CS GitHub page. The panels were constructed using either 1000 Genomes Project phase 3 samples or UK Biobank data. For PRS-CS, please supply the path to the folder containing the extracted map and .hdf5 files, e.g. /data/ldblk_1kg_eur
.
Reference data is optional and must be in .bed
format or in BGEN format. If no reference data is provided, PRSice2 will use the target genotype for LD estimation. For PRSice2, please supply the path and prefix of the reference files, e.g. /data/ld_ref/1000Gp3v5.20130502.EUR.chr
.
This file is optional and can be used to perform quality control. If the QC
parameter is active, the base file variants not meeting the imputation score and minor allele frequency thresholds will be removed. Such a file can for example be generated using SNPTEST. This file must be whitespace delimited, .gz compressed and must have a single header line.
-
STATS_FILE
: path to the stats file -
STATS_ID_COL
: name of the stats file column containing the SNP IDs, these IDs must match the IDs that occur in the base file. -
STATS_MAF_COL
: name of the stats file column containing the minor allele frequency. -
STATS_INFO_COL
: name of the stats file column containing the imputation score.