This toolkit contains the required scripts to transform sequencing files into RsQTL input files and run the MatrixEQTL R package to identify significant variation-splicing relationships.
These instructions will get you a copy of the scripts up and running on your machine for development and testing purposes. See Running the scripts for notes on how to use the project on a live system. We have provided sample data that can be used to test the pipeline. It also serves as an example of the data format the pipeline expects.
This package was developed on R version 3.6.1 on macOS High Sierra.
-
The following R packages installed on your machine:
tidyverse MatrixEQTL data.table GenomicRanges factoextra ggpubr
You may install the packages using the following commands:
install.packages(c("tidyverse", "MatrixEQTL", "data.table", "factoextra", "ggpubr", "BiocManager")) BiocManager::install("GenomicRanges")
-
Each of the following scripts copied to a working directory on your machine
build_junction_input_RsQTL.R build_VAF_matrix_RsQTL.R harmonize_matrices_RsQTL.R build_pca_covariate_matrix_RsQTL.R run_matrix_RsQTL.R annotate_cis_trans_RsQTL.R plot_RsQTL.R
You can obtain the full toolkit here.
-
Output .csv files from our ReadCounts tool (https://github.com/HorvathLab/NGS/tree/master/readCounts) containing the read counts extracted per SNV for each sample
-
A .counts matrix from LeafCutter
The scripts are designed to be run from the Unix command line (Terminal on macOS) from the root directory (by default RsQTL-master if the toolkit was downloaded from the link above). Make sure to cd to this directory before beginning.
Transforms the raw LeafCutter output into a matrix suitable for RsQTL analysis
- The path to a raw LeafCutter output matrix (-c)
- The desired prefix of the output junction matrices (-o)
- OPTIONAL: the minimum number of reads spanning an intron junction for inclusion (--read-thresh, default = 30)
- OPTIONAL: maximum fraction of samples allowed to have a cluster read ratio of NA for a given intron (--na-thresh, default = 0.8)
- OPTIONAL: maximum fraction of samples allowed to have a cluster read ratio of 1 for a given intron (--one-thresh, default = 0.8)
- OPTIONAL: maximum fraction of samples allowed to have a cluster read ratio of 0 for a given intron (--zero-thresh, default = 0.8)
- OPTIONAL: logical specifying whether to remove introns above the X%ile of NA values (--excl-na, default = F)
- OPTIONAL: %ile threshold above which to exclude introns (--perc-thresh, default = 0.9)
- OPTIONAL: logical specifying whether to exclude introns which represent the only member of a cluster (--excl-sing, default = F)
- One file (in the output directory) with the cluster read ratios for each intron in each sample
- One file (in the output directory) with the genomic locations for each intron
Rscript build_splicing_matrix_RsQTL.R -c data/sample_junction_matrix.counts -o RsQTL_test
Transforms the read counts into a variant fraction matrix with information from all provided samples
- A directory containing the .csv files from the output of Readcounts (-r)
- The desired prefix of the output SNV matrix and SNV location files (-o)
- One file (in the script’s directory) with the SNV locations for MatrixEQTL
- One file (in the script’s directory) with the SNV variant allele fraction matrix for MatrixEQTL
Rscript build_VAF_matrix_RsQTL.R -r data/ -o RsQTL_test
Creates a covariate matrix of principal components from the VAF and junction matrices. Additional covariates can be included (see sample data for expected format).
- The path to the VAF matrix created by build_VAF_matrix_RsQTL.R (-r)
- The path to the splicing matrix created by build_junction_matrix_RsQTL.R (-s)
- OPTIONAL: The path to any additional covariates (-c)
- OPTIONAL: The number of PCs to include as covariates (-n, default = 10)
- NOTE: You may want to run this script once with the default value then adjust the number of PCs used based on the variance plots. Read more about selecting an appropriate number of principal components here.
- The desired prefix of the output covariate matrix (-o)
- A covariate matrix containing the top n VAF and cluster read ratio principal components as well as any additional supplied covariates
- Scree plots showing the percentage of variance explained by principal components of the VAF and splicing matrices
Rscript build_pca_covariate_matrix_RsQTL.R -r output/RsQTL_test_VAF_matrix.txt -s output/RsQTL_test_splicing_matrix.txt -c data/additional_covariates_matrix.txt -n 10 -o RsQTL_test
Harmonizes matrices so that all inputs for run_matrix_RsQTL.R contain the same samples (optional, but helps to avoid errors)
- The path to the VAF matrix created by build_VAF_matrix_RsQTL.R
- The path to the splicing matrix created by build_splicing_matrix_RsQTL.R
- The path to the covariate matrix created by build_pca_covariate_matrix_RsQTL.R (if no covariate information is used, you will need to modify this script to remove the references to the covariate matrix)
- Three matrices (in the output directory) corresponding to the three input files, but including only samples that were present in all three input files
Rscript harmonize_matrices_RsQTL.R -r output/RsQTL_test_VAF_matrix.txt -s output/RsQTL_test_splicing_matrix.txt -c output/RsQTL_test_covariate_pca_matrix.txt
This script is based off of the sample code from Shabalin, et al (2012)
Runs the RsQTL analysis using MatrixEQTL
- Names of the SNV matrix (-s), SNV location (-sl), splicing intron matrix (-i), and splicing intron location (-il) files
- OPTIONAL: Name of the covariates matrix (-c); we include an example in the "data" folder
- Logical (T or F, -ct) specifying whether to split the output into cis and trans
- Name of the output file for the qq plot (-qq)
- Prefix of the output file(s) (-o)
- P-value thresholds for the cis (-pcis) and trans (-ptr) output files or the unified output file (-p) depending on the logical specified above
- One file (in the output directory) with the cis RsQTLs
- One file (in the output directory) with the trans RsQTLs OR
- One file (in the output directory) with all of the unified RsQTLs depending on the logical specified above
Splitting cis and trans
Rscript run_matrix_RsQTL.R -s output/RsQTL_test_VAF_matrix_harmonized.txt -sl output/RsQTL_test_VAF-loc_matrix.txt -i output/RsQTL_test_splicing_matrix_harmonized.txt -il output/RsQTL_test_splicing-loc_matrix.txt -c output/RsQTL_test_pca_covariate_matrix_harmonized.txt -ct T -qq RsQTL_test_qqplot -pcis 0.1 -ptr 0.1 -o RsQTL_test
Unified cis and trans
Rscript run_matrix_RsQTL.R -s output/RsQTL_test_VAF_matrix_harmonized.txt -sl output/RsQTL_test_VAF-loc_matrix.txt -i output/RsQTL_test_splicing_matrix_harmonized.txt -il output/RsQTL_test_splicing-loc_matrix.txt -c output/RsQTL_test_pca_covariate_matrix_harmonized.txt -ct F -qq RsQTL_test_qqplot -p 0.1 -o RsQTL_test
Annotates the output of RsQTL as cis/trans based on whether the SNV and paired intron lie within the same gene
- The path to a RsQTL results file (-r)
- The path to a list of gene locations (-g, see sample data for a suggested file)
- The desired prefix of the output annotated results file (-o)
- One file (in the output directory) with the RsQTLs annotated as cis or trans
Rscript annotate_cis_trans_RsQTL.R -r output/RsQTL_test_all_RsQTLs.txt -g data/gene_locations_hg38.txt -o RsQTL_test
Plots either the top n most significant RsQTLs in the input file or a specific SNV-intron pair
- The path to the input VAF matrix (-r)
- The path to the input splicing matrix (-s)
- Plotting mode (-m, "bulk" or "single")
- (If plotting in bulk mode) the path to a results file (-res)
- (If plotting in bulk mode) the number of correlations to plot (-n)
- (If plotting in single mode) the SNV to plot (-snv)
- (If plotting in single mode) the intron to plot (-intron)
- The desired prefix of the output annotated results file (-o)
- One file (in the output directory) containing either the bulk or individual plot
Bulk mode
Rscript plot_RsQTL.R -r output/RsQTL_test_VAF_matrix_harmonized.txt -s output/RsQTL_test_splicing_matrix_harmonized.txt -res output/RsQTL_test_RsQTLs_cistrans_ann.txt -m bulk -n 200 -o RsQTL_test
Single mode
Rscript plot_RsQTL.R -r output/RsQTL_test_VAF_matrix_harmonized.txt -s output/RsQTL_test_splicing_matrix_harmonized.txt -res output/RsQTL_test_RsQTLs_cistrans_ann.txt -m single -snv "1:952657_T>C" -intron "chr12:56160320_56161387" -o RsQTL_test
Justin Sein and Liam F. Spurr, Pavlos Bousounis, Nawaf Alomran, Dacian Reece-Stremtan, Prashant N M, Hongyu Liu, and Anelia Horvath
We would like to thank the Matrix EQTL team (Shabalin, et al. 2012) for their sample code and R package upon which run_matrix_RsQTL.R is based.