This repository documents the analysis performed for The impact of sex on alternative splicing; note that a manuscript with a modified version of the analysis has been submitted. To reproduce the analysis, users will need to go through several steps.
- Get access to the Genotype-Tissue Expression (GTEx) RNAseq data (an application to dbGAP for access to the dataset phs000424.v8.v2 is required)
- Align each RNAseq sample using hisat2 and create a matrix of counts for each of a variety of splicing types was generated by the rMATS. Specifically, rMATS was run as a nextflow script. The script may be modified to run on any platform, the results from this study was performed on the cloudOS/lifebit platform.
- Run the Jupyter notebooks from this repository to perform the individual analyses.
This repository documents the interactive analysis for the results of running the rmats-nf pipeline.
The RNA-seq samples analyzed in this project are restricted access (dbGAP phs000424.v8.v2). See the database of Genotypes and Phenotypes (dbGaP) for details.
See the manuscript for methods details. In brief, we ran the nextflow script at https://github.com/lifebit-ai/rmats-nf to align the RNA-seq samples with hisat2 and to characterize splicing events with rMATS. Results from individual samples are summarized in 'matrix' files. To run the Jupyter scripts in the next section, you will need to place these files in a results bucket (if you are using the cloudos system) or in some other defined location.
Each of the results described in the manuscript was generated by one or more Jupyter notebooks in this repository. There are a number of R packages that need to be installed prior to running the notebooks. This process is described from the cloudos environment in this document. If running the notebooks in another environment, simply run the setup scripts.
Most of the notebooks require that the raw rMATS files are first processed to generate summary files. This is done by the notebook countGenesAndEvents.ipynb. Additionally, two notebooks are used to perform DGE and DAS analysis. These three notebooks should be run first.
- differentialGeneExpressionAnalysis.ipynb. Perform differential gene analysis with voom.
- differentialSplicingJunctionAnalysis.ipynb. Regression analysis to characterize sex-biased alternative splicing events.
- countGenesAndEvents.ipynb. Set up the overall analysis. Write various files to the
data
subdirectory that will be used by other scripts.
The remaining notebooks can be run in any order. Most of the notebooks generate a Figure or a Table or a result that is described in the manuscript.
- expressionHeatplot.ipynb. Generate a heatplot representing expression across tissues.
- totalDGEByTissue.ipynb. Generate a plot representing counts of expression events across tissues.
- alternativeSplicingHeatplot.ipynb. Generate a heatplot representing alternative splicing across tissues.
- totalAlternativeSplicingByTissue.ipynb. Generate a plot representing counts of alternative splicing across tissues.
- XchromosomalEscape.ipynb. Investigate the overlap of alternative splicing and genes on the X chromosome that escape inactivation.
- splicingIndex.ipynb. Calculate the splicing index for each chromosome.
- spliceTypeByChromosome.ipynb. Calculate the distribution of the 5 types of alternative splicing event analyzed in this manuscript for each chromosome.
- altSplicing_events_per_gene.ipynb. Create a plot showing genes that display alternative splicing in many tissues.
- tissue_piechart.ipynb. Create a piechart showing distribution of genes according to number of tissues showing differential alternative splicing.
To facilitate reproducing the results from the secondary analysis that generates all the plots and tables of the publication, we have created a helper bash script that can be run to perform the following:
- Prepare the environment by installing dependencies
- Retrieve the data that we have made available via Zenodo 10.5281/zenodo.5524975
- Programmatically executing all Jupyter Notebooks leveraging the papermill library.
You can find the file at ./reproduce.sh
.
Instructions for environments with conda available
The only prerequisite in this case is a machine with conda
installed.
IMPORTANT NOTE: Before executing the bash script, make sure your terminal is initialises for using
conda
. You can do so by running the following command, depending on you default shell:
i) for zsh
## Initialise the terminal for use of conda
conda init zsh && exec -l zsh
ii) for bash
## Initialise the terminal for use of conda
conda init bash && exec -l bash
Copy the following commands in your terminal to reproduce the Jupyter Notebooks analysis:
git clone https://github.com/TheJacksonLaboratory/sbas.git
cd sbas
git checkout adds-rendered-notebooks
conda init zsh && exec -l zsh
After this has finished, run the bash script reproduce.sh
:
time bash ./reproduce.sh
Instructions for environments with docker but not conda available
The only prerequisite in this case is a machine with docker
installed.
You can use a docker image with conda, like this one for example continuumio/miniconda3
.
Copy the following commands in your terminal to reproduce the Jupyter Notebooks analysis:
## use the container, mount it so tha input and output data are available in PWD
docker run -v $PWD:$PWD -w $PWD -it continuumio/miniconda3
Continue running the commands below (inside the docker container):
## Initialise the terminal for use of conda
conda init zsh && exec -l zsh
Copy the following commands in your terminal to reproduce the Jupyter Notebooks analysis:
git clone https://github.com/TheJacksonLaboratory/sbas.git
cd sbas
git checkout adds-rendered-notebooks
conda init zsh && exec -l zsh
After this has finished, run the bash script reproduce.sh
:
time bash ./reproduce.sh