This is a tool to detect enhancer hijacking events in a cohort of at least 10 samples profiled with WGS and RNA-seq. It does not require matched normals and can detect enhancer hijacking events occurring in only a single sample. Briefly, it looks for outlier high and monoallelic expression of a gene in a sample which has a breakpoint close to the gene.
In an environment with python>=3.7:
pip install pyjacker
pyjacker config.yaml
The config file indicates all the parameters, including paths to the input files (see config_AML.yaml as an example). Alternatively, we provide a nextflow workflow that generates pyjacker's inputs from bam files, and run pyjacker: https://github.com/CompEpigen/wf_WGS.
Alternatively, pyjacker can be run using a docker image.
docker run -t -w `pwd` -v `pwd`:`pwd` esollier/pyjacker:latest pyjacker config.yaml
Rows are genes (ensembl IDs) and columns are samples. The expression data must be provided in TPM. See data/TPM_ckAML.tsv for an example.
tsv file with columns: sample, chr1, pos1, chr2, pos2. The fields chr2 and pos2 are optional (for example if you only have copy number data). See data/breakpoints.tsv for an example.
gtf file containing gene coordinates for your reference genome (see data/Homo_sapiens.GRCh37.75.gtf.gz or data/Homo_sapiens.GRCh38.113.gtf.gz, and tsv file containing cytobands (see data/cytobands_hg19 or data/cytobands_hg38).
A bed file of topologically-associating domains can be used, in which case only the breakpoints in the same TAD as a gene are considered in the search for enhancer hijacking events. See data/TADs_Dixon_IMR90_hg19.bed or data/TADs_Dixon_IMR90_hg38.bed for TADs derived from the data of Dixon et al. or data/TADs_HSPC_hg19.bed for TADs derived from HSPCs. If not TAD file is provided, pyjacker will instead look for breakpoints within a fixed distance to the gene (1.5Mb by default).
This is used to detect monoallelic expression. This requires files generated by fast_ase or GATK ASEReadCounter. See data/ASE_ckAML for example files.
tsv file with the following columns: sample, chr, start, end, cn. See data/CNAs_ckAML.tsv for an example. If provided, this will be used to:
- correct gene expression based on copy number (so high expression because of amplification will not be reported)
- filter out SNPs within deletions from the monoallelic expression detection
A file of scored enhancers, generated by ROSE. See data/enhancers_myeloid_hg19.tsv for an example.
Fusion transcripts can also lead to aberrant high and monoallelic expression of a gene. If a list of fusion transcripts detected from RNAseq is provided, they will be used to annotate candidate enhancer hijacking events which are actually due to a fusion. See data/fusions_ckAML.tsv for an example file.
Pyjacker takes approximately 5h to run on the ckAML dataset (39 samples) with default settings and 6 cores. The runtime is essentially proportional to the number of samples in the dataset and to the number of iterations used when estimating the null distribution of scores (used to compute the false discovery rate). This number of iterations is 50 by default, which ensures that accurate p-values are computed, but this can easily be reduced to 5-10 to reduce the runtime, without drastically altering the results.
If you use pyjacker in your research, please consider citing:
Sollier E, Riedel A, Toprak UH, Wierzbinska JA, Weichenhan D, Schmid JP, Hakobyan M, Touzart A, Jahn E, Vick B, Brown-Burke F. Pyjacker identifies enhancer hijacking events in acute myeloid leukemia including MNX1 activation via deletion 7q. bioRxiv. 2024:2024-09. https://doi.org/10.1101/2024.09.11.611224