reproducibility_report.Rmd

---
title: "De novo discovery of traits co-occurring with chronic obstructive pulmonary disease"
author: "Evgeniia Golovina"
date: "26/05/2021"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message=FALSE)

# install.packages("pacman")
# load libraries
pacman::p_load(ggplot2, pander, dplyr, tidyr, ggrepel, tidyverse, reshape2, RColorBrewer,
               gprofiler2, clusterProfiler, circlize, ComplexHeatmap, igraph, viridis, plotly,
               venneuler, eulerr, Rcpp, colorspace, EnsDb.Hsapiens.v86, scales, ggpubr, ggrepel,
               vroom, ggforce, cowplot, pheatmap, readxl, httr)

# colors: lung - #009444
```

This is a reproducibility report for "De novo discovery of traits co-occurring with chronic obstructive pulmonary disease" study. 

Python (version 3.6.9), R (version 4.0.2) and RStudio (version 1.2.5033) were used for data processing, analysis and visualisation.

1. Hi-C datasets for primary lung cells are available on [GEO](https://www.ncbi.nlm.nih.gov/geo/) (accession numbers: GSM2322544, GSM2322545).
2. Total RNA-seq and WGS datasets across GTEx v8 lung tissue are available via the [dbGaP](https://www.ncbi.nlm.nih.gov/gap/) (accession: phs000424.v8.p2).  
3. Human genome build hg38 release 75 (GRCh38) (Homo_sapiens_assembly38_noALT_noHLA_noDecoy.fasta) was downloaded from [gs://gtex-resources](https://console.cloud.google.com/storage/browser/gtex-resources/references).
4. SNP genomic positions for genome GRCh38p7 build 151 were obtained from ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7.  
5. Gene annotation for GENCODE v26 (gencode.v26.GRCh38.genes.gtf) was downloaded from [gs://gtex-resources](https://console.cloud.google.com/storage/browser/gtex-resources/references).  
6. SNPs associated with COPD were downloaded from the [GWAS Catalog](www.ebi.ac.uk/gwas/) on 09/06/2021.
7. SNPs associated with CAD were downloaded from the [GWAS Catalog](www.ebi.ac.uk/gwas/) on 11/04/2022.
8. SNPs associated with UD were downloaded from the [GWAS Catalog](www.ebi.ac.uk/gwas/) on 11/04/2022.
9. [GWAS Catalog](https://www.ebi.ac.uk/gwas/docs/file-downloads) was queried for SNP-trait associations on 22/11/2021.  
10. [STRING PPI database](https://string-db.org/) (version 11.5) was queried on 22/11/2021.


### 1. Identification of significant spatial eQTL SNP-gene interactions using CoDeS3D.

First, we run CoDeS3D pipeline to get COPD-associated spatial regulatory interactions across lung tissue (using GTEx lung eQTL data and Hi-C libraries for primary lung cells): `python codes3d/codes3d.py -s data/snps/263_copd_snps_5E-8.txt -o results/codes3d/lung -n Lung_cells_Schmitt2016 -t Lung`.

## 1. Construction of the entire lung-specific gene regulatory map (GRN).

### 1.1. Percentage of spatial lung-specific eQTL SNPs and non-eQTL SNPs among all variants (genotyped from lung samples obtained from GTEx v8). 

Download lung-specific GRN using this link --> https://auckland.figshare.com/account/projects/152496/articles/21498774
Download eQTLs from lung-specic GRN using this link --> https://auckland.figshare.com/account/projects/152496/articles/21498780
Download genes from lung-specific GRN using this link --> https://auckland.figshare.com/account/projects/152496/articles/21498786

```{r lung_map_eqtls_vs_noneqtls}
### Getting number of SNPs vs eQTLs in lung across the entire genome
l_grn_eqtls <- read.table("data/lung_grn_eqtls.txt", header = TRUE, sep = "\t")
gtex_snps <- vroom(file="./data/GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze.lookup_table_short_version.txt.gz", delim ="\t") #43174097

lung.map.genome <- data.frame(
  snps = rep(c("eQTL","non-eQTL")),
  number = c(nrow(l_grn_eqtls), (nrow(gtex_snps)-nrow(l_grn_eqtls))),
  percentage = c(round((nrow(l_grn_eqtls)/nrow(gtex_snps))*100, 2),
                 round(((nrow(gtex_snps)-nrow(l_grn_eqtls))/nrow(gtex_snps))*100, 2)))

#pdf("figures/lung_eqtl_map/lung_map_eqtls_vs_non-eqtls_genomewide.pdf", width = 9, height = 9)
ggplot(lung.map.genome, aes(x = "", y = percentage, fill = snps)) +
  geom_bar(width = 1, stat = "identity", fill = c("#009444", "darkgrey"), color = "white") +
  coord_polar("y", start = 0)+
  #geom_text(aes(y = percentage, label = snps), color = "white")+
  #scale_fill_manual(values = mycols) +
  theme_void()
#dev.off()
```

### 1.2. Percentage of genes spatially regulated in the lung among all genes from GTEx v8.

```{r lung_map_egenes}
### Getting number of eGenes
# reading lung-specific gene regulatory network (GRN) data
lung_GRN_genes <- read.table(gzfile("results/lung_specific_GRN/lung_specific_GRN_genes.txt.gz"))
colnames(lung_GRN_genes) <- c("gene_id", "gene_name")
lung_GRN_genes <- unique(lung_GRN_genes) # 15855
gtex_genes <- read.table(gzfile("data/gene_reference.bed.gz"), header = FALSE, sep = "\t") #56200

lung.map.genes <- data.frame(
  genes = rep(c("eGenes","non-eGenes")),
  number = c(nrow(lung_GRN_genes), (nrow(gtex_genes)-nrow(lung_GRN_genes))),
  percentage = c(round((nrow(lung_GRN_genes)/nrow(gtex_genes))*100, 2),
                 round(((nrow(gtex_genes)-nrow(lung_GRN_genes))/nrow(gtex_genes))*100, 2)))

#pdf("figures/lung_eqtl_map/lung_map_egeness_vs_non-egenes_genomewide.pdf", width = 9, height = 9)
ggplot(lung.map.genes, aes(x = "", y = percentage, fill = genes)) +
  geom_bar(width = 1, stat = "identity", fill = c("#009444", "darkgrey"), color = "white") +
  coord_polar("y", start = 0)+
  #geom_text(aes(y = percentage, label = snps), color = "white")+
  #scale_fill_manual(values = mycols) +
  theme_void()
#dev.off()
```

### 1.3. Mean, median, and standard deviation of number of eQTLs per gene grouped by interaction type.

```{r lung_eqtls_per_gene_by_interaction_type}
### reading lung-specific GRN
lung_map <- vroom(file="./results/lung_specific_GRN/lung_grn_significant_eqtls.txt.gz",
                  delim="\t")

lung_map.df <- distinct(lung_map[,c("snp", "gene", "interaction_type")])
lung_map_cis <- subset(lung_map.df, interaction_type=='Cis') # 840,170 
#length(unique(lung_map_cis$snp)) # 711,397 eQTL SNPs
#length(unique(lung_map_cis$gene)) # 14,700 eGenes
lung_map_intra <- subset(lung_map.df, interaction_type=='Trans-intrachromosomal') # 16,213
#length(unique(lung_map_intra$snp)) # 16,175 eQTL SNPs
#length(unique(lung_map_intra$gene)) # 2,702 eGenes
lung_map_inter <- subset(lung_map.df, interaction_type=='Trans-interchromosomal') # 6,540
#length(unique(lung_map_inter$snp)) # 6,385 eQTL SNPs
#length(unique(lung_map_inter$gene)) # 2,003 eGenes

lung_map_cis.freq <- count(distinct(lung_map_cis[,c("snp", "gene")]), gene) %>%
    mutate("interaction_type"="Cis")
lung_map_intra.freq <- count(distinct(lung_map_intra[,c("snp", "gene")]), gene) %>%
    mutate("interaction_type"="Trans-intrachromosomal")
lung_map_inter.freq <- count(distinct(lung_map_inter[,c("snp", "gene")]), gene) %>%
    mutate("interaction_type"="Trans-interchromosomal")

lung_map_combined <- rbind(lung_map_cis.freq, lung_map_intra.freq, lung_map_inter.freq)
lung_map_combined <- lung_map_combined %>% 
    mutate(interaction_type=factor(interaction_type, 
                                   levels=c("Cis",
                                            "Trans-intrachromosomal",
                                            "Trans-interchromosomal")))

# computing mean
lung_map_combined %>% group_by(interaction_type) %>% summarise(mean = mean(log10(n)))
lung_map_combined %>% group_by(interaction_type) %>% summarise(mean = mean(n))
# computing SD
lung_map_combined %>% group_by(interaction_type) %>% summarise(sd=sd(log10(n)))
lung_map_combined %>% group_by(interaction_type) %>% summarise(sd=sd(n))
# comparing number of eQTLs per gene grouped by interaction type
eqtls_per_gene <- list(c("Cis", "Trans-intrachromosomal"),
                       c("Cis", "Trans-interchromosomal"),
                       c("Trans-interchromosomal", "Trans-intrachromosomal"))

#pdf("figures/lung_eqtl_map/violin_eqtls_per_gene.pdf", width = 12, height = 8)
violin_eqtls_per_gene <- ggplot(lung_map_combined,
                                aes(x=interaction_type, y=log10(n), fill=interaction_type)) +
    geom_violin(trim=T) + 
    stat_summary(fun = "median", geom = "point", shape = 3, size = 4, color = "white") +
    stat_summary(fun.data=mean_sdl, fun.args = list(mult = 1), 
                 geom="pointrange", color="red", shape=3, size=0.85) +
    scale_x_discrete("Interaction type") +
    scale_y_continuous("log10(Number of eQTL SNPs)", expand = c(0,0)) +
    #scale_fill_manual(values=c("lightgrey", "grey", "darkgrey")) +
    scale_fill_manual(values=c("#009444", "#009444", "#009444")) +
    theme_classic() +
    theme(plot.title = element_blank(),
          axis.title.x = element_blank(),
          legend.text=element_text(size=20),
          legend.title=element_blank(),
          legend.position = "none",
          #legend.direction = "horizontal",
          axis.text=element_text(size=22, colour = "black"),
          axis.title=element_text(size=24, colour = "black"),
          strip.text.x = element_text(size = 19, colour = "black")) + 
    #theme(axis.text.x= element_text(size=12)) +
    stat_compare_means(comparisons = eqtls_per_gene, method = "t.test", label = "p.signif")
violin_eqtls_per_gene
#dev.off()
```

### 1.4. Mean, median, and standard deviation of eQTLs per gene expression grouped by interaction type.

```{r lung_eqtls_per_gene_expression_by_interaction_type}
### reading GTEx lung v8
lung_tpm <- vroom(file="./data/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct.gz", delim = "\t", skip=2, col_select = c("Description", "Lung", "gencode_id"= Name))

lung_map <- vroom(file="./results/lung_specific_GRN/lung_grn_significant_eqtls.txt.gz",
                  delim="\t")
lung_map.df <- distinct(lung_map[,c("snp", "gencode_id", "interaction_type")])
lung_map_cis <- subset(lung_map.df, interaction_type=='Cis')
lung_map_intra <- subset(lung_map.df, interaction_type=='Trans-intrachromosomal')
lung_map_inter <- subset(lung_map.df, interaction_type=='Trans-interchromosomal')

lung_map_cis.freq <- count(distinct(lung_map_cis[,c("snp", "gencode_id")]), gencode_id) %>%
    mutate("interaction_type"="Cis")
lung_map_intra.freq <- count(distinct(lung_map_intra[,c("snp", "gencode_id")]), gencode_id) %>%
    mutate("interaction_type"="Trans-intrachromosomal")
lung_map_inter.freq <- count(distinct(lung_map_inter[,c("snp", "gencode_id")]), gencode_id) %>%
    mutate("interaction_type"="Trans-interchromosomal")

lung_map_combined <- rbind(lung_map_cis.freq, lung_map_intra.freq, lung_map_inter.freq)
lung_map_combined <- lung_map_combined %>% 
    mutate(interaction_type=factor(interaction_type, 
                                   levels=c("Cis",
                                            "Trans-intrachromosomal",
                                            "Trans-interchromosomal")))

lung_map_combined_tpm <- merge(lung_map_combined, lung_tpm, by= "gencode_id")
lung_map_combined_tpm.df <- lung_map_combined_tpm %>% 
    relocate("gene"=Description, gencode_id, "tpm"=Lung, interaction_type) %>%
    mutate(interaction_type=factor(interaction_type, 
                                   levels=c("Cis",
                                            "Trans-intrachromosomal",
                                            "Trans-interchromosomal")))
# computing mean
lung_map_combined_tpm.df %>% group_by(interaction_type) %>% summarise(mean = mean(log10(tpm+1)))
lung_map_combined_tpm.df %>% group_by(interaction_type) %>% summarise(mean = mean(tpm))
# computing SD
lung_map_combined_tpm.df %>% group_by(interaction_type) %>% summarise(sd=sd(log10(tpm+1)))
lung_map_combined_tpm.df %>% group_by(interaction_type) %>% summarise(sd=sd(tpm))
# comparing number of eQTLs per gene grouped by interaction type
tpm_expression <- list(c("Cis", "Trans-intrachromosomal"),
                       c("Cis", "Trans-interchromosomal"),
                       c("Trans-interchromosomal", "Trans-intrachromosomal"))

#pdf("figures/lung_eqtl_map/violin_expression.pdf", width = 12, height = 8)
violin_expression <- ggplot(lung_map_combined_tpm.df,
                            aes(x=interaction_type, y=log10(tpm+1), fill=interaction_type)) +
    geom_violin(trim=T) + 
    stat_summary(fun = "median", geom = "point", shape = 3, size = 4, color = "white") +
    stat_summary(fun.data=mean_sdl, fun.args = list(mult = 1), 
                 geom="pointrange", color="red", shape=3, size=0.85) +
    scale_x_discrete("Interaction type") +
    scale_y_continuous("log10(TPM+1)", expand = c(0,0)) +
    scale_fill_manual(values=c("#009444", "#009444", "#009444")) +
    theme_classic() +
    theme(plot.title = element_blank(),
          axis.title.x = element_blank(),
          legend.text=element_text(size=20),
          legend.title=element_blank(),
          legend.position = "none",
          #legend.direction = "horizontal",
          axis.text=element_text(size=22, colour = "black"),
          axis.title=element_text(size=24, colour = "black"),
          strip.text.x = element_text(size = 19, colour = "black")) + 
    #theme(axis.text.x= element_text(size=12)) +
    stat_compare_means(comparisons = tpm_expression, method = "t.test", label = "p.signif")
violin_expression
#dev.off()
```

### 1.5. Correlation analysis.

Correlation analysis of the number of cis-, trans-intrachromosomal and trans-interchromosomal eQTL-eGene interactions and all variants (genotyped from lung samples obtained from GTEx v8) across different chromosomes in lung-specific regulatory map. The file "GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze.lookup_table_short_version.txt.gz" is available by request only.

```{r lung_map_correlation_analysis}
# getting number of SNPs per chromosome
gtex_snps <- vroom(file="./data/GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze.lookup_table_short_version.txt.gz", delim ="\t")

lung_map_chr1 <- vroom(file="./results/lung_specific_GRN/significant_eqtls_chr1.txt.gz", delim="\t", col_select = c(snp, snp_chr))
lung_map_chr2 <- vroom(file="./results/lung_specific_GRN/significant_eqtls_chr2.txt.gz", delim="\t", col_select = c(snp, snp_chr))
lung_map_chr3 <- vroom(file="./results/lung_specific_GRN/significant_eqtls_chr3.txt.gz", delim="\t", col_select = c(snp, snp_chr))
lung_map_chr4 <- vroom(file="./results/lung_specific_GRN/significant_eqtls_chr4.txt.gz", delim="\t", col_select = c(snp, snp_chr))
lung_map_chr5 <- vroom(file="./results/lung_specific_GRN/significant_eqtls_chr5.txt.gz", delim="\t", col_select = c(snp, snp_chr))
lung_map_chr6 <- vroom(file="./results/lung_specific_GRN/significant_eqtls_chr6.txt.gz", delim="\t", col_select = c(snp, snp_chr))
lung_map_chr7 <- vroom(file="./results/lung_specific_GRN/significant_eqtls_chr7.txt.gz", delim="\t", col_select = c(snp, snp_chr))
lung_map_chr8 <- vroom(file="./results/lung_specific_GRN/significant_eqtls_chr8.txt.gz", delim="\t", col_select = c(snp, snp_chr))
lung_map_chr9 <- vroom(file="./results/lung_specific_GRN/significant_eqtls_chr9.txt.gz", delim="\t", col_select = c(snp, snp_chr))
lung_map_chr10 <- vroom(file="./results/lung_specific_GRN/significant_eqtls_chr10.txt.gz", delim="\t", col_select = c(snp, snp_chr))
lung_map_chr11 <- vroom(file="./results/lung_specific_GRN/significant_eqtls_chr11.txt.gz", delim="\t", col_select = c(snp, snp_chr))
lung_map_chr12 <- vroom(file="./results/lung_specific_GRN/significant_eqtls_chr12.txt.gz", delim="\t", col_select = c(snp, snp_chr))
lung_map_chr13 <- vroom(file="./results/lung_specific_GRN/significant_eqtls_chr13.txt.gz", delim="\t", col_select = c(snp, snp_chr))
lung_map_chr14 <- vroom(file="./results/lung_specific_GRN/significant_eqtls_chr14.txt.gz", delim="\t", col_select = c(snp, snp_chr))
lung_map_chr15 <- vroom(file="./results/lung_specific_GRN/significant_eqtls_chr15.txt.gz", delim="\t", col_select = c(snp, snp_chr))
lung_map_chr16 <- vroom(file="./results/lung_specific_GRN/significant_eqtls_chr16.txt.gz", delim="\t", col_select = c(snp, snp_chr))
lung_map_chr17 <- vroom(file="./results/lung_specific_GRN/significant_eqtls_chr17.txt.gz", delim="\t", col_select = c(snp, snp_chr))
lung_map_chr18 <- vroom(file="./results/lung_specific_GRN/significant_eqtls_chr18.txt.gz", delim="\t", col_select = c(snp, snp_chr))
lung_map_chr19 <- vroom(file="./results/lung_specific_GRN/significant_eqtls_chr19.txt.gz", delim="\t", col_select = c(snp, snp_chr))
lung_map_chr20 <- vroom(file="./results/lung_specific_GRN/significant_eqtls_chr20.txt.gz", delim="\t", col_select = c(snp, snp_chr))
lung_map_chr21 <- vroom(file="./results/lung_specific_GRN/significant_eqtls_chr21.txt.gz", delim="\t", col_select = c(snp, snp_chr))
lung_map_chr22 <- vroom(file="./results/lung_specific_GRN/significant_eqtls_chr22.txt.gz", delim="\t", col_select = c(snp, snp_chr))
lung_map_chrX <- vroom(file="./results/lung_specific_GRN/significant_eqtls_chrX.txt.gz", delim="\t", col_select = c(snp, snp_chr))

lung_map <- rbind(lung_map_chr1, lung_map_chr2, lung_map_chr3, lung_map_chr4, lung_map_chr5,
                  lung_map_chr6, lung_map_chr7, lung_map_chr8, lung_map_chr9, lung_map_chr10,
                  lung_map_chr11, lung_map_chr12, lung_map_chr13, lung_map_chr14,
                  lung_map_chr15, lung_map_chr16, lung_map_chr17, lung_map_chr18, 
                  lung_map_chr19, lung_map_chr20, lung_map_chr21, lung_map_chr22, lung_map_chrX)

# counting the number of SNPs per chromosome (GTEx and lung-specific GRN)
gtex_snps_count <- gtex_snps %>% count(chr)
lung_map_eqtls_count <- lung_map %>% count(snp_chr)

# calculating gene density per Mb for each chromosome
genomeSum <- read.csv("./data/genomeSummary_grch38p13.txt", sep="\t", skip=1)
genomeSum <- genomeSum %>% 
  mutate(geneDensity = genomeSum$Gene/genomeSum$Size..Mb.) %>% #compute gene density
  mutate(Chromosome = tolower(Type)) %>% dplyr::select(geneDensity, Chromosome)
genomeSum[23,2] <- "chrX"; genomeSum[24,2] <- "chrY"; genomeSum[25,2] <- "chrMT"

# merging the two count tables (GTEx and lung-specific GRN)
lung_map_count_merged <- merge(lung_map_eqtls_count, gtex_snps_count, by=1, all=TRUE) %>%
  dplyr::rename("Chromosome"=snp_chr, "eQTL_eGene_interactions"=n.x, "snps_number"=n.y) %>%
  merge(genomeSum, by="Chromosome")

#pdf("figures/lung_eqtl_map/cor_lung_map_cis_per_chromosome.pdf", width = 13, height = 6)
options(scipen = 6)
corr_gg <- ggplot(lung_map_count_merged, aes(x = snps_number, y = eQTL_eGene_interactions)) +
  geom_smooth(fullrange = TRUE, method = lm, alpha = 0.2, level = 0.95, color = "red", 
              fill = "#009444") +
  geom_count(aes(size = geneDensity)) +
  geom_text_repel(aes(label = Chromosome),
                  min.segment.length = 0,
                  box.padding   = 0.35,
                  point.padding = 0.5,
                  segment.color = 'grey50',
                  max.overlaps = 30) +
  stat_cor(method = "pearson", label.x = 1000, label.y = 90000, size = 7) +
  scale_x_continuous("Number of SNPs", expand = c(0, 0), limits = c(0, NA)) +
  # the limits argument is set to -100000 so that the CI band can extend to the end of the plot
  # setting limits to -big number ensures the CI band extends to ends of plot along with coord_cartesian
  scale_y_continuous("Number of eQTL-eGene interactions", expand = c(0, 0),
                     limits = c(-1000000, 100000)) +
  #labs(title = "Correlation between the number of eQTLs from lung-specific GRN and the number of SNPs from GTEx", hjust = 1) +
  coord_cartesian(xlim = c(0, NA), ylim = c(0, NA)) + 
  guides(size = guide_legend("Genes/Mb")) +
  scale_size_continuous(limits = c(10, 60)) +
  theme_classic() +
  theme(plot.title = element_blank(),
        #axis.title.x = element_blank(),
        legend.text=element_text(size=16),
        legend.title=element_text(size=18),
        #legend.position = "none",
        #legend.direction = "horizontal",
        axis.text=element_text(size=16, colour = "black"),
        axis.title=element_text(size=18, colour = "black"))
corr_gg
#dev.off()
```

## 2. Construction of COPD-associated lung-specific gene regulatory map (GRN).

```{r copd_codes3d_results}
### Getting number of COPD-associated GWAS snps
snps <- readLines("data/snps/263_COPD_snps_5E-8.txt") # 263

### Reading significant spatial interactions in lung
sp_lung <- read.table(gzfile("results/codes3d/lung/significant_eqtls.txt.gz"), 
                      header = TRUE, sep = "\t") # 151

### Getting spatial eQTLs in lung
sp_lung_eqtls <- unique(sp_lung$snp) # 103
#writeLines(sp_lung_eqtls, con = "results/codes3d/lung/copd_lung_eqtl_snps.txt", sep = "\n")

### Getting spatially regulated genes in lung
sp_lung_egenes <- unique(sp_lung$gene) # 107
#writeLines(sp_lung_egenes, con = "results/codes3d/lung/copd_lung_egenes.txt", sep = "\n")
```

### 2.1 Percentage of spatial eQTL SNPs and non-eQTL SNPs across lung.

```{r spatial_eqtl_vs_non-eQTL, fig.width=5, fig.height=7}
# getting eqtls vs non-eqtls in lung
copd.snp.lung <- data.frame(
    snps = c("eQTL","non-eQTL"),
    number = c(length(sp_lung_eqtls), length(snps)-length(sp_lung_eqtls)),
    percentage = c(round(length(sp_lung_eqtls)/length(snps)*100, 2),
                   round((length(snps)-length(sp_lung_eqtls))/length(snps)*100, 2)))

#pdf("figures/functional_annotation/lung_eqtls_vs_non-eqtls.pdf", width = 8, height = 9)
ggplot(copd.snp.lung, aes(x = factor(snps, level=c("non-eQTL","eQTL")), y = percentage), 
       fill = snps) +
    geom_bar(stat="identity", position = "dodge") + 
    scale_fill_manual(values=c("#ED1C24", "grey")) +
    scale_y_continuous("Percentage", expand = c(0,0), limits = c(0,65)) +
    theme_classic() +
    theme(plot.title = element_blank(),
          axis.title.x = element_blank(),
          legend.text=element_text(size=20),
          legend.title=element_blank(),
          legend.position = "none",
          #legend.direction = "horizontal",
          axis.text=element_text(size=20, colour = "black"),
          axis.title=element_text(size=28, colour = "black"),
          strip.text.x = element_text(size = 19, colour = "black")) +
    geom_text(aes(y = percentage, label = paste0("n=", number)),
              position=position_dodge(width=0.9), vjust=-0.25, color = "black", 
              size = 8, fontface = 'italic')
#dev.off()
```

### 2.2. Functional annotation of eQTLs associated with COPD.

We used [wANNOVAR tool](http://wannovar.wglab.org/) to obtain information about the locus eQTLs tagged (9 June 2021). First, we extracted the funnctional annotation for COPD-associated eQTLs: `cut -f1,2,3,6,132 results/wannovar/query.output.genome_summary.txt | sort -u > results/wannovar/copd_fun_snp_ann.txt`

```{r wANNOVAR, fig.width=7, fig.height=5}
# calculating percentage of functionally annotated SNPs
cal_ann <- function(phe, total) {
    df <- data.frame(matrix(ncol=3, nrow=0))
    colnames(df)<- c("annotation","number", "percent")
    for (i in 1:length(phe)){
        down <- nrow(phe[which(phe$Func.refGene=="downstream"), ])
        df[nrow(df) + 1,] = list(annotation = "downstream", number = down, 
                                 percent = down/total*100)
        ex <- nrow(phe[which(phe$Func.refGene=="exonic"), ])
        df[nrow(df) + 1,] = list(annotation = "exonic", number = ex, 
                                 percent = ex/total*100)
        inter <- nrow(phe[which(phe$Func.refGene=="intergenic"), ])
        df[nrow(df) + 1,] = list(annotation = "intergenic", number = inter, 
                                 percent = inter/total*100)
        intro <- nrow(phe[which(phe$Func.refGene=="intronic"), ])
        df[nrow(df) + 1,] = list(annotation = "intronic", number = intro, 
                                 percent = intro/total*100)
        ncRNA_ex <- nrow(phe[which(phe$Func.refGene=="ncRNA_exonic"), ])
        df[nrow(df) + 1,] = list(annotation = "ncRNA_exonic", number = ncRNA_ex,
                                 percent = ncRNA_ex/total*100)
        ncRNA_in <- nrow(phe[which(phe$Func.refGene=="ncRNA_intronic"), ])
        df[nrow(df) + 1,] = list(annotation = "ncRNA_intronic", number = ncRNA_in,
                                 percent = ncRNA_in/total*100)
        up <- nrow(phe[which(phe$Func.refGene=="upstream"), ])
        df[nrow(df) + 1,] = list(annotation = "upstream", number = up, 
                                 percent = up/total*100)
        UTR3 <- nrow(phe[which(phe$Func.refGene=="UTR3"), ])
        df[nrow(df) + 1,] = list(annotation = "UTR3", number = UTR3, 
                                 percent = UTR3/total*100)
        }
    df[!duplicated(df), ]
}

copd_ann <- read.table("results/wannovar/copd_fun_snp_ann.txt", sep = "\t", header=FALSE)
copd_ann <- copd_ann[order(copd_ann$V1, decreasing = TRUE),]
colnames(copd_ann) <- copd_ann[1, ]; copd_ann <- copd_ann[-1, ]
copd_ann_lung <- subset(copd_ann, copd_ann$Otherinfo %in% sp_lung_eqtls) # 103

copd_fun_ann_lung <- cal_ann(copd_ann_lung, length(sp_lung_eqtls))

# lung eQTLs
l_number <- c(copd_fun_ann_lung[copd_fun_ann_lung$annotation=="downstream",][,2],
              copd_fun_ann_lung[copd_fun_ann_lung$annotation=="exonic",][,2],
              copd_fun_ann_lung[copd_fun_ann_lung$annotation=="intergenic",][,2],
              copd_fun_ann_lung[copd_fun_ann_lung$annotation=="intronic",][,2],
              copd_fun_ann_lung[copd_fun_ann_lung$annotation=="ncRNA_intronic",][,2],
              copd_fun_ann_lung[copd_fun_ann_lung$annotation=="ncRNA_exonic",][,2],
              copd_fun_ann_lung[copd_fun_ann_lung$annotation=="UTR3",][,2])
l_percent <- c(copd_fun_ann_lung[copd_fun_ann_lung$annotation=="downstream",][,3],
              copd_fun_ann_lung[copd_fun_ann_lung$annotation=="exonic",][,3],
              copd_fun_ann_lung[copd_fun_ann_lung$annotation=="intergenic",][,3],
              copd_fun_ann_lung[copd_fun_ann_lung$annotation=="intronic",][,3],
              copd_fun_ann_lung[copd_fun_ann_lung$annotation=="ncRNA_intronic",][,3],
              copd_fun_ann_lung[copd_fun_ann_lung$annotation=="ncRNA_exonic",][,3],
              copd_fun_ann_lung[copd_fun_ann_lung$annotation=="UTR3",][,3])

lung_eqtls.df <- data.frame(annotation = c("downstream", "exonic", "intergenic", "intronic",
                                           "ncRNA_intronic", "ncRNA_exonic", "UTR3"),
                            number = l_number,
                            percent = l_percent)

#pdf("figures/functional_annotation/lung_copd_fun_ann.pdf", width = 8, height = 5)
ggplot(lung_eqtls.df, aes(x = annotation, y = percent, fill = "#009444")) +
  geom_bar(stat="identity", position = "dodge") + 
  theme_classic() +
  theme(plot.title = element_blank(),
        axis.title.x = element_blank(),
        legend.position = "none",
        axis.text=element_text(size=14, colour = "black"),
        axis.text.x=element_text(vjust = 0.5, hjust=0.75),
        axis.title=element_text(size=14, colour = "black")) +
  scale_fill_manual(values = "#009444") +
  scale_y_continuous("Functional annotation", expand = c(0,0), limits = c(0,75)) + 
  geom_text(aes(y = percent, label = paste0(round(percent, digits=2), "%")),
            position=position_dodge(width=0.9), vjust=0.75, hjust=-0.15, 
            color = "black", size = 3.5) +
  #scale_y_continuous(limits = c(0, 70)) +
  xlab("Percentage") +
  coord_flip()
#dev.off()

gwas <- read.table("results/wannovar/gwas.txt", header = TRUE, sep = "\t", quote="", encoding="UTF-8")
eqtl <- read.table("results/wannovar/eqtl.txt", header = TRUE, sep = "\t", quote="", encoding="UTF-8")
merged <- left_join(eqtl, gwas, by = "snp")
#write.table(merged, file = "results/wannovar/annotated_eqtl_snps.txt", sep = "\t", col.names = FALSE, row.names=FALSE)

copd_gwas <- read.table("results/codes3d/copd_gwas.txt", header = TRUE, sep = "\t", quote="", encoding="UTF-8")
copd_eqtl <- read.table("results/codes3d/copd_eqtl.txt", header = TRUE, sep = "\t", quote="", encoding="UTF-8")
copd_merged <- left_join(copd_gwas, copd_eqtl, by = "snp")
#write.table(copd_merged, file = "results/codes3d/copd_gwas_vs_eqtl_snps.txt", sep = "\t", col.names = FALSE, row.names=FALSE)
```

### 2.3. Number of eGenes per eQTL SNP.

```{r genes_per_eqtl}
# getting the number of eGenes per eQTL SNP
copd.snps.x <- distinct(sp_lung[,c("snp", "gene")])
copd.snps.freq <- count(distinct(sp_lung[,c("snp", "gene")]), snp) # 103 eQTL SNPs
copd.snps.freq <- data.frame(table(copd.snps.freq$n))
colnames(copd.snps.freq) <- c("genes", "snps")
#copd.snps.freq$genes <- as.numeric(as.character(copd.snps.freq$genes))

#pdf("figures/functional_annotation/lung_copd_genes_per_eqtl.pdf", width = 5, height = 9)
ggplot(copd.snps.freq, aes(x = factor(snps, level=c("2","8", "26", "67")), y = genes)) + 
  geom_bar(stat="identity", position = "dodge", fill = "#009444") + 
  scale_fill_manual(values=c("#009444")) +
  scale_x_discrete("Number of eQTL SNPs") +
  scale_y_discrete("Number of eGenes", expand = c(0,0)) +
  #scale_y_continuous(limits=c(0, 7), expand=c(0,0)) +
  theme_classic() +
  theme(plot.title = element_blank(),
        #axis.title.x = element_blank(),
        legend.text=element_text(size=20),
        legend.title=element_blank(),
        legend.position = "none",
        #legend.direction = "horizontal",
        axis.text=element_text(size=20, colour = "black"),
        axis.title=element_text(size=24, colour = "black"),
        strip.text.x = element_text(size = 19, colour = "black")) + 
  geom_text(aes(y = genes, label = paste0("n=", genes)),
            position=position_dodge(width=0.9), vjust=-0.25, color = "black", 
            size = 8, fontface = 'italic')
  #labs(y = "Number of eGenes")
#dev.off()
```

### 2.4. Percentage of spatial cis-, trans-intrachromosomal and trans-interchromosomal eQTL-eGene interactions.

```{r cis_vs_trans, fig.width=5, fig.height=7}
# reading significant spatial interactions in lung
sp_lung <- read.table(gzfile("results/codes3d/lung/significant_eqtls.txt.gz"), 
                      header = TRUE, sep = "\t") #151

# getting eqtls vs non-eqtls in lung
copd.snp.lung <- data.frame(
    interactions = c("Cis","Trans-intra", "Trans-inter"),
    number = c(148, 1, 2),
    percentage = c(round(148/151*100, 2),
                   round(1/151*100, 2),
                   round(2/151*100, 2)))

#pdf("figures/functional_annotation/lung_copd_cis_vs_trans.pdf", width = 7, height = 9)
ggplot(copd.snp.lung, aes(x = factor(interactions, 
                                     level=c("Cis","Trans-intra", "Trans-inter")), 
                          y = percentage)) +
    geom_bar(stat="identity", position="dodge", fill="#009444") + 
    scale_fill_manual(values=c("#009444")) +
    scale_x_discrete("Interaction type") +
    scale_y_continuous("Percentage", expand = c(0,0), limits = c(0, 105)) +
    theme_classic() +
    theme(plot.title = element_blank(),
          #axis.title.x = element_blank(),
          legend.text=element_text(size=20),
          legend.title=element_blank(),
          legend.position = "none",
          #legend.direction = "horizontal",
          axis.text=element_text(size=20, colour = "black"),
          axis.title=element_text(size=24, colour = "black"),
          strip.text.x = element_text(size = 15, colour = "black")) +
    geom_text(aes(y = percentage, label = paste0("n=", number)),
              position=position_dodge(width=0.9), vjust=-0.25, color = "black", 
              size = 8, fontface = 'italic')
#dev.off()
```

### 2.5. Percentage of protein-coding genes, non-coding RNA genes and pseudogenes.

```{r gene_types, fig.width=5, fig.height=7}
# getting spatially regulated genes in lung
sp_lung_egenes <- unique(sp_lung$gene) # 107

# getting eqtls vs non-eqtls in lung
copd.gene.lung <- data.frame(
    gene_types = c("Protein-coding","Non-coding RNA", "Pseudogene"),
    number = c(84, 22, 1),
    percentage = c(round(84/107*100, 2),
                   round(22/107*100, 2),
                   round(1/107*100, 2)))

#pdf("figures/functional_annotation/lung_copd_gene_types.pdf", width = 8, height = 9)
ggplot(copd.gene.lung, aes(x = factor(gene_types, 
                                     level=c("Protein-coding","Non-coding RNA", "Pseudogene")),
                           y = percentage)) +
    geom_bar(stat="identity", position="dodge", fill="#009444") + 
    scale_fill_manual(values=c("#009444")) +
    scale_y_continuous("Percentage", expand = c(0,0), limits = c(0, 85)) +
    scale_x_discrete("Gene type") +
    theme_classic() +
    theme(plot.title = element_blank(),
          #axis.title.x = element_blank(),
          legend.text=element_text(size=20),
          legend.title=element_blank(),
          legend.position = "none",
          #legend.direction = "horizontal",
          axis.text=element_text(size=20, colour = "black"),
          axis.title=element_text(size=24, colour = "black"),
          strip.text.x = element_text(size = 15, colour = "black")) +
    geom_text(aes(y = percentage, label = paste0("n=", number)),
              position=position_dodge(width=0.9), vjust=-0.25, color = "black", 
              size = 8, fontface = 'italic')
#dev.off()
```

## 3. Construction of lung-specific protein-protein interaction network (LSPPIN).

The STRING PPI network (version 11.5, protein.links.full.v11.0.txt.gz) was downloaded from [STRING website](https://string-db.org/cgi/download?sessionId=bBkmARm05QYT). We extracted only human PPIs with a combined score ≥ 700 between proteins:
`zgrep ^"9606\." protein.links.full.v11.5.txt.gz | awk '($16 >= 700) { print $1, $2, $16 }' > human_PPIs.full.v11.5_700.txt`  
Made the file tab separated: `tr ' ' '\t' < human_PPIs.full.v11.5_700.txt > tab_human_PPIs.full.v11.5_700.txt`  
And compressed it: `gzip tab_human_PPIs.full.v11.5_700.txt`

```{r PPI_cleaning}
# loading human PPI network
PPI_700 <- read.table(gzfile("results/string/tab_human_PPIs.full.v11.5_700.txt.gz"),
                      header = FALSE, sep="\t") # 505,968 PPIs
# removing all characters before and up to "."
PPI_700$V1 <- gsub(".*\\.", "", PPI_700$V1); PPI_700$V2 <- gsub(".*\\.", "", PPI_700$V2)
PPI_700 <- distinct(PPI_700) # 505,968 PPIs
#write.table(PPI_700, file = "results/string/PPI_700.txt", sep = "\t", col.names = FALSE, row.names=FALSE)
# extracting unique proteins
unique_proteins <- unique(c(PPI_700$V1, PPI_700$V2)) # 16,814
```

Overall, we extracted `r pander(nrow(PPI_700))` PPIs between a total of `r pander(length(unique_proteins))` unique human proteins.

Ensembl protein identifiers were mapped to Ensembl gene identifiers.

```{r protein_IDs}
# reading lung-specific gene regulatory network (GRN) data
lung_GRN_genes <- read.table(gzfile("results/lung_specific_GRN/lung_specific_GRN_genes.txt.gz")) # 15,855 genes
colnames(lung_GRN_genes) <- c("gene_id", "gene_name")
lung_GRN_genes <- unique(lung_GRN_genes)

# converting transcript IDs to gene IDs
lung_GRN_genes$gene_id <- gsub("\\..*","", lung_GRN_genes$gene_id)
# querying protein features
edb <- EnsDb.Hsapiens.v86
#hasProteinData(edb) # TRUE
#listTables(edb)
# getting protein information for gene IDs
lung_GRN_proteins <- genes(edb, filter = ~ gene_id %in% lung_GRN_genes$gene_id,
                           columns = c("protein_id", "gene_name"))
l_p_df <- bind_rows(lung_GRN_proteins@elementMetadata@listData) # 79,005
#write.table(l_p_df, file = "results/string/lung_GRN_proteins.txt", sep = "\t", col.names = TRUE, row.names=FALSE)
# getting only protein and gene IDs and removing protein-gene pairs containing NAs
l_p_df <- l_p_df[, c("protein_id", "gene_name")]
l_p_df <- l_p_df[complete.cases(l_p_df), ] # 65,742 proteins
#write.table(l_p_df, file = "results/string/lung_GRN_gene_pairs.txt", sep = "\t", col.names = TRUE, row.names=FALSE)
l_p_df <- read.table(gzfile("results/string/lung_GRN_gene_pairs.txt.gz"), header = TRUE, 
                     sep="\t", colClasses = "character")
```

Lung-specific PPIN (LSPPIN) was constructed by combining the STRING PPI network with lung-specific GRN.

```{r LSPPIN}
# subsetting PPIs that are present in lung tissue
l_PPI_700 <- subset(PPI_700, (PPI_700$V1 %in% l_p_df$protein_id & PPI_700$V2 %in% l_p_df$protein_id))
#write.table(l_PPI_700, file = "results/string/LSPPIN.txt", sep = "\t", col.names = FALSE, row.names=FALSE) # 210,192 PPIs
l_unique_proteins <- unique(c(l_PPI_700$V1, l_PPI_700$V2)) # 10,188
```

The resulting LSPPIN contained `r pander(nrow(l_PPI_700))` PPIs between `r pander(length(l_unique_proteins))` unique proteins in the lung tissue.

Ensembl protein identifiers were mapped to Ensembl gene identifiers.

```{r protein_ID_mapping, cache=TRUE}
l_dict <- l_p_df$gene_name # 65,742 proteins
names(l_dict) <- l_p_df$protein_id
# This function maps gene names to protein IDs. It takes two arguments: ppi is a dataframe containing IDs for interacting proteins (first two columns: protein1 and protein2) and combined score for their interaction (the third column)
create_lsppi <- function(ppi, dict){
    df <- data.frame()
    for(i in 1:nrow(ppi)){
        cat("Analysing PPI: ", i, "\n")
        tryCatch({
            p1 <- ppi[i,][1]; p2 <- ppi[i,][2]; comb <- ppi[i,][3]
            t <- c(dict[p1$V1], dict[p2$V2], comb$V3)
            #cat("PPI is: ", t, "\n")
            df <- rbind(df, t)
        }, error=function(e){
            cat("ERROR: ", conditionMessage(e), "\n")
        })
    }
    df <- distinct(df)
    colnames(df) <- c("p1", "p2", "combined_score")
    return(df)
}
# Testing
#test_subset <- subset(l_PPI_700, l_PPI_700$V3>998) # 8,410 PPIs
#test_lsppi.df <- create_lsppi(test_subset, l_dict)

# Uncomment the line below
#lsppin.df <- create_lsppi(l_PPI_700, l_dict) # 210,192 PPIs
#write.table(lsppin.df, file = "results/string/LSPPIN.txt", sep = "\t", col.names = TRUE, row.names=FALSE)
lsppin.df <- read.table(gzfile("results/string/LSPPIN.txt.gz"), header = TRUE, sep = "\t")
```

### 3.1. Number of human protein-protein interactions (PPI) in the STRING PPI network and lung-specific PPI network.

```{r ppis}
# getting the number of human PPIs in STRING and LSPPIN (with combined score 0.7).
PPI_700 <- read.table(gzfile("results/string/tab_human_PPIs.full.v11.5_700.txt.gz"), 
                      header = FALSE, sep="\t") # 505,968 PPIs
# removing all characters before and up to "."
PPI_700$V1 <- gsub(".*\\.", "", PPI_700$V1); PPI_700$V2 <- gsub(".*\\.", "", PPI_700$V2)
string_700 <- distinct(PPI_700) # 505,968 PPIs
lsppin_700 <- subset(string_700, (string_700$V1 %in% l_p_df$protein_id & 
                                    string_700$V2 %in% l_p_df$protein_id)) # 210,192 PPIs
nrow(lsppin_700)

lung.map.ppis <- data.frame(
  ppis = rep(c("LSPPIN","STRING")),
  number = c(nrow(lsppin_700), (nrow(string_700)-nrow(lsppin_700))),
  percentage = c(round((nrow(lsppin_700)/nrow(string_700))*100, 2),
                 round(((nrow(string_700)-nrow(lsppin_700))/nrow(string_700))*100, 2)))

#pdf("figures/lung_eqtl_map/lung_map_ppis.pdf", width = 9, height = 9)
ggplot(lung.map.ppis, aes(x = "", y = percentage, fill = ppis)) +
  geom_bar(width = 1, stat = "identity", fill = c("#009444", "darkgrey"), color = "white") +
  coord_polar("y", start = 0)+
  #geom_text(aes(y = percentage, label = snps), color = "white")+
  #scale_fill_manual(values = mycols) +
  theme_void()
#dev.off()
```

### 3.2. Number of unique human proteins in the STRING PPI network and the lung-specific PPI network.

```{r proteins}
# getting the number per chromosome
string_proteins <- unique(c(string_700$V1, string_700$V2)) # 16,814
lsspin_proteins <- unique(c(lsppin_700$V1, lsppin_700$V2)) # 10,188

lung.map.p <- data.frame(
  ppis = rep(c("LSPPIN","STRING")),
  number = c(length(lsspin_proteins), (length(string_proteins)-length(lsspin_proteins))),
  percentage = c(round((length(lsspin_proteins)/length(string_proteins))*100, 2),
                 round(((length(string_proteins)-length(lsspin_proteins))/length(string_proteins))*100, 2)))

#pdf("figures/lung_eqtl_map/lung_map_proteins.pdf", width = 9, height = 9)
ggplot(lung.map.p, aes(x = "", y = percentage, fill = ppis)) +
  geom_bar(width = 1, stat = "identity", fill = c("#009444", "darkgrey"), color = "white") +
  coord_polar("y", start = 0) +
  #geom_text(aes(y = percentage, label = snps), color = "white")+
  #scale_fill_manual(values = mycols) +
  theme_void()
#dev.off()
```

## 4. Construction of COPD-associated LSPPIN.

To build COPD-specific LSPPIN, only interactions between COPD-associated genes we extracted from LSPPIN.

```{r COPD_LSPPIN}
# subsetting COPD-associated PPIs that are present in lung tissue
copd_lsppin.df <- subset(lsppin.df, (lsppin.df$p1 %in% sp_lung_egenes 
                                     & lsppin.df$p2 %in% sp_lung_egenes))
#write.table(copd_lsppin.df, file = "results/string/COPD_LSPPIN_links.txt", sep = "\t", col.names = FALSE, row.names=FALSE) # 28 PPIs
copd_l_unique_proteins <- unique(c(copd_lsppin.df$p1, copd_lsppin.df$p2)) # 23 proteins
```

The resulting COPD-associated LSPPIN contained `r pander(nrow(copd_lsppin.df))` PPIs between `r pander(length(copd_l_unique_proteins))` unique proteins in the lung.

### 4.1. Pathway analysis of the COPD-associated genes.

Pathway analysis was performed using the g:GOSt module of the g:Profiler tool. The significance level was determined using Benjamini-Hochberg algorithm (FDR < 0.05) (7 April 2022). Uncomment the lines below to query g:GOSt.

```{r pathway_analysis}
# This function quieries g:GOSt module of the g:Profiler tool. It takes a vector of genes and quieries the GOSt module. It outputs the dataframe with the query results for the genes.
# with "known" domain scope - run on 6 October 2022.
query_kegg <- function(genes){
    tryCatch({
        t <- gost(query = genes, organism = "hsapiens", ordered_query = TRUE,
                  multi_query = FALSE, significant = TRUE, exclude_iea = FALSE,
                  measure_underrepresentation = FALSE, evcodes = TRUE,
                  user_threshold = 0.05, correction_method = "fdr",
                  domain_scope = "known", custom_bg = NULL,
                  numeric_ns = "", sources = c("KEGG"), as_short_link = FALSE)
        return(t)
        }, error=function(e){
            cat("ERROR: ", conditionMessage(e), "\n")
        })
}

# Pathway analysis of all 107 COPD-associated genes
copd_path <- query_kegg(sp_lung_egenes)
copd_paths <- copd_path$result
copd_paths.df <- apply(copd_paths, 2, as.character)
#write.table(copd_paths.df, file = "results/kegg/copd_egenes_kegg_fdr_known.txt", sep = "\t", col.names = TRUE, row.names=FALSE)
```

### 4.2. Gene Ontology (GO) enrichment analysis of the COPD-associated genes.

GO analysis was performed using the g:GOSt module of the g:Profiler tool. The significance level was determined using Benjamini-Hochberg algorithm (FDR < 0.05) (7 April 2022).

```{r go_analysis, fig.width=12, fig.height=5}
# This function quieries g:GOSt module of the g:Profiler tool. It takes a vector of genes and quieries the GOSt module. It outputs the dataframe with the query results for the genes.
query_go <- function(genes){
    tryCatch({
        t <- gost(query = genes, organism = "hsapiens", ordered_query = TRUE,
                  multi_query = FALSE, significant = TRUE, exclude_iea = FALSE,
                  measure_underrepresentation = FALSE, evcodes = TRUE,
                  user_threshold = 0.05, correction_method = "fdr",
                  domain_scope = "annotated", custom_bg = NULL,
                  numeric_ns = "", sources = "GO", as_short_link = FALSE)
        return(t[["result"]])
        }, error=function(e){
            cat("ERROR: ", conditionMessage(e), "\n")
        })
}

# COPD-associated eGenes
copd_lung_go <- query_go(sp_lung_egenes)
copd_lung_gos.df <- apply(copd_lung_go, 2, as.character)
#write.table(copd_lung_gos.df, file = "results/go/copd_lung_all_gos_fdr.txt", sep = "\t", col.names = TRUE, row.names=FALSE)
copd_lung_gos.df <- read.table(gzfile("results/go/copd_lung_all_gos_fdr.txt.gz"), 
                               header = TRUE, sep="\t")

# Extraction top 10 terms for biological process
copd_lung_go_bp <- copd_lung_gos.df[grep('GO:BP', copd_lung_gos.df[, "source"]), ]
copd_lung_go_bp_top10 <- as.data.frame(copd_lung_go_bp[1:10,])

#pdf("figures/go/copd_lung_bp_gos_fdr_top10.pdf", width = 9, height = 4)
ggplot(copd_lung_go_bp_top10, aes(x=factor(term_name, levels = rev(levels(factor(term_name)))),
                       y=-log10(as.numeric(p_value)), fill="#009444")) +
    geom_bar(stat="identity") +
    theme_classic() +
    theme(plot.title = element_blank(),
          axis.title.x = element_text(size=16, colour = "black"),
          axis.text=element_text(size=16, colour = "black"),
          axis.title.y = element_blank(),
          legend.position = "none") +
    scale_fill_manual(values=c("#009444")) +
    labs(y = "-log10(p)") +
    geom_hline(aes(yintercept=-log10(as.numeric(0.05))), colour = "red", size = 1) +
    coord_flip()
#dev.off()

# Extraction top 10 terms for molecular function
copd_lung_go_mf <- copd_lung_gos.df[grep('GO:MF', copd_lung_gos.df[, "source"]), ]
copd_lung_go_mf_top10 <- as.data.frame(copd_lung_go_mf[1:10,])

#pdf("figures/go/copd_lung_mf_gos_fdr_top10.pdf", width = 13, height = 4)
ggplot(copd_lung_go_mf_top10, aes(x=factor(term_name, levels = rev(levels(factor(term_name)))),
                       y=-log10(as.numeric(p_value)), fill="#009444")) +
    geom_bar(stat="identity") +
    theme_classic() +
    theme(plot.title = element_blank(),
          axis.title.x = element_text(size=16, colour = "black"),
          axis.text=element_text(size=16, colour = "black"),
          axis.title.y = element_blank(),
          legend.position = "none") +
    scale_fill_manual(values=c("#009444")) +
    labs(y = "-log10(p)") +
    geom_hline(aes(yintercept=-log10(as.numeric(0.05))), colour = "red", size = 1) +
    coord_flip()
#dev.off()

# Extraction top 10 terms for cellular component
copd_lung_go_cc <- copd_lung_gos.df[grep('GO:CC', copd_lung_gos.df[, "source"]), ]
copd_lung_go_cc_top10 <- as.data.frame(copd_lung_go_cc[1:10,])

#pdf("figures/go/copd_lung_cc_gos_fdr_top10.pdf", width = 7, height = 4)
ggplot(copd_lung_go_cc_top10, aes(x=factor(term_name, levels = rev(levels(factor(term_name)))),
                       y=-log10(as.numeric(p_value)), fill="#009444")) +
    geom_bar(stat="identity") +
    theme_classic() +
    theme(plot.title = element_blank(),
          axis.title.x = element_text(size=16, colour = "black"),
          axis.text=element_text(size=16, colour = "black"),
          axis.title.y = element_blank(),
          legend.position = "none") +
    scale_fill_manual(values=c("#009444")) +
    labs(y = "-log10(p)") +
    geom_hline(aes(yintercept=-log10(as.numeric(0.05))), colour = "red", size = 1) +
    coord_flip()
#dev.off()
```

## 5. Identification of co-occuring conditions using multimorbid3D.

### 5.1. Identification of potential co-occuring conditions for COPD.

```{r copd_multimorbidity, fig.width=15, fig.height=12}
copd_lung_string_multm <-read.table(gzfile("results/multimorbid3d/copd/significant_enrichment_bootstrap.txt.gz"), header = TRUE, sep = "\t", quote="", encoding="UTF-8")

#pdf("figures/multimorbidity/copd_lung_string_multimorbidiry.pdf", width = 9, height = 11)
s_gg <- ggplot(copd_lung_string_multm, aes(x = level, y = trait)) +
    geom_point(aes(size = trait_eqtls, fill = adj_pval), alpha = 0.75, shape = 21) +
    labs( x = "", y = "", size = "Number of eQTLs", fill = "Adj p value") +
    theme(legend.key=element_blank(), 
          axis.text.x = element_text(colour = "black", size = 8, angle = 90, 
                                     vjust = 0.3, hjust = 1), 
          axis.text.y = element_text(colour = "black", size = 8), 
          legend.text = element_text(size = 8, colour ="black"), 
          legend.title = element_text(size = 9, face = "bold"), 
          panel.background = element_blank(),
          panel.border = element_rect(colour = "black", fill = NA, size = 1),
          legend.position = "right")
s_gg
#dev.off()
```

### 5.2. Identification of potential co-occuring conditions for CAD and UD.

```{r cad_and_ud_multimorbidity, fig.width=15, fig.height=12}
cad_lung_string_multm <- read.table(gzfile("results/multimorbid3d/cad/significant_enrichment_bootstrap.txt.gz"), header = TRUE, sep = "\t", quote="", encoding="UTF-8")
ud_lung_string_multm <- read.table(gzfile("results/multimorbid3d/ud/significant_enrichment_bootstrap.txt.gz"), header = TRUE, sep = "\t", quote="", encoding="UTF-8")

#pdf("figures/multimorbidity/cad_lung_string_multimorbidiry.pdf", width = 13, height = 11)
cad_gg <- ggplot(cad_lung_string_multm, aes(x = level, y = trait)) +
  geom_point(aes(size = trait_eqtls, fill = adj_pval), alpha = 0.75, shape = 21) +
  labs( x = "", y = "", size = "Number of eQTLs", fill = "Adj p value") +
  theme(legend.key=element_blank(), 
        axis.text.x = element_text(colour = "black", size = 8, angle = 90, 
                                   vjust = 0.3, hjust = 1),
        axis.text.y = element_text(colour = "black", size = 8), 
        legend.text = element_text(size = 8, colour ="black"), 
        legend.title = element_text(size = 9, face = "bold"), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "black", fill = NA, size = 1),
        legend.position = "right")
cad_gg
#dev.off()

#pdf("figures/multimorbidity/ud_lung_string_multimorbidiry.pdf", width = 13, height = 11)
ud_gg <- ggplot(ud_lung_string_multm, aes(x = level, y = trait)) +
  geom_point(aes(size = trait_eqtls, fill = adj_pval), alpha = 0.75, shape = 21) +
  labs( x = "", y = "", size = "Number of eQTLs", fill = "Adj p value") +
  theme(legend.key=element_blank(), 
        axis.text.x = element_text(colour = "black", size = 8, angle = 90, 
                                   vjust = 0.3, hjust = 1),
        axis.text.y = element_text(colour = "black", size = 8), 
        legend.text = element_text(size = 8, colour ="black"), 
        legend.title = element_text(size = 9, face = "bold"), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "black", fill = NA, size = 1),
        legend.position = "right")
ud_gg
#dev.off()
```

## 6. Examples of PPI clusters interacting with proteins on levels 0 and 1.

```{r multimorbidity_examples, fig.width=10, fig.height=7}
lsppin.df <- read.table(gzfile("results/string/LSPPIN.txt.gz"), header = TRUE, sep = "\t")
### cluster 1 - sulfur rely system and folate biosynthesis
cl1_l1_1 <- lsppin.df[grep(c('NUPR1'), lsppin.df[, c("p1")]), ] # 4
cl1_l1_2 <- lsppin.df[grep(c('MSL1'), lsppin.df[, c("p1")]), ] # 24
cl1_l1_3 <- lsppin.df[grep(c('SGF29'), lsppin.df[, c("p1")]), ] # 39
cl1_l1_4 <- lsppin.df[grep(c('MOCS2'), lsppin.df[, c("p1")]), ] # 13
cl1_l2_1 <- lsppin.df[grep(c('KAT8'), lsppin.df[, c("p1")]), ] # 41
cl1_l2_2 <- lsppin.df[grep(c('RUVBL1'), lsppin.df[, c("p1")]), ] # 124
cl1_l2_3 <- lsppin.df[grep(c('SUOX'), lsppin.df[, c("p1")]), ] # 40

bin1 <- lsppin.df[grep(c('BIN1'), lsppin.df[, c("p1")]), ] # 

# heatmaps for levels 0 and 1
colors = colorRamp2(c(-1.5,0,1.5), c("blue", "#E8E8E8","red"), space = "RGB")

p1_l0 <- read.table("results/heatmaps/path1_level0.txt", sep = "\t", header=TRUE, row.names = 1)
row_od = c("MSL1", "MOCS2", "NUPR1", "SGF29")
p1_l0.mat <- as.matrix(p1_l0)
ha = rowAnnotation(foo = anno_empty(border = TRUE))
ht = HeatmapAnnotation(top = anno_empty(border = FALSE))
#pdf("figures/heatmaps/p1_l0.pdf", width = 14, height = 10)
Heatmap(p1_l0.mat, name = "log2(aFC)", #column_title = "eQTL SNPs",
        col = colors, na_col = "white", #row_title = "shared eGenes",
        cluster_rows = FALSE, cluster_columns = FALSE, row_names_side = "left",
        column_names_side = "top", row_names_gp = gpar(fontface = "italic"),
        row_order = row_od,
        width = unit(10, "cm"), height = unit(10, "cm"),
        right_annotation = ha, top_annotation = ht,
        border = TRUE, rect_gp = gpar(col = "grey"))
decorate_annotation("foo", slice = 1, {
  grid.rect(x = 0, width = unit(10, "mm"), gp = gpar(fill = NA, col = NA), just = "left")
  grid.text(paste("Level 0", collapse = "\n"), x = unit(5, "mm"), just = "centre", rot = 90,
            gp = gpar(fontsize = 13, fontface = "bold"))})
decorate_annotation("top", slice = 1, {
  grid.rect(x = 0, width = unit(0.7, "mm"), gp = gpar(fill = NA, col = NA), just = "centre")})
#dev.off()

p1_l1 <- read.table("results/heatmaps/path1_level1.txt", sep = "\t", header=TRUE, row.names = 1)
row_od = c("KAT8", "RUVBL1", "SUOX")
p1_l1.mat <- as.matrix(p1_l1)
ha = rowAnnotation(foo = anno_empty(border = TRUE))
ht = HeatmapAnnotation(top = anno_empty(border = FALSE))
#pdf("figures/heatmaps/p1_l1.pdf", width = 14, height = 10)
Heatmap(p1_l1.mat, name = "log2(aFC)", #column_title = "eQTL SNPs",
        col = colors, na_col = "white", #row_title = "shared eGenes",
        cluster_rows = FALSE, cluster_columns = FALSE, row_names_side = "left",
        column_names_side = "top", row_names_gp = gpar(fontface = "italic"),
        row_order = row_od,
        width = unit(18, "cm"), height = unit(7, "cm"),
        right_annotation = ha, top_annotation = ht,
        border = TRUE, rect_gp = gpar(col = "grey"))
decorate_annotation("foo", slice = 1, {
  grid.rect(x = 0, width = unit(10, "mm"), gp = gpar(fill = NA, col = NA), just = "left")
  grid.text(paste("Level 1", collapse = "\n"), x = unit(5, "mm"), just = "centre", rot = 90,
            gp = gpar(fontsize = 13, fontface = "bold"))})
decorate_annotation("top", slice = 1, {
  grid.rect(x = 0, width = unit(0.7, "mm"), gp = gpar(fill = NA, col = NA), just = "centre")})
#dev.off()

### cluster 2 - axon guidance and regulation of actin cytoskeleton
cl2_l1_1 <- lsppin.df[grep(c('TESK2'), lsppin.df[, c("p1")]), ] # 1
cl2_l1_2 <- lsppin.df[grep(c('SSH2'), lsppin.df[, c("p1")]), ] # 8

cl2_l2_1 <- lsppin.df[grep(c('PPP3CA'), lsppin.df[, c("p1")]), ] # 70
cl2_l2_2 <- lsppin.df[grep(c('TESK2'), lsppin.df[, c("p1")]), ] # 1
cl2_l2_3 <- lsppin.df[grep(c('LIMK1'), lsppin.df[, c("p1")]), ] # 27
cl2_l2_4 <- lsppin.df[grep(c('CFL1'), lsppin.df[, c("p1")]), ] # 39
cl2_l2_5 <- lsppin.df[grep(c('SSH1'), lsppin.df[, c("p1")]), ] # 16
cl2_l2_6 <- lsppin.df[grep(c('SSH3'), lsppin.df[, c("p1")]), ] # 8
cl2_l2_7 <- lsppin.df[grep(c('PPP3CC'), lsppin.df[, c("p1")]), ] # 47
cl2_l2_8 <- lsppin.df[grep(c('CFL2'), lsppin.df[, c("p1")]), ] # 11

# heatmaps for levels 0 and 1
colors = colorRamp2(c(-1.5,0,1.5), c("blue", "#E8E8E8","red"), space = "RGB")

p2_l0 <- read.table("results/heatmaps/path2_level0.txt", sep = "\t", header=TRUE, row.names = 1)
row_od = c("SSH2", "TESK2")
p2_l0.mat <- as.matrix(p2_l0)
ha = rowAnnotation(foo = anno_empty(border = TRUE))
ht = HeatmapAnnotation(top = anno_empty(border = FALSE))
#pdf("figures/p2_l0.pdf", width = 14, height = 10)
Heatmap(p2_l0.mat, name = "log2(aFC)", #column_title = "eQTL SNPs",
        col = colors, na_col = "white", #row_title = "shared eGenes",
        cluster_rows = FALSE, cluster_columns = FALSE, row_names_side = "left",
        column_names_side = "top", row_names_gp = gpar(fontface = "italic"),
        row_order = row_od,
        width = unit(26, "cm"), height = unit(5, "cm"),
        right_annotation = ha, top_annotation = ht,
        border = TRUE, rect_gp = gpar(col = "grey"))
decorate_annotation("foo", slice = 1, {
  grid.rect(x = 0, width = unit(12, "mm"), gp = gpar(fill = NA, col = NA), just = "left")
  grid.text(paste("Level 0", collapse = "\n"), x = unit(5, "mm"), just = "centre", rot = 90,
            gp = gpar(fontsize = 13, fontface = "bold"))})
decorate_annotation("top", slice = 1, {
  grid.rect(x = 0, width = unit(0.7, "mm"), gp = gpar(fill = NA, col = NA), just = "centre")})
#dev.off()

# hhip <- lsppin.df[grep(c('HHIP'), lsppin.df[, c("p1")]), ]
# fam13a <- lsppin.df[grep(c('FAM13A'), lsppin.df[, c("p1")]), ]
# chrna3 <- lsppin.df[grep(c('CHRNA3'), lsppin.df[, c("p1")]), ]
# chrna5 <- lsppin.df[grep(c('CHRNA5'), lsppin.df[, c("p1")]), ]
# ireb2 <- lsppin.df[grep(c('IREB2'), lsppin.df[, c("p1")]), ]
# hykk <- lsppin.df[grep(c('HYKK'), lsppin.df[, c("p1")]), ]

# Testing
#lsppin.df[grep(c('CTSS'), lsppin.df[, c("p1")]), ] # score = 928
#lsppin.df[grep(c('HLA-DQA2'), lsppin.df[, c("p1")]), ] # score = 928
```