R/summarize_epitopes_by_antigen.Rmd

---
title: "Summarize epitopes by antigen"
author: "Jason Greenbaum"
date: "11/7/2022"
params:
  pubmed_id_file: pubmed_ids.txt
output: 
  html_document:
    code_folding: show
    df_print: paged
    toc: true
    toc_depth: 4
    theme: united
    highlight: tango
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

# Summarize epitopes by antigen

Given a list of pubmed IDs, use the IQ-API to pull in all antigen and epitope information
and summarize as follows:

* antigen_id, name, etc.
* number of reference that report positive epitopes derived from that antigen
* total num_peptides_positive derived from each antigen (each peptide is counted only once even if it appears in multiple references)

## Setup

First, some housekeeping...

```{r}
library(httr)
library(jsonlite)
library(tidyr)
library(readr)
library(dplyr)
library(DT)

create_dt <- function(x){
  DT::datatable(x,
                extensions = 'Buttons',
                options = list(dom = 'Blfrtip',
                               buttons = c('copy', 'csv', 'excel', 'pdf', 'print'),
                               lengthMenu = list(c(10,25,50,-1),
                                                 c(10,25,50,"All"))))
}

```
We define a function to query the API.  Since I know there will be more than 10K records, we will need to fetch 1 page at a time
and continue to append them onto our tibble.  **NOTE:** When paging through results by using an 'offset', it is critical to add an 'order' parameter to ensure that the pages are consistent between queries.

To be good citizens, we also add a short 'sleep' in between calls to the API.
```{r}

iq_query <- function(endpoint, 
                     query_params, 
                     base_uri='https://query-api.iedb.org/',
                     page_size = 10000) {

  # initialize 'get_text' so we can page through the results
  get_text <- 'NA'
  final_tbl <- tibble()
  url <- paste(base_uri, endpoint,sep='')
  
  # set the offset to 0
  offset <- 0
  
  # we must be careful not to use scientific notation with the offset parameter
  query_params[['offset']] <- format(offset, scientific = F)

  while(get_text != '[]') {
    print(paste0("fetching offset: ", query_params[['offset']]))
    #TODO: wrap this in a try block
    get_1 = GET(url, query=query_params)
    get_text = content(get_1,'text')
    resp_tbl <- tibble(fromJSON(get_text))
    final_tbl <- rbind(final_tbl, resp_tbl)
    offset <- offset + page_size
    query_params[['offset']] = format(offset, scientific = F)
    # sleep for 1 second between calls so as not to overload the server
    Sys.sleep(1)
  }

  # return the final_tbl
  final_tbl

}


```


Read in the pubmed IDs from a file:
```{r}
pubmed_ids <- read_lines(params$pubmed_id_file)
```


### Map the pubmed IDs to reference IDs

Let's set the query string parameters

```{r}
pubmed_id_string<-paste(pubmed_ids, collapse=',')
query_params <- list()
query_params[['pubmed_id']] = paste0('in.(', pubmed_id_string,')')
query_params[['select']] = 'pubmed_id,reference_id'
query_params[['order']] = 'pubmed_id'
```


And pull the data..
```{r}
ref2pmid <- iq_query('reference_search',query_params)
```

Create a reference ID string to incorporate into downstream queries:
```{r}
ref_id_string <- paste(ref2pmid$reference_id, collapse=',')
```

## Fetch the positive T & B cell epitopes

First, set the query parameters:

```{r}
query_params <- list()
query_params[['reference_id']] = paste0('in.(', ref_id_string,')')
query_params[['qualitative_measure']] = 'neq.Negative'
query_params[['select']] = 'reference_id,parent_source_antigen_iri,structure_id,structure_description,curated_source_antigen'
query_params[['order']] = 'reference_id'
```

Now run the query against tcell_search
```{r}
tcell_epitopes <- iq_query('tcell_search',query_params)
```

And then B cell search:
```{r}
bcell_epitopes <- iq_query('bcell_search',query_params)
```

**NOTE** There are certain cases when the source antigen is unknown/null.  We'll quantify
those here, but will ignore those epitopes for the remainder of this analysis.

### Initial summary

Here we summarize the number of T & B cell epitopes retrieved and the number where the source antigen (parent protein) is not null.  We move forward by combining these data.

```{r}
bcell_epitopes %>% nrow()
bcell_epitopes %>%
  filter(!is.na(parent_source_antigen_iri)) %>%
  nrow()
```

```{r}
tcell_epitopes %>% nrow()
tcell_epitopes %>%
  filter(!is.na(parent_source_antigen_iri)) %>%
  nrow()
```

In both cases, more so in B cell, there are many records that are not mapped to a parent protein.  We'll investigate down below.

Let's combine all epitopes into one tibble.
```{r}
all_epitopes <- bind_rows(bcell_epitopes,tcell_epitopes)
```

### References per antigen

Let's get the number of references for each parent antigen first.

```{r}
refs_per_parent_antigen <- all_epitopes %>%
  distinct(reference_id, parent_source_antigen_iri) %>%
  group_by(parent_source_antigen_iri) %>%
  summarize(num_references=n())

refs_per_parent_antigen %>%
  create_dt()
```


### Peptides per parent antigen

Now the peptides (including discontinuous sequences) per parent antigen

```{r}
peps_per_parent_antigen <- all_epitopes %>%
  distinct(parent_source_antigen_iri, structure_id, structure_description) %>%
  group_by(parent_source_antigen_iri) %>%
  summarize(num_peptides=n())

peps_per_parent_antigen %>%
  create_dt()
```


### Source antigen (parent protein) data

Before we put everything together, we need to pull all of the source antigen information.  Since the parent_source_antigen_iri is not yet in the parent_proteins table, we first pull ALL of the parent proteins
and create the parent_source_antigen_uri later.

```{r}
query_params <- list()
query_params[['select']] = 'iri,database,accession,name,title,proteome_label'
query_params[['order']] = 'accession'
```

Now run the query against parent_proteins to pull everything out
```{r}
parent_proteins <- iq_query('parent_proteins',query_params)
```


### Final summary by parent protein

Here, we combine the data into the table of interest

```{r}
final_summary_by_parent <-
  refs_per_parent_antigen %>%
  left_join(peps_per_parent_antigen,
            by='parent_source_antigen_iri') %>%
  left_join(parent_proteins,
            by=c('parent_source_antigen_iri'='iri'))

final_summary_by_parent %>%
  select(antigen_id=parent_source_antigen_iri,
         antigen_title=title,
         num_references,
         num_peptides) %>%
  arrange(-num_references, -num_peptides) %>%
  create_dt()

```

As we can see by the row with an empty antigen_id & antigen_title, there are a good number of proteins that are not mapped to a parent protein.  So let's have a closer look at those, specifically:

```{r}
# first pull out the curated antigen info into separate columns
all_epitopes <- all_epitopes %>%
  mutate(curated_antigen_accession=curated_source_antigen$accession,
         curated_antigen_name=curated_source_antigen$name)

all_epitopes %>%
  filter(is.na(parent_source_antigen_iri)) %>%
  select(reference_id, structure_id, structure_description,
         curated_antigen_accession, curated_antigen_name) %>%
  distinct() %>%
  create_dt()

```

Interesting!  None of them have curated source antigens, so we should investigate why and ignore them if there is a good explanation.