Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update overlap.R #88

Open
wants to merge 4 commits into
base: dev
Choose a base branch
from
Open

Update overlap.R #88

wants to merge 4 commits into from

Conversation

abrown435
Copy link

repOverlap chao_jaccard_abundance_index

EugeneRumynskiy and others added 3 commits May 13, 2020 13:03
repOverlap chao_jaccard_abundance_index
@vadimnazarov
Copy link
Contributor

Thank you a lot, Alex!

For the clarification: the algorithm is here https://pubmed.ncbi.nlm.nih.gov/23094376/

Let's first figure out:

  • the algorithm correctness

  • the naming

Do you have doubts that the R implementation of the algorithm is correct? Any risks or open questions? We can figure them out here.

I'm afraid the name is too large, despite that it being very clear. How about just "jaccard_chao"?

It would be great to have a doc and the vignette update later as well, but for now let's focus on the implementation part.

@abrown435
Copy link
Author

  • jaccard_chao works for me.

  • the algorithm is not correct as it stands. Something like the following should work, but I'm not sure how its implementation would change in this function and the specific column names are more place holders because Im not directly looking at the immunarch dataframe:

library(tidyr)
.x_reads <- .x %>% uncount(counts)
.y_reads <- .y %>% uncount(counts)
intersection_reads <- nrow(dplyr::intersect(.x_reads, .y_reads))
proportion_of_x_in_y_counting_all_seqs <- intersection_reads / nrow(.y_reads)
proportion_of_y_in_x_counting_all_seqs <- intersection_reads / nrow(.x_reads)

  • Also can someone clarify the use of:
    intersection <- nrow(dplyr::intersect(.x, .y))
    This is used a lot. It looks like the call is saying that to find an intersect you require exact match of rows in the two dataframes .x and .y? Note the above function operates the same way, how I have it written. But if my understanding of this code is right this is not the appropriate use case for the Jaccard index, or really any of the overlap comparisons. Because you would require that the CDR3 DNA sequence to be the same for an intersection to be present. But for clonal overlap you don't care about this as you only evaluate these kinds of comparisons on an AA level and disregard the CDR3_DNA. Meaning that clones which arise independently to have the same AA sequence in the TCR but differ in the DNA sequence are effectively collapsed. The way we have always done this in a merge column of the V, J, C, aaSeqCDR3 and then collapse the identical ones and have the larger ones absorb the the smaller count numbers then you would use this column as what you compare the intersection of the two dataframes with. In the above case you would then have to uncount(counts) and then count the nrows of the merge intersects: nrow(dplyr::intersect(.x_reads$Merge, .y_reads$merge))

  • so the merge column is made like so:
    dataframe_merged <- within(dataframe, Merge <- paste(allVHitsWithScore, allDHitsWithScore, allJHitsWithScore, allCHitsWithScore, aaSeqCDR3, sep = "_"))

Hope this makes sense... And please correct me if I am wrong on anything.

@vadimnazarov
Copy link
Contributor

vadimnazarov commented Jul 21, 2020

Wonderful, thank you a lot for the depth and details, I appreciate it! Let me dig into this a little and I will get back to you ASAP with thoughts and ideas for the next steps.

Regarding the intersection part: because all such functions used via repOverlap, they get specific columns as their inputs. So nrow(dplyr::intersect(.x, .y)) works on any columns you want.

Would you mind telling me about your usage of MiXCR data frames instead of immunarch data frames? We would be very glad to see if we can improve the package to make sure you have a comfortable and seamless work with immunarch's data format!

Ideas for the future:

  • update the documentation for repOverlap to add a string ID for the new function + a link to the jaccard-chao publication
  • update the overlap vignette
  • test on data frames via repOverlap(immdata$data, "jaccard_chao")
  • test on data tables via repOverlap(immdata$data, "jaccard_chao")

@vadimnazarov vadimnazarov added status:In progress ⏳ Work in progress tag:Methods Analysis and visualisation methods type:Enhancement New feature or request labels Jul 21, 2020
@abrown435
Copy link
Author

Ok that sounds great. I'll look into this and see what I can do.

Using the immune arch dataframes should be fine on my end. The MiXCR data frames were just the quickest thing I had to go off at the moment. I think how immunarch is set up to take in MiXCR is straigtforward enough so no problems there. There are some broader issues I have had to wrangle with MiXCR and how it does a poor job with correcting errors in IonTorrent, but we have a program to post process and correct the aspects of MiXCR we don't like.

@vadimnazarov
Copy link
Contributor

I see, very interesting! Is this program open source? It would be great to mention this program in our tutorial(s) on MiXCR so people can use the best practices.

@abrown435
Copy link
Author

Yeah I'm happy to make it open source. Certain aspects of it would likely be broadly helpful to others. It doesn't exist online at the moment, because I just finished it over the weekend and I don't think its ready for other people to use. Also I spend the majority of my work time at the bench so coding is not really what i do all day.
The jist of it is it is an R markdown file which does the following:

  1. deletes all intermediate file which were created from previous run of the program and leaves the import folder and export folder alone.
  2. takes in fastq files which are in a folder called import
  3. sources a .sh script in R to run a specific mixcer program on TCRalpha chains for every file in the import folder and tells you the run progress.
  4. moves the output .txt file of clonotypes into a new temp folder. and reads in each file in the folder into R and modifys two columns so that it can go into the next step:
  5. R then sources a pyhton script to correct for CDR3 errors (things that look out of frame but are actually not) asign a new CDR3 AA and CDR3 DNA sequence if there was a detected premature stop codon, or frameshift. I should add that we know this correction process to be accurate for IonTorrent, but it would have to be modified for other platforms. run this python CDR3 correction for all the files you loaded in and give a run status for each file. The program also automatically parses each clone so that the V.Region, V.Palindromic.NTs, n.Region, J.Palindromic.NTs, J.Region are all displayed in separate columns. I think you need this for AIRR format but MiXCR doesn't display this by default. We use this information for other things so its also included.
  6. The program effectively 'rescues' about 10-15% of our reads which were otherwise going to be thrown out because we don't keep out of frame clones. Then the program goes back into R and these new rescued clones are determined if they are identical to clones already present in the data. Larger clones absorb the clone count and clone fraction of the smaller clones if they are identical.
  7. then all the files are outputed into a .csv format and can be loaded into immunarch or whatever you like.

The whole process has been simplified to pressing GO and take 30-60 secs to convert a fastq files of 1,000,000 reads to a CSV file and sets up a loading que for the files so you can load a bunch of files at once.

@vadimnazarov
Copy link
Contributor

Great work! Better AND user-friendly MiXCR, this is like a dream for many people. Did you have a chance to somehow benchmark it against regular MiXCR runs? See, e.g.,:

@abrown435
Copy link
Author

I don't really have a lot of analytics to compare the results to. But I'm meeting with the Swiss group from the first link you have there tomorrow morning so they will probably have some suggestions. So yeah I don't know the false negative and false positive rate. In my particular case all the samples I am analyzing with this program are mice which are hemizygous for TCRa. So ANY functional T cell we see in the periphery should only have alpha chains in frame so the CDR3 correction program doesn't have to worry about false positives as much. There are still lots of sequences which appear to be out of frame or have premature stop codons and we cannot recover them so we drop them. The primary analysis program I am comparing the results to is regular MiXCR and a program our lab wrote some time back which appears to be more accurate than IMGT/HighV-QUEST and is similar to IgBlast. the benefit is that it is fast and accurate. Major downside is it only runs out of an MS DOS window and emulating it out of DOS box on a modern machine makes it run like a 🥔. Hence why I've focused on getting something to run using MiXCR and the PyIR wrapper.

@abrown435
Copy link
Author

abrown435 commented Jul 22, 2020

So I re-wrote the function and it looks like it works if I test it (pairwise) on the default test data. But when I generate the Jaccard and chao.jaccard plots they look identical. Moreover the numbers in either plot don't correspond with the returned values I get when I run the function commands on the data manually.

## load test data:
library(immunarch)
data(immdata)
df <- immdata$data
imm_ov1 <- repOverlap(immdata$data, .method = "jaccard", .verbose = T)
p1 <- vis(imm_ov1)
imm_ov2 <- repOverlap(immdata$data, .method = "chao.jaccard", .verbose = T)
p2 <- vis(imm_ov2)
p1 + p2
## View test data
immdata$data$`A2-i129`
immdata$data$`A2-i131`
## Jaccard index test
.x <- immdata$data$`A2-i129`
.y <- immdata$data$`A2-i131`
intersection <- nrow(dplyr::intersect(.x, .y))
A2.i129_A2.i131_jaccard <- intersection / (nrow(.x) + nrow(.y) - intersection)
A2.i129_A2.i131_jaccard # **Graphed number is 0.0048. Returned number for A2.i129_A2.i131_jaccard is: 0.000305787**
## chao. Jaccard index test
.x_merged_test <- within(immdata$data$`A2-i129`, Merge <- paste(V.name, D.name, J.name, CDR3.aa, sep = "_"))
.y_merged_test <- within(immdata$data$`A2-i131`, Merge <- paste(V.name, D.name, J.name, CDR3.aa, sep = "_"))
.x_reads <- .x_merged_test %>% tidyr::uncount(Clones)
.y_reads <- .y_merged_test %>% tidyr::uncount(Clones)
.x_reads_M <- as_tibble(.x_reads$Merge)
.y_reads_M <- as_tibble(.y_reads$Merge)
intersection_reads <- nrow(dplyr::intersect(.x_reads_M, .y_reads_M))
proportion_of_x_in_y_counting_all_seqs <- intersection_reads / nrow(.y_reads_M)
proportion_of_y_in_x_counting_all_seqs <- intersection_reads / nrow(.x_reads_M)
A2.i129_A2.i131_JaccChao <-  (proportion_of_x_in_y_counting_all_seqs * proportion_of_y_in_x_counting_all_seqs) / (proportion_of_x_in_y_counting_all_seqs + proportion_of_y_in_x_counting_all_seqs - (proportion_of_x_in_y_counting_all_seqs * proportion_of_y_in_x_counting_all_seqs))
A2.i129_A2.i131_JaccChao # **Graphed number is 0.001354774. Returned number for A2.i129_A2.i131_JaccChao is: 0.0048.**

Something doesn't add up... plotted numbers are not what I would expect. No difference between jaccard and the chao.jaccard in the output plots.

@abrown435
Copy link
Author

Also I think I screwed up a merge. Really sorry about that. i think its probably safer if I just have my own private repo test it locally and upload files which I are corrected on dropbox or something because I don't want to screw up the branches because I don't really understand Git.

@abrown435
Copy link
Author

abrown435 commented Jul 28, 2020

@vadimnazarov

So I haven't made much progress with this and I could use some help.

But the Jaccard function should also be corrected to be something like the following. It works if you just run the below code on the the test data. I think the way the Jaccard you have written in ImmunArch, it compares the entire row of one clone set to another. Like I mentioned, I think this is a problem because you actually only want to compare a string of the V.name, D.name, J.name and CDR3.aa.

The following compares the overlap of A2-i129 and A2-i131
When calling the repOverlap function using Jaccard the heat map returns: 0.0048 for this comparison.
However the below functions return a different value which I believe are correct, as they reflect how we used to compute this in excel:

# Jaccard
x_merged_test <- within(immdata$data$`A2-i129`, Merge <- paste(V.name, D.name, J.name, CDR3.aa, sep = "_"))
y_merged_test <- within(immdata$data$`A2-i131`, Merge <- paste(V.name, D.name, J.name, CDR3.aa, sep = "_"))
x_reads_M <- as_tibble(x_merged_test$Merge) #store the AA merge clone column
y_reads_M <- as_tibble(y_merged_test$Merge)
intersection_J <- nrow(dplyr::intersect(x_reads_M, y_reads_M))
A2.i129_A2.i131_Jaccard_index <- intersection_J / (nrow(x_reads_M) + nrow(y_reads_M) - intersection_J)
A2.i129_A2.i131_Jaccard_index

#This Jaccard returns: 0.001760833
# chao. Jaccard index test
.x_merged_test <- within(immdata$data$`A2-i129`, Merge <- paste(V.name, D.name, J.name, CDR3.aa, sep = "_"))
.y_merged_test <- within(immdata$data$`A2-i131`, Merge <- paste(V.name, D.name, J.name, CDR3.aa, sep = "_"))
.x_reads <- .x_merged_test %>% tidyr::uncount(Clones)
.y_reads <- .y_merged_test %>% tidyr::uncount(Clones)
.x_reads_M <- as_tibble(.x_reads$Merge)
.y_reads_M <- as_tibble(.y_reads$Merge)
intersection_reads <- nrow(dplyr::intersect(.x_reads_M, .y_reads_M))
proportion_of_x_in_y_counting_all_seqs <- intersection_reads / nrow(.y_reads_M)
proportion_of_y_in_x_counting_all_seqs <- intersection_reads / nrow(.x_reads_M)
A2.i129_A2.i131_JaccChao <-  (proportion_of_x_in_y_counting_all_seqs * proportion_of_y_in_x_counting_all_seqs) / (proportion_of_x_in_y_counting_all_seqs + proportion_of_y_in_x_counting_all_seqs - (proportion_of_x_in_y_counting_all_seqs * proportion_of_y_in_x_counting_all_seqs))
A2.i129_A2.i131_JaccChao

#This Jaccard Chao returns: 0.001354774

These two modified functions return what I want but its unclear to me how to pass them properly into the functions you have written to generate the repOverlap heat-map looking plots.

@vadimnazarov
Copy link
Contributor

Thank you so much for the detailed experiments, Alex! Let me see what is the issue with the heatmap, and how we can add this to immunarch.

@abrown435
Copy link
Author

Oh I just realized something which looks incorrect in the Chao.Jaccard I wrote.
I just realized that the intersection_J and intersection_reads in Jaccard and Chao.Jaccard respectively, was returning the same values. The Jaccard one still seems to be correct. but the Chao.Jaccard intersection value here should be much larger and the dplyr::intersect is just accounting for unique intersections.
I think you actually want to source the "Clones" number from what is the numerator and append this to the number of reads each time there is an "intersection". Doing this will give you a unique X intersection and unique Y intersection. which you feed into the next two lines to create a proportion.
The appropriate way to do this would be (for Chao.Jaccard)

  • make the merge column before the uncount function.
  • find merge column (identical AA) which are the same in the df and collapse them. Larger clones absorb the proportions and clone count of the smaller clones
  • Skip the uncount function
  • then find #seqs in X present in Y data, #seq in Y present in Y data
  • this would be a for loop (or maybe you can use the pipe operator, i forget...) which says for every time there is a intersection found (say X in Y) take the clone count value for that X sequence and add it to a counter.
  • then you do the same thing for Y
  • this way the numerator of proportion of X in Y counting all sequences is different from the proportion of Y in X

I think it would be easy for me to write this. If you can get the Jaccard function I have written to work as a function I think it would be easy for me to go off of that to get the Chao.Jaccard index test function working.

@abrown435
Copy link
Author

Ok I think the Chao.Jaccard is also fixed. But yeah I don't know how to make the function call this.

#######################
# Chao Jaccard Index  #
#######################

# Generate a column which merges the values of V, D, J, and CDR3 AA for the comparator and comparatee (X and Y)
x_merged_test_cj <- within(immdata$data$`A2-i129`, Merge <- paste(V.name, D.name, J.name, CDR3.aa, sep = "_"))
y_merged_test_cj <- within(immdata$data$`A2-i131`, Merge <- paste(V.name, D.name, J.name, CDR3.aa, sep = "_"))

# total number of reads in X and Y
reads_in_x_cj <- nrow(x_merged_test_cj %>% tidyr::uncount(Clones))
reads_in_y_cj <- nrow(y_merged_test_cj %>% tidyr::uncount(Clones))

# ======= Find the #X reads which can be also be found in the total #Y reads, & vice versa ======= 
# This looks for identical Merge rows from X in Y and then sources the common X and references the Clone number.
# Clone number is number of times this sequence appears and if a clone has identical AA as another but differs in
# DNA sequence it still finds these. Then it takes this list of the clone counts of X in Y and adds this up.
x_in_y_reads <- sum(x_merged_test_cj[which(x_merged_test_cj$Merge %in% y_merged_test_cj$Merge),]$Clones)
y_in_x_reads <- sum(y_merged_test_cj[which(y_merged_test_cj$Merge %in% x_merged_test_cj$Merge),]$Clones)

# Represents the fraction of X sequences in Y over all the unique DNA sequences in Y
proportion_of_x_in_y_counting_all_seqs <- x_in_y_reads/reads_in_y_cj
proportion_of_y_in_x_counting_all_seqs <- y_in_x_reads/reads_in_x_cj

#  Calculate the Chao Jaccard Index: calculates a Jaccard based overlap accounting for sequence abundance in the datasets.
A2.i129_A2.i131_JaccChao <- (proportion_of_x_in_y_counting_all_seqs * proportion_of_y_in_x_counting_all_seqs) / (proportion_of_x_in_y_counting_all_seqs + proportion_of_y_in_x_counting_all_seqs - (proportion_of_x_in_y_counting_all_seqs * proportion_of_y_in_x_counting_all_seqs))
A2.i129_A2.i131_JaccChao #returns:  0.002029687

The other Overlap functions might deserve a second look at too. Maybe I can look at those too.

Other ideas for improving the overlap function:

We have also made these other graphs (EX: figure 4a) from overlap data. And I am going to have to make them again. They are a little confusing to look at but basically they just using the data from the overlap heat map looking graph and representing it differently. It is useful if you have sample replicates which have also been sequenced. For instance in Figure 4a we did this: compare b1 group vs b2, b2 vs b3, b1 vs b3. Then take the standard error of the mean (SEM) of this. this is the blue. then for the b comparator vs f comparatee it is the SEM of all 9 pair wise comparisons of b and f replicates:

  • b1 vs f1, b1 vs f2, b1 vs f3
  • b2 vs f1, b2 vs f2, b2 vs f3
  • b3 vs f1, b3 vs f2, b3 vs f3

This allows us to give evaluate the statistical significance of the overlap functions using ANOVA. Its quite useful if you have replicates. I am going to have to make a function in R to do this with my current data, or I'll just do it semi-manually and then make the graphs in prism... But it would make a great addition to ImmunArch! Seems like if you call the meta data correctly it could be done with a function automatically.

@fabio-t
Copy link

fabio-t commented Jul 23, 2021

I'm not sure if this question is too OT here, but: does immunarch use the classic Morisita Overlap or the one with Horn's correction?

In my experience/readings, Morisita-horn tends to be preferred (about equally to Bray-Curtis). My understanding is that Morisita-Horn is more sample-size-independent than the classic Morisita Overlap Index.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status:In progress ⏳ Work in progress tag:Methods Analysis and visualisation methods type:Enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Suggestions: Include a Chao-Jaccard abundance index function for repOverlap
5 participants