-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update overlap.R #88
base: dev
Are you sure you want to change the base?
Update overlap.R #88
Conversation
0.6.5 CRAN release
repOverlap chao_jaccard_abundance_index
Thank you a lot, Alex! For the clarification: the algorithm is here https://pubmed.ncbi.nlm.nih.gov/23094376/ Let's first figure out:
Do you have doubts that the R implementation of the algorithm is correct? Any risks or open questions? We can figure them out here. I'm afraid the name is too large, despite that it being very clear. How about just "jaccard_chao"? It would be great to have a doc and the vignette update later as well, but for now let's focus on the implementation part. |
Hope this makes sense... And please correct me if I am wrong on anything. |
Wonderful, thank you a lot for the depth and details, I appreciate it! Let me dig into this a little and I will get back to you ASAP with thoughts and ideas for the next steps. Regarding the intersection part: because all such functions used via repOverlap, they get specific columns as their inputs. So Would you mind telling me about your usage of MiXCR data frames instead of immunarch data frames? We would be very glad to see if we can improve the package to make sure you have a comfortable and seamless work with immunarch's data format! Ideas for the future:
|
Ok that sounds great. I'll look into this and see what I can do. Using the immune arch dataframes should be fine on my end. The MiXCR data frames were just the quickest thing I had to go off at the moment. I think how immunarch is set up to take in MiXCR is straigtforward enough so no problems there. There are some broader issues I have had to wrangle with MiXCR and how it does a poor job with correcting errors in IonTorrent, but we have a program to post process and correct the aspects of MiXCR we don't like. |
I see, very interesting! Is this program open source? It would be great to mention this program in our tutorial(s) on MiXCR so people can use the best practices. |
Yeah I'm happy to make it open source. Certain aspects of it would likely be broadly helpful to others. It doesn't exist online at the moment, because I just finished it over the weekend and I don't think its ready for other people to use. Also I spend the majority of my work time at the bench so coding is not really what i do all day.
The whole process has been simplified to pressing GO and take 30-60 secs to convert a fastq files of 1,000,000 reads to a CSV file and sets up a loading que for the files so you can load a bunch of files at once. |
Great work! Better AND user-friendly MiXCR, this is like a dream for many people. Did you have a chance to somehow benchmark it against regular MiXCR runs? See, e.g.,: |
I don't really have a lot of analytics to compare the results to. But I'm meeting with the Swiss group from the first link you have there tomorrow morning so they will probably have some suggestions. So yeah I don't know the false negative and false positive rate. In my particular case all the samples I am analyzing with this program are mice which are hemizygous for TCRa. So ANY functional T cell we see in the periphery should only have alpha chains in frame so the CDR3 correction program doesn't have to worry about false positives as much. There are still lots of sequences which appear to be out of frame or have premature stop codons and we cannot recover them so we drop them. The primary analysis program I am comparing the results to is regular MiXCR and a program our lab wrote some time back which appears to be more accurate than IMGT/HighV-QUEST and is similar to IgBlast. the benefit is that it is fast and accurate. Major downside is it only runs out of an MS DOS window and emulating it out of DOS box on a modern machine makes it run like a 🥔. Hence why I've focused on getting something to run using MiXCR and the PyIR wrapper. |
So I re-wrote the function and it looks like it works if I test it (pairwise) on the default test data. But when I generate the Jaccard and chao.jaccard plots they look identical. Moreover the numbers in either plot don't correspond with the returned values I get when I run the function commands on the data manually.
Something doesn't add up... plotted numbers are not what I would expect. No difference between jaccard and the chao.jaccard in the output plots. |
Also I think I screwed up a merge. Really sorry about that. i think its probably safer if I just have my own private repo test it locally and upload files which I are corrected on dropbox or something because I don't want to screw up the branches because I don't really understand Git. |
So I haven't made much progress with this and I could use some help.But the Jaccard function should also be corrected to be something like the following. It works if you just run the below code on the the test data. I think the way the Jaccard you have written in ImmunArch, it compares the entire row of one clone set to another. Like I mentioned, I think this is a problem because you actually only want to compare a string of the V.name, D.name, J.name and CDR3.aa. The following compares the overlap of A2-i129 and A2-i131
These two modified functions return what I want but its unclear to me how to pass them properly into the functions you have written to generate the repOverlap heat-map looking plots. |
Thank you so much for the detailed experiments, Alex! Let me see what is the issue with the heatmap, and how we can add this to immunarch. |
Oh I just realized something which looks incorrect in the Chao.Jaccard I wrote.
I think it would be easy for me to write this. If you can get the Jaccard function I have written to work as a function I think it would be easy for me to go off of that to get the Chao.Jaccard index test function working. |
Ok I think the Chao.Jaccard is also fixed. But yeah I don't know how to make the function call this.
The other Overlap functions might deserve a second look at too. Maybe I can look at those too. Other ideas for improving the overlap function:We have also made these other graphs (EX: figure 4a) from overlap data. And I am going to have to make them again. They are a little confusing to look at but basically they just using the data from the overlap heat map looking graph and representing it differently. It is useful if you have sample replicates which have also been sequenced. For instance in Figure 4a we did this: compare b1 group vs b2, b2 vs b3, b1 vs b3. Then take the standard error of the mean (SEM) of this. this is the blue. then for the b comparator vs f comparatee it is the SEM of all 9 pair wise comparisons of b and f replicates:
This allows us to give evaluate the statistical significance of the overlap functions using ANOVA. Its quite useful if you have replicates. I am going to have to make a function in R to do this with my current data, or I'll just do it semi-manually and then make the graphs in prism... But it would make a great addition to ImmunArch! Seems like if you call the meta data correctly it could be done with a function automatically. |
I'm not sure if this question is too OT here, but: does immunarch use the classic Morisita Overlap or the one with Horn's correction? In my experience/readings, Morisita-horn tends to be preferred (about equally to Bray-Curtis). My understanding is that Morisita-Horn is more sample-size-independent than the classic Morisita Overlap Index. |
repOverlap chao_jaccard_abundance_index