-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MetaProViz::Preprocessing - CoRE=TRUE: Media blank samples - Enable removal of single metabolite values? #6
Comments
Another thought is that we could use again the HotellingT2 here |
ok, thanks for checking! |
After todays meeting we decided to do:
|
To add for 3.: With 2 or 3 we can do a variance testing, to understand if there are features (metabolites) that appear to be really variable. If so we can report those. Otherwise, I completely agree we can not do any outlier testing (as in weather a media sample is an outlier). |
There is a issue.
It can happen (and it happens is the toy data) that we keep a metabolite that has NA values for all the blank samples because of 1. Then in 2 we get inf and thigs dont work. What do we do here? I think we should use the half minimum of all the dataset to fill these NAs is the blank samples. |
Just to be sure, blank samples as in the media control samples for the CoRe: I agree thats something we should change. The problem is that the user can of course choose not to do any feature filtering or MVI for many reasons. I think here we should just treat the blank samples in the best way possible irrespectively what the user choose to do for the rest of the matrix. Meaning:
Would this make sense to you? |
Yes, blank samples are the media control in CoRe. I havent change the name yes. The "blank" samples will be "Media_Control" in the Experimental_design will will also be renames as InputSettingsFile as stated in issue 47. If I understood correctly:
If this is correct then yes I totally agree. |
Yes this is exactly what I had in mind, thanks for writing the bullet points, thats much clearer than my text :) |
This is done except the "Outlier testing". Anova or pairwise tests of the samples dont work here because they look for a difference in the mean and the data are normalized. |
Yes thats correct. |
We do the same checks as in pool samples. Meaning check for high variant metabolites among samples. But here its missing an update I did there which I plan to also add. In pool samples I also added the Quartile coefficient of variation https://www.nature.com/articles/s41598-023-31711-8. I am readings some papers for this and I will make an update. Since with the above tests we measure which metabolites have high variance I am thinking to somehow identify which samples are responsible for this and the make test (I guess a categorical test) to conclude if these differences are enough to say that the samples are different. |
Ok that sounds sensible. Let me know what you decide once you finsihed your reading :) |
This will be a kinda big post discussing 2 things. Which methods to use to measure the dispersion of samples and the about finding the “outlier samples” responsible for the high variation observed.
A) About the use of Standard error are a dispersion measure. Based on this paper (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3487226/) I would suggest not to use the standard error in this case as it is not a direct measure of dispersion. SEM quantifies uncertainty in estimate of the mean. So, how close to the real mean we are (the SD of the mean) based on the sample mean we have. B) I was reading some papers (this in particular https://www.frontiersin.org/articles/10.3389/fams.2019.00043/full#B31) when I realized I was making a mistake. I was searching for a robust measure to measure the dispersion across different distributions. So to choose this measure when we have normally distributed or the other if not. CV works best for normally distributed but other measure using quartiles for example are better for other distributions. The issue here is that yes there are other better measures than CV with respect to expressing the true dispersion of the data than CV. But we are not trying to find the best approximation of the true/real dispersion. What we want to find is the possible outliers of a specific set of samples which are supposed to be the same (of very low variation between them). Also, the more robust measures are less sensitive to outliers. We use CV exactly because its sensitive to outliers. Thus ,I propose to just and only use the coefficient of variation as a measure of dispersion.
Outliers/ samples Sample A Sample B Sample C Sample D Sum
(Instead of the z-score exclusion we could do leave-1-out crossvalidation but i think it goes too far). |
Thank you very much for the detailed summary, you really did a great job dissecting this issue. A) This makes sense, especially in regards of the issue that we discussed before that often there are no more than 3 blank sample (even more likely that people only process 2). The option you describe with leaving out a sample an remeasure would make sense if we would always have a higher number of blank samples, but in hindsight that this will most likely not be the case, I would not proceed down this route. Yet, this is a very valuable stream of discussion, which can be helpful for us to explain our choices. B) If I remember correctly, the CV is the ratio between the standard deviation and the mean. So it is still connected to the mean, but this is fine as we really want to find outliers (mean is more sensitive to outliers than median). Here I have some questions:
One point: As you would do this for each metabolite if I understand correctly, what happens if for different metabolites different blanks are sugested to be outliers? |
Yes CV is sd/mean. You can multiply that with 100 to get the relative standard deviation (RSD), which is the same as a percentage. |
Do you have 5 min to meet opn zoom, to briefly discuss the points? |
Amazing, well done! |
Here is a 'short' summary on how we process the CoRe Media samples. I am not sure what to put in the Vignette from all of this because it seems a lot, thus I am putting it here to have it at one place. Preprocessing of CoRe-Media_samples At least 1 Media_sample has to be present in the data. Otherwise we return an error.
|
Some discussion: Media_samples: We do a CoRe experiment in order to find out which and how much metabolites are absorbed by or excreted from the cells. When we run LC-MS and we measure our samples we actually measure the initial abundace of a metabolite in the enviroment plus the change in its abundance that occured because of the cells. To find out what our cells have actually done we must remove this initial abundance of the metabolites. To do this we must also measure the Media_Samples which are samples without any cells and we subtract these values from the samples. CoRe_norm_factor : As a CoRe experiment is underway the number of cells tends to change. This has an effect of the abundace of the absorbed/excreted metabolites. To account for that we use the CoRe_norm_factor which is numerical value accounting for the growth rate that has been obtained from a growth curve or growth factor using the ratio of cell count/protein quantification at the start point to cell count/protein quantification at the end point. Feature filtering: Media_samples are ignored in the Feature filtering step. Cells synthesise and excrete metabolites that are different from the ones they need to survive (Metabolites inclyded in the growth media). So, we expect that many metabolites would not be present in the Media_Samples (their values would be NA) and would result in removing them. This woud lead to losing valuable information. MVI: We always imput missing values in the Media_Samples as we need to have values ro subtract from the actual samples. For a metabolite that has at least 1 value we impute using HM per feature as we expect that the missing values are due to the low adundace but the metaboltie is there since it was picked up by one Media_sample. IF all values are NA we assume the metabolite is not present so we replace NAs with 0. CV: The accurate measurement of the initial metabolite abundace is crucial as it affects the whole experiment. Media samples, since they consist of only the growth media, are expected to be identical. This translates into the variation of a metabolite to be very small. To check if this is indeed the case for each metabolite we measure the Coefficient of variation. A metabolite with almost identical values will have very small CV while a metabolite with very different abundances will have high CV. Also, because of the nature of LC-MS it can happen that there are outlier values and CV is more sensitive to outliers than other more robust variation measures. This enables us to identify if a metabolite is unstable and which sample is responsible for it are remove it prior to Media_sample mean calculation. |
There is 1 issue which I would like to discuss because I am not sure. Maybe you can discuss it with the people there because its important for the correctness of the outlier procedure. Its about fishers test. We use fishers exact test as we have 2x2 categorical data and generally have low number of samples and we expect very sparse data. So Fishers exact test seems appropriate. However, Fishers test does not work if we have structural zeros. Check structural zeros here: https://www.quality-control-plan.com/StatGuide/conting_anal_ass_viol.htm#Structural%20zeroes. In this guide they give as an example of structural zeros the male and the ovarian cancer, which is something that cannot happen. FIchers test cannot handle these cases. In our case we expect that all the metabolites will have low CV,thus having zero in the high_var but its not "mandatory" to have zero, as it can happen that they are not. My question is: Are our zeros structural zeros? or not? If they are not, all is fine. If they are, then we cannot use fishers exact test and we have to find another way to find the "outllier samples". I thought of another possible way which is to do something like GSEA. Meaning that we start at zero and check for each sample each metabolite one at a time and we give +1 for each high_var or -1 for each low_var and then use Kolmogorov–Smirnov test. But first we have to find out if we can use Fishers test or not. |
Thabnks for the great summary, I really like it. I can move some of the points into the vignette this week as I will try to finish this up. About the last comment on the fishers exact test: I will discuss this today and add my response afterwards. One question here: You refer to the outlier detection of the media blank control?! |
Great thanks! Yeah about fishers exact test, I indeed mean the media control outlier detection! |
After todays meeting we should proceed with the following:
|
I had a discussion on this with Aurelien and there is some feedback here:
|
Ah nice! When you say blank you mean the media_control samples right? Because then I would also add the media sample in a pca along with all the other samples in one QC pca. I say this because I have tested this with the pool samples and it only makes sense using all the samples. I think the same goes with the media samples even though I would expect (not totaly sure here) that they would seem like an outlier group compared to the samples. Which means that they might drive the separation in one axis, which would mean that the interesting variation might be hindered because we added the media_samples in the pca, which makes me think we might have to indeed add another pca plot. Lets me check this and how things look and I''ll come back to it.
|
Yes exactly, the media control samples :) And yes the PCA plot with the added media_control samples (=blank samples) - we can name it QC blank samples and colour code the blank samples in red colour and all the other samples in black (or something like this). Also yes, the media samples usually seem like a outlier group, whilst the pool samples would be expected to be in the middle of all samples (as they are the pool of all samples). About your last point: Yes I agree hotellins is not a solution and the PCA in itself is at least giving us a visual representation in case something went really wrong, so its a good control in any case. For the automation we still need a solution tough. One more thing: shall we report the CV in percent? (or is this already converted, sorry havent checked)? |
No the CV is a dataframe with metabolites as rows, and 2 columns: numerical CVs and High_var = TRUE/FALSE depending on the threshold (which is 1). We print how many metabolites were found with high CV but a percentage here would be better. |
Ok, so maybe we can multiple the numerical CVs with 100 and export CV [%] instead. |
Ok but then we will export 1 value. Wouldnt it make sense also to export the table at least of the high variable metabolites? or their names? so that the user can do something with it. If its just the percentage of high CV metabolites, meaning this one value, I think printing it in a message would be more apropriate than saving it in a file. On the other hand, saving it keeps it there forever but still not much you can do with it. |
I think we talk about different things I mean if we have a cv of 0.1 this is 10% of variation, but I think I mixed this with SD. So never mind. |
So here we had to do 5 things.
To my eyes it works as I would expect it to work.
So about the relationship of MVI/NAs and CV I think the more NAs the CV increases but at some point it stabilizes. This is because: From 0 to more NAs the CV increases as the imputed values have high sd. However at some point the general mean moves closer to the imputed values and they get have lower sd. I guess there could be a perfect analogy of missing values and recorded values around 50-50 which will cause the mean to be in the middle of both, thus making the sd to be high, while also decreasing the mean and thus making the CV to be high. But I am not sure. I think I could try to actually model what would happen but might take a while.
This, as I see it, is more of a theoretical argument and I went a bit more general. Our goal here is to be as precise as possible and as close to the 'real' mean which we dont know. So we measure a number of samples and go for reproducibility/consistency of the values. I would say that we remove completely a sample when it is found as outlier by fishers test (due to high number of hugh CVs cause by a sample) because we have a reason to do it. For example one could argue that even though this sample is found to have only 5 ot 10 different values for metabolites compared to the rest, still there might be extremities to other metabolites that are of smaller magnitude that we are not able to detect. For example the metabolite abundances of this sample might always be higher than the other samples but still the total CV might be <1, so all the means might be a little higher. My point here is that if fishers test found this sample as not related to the rest we have enough evidence to remove it as not being consistent with the rest of the samples. Now about the removing a specific value from the mean calculation because of high cv. Again I would vote for removing it. (Here, I am not 100% sure of the first dtep in my reasoning) Because of the nature of metabolomics we can have random effects that might change the value of a metabolite thus also resulting in a different abundance when compared to the group, thus resulting in high CV. If fishers test did not find the sample as outlier then the sample is considered fine and this difference in abundance is considered a random effect. So, we remove this value from the mean calculation but keep the sample.
|
Thanks for doing the tests (point 1). This tells us some important things:
Towards point 3:
Towards point 4: Yes thats perfect :) Towards point 2 and 5: So action points: Let me know what you think about this. Also, is there anything from the testing you have done you think we should add into the manuscript methods section like a benchmarking? |
Ok, based on your comment I would say we do: MVI, PCA, CV, contigency tables, Fishers , and then removing the samples identified as outliers from fishers and just that. The resoning is that we remove the samples that were found as outliers by fishers test on the tables of CVs. With this we remove the samples that were found as having disproportional persentage of "being responsible for a metabolite high CV" when compared to the rest of the Media_control samples and we have evidence for it from fishers test. We dont do anything else. As it could fall in the category of "overfitting". Also about some benchmarking, I could maybe plot the function of MVIs and resulting CVs. As for the 3b point: We could add a parameter called: Remove_Media_test = "fishers" or "CV". fishers would be the default, with the above reasoning and CV would be removing all sample values causing high CV with the reasoning of consistency against random effects, but indeed this needs a lot more investigation if it is a substantial argument. But for now I would say leave this for later. |
Sounds good to me. So I would suggest you do the points above "MVI, PCA, CV, contigency tables, Fishers , and then removing the samples identified as outliers from fishers and just that.". For the benchmarking, we could start a benchmarking .rmd file in the vignette folder. You can move your tests regarding the pre-processing in there and we can extend this with other benchmarking-like things we did or do for other functions. For the last point (3b): I agree lets leave this for now and once the above is done, we can rename the issue and just add this as the remaining point for the future. |
SO the ""MVI, PCA, CV, contigency tables, Fishers" is done. I am adding the test of NA and CV in a Benchmark.Rmd and I will put it in the vignette. |
As an upgrade: We should decide if removing single metabolite values that cause high CV when the sample responsible is not identified as outlier using fishers test. |
Thanks! |
To do's:
|
Something that I noticed here by looking at this plot. This is the benchmark plot for the percentage of NA effect on the CV. On the x-axis, we see the different percentages on added NA and on the y-axis the resulting CV. Here, we see that the max NA percentage is 50-60% and for the 60% the max CV is ~ 90%. Assuming that our data are good, I would say that CV =1 as a cut off seems good because we see that the effect of NAs on CV isnt enough to make a metabolite having high CV. So for a metabolite to have CV >1, something else besides having NAs has to have happen. |
We measure the mean and sd of the blank samples and return a message with the persentage of high variable metabolites in the blank samples.
I am not sure what would be a decent cut-off for SD to take actions, meaning give error message asking the user to check certain blanks versus warning message. You are right that there should not be any variance really.
You said: "for nowI used coefficient of variation CV>1 to give a warning of high variance based on this. I will come back at this."
The text was updated successfully, but these errors were encountered: