-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
conditionally and persisitently rare taxa #98
base: devel
Are you sure you want to change the base?
Conversation
Signed-off-by: Daena Rys <[email protected]>
R/addMetrics.R
Outdated
# Calculate metrics for each split subject/group | ||
results_list <- lapply(split_data, function(sub_data) { | ||
# Initialize a list to hold results for each metric | ||
metric_results <- list() | ||
|
||
# Loop through each metric and calculate it | ||
for (metric in metrics) { | ||
# Check if the metric is supported and calculate accordingly | ||
if (metric == "CRT") { | ||
crt_results <- .calculateCRT(sub_data, thresholds[["CRT"]]) | ||
metric_results[["CRT"]] <- crt_results | ||
} else { | ||
stop(paste("Metric", metric, "is not supported yet."), | ||
call. = FALSE) | ||
} | ||
} | ||
|
||
# Add the calculated metrics to the metadata (colData or rowData) | ||
for (metric in names(metric_results)) { | ||
colData(sub_data)[[metric]] <- metric_results[[metric]] | ||
} | ||
|
||
return(sub_data) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be more efficient to:
- Get matrix
- Get the list which has vectors as elements. Each vector represent members of each group
- Loop through this list
- Select the group members from the matrix
- Calculate values
- Combine all the results and sort values based on sample names
- Add to colData (in add* method)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not convinced the right location should be colData as the interest is taxa related.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, I followed the code and did not think about it. As it these features are global (for whole dataset), colData is not correct place indeed
I think the output could be similar to getPrevalent and getRare, i.e., vector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.. yes. If we have a single subject/group with time series, then this is a vector over features (or rowData) - one measure of persistence or rarity for each feature and rowData would be a suitable location.
If we have multiple subjects/groups with time series, then this would be features x groups matrix.
At simplest it could be implemented for just a single-group case, and rest left for the user.
If we support multi-group (multi-subject) analyses, then that is a more general design question and linked to #30
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one vector could be calculated for whole dataset for now. As you mentioned, the multi-group thing is more general question, and is affecting also the mia functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - note that this can only be calculated in the case where the whole data set has some continuous variable that extends all samples (like is the case when there is time dimension for a single subject). If we have multiple groups then this is not defined for the whole data set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might have missed something, but I cannot see direct linkage of time series to this concept of conditionally/persistent rare/abundant taxa. It seems that this is more general concept?
Based on the following article, I summarized the concept
Sizhong Yang, Matthias Winkel, Dirk Wagner, Susanne Liebner, Community structure of rare methanogenic archaea: insight from a single functional group, FEMS Microbiology Ecology, Volume 93, Issue 11, November 2017, fix126, https://doi.org/10.1093/femsec/fix126
Rare taxa:
- Average relative abundance below 1% accross all samples
Abundant taxa:
- Average relative abundance above 1% accross all samples
Conditionally rare taxa:
microbial taxa that are typically in low abundance in one locality but occasionally become prevalent in at least one other sample
- CRT was assigned if an OTU exhibits a maximum relative abundance at least 100 times higher than its minimum value (max:min > 100)
Permanently rare taxa:
- If the (max/min) ratio is less than 5, that is the abundance rarely fluctuates among samples, it will be grouped into the permanently rare taxa (PER)
Rare and abundant taxa can be already calculated by mia functions:
getRare(tse, assay.type = "relabundance", detection = 1/100)
getPrevalent(tse, assay.type = "relabundance", detection = 1/100)
I see that these conditionally rare and permanently rare taxa are very close to getRare()
and getPrevalent()
. As they do not necessarily deal with time series, should these methods be in mia instead: getConditionallyRare()
& getPermanentlyRare()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I mixed it up, no time series needed.
CRT seems like a bit strange concept in the end. I am not sure if there should be much difference whether the minimum count is 1, 2, or 3 reads but this can make a huge difference in the max:min calculation. But yes, it might be relevant to recognize taxa that have vast variation. Variance would not capture all these cases, if the blooms cover only few samples. Shouldn't this be called conditionally abundant, rather than conditionally rare..?
PER: is the rarity also evaluated, or just max/min ratio? some taxa might be abundant and prevalent, while having a low max/min ratio. At least in principle. Then it would be misleading to call them PER. The interesting PER are taxa that might be prevalent but always stay under a certain threshold and never bloom.
We could add the methods as suggested and it could go to mia. If this could be made into a more generic index function (like addAlpha for samples), that could help to simplify maintenance and adding possible other measures later. Would that be feasible..?
On the other if this seem time-consuming to finalize, it could be left for later.
Signed-off-by: Daena Rys <[email protected]>
Signed-off-by: Daena Rys <[email protected]>
I read this article that Leo shared in the issue: https://journals.asm.org/doi/10.1128/mbio.01371-14 In this article, conditionally rare taxa is determined based on coefficient of bimodality. They determined CRT for each system from where multiple time points were collected. How should we proceed with these? CRT is determined with two different ways. Proposal: In mia, introduce new funtions:
These functions would return a vector of taxa (only those ones that are classified as CRT or PRT). Similar to These would follow the implementation introduced in this article Sizhong Yang, Matthias Winkel, Dirk Wagner, Susanne Liebner, Community structure of rare methanogenic archaea: insight from a single functional group, FEMS Microbiology Ecology, Volume 93, Issue 11, November 2017, fix126, https://doi.org/10.1093/femsec/fix126 Conditionally rare: (max:min > 100), where max and min are maximum and minimum abundances Permantently rare: (max:min < 5) In mia, implement In miaTime, implement The method would return either a vector length nrow or data.frame with dimensions nrow*N_system This method would follow this implementation: Shade A, Jones SE, Caporaso JGHandelsman J, Knight RFierer N, Gilbert JA2014.Conditionally Rare Taxa Disproportionately Contribute to Temporal Changes in Microbial Diversity. mBio5:10.1128/mbio.01371-14.https://doi.org/10.1128/mbio.01371-14 |
There are different multi/bimodality indices available, here one implementation: |
Interesting, I had not paid attention to the multiple definitions. Definitions in (1-2) are somehow arbitrary, and relatively easy calculate manually in whatever desirable way. I am hesitating if we need full methods for these in the end, before there is a clear need / more use cases. On the other hand this PR is already half-way.. Bimodalities are interesting, too. I'm good with that if it is quick to add. It could make sense to check what uni/bi/multimodality indices are available in R. There are a couple of different methods. |
1-2) I think this PR is not suitable for implementing these since I think mia is the correct place. However, these should be quite easy to implement.
|
Seems good. Only question I have is whether the added benefit from having these will be higher than the maintenance burden as these are relatively specialized functions. I could have thought about it earlier. But if these are fast to add, then it would be good to just PR them to mia and close the case. |
Re: (3) I would do that when someone has need in a real project. |
1-2) That is true. This is especially problematic if there are more than these two ways to define conditionally rare and persistently rare taxa. I don't know if it is common to determine these CRT and PRT, and are these the most common definitions. However to point out, in the issue, there was one external person interested in this activity. If these are the most common definitions for CRT & PRT, it would be nice to add features that community needs.
|
ping #83