-
Notifications
You must be signed in to change notification settings - Fork 37
Phenotype page: Display co-occurring phenotypes. #1538
Comments
@jmcmurry is there a specific item in the R24 for this or is it more towards a general aim? I've done some work computing similarity coefficients, p-values, and mocked up some methods to include frequency classes from the HPO in the analysis. Some of this is in the notebook above and some is off line. I'm not sure if this is useful or overkill for what we'll get from this analysis. |
Hannah advises: Look at "Market basket analysis" |
here is an update (rename as .html) |
The feedback I'm interested in:
|
I would say that co-occurence is more of a FYI than a statistical test, and I am not sure that users would know what to do with the information about p-values. A more important questions is how to deal with the implicit annotations. I would find it useful to know what overall categories are frequently shared, but at least on the website am less convinced people would want to see long lists of shared terms. |
thanks this is helpful! the p-value is more for establishing a cut-off in the case that it's not obvious how to interpret some normalization of the data, compared to a correlation coefficient where you might set the cutoff at abs(.7). But see your point and thought this might be excessive. EDIT: disregard original comment on implicit phenotype co-counts, I went about this incorrectly. |
I've reworked the code for generating implicit co-occurrence data. Code: It takes ~20 minutes and requires ~85g of memory. Would be interesting to see how this would perform in another language (e.g. Julia). The top co-occurring count is greater than the number of diseases with phenotype annotations. The alternative would be convert all phenotypes and their closures into a single set per disease, but then we're missing co-occurrence on all implicit classes when two explicit terms share a common ancestor. Next step would be to account for terms in the same lineage, or alternatively only consider terms with the same distance from the root class. |
hmm, I've managed to do this using ontobio on a laptop before |
filtering redundant phenotypes as early as possible is key. note I have no permission to push to that repo but
and you will be under 10G and 2 min processing |
It seems odd this is requiring so heavy computational resources. I had a prototype solution in Java that also arranged things according to category (but did not calculate p values) that was pretty fast. I need to refactor it after having refactored everything else to use phenol, but certainly 85g/20 minutes are excessive. It would be good to collaborate more on code like this, why don't you take a look at HPO Workbench and see if that starts to fulfil the requirements? |
@TomConlin I think it depends what we want out of the analysis. If a disease is annotated to 'abnormal optic nerve' and 'abnormal neuron', would I want to capture that 'abnormality of the nervous system' co-occurs with itself once in this disease? If we convert the implicit classes to a set we miss this. This is why the top count in the tsv is much higher than the total count of diseases. @pnrobinson the code here looks at co-occurrence of every explicit and implicit class all the way up to HP:0000001 (which is unnecessary). If we were to look at just a subset of categorical phenotypes it would be far less resource hungry. |
to capture what a disease is annotated to, we would have to distinguish the terms from all their included ancestors. converting to a set means HP:0000118 shows up once per disease instead of ~25 times per disease. That is; you still get your disease associated with 'abnormality of the nervous system' but only once. |
HP:0000118 isn't a great example because it doesn't make sense to capture, but say I have a disease annotated to 25 phenotypes that are all subclasses of 'nervous system abnormality', how many times does 'nervous sys abnormality' co-occur within that disease? |
I am content with once. |
With the inherited annotations, you need to count them only once per disease. That is, if a patient has abn of the brain, and abn of the spinal cord, this would naively result in two inferred annotations for abn or the nervous system, but this is wrong, because according to the HPO model the annotation needs to be counted only once. |
Okay I was going about this wrong then! |
It could be interesting to look at from the phenotype ancestor "score card" point of view. |
If it's interesting I will leave it as an option and compute it both ways. I understand everyone’s point that you can only be in one of two states of abnormality at the system level (present/absent). But say a patient presents with a mole on their arm, and an abscess on their thigh, would we not say they have two skin abnormalities occurring together? |
imho that is not the context of this approach-- we are not talking about what is happening in an individual patient, we are talking about whether any two diseases share an abnormality. I think it would be just confusing to double count in this way. |
It sounds like the way I'm calculating this is fundamentally wrong, as I'm looking at phenotypes occurring within the same disease. |
I see -- I would say not wrong but a different calculation. I was thinking that we take all diseases that have HPO:X and then ask what the most common co-occuring terms are. Possibly both calculations are interesting.... |
Here is what I think should be done. This should be done as a standard enrichment test between two gene sets. i.e. a fisher exact test for genes in P1 vs genes in P2, with appropriate correction for multiple tests. Skip tests if P1 and P2 are mutual ancestors/descendants. Note this will give you a lot of significant matches between siblings and grandsiblings etc, so the appropriate background test is the set of all genes in the MRCAs of P1 and P2. The goal is to find latent connections not already in the ontology. As far as implementation, I would avoid any direct computation in solr. Just load everything into main memory and do the calculations there with any necessary optimizations. The language is largely irrelevant, but note that ontobio has all the necessary calls to load into an association object any set of annotations in monarch, so the same analysis could be repeated for human PxP with genes, PxP with diseases, mouse PxP, PxP with orthologous genes (ie phenologs), PxGO, DxGO etc. |
@cmungall can you look at the notebook here https://github.com/monarch-initiative/monarch-app/issues/1538#issuecomment-377278609 and comment if I'm setting up the fisher exact test correctly? I think we're on the same page but not certain. In your example does the intersection of diseases annotated to P1|P2 go in the 2x2 table? |
Not super high priority, but for discussion... I could have sworn there was already a ticket for this but ...
for phenotype pages, I think it would be interesting to have a tab with other phenotypes that frequently co-occur. The table would also contain a column for the number of diseases in which they co-occur.
The text was updated successfully, but these errors were encountered: