-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
miRNA/isomiR naming #11
Comments
With the realization that miRNAs by the nature of their genomic loci duplications and mutations, their messy cutting by Dicer, and a host of other variable factors are terribly complex to have a specific naming convention that is 100% accurate, I will take a stab. Naming goals Yes, I think a hairpin and the mature miRNAs should have perfect correspondence. Exceptions to this rule may have to exist. 2. Definition of the canonical sequence: should define and name the canonical sequence & point out if it is a constitutive canonical (same sequence in all known tissues) or regulated canonical (depends on the tissue) At least for humans, I would not peg an miRNA to a tissue. miRNAs should be pegged to specific cell types. Tissues are too complicated. In smaller animals where individual cells are not obtained or obtainable (ex. c. elegans) one could peg a miRNA to the tissue. After that caveat, the most abundant sequence should be the canonical sequence (which is not currently accurate for much of miRBase). 3. Guide and passenger strand: If a clear distinction between guide and passenger strand can be made at a functional level, this must be reflected in the naming (with the good old ‘*’ for example) It would be helpful for people to inherently know which of the miRNAs is the “passenger” strand. I do like the 5p/3p nomenclature, so I think this mark “*” or other, should be in addition to the 5p/3p. 4. Evolutionary information and family naming (I): The naming should include information about paralogues and homologues (like in miRGeneDB): to achieve this, a (evolutionary family) seed definition is needed. This would be a good idea, but the devil is in the details. Is the family named just on the seed? A certain percentage of the mature miRNA? The pre-miRNA? What if the mature miRNA is conserved between species but the seed is altered? Certainly some families (let-7 etc) we could all agree on, but there will be slippery slope of evolutionary changes. 5. Evolutionary information and family naming (II): If the seed region changes --> the function of the microRNA changes: should the microRNAs that are homologous but having different functions (regulate different genes because they have different seeds) receive the same name? I agree this will be challenging. I don’t have any suggestions here. Other technical issues that could be improved:
This speaks to the perplexities of naming miRNAs due to their complex evolutionary history. My bias, and it is a bias, is that the mature miRNA is by far the most important part of this. So as long as the mature miRNAs are properly named, I would be happy. Beyond that, linking a miRNA to it’s genomic location(s) is useful. In general, I think the miRBase nomenclature is pretty good. I don’t have specific suggestions. In case of novel ‘mirna’ detected by sequencing, how to name if: I am sure Bastian will have much to say about this. I don’t know the answer here. I thought, at least for humans, that we were near the end of the novel miRNA discovery era. But the Rigoutsis lab’s PNAS paper suggested that may not be true. I think the novel miRNA detection algorithms of miRDeep2 and miRAnalyzer (the only ones I have used) over call miRNAs, thus there is a high false positive rate in any novel miRNA detections made by these methods. I do think NGS + structure is important (particularly the distribution of isomiRs relative to know miRNAs patterns). I do not think conservation is as essential in finding new miRNAs as I think we’ve figured out those major families, but there are certainly lineage specific miRNAs. Ago loading is helpful, but Ago-Seq data is dirty. Perhaps better Ago-Seq data would help. You can probably get function by overexpressing any segment of RNA, so I don’t think that’s the answer either. _isomiR
I don’t think it’s possible to annotate every isomiR from each miRNA as very deep RNA-seq could have 100+ isomiRs for abundant miRNAs. I think it is important to identify all 5’ changes (that affect the seed sequence), template length variants, and common nontemplated additions on the 3’ end (+A, +T, +G, +C in the +1 position). I don’t know if any other type of isomiR is particularly functional and I have some data that suggests nontemplated additions are somewhat affected by sequencing kit. So that introduces some complexity. I think we should cover 5’ changes and identify the nontemplated additions on the 3’ end, such that +1 from the canonical sequence should be defined as +1A vs +1C vs +1T vs +1G and the genomic template nucleotide at this position should be known. I think most internal changes are sequencing/PCR errors, with the exception of A to I changes. So 1 and 5 are generally an inconsequential percent of the miRNA. I’m not even sure if most programs would identify InDels. I think it is important to document 5’ additions (as they change the seed) and any 3’ additions, where again a distinction could be made between a template and nontemplate addition. I could be easily swayed on many of these points by convincing data that suggests a better approach. |
Hi @mhalushka, I'll give my two cents. Say if you agree or not: 1.I agree Say if you agree or not: A.I agree, longer precursor just to help bioinformatics analysis would be ideal miRNA How to name if: The precursor of the miRNA, and the 3p miRNA In case of novel ‘mirna’ detected by sequencing, how to name if: i) NGS + structure (‘looks like’ a microRNA due to 5’ homogeneity and Drohsa/Dicer pattern), isomiR Say if you agree or not, or offer other option: Should be annotated all isomiRs? Imagine the previous miRNA, how to annotate: InDels XXXXXX-XXXXXXXX: As I said, I am more convinced that maybe we need a format file just to show what the tool has detected, and then this file can be used to even say what is a PASS/FAIL isomiRs. Again, all this more explain in issue 10. |
Hi all, @lpantano > Say if you agree or not:
I think MirGeneDb (@BastianFromm) have a good miRNA naming system. IsomiR questions:
|
I would advise strongly against adding the cell type to the name. "A new cell type identifyer sould be included in the miRNA name if the mature miRNA sequence will be pegged to specific cell types (as suggested by @mhalushka). |
Hi all,
Thomas |
Hi all, I'm really troubled with the canonical sequence. It might prove quite challenging for a few (or more?) miRNAs as more data are aggregated and more samples sequenced. What if the canonical form of the miRNA (which actually is a convention, since miRNAs can have different canonical forms based on cell types, conditions, states, etc or even canonical sequence definition) is not adopted as an approach? Do we have alternatives? I'm not sure that they are viable but I will state some examples as reference. Example 1: A form similar (actually reverse) to genes and isoforms could be followed (as e.g. in Ensembl): this list of isomiRs has been found to be derived from this (or these) pre-miRNAs. Their order (as in Ensembl) might signify abundance in different tissues or any other loosely or strictly defined metric, while other metadata (as we have now for transcript quality) could be added to help filtering. For instance, evidence level, is an especially common mature (for the most common ones), etc could be added as extra information. Since hairpin(s) of origin for each mature sequence are known (and can also be included in the naming convention as suggested by colleagues above), the interested researcher has all the necessary information to make informed decisions. What is interesting with this approach is that isomiR naming can be decided without having to be compared to an archetypical sequence (e.g. could be based on the pre-miRNA of origin, genomic location and so on). Example 2: A rigid canonical form could be uniformly defined with the same length for all matures. If this is the case, then isomiR naming could be based on comparisons with this sequence. With this approach a commonly expressed mature (which is now considered as the canonical form) could be actually represented as an isomiR (if it defers to the default). With this approach, depending on the criteria for the uniform canonical form, it would be easy(-ier) to identify a mature sequence just from the naming. Example 3: [placeholder for another example. Please add] These approaches have a common aim: so that canonical sequences don't change as we progress and also act as a constant reminder to the community that searching for (or using) a single mature sequence in all in silico/vitro investigations is an oversimplification of the actual biology. It is the same analogy as looking for a single gene isoform in all experiments. For some samples this might work out just fine but there are definitely caveats. The second aim is to make those of us who don't like naming systems and conventions to change very often more comfortable. My greatest fear of basing the isomiR naming on a canonical mature sequence, is that when (and in many cases will happen based on the definition of the canonical) this sequence changes, then all the isomiRs that rely on it will have to change naming as well. This might create a lot of problems to specialists and non-specialists alike. |
I wanted to address the canonical form. We have data from primary and cancer cells that looked into the most abundant miRNA form for all miRNAs. Below is some of the data. C represents the miRBase canonical sequence and C+1, C-1 etc represent the location of the most abundant miRNA by 3' length relative to the canonical sequence. Two points - 1) miRBase is not always accurate (we all know that) and 2) the most abundant length can vary by cell type (as noted by non-red colors and by sample size). For example, the most abundant "canonical" form of miR-454-3p can be C, C+1, or C+2 depending on the cell type. We think this is biological (cancer cells are more variable), but also technical. I don't think the potential large role that technical factors (different library prep methods, different miRNA alignment programs, etc) may be playing on both the canonical form and the isomiRs has been part of our conversation, but that should be addressed as well. This figure is from our manuscript on bioRxiv: http://biorxiv.org/content/early/2017/03/24/120394. |
I completely agree. Based on my experience, alignment can definitely affect isomiR abundance and in some cases significantly. Mapping bias is an actual issue, with some isomiRs being unmappable or heavily handicapped vs others. This occurs in most approaches that we have checked (genome or miRNA alignment-based). Certainly as mhalushka mentioned above, library preps and other technical factors can also affect quantification results. |
Hi all |
Hi all again,
cc: @ThomasDesvignes @mhalushka @mlhack @keilbeck @BastianFromm @ivlachos @TJU-CMC
Finally discussing naming proposals. Feel free to adapt the issue to ask more questions or clarify them.
Deadline: May, 16.
Naming goals
Say if you agree or not:
Other technical issues that could be improved:
Say if you agree or not:
A)Length of the hairpin sequences
Right now, in miRBase each hairpin sequence is pre-microRNA + X nt flanking sequences: X can be anything and is not defined by miRBase. This number needs to be fixed (5 nt, 10 nt, 15 nt – what ever).
miRNA
miRNA defined in miRBase: XXXXXXXXXXXXXXXX hsa-miR-X-5p
How to name if:
In case of novel ‘mirna’ detected by sequencing, how to name if:
i) NGS + structure (‘looks like’ a microRNA due to 5’ homogeneity and Drohsa/Dicer pattern),
ii) phylogenetic footprinting (miRNA is conserved),
iii) Ago loading
iv) Impact of Drosha/Dicer knockout and
v) positive functional assay.
isomiR
Say if you agree or not, or offer other option:
Imagine the previous miRNA, how to annotate:
The text was updated successfully, but these errors were encountered: