Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LOVD endpoint: Variants crossing gene boundaries generate "porcessing_error". #173

Closed
ifokkema opened this issue Apr 28, 2020 · 44 comments
Closed

Comments

@ifokkema
Copy link
Collaborator

On the LOVD endpoint, variants crossing gene boundaries, such as the BRCA1 promotor deletion variant NC_000017.10:g.41271863_41308933del, generate the flag porcessing_error (also note the typo) and a transcript_variant_error value of "start or end or both are beyond the bounds of transcript record". The mapping on NM_007294.3 was provided by VV itself, so I'd argue it should then be able to handle the mapping. Mutalyzer claims this variant should map to NC_000017.10(NM_007294.3):c.-31665_81-4067del, although I understand that these mapped positions cannot be represented in this NM record.

I'm not sure if you'd want to support this notation; the HGVS nomenclature website clearly states:

(...) it is not allowed to describe variants in nucleotides beyond the boundaries of a reference sequence. Consequently, deletions extending 5’ of a transcript (...) can only be described using genomic coordinates. The HGVS nomenclature committee (SVD-WG) is discussing whether a c. based format should be proposed.

Not sure if there was an outcome to this, but obviously, LOVD is massively breaking these regulations because some kind of mapping needs to be given to indicate this genomic variant has an effect on the expression of the gene. Any thoughts?

@Peter-J-Freeman
Copy link
Collaborator

Hi @ifokkema

I have corrected the typo. Will push up shortly.

I need to try and figure out why VV identifies the transcript. I suspect that normalised in one direction the deletion can be pushed entirely into the transcript reference sequence. Want to understand what's going on first.

As for the coordinates outside of transcript reference sequences, the Working Group cannot reach a conclusion as to whether to allow it or the nomenclature standard to support it.

Perhaps a position relative to the c.1 could be provided (although I'm not sure whether this is possible) which you can then choose to format into whatever you like???? :)

@ifokkema
Copy link
Collaborator Author

I have corrected the typo. Will push up shortly.

Thank you!

As for the coordinates outside of transcript reference sequences, the Working Group cannot reach a conclusion as to whether to allow it or the nomenclature standard to support it.

Perhaps a position relative to the c.1 could be provided (although I'm not sure whether this is possible) which you can then choose to format into whatever you like???? :)

Sounds good! I don't mind breaking HGVS rules by myself, as long as the working group cannot come up with a solution. It seems obvious to me that in some way we should be able to indicate the variant affects the gene, and I don't mind constructing the non-HGVS format that Mutalyzer currently uses to create some support. I could parse the positions out of an error message if I have to, otherwise, separate fields would be awesome.

@Peter-J-Freeman
Copy link
Collaborator

I'd go for additional JSON fields. Adding fields generally doesn't break stuff for others and is a way of creating non-standard data in a meaningful way.

Gonna prioratise a few other issues first though???

@ifokkema
Copy link
Collaborator Author

ifokkema commented May 1, 2020

Great, thank you!

And no worries, there's plenty left for me to work on 😬

@ifokkema
Copy link
Collaborator Author

@leicray and me also discussed this today over Skype;

  • I understand the point that positions can not be described outside of their reference sequences.
  • I understand that intronic variants are a bit of an exception in this case as NC-based descriptions are supported (i.e., NC_000015.9(NM_002225.3):c.466-2A>G).
  • So in principle, positions upstream of the transcription start, and downstream of the transcription end, could be supported, but then the question becomes - "how far from the gene should we support NC-based positions?".

These are the downsides that I see of simply not supporting variants partially falling outside of a gene:

  • There is no way to annotate whole-gene deletions on the gene level.
  • There is no way to annotate partial gene deletions on the gene level.
  • VV doesn't produce liftovers for variants like these.
  • It's exceptionally awkward that increasing the size of a deletion with a valid HGVS nomenclature, suddenly makes it not have valid HGVS nomenclature on the gene level at all.
  • Databases like LOVD require a transcript-based description to be able to store the effect on the gene-level.
  • Humans don't see what these genomic descriptions mean; is the whole gene gone? Or only half of it? Or perhaps only a part of 3' UTR, and the gene functions still?
  • Actually, databases won't know either, as VV doesn't help them out in this case.

I argue the downsides of having no support outweigh not having an answer to the question "how far from the gene should we support NC-based positions?". Therefore, after discussing this with @leicray, I suggest using the uncertain positions notation for this; for instance, as seen in c.(?_-126)_(*1406_?)del (by the way, Raymond, that's how you use a DBID link 😝).

Pros:

  • You don't need to think about how far from the gene to support NC-based positions.
  • It doesn't use positions outside of the transcript, so it doesn't break the rule.
  • It nicely groups all gene-based descriptions of whole-gene deletions, making it easy to find all associated phenotypes.
  • Humans can read it.
  • Databases can read it.
  • It matches existing HGVS rules, making it easier for people to remember how to produce the HGVS themselves, should they have to.

Perhaps I should shoot this into the HGVS committee, but I'm not part of those email conversations.

ifokkema added a commit to LOVDnl/LOVD3 that referenced this issue May 29, 2020
…to a transcript.

- If variants partially map outside of a transcript, VV currently does not return a transcript mapping, but also no liftover. As such, we can do nothing.
- See openvar/variantValidator#173.
- Silently skipping these variants for now.
@ifokkema
Copy link
Collaborator Author

ifokkema commented Jun 3, 2020

Although the original example no longer throws an error, NC_000001.10:g.1573181C>G still does (porcessing_error, with typo, and all mappings return a transcript_variant_error).

@Peter-J-Freeman
Copy link
Collaborator

That's odd. Wonder if I forgot to upload a version/merge some changes. I'll take a look ASAP

Sorry for the delays @ifokkema . Having to do some of the boring stuff currently!!!!!

@ifokkema
Copy link
Collaborator Author

ifokkema commented Jun 3, 2020

Thanks and, no worries! I'm running my VV verification script now on the GV shared LOVD, so I might run into more weird things...

@ifokkema
Copy link
Collaborator Author

ifokkema commented Jun 3, 2020

Another example, but strangely enough, Mutalyzer indicates NC_000001.10:g.16891340T>A is in the CDS of NM_017940.4. Yet, VV generates an error as if it's not inside of the transcript.

@leicray
Copy link
Contributor

leicray commented Jun 3, 2020

I suspect that this might be a gene containing an alignment gap because Alamut Visual does not show any alignments against RefSeq transcripts. What's more, the genome and transcript position numbers do not align. Perhaps easier to SHOW you this than to try an explain. How about a skype chat tomorrow?

@leicray
Copy link
Contributor

leicray commented Jun 3, 2020

I have just also noticed that the error message reads:

"flag": "porcessing_error"

instead of

"flag": "processing_error"

@ifokkema
Copy link
Collaborator Author

ifokkema commented Jun 3, 2020

I suspect that this might be a gene containing an alignment gap because Alamut Visual does not show any alignments against RefSeq transcripts.

Normally, VV would throw a warning for this. Now, the gap_statement and gapped_alignment_warning fields are empty.

Perhaps this transcript has a very specific problem, but I think in general this is caused by the basic mapping mechanism (hgvs python module?) allowing for some distance between the variant and the transcript (like Mutalyzer allows 5000 bp upstream and 2000 bp downstream), but then VV not allowing any position outside of the transcript.

What's more, the genome and transcript position numbers do not align. Perhaps easier to SHOW you this than to try an explain. How about a skype chat tomorrow?

Hmm... yeah, this one I don't understand. I'm free tomorrow at any time, from 8:30 UK time. Just let me know ~30 minutes in advance, if possible.

I have just also noticed that the error message reads:

"flag": "porcessing_error"

instead of

"flag": "processing_error"

Yes, indeed; Pete already fixed it, but it somehow this change didn't find its way onto the server.

@Peter-J-Freeman
Copy link
Collaborator

Normally, VV would throw a warning for this. Now, the gap_statement and gapped_alignment_warning fields are empty.

Only if the position queried hits the gap.

Yes, indeed; Pete already fixed it, but it somehow this change didn't find its way onto the server.

Yep, pretty sure it's fixed. Can't replicate locally :)

@Peter-J-Freeman
Copy link
Collaborator

Perhaps this transcript has a very specific problem, but I think in general this is caused by the basic mapping mechanism (hgvs python module?) allowing for some distance between the variant and the transcript (like Mutalyzer allows 5000 bp upstream and 2000 bp downstream), but then VV not allowing any position outside of the transcript.

HGVS does not allow for any positions outside of the transcript. Mutalyzer is, as usual, non compliant with this rule.

That being said, it is possible our alignment for NM_017940.4 might be inaccurate. The update UTA will fix this. I'll check

@Peter-J-Freeman
Copy link
Collaborator

Peter-J-Freeman commented Jun 3, 2020

For NC_000001.10:g.1573181C>G I get locally (so def needs pushing up)

{
    "NC_000001.10:g.1573181C>G": {
        "errors": [],
        **"flag": "processing_error",**
        "NC_000001.10:g.1573181C>G": {
            "p_vcf": "1:1573181:C:G",
            "g_hgvs": "NC_000001.10:g.1573181C>G",
            "selected_build": "GRCh37",
            "genomic_variant_error": null,
            "hgvs_t_and_p": {
                "NM_033488.1": {
                    "t_hgvs": null,
                    "p_hgvs_tlc": null,
                    "p_hgvs_slc": null,
                    "gapped_alignment_warning": null,
                    "gap_statement": null,
                    "transcript_variant_error": "start or end or both are beyond the bounds of transcript record",
                    "primary_assembly_loci": {
                        "grch37": {
                            "NC_000001.10": {
                                "hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
                                "vcf": {
                                    "chr": "1",
                                    "pos": "1573181",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "hg19": {
                            "NC_000001.10": {
                                "hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
                                "vcf": {
                                    "chr": "chr1",
                                    "pos": "1573181",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "grch38": {
                            "NC_000001.11": {
                                "hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
                                "vcf": {
                                    "chr": "1",
                                    "pos": "1637819",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "hg38": {
                            "NC_000001.11": {
                                "hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
                                "vcf": {
                                    "chr": "chr1",
                                    "pos": "1637819",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        }
                    },
                    "alt_genomic_loci": {
                        "grch37": {},
                        "hg19": {},
                        "grch38": {},
                        "hg38": {}
                    }
                },
                "NM_033489.1": {
                    "t_hgvs": null,
                    "p_hgvs_tlc": null,
                    "p_hgvs_slc": null,
                    "gapped_alignment_warning": null,
                    "gap_statement": null,
                    "transcript_variant_error": "start or end or both are beyond the bounds of transcript record",
                    "primary_assembly_loci": {
                        "grch37": {
                            "NC_000001.10": {
                                "hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
                                "vcf": {
                                    "chr": "1",
                                    "pos": "1573181",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "hg19": {
                            "NC_000001.10": {
                                "hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
                                "vcf": {
                                    "chr": "chr1",
                                    "pos": "1573181",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "grch38": {
                            "NC_000001.11": {
                                "hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
                                "vcf": {
                                    "chr": "1",
                                    "pos": "1637819",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "hg38": {
                            "NC_000001.11": {
                                "hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
                                "vcf": {
                                    "chr": "chr1",
                                    "pos": "1637819",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        }
                    },
                    "alt_genomic_loci": {
                        "grch37": {},
                        "hg19": {},
                        "grch38": {},
                        "hg38": {}
                    }
                },
                "NM_033486.1": {
                    "t_hgvs": null,
                    "p_hgvs_tlc": null,
                    "p_hgvs_slc": null,
                    "gapped_alignment_warning": null,
                    "gap_statement": null,
                    "transcript_variant_error": "start or end or both are beyond the bounds of transcript record",
                    "primary_assembly_loci": {
                        "grch37": {
                            "NC_000001.10": {
                                "hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
                                "vcf": {
                                    "chr": "1",
                                    "pos": "1573181",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "hg19": {
                            "NC_000001.10": {
                                "hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
                                "vcf": {
                                    "chr": "chr1",
                                    "pos": "1573181",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "grch38": {
                            "NC_000001.11": {
                                "hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
                                "vcf": {
                                    "chr": "1",
                                    "pos": "1637819",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "hg38": {
                            "NC_000001.11": {
                                "hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
                                "vcf": {
                                    "chr": "chr1",
                                    "pos": "1637819",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        }
                    },
                    "alt_genomic_loci": {
                        "grch37": {},
                        "hg19": {},
                        "grch38": {},
                        "hg38": {}
                    }
                },
                "NM_033493.1": {
                    "t_hgvs": null,
                    "p_hgvs_tlc": null,
                    "p_hgvs_slc": null,
                    "gapped_alignment_warning": null,
                    "gap_statement": null,
                    "transcript_variant_error": "start or end or both are beyond the bounds of transcript record",
                    "primary_assembly_loci": {
                        "grch37": {
                            "NC_000001.10": {
                                "hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
                                "vcf": {
                                    "chr": "1",
                                    "pos": "1573181",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "hg19": {
                            "NC_000001.10": {
                                "hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
                                "vcf": {
                                    "chr": "chr1",
                                    "pos": "1573181",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "grch38": {
                            "NC_000001.11": {
                                "hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
                                "vcf": {
                                    "chr": "1",
                                    "pos": "1637819",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "hg38": {
                            "NC_000001.11": {
                                "hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
                                "vcf": {
                                    "chr": "chr1",
                                    "pos": "1637819",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        }
                    },
                    "alt_genomic_loci": {
                        "grch37": {},
                        "hg19": {},
                        "grch38": {},
                        "hg38": {}
                    }
                },
                "NM_033492.1": {
                    "t_hgvs": null,
                    "p_hgvs_tlc": null,
                    "p_hgvs_slc": null,
                    "gapped_alignment_warning": null,
                    "gap_statement": null,
                    "transcript_variant_error": "start or end or both are beyond the bounds of transcript record",
                    "primary_assembly_loci": {
                        "grch37": {
                            "NC_000001.10": {
                                "hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
                                "vcf": {
                                    "chr": "1",
                                    "pos": "1573181",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "hg19": {
                            "NC_000001.10": {
                                "hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
                                "vcf": {
                                    "chr": "chr1",
                                    "pos": "1573181",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "grch38": {
                            "NC_000001.11": {
                                "hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
                                "vcf": {
                                    "chr": "1",
                                    "pos": "1637819",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "hg38": {
                            "NC_000001.11": {
                                "hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
                                "vcf": {
                                    "chr": "chr1",
                                    "pos": "1637819",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        }
                    },
                    "alt_genomic_loci": {
                        "grch37": {},
                        "hg19": {},
                        "grch38": {},
                        "hg38": {}
                    }
                },
                "NM_033487.1": {
                    "t_hgvs": null,
                    "p_hgvs_tlc": null,
                    "p_hgvs_slc": null,
                    "gapped_alignment_warning": null,
                    "gap_statement": null,
                    "transcript_variant_error": "start or end or both are beyond the bounds of transcript record",
                    "primary_assembly_loci": {
                        "grch37": {
                            "NC_000001.10": {
                                "hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
                                "vcf": {
                                    "chr": "1",
                                    "pos": "1573181",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "hg19": {
                            "NC_000001.10": {
                                "hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
                                "vcf": {
                                    "chr": "chr1",
                                    "pos": "1573181",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "grch38": {
                            "NC_000001.11": {
                                "hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
                                "vcf": {
                                    "chr": "1",
                                    "pos": "1637819",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "hg38": {
                            "NC_000001.11": {
                                "hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
                                "vcf": {
                                    "chr": "chr1",
                                    "pos": "1637819",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        }
                    },
                    "alt_genomic_loci": {
                        "grch37": {},
                        "hg19": {},
                        "grch38": {},
                        "hg38": {}
                    }
                }
            }
        }
    },
    "metadata": {
        "variantvalidator_version": "1.0.4.dev11+g97aec97",
        "variantvalidator_hgvs_version": "1.2.5.vv1",
        "uta_schema": "uta_20180821",
        "seqrepo_db": "/Users/Shared/seqrepo_dumps/2018-08-21",
        "variantformatter_version": "1.0.2.dev7+gbdcbbe9.d20200316"
    }
}

CDK11B is indeed listed as a gapped alignment

The alignment is a total mess. According to UCSC

The RefSeq transcript has 3 substitutions, 2 non-frameshifting indels and aligns at 90% coverage compared to this genomic sequence
http://genome.ucsc.edu/cgi-bin/hgc?hgsid=842211639_EOu1QaRxab8C2lRBGfhouSqNEXvG&c=chr1&l=1573169&r=1573192&o=1570602&t=1590465&g=ncbiRefSeqCurated&i=NM_033489.3

Therefore the alignment provided by Mutalyzer will be absolute pants!!! :)

That being said, our alignment needs looking at but only once we have re-created UTA
Also, I don't quite understand how we are identifying transcripts here so warrants a look. I'll get on to that

@ifokkema
Copy link
Collaborator Author

ifokkema commented Jun 3, 2020

Perhaps this transcript has a very specific problem, but I think in general this is caused by the basic mapping mechanism (hgvs python module?) allowing for some distance between the variant and the transcript (like Mutalyzer allows 5000 bp upstream and 2000 bp downstream), but then VV not allowing any position outside of the transcript.

HGVS does not allow for any positions outside of the transcript. Mutalyzer is, as usual, non compliant with this rule.

Hmm... So perhaps then the method that checks which transcripts should be aligned to, thinks the variant aligns, but then the actual alignment algorithm finds they don't align... Something like that?

The RefSeq transcript has 3 substitutions, 2 non-frameshifting indels and aligns at 90% coverage compared to this genomic sequence

Therefore the alignment provided by Mutalyzer will be absolute pants!!! :)

Uff... haha, good to know 😉

@Peter-J-Freeman
Copy link
Collaborator

For NC_000001.10:g.16891340T>A

VV gives

{
  "flag": "intergenic",
  "intergenic_variant_1": {
    "alt_genomic_loci": [],
    "gene_ids": {},
    "gene_symbol": "",
    "genome_context_intronic_sequence": "",
    "hgvs_lrg_transcript_variant": "",
    "hgvs_lrg_variant": "",
    "hgvs_predicted_protein_consequence": {
      "slr": "",
      "tlr": ""
    },
    "hgvs_refseqgene_variant": "",
    "hgvs_transcript_variant": "",
    "primary_assembly_loci": {
      "grch37": {
        "hgvs_genomic_description": "NC_000001.10:g.16891340T>A",
        "vcf": {
          "alt": "A",
          "chr": "1",
          "pos": "16891340",
          "ref": "T"
        }
      },
      "grch38": {
        "hgvs_genomic_description": "NC_000001.11:g.16564845T>A",
        "vcf": {
          "alt": "A",
          "chr": "1",
          "pos": "16564845",
          "ref": "T"
        }
      },
      "hg19": {
        "hgvs_genomic_description": "NC_000001.10:g.16891340T>A",
        "vcf": {
          "alt": "A",
          "chr": "chr1",
          "pos": "16891340",
          "ref": "T"
        }
      },
      "hg38": {
        "hgvs_genomic_description": "NC_000001.11:g.16564845T>A",
        "vcf": {
          "alt": "A",
          "chr": "chr1",
          "pos": "16564845",
          "ref": "T"
        }
      }
    },
    "reference_sequence_records": "",
    "refseqgene_context_intronic_sequence": "",
    "selected_assembly": "GRCh37",
    "submitted_variant": "NC_000001.10:g.16891340T>A",
    "transcript_description": "",
    "validation_warnings": [
      "No transcripts found that fully overlap the described variation in the genomic sequence"
    ]
  },
  "metadata": {
    "seqrepo_db": "2018-08-21",
    "uta_schema": "uta_20180821",
    "variantvalidator_hgvs_version": "1.2.5.vv1",
    "variantvalidator_version": "1.0.4.dev42+gbb2b0b7"
  }
}

Again, there seems to be something odd with the alignment of NBPF1. We do not have it down as a gap gene, but the UCSC status is

Other notes: The RefSeq transcript aligns at 95% coverage compared to this genomic sequence
However, this is for NM_017940.6 and not .4

LOVD EP produces

{
    "NC_000001.10:g.16891340T>A": {
        "errors": [],
        "flag": "processing_error",
        "NC_000001.10:g.16891340T>A": {
            "p_vcf": "1:16891340:T:A",
            "g_hgvs": "NC_000001.10:g.16891340T>A",
            "selected_build": "GRCh37",
            "genomic_variant_error": null,
            "hgvs_t_and_p": {
                "NM_017940.4": {
                    "t_hgvs": null,
                    "p_hgvs_tlc": null,
                    "p_hgvs_slc": null,
                    "gapped_alignment_warning": null,
                    "gap_statement": null,
                    "transcript_variant_error": "start or end or both are beyond the bounds of transcript record",
                    "primary_assembly_loci": {
                        "grch37": {
                            "NC_000001.10": {
                                "hgvs_genomic_description": "NC_000001.10:g.16891340T>A",
                                "vcf": {
                                    "chr": "1",
                                    "pos": "16891340",
                                    "ref": "T",
                                    "alt": "A"
                                }
                            }
                        },
                        "hg19": {
                            "NC_000001.10": {
                                "hgvs_genomic_description": "NC_000001.10:g.16891340T>A",
                                "vcf": {
                                    "chr": "chr1",
                                    "pos": "16891340",
                                    "ref": "T",
                                    "alt": "A"
                                }
                            }
                        },
                        "grch38": {
                            "NC_000001.11": {
                                "hgvs_genomic_description": "NC_000001.11:g.16564845T>A",
                                "vcf": {
                                    "chr": "1",
                                    "pos": "16564845",
                                    "ref": "T",
                                    "alt": "A"
                                }
                            }
                        },
                        "hg38": {
                            "NC_000001.11": {
                                "hgvs_genomic_description": "NC_000001.11:g.16564845T>A",
                                "vcf": {
                                    "chr": "chr1",
                                    "pos": "16564845",
                                    "ref": "T",
                                    "alt": "A"
                                }
                            }
                        }
                    },
                    "alt_genomic_loci": {
                        "grch37": {},
                        "hg19": {},
                        "grch38": {},
                        "hg38": {}
                    }
                }
            }
        }
    },
    "metadata": {
        "variantvalidator_version": "1.0.4.dev11+g97aec97",
        "variantvalidator_hgvs_version": "1.2.5.vv1",
        "uta_schema": "uta_20180821",
        "seqrepo_db": "/Users/Shared/seqrepo_dumps/2018-08-21",
        "variantformatter_version": "1.0.2.dev7+gbdcbbe9.d20200316"
    }
}

So again the transcript is identified. I'll do some additional digging. I am still thinking out alignment may need an update though

@Peter-J-Freeman
Copy link
Collaborator

Peter-J-Freeman commented Jun 4, 2020

OK, so here is our alignment for [['NM_017940.4', 'NC_000001.10', -1, 'splign', 16888921, 16940100]]

It's a total mess of a gene. I would not trust UCSC or Mjutalyzer. I suspect our alignment may also need an update, but it goes to show how bad this gene is. I'll look at the actual positions later and see why the genomic position entered may be being returned as out-of-bounds currently then check the RefSeq alignment

['gene', 'tx', 'chr', 'aln', 'ori', 'exon', 'tx_st', 'tx_end', 'chr_st', 'chr_end', 'cigar']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 24, 3672, 3845, 16893674, 16893846, '96=1X75=1D']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 23, 3620, 3672, 16894473, 16894525, '52=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 22, 3456, 3620, 16895567, 16895731, '97=1X66=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 21, 3404, 3456, 16899636, 16899688, '52=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 20, 3198, 3404, 16900981, 16901187, '8=1X197=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 19, 3125, 3198, 16901651, 16901724, '10=1X3=1X58=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 18, 2910, 3125, 16902761, 16902976, '33=1X48=1X109=1X3=1X18=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 17, 2807, 2910, 16903811, 16903914, '2=1X29=1X70=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 16, 2597, 2807, 16905687, 16905897, '94=1X2=1X10=2X100=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 15, 2385, 2597, 16907239, 16907451, '123=1X88=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 14, 2312, 2385, 16907914, 16907987, '73=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 13, 2097, 2312, 16909038, 16909253, '124=1X67=1X3=1X4=1X13=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 12, 1994, 2097, 16910088, 16910191, '14=1X88=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 11, 1784, 1994, 16911983, 16912193, '94=1X2=1X10=2X69=1X30=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 10, 1572, 1784, 16913544, 16913756, '79=1X43=1X88=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 9, 1499, 1572, 16914219, 16914292, '73=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 8, 1284, 1499, 16915343, 16915558, '33=1X86=1X3=1X67=1X3=1X4=1X13=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 7, 1181, 1284, 16916393, 16916496, '103=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 6, 971, 1181, 16918341, 16918551, '49=1X160=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 5, 816, 971, 16918653, 16918808, '115=1X39=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 4, 689, 816, 16919935, 16920062, '127=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 3, 619, 689, 16921086, 16921156, '70=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 2, 540, 619, 16921425, 16921504, '79=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 1, 268, 540, 16935002, 16935274, '272=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 0, 0, 268, 16939832, 16940100, '268=']

@ifokkema
Copy link
Collaborator Author

ifokkema commented Jun 4, 2020

OK, so here is our alignment for [['NM_017940.4', 'NC_000001.10', -1, 'splign', 16888921, 16940100]]

OK, so this is odd, right? 16891340 doesn't overlap with any exon there, but it does with your genomic range above. That range is bigger by 4753 bases downstream of the gene (5' in the chromosome). I don't know what data comes from where, but shouldn't the numbers you mention there simply be based on the smallest and largest positions in the list of exon alignments?

@Peter-J-Freeman
Copy link
Collaborator

No, that's genome annotation, we are providing alignments hence the CIGAR MUST be considered which tools like UCSC, Mutalyzer and others fail to do. I need to look more closely, but the alignment is a mess. Will look ASAP

@ifokkema
Copy link
Collaborator Author

ifokkema commented Jun 5, 2020

Of course, you should consider the CIGAR, but that's not my point - you have the strange situation where VV reports a transcript to overlap a position, and then it says it doesn't. Well, your transcript lengths are conflicting right there, so doesn't that explain the strange behavior in VV?

Will look ASAP

No stress, take your time on this. I'm running VV on our entire LOVD instance, I'll be busy for a long time with this...

@Peter-J-Freeman
Copy link
Collaborator

Absolutely, agree with Well, your transcript lengths are conflicting right there, so doesn't that explain the strange behavior in VV?

Need a puzzle to solve :)

@Peter-J-Freeman
Copy link
Collaborator

Sorry, I see what you mean now. This is really unusual.

I'm going to look into how the hgvs library fetches that range.

@Peter-J-Freeman
Copy link
Collaborator

So the method used is in here
def get_tx_for_region

def get_tx_for_region(self, alt_ac, alt_aln_method, start_i, end_i):
        """
        return transcripts that overlap given region
        :param str alt_ac: reference sequence (e.g., NC_000007.13)
        :param str alt_aln_method: alignment method (e.g., splign)
        :param int start_i: 5' bound of region
        :param int end_i: 3' bound of region
        """
        return self._fetchall(self._queries['tx_for_region'], [alt_ac, alt_aln_method, start_i, end_i])

So the range is as you suspected the genomic coordinate bounds for the transcript. So the question is why is there no exon either side of this position???? The query position is

vfo.hdp.get_tx_for_region(hgvs_genomic.ac, 'splign', hgvs_genomic.posedit.pos.start.base,
                                            hgvs_genomic.posedit.pos.end.base - 1)

where hgvs_genomic is NC_000001.10:g.16891340T>A

@Peter-J-Freeman
Copy link
Collaborator

So I suspect the errors are with UTA. Will email this link to Jon to look at in the new build.

Jon, note we are looking at variants NC_000001.10:g.16891340T>A and NC_000001.10:g.1573181C>G

@Peter-J-Freeman
Copy link
Collaborator

Peter-J-Freeman commented Jun 5, 2020

Alignments for second variant

['gene', 'tx', 'chr', 'aln', 'ori', 'exon', 'tx_st', 'tx_end', 'chr_st', 'chr_end', 'cigar']
['CDK11B', 'NM_033488.1', 'NC_000001.10', 'splign', -1, 3, 269, 390, 1586818, 1586938, '116=2X2=1D']
['CDK11B', 'NM_033488.1', 'NC_000001.10', 'splign', -1, 2, 222, 269, 1588705, 1588752, '47=']
['CDK11B', 'NM_033488.1', 'NC_000001.10', 'splign', -1, 1, 98, 222, 1588824, 1588948, '124=']
['CDK11B', 'NM_033488.1', 'NC_000001.10', 'splign', -1, 0, 0, 98, 1590374, 1590473, '31=1I50=2X15=']
['CDK11B', 'NM_033489.1', 'NC_000001.10', 'splign', -1, 3, 269, 390, 1586818, 1586938, '116=2X2=1D']
['CDK11B', 'NM_033489.1', 'NC_000001.10', 'splign', -1, 2, 222, 269, 1588705, 1588752, '47=']
['CDK11B', 'NM_033489.1', 'NC_000001.10', 'splign', -1, 1, 98, 222, 1588824, 1588948, '124=']
['CDK11B', 'NM_033489.1', 'NC_000001.10', 'splign', -1, 0, 0, 98, 1590374, 1590473, '31=1I50=2X15=']
['CDK11B', 'NM_033486.1', 'NC_000001.10', 'splign', -1, 2, 222, 343, 1586818, 1586938, '116=2X2=1D']
['CDK11B', 'NM_033486.1', 'NC_000001.10', 'splign', -1, 1, 98, 222, 1588824, 1588948, '124=']
['CDK11B', 'NM_033486.1', 'NC_000001.10', 'splign', -1, 0, 0, 98, 1590374, 1590473, '31=1I50=2X15=']
['CDK11B', 'NM_033493.1', 'NC_000001.10', 'splign', -1, 2, 222, 343, 1586818, 1586938, '57=1X58=2X2=1D']
['CDK11B', 'NM_033493.1', 'NC_000001.10', 'splign', -1, 1, 98, 222, 1588824, 1588948, '124=']
['CDK11B', 'NM_033493.1', 'NC_000001.10', 'splign', -1, 0, 0, 98, 1590374, 1590473, '31=1I50=2X15=']
['CDK11B', 'NM_033492.1', 'NC_000001.10', 'splign', -1, 2, 222, 343, 1586818, 1586938, '57=1X58=2X2=1D']
['CDK11B', 'NM_033492.1', 'NC_000001.10', 'splign', -1, 1, 98, 222, 1588824, 1588948, '124=']
['CDK11B', 'NM_033492.1', 'NC_000001.10', 'splign', -1, 0, 0, 98, 1590374, 1590473, '31=1I50=2X15=']
['CDK11B', 'NM_033487.1', 'NC_000001.10', 'splign', -1, 2, 222, 343, 1586817, 1586938, '116=2X3=']
['CDK11B', 'NM_033487.1', 'NC_000001.10', 'splign', -1, 1, 98, 222, 1588824, 1588948, '124=']
['CDK11B', 'NM_033487.1', 'NC_000001.10', 'splign', -1, 0, 0, 98, 1590374, 1590473, '31=1I50=2X15=']

And

[['NM_033488.1', 'NC_000001.10', -1, 'splign', 1571099, 1590473], ['NM_033489.1', 'NC_000001.10', -1, 'splign', 1571099, 1590473], ['NM_033486.1', 'NC_000001.10', -1, 'splign', 1571099, 1590473], ['NM_033493.1', 'NC_000001.10', -1, 'splign', 1571099, 1590473], ['NM_033492.1', 'NC_000001.10', -1, 'splign', 1571099, 1590473], ['NM_033487.1', 'NC_000001.10', -1, 'splign', 1571099, 1590473]]

Again it looks like the range provided by hdp.get_tx_for_region doesn't match the range provided in the alignments.

we also need to cross reference against RefSeq

@ifokkema
Copy link
Collaborator Author

ifokkema commented Jun 5, 2020

Alignments for second variant
(...)
Again it looks like the range provided by hdp.get_tx_for_region doesn't match the range provided in the alignments.

Here, the difference is even a staggering 15719 basepairs. I'm really curious where this comes from...

@Peter-J-Freeman
Copy link
Collaborator

We think it is a balls up when the transcript record was updated. We think the alignment data was updated but the range data returned by vfo.hdp.get_tx_for_region were not.

This will be fixed with our UTA build. John is already looking into it.

FYI, I think NM_017940.4 has been deprecated.

@ifokkema
Copy link
Collaborator Author

Just to link; related to #152.

@ifokkema
Copy link
Collaborator Author

Just to update; I have just sent out a survey to all our curators in the GV shared LOVD; they're asked to vote on the proposed description on cDNA level for variants (partially) outside of the transcript's reference sequence.
Options that they can vote for, given NC_000016.9:g.2106894_2161281del as an example:
A. "-" (an empty description)
B. "c.3887_*32834del"
C. "c.3887_*1017del"
D. "c.3887_(*1017_?)del"
E. "c.3887_*1017[0]"
F. "c.3887_*1017{0}"

I'll keep you updated on the results 😉

@ifokkema
Copy link
Collaborator Author

ifokkema commented Jun 22, 2020

Another variant that voluntarily maps to transcripts (10, in this case), and then throws a transcript_variant_error for each of them. NC_000007.13:g.50468071G>A. Variant should map to the coding region of multiple transcripts (according to UCSC genome browser).

@leicray
Copy link
Contributor

leicray commented Jun 22, 2020

That's very strange. If I validate NC_000007.13:g.50468071G>A using the interactive validator, it returns the warning:
"No transcripts found that fully overlap the described variation in the genomic sequence".
What's the evidence that there are 10 transcripts that span the variant?

@ifokkema
Copy link
Collaborator Author

What's the evidence that there are 10 transcripts that span the variant?

That is what the API says, I always use the APIs now. See the link that I posted (the variant holds the link). I bet the interface discards all ten transcripts before reporting that there is no overlap.

@Peter-J-Freeman
Copy link
Collaborator

The LOVD endpoint and the VV endpoint have slightly different processing. I still believe this is an alignment issue which Johns work will corrcect

@ifokkema
Copy link
Collaborator Author

OK, cool! Would you like more examples when I run into them? Or perhaps that's just overkill... I really think that once John is done with rebuilding the database, I should just run all verifications again 😅 It's quite slow at the moment though (meaning, we have a lot of data); about 2% of the database contents is checked per day.

@ifokkema
Copy link
Collaborator Author

In case you're interested, here's another one, but special because it also shows that UTA misses a new transcript version there. NC_000008.10:g.144940290C>G voluntarily maps to NM_031308.2 and then throws an error. Mutalyzer claims it should map to NM_031308.3:c.7129G>C (note the newer version!).

The version 2 of the transcript does in principle work, because NC_000008.10:g.144940327G>A maps fine to NM_031308.2:c.7095C>T (mutalyzer agrees on the position). There's no hg38 mapping, but that might also be because this is an older version of the transcript.

@Peter-J-Freeman
Copy link
Collaborator

OK, cool! Would you like more examples when I run into them? Or perhaps that's just overkill... I really think that once John is done with rebuilding the database, I should just run all verifications again 😅 It's quite slow at the moment though (meaning, we have a lot of data); about 2% of the database contents is checked per day.

If you can be bothered to post them then yes please because I add them to the PyTests

@ifokkema
Copy link
Collaborator Author

ifokkema commented Jul 7, 2020

Just to update; I have just sent out a survey to all our curators in the GV shared LOVD; they're asked to vote on the proposed description on cDNA level for variants (partially) outside of the transcript's reference sequence.
Options that they can vote for, given NC_000016.9:g.2106894_2161281del as an example:
A. "-" (an empty description)
B. "c.3887_*32834del"
C. "c.3887_*1017del"
D. "c.3887_(*1017_?)del"
E. "c.3887_*1017[0]"
F. "c.3887_*1017{0}"

Update: These are the results of the survey:

survey_results_2020-07-06

A majority of curators wants to see a variant description that acknowledges the size of the variant. I was reminded by one of the curators of the previously suggested nomenclature: "c.3887_*1017+d31817del". The HGVS earlier rejected this proposal, but well, having a description at all for this variant has been dismissed by the HGVS. This option, together with option B above, would also allow us to map back to the genome, as long as we store which reference sequence was used to describe the variant in the first place (like "NC_000016.9(NM_000296.3):c.3887_*32834del"). In that sense, it'll be just like an intronic variant.

@ifokkema
Copy link
Collaborator Author

Although this issue started with reporting the typo in the error, we also discussed handling variants that (partially or wholly) fall outside of genes. What do you think we should do with that? Or do you want me to submit a new report?

E.g.,
NC_000016.9:g.2106894_2161281del

Currently, VV returns no mapping at all, even though it partially overlaps a gene (see above for some possible descriptions that we can use).

@Peter-J-Freeman
Copy link
Collaborator

I think open a new report @ifokkema.
Perhaps add the survey and expand a little more on what it means and what you want to see. Add as feature request

@Peter-J-Freeman
Copy link
Collaborator

p.s. does c.3887_*32834del go beyond the length of the transcript. If so, I. will not accept this as it is an illegal descriptions because you cannot just infer bases in a reference sequence that do not exist.

@ifokkema
Copy link
Collaborator Author

Sure, I'll open a new request!
Yes, c.3887_*32834del exceeds the length of the 3' UTR, but note that the full suggested description is NC_000016.9(NM_001009944.2):c.3887_*32834del which doesn't infer bases but instead relies on the mapping with the NC. That way, it's no different from NC_000016.9:g.2145157_2163814del, which maps to NC_000016.9(NM_001009944.2):c.2853+357_10499+1992del. Neither the start nor the end of that deletion can be found in the NM reference sequence.

@Peter-J-Freeman
Copy link
Collaborator

Open the new issue and I will comment there

@ifokkema
Copy link
Collaborator Author

Excellent! Opened #333.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants