LOVD endpoint: Variants crossing gene boundaries generate "porcessing_error". #173

ifokkema · 2020-04-28T11:42:34Z

On the LOVD endpoint, variants crossing gene boundaries, such as the BRCA1 promotor deletion variant NC_000017.10:g.41271863_41308933del, generate the flag porcessing_error (also note the typo) and a transcript_variant_error value of "start or end or both are beyond the bounds of transcript record". The mapping on NM_007294.3 was provided by VV itself, so I'd argue it should then be able to handle the mapping. Mutalyzer claims this variant should map to NC_000017.10(NM_007294.3):c.-31665_81-4067del, although I understand that these mapped positions cannot be represented in this NM record.

I'm not sure if you'd want to support this notation; the HGVS nomenclature website clearly states:

(...) it is not allowed to describe variants in nucleotides beyond the boundaries of a reference sequence. Consequently, deletions extending 5’ of a transcript (...) can only be described using genomic coordinates. The HGVS nomenclature committee (SVD-WG) is discussing whether a c. based format should be proposed.

Not sure if there was an outcome to this, but obviously, LOVD is massively breaking these regulations because some kind of mapping needs to be given to indicate this genomic variant has an effect on the expression of the gene. Any thoughts?

The text was updated successfully, but these errors were encountered:

Peter-J-Freeman · 2020-04-28T14:02:56Z

Hi @ifokkema

I have corrected the typo. Will push up shortly.

I need to try and figure out why VV identifies the transcript. I suspect that normalised in one direction the deletion can be pushed entirely into the transcript reference sequence. Want to understand what's going on first.

As for the coordinates outside of transcript reference sequences, the Working Group cannot reach a conclusion as to whether to allow it or the nomenclature standard to support it.

Perhaps a position relative to the c.1 could be provided (although I'm not sure whether this is possible) which you can then choose to format into whatever you like???? :)

ifokkema · 2020-04-28T17:16:48Z

I have corrected the typo. Will push up shortly.

Thank you!

As for the coordinates outside of transcript reference sequences, the Working Group cannot reach a conclusion as to whether to allow it or the nomenclature standard to support it.

Perhaps a position relative to the c.1 could be provided (although I'm not sure whether this is possible) which you can then choose to format into whatever you like???? :)

Sounds good! I don't mind breaking HGVS rules by myself, as long as the working group cannot come up with a solution. It seems obvious to me that in some way we should be able to indicate the variant affects the gene, and I don't mind constructing the non-HGVS format that Mutalyzer currently uses to create some support. I could parse the positions out of an error message if I have to, otherwise, separate fields would be awesome.

Peter-J-Freeman · 2020-05-01T14:33:11Z

I'd go for additional JSON fields. Adding fields generally doesn't break stuff for others and is a way of creating non-standard data in a meaningful way.

Gonna prioratise a few other issues first though???

ifokkema · 2020-05-01T14:44:02Z

Great, thank you!

And no worries, there's plenty left for me to work on 😬

ifokkema · 2020-05-28T16:43:57Z

@leicray and me also discussed this today over Skype;

I understand the point that positions can not be described outside of their reference sequences.
I understand that intronic variants are a bit of an exception in this case as NC-based descriptions are supported (i.e., NC_000015.9(NM_002225.3):c.466-2A>G).
So in principle, positions upstream of the transcription start, and downstream of the transcription end, could be supported, but then the question becomes - "how far from the gene should we support NC-based positions?".

These are the downsides that I see of simply not supporting variants partially falling outside of a gene:

There is no way to annotate whole-gene deletions on the gene level.
There is no way to annotate partial gene deletions on the gene level.
VV doesn't produce liftovers for variants like these.
It's exceptionally awkward that increasing the size of a deletion with a valid HGVS nomenclature, suddenly makes it not have valid HGVS nomenclature on the gene level at all.
Databases like LOVD require a transcript-based description to be able to store the effect on the gene-level.
Humans don't see what these genomic descriptions mean; is the whole gene gone? Or only half of it? Or perhaps only a part of 3' UTR, and the gene functions still?
Actually, databases won't know either, as VV doesn't help them out in this case.

I argue the downsides of having no support outweigh not having an answer to the question "how far from the gene should we support NC-based positions?". Therefore, after discussing this with @leicray, I suggest using the uncertain positions notation for this; for instance, as seen in c.(?_-126)_(*1406_?)del (by the way, Raymond, that's how you use a DBID link 😝).

Pros:

You don't need to think about how far from the gene to support NC-based positions.
It doesn't use positions outside of the transcript, so it doesn't break the rule.
It nicely groups all gene-based descriptions of whole-gene deletions, making it easy to find all associated phenotypes.
Humans can read it.
Databases can read it.
It matches existing HGVS rules, making it easier for people to remember how to produce the HGVS themselves, should they have to.

Perhaps I should shoot this into the HGVS committee, but I'm not part of those email conversations.

…to a transcript. - If variants partially map outside of a transcript, VV currently does not return a transcript mapping, but also no liftover. As such, we can do nothing. - See openvar/variantValidator#173. - Silently skipping these variants for now.

ifokkema · 2020-06-03T12:00:32Z

Although the original example no longer throws an error, NC_000001.10:g.1573181C>G still does (porcessing_error, with typo, and all mappings return a transcript_variant_error).

Peter-J-Freeman · 2020-06-03T12:02:48Z

That's odd. Wonder if I forgot to upload a version/merge some changes. I'll take a look ASAP

Sorry for the delays @ifokkema . Having to do some of the boring stuff currently!!!!!

ifokkema · 2020-06-03T12:57:27Z

Thanks and, no worries! I'm running my VV verification script now on the GV shared LOVD, so I might run into more weird things...

ifokkema · 2020-06-03T14:11:23Z

Another example, but strangely enough, Mutalyzer indicates NC_000001.10:g.16891340T>A is in the CDS of NM_017940.4. Yet, VV generates an error as if it's not inside of the transcript.

leicray · 2020-06-03T15:03:11Z

I suspect that this might be a gene containing an alignment gap because Alamut Visual does not show any alignments against RefSeq transcripts. What's more, the genome and transcript position numbers do not align. Perhaps easier to SHOW you this than to try an explain. How about a skype chat tomorrow?

leicray · 2020-06-03T15:08:13Z

I have just also noticed that the error message reads:

"flag": "porcessing_error"

instead of

"flag": "processing_error"

ifokkema · 2020-06-03T17:13:11Z

I suspect that this might be a gene containing an alignment gap because Alamut Visual does not show any alignments against RefSeq transcripts.

Normally, VV would throw a warning for this. Now, the gap_statement and gapped_alignment_warning fields are empty.

Perhaps this transcript has a very specific problem, but I think in general this is caused by the basic mapping mechanism (hgvs python module?) allowing for some distance between the variant and the transcript (like Mutalyzer allows 5000 bp upstream and 2000 bp downstream), but then VV not allowing any position outside of the transcript.

What's more, the genome and transcript position numbers do not align. Perhaps easier to SHOW you this than to try an explain. How about a skype chat tomorrow?

Hmm... yeah, this one I don't understand. I'm free tomorrow at any time, from 8:30 UK time. Just let me know ~30 minutes in advance, if possible.

I have just also noticed that the error message reads:

"flag": "porcessing_error"

instead of

"flag": "processing_error"

Yes, indeed; Pete already fixed it, but it somehow this change didn't find its way onto the server.

Peter-J-Freeman · 2020-06-03T17:15:12Z

Normally, VV would throw a warning for this. Now, the gap_statement and gapped_alignment_warning fields are empty.

Only if the position queried hits the gap.

Yes, indeed; Pete already fixed it, but it somehow this change didn't find its way onto the server.

Yep, pretty sure it's fixed. Can't replicate locally :)

Peter-J-Freeman · 2020-06-03T17:17:32Z

Perhaps this transcript has a very specific problem, but I think in general this is caused by the basic mapping mechanism (hgvs python module?) allowing for some distance between the variant and the transcript (like Mutalyzer allows 5000 bp upstream and 2000 bp downstream), but then VV not allowing any position outside of the transcript.

HGVS does not allow for any positions outside of the transcript. Mutalyzer is, as usual, non compliant with this rule.

That being said, it is possible our alignment for NM_017940.4 might be inaccurate. The update UTA will fix this. I'll check

Peter-J-Freeman · 2020-06-03T17:25:40Z

For NC_000001.10:g.1573181C>G I get locally (so def needs pushing up)

{
    "NC_000001.10:g.1573181C>G": {
        "errors": [],
        **"flag": "processing_error",**
        "NC_000001.10:g.1573181C>G": {
            "p_vcf": "1:1573181:C:G",
            "g_hgvs": "NC_000001.10:g.1573181C>G",
            "selected_build": "GRCh37",
            "genomic_variant_error": null,
            "hgvs_t_and_p": {
                "NM_033488.1": {
                    "t_hgvs": null,
                    "p_hgvs_tlc": null,
                    "p_hgvs_slc": null,
                    "gapped_alignment_warning": null,
                    "gap_statement": null,
                    "transcript_variant_error": "start or end or both are beyond the bounds of transcript record",
                    "primary_assembly_loci": {
                        "grch37": {
                            "NC_000001.10": {
                                "hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
                                "vcf": {
                                    "chr": "1",
                                    "pos": "1573181",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "hg19": {
                            "NC_000001.10": {
                                "hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
                                "vcf": {
                                    "chr": "chr1",
                                    "pos": "1573181",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "grch38": {
                            "NC_000001.11": {
                                "hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
                                "vcf": {
                                    "chr": "1",
                                    "pos": "1637819",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "hg38": {
                            "NC_000001.11": {
                                "hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
                                "vcf": {
                                    "chr": "chr1",
                                    "pos": "1637819",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        }
                    },
                    "alt_genomic_loci": {
                        "grch37": {},
                        "hg19": {},
                        "grch38": {},
                        "hg38": {}
                    }
                },
                "NM_033489.1": {
                    "t_hgvs": null,
                    "p_hgvs_tlc": null,
                    "p_hgvs_slc": null,
                    "gapped_alignment_warning": null,
                    "gap_statement": null,
                    "transcript_variant_error": "start or end or both are beyond the bounds of transcript record",
                    "primary_assembly_loci": {
                        "grch37": {
                            "NC_000001.10": {
                                "hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
                                "vcf": {
                                    "chr": "1",
                                    "pos": "1573181",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "hg19": {
                            "NC_000001.10": {
                                "hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
                                "vcf": {
                                    "chr": "chr1",
                                    "pos": "1573181",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "grch38": {
                            "NC_000001.11": {
                                "hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
                                "vcf": {
                                    "chr": "1",
                                    "pos": "1637819",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "hg38": {
                            "NC_000001.11": {
                                "hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
                                "vcf": {
                                    "chr": "chr1",
                                    "pos": "1637819",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        }
                    },
                    "alt_genomic_loci": {
                        "grch37": {},
                        "hg19": {},
                        "grch38": {},
                        "hg38": {}
                    }
                },
                "NM_033486.1": {
                    "t_hgvs": null,
                    "p_hgvs_tlc": null,
                    "p_hgvs_slc": null,
                    "gapped_alignment_warning": null,
                    "gap_statement": null,
                    "transcript_variant_error": "start or end or both are beyond the bounds of transcript record",
                    "primary_assembly_loci": {
                        "grch37": {
                            "NC_000001.10": {
                                "hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
                                "vcf": {
                                    "chr": "1",
                                    "pos": "1573181",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "hg19": {
                            "NC_000001.10": {
                                "hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
                                "vcf": {
                                    "chr": "chr1",
                                    "pos": "1573181",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "grch38": {
                            "NC_000001.11": {
                                "hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
                                "vcf": {
                                    "chr": "1",
                                    "pos": "1637819",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "hg38": {
                            "NC_000001.11": {
                                "hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
                                "vcf": {
                                    "chr": "chr1",
                                    "pos": "1637819",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        }
                    },
                    "alt_genomic_loci": {
                        "grch37": {},
                        "hg19": {},
                        "grch38": {},
                        "hg38": {}
                    }
                },
                "NM_033493.1": {
                    "t_hgvs": null,
                    "p_hgvs_tlc": null,
                    "p_hgvs_slc": null,
                    "gapped_alignment_warning": null,
                    "gap_statement": null,
                    "transcript_variant_error": "start or end or both are beyond the bounds of transcript record",
                    "primary_assembly_loci": {
                        "grch37": {
                            "NC_000001.10": {
                                "hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
                                "vcf": {
                                    "chr": "1",
                                    "pos": "1573181",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "hg19": {
                            "NC_000001.10": {
                                "hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
                                "vcf": {
                                    "chr": "chr1",
                                    "pos": "1573181",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "grch38": {
                            "NC_000001.11": {
                                "hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
                                "vcf": {
                                    "chr": "1",
                                    "pos": "1637819",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "hg38": {
                            "NC_000001.11": {
                                "hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
                                "vcf": {
                                    "chr": "chr1",
                                    "pos": "1637819",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        }
                    },
                    "alt_genomic_loci": {
                        "grch37": {},
                        "hg19": {},
                        "grch38": {},
                        "hg38": {}
                    }
                },
                "NM_033492.1": {
                    "t_hgvs": null,
                    "p_hgvs_tlc": null,
                    "p_hgvs_slc": null,
                    "gapped_alignment_warning": null,
                    "gap_statement": null,
                    "transcript_variant_error": "start or end or both are beyond the bounds of transcript record",
                    "primary_assembly_loci": {
                        "grch37": {
                            "NC_000001.10": {
                                "hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
                                "vcf": {
                                    "chr": "1",
                                    "pos": "1573181",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "hg19": {
                            "NC_000001.10": {
                                "hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
                                "vcf": {
                                    "chr": "chr1",
                                    "pos": "1573181",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "grch38": {
                            "NC_000001.11": {
                                "hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
                                "vcf": {
                                    "chr": "1",
                                    "pos": "1637819",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "hg38": {
                            "NC_000001.11": {
                                "hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
                                "vcf": {
                                    "chr": "chr1",
                                    "pos": "1637819",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        }
                    },
                    "alt_genomic_loci": {
                        "grch37": {},
                        "hg19": {},
                        "grch38": {},
                        "hg38": {}
                    }
                },
                "NM_033487.1": {
                    "t_hgvs": null,
                    "p_hgvs_tlc": null,
                    "p_hgvs_slc": null,
                    "gapped_alignment_warning": null,
                    "gap_statement": null,
                    "transcript_variant_error": "start or end or both are beyond the bounds of transcript record",
                    "primary_assembly_loci": {
                        "grch37": {
                            "NC_000001.10": {
                                "hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
                                "vcf": {
                                    "chr": "1",
                                    "pos": "1573181",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "hg19": {
                            "NC_000001.10": {
                                "hgvs_genomic_description": "NC_000001.10:g.1573181C>G",
                                "vcf": {
                                    "chr": "chr1",
                                    "pos": "1573181",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "grch38": {
                            "NC_000001.11": {
                                "hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
                                "vcf": {
                                    "chr": "1",
                                    "pos": "1637819",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        },
                        "hg38": {
                            "NC_000001.11": {
                                "hgvs_genomic_description": "NC_000001.11:g.1637819C>G",
                                "vcf": {
                                    "chr": "chr1",
                                    "pos": "1637819",
                                    "ref": "C",
                                    "alt": "G"
                                }
                            }
                        }
                    },
                    "alt_genomic_loci": {
                        "grch37": {},
                        "hg19": {},
                        "grch38": {},
                        "hg38": {}
                    }
                }
            }
        }
    },
    "metadata": {
        "variantvalidator_version": "1.0.4.dev11+g97aec97",
        "variantvalidator_hgvs_version": "1.2.5.vv1",
        "uta_schema": "uta_20180821",
        "seqrepo_db": "/Users/Shared/seqrepo_dumps/2018-08-21",
        "variantformatter_version": "1.0.2.dev7+gbdcbbe9.d20200316"
    }
}

CDK11B is indeed listed as a gapped alignment

The alignment is a total mess. According to UCSC

The RefSeq transcript has 3 substitutions, 2 non-frameshifting indels and aligns at 90% coverage compared to this genomic sequence
http://genome.ucsc.edu/cgi-bin/hgc?hgsid=842211639_EOu1QaRxab8C2lRBGfhouSqNEXvG&c=chr1&l=1573169&r=1573192&o=1570602&t=1590465&g=ncbiRefSeqCurated&i=NM_033489.3

Therefore the alignment provided by Mutalyzer will be absolute pants!!! :)

That being said, our alignment needs looking at but only once we have re-created UTA
Also, I don't quite understand how we are identifying transcripts here so warrants a look. I'll get on to that

ifokkema · 2020-06-03T17:31:34Z

Perhaps this transcript has a very specific problem, but I think in general this is caused by the basic mapping mechanism (hgvs python module?) allowing for some distance between the variant and the transcript (like Mutalyzer allows 5000 bp upstream and 2000 bp downstream), but then VV not allowing any position outside of the transcript.

HGVS does not allow for any positions outside of the transcript. Mutalyzer is, as usual, non compliant with this rule.

Hmm... So perhaps then the method that checks which transcripts should be aligned to, thinks the variant aligns, but then the actual alignment algorithm finds they don't align... Something like that?

The RefSeq transcript has 3 substitutions, 2 non-frameshifting indels and aligns at 90% coverage compared to this genomic sequence

Therefore the alignment provided by Mutalyzer will be absolute pants!!! :)

Uff... haha, good to know 😉

Peter-J-Freeman · 2020-06-03T17:42:31Z

For NC_000001.10:g.16891340T>A

VV gives

{
  "flag": "intergenic",
  "intergenic_variant_1": {
    "alt_genomic_loci": [],
    "gene_ids": {},
    "gene_symbol": "",
    "genome_context_intronic_sequence": "",
    "hgvs_lrg_transcript_variant": "",
    "hgvs_lrg_variant": "",
    "hgvs_predicted_protein_consequence": {
      "slr": "",
      "tlr": ""
    },
    "hgvs_refseqgene_variant": "",
    "hgvs_transcript_variant": "",
    "primary_assembly_loci": {
      "grch37": {
        "hgvs_genomic_description": "NC_000001.10:g.16891340T>A",
        "vcf": {
          "alt": "A",
          "chr": "1",
          "pos": "16891340",
          "ref": "T"
        }
      },
      "grch38": {
        "hgvs_genomic_description": "NC_000001.11:g.16564845T>A",
        "vcf": {
          "alt": "A",
          "chr": "1",
          "pos": "16564845",
          "ref": "T"
        }
      },
      "hg19": {
        "hgvs_genomic_description": "NC_000001.10:g.16891340T>A",
        "vcf": {
          "alt": "A",
          "chr": "chr1",
          "pos": "16891340",
          "ref": "T"
        }
      },
      "hg38": {
        "hgvs_genomic_description": "NC_000001.11:g.16564845T>A",
        "vcf": {
          "alt": "A",
          "chr": "chr1",
          "pos": "16564845",
          "ref": "T"
        }
      }
    },
    "reference_sequence_records": "",
    "refseqgene_context_intronic_sequence": "",
    "selected_assembly": "GRCh37",
    "submitted_variant": "NC_000001.10:g.16891340T>A",
    "transcript_description": "",
    "validation_warnings": [
      "No transcripts found that fully overlap the described variation in the genomic sequence"
    ]
  },
  "metadata": {
    "seqrepo_db": "2018-08-21",
    "uta_schema": "uta_20180821",
    "variantvalidator_hgvs_version": "1.2.5.vv1",
    "variantvalidator_version": "1.0.4.dev42+gbb2b0b7"
  }
}

Again, there seems to be something odd with the alignment of NBPF1. We do not have it down as a gap gene, but the UCSC status is

Other notes: The RefSeq transcript aligns at 95% coverage compared to this genomic sequence
However, this is for NM_017940.6 and not .4

LOVD EP produces

{
    "NC_000001.10:g.16891340T>A": {
        "errors": [],
        "flag": "processing_error",
        "NC_000001.10:g.16891340T>A": {
            "p_vcf": "1:16891340:T:A",
            "g_hgvs": "NC_000001.10:g.16891340T>A",
            "selected_build": "GRCh37",
            "genomic_variant_error": null,
            "hgvs_t_and_p": {
                "NM_017940.4": {
                    "t_hgvs": null,
                    "p_hgvs_tlc": null,
                    "p_hgvs_slc": null,
                    "gapped_alignment_warning": null,
                    "gap_statement": null,
                    "transcript_variant_error": "start or end or both are beyond the bounds of transcript record",
                    "primary_assembly_loci": {
                        "grch37": {
                            "NC_000001.10": {
                                "hgvs_genomic_description": "NC_000001.10:g.16891340T>A",
                                "vcf": {
                                    "chr": "1",
                                    "pos": "16891340",
                                    "ref": "T",
                                    "alt": "A"
                                }
                            }
                        },
                        "hg19": {
                            "NC_000001.10": {
                                "hgvs_genomic_description": "NC_000001.10:g.16891340T>A",
                                "vcf": {
                                    "chr": "chr1",
                                    "pos": "16891340",
                                    "ref": "T",
                                    "alt": "A"
                                }
                            }
                        },
                        "grch38": {
                            "NC_000001.11": {
                                "hgvs_genomic_description": "NC_000001.11:g.16564845T>A",
                                "vcf": {
                                    "chr": "1",
                                    "pos": "16564845",
                                    "ref": "T",
                                    "alt": "A"
                                }
                            }
                        },
                        "hg38": {
                            "NC_000001.11": {
                                "hgvs_genomic_description": "NC_000001.11:g.16564845T>A",
                                "vcf": {
                                    "chr": "chr1",
                                    "pos": "16564845",
                                    "ref": "T",
                                    "alt": "A"
                                }
                            }
                        }
                    },
                    "alt_genomic_loci": {
                        "grch37": {},
                        "hg19": {},
                        "grch38": {},
                        "hg38": {}
                    }
                }
            }
        }
    },
    "metadata": {
        "variantvalidator_version": "1.0.4.dev11+g97aec97",
        "variantvalidator_hgvs_version": "1.2.5.vv1",
        "uta_schema": "uta_20180821",
        "seqrepo_db": "/Users/Shared/seqrepo_dumps/2018-08-21",
        "variantformatter_version": "1.0.2.dev7+gbdcbbe9.d20200316"
    }
}

So again the transcript is identified. I'll do some additional digging. I am still thinking out alignment may need an update though

Peter-J-Freeman · 2020-06-04T12:25:28Z

OK, so here is our alignment for [['NM_017940.4', 'NC_000001.10', -1, 'splign', 16888921, 16940100]]

It's a total mess of a gene. I would not trust UCSC or Mjutalyzer. I suspect our alignment may also need an update, but it goes to show how bad this gene is. I'll look at the actual positions later and see why the genomic position entered may be being returned as out-of-bounds currently then check the RefSeq alignment

['gene', 'tx', 'chr', 'aln', 'ori', 'exon', 'tx_st', 'tx_end', 'chr_st', 'chr_end', 'cigar']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 24, 3672, 3845, 16893674, 16893846, '96=1X75=1D']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 23, 3620, 3672, 16894473, 16894525, '52=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 22, 3456, 3620, 16895567, 16895731, '97=1X66=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 21, 3404, 3456, 16899636, 16899688, '52=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 20, 3198, 3404, 16900981, 16901187, '8=1X197=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 19, 3125, 3198, 16901651, 16901724, '10=1X3=1X58=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 18, 2910, 3125, 16902761, 16902976, '33=1X48=1X109=1X3=1X18=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 17, 2807, 2910, 16903811, 16903914, '2=1X29=1X70=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 16, 2597, 2807, 16905687, 16905897, '94=1X2=1X10=2X100=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 15, 2385, 2597, 16907239, 16907451, '123=1X88=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 14, 2312, 2385, 16907914, 16907987, '73=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 13, 2097, 2312, 16909038, 16909253, '124=1X67=1X3=1X4=1X13=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 12, 1994, 2097, 16910088, 16910191, '14=1X88=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 11, 1784, 1994, 16911983, 16912193, '94=1X2=1X10=2X69=1X30=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 10, 1572, 1784, 16913544, 16913756, '79=1X43=1X88=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 9, 1499, 1572, 16914219, 16914292, '73=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 8, 1284, 1499, 16915343, 16915558, '33=1X86=1X3=1X67=1X3=1X4=1X13=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 7, 1181, 1284, 16916393, 16916496, '103=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 6, 971, 1181, 16918341, 16918551, '49=1X160=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 5, 816, 971, 16918653, 16918808, '115=1X39=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 4, 689, 816, 16919935, 16920062, '127=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 3, 619, 689, 16921086, 16921156, '70=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 2, 540, 619, 16921425, 16921504, '79=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 1, 268, 540, 16935002, 16935274, '272=']
['NBPF1', 'NM_017940.4', 'NC_000001.10', 'splign', -1, 0, 0, 268, 16939832, 16940100, '268=']

ifokkema · 2020-06-04T13:29:05Z

OK, so here is our alignment for [['NM_017940.4', 'NC_000001.10', -1, 'splign', 16888921, 16940100]]

OK, so this is odd, right? 16891340 doesn't overlap with any exon there, but it does with your genomic range above. That range is bigger by 4753 bases downstream of the gene (5' in the chromosome). I don't know what data comes from where, but shouldn't the numbers you mention there simply be based on the smallest and largest positions in the list of exon alignments?

Peter-J-Freeman · 2020-06-04T18:47:14Z

No, that's genome annotation, we are providing alignments hence the CIGAR MUST be considered which tools like UCSC, Mutalyzer and others fail to do. I need to look more closely, but the alignment is a mess. Will look ASAP

ifokkema · 2020-06-05T07:18:34Z

Of course, you should consider the CIGAR, but that's not my point - you have the strange situation where VV reports a transcript to overlap a position, and then it says it doesn't. Well, your transcript lengths are conflicting right there, so doesn't that explain the strange behavior in VV?

Will look ASAP

No stress, take your time on this. I'm running VV on our entire LOVD instance, I'll be busy for a long time with this...

Peter-J-Freeman · 2020-06-05T07:39:07Z

Absolutely, agree with Well, your transcript lengths are conflicting right there, so doesn't that explain the strange behavior in VV?

Need a puzzle to solve :)

Peter-J-Freeman · 2020-06-05T10:11:33Z

Sorry, I see what you mean now. This is really unusual.

I'm going to look into how the hgvs library fetches that range.

Peter-J-Freeman · 2020-06-05T10:22:04Z

So the method used is in here
def get_tx_for_region

def get_tx_for_region(self, alt_ac, alt_aln_method, start_i, end_i):
        """
        return transcripts that overlap given region
        :param str alt_ac: reference sequence (e.g., NC_000007.13)
        :param str alt_aln_method: alignment method (e.g., splign)
        :param int start_i: 5' bound of region
        :param int end_i: 3' bound of region
        """
        return self._fetchall(self._queries['tx_for_region'], [alt_ac, alt_aln_method, start_i, end_i])

So the range is as you suspected the genomic coordinate bounds for the transcript. So the question is why is there no exon either side of this position???? The query position is

vfo.hdp.get_tx_for_region(hgvs_genomic.ac, 'splign', hgvs_genomic.posedit.pos.start.base,
                                            hgvs_genomic.posedit.pos.end.base - 1)

where hgvs_genomic is NC_000001.10:g.16891340T>A

Peter-J-Freeman · 2020-06-05T10:32:55Z

So I suspect the errors are with UTA. Will email this link to Jon to look at in the new build.

Jon, note we are looking at variants NC_000001.10:g.16891340T>A and NC_000001.10:g.1573181C>G

Peter-J-Freeman · 2020-06-05T10:38:40Z

Alignments for second variant

['gene', 'tx', 'chr', 'aln', 'ori', 'exon', 'tx_st', 'tx_end', 'chr_st', 'chr_end', 'cigar']
['CDK11B', 'NM_033488.1', 'NC_000001.10', 'splign', -1, 3, 269, 390, 1586818, 1586938, '116=2X2=1D']
['CDK11B', 'NM_033488.1', 'NC_000001.10', 'splign', -1, 2, 222, 269, 1588705, 1588752, '47=']
['CDK11B', 'NM_033488.1', 'NC_000001.10', 'splign', -1, 1, 98, 222, 1588824, 1588948, '124=']
['CDK11B', 'NM_033488.1', 'NC_000001.10', 'splign', -1, 0, 0, 98, 1590374, 1590473, '31=1I50=2X15=']
['CDK11B', 'NM_033489.1', 'NC_000001.10', 'splign', -1, 3, 269, 390, 1586818, 1586938, '116=2X2=1D']
['CDK11B', 'NM_033489.1', 'NC_000001.10', 'splign', -1, 2, 222, 269, 1588705, 1588752, '47=']
['CDK11B', 'NM_033489.1', 'NC_000001.10', 'splign', -1, 1, 98, 222, 1588824, 1588948, '124=']
['CDK11B', 'NM_033489.1', 'NC_000001.10', 'splign', -1, 0, 0, 98, 1590374, 1590473, '31=1I50=2X15=']
['CDK11B', 'NM_033486.1', 'NC_000001.10', 'splign', -1, 2, 222, 343, 1586818, 1586938, '116=2X2=1D']
['CDK11B', 'NM_033486.1', 'NC_000001.10', 'splign', -1, 1, 98, 222, 1588824, 1588948, '124=']
['CDK11B', 'NM_033486.1', 'NC_000001.10', 'splign', -1, 0, 0, 98, 1590374, 1590473, '31=1I50=2X15=']
['CDK11B', 'NM_033493.1', 'NC_000001.10', 'splign', -1, 2, 222, 343, 1586818, 1586938, '57=1X58=2X2=1D']
['CDK11B', 'NM_033493.1', 'NC_000001.10', 'splign', -1, 1, 98, 222, 1588824, 1588948, '124=']
['CDK11B', 'NM_033493.1', 'NC_000001.10', 'splign', -1, 0, 0, 98, 1590374, 1590473, '31=1I50=2X15=']
['CDK11B', 'NM_033492.1', 'NC_000001.10', 'splign', -1, 2, 222, 343, 1586818, 1586938, '57=1X58=2X2=1D']
['CDK11B', 'NM_033492.1', 'NC_000001.10', 'splign', -1, 1, 98, 222, 1588824, 1588948, '124=']
['CDK11B', 'NM_033492.1', 'NC_000001.10', 'splign', -1, 0, 0, 98, 1590374, 1590473, '31=1I50=2X15=']
['CDK11B', 'NM_033487.1', 'NC_000001.10', 'splign', -1, 2, 222, 343, 1586817, 1586938, '116=2X3=']
['CDK11B', 'NM_033487.1', 'NC_000001.10', 'splign', -1, 1, 98, 222, 1588824, 1588948, '124=']
['CDK11B', 'NM_033487.1', 'NC_000001.10', 'splign', -1, 0, 0, 98, 1590374, 1590473, '31=1I50=2X15=']

And

[['NM_033488.1', 'NC_000001.10', -1, 'splign', 1571099, 1590473], ['NM_033489.1', 'NC_000001.10', -1, 'splign', 1571099, 1590473], ['NM_033486.1', 'NC_000001.10', -1, 'splign', 1571099, 1590473], ['NM_033493.1', 'NC_000001.10', -1, 'splign', 1571099, 1590473], ['NM_033492.1', 'NC_000001.10', -1, 'splign', 1571099, 1590473], ['NM_033487.1', 'NC_000001.10', -1, 'splign', 1571099, 1590473]]

Again it looks like the range provided by hdp.get_tx_for_region doesn't match the range provided in the alignments.

we also need to cross reference against RefSeq

ifokkema · 2020-06-05T16:16:53Z

Alignments for second variant
(...)
Again it looks like the range provided by hdp.get_tx_for_region doesn't match the range provided in the alignments.

Here, the difference is even a staggering 15719 basepairs. I'm really curious where this comes from...

Peter-J-Freeman · 2020-06-05T16:19:17Z

We think it is a balls up when the transcript record was updated. We think the alignment data was updated but the range data returned by vfo.hdp.get_tx_for_region were not.

This will be fixed with our UTA build. John is already looking into it.

FYI, I think NM_017940.4 has been deprecated.

ifokkema · 2020-06-10T11:10:59Z

Just to link; related to #152.

ifokkema · 2020-06-22T10:41:05Z

Just to update; I have just sent out a survey to all our curators in the GV shared LOVD; they're asked to vote on the proposed description on cDNA level for variants (partially) outside of the transcript's reference sequence.
Options that they can vote for, given NC_000016.9:g.2106894_2161281del as an example:
A. "-" (an empty description)
B. "c.3887_*32834del"
C. "c.3887_*1017del"
D. "c.3887_(*1017_?)del"
E. "c.3887_*1017[0]"
F. "c.3887_*1017{0}"

I'll keep you updated on the results 😉

ifokkema · 2020-06-22T15:15:29Z

Another variant that voluntarily maps to transcripts (10, in this case), and then throws a transcript_variant_error for each of them. NC_000007.13:g.50468071G>A. Variant should map to the coding region of multiple transcripts (according to UCSC genome browser).

leicray · 2020-06-22T16:32:36Z

That's very strange. If I validate NC_000007.13:g.50468071G>A using the interactive validator, it returns the warning:
"No transcripts found that fully overlap the described variation in the genomic sequence".
What's the evidence that there are 10 transcripts that span the variant?

ifokkema · 2020-06-22T17:00:10Z

What's the evidence that there are 10 transcripts that span the variant?

That is what the API says, I always use the APIs now. See the link that I posted (the variant holds the link). I bet the interface discards all ten transcripts before reporting that there is no overlap.

Peter-J-Freeman · 2020-06-22T17:11:56Z

The LOVD endpoint and the VV endpoint have slightly different processing. I still believe this is an alignment issue which Johns work will corrcect

ifokkema · 2020-06-23T12:37:30Z

OK, cool! Would you like more examples when I run into them? Or perhaps that's just overkill... I really think that once John is done with rebuilding the database, I should just run all verifications again 😅 It's quite slow at the moment though (meaning, we have a lot of data); about 2% of the database contents is checked per day.

ifokkema · 2020-06-24T15:53:19Z

In case you're interested, here's another one, but special because it also shows that UTA misses a new transcript version there. NC_000008.10:g.144940290C>G voluntarily maps to NM_031308.2 and then throws an error. Mutalyzer claims it should map to NM_031308.3:c.7129G>C (note the newer version!).

The version 2 of the transcript does in principle work, because NC_000008.10:g.144940327G>A maps fine to NM_031308.2:c.7095C>T (mutalyzer agrees on the position). There's no hg38 mapping, but that might also be because this is an older version of the transcript.

Peter-J-Freeman · 2020-06-24T16:36:29Z

OK, cool! Would you like more examples when I run into them? Or perhaps that's just overkill... I really think that once John is done with rebuilding the database, I should just run all verifications again 😅 It's quite slow at the moment though (meaning, we have a lot of data); about 2% of the database contents is checked per day.

If you can be bothered to post them then yes please because I add them to the PyTests

ifokkema · 2020-07-07T09:56:49Z

Just to update; I have just sent out a survey to all our curators in the GV shared LOVD; they're asked to vote on the proposed description on cDNA level for variants (partially) outside of the transcript's reference sequence.
Options that they can vote for, given NC_000016.9:g.2106894_2161281del as an example:
A. "-" (an empty description)
B. "c.3887_*32834del"
C. "c.3887_*1017del"
D. "c.3887_(*1017_?)del"
E. "c.3887_*1017[0]"
F. "c.3887_*1017{0}"

Update: These are the results of the survey:

A majority of curators wants to see a variant description that acknowledges the size of the variant. I was reminded by one of the curators of the previously suggested nomenclature: "c.3887_*1017+d31817del". The HGVS earlier rejected this proposal, but well, having a description at all for this variant has been dismissed by the HGVS. This option, together with option B above, would also allow us to map back to the genome, as long as we store which reference sequence was used to describe the variant in the first place (like "NC_000016.9(NM_000296.3):c.3887_*32834del"). In that sense, it'll be just like an intronic variant.

ifokkema · 2022-01-20T14:28:53Z

Although this issue started with reporting the typo in the error, we also discussed handling variants that (partially or wholly) fall outside of genes. What do you think we should do with that? Or do you want me to submit a new report?

E.g.,
NC_000016.9:g.2106894_2161281del

Currently, VV returns no mapping at all, even though it partially overlaps a gene (see above for some possible descriptions that we can use).

Peter-J-Freeman · 2022-01-20T14:43:23Z

I think open a new report @ifokkema.
Perhaps add the survey and expand a little more on what it means and what you want to see. Add as feature request

Peter-J-Freeman · 2022-01-20T14:46:43Z

p.s. does c.3887_*32834del go beyond the length of the transcript. If so, I. will not accept this as it is an illegal descriptions because you cannot just infer bases in a reference sequence that do not exist.

ifokkema · 2022-01-20T15:08:19Z

Sure, I'll open a new request!
Yes, c.3887_*32834del exceeds the length of the 3' UTR, but note that the full suggested description is NC_000016.9(NM_001009944.2):c.3887_*32834del which doesn't infer bases but instead relies on the mapping with the NC. That way, it's no different from NC_000016.9:g.2145157_2163814del, which maps to NC_000016.9(NM_001009944.2):c.2853+357_10499+1992del. Neither the start nor the end of that deletion can be found in the NM reference sequence.

Peter-J-Freeman · 2022-01-20T15:10:26Z

Open the new issue and I will comment there

ifokkema · 2022-01-20T17:45:54Z

Excellent! Opened #333.

ifokkema mentioned this issue May 6, 2020

Large deletions without hg38 liftover, shorter deletions do work #186

Closed

ifokkema mentioned this issue Jun 10, 2020

Variants that don't return a hg38 mapping. #201

Open

ifokkema mentioned this issue Aug 10, 2020

NC(NM) notation doesn't work for mismatches between NC and NM #218

Closed

Peter-J-Freeman closed this as completed Jan 20, 2022

LOVD endpoint: Variants crossing gene boundaries generate "porcessing_error". #173

LOVD endpoint: Variants crossing gene boundaries generate "porcessing_error". #173

Comments

ifokkema commented Apr 28, 2020

Peter-J-Freeman commented Apr 28, 2020

ifokkema commented Apr 28, 2020

Peter-J-Freeman commented May 1, 2020

ifokkema commented May 1, 2020

ifokkema commented May 28, 2020

ifokkema commented Jun 3, 2020

Peter-J-Freeman commented Jun 3, 2020

ifokkema commented Jun 3, 2020

ifokkema commented Jun 3, 2020

leicray commented Jun 3, 2020

leicray commented Jun 3, 2020

ifokkema commented Jun 3, 2020

Peter-J-Freeman commented Jun 3, 2020

Peter-J-Freeman commented Jun 3, 2020

Peter-J-Freeman commented Jun 3, 2020 • edited Loading

ifokkema commented Jun 3, 2020

Peter-J-Freeman commented Jun 3, 2020

Peter-J-Freeman commented Jun 4, 2020 • edited Loading

ifokkema commented Jun 4, 2020

Peter-J-Freeman commented Jun 4, 2020

ifokkema commented Jun 5, 2020

Peter-J-Freeman commented Jun 5, 2020

Peter-J-Freeman commented Jun 5, 2020

Peter-J-Freeman commented Jun 5, 2020

Peter-J-Freeman commented Jun 5, 2020

Peter-J-Freeman commented Jun 5, 2020 • edited Loading

ifokkema commented Jun 5, 2020

Peter-J-Freeman commented Jun 5, 2020

ifokkema commented Jun 10, 2020

ifokkema commented Jun 22, 2020

ifokkema commented Jun 22, 2020 • edited Loading

leicray commented Jun 22, 2020

ifokkema commented Jun 22, 2020

Peter-J-Freeman commented Jun 22, 2020

ifokkema commented Jun 23, 2020

ifokkema commented Jun 24, 2020

Peter-J-Freeman commented Jun 24, 2020

ifokkema commented Jul 7, 2020

ifokkema commented Jan 20, 2022

Peter-J-Freeman commented Jan 20, 2022

Peter-J-Freeman commented Jan 20, 2022

ifokkema commented Jan 20, 2022

Peter-J-Freeman commented Jan 20, 2022

ifokkema commented Jan 20, 2022

Peter-J-Freeman commented Jun 3, 2020 •

edited

Loading

Peter-J-Freeman commented Jun 4, 2020 •

edited

Loading

Peter-J-Freeman commented Jun 5, 2020 •

edited

Loading

ifokkema commented Jun 22, 2020 •

edited

Loading