Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Erroneous behaviour searching for database using some HGNC: IDs #186

Open
Mani-varma1 opened this issue Dec 2, 2024 · 8 comments
Open

Comments

@Mani-varma1
Copy link

Endpoint: /VariantValidator/tools/gene2transcripts_v2/{gene_query}/{limit_transcripts}/{transcript_set}/{genome_build}

Not all genes return a correct response when trying to query the database. There is a list of about 73 HGNC ID that will not provide the correct output when requesting data. There are 2 ways it can fail.

1)

E.g, searching for HGNC:32700 returns the following response:
URL: https://rest.variantvalidator.org/VariantValidator/tools/gene2transcripts_v2/HGNC%3A32700/mane_select/all/GRCh38?content-type=application%2Fjson

Response Body:
[
  {
    "error": "Unable to recognise gene symbol DNAAF19",
    "requested_symbol": "DNAAF19"
  }
]

This might be because the record matching this query in a specific table is updated to new gene symbol DNAAF19, however the original symbol for this record is CCDC103, so doesn't match it. When trying to search using the old gene symbol gives you an accurate output

Response Body:
[
  {
    "current_name": "coiled-coil domain containing 103",
    "current_symbol": "CCDC103",
    "hgnc": "HGNC:32700",
    "previous_symbol": "",
    "requested_symbol": "CCDC103",
    "transcripts": [

Also note that trying to search using "DNAAF19" also gives the same "Unable to recognise DNAAF19 symbol"

Discovery: Itebbs22/SoftwareDevelopmentVIMMO#19

2)

Potentially due to not having a record or for unknown reason some gene symbols throws out an internal server error.

E.g, searching for the gene HGNC:12029 throws server error instead of no match found message.
URL:https://rest.variantvalidator.org/VariantValidator/tools/gene2transcripts_v2/TRAC/False/all/GRCh38?content-type=application%2Fjson

ResponseBody:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>500 Internal Server Error</title>
</head><body>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error or
misconfiguration and was unable to complete
your request.</p>
<p>Please contact the server administrator at 
 [no address given] to inform them of the time this error occurred,
 and the actions you performed just before this error.</p>
<p>More information about this error may be available
in the server error log.</p>
<hr>
<address>Apache Server at rest.variantvalidator.org Port 80</address>
</body></html>

For a list of problem URL please check: https://github.com/Itebbs22/SoftwareDevelopmentVIMMO/blob/BED-HGNC_FIX/problem_url

For a list of problem HGNC:ID please check: https://github.com/Itebbs22/SoftwareDevelopmentVIMMO/blob/BED-HGNC_FIX/prob_gene.txt

However, I'd like to point out that this list also includes when a gene had a match but had an empty transcript value. I have added to this list, as my initial assumption was that it shouldn't return empty values

e.g, when specifically looking for a gene (HGNC:4713) in mane_select transcripts in GRCh38
URL:https://rest.variantvalidator.org/VariantValidator/tools/gene2transcripts_v2/HGNC%3A4713/mane_select/all/GRCh38?content-type=application%2Fjson

Response Body:
[
  {
    "current_name": "H19 imprinted maternally expressed transcript",
    "current_symbol": "H19",
    "hgnc": "HGNC:4713",
    "previous_symbol": "",
    "requested_symbol": "H19",
    "transcripts": []
  }
]

However if I don't limit it to just mane_select, then we have some records with an accurate output. My code will later be updated to only include genes that have mismatching queries, as mentioned in the 1st section, and URLs that throw an internal error. so the list will only include the problematic ones at a later date.

Also note that This was only done on a very small subset of endpoint for only the genes used in panelapp i.e, on mane_select, Grch38 for each gene and did not test for genes that are not part of the panelapp and on grch37. Which I can update at a later date.

Potential Issue

Also, another issue when setting the limit_transcripts to False seems to fail with an internal server error (might due to bigger genes), due to time out? as it loads for a bit and fails. However, this is not a universal behaviour, as some genes return a good response.
e.g., HGNC:1100(BRCA1)
URL :https://rest.variantvalidator.org/VariantValidator/tools/gene2transcripts_v2/HGNC%3A1100/False/all/GRCh38?content-type=application%2Fjson
not sure why and have not checked other genes for this behaviour.

@Peter-J-Freeman
Copy link
Contributor

Thanks @Mani-varma1 . Will look at this now. I happen to be working on this functionality at the moment anyway. thanls for the report. Will try push a patch soon. Check back in here for info on the fixes so you can approve.

@Peter-J-Freeman
Copy link
Contributor

Peter-J-Freeman commented Dec 4, 2024

E.g, searching for HGNC:32700 returns the following response:

OK, not a perfect fix since we need a database update. Lookms like half our data is updated, but I have added the gollwoing fallback

import json
import VariantValidator
vval = VariantValidator.Validator()
gene = 'HGNC:32700'
g_and_t = vval.gene2transcripts(gene, vval)
print(json.dumps(g_and_t, sort_keys=True, indent=4, separators=(',', ': ')))

will now return

{
    "current_name": "coiled-coil domain containing 103",
    "current_symbol": "DNAAF19",
    "hgnc": "HGNC:32700",
    "previous_symbol": "CCDC103",
    "requested_symbol": "DNAAF19",
    "transcripts": [
        {
            "annotations": {

....

So the current name will be incorrect until the database updates, but the symbols and IDs will be correct

This might be because the record matching this query in a specific table is updated to new gene symbol DNAAF19, however the original symbol for this record is CCDC103, so doesn't match it. When trying to search using the old gene symbol gives you an accurate output

Yes, this was the case. This is now also updated and will accept the updated gene symnbol as well

@Peter-J-Freeman
Copy link
Contributor

Peter-J-Freeman commented Dec 4, 2024

E.g, searching for the gene HGNC:12029 throws server error instead of no match found message.

This is a weird case. There are no transcripts for this gene, so we do not have a record for it.

See https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:12029 and https://www.ncbi.nlm.nih.gov/gene/28755

This means we have no way of getting from the HGNC ID to the gene symbol. You can provide the gene Symbol, but there are no transcripts

This will now return

{
    "error": "'Unable to recognise HGNC ID. PLease provide a gene symbol",
    "requested_id": "HGNC:12029"
}

Peter-J-Freeman added a commit to openvar/variantValidator that referenced this issue Dec 4, 2024
…ing and HGNC genes with no transcript info openvar/rest_variantValidator#186 and also handle the longer deletions in #651
@Peter-J-Freeman
Copy link
Contributor

@Mani-varma1 a patch is pushed up ready for testing

@ifokkema
Copy link

ifokkema commented Dec 5, 2024

This means we have no way of getting from the HGNC ID to the gene symbol.

Downloads from the HGNC could be used for that; we use those downloads for HGNC ID <-> gene symbol conversions. It would require an additional resource to handle and perhaps some conflict resolution, but it'll give you a bit more freedom, e.g., weekly downloads would help stay up to date with the latest symbols while not having to rebuild your entire database.

@Peter-J-Freeman
Copy link
Contributor

We do get data from HGNC downloads already. We just don't yet store an independant HGNC ID to gene symbol table when there are no transcripts. Will keep this open for when I have time to look at it, but do not want to have to rebuild the database at this stage since we are planning a large-scale overhaul

@Peter-J-Freeman
Copy link
Contributor

@John-F-Wagstaff , one to be aware of when we start planning the new VVTA Validator merge

@Mani-varma1
Copy link
Author

Mani-varma1 commented Dec 18, 2024

Sorry for the delay and a quick update. Most of the issues are fixed with chromosomal records I believe but there are still some records that are failing, could be because some records are withdrawn but not always:

As an example
HGNCID: 12029 / genesymbol: TRAC is still causing issue but cant figure out why. Searching with HGNC ID or the gene symbol did not work
URL for HGNC_ID: https://rest.variantvalidator.org/VariantValidator/tools/gene2transcripts_v2/HGNC%3A12029/select/all/GRCh38
Genesymbol: https://rest.variantvalidator.org/VariantValidator/tools/gene2transcripts_v2/TRAC/select/all/GRCh38

Response body
[
  {
    "error": "Unable to retrieve data from the VVTA, please contact admin",
    "requested_symbol": "TRAC"
  }
]

Few examples of genes that have a HGNC record and does not return any even if we use gene symbol

  • HGNC:12348 / TRU-TCA1-1, old: TRSP, TRNAU1
  • HGNC:32925 / ATXN8
  • HGNC:5716 / IGKC

Also a lot issues still persist with a some of mitochondrial records
HGNC:7414 - but works if there we use the old/new gene symbol (MT-ATP6)

Response body
[
{
"error": "Unable to recognise gene symbol MT",
"requested_symbol": "MT"
}
]

I think this might be a result of new gene symbols having a hyphen so MT-xxxx and it just parses the initial part and tries to look for it

Here are some other examples

  • HGNC:7415
  • HGNC:7419
  • HGNC:7421

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants