-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Erroneous behaviour searching for database using some HGNC: IDs #186
Comments
Thanks @Mani-varma1 . Will look at this now. I happen to be working on this functionality at the moment anyway. thanls for the report. Will try push a patch soon. Check back in here for info on the fixes so you can approve. |
OK, not a perfect fix since we need a database update. Lookms like half our data is updated, but I have added the gollwoing fallback import json
import VariantValidator
vval = VariantValidator.Validator()
gene = 'HGNC:32700'
g_and_t = vval.gene2transcripts(gene, vval)
print(json.dumps(g_and_t, sort_keys=True, indent=4, separators=(',', ': '))) will now return {
"current_name": "coiled-coil domain containing 103",
"current_symbol": "DNAAF19",
"hgnc": "HGNC:32700",
"previous_symbol": "CCDC103",
"requested_symbol": "DNAAF19",
"transcripts": [
{
"annotations": {
.... So the current name will be incorrect until the database updates, but the symbols and IDs will be correct
Yes, this was the case. This is now also updated and will accept the updated gene symnbol as well |
This is a weird case. There are no transcripts for this gene, so we do not have a record for it. See https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:12029 and https://www.ncbi.nlm.nih.gov/gene/28755 This means we have no way of getting from the HGNC ID to the gene symbol. You can provide the gene Symbol, but there are no transcripts This will now return {
"error": "'Unable to recognise HGNC ID. PLease provide a gene symbol",
"requested_id": "HGNC:12029"
} |
…ing and HGNC genes with no transcript info openvar/rest_variantValidator#186 and also handle the longer deletions in #651
@Mani-varma1 a patch is pushed up ready for testing |
Downloads from the HGNC could be used for that; we use those downloads for HGNC ID <-> gene symbol conversions. It would require an additional resource to handle and perhaps some conflict resolution, but it'll give you a bit more freedom, e.g., weekly downloads would help stay up to date with the latest symbols while not having to rebuild your entire database. |
We do get data from HGNC downloads already. We just don't yet store an independant HGNC ID to gene symbol table when there are no transcripts. Will keep this open for when I have time to look at it, but do not want to have to rebuild the database at this stage since we are planning a large-scale overhaul |
@John-F-Wagstaff , one to be aware of when we start planning the new VVTA Validator merge |
Sorry for the delay and a quick update. Most of the issues are fixed with chromosomal records I believe but there are still some records that are failing, could be because some records are withdrawn but not always: As an example
Few examples of genes that have a HGNC record and does not return any even if we use gene symbol
Also a lot issues still persist with a some of mitochondrial records
I think this might be a result of new gene symbols having a hyphen so MT-xxxx and it just parses the initial part and tries to look for it Here are some other examples
|
Endpoint: /VariantValidator/tools/gene2transcripts_v2/{gene_query}/{limit_transcripts}/{transcript_set}/{genome_build}
Not all genes return a correct response when trying to query the database. There is a list of about 73 HGNC ID that will not provide the correct output when requesting data. There are 2 ways it can fail.
1)
E.g, searching for HGNC:32700 returns the following response:
URL: https://rest.variantvalidator.org/VariantValidator/tools/gene2transcripts_v2/HGNC%3A32700/mane_select/all/GRCh38?content-type=application%2Fjson
This might be because the record matching this query in a specific table is updated to new gene symbol DNAAF19, however the original symbol for this record is CCDC103, so doesn't match it. When trying to search using the old gene symbol gives you an accurate output
Also note that trying to search using "DNAAF19" also gives the same "Unable to recognise DNAAF19 symbol"
Discovery: Itebbs22/SoftwareDevelopmentVIMMO#19
2)
Potentially due to not having a record or for unknown reason some gene symbols throws out an internal server error.
E.g, searching for the gene HGNC:12029 throws server error instead of no match found message.
URL:https://rest.variantvalidator.org/VariantValidator/tools/gene2transcripts_v2/TRAC/False/all/GRCh38?content-type=application%2Fjson
For a list of problem URL please check: https://github.com/Itebbs22/SoftwareDevelopmentVIMMO/blob/BED-HGNC_FIX/problem_url
For a list of problem HGNC:ID please check: https://github.com/Itebbs22/SoftwareDevelopmentVIMMO/blob/BED-HGNC_FIX/prob_gene.txt
However, I'd like to point out that this list also includes when a gene had a match but had an empty transcript value. I have added to this list, as my initial assumption was that it shouldn't return empty values
e.g, when specifically looking for a gene (HGNC:4713) in mane_select transcripts in GRCh38
URL:https://rest.variantvalidator.org/VariantValidator/tools/gene2transcripts_v2/HGNC%3A4713/mane_select/all/GRCh38?content-type=application%2Fjson
However if I don't limit it to just mane_select, then we have some records with an accurate output. My code will later be updated to only include genes that have mismatching queries, as mentioned in the 1st section, and URLs that throw an internal error. so the list will only include the problematic ones at a later date.
Also note that This was only done on a very small subset of endpoint for only the genes used in panelapp i.e, on mane_select, Grch38 for each gene and did not test for genes that are not part of the panelapp and on grch37. Which I can update at a later date.
Potential Issue
Also, another issue when setting the limit_transcripts to False seems to fail with an internal server error (might due to bigger genes), due to time out? as it loads for a bit and fails. However, this is not a universal behaviour, as some genes return a good response.
e.g., HGNC:1100(BRCA1)
URL :https://rest.variantvalidator.org/VariantValidator/tools/gene2transcripts_v2/HGNC%3A1100/False/all/GRCh38?content-type=application%2Fjson
not sure why and have not checked other genes for this behaviour.
The text was updated successfully, but these errors were encountered: