example Disease Annotation in Uniprot in README.md not working #1

adeslatt · 2024-10-26T12:58:39Z

Hello -- just learning and may not know sytack -- but the example

up:Disease_Annotation {
  a [ up:Disease_Annotation ] ;
  up:sequence [ up:Chain_Annotation up:Modified_Sequence ] ;
  rdfs:comment xsd:string ;
  up:disease IRI
}

Results in a malformed query when. you try it on the sparql endpoint for unitprot.

I set up a jupyter lab notebook - and this worked very nicely

from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd

# Set up the UniProt SPARQL endpoint
sparql = SPARQLWrapper("https://sparql.uniprot.org/sparql")

# Define a query to fetch available Disease Annotation data
query_disease_annotations_simple = """
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?disease_annotation ?comment ?disease
WHERE {
  ?disease_annotation a up:Disease_Annotation ;
                      rdfs:comment ?comment ;
                      up:disease ?disease .
}
LIMIT 10
"""

# Execute the query and format the output in a DataFrame
sparql.setQuery(query_disease_annotations_simple)
sparql.setReturnFormat(JSON)

try:
    # Execute query and retrieve results
    results_disease_simple = sparql.query().convert()
    
    # Parse the results
    disease_data_simple = [
        {
            "Disease Annotation": result["disease_annotation"]["value"],
            "Comment": result["comment"]["value"],
            "Disease": result["disease"]["value"]
        }
        for result in results_disease_simple["results"]["bindings"]
    ]
    
    # Create a DataFrame to display the results
    df_disease_simple = pd.DataFrame(disease_data_simple)
    
    # Wrap text for 'Comment' column in Jupyter display
    df_disease_simple_styled = df_disease_simple.style.set_properties(
        **{'white-space': 'pre-wrap', 'text-align': 'left'}
    )
    
    display(df_disease_simple_styled)
except Exception as e:
    print(f"Error occurred: {e}")

Returns this (organized with pandas)

Disease Annotation	Comment	Disease
0	http://purl.uniprot.org/uniprot/Q9UDR5#SIP17D85FE178BE13B6	The disease is caused by variants affecting the gene represented in this entry. In hyperlysinemia 1, both enzymatic functions of AASS are defective and patients have increased serum lysine and possibly increased saccharopine. Some individuals, however, retain significant amounts of lysine-ketoglutarate reductase and present with saccharopinuria, a metabolic condition with few, if any, clinical manifestations.	http://purl.uniprot.org/diseases/1773
1	http://purl.uniprot.org/uniprot/Q9UDR5#SIP77BA87EDDA8559D2	The protein represented in this entry is involved in disease pathogenesis. A selective decrease in mitochondrial NADP(H) levels due to NADK2 mutations causes a deficiency of NADPH-dependent mitochondrial enzymes, such as DECR1 and AASS.	http://purl.uniprot.org/diseases/4240
2	http://purl.uniprot.org/uniprot/Q9UGJ0#SIP473418E25D4D3A3B	The disease is caused by variants affecting the gene represented in this entry.	http://purl.uniprot.org/diseases/1676
3	http://purl.uniprot.org/uniprot/Q9UGJ0#SIPBA4A3C214C09B2B7	The disease is caused by variants affecting the gene represented in this entry.	http://purl.uniprot.org/diseases/245
4	http://purl.uniprot.org/uniprot/Q9UGJ0#SIPF5992DDE995A022F	The disease is caused by variants affecting the gene represented in this entry.	http://purl.uniprot.org/diseases/1150
5	http://purl.uniprot.org/uniprot/P00519#SIP961ECAA35D2F0134	The disease is caused by variants affecting the gene represented in this entry.	http://purl.uniprot.org/diseases/5064
6	http://purl.uniprot.org/uniprot/P00519#SIPDFB66D0B5174D549	The gene represented in this entry is involved in disease pathogenesis.	http://purl.uniprot.org/diseases/3735
7	http://purl.uniprot.org/uniprot/Q13085#SIPE73D1EB0068562AA	The disease is caused by variants affecting the gene represented in this entry.	http://purl.uniprot.org/diseases/1164
8	http://purl.uniprot.org/uniprot/Q6UWZ7#SIP86B515DA1B7AD8CF	Disease susceptibility is associated with variants affecting the gene represented in this entry.	http://purl.uniprot.org/diseases/2602
9	http://purl.uniprot.org/uniprot/A8K2U0#SIPCE73AF232236B8B1	Disease susceptibility is associated with variants affecting the gene represented in this entry.	http://purl.uniprot.org/diseases/5294

I modified the query a bit more using regular expression package (re) and used this:

import pandas as pd
import re

# Assuming df_disease_simple from Step 1 already exists

# Define regex patterns for variants and genes
variant_pattern = r"\bvariant\s\w+\b|\bmutation\b|\bpolymorphism\b"  # Adjust patterns as needed
gene_pattern = r"\b[A-Z0-9]{2,}\b"  # Basic pattern for gene identifiers, e.g., BRCA1, TP53

# Extract details for each disease annotation
extracted_info = []
for _, row in df_disease_simple.iterrows():
    disease_id = row["Disease"]
    comment = row["Comment"]
    
    # Find all variants and gene mentions
    variants = re.findall(variant_pattern, comment, flags=re.IGNORECASE)
    genes = re.findall(gene_pattern, comment)
    
    # Store results in a structured format
    extracted_info.append({
        "Disease": disease_id,
        "Comment": comment,
        "Variants": variants,
        "Genes": genes
    })

# Convert to DataFrame
df_extracted_info = pd.DataFrame(extracted_info)

# Apply wrapping style to comment for readability
df_extracted_info_styled = df_extracted_info.style.set_properties(
    **{'white-space': 'pre-wrap', 'text-align': 'left'}
)

# Display the wrapped DataFrame in Jupyter
display(df_extracted_info_styled)

Disease	Comment	Variants	Genes
http://purl.uniprot.org/diseases/1773	The disease is caused by variants affecting the gene represented in this entry. In hyperlysinemia 1, both enzymatic functions of AASS are defective and patients have increased serum lysine and possibly increased saccharopine. Some individuals, however, retain significant amounts of lysine-ketoglutarate reductase and present with saccharopinuria, a metabolic condition with few, if any, clinical manifestations.	[]	['AASS']
http://purl.uniprot.org/diseases/4240	The protein represented in this entry is involved in disease pathogenesis. A selective decrease in mitochondrial NADP(H) levels due to NADK2 mutations causes a deficiency of NADPH-dependent mitochondrial enzymes, such as DECR1 and AASS.	[]	['NADP', 'NADK2', 'NADPH', 'DECR1', 'AASS']
http://purl.uniprot.org/diseases/1676	The disease is caused by variants affecting the gene represented in this entry.	[]	[]
http://purl.uniprot.org/diseases/245	The disease is caused by variants affecting the gene represented in this entry.	[]	[]
http://purl.uniprot.org/diseases/1150	The disease is caused by variants affecting the gene represented in this entry.	[]	[]
http://purl.uniprot.org/diseases/5064	The disease is caused by variants affecting the gene represented in this entry.	[]	[]
http://purl.uniprot.org/diseases/3735	The gene represented in this entry is involved in disease pathogenesis.	[]	[]
http://purl.uniprot.org/diseases/1164	The disease is caused by variants affecting the gene represented in this entry.	[]	[]
http://purl.uniprot.org/diseases/2602	Disease susceptibility is associated with variants affecting the gene represented in this entry.	[]	[]
http://purl.uniprot.org/diseases/5294	Disease susceptibility is associated with variants affecting the gene represented in this entry.	[]	[]


Hope this helps.

The text was updated successfully, but these errors were encountered:

vemonet · 2024-10-28T10:08:09Z

Hi @adeslatt , sorry I am not sure I understood your issue :)

Are you trying to run this as a SPARQL query?

up:Disease_Annotation {
  a [ up:Disease_Annotation ] ;
  up:sequence [ up:Chain_Annotation up:Modified_Sequence ] ;
  rdfs:comment xsd:string ;
  up:disease IRI
}

If yes, then it is normal it is not working as is, because it is not a SPARQL query, this is a ShEx "Shape Expression", basically a schema for RDF data, see here for more details: https://shex.io

We are using it to pass the endpoint schema to the LLM

adeslatt · 2024-10-28T14:25:42Z

Hi @vemonet ,
Thank you so much! Yes I thought it was a SPARQL query -- I was not familiar with Shape Expression thank you for the reference -- can this work on the RDF as a file itself? I just exported from a database and made turtle files -- we are working with a non-SPARQL graph database (it is a ArangoDB instance).

vemonet · 2024-10-28T16:01:23Z

ShEx expression are usually used to describe the schema of the RDF data, and perform validation of RDF data (here we just use it to communicate the schema of the different classes in our knowledge graph to the LLM, so it knows which predicates can be used with the different classes). In my opinion ShEx is a bit harder to use than SPARQL (because the libraries are less mature), so I would just load the RDF you have in a store and run SPARQL queries

If you just want to run queries on RDF data you could load it with RDFLib then run queries: https://rdflib.readthedocs.io/en/stable/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

example Disease Annotation in Uniprot in README.md not working #1

example Disease Annotation in Uniprot in README.md not working #1

adeslatt commented Oct 26, 2024

vemonet commented Oct 28, 2024

adeslatt commented Oct 28, 2024

vemonet commented Oct 28, 2024

example Disease Annotation in Uniprot in README.md not working #1

example Disease Annotation in Uniprot in README.md not working #1

Comments

adeslatt commented Oct 26, 2024

vemonet commented Oct 28, 2024

adeslatt commented Oct 28, 2024

vemonet commented Oct 28, 2024