Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

example Disease Annotation in Uniprot in README.md not working #1

Open
adeslatt opened this issue Oct 26, 2024 · 3 comments
Open

example Disease Annotation in Uniprot in README.md not working #1

adeslatt opened this issue Oct 26, 2024 · 3 comments

Comments

@adeslatt
Copy link

Hello -- just learning and may not know sytack -- but the example

up:Disease_Annotation {
  a [ up:Disease_Annotation ] ;
  up:sequence [ up:Chain_Annotation up:Modified_Sequence ] ;
  rdfs:comment xsd:string ;
  up:disease IRI
}

Results in a malformed query when. you try it on the sparql endpoint for unitprot.

I set up a jupyter lab notebook - and this worked very nicely

from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd

# Set up the UniProt SPARQL endpoint
sparql = SPARQLWrapper("https://sparql.uniprot.org/sparql")

# Define a query to fetch available Disease Annotation data
query_disease_annotations_simple = """
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?disease_annotation ?comment ?disease
WHERE {
  ?disease_annotation a up:Disease_Annotation ;
                      rdfs:comment ?comment ;
                      up:disease ?disease .
}
LIMIT 10
"""

# Execute the query and format the output in a DataFrame
sparql.setQuery(query_disease_annotations_simple)
sparql.setReturnFormat(JSON)

try:
    # Execute query and retrieve results
    results_disease_simple = sparql.query().convert()
    
    # Parse the results
    disease_data_simple = [
        {
            "Disease Annotation": result["disease_annotation"]["value"],
            "Comment": result["comment"]["value"],
            "Disease": result["disease"]["value"]
        }
        for result in results_disease_simple["results"]["bindings"]
    ]
    
    # Create a DataFrame to display the results
    df_disease_simple = pd.DataFrame(disease_data_simple)
    
    # Wrap text for 'Comment' column in Jupyter display
    df_disease_simple_styled = df_disease_simple.style.set_properties(
        **{'white-space': 'pre-wrap', 'text-align': 'left'}
    )
    
    display(df_disease_simple_styled)
except Exception as e:
    print(f"Error occurred: {e}")

Returns this (organized with pandas)

Disease Annotation	Comment	Disease
0	http://purl.uniprot.org/uniprot/Q9UDR5#SIP17D85FE178BE13B6	The disease is caused by variants affecting the gene represented in this entry. In hyperlysinemia 1, both enzymatic functions of AASS are defective and patients have increased serum lysine and possibly increased saccharopine. Some individuals, however, retain significant amounts of lysine-ketoglutarate reductase and present with saccharopinuria, a metabolic condition with few, if any, clinical manifestations.	http://purl.uniprot.org/diseases/1773
1	http://purl.uniprot.org/uniprot/Q9UDR5#SIP77BA87EDDA8559D2	The protein represented in this entry is involved in disease pathogenesis. A selective decrease in mitochondrial NADP(H) levels due to NADK2 mutations causes a deficiency of NADPH-dependent mitochondrial enzymes, such as DECR1 and AASS.	http://purl.uniprot.org/diseases/4240
2	http://purl.uniprot.org/uniprot/Q9UGJ0#SIP473418E25D4D3A3B	The disease is caused by variants affecting the gene represented in this entry.	http://purl.uniprot.org/diseases/1676
3	http://purl.uniprot.org/uniprot/Q9UGJ0#SIPBA4A3C214C09B2B7	The disease is caused by variants affecting the gene represented in this entry.	http://purl.uniprot.org/diseases/245
4	http://purl.uniprot.org/uniprot/Q9UGJ0#SIPF5992DDE995A022F	The disease is caused by variants affecting the gene represented in this entry.	http://purl.uniprot.org/diseases/1150
5	http://purl.uniprot.org/uniprot/P00519#SIP961ECAA35D2F0134	The disease is caused by variants affecting the gene represented in this entry.	http://purl.uniprot.org/diseases/5064
6	http://purl.uniprot.org/uniprot/P00519#SIPDFB66D0B5174D549	The gene represented in this entry is involved in disease pathogenesis.	http://purl.uniprot.org/diseases/3735
7	http://purl.uniprot.org/uniprot/Q13085#SIPE73D1EB0068562AA	The disease is caused by variants affecting the gene represented in this entry.	http://purl.uniprot.org/diseases/1164
8	http://purl.uniprot.org/uniprot/Q6UWZ7#SIP86B515DA1B7AD8CF	Disease susceptibility is associated with variants affecting the gene represented in this entry.	http://purl.uniprot.org/diseases/2602
9	http://purl.uniprot.org/uniprot/A8K2U0#SIPCE73AF232236B8B1	Disease susceptibility is associated with variants affecting the gene represented in this entry.	http://purl.uniprot.org/diseases/5294

I modified the query a bit more using regular expression package (re) and used this:

import pandas as pd
import re

# Assuming df_disease_simple from Step 1 already exists

# Define regex patterns for variants and genes
variant_pattern = r"\bvariant\s\w+\b|\bmutation\b|\bpolymorphism\b"  # Adjust patterns as needed
gene_pattern = r"\b[A-Z0-9]{2,}\b"  # Basic pattern for gene identifiers, e.g., BRCA1, TP53

# Extract details for each disease annotation
extracted_info = []
for _, row in df_disease_simple.iterrows():
    disease_id = row["Disease"]
    comment = row["Comment"]
    
    # Find all variants and gene mentions
    variants = re.findall(variant_pattern, comment, flags=re.IGNORECASE)
    genes = re.findall(gene_pattern, comment)
    
    # Store results in a structured format
    extracted_info.append({
        "Disease": disease_id,
        "Comment": comment,
        "Variants": variants,
        "Genes": genes
    })

# Convert to DataFrame
df_extracted_info = pd.DataFrame(extracted_info)

# Apply wrapping style to comment for readability
df_extracted_info_styled = df_extracted_info.style.set_properties(
    **{'white-space': 'pre-wrap', 'text-align': 'left'}
)

# Display the wrapped DataFrame in Jupyter
display(df_extracted_info_styled)
Disease Comment Variants Genes
http://purl.uniprot.org/diseases/1773 The disease is caused by variants affecting the gene represented in this entry. In hyperlysinemia 1, both enzymatic functions of AASS are defective and patients have increased serum lysine and possibly increased saccharopine. Some individuals, however, retain significant amounts of lysine-ketoglutarate reductase and present with saccharopinuria, a metabolic condition with few, if any, clinical manifestations. [] ['AASS']
http://purl.uniprot.org/diseases/4240 The protein represented in this entry is involved in disease pathogenesis. A selective decrease in mitochondrial NADP(H) levels due to NADK2 mutations causes a deficiency of NADPH-dependent mitochondrial enzymes, such as DECR1 and AASS. [] ['NADP', 'NADK2', 'NADPH', 'DECR1', 'AASS']
http://purl.uniprot.org/diseases/1676 The disease is caused by variants affecting the gene represented in this entry. [] []
http://purl.uniprot.org/diseases/245 The disease is caused by variants affecting the gene represented in this entry. [] []
http://purl.uniprot.org/diseases/1150 The disease is caused by variants affecting the gene represented in this entry. [] []
http://purl.uniprot.org/diseases/5064 The disease is caused by variants affecting the gene represented in this entry. [] []
http://purl.uniprot.org/diseases/3735 The gene represented in this entry is involved in disease pathogenesis. [] []
http://purl.uniprot.org/diseases/1164 The disease is caused by variants affecting the gene represented in this entry. [] []
http://purl.uniprot.org/diseases/2602 Disease susceptibility is associated with variants affecting the gene represented in this entry. [] []
http://purl.uniprot.org/diseases/5294 Disease susceptibility is associated with variants affecting the gene represented in this entry. [] []

Hope this helps.
@vemonet
Copy link
Member

vemonet commented Oct 28, 2024

Hi @adeslatt , sorry I am not sure I understood your issue :)

Are you trying to run this as a SPARQL query?

up:Disease_Annotation {
  a [ up:Disease_Annotation ] ;
  up:sequence [ up:Chain_Annotation up:Modified_Sequence ] ;
  rdfs:comment xsd:string ;
  up:disease IRI
}

If yes, then it is normal it is not working as is, because it is not a SPARQL query, this is a ShEx "Shape Expression", basically a schema for RDF data, see here for more details: https://shex.io

We are using it to pass the endpoint schema to the LLM

@adeslatt
Copy link
Author

Hi @vemonet ,
Thank you so much! Yes I thought it was a SPARQL query -- I was not familiar with Shape Expression thank you for the reference -- can this work on the RDF as a file itself? I just exported from a database and made turtle files -- we are working with a non-SPARQL graph database (it is a ArangoDB instance).

@vemonet
Copy link
Member

vemonet commented Oct 28, 2024

ShEx expression are usually used to describe the schema of the RDF data, and perform validation of RDF data (here we just use it to communicate the schema of the different classes in our knowledge graph to the LLM, so it knows which predicates can be used with the different classes). In my opinion ShEx is a bit harder to use than SPARQL (because the libraries are less mature), so I would just load the RDF you have in a store and run SPARQL queries

If you just want to run queries on RDF data you could load it with RDFLib then run queries: https://rdflib.readthedocs.io/en/stable/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants