You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Results in a malformed query when. you try it on the sparql endpoint for unitprot.
I set up a jupyter lab notebook - and this worked very nicely
fromSPARQLWrapperimportSPARQLWrapper, JSONimportpandasaspd# Set up the UniProt SPARQL endpointsparql=SPARQLWrapper("https://sparql.uniprot.org/sparql")
# Define a query to fetch available Disease Annotation dataquery_disease_annotations_simple="""PREFIX up: <http://purl.uniprot.org/core/>PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>SELECT ?disease_annotation ?comment ?diseaseWHERE { ?disease_annotation a up:Disease_Annotation ; rdfs:comment ?comment ; up:disease ?disease .}LIMIT 10"""# Execute the query and format the output in a DataFramesparql.setQuery(query_disease_annotations_simple)
sparql.setReturnFormat(JSON)
try:
# Execute query and retrieve resultsresults_disease_simple=sparql.query().convert()
# Parse the resultsdisease_data_simple= [
{
"Disease Annotation": result["disease_annotation"]["value"],
"Comment": result["comment"]["value"],
"Disease": result["disease"]["value"]
}
forresultinresults_disease_simple["results"]["bindings"]
]
# Create a DataFrame to display the resultsdf_disease_simple=pd.DataFrame(disease_data_simple)
# Wrap text for 'Comment' column in Jupyter displaydf_disease_simple_styled=df_disease_simple.style.set_properties(
**{'white-space': 'pre-wrap', 'text-align': 'left'}
)
display(df_disease_simple_styled)
exceptExceptionase:
print(f"Error occurred: {e}")
Returns this (organized with pandas)
Disease Annotation Comment Disease
0 http://purl.uniprot.org/uniprot/Q9UDR5#SIP17D85FE178BE13B6 The disease is caused by variants affecting the gene represented in this entry. In hyperlysinemia 1, both enzymatic functions of AASS are defective and patients have increased serum lysine and possibly increased saccharopine. Some individuals, however, retain significant amounts of lysine-ketoglutarate reductase and present with saccharopinuria, a metabolic condition with few, if any, clinical manifestations. http://purl.uniprot.org/diseases/1773
1 http://purl.uniprot.org/uniprot/Q9UDR5#SIP77BA87EDDA8559D2 The protein represented in this entry is involved in disease pathogenesis. A selective decrease in mitochondrial NADP(H) levels due to NADK2 mutations causes a deficiency of NADPH-dependent mitochondrial enzymes, such as DECR1 and AASS. http://purl.uniprot.org/diseases/4240
2 http://purl.uniprot.org/uniprot/Q9UGJ0#SIP473418E25D4D3A3B The disease is caused by variants affecting the gene represented in this entry. http://purl.uniprot.org/diseases/1676
3 http://purl.uniprot.org/uniprot/Q9UGJ0#SIPBA4A3C214C09B2B7 The disease is caused by variants affecting the gene represented in this entry. http://purl.uniprot.org/diseases/245
4 http://purl.uniprot.org/uniprot/Q9UGJ0#SIPF5992DDE995A022F The disease is caused by variants affecting the gene represented in this entry. http://purl.uniprot.org/diseases/1150
5 http://purl.uniprot.org/uniprot/P00519#SIP961ECAA35D2F0134 The disease is caused by variants affecting the gene represented in this entry. http://purl.uniprot.org/diseases/5064
6 http://purl.uniprot.org/uniprot/P00519#SIPDFB66D0B5174D549 The gene represented in this entry is involved in disease pathogenesis. http://purl.uniprot.org/diseases/3735
7 http://purl.uniprot.org/uniprot/Q13085#SIPE73D1EB0068562AA The disease is caused by variants affecting the gene represented in this entry. http://purl.uniprot.org/diseases/1164
8 http://purl.uniprot.org/uniprot/Q6UWZ7#SIP86B515DA1B7AD8CF Disease susceptibility is associated with variants affecting the gene represented in this entry. http://purl.uniprot.org/diseases/2602
9 http://purl.uniprot.org/uniprot/A8K2U0#SIPCE73AF232236B8B1 Disease susceptibility is associated with variants affecting the gene represented in this entry. http://purl.uniprot.org/diseases/5294
I modified the query a bit more using regular expression package (re) and used this:
importpandasaspdimportre# Assuming df_disease_simple from Step 1 already exists# Define regex patterns for variants and genesvariant_pattern=r"\bvariant\s\w+\b|\bmutation\b|\bpolymorphism\b"# Adjust patterns as neededgene_pattern=r"\b[A-Z0-9]{2,}\b"# Basic pattern for gene identifiers, e.g., BRCA1, TP53# Extract details for each disease annotationextracted_info= []
for_, rowindf_disease_simple.iterrows():
disease_id=row["Disease"]
comment=row["Comment"]
# Find all variants and gene mentionsvariants=re.findall(variant_pattern, comment, flags=re.IGNORECASE)
genes=re.findall(gene_pattern, comment)
# Store results in a structured formatextracted_info.append({
"Disease": disease_id,
"Comment": comment,
"Variants": variants,
"Genes": genes
})
# Convert to DataFramedf_extracted_info=pd.DataFrame(extracted_info)
# Apply wrapping style to comment for readabilitydf_extracted_info_styled=df_extracted_info.style.set_properties(
**{'white-space': 'pre-wrap', 'text-align': 'left'}
)
# Display the wrapped DataFrame in Jupyterdisplay(df_extracted_info_styled)
The disease is caused by variants affecting the gene represented in this entry. In hyperlysinemia 1, both enzymatic functions of AASS are defective and patients have increased serum lysine and possibly increased saccharopine. Some individuals, however, retain significant amounts of lysine-ketoglutarate reductase and present with saccharopinuria, a metabolic condition with few, if any, clinical manifestations.
The protein represented in this entry is involved in disease pathogenesis. A selective decrease in mitochondrial NADP(H) levels due to NADK2 mutations causes a deficiency of NADPH-dependent mitochondrial enzymes, such as DECR1 and AASS.
Hi @adeslatt , sorry I am not sure I understood your issue :)
Are you trying to run this as a SPARQL query?
up:Disease_Annotation {
a [ up:Disease_Annotation ] ;
up:sequence [ up:Chain_Annotation up:Modified_Sequence ] ;
rdfs:comment xsd:string ;
up:disease IRI
}
If yes, then it is normal it is not working as is, because it is not a SPARQL query, this is a ShEx "Shape Expression", basically a schema for RDF data, see here for more details: https://shex.io
We are using it to pass the endpoint schema to the LLM
Hi @vemonet ,
Thank you so much! Yes I thought it was a SPARQL query -- I was not familiar with Shape Expression thank you for the reference -- can this work on the RDF as a file itself? I just exported from a database and made turtle files -- we are working with a non-SPARQL graph database (it is a ArangoDB instance).
ShEx expression are usually used to describe the schema of the RDF data, and perform validation of RDF data (here we just use it to communicate the schema of the different classes in our knowledge graph to the LLM, so it knows which predicates can be used with the different classes). In my opinion ShEx is a bit harder to use than SPARQL (because the libraries are less mature), so I would just load the RDF you have in a store and run SPARQL queries
Hello -- just learning and may not know sytack -- but the example
Results in a malformed query when. you try it on the sparql endpoint for unitprot.
I set up a jupyter lab notebook - and this worked very nicely
Returns this (organized with pandas)
I modified the query a bit more using regular expression package (re) and used this:
The text was updated successfully, but these errors were encountered: