Skip to content

Commit

Permalink
feat: Enabling Natural Language Graph Queries in GraphScope with chat…
Browse files Browse the repository at this point in the history
…GPT alibaba#2648 (alibaba#3271)

## What do these changes do?

[TL;DR] We add LLM support to enhance the basic graph queries, allowing
users to interact with graphs more casually.

> Specifically, the detailed steps can be shown as:
1. Using the Langchain to send natural language sentences to LLM
(chatGPT, chatGLM, and so on) with manually designed prompts.
2. Adding the schema of graphs to form more expected results from LLM.
3. Extract Cypher statements from the results.
4. Running Cypher on the Graphs through the Cypher interactive backend.
5. Return results from the backend.

> Our contribution can be summarized as:

_1. Well-designed Langchain pipeline as well as prompt templates
(langchain_cypher.py).
2. Exposed the query interface (query.py::query_to_cypher)._

## Related issue number

We are related to the issue alibaba#2648 

Fixes

---------

Co-authored-by: Longbin Lai <[email protected]>
Co-authored-by: Yufan Yang <[email protected]>
  • Loading branch information
3 people authored Oct 27, 2023
1 parent 6547182 commit 23809be
Show file tree
Hide file tree
Showing 6 changed files with 482 additions and 0 deletions.
Binary file added docs/images/llm+knowledge_base.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/llm_hot_code_knowledge.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
139 changes: 139 additions & 0 deletions docs/interactive_engine/neo4j/llm_assistant.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
# Using LLM as an Assistant in GIE

Nowadays, people are more likely to turn to LLM to answer their questions instead of relying on traditional search engines, thanks to LLM's convenience. However, relying solely on LLM for question answering has its shortcomings. Most LLMs, represented by GPTs, have knowledge limited to their training materials, which are usually up to two years old. Additionally, they lack the ability to access the Internet to retrieve the latest information. Therefore, it is quite common for LLMs to provide misleading answers in less well-known areas because they are also less trained in such domains.

:::{figure-md}

<img src="../../images/llm_hot_code_knowledge.png" alt="llm_hot_code_question" style="zoom:33%;" />

Figure 1. LLM's different responses to hot and cold knowledge
:::

In fact, knowledge in less well-known areas can be organized within a knowledge base, such as an RDBMS or graphs. LLMs can serve as assistants, efficiently helping users retrieve the required information from these knowledge bases by directly translating the user's question into a workable query sentence. This approach can significantly reduce the occurrence of misleading information generated by LLMs in less well-known areas.

:::{figure-md}

<img src="../../images/llm+knowledge_base.png" alt="llm+knowledge_base" style="zoom:33%;" />

Figure 2. Using LLM as an assistant to help retrieve information from Knowledge Base
:::

Following this pattern, we integrate GPTs into GIE as an assistant using OpenAI's API. Now, even if you are a complete novice when it comes to graphs, with the assistance of LLMs, you can retrieve the information you need from the graph conveniently. This document will use the graph of *Dream of the Red Chamber* as an example to guide you through how to use LLM as an Assistant in GIE.

## 0. Environment

The integration of LLMs is available for GraphScope versions 0.25 and later, and it uses `langchain` for prompts. Therefore, to begin, please make sure you have the following environment:

```bash
python>=3.8
graphscope>=0.25.0
pandas==2.0.3
langchain>=0.0.316
```

We strongly recommend creating a clean Python virtual environment for GraphScope and those dependencies. If you are unsure how to do this, you can follow these [instructions](https://docs.python.org/3/library/venv.html).

## 1. Download Datasets

In this document, we take the graph of *Dream of the Red Chamber* as the example. You can download the dataset by directly git clone the repository of the dataset:

```bash
git clone https://github.com/Suchun-sv/The-Dream-of-the-Red-Chamber.git
```

Or you can visit the [Git Repository](https://github.com/Suchun-sv/The-Dream-of-the-Red-Chamber), select the `Download ZIP` button to download the dataset, and then unzip it.

Finally, you should move the dataset to the directory where you run your python file.

```bash
# unzip The-Dream-of-the-Red-Chamber.zip # if you download the zip file
# move the dataset to the directory where you run your python file
mv /path/to/The-Dream-of-the-Red-Chamber/data ./data
```

## 2. Load the Graph

After preparing the dataset, use the following python code to have the GIE load the datasets and build the graph.

```python
import graphscope as gs
import pandas as pd
gs.set_option()
sess = gs.session(cluster_type='hosts')
graph = sess.g()
nodes_sets = pd.read_csv("./data/stone_story_nodes_relation.csv", sep=",")
graph = graph.add_vertices(nodes_sets, label="Person", vid_field="id")
edges_sets = pd.read_csv("./data/stone_story_edges.csv")
for edge_label in edges_sets['label'].unique():
edges_sets_ = edges_sets[edges_sets['label'] == edge_label]
graph = graph.add_edges(edges_sets_, src_field="head", dst_field="tail", label=edge_label)
print(graph.schema)
```

If you see the following output in your terminal or console, it suggests that the dataset has been successfully loaded into GIE.

```bash
Properties: Property(0, eid, LONG, False, ), Property(1, label, STRING, False, )
Comment: Relations: [Relation(source='Person', destination='Person')]
type: EDGE
Label: daughter_in_law_of_grandson_of
Properties: Property(0, eid, LONG, False, ), Property(1, label, STRING, False, )
Comment: Relations: [Relation(source='Person', destination='Person')]
type: EDGE
Label: wife_of
Properties: Property(0, eid, LONG, False, ), Property(1, label, STRING, False, )
Comment: Relations: [Relation(source='Person', destination='Person')]
type: EDGE
...
```

## 3. Set Endpoint and API Key

Since GIE's LLM assistant module uses OpenAI's API, you should set your endpoint and API key before using it:

```python
endpoint = "https://xxx" # use your endpoint
api_key = "xxx" # replace to your own api key
```

## 4. Generate Graph Query Sentence from Questions

In GIE's LLM assistant module, the `query_to_cypher` function allows you to generate corresponding query sentences from provided questions.

```python
from graphscope.langchain_prompt.query import query_to_cypher
```

Simply define your question and pass it to `query_to_cypher`. It will generate the corresponding Cypher queries based on the question and the information of the loaded graph. Here's an example of the LLM assistant generating Cypher queries for the question, 'Whose son is Baoyu Jia?"

```python
from graphscope.langchain_prompt.query import query_to_cypher
question = "贾宝玉是谁的儿子?"
cypher_sentence = query_to_cypher(graph, question, endpoint=endpoint, api_key=api_key)
print(cypher_sentence)
```

The query sentence would be like:

```cypher
MATCH (p:Person)-[:son_of]->(q:Person)
WHERE p.name = '贾宝玉'
RETURN q.name
```

Please note that the generated query sentence may not be 100% correct, and you have the option to edit the query sentence as needed.

## 5. Execute Generated Query Sentence with GIE

Lastly, you can execute the generated Cypher query in the built-in GIE interactive sessions. Here is an example.

```python
# Start the GIE interactive session
g = gs.interactive(graph, params={'neo4j.bolt.server.disabled': 'false', 'neo4j.bolt.server.port': 7687})
# Submit the query sentence
q1 = g.execute(cypher_sentence, lang="cypher")
# Check the query results
print(q1.records)
```

Here the output would be "贾政", which is accurate according to the story of *Dream of the Red Chamber*.
16 changes: 16 additions & 0 deletions python/graphscope/langchain_prompt/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#
# Copyright 2020 Alibaba Group Holding Limited. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
216 changes: 216 additions & 0 deletions python/graphscope/langchain_prompt/langchain_cypher.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#
# Copyright 2020 Alibaba Group Holding Limited. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

from __future__ import annotations

import re
from typing import Any
from typing import Dict
from typing import List
from typing import Optional

from langchain.callbacks.manager import CallbackManagerForChainRun
from langchain.chains.base import Chain
from langchain.chains.graph_qa.prompts import CYPHER_QA_PROMPT
from langchain.chains.llm import LLMChain
from langchain.prompts.prompt import PromptTemplate
from langchain.schema import BasePromptTemplate
from langchain.schema.language_model import BaseLanguageModel

import graphscope

# The patterns to replace in the generated Cypher query
PATTERN_TRANSFER = [("-[*]-", "-[]-")]

Cases = """Right Cases:
querys: 列举出鲁迅的一个别名可以吗? answer:match (:ENTITY{name:'鲁迅'})<--(h)-[:Relationship{name:'别名'}]->(q) return distinct q.name limit 1
querys: 我们常用的301SH不锈钢带的硬度公差是多少,你知道吗? answers:match(p:ENTITY{name:'301SH不锈钢带'})-[:Relationship{name:'硬度公差'}]-> (q) return q.name
Wrong Cases:
querys: 12344加油这首歌真好听,你知道歌曲原唱是谁吗? answers: MATCH (a:Actor)-[:ACTED_IN]->(m:Movie) WHERE m.name = '12345加油' RETURN a.name
querys: 七宗梦是什么时候上映的? answers: MATCH (a:Actor)-[:ACTED_IN]->(m:Movie) WHERE m.name = '七宗梦' RETURN a.name LIMIT 30"""


INTERMEDIATE_STEPS_KEY = "intermediate_steps"

CYPHER_GENERATION_TEMPLATE = """Task:Generate Cypher statement to query a graph database.
Cases:
{cases}
Schema:
{schema}
Instructions:
Note: Do not include any explanations or apologies in your responses.
Do not respond to any questions that might ask anything else than for you to construct a Cypher statement.
Do not include any text except the generated Cypher statement.
You must use the relaship or property shown in the schema!!! do not use other keys!!!
You must use the relaship or property shown in the schema!!! do not use other keys!!!
You must use the relaship or property shown in the schema!!! do not use other keys!!!
你必须使用Sechema中出现的关键词!!!
The question is:
{question}
You must use the relaship or property shown in the schema!!! do not use other keys!!!"""
CYPHER_GENERATION_PROMPT = PromptTemplate(
input_variables=["schema", "question", "cases"], template=CYPHER_GENERATION_TEMPLATE
)


CHECK_SCHEMA_TEMPLATE = """Task: Check the schema
{query}
Schema:
{schema}
Check the properities and relashions in the query, replace all the keywards that did not shown in the schema!!!
Check the properities and relashions in the query, replace all the keywards that did not shown in the schema!!!
Check the properities and relashions in the query, replace all the keywards that did not shown in the schema!!!
if correct, return the origianl query!!!
Note: Do not include any explanations or apologies in your responses.
Do not respond to any questions that might ask anything else than for you to construct a Cypher statement.
Do not include any text except the generated Cypher statement.
Note: Do not include any explanations or apologies in your responses.
Do not respond to any questions that might ask anything else than for you to construct a Cypher statement.
Do not include any text except the generated Cypher statement.
"""
CHECK_SCHEMA_PROMPT = PromptTemplate(
input_variables=["query", "schema"], template=CHECK_SCHEMA_TEMPLATE
)


def extract_cypher(text: str) -> str:
"""Extract Cypher code from a text.
Args:
text: Text to extract Cypher code from.
Returns:
Cypher code extracted from the text.
"""
# The pattern to find Cypher code enclosed in triple backticks
pattern = r"```(.*?)```"

# Find all matches in the input text
matches = re.findall(pattern, text, re.DOTALL)

cypher_query = matches[0] if matches else text

# Replace any patterns that are not supported by the graph database
for pattern, replacement in PATTERN_TRANSFER:
cypher_query = cypher_query.replace(pattern, replacement)
return cypher_query


class GraphCypherQAChain(Chain):
"""Chain for question-answering against a graph by generating Cypher statements."""

graph: graphscope.Graph
cypher_generation_chain: LLMChain
check_schema_chain: LLMChain
qa_chain: LLMChain
input_key: str = "query" #: :meta private:
output_key: str = "result" #: :meta private:
top_k: int = 10
"""Number of results to return from the query"""
return_intermediate_steps: bool = False
"""Whether or not to return the intermediate steps along with the final answer."""
return_direct: bool = False
"""Whether or not to return the result of querying the graph directly."""

@property
def input_keys(self) -> List[str]:
"""Return the input keys.
:meta private:
"""
return [self.input_key]

@property
def output_keys(self) -> List[str]:
"""Return the output keys.
:meta private:
"""
_output_keys = [self.output_key]
return _output_keys

@property
def _chain_type(self) -> str:
return "graph_cypher_chain"

@classmethod
def from_llm(
cls,
llm: BaseLanguageModel,
*,
qa_prompt: BasePromptTemplate = CYPHER_QA_PROMPT,
cypher_prompt: BasePromptTemplate = CYPHER_GENERATION_PROMPT,
check_prompt: BasePromptTemplate = CHECK_SCHEMA_PROMPT,
**kwargs: Any,
) -> GraphCypherQAChain:
"""Initialize from LLM."""
qa_chain = LLMChain(llm=llm, prompt=qa_prompt)
cypher_generation_chain = LLMChain(llm=llm, prompt=cypher_prompt)
check_schema_chain = LLMChain(llm=llm, prompt=check_prompt)

return cls(
qa_chain=qa_chain,
cypher_generation_chain=cypher_generation_chain,
check_schema_chain=check_schema_chain,
**kwargs,
)

def _call(
self,
inputs: Dict[str, Any],
run_manager: Optional[CallbackManagerForChainRun] = None,
) -> Dict[str, Any]:
"""Generate Cypher statement, use it to look up in db and answer question."""
_run_manager = run_manager or CallbackManagerForChainRun.get_noop_manager()
callbacks = _run_manager.get_child()
question = inputs[self.input_key]

intermediate_steps: List = []
"""Initialize from Graph."""

generated_cypher = self.cypher_generation_chain.run(
{"question": question, "schema": self.graph.schema, "cases": Cases},
callbacks=callbacks,
)

# Extract the Cypher code from the generated text
generated_cypher = extract_cypher(generated_cypher)
generated_cypher = self.check_schema_chain.run(
{"query": generated_cypher, "schema": self.graph.schema},
callbacks=callbacks,
)
generated_cypher = extract_cypher(generated_cypher)

_run_manager.on_text("Generated Cypher:", end="\n", verbose=self.verbose)
_run_manager.on_text(
generated_cypher, color="green", end="\n", verbose=self.verbose
)

intermediate_steps.append({"query": generated_cypher})

# context = graph_interface.execute(generated_cypher, lang="cypher")
# intermediate_steps.append({"context": context})

# final_result = context

chain_result: Dict[str, Any] = {self.output_key: generated_cypher}
if self.return_intermediate_steps:
chain_result[INTERMEDIATE_STEPS_KEY] = intermediate_steps

return chain_result
Loading

0 comments on commit 23809be

Please sign in to comment.