feat: Enabling Natural Language Graph Queries in GraphScope with chat…

…GPT alibaba#2648 (alibaba#3271) ## What do these changes do? [TL;DR] We add LLM support to enhance the basic graph queries, allowing users to interact with graphs more casually. > Specifically, the detailed steps can be shown as: 1. Using the Langchain to send natural language sentences to LLM (chatGPT, chatGLM, and so on) with manually designed prompts. 2. Adding the schema of graphs to form more expected results from LLM. 3. Extract Cypher statements from the results. 4. Running Cypher on the Graphs through the Cypher interactive backend. 5. Return results from the backend. > Our contribution can be summarized as： _1. Well-designed Langchain pipeline as well as prompt templates (langchain_cypher.py). 2. Exposed the query interface (query.py::query_to_cypher)._ ## Related issue number We are related to the issue alibaba#2648 Fixes --------- Co-authored-by: Longbin Lai <[email protected]> Co-authored-by: Yufan Yang <[email protected]>
shirly121 · Oct 27, 2023 · 23809be · 23809be
1 parent 6547182
commit 23809be
Show file tree

Hide file tree

Showing 6 changed files with 482 additions and 0 deletions.
diff --git a/docs/images/llm+knowledge_base.png b/docs/images/llm+knowledge_base.png
diff --git a/docs/images/llm_hot_code_knowledge.png b/docs/images/llm_hot_code_knowledge.png
diff --git a/docs/interactive_engine/neo4j/llm_assistant.md b/docs/interactive_engine/neo4j/llm_assistant.md
@@ -0,0 +1,139 @@
+# Using LLM as an Assistant in GIE
+
+Nowadays, people are more likely to turn to LLM to answer their questions instead of relying on traditional search engines, thanks to LLM's convenience. However, relying solely on LLM for question answering has its shortcomings. Most LLMs, represented by GPTs, have knowledge limited to their training materials, which are usually up to two years old. Additionally, they lack the ability to access the Internet to retrieve the latest information. Therefore, it is quite common for LLMs to provide misleading answers in less well-known areas because they are also less trained in such domains. 
+
+:::{figure-md}
+
+<img src="../../images/llm_hot_code_knowledge.png" alt="llm_hot_code_question" style="zoom:33%;" />
+
+Figure 1. LLM's different responses to hot and cold knowledge
+:::
+
+In fact, knowledge in less well-known areas can be organized within a knowledge base, such as an RDBMS or graphs. LLMs can serve as assistants, efficiently helping users retrieve the required information from these knowledge bases by directly translating the user's question into a workable query sentence. This approach can significantly reduce the occurrence of misleading information generated by LLMs in less well-known areas.
+
+:::{figure-md}
+
+<img src="../../images/llm+knowledge_base.png" alt="llm+knowledge_base" style="zoom:33%;" />
+
+Figure 2. Using LLM as an assistant to help retrieve information from Knowledge Base
+:::
+
+Following this pattern, we integrate GPTs into GIE as an assistant using OpenAI's API. Now, even if you are a complete novice when it comes to graphs, with the assistance of LLMs, you can retrieve the information you need from the graph conveniently. This document will use the graph of *Dream of the Red Chamber* as an example to guide you through how to use LLM as an Assistant in GIE.
+
+## 0. Environment
+
+The integration of LLMs is available for GraphScope versions 0.25 and later, and it uses `langchain` for prompts. Therefore, to begin, please make sure you have the following environment:
+
+```bash
+python>=3.8
+graphscope>=0.25.0
+pandas==2.0.3
+langchain>=0.0.316  
+```
+
+We strongly recommend creating a clean Python virtual environment for GraphScope and those dependencies. If you are unsure how to do this, you can follow these [instructions](https://docs.python.org/3/library/venv.html).
+
+## 1. Download Datasets
+
+In this document, we take the graph of *Dream of the Red Chamber* as the example. You can download the dataset by directly git clone the repository of the dataset:
+
+```bash
+git clone https://github.com/Suchun-sv/The-Dream-of-the-Red-Chamber.git
+```
+
+Or you can visit the [Git Repository](https://github.com/Suchun-sv/The-Dream-of-the-Red-Chamber), select the `Download ZIP` button to download the dataset, and then unzip it. 
+
+Finally, you should move the dataset to the directory where you run your python file. 
+
+```bash
+# unzip The-Dream-of-the-Red-Chamber.zip # if you download the zip file
+# move the dataset to the directory where you run your python file
+mv /path/to/The-Dream-of-the-Red-Chamber/data ./data
+```
+
+## 2. Load the Graph
+
+After preparing the dataset, use the following python code to have the GIE load the datasets and build the graph.
+
+```python
+import graphscope as gs
+import pandas as pd
+gs.set_option()
+sess = gs.session(cluster_type='hosts')
+graph = sess.g()
+nodes_sets = pd.read_csv("./data/stone_story_nodes_relation.csv", sep=",")
+graph = graph.add_vertices(nodes_sets, label="Person", vid_field="id")
+edges_sets = pd.read_csv("./data/stone_story_edges.csv")
+for edge_label in edges_sets['label'].unique():
+    edges_sets_ = edges_sets[edges_sets['label'] == edge_label]
+    graph = graph.add_edges(edges_sets_, src_field="head", dst_field="tail", label=edge_label)
+print(graph.schema)
+```
+
+If you see the following output in your terminal or console, it suggests that the dataset has been successfully loaded into GIE. 
+
+```bash
+Properties: Property(0, eid, LONG, False, ), Property(1, label, STRING, False, )
+Comment: Relations: [Relation(source='Person', destination='Person')]
+type: EDGE
+Label: daughter_in_law_of_grandson_of
+Properties: Property(0, eid, LONG, False, ), Property(1, label, STRING, False, )
+Comment: Relations: [Relation(source='Person', destination='Person')]
+type: EDGE
+Label: wife_of
+Properties: Property(0, eid, LONG, False, ), Property(1, label, STRING, False, )
+Comment: Relations: [Relation(source='Person', destination='Person')]
+type: EDGE
+...
+```
+
+## 3. Set Endpoint and API Key
+
+Since GIE's LLM assistant module uses OpenAI's API, you should set your endpoint and API key before using it:
+
+```python
+endpoint = "https://xxx" # use your endpoint 
+api_key = "xxx" # replace to your own api key
+```
+
+## 4. Generate Graph Query Sentence from Questions
+
+In GIE's LLM assistant module, the `query_to_cypher` function allows you to generate corresponding query sentences from provided questions.
+
+```python
+from graphscope.langchain_prompt.query import query_to_cypher
+```
+
+Simply define your question and pass it to `query_to_cypher`. It will generate the corresponding Cypher queries based on the question and the information of the loaded graph. Here's an example of the LLM assistant generating Cypher queries for the question, 'Whose son is Baoyu Jia?"
+
+```python
+from graphscope.langchain_prompt.query import query_to_cypher
+question = "贾宝玉是谁的儿子?"
+cypher_sentence = query_to_cypher(graph, question, endpoint=endpoint, api_key=api_key)
+print(cypher_sentence)
+```
+
+The query sentence would be like:
+
+```cypher
+MATCH (p:Person)-[:son_of]->(q:Person)
+WHERE p.name = '贾宝玉'
+RETURN q.name
+```
+
+Please note that the generated query sentence may not be 100% correct, and you have the option to edit the query sentence as needed.
+
+## 5. Execute Generated Query Sentence with GIE
+
+Lastly, you can execute the generated Cypher query in the built-in GIE interactive sessions. Here is an example.
+
+```python
+# Start the GIE interactive session
+g = gs.interactive(graph, params={'neo4j.bolt.server.disabled': 'false', 'neo4j.bolt.server.port': 7687})
+# Submit the query sentence
+q1 = g.execute(cypher_sentence, lang="cypher")
+# Check the query results
+print(q1.records)
+```
+
+Here the output would be "贾政", which is accurate according to the story of *Dream of the Red Chamber*. 
diff --git a/python/graphscope/langchain_prompt/__init__.py b/python/graphscope/langchain_prompt/__init__.py
@@ -0,0 +1,16 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+#
+# Copyright 2020 Alibaba Group Holding Limited. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/python/graphscope/langchain_prompt/langchain_cypher.py b/python/graphscope/langchain_prompt/langchain_cypher.py
@@ -0,0 +1,216 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+#
+# Copyright 2020 Alibaba Group Holding Limited. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from __future__ import annotations
+
+import re
+from typing import Any
+from typing import Dict
+from typing import List
+from typing import Optional
+
+from langchain.callbacks.manager import CallbackManagerForChainRun
+from langchain.chains.base import Chain
+from langchain.chains.graph_qa.prompts import CYPHER_QA_PROMPT
+from langchain.chains.llm import LLMChain
+from langchain.prompts.prompt import PromptTemplate
+from langchain.schema import BasePromptTemplate
+from langchain.schema.language_model import BaseLanguageModel
+
+import graphscope
+
+# The patterns to replace in the generated Cypher query
+PATTERN_TRANSFER = [("-[*]-", "-[]-")]
+
+Cases = """Right Cases:
+querys: 列举出鲁迅的一个别名可以吗？ answer:match (:ENTITY{name:'鲁迅'})<--(h)-[:Relationship{name:'别名'}]->(q) return distinct q.name limit 1
+querys: 我们常用的301SH不锈钢带的硬度公差是多少，你知道吗？ answers:match(p:ENTITY{name:'301SH不锈钢带'})-[:Relationship{name:'硬度公差'}]-> (q) return q.name
+Wrong Cases:
+querys: 12344加油这首歌真好听，你知道歌曲原唱是谁吗？ answers: MATCH (a:Actor)-[:ACTED_IN]->(m:Movie) WHERE m.name = '12345加油' RETURN a.name
+querys: 七宗梦是什么时候上映的？ answers: MATCH (a:Actor)-[:ACTED_IN]->(m:Movie) WHERE m.name = '七宗梦' RETURN a.name LIMIT 30"""
+
+
+INTERMEDIATE_STEPS_KEY = "intermediate_steps"
+
+CYPHER_GENERATION_TEMPLATE = """Task:Generate Cypher statement to query a graph database.
+Cases:
+{cases}
+Schema:
+{schema}
+Instructions:
+Note: Do not include any explanations or apologies in your responses.
+Do not respond to any questions that might ask anything else than for you to construct a Cypher statement.
+Do not include any text except the generated Cypher statement.
+You must use the relaship or property shown in the schema!!! do not use other keys!!!
+You must use the relaship or property shown in the schema!!! do not use other keys!!!
+You must use the relaship or property shown in the schema!!! do not use other keys!!!
+你必须使用Sechema中出现的关键词！！！
+
+The question is:
+{question}
+You must use the relaship or property shown in the schema!!! do not use other keys!!!"""
+CYPHER_GENERATION_PROMPT = PromptTemplate(
+    input_variables=["schema", "question", "cases"], template=CYPHER_GENERATION_TEMPLATE
+)
+
+
+CHECK_SCHEMA_TEMPLATE = """Task: Check the schema
+{query}
+Schema:
+{schema}
+Check the properities and relashions in the query, replace all the keywards that did not shown in the schema!!!
+Check the properities and relashions in the query, replace all the keywards that did not shown in the schema!!!
+Check the properities and relashions in the query, replace all the keywards that did not shown in the schema!!!
+if correct, return the origianl query!!!
+Note: Do not include any explanations or apologies in your responses.
+Do not respond to any questions that might ask anything else than for you to construct a Cypher statement.
+Do not include any text except the generated Cypher statement.
+Note: Do not include any explanations or apologies in your responses.
+Do not respond to any questions that might ask anything else than for you to construct a Cypher statement.
+Do not include any text except the generated Cypher statement.
+"""
+CHECK_SCHEMA_PROMPT = PromptTemplate(
+    input_variables=["query", "schema"], template=CHECK_SCHEMA_TEMPLATE
+)
+
+
+def extract_cypher(text: str) -> str:
+    """Extract Cypher code from a text.
+
+    Args:
+        text: Text to extract Cypher code from.
+
+    Returns:
+        Cypher code extracted from the text.
+    """
+    # The pattern to find Cypher code enclosed in triple backticks
+    pattern = r"```(.*?)```"
+
+    # Find all matches in the input text
+    matches = re.findall(pattern, text, re.DOTALL)
+
+    cypher_query = matches[0] if matches else text
+
+    # Replace any patterns that are not supported by the graph database
+    for pattern, replacement in PATTERN_TRANSFER:
+        cypher_query = cypher_query.replace(pattern, replacement)
+    return cypher_query
+
+
+class GraphCypherQAChain(Chain):
+    """Chain for question-answering against a graph by generating Cypher statements."""
+
+    graph: graphscope.Graph
+    cypher_generation_chain: LLMChain
+    check_schema_chain: LLMChain
+    qa_chain: LLMChain
+    input_key: str = "query"  #: :meta private:
+    output_key: str = "result"  #: :meta private:
+    top_k: int = 10
+    """Number of results to return from the query"""
+    return_intermediate_steps: bool = False
+    """Whether or not to return the intermediate steps along with the final answer."""
+    return_direct: bool = False
+    """Whether or not to return the result of querying the graph directly."""
+
+    @property
+    def input_keys(self) -> List[str]:
+        """Return the input keys.
+
+        :meta private:
+        """
+        return [self.input_key]
+
+    @property
+    def output_keys(self) -> List[str]:
+        """Return the output keys.
+
+        :meta private:
+        """
+        _output_keys = [self.output_key]
+        return _output_keys
+
+    @property
+    def _chain_type(self) -> str:
+        return "graph_cypher_chain"
+
+    @classmethod
+    def from_llm(
+        cls,
+        llm: BaseLanguageModel,
+        *,
+        qa_prompt: BasePromptTemplate = CYPHER_QA_PROMPT,
+        cypher_prompt: BasePromptTemplate = CYPHER_GENERATION_PROMPT,
+        check_prompt: BasePromptTemplate = CHECK_SCHEMA_PROMPT,
+        **kwargs: Any,
+    ) -> GraphCypherQAChain:
+        """Initialize from LLM."""
+        qa_chain = LLMChain(llm=llm, prompt=qa_prompt)
+        cypher_generation_chain = LLMChain(llm=llm, prompt=cypher_prompt)
+        check_schema_chain = LLMChain(llm=llm, prompt=check_prompt)
+
+        return cls(
+            qa_chain=qa_chain,
+            cypher_generation_chain=cypher_generation_chain,
+            check_schema_chain=check_schema_chain,
+            **kwargs,
+        )
+
+    def _call(
+        self,
+        inputs: Dict[str, Any],
+        run_manager: Optional[CallbackManagerForChainRun] = None,
+    ) -> Dict[str, Any]:
+        """Generate Cypher statement, use it to look up in db and answer question."""
+        _run_manager = run_manager or CallbackManagerForChainRun.get_noop_manager()
+        callbacks = _run_manager.get_child()
+        question = inputs[self.input_key]
+
+        intermediate_steps: List = []
+        """Initialize from Graph."""
+
+        generated_cypher = self.cypher_generation_chain.run(
+            {"question": question, "schema": self.graph.schema, "cases": Cases},
+            callbacks=callbacks,
+        )
+
+        # Extract the Cypher code from the generated text
+        generated_cypher = extract_cypher(generated_cypher)
+        generated_cypher = self.check_schema_chain.run(
+            {"query": generated_cypher, "schema": self.graph.schema},
+            callbacks=callbacks,
+        )
+        generated_cypher = extract_cypher(generated_cypher)
+
+        _run_manager.on_text("Generated Cypher:", end="\n", verbose=self.verbose)
+        _run_manager.on_text(
+            generated_cypher, color="green", end="\n", verbose=self.verbose
+        )
+
+        intermediate_steps.append({"query": generated_cypher})
+
+        # context = graph_interface.execute(generated_cypher, lang="cypher")
+        # intermediate_steps.append({"context": context})
+
+        # final_result = context
+
+        chain_result: Dict[str, Any] = {self.output_key: generated_cypher}
+        if self.return_intermediate_steps:
+            chain_result[INTERMEDIATE_STEPS_KEY] = intermediate_steps
+
+        return chain_result