feat: Qdrant as a supported knowledge base (#244)

* feat: QdrantKnowlegeBase * feat: Async QdrantKnowledgeBase * test: updated async tests * test: async tests, refactor * chore: linting * docs: Added QdrantKnowledgeBase docstrings * chore: added QdrantKnowledgeBase.from_config() * docs: fix typos * chore: Bumped qdrant_client 1.7.2 * chore: resolve typings, default pytest-dotenv * chore: optional import qdrant_client * chore: Use distance TitleCase as docs * docs: Qdrant reference library.md * chore: Bump qdrant_client pyproject.toml
pinecone-io · Mar 27, 2024 · b835eb2 · b835eb2
1 parent 73338bb
commit b835eb2
Show file tree

Hide file tree

Showing 14 changed files with 1,757 additions and 27 deletions.
diff --git a/README.md b/README.md
@@ -33,7 +33,7 @@ Canopy has two flows: knowledge base creation and chat. In the knowledge base cr
 1. **Canopy Core Library** - The library has 3 main classes that are responsible for different parts of the RAG workflow:
     * **ChatEngine** - Exposes a chat interface to interact with your data. Given the history of chat messages, the `ChatEngine` formulates relevant queries to the `ContextEngine`, then uses the LLM to generate a knowledgeable response.
     * **ContextEngine**  - Performs the “retrieval” part of RAG. The `ContextEngine` utilizes the underlying `KnowledgeBase` to retrieve the most relevant documents, then formulates a coherent textual context to be used as a prompt for the LLM. 
-    * **KnowledgeBase** - Manages your data for the RAG workflow. It automatically chunks and transforms your text data into text embeddings, storing them in a Pinecone vector database. Given a text query - the `KnowledgeBase` will retrieve the most relevant document chunks from the database. 
+    * **KnowledgeBase** - Manages your data for the RAG workflow. It automatically chunks and transforms your text data into text embeddings, storing them in a Pinecone(Default)/Qdrant vector database. Given a text query - the knowledge base will retrieve the most relevant document chunks from the database. 
 
 
 > More information about the Core Library usage can be found in the [Library Documentation](docs/library.md)
@@ -67,11 +67,12 @@ pip install canopy-sdk
 ### Extras
 
 | Name           | Description                                                                                                                                              |
-|----------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|
+| -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `grpc`         | To unlock some performance improvements by working with the GRPC version of the [Pinecone Client](https://github.com/pinecone-io/pinecone-python-client) |
 | `torch`        | To enable embeddings provided by [sentence-transformers](https://www.sbert.net/)                                                                         |
 | `transformers` | If you are using Anyscale LLMs, it's recommended to use `LLamaTokenizer` tokenizer which requires transformers as dependency                             |
 | `cohere`       | To use Cohere reranker or/and Cohere LLM                                                                                                                 |
+| `qdrant`       | To use [Qdrant](http://qdrant.tech/) as an alternate knowledge base                                                                                      |
 
 </details>
 

diff --git a/docs/library.md b/docs/library.md
@@ -1,21 +1,21 @@
 # Canopy Library
 
-For most common use cases, users can simply deploy the fully-configurable [Canopy service](../README.md), which provides a REST API backend for your own RAG-infused Chatbot.  
+For most common use cases, users can simply deploy the fully configurable [Canopy service](../README.md), which provides a REST API backend for their own RAG-infused Chatbot.  
 
-For advanced users, this page describes how to use `canopy` core library directly to implement their own custom applications. 
+For advanced users, this page describes how to use the `canopy` core library directly to implement their custom applications. 
 
 > **_💡 NOTE:_** You can also follow the quickstart Jupyter [notebook](../examples/canopy-lib-quickstart.ipynb)
 
-The idea behind Canopy library is to provide a framework to build AI applications on top of Pinecone as a long memory storage for you own data. Canopy library designed with the following principles in mind:
+The idea behind Canopy is to provide a framework to build AI applications on top of Pinecone as a long-memory storage for your own data. Canopy is designed with the following principles in mind:
 
-- **Easy to use**: Canopy is designed to be easy to use. It is well packaged and can be installed with a single command.
-- **Modularity**: Canopy is built as a collection of modules that can be used together or separately. For example, you can use the `chat_engine` module to build a chatbot on top of your data, or you can use the `knowledge_base` module to directly store and search your data.
-- **Extensibility**: Canopy is designed to be extensible. You can easily add your own components and extend the functionality.
-- **Production ready**: Canopy designed to be production ready, tested, well documented, maintained and supported.
-- **Open source**: Canopy is open source and free to use. It built in partnership with the community and for the community.
+- **Easy to use**: Canopy is designed to be easy. It is well-packaged and can be installed with a single command.
+- **Modularity**: Canopy is built as a collection of modules that can be used together or separately. For example, you can use the `chat_engine` module to build a chatbot on top of your data or the `knowledge_base` module to store and search your data directly.
+- **Extensibility**: Canopy is designed to be extensible. You can easily add your components and extend the functionality.
+- **Production-ready**: Canopy designed to be production-ready, tested, well-documented, maintained and supported.
+- **Open-source**: Canopy is open-source and free to use. It is built in partnership with the community and for the community.
 
 
-## High level architecture
+## High-level architecture
 
 ![class architecture](../.readme-content/class_architecture.png)
 
@@ -59,9 +59,9 @@ os.environ["OPENAI_API_KEY"] = "<OPENAI_API_KEY>"
 
 ### Step 1: Initialize global Tokenizer
 
-The `Tokenizer` object is used for converting text into tokens, which is the basic data represntation that is used for processing.
+The `Tokenizer` object is used for converting text into tokens, which is the basic data representation that is used for processing.
 
-Since manny different classes rely on a tokenizer,  Canopy uses a singleton `Tokenizer` object which needs to be initialized once. 
+Since many different classes rely on a tokenizer,  Canopy uses a singleton `Tokenizer` object which needs to be initialized once. 
 
 Before instantiating any other canopy core objects, please initialize the `Tokenizer` singleton:
 
@@ -84,13 +84,13 @@ tokenizer.tokenize("Hello world!")
 
 Since the `tokenizer` object created here would be the same instance that you have initialized at the beginning of this subsection.
 
-By default, the global tokenizer is initialized with `OpenAITokenizer` that is based on OpenAI's tiktoken library and aligned with GPT 3 and 4 models tokenization.
+By default, the global tokenizer is initialized with `OpenAITokenizer` which is based on OpenAI's Tiktoken library and aligned with GPT 3 and 4 models tokenization.
 
 <details>
 <summary>👉 Click here to understand how you can configure and customize the tokenizer</summary>
 The `Tokenizer` singleton is holding an inner `Tokenizer` object that implements `BaseTokenizer`.
 
-You can create your own customized tokenizer by implementing a new class that derives from `BaseTokenizer`, then passing this class to the `Tokenizer` singleton during initialization. Example:
+You can create your own customized tokenizer by implementing a new class that derives from `BaseTokenizer`, and then passing this class to the `Tokenizer` singleton during initialization. Example:
 ```python
 from canopy.tokenizer import Tokenizer, BaseTokenizer
 
@@ -114,7 +114,7 @@ Will initialize the global tokenizer with `OpenAITokenizer` and will pass the `m
 
 
 ### Step 2: Create a knowledge base
-Knowledge base is an object that is responsible for storing and query your data. It holds a connection to a single Pinecone index and provides a simple API to insert, delete and search textual documents.
+Knowledge base is an object that is responsible for storing and querying your data. It holds a connection to a single Pinecone index and provides a simple API to insert, delete and search textual documents.
 
 To create a knowledge base, you can use the following command:
 
@@ -140,7 +140,7 @@ To create a new Pinecone index and connect it to the knowledge base, you can use
 kb.create_canopy_index()
 ```
 
-Then, you will be able to mange the index in Pinecone [console](https://app.pinecone.io/).
+Then, you will be able to manage the index in Pinecone [console](https://app.pinecone.io/).
 
 If you already created a Pinecone index, you can connect it to the knowledge base with the `connect` method:
 
@@ -154,7 +154,39 @@ You can always verify the connection to the Pinecone index with the `verify_inde
 kb.verify_index_connection()
 ```
 
-To learn more about customizing the KnowledgeBase and its inner components, see [understanding knowledgebase workings section](#understanding-knowledgebase-workings).
+#### Using Qdrant as a knowledge base
+
+Canopy supports [Qdrant](https://qdrant.tech) as an alternative knowledge base. To use Qdrant with Canopy, install the `qdrant` extra.
+
+```bash
+pip install canopy-sdk[qdrant]
+```
+
+The Qdrant knowledge base is accessible via the `QdrantKnowledgeBase` class.
+
+```python
+from canopy.knowledge_base import QdrantKnowledgeBase
+
+kb = QdrantKnowledgeBase(collection_name="<YOUR_COLLECTION_NAME>")
+```
+
+The constructor accepts additional [options](https://github.com/qdrant/qdrant-client/blob/eda201a1dbf1bbc67415f8437a5619f6f83e8ac6/qdrant_client/qdrant_client.py#L36-L61) to customize your connection to Qdrant.
+
+To create a new Qdrant collection and connect it to the knowledge base, use the `create_canopy_collection` method:
+
+```python
+kb.create_canopy_collection()
+```
+
+The method accepts additional [options](https://github.com/qdrant/qdrant-client/blob/c63c62e6df9763591622d1921b3dfcc486666481/qdrant_client/qdrant_remote.py#L2137-L2150) to configure the collection to be created.
+
+You can always verify the connection to the collection with the `verify_index_connection` method:
+
+```python
+kb.verify_index_connection()
+```
+
+To learn more about customizing the KnowledgeBase and its inner components, see [understanding knowledge ebase workings section](#understanding-knowledgebase-workings).
 
 ### Step 3: Upsert and query data
 
@@ -190,9 +222,9 @@ print(f"score - {results[0].documents[0].score:.4f}")
 
 ### Step 4: Create a context engine
 
-Context engine is an object that responsible to retrieve the most relevant context for a given query and token budget.  
+Context engine is an object that is responsible for retrieving the most relevant context for a given query and token budget.  
 The context engine first uses the knowledge base to retrieve the most relevant documents. Then, it  formalizes the textual context that will be presented to the LLM. This textual context might be structured or unstructured, depending on the use case and configuration. 
-The output of the context engine is designed to provide the LLM the most relevant context for a given query. 
+The output of the context engine is designed to provide the LLM with the most relevant context for a given query. 
 
 
 To create a context engine using a knowledge base, you can use the following command:
@@ -243,7 +275,7 @@ TBD
 
 ### Step 5: Create a chat engine
 
-Chat engine is an object that implements end to end chat API with [RAG](https://www.pinecone.io/learn/retrieval-augmented-generation/).
+Chat engine is an object that implements end-to-end chat API with [RAG](https://www.pinecone.io/learn/retrieval-augmented-generation/).
 Given chat history, the chat engine orchestrates its underlying context engine and LLM to run the following steps:
 
 1. Generate search queries from the chat history
@@ -270,8 +302,8 @@ print(response.choices[0].message.content)
 ```
 
 
-Canopy designed to be production ready and handle any conversation length and context length. Therefore, the chat engine uses internal components to handle long conversations and long contexts.
-By default, long chat history is truncated to the latest messages that fits the token budget. It orchestrates the context engine to retrieve context that fits the token budget and then use the LLM to generate the next response.
+Canopy designed to be production-ready and handle any conversation length and context length. Therefore, the chat engine uses internal components to handle long conversations and long contexts.
+By default, long chat history is truncated to the latest messages that fit the token budget. It orchestrates the context engine to retrieve context that fits the token budget and then uses the LLM to generate the next response.
 
 
 <details>
@@ -282,10 +314,10 @@ TBD
 
 ## Understanding KnowledgeBase workings
 
-The knowledge base is an object that is responsible for storing and query your data. It holds a connection to a single Pinecone index and provides a simple API to insert, delete and search textual documents.
+The knowledge base is an object that is responsible for storing and querying your data. It holds a connection to a single Pinecone index and provides a simple API to insert, delete and search textual documents.
 
 ### Upsert workflow
-The `upsert` method is used to insert of update textual documents of any size into the knowledge base. For each document, the following steps are performed:
+The `upsert` method is used to insert or update textual documents of any size into the knowledge base. For each document, the following steps are performed:
 
 1. The document is chunked into smaller pieces of text, each piece is called a `Chunk`.
 2. Each chunk is encoded into a vector representation.
@@ -308,7 +340,7 @@ The knowledge base is composed of the following components:
 - **Chunker**: A `Chunker` object that is used to chunk the documents into smaller pieces of text.
 - **Encoder**: An `RecordEncoder` object that is used to encode the chunks and queries into vector representations.
 
-By default the knowledge base is initialized with `OpenAIRecordEncoder` which uses OpenAI embedding API to encode the text into vector representations, and `MarkdownChunker` which is based on a cloned version of Langchain's `MarkdownTextSplitter` [chunker](https://github.com/langchain-ai/langchain/blob/95a1b598fefbdb4c28db53e493d5f3242129a5f2/libs/langchain/langchain/text_splitter.py#L1374C7-L1374C27).
+By default, the knowledge base is initialized with `OpenAIRecordEncoder` which uses OpenAI embedding API to encode the text into vector representations, and `MarkdownChunker` which is based on a cloned version of Langchain's `MarkdownTextSplitter` [chunker](https://github.com/langchain-ai/langchain/blob/95a1b598fefbdb4c28db53e493d5f3242129a5f2/libs/langchain/langchain/text_splitter.py#L1374C7-L1374C27).
 
 
 You can customize each component by passing any instance of `Chunker` or `RecordEncoder` to the `KnowledgeBase` constructor.

diff --git a/pyproject.toml b/pyproject.toml
@@ -32,6 +32,7 @@ transformers = {version = "^4.35.2", optional = true}
 sentencepiece = "^0.1.99"
 pandas = "2.0.0"
 pyarrow = "^14.0.1"
+qdrant-client = {version = "^1.8.0", optional = true}
 cohere = { version = "^4.37", optional = true }
 
 
@@ -60,6 +61,7 @@ cohere = ["cohere"]
 torch = ["torch", "sentence-transformers"]
 transformers = ["transformers"]
 grpc = ["grpcio", "grpc-gateway-protoc-gen-openapiv2", "googleapis-common-protos", "lz4", "protobuf"]
+qdrant = ["qdrant-client"]
 
 
 [tool.poetry.group.dev.dependencies]
@@ -96,7 +98,9 @@ module = [
     'tokenizers.*',
     'cohere.*',
     'pinecone.grpc',
-    'huggingface_hub.utils'
+    'huggingface_hub.utils',
+    'qdrant_client.*',
+    'grpc.*'
 ]
 ignore_missing_imports = true
 

diff --git a/src/canopy/knowledge_base/__init__.py b/src/canopy/knowledge_base/__init__.py
@@ -1,2 +1,3 @@
 from .knowledge_base import list_canopy_indexes
 from .knowledge_base import KnowledgeBase
+from .qdrant.qdrant_knowledge_base import QdrantKnowledgeBase
diff --git a/src/canopy/knowledge_base/qdrant/constants.py b/src/canopy/knowledge_base/qdrant/constants.py
@@ -0,0 +1,7 @@
+from canopy.knowledge_base.knowledge_base import INDEX_NAME_PREFIX
+
+COLLECTION_NAME_PREFIX = INDEX_NAME_PREFIX
+DENSE_VECTOR_NAME = "dense"
+RESERVED_METADATA_KEYS = {"document_id", "text", "source", "chunk_id"}
+SPARSE_VECTOR_NAME = "sparse"
+UUID_NAMESPACE = "867603e3-ba69-447d-a8ef-263dff19bda7"
diff --git a/src/canopy/knowledge_base/qdrant/converter.py b/src/canopy/knowledge_base/qdrant/converter.py
@@ -0,0 +1,102 @@
+from copy import deepcopy
+from typing import Dict, List, Any, Union
+import uuid
+from canopy.knowledge_base.models import (
+    KBDocChunkWithScore,
+    KBEncodedDocChunk,
+    KBQuery,
+    VectorValues,
+)
+from pinecone_text.sparse import SparseVector
+
+try:
+    from qdrant_client import models
+except ImportError:
+    pass
+
+from canopy.knowledge_base.qdrant.constants import (
+    DENSE_VECTOR_NAME,
+    SPARSE_VECTOR_NAME,
+    UUID_NAMESPACE,
+)
+
+
+class QdrantConverter:
+    @staticmethod
+    def convert_id(_id: str) -> str:
+        """
+        Converts any string into a UUID string based on a seed.
+
+        Qdrant accepts UUID strings and unsigned integers as point ID.
+        We use a seed to convert each string into a UUID string deterministically.
+        This allows us to overwrite the same point with the original ID.
+        """
+        return str(uuid.uuid5(uuid.UUID(UUID_NAMESPACE), _id))
+
+    @staticmethod
+    def encoded_docs_to_points(
+        encoded_docs: List[KBEncodedDocChunk],
+    ) -> "List[models.PointStruct]":
+        points = []
+        for doc in encoded_docs:
+            record = doc.to_db_record()
+            _id: str = record.pop("id")
+            dense_vector: VectorValues = record.pop("values", None)
+            sparse_vector: SparseVector = record.pop("sparse_values", None)
+
+            vector: Dict[str, models.Vector] = {}
+
+            if dense_vector:
+                vector[DENSE_VECTOR_NAME] = dense_vector
+
+            if sparse_vector:
+                vector[SPARSE_VECTOR_NAME] = models.SparseVector(
+                    indices=sparse_vector["indices"],
+                    values=sparse_vector["values"],
+                )
+
+            points.append(
+                models.PointStruct(
+                    id=QdrantConverter.convert_id(_id),
+                    vector=vector,
+                    payload={**record["metadata"], "chunk_id": _id},
+                )
+            )
+        return points
+
+    @staticmethod
+    def scored_point_to_scored_doc(
+        scored_point,
+    ) -> "KBDocChunkWithScore":
+        metadata: Dict[str, Any] = deepcopy(scored_point.payload or {})
+        _id = metadata.pop("chunk_id")
+        text = metadata.pop("text", "")
+        document_id = metadata.pop("document_id")
+        return KBDocChunkWithScore(
+            id=_id,
+            text=text,
+            document_id=document_id,
+            score=scored_point.score,
+            source=metadata.pop("source", ""),
+            metadata=metadata,
+        )
+
+    @staticmethod
+    def kb_query_to_search_vector(
+        query: KBQuery,
+    ) -> "Union[models.NamedVector, models.NamedSparseVector]":
+        # Use dense vector if available, otherwise use sparse vector
+        query_vector: Union[models.NamedVector, models.NamedSparseVector]
+        if query.values:
+            query_vector = models.NamedVector(name=DENSE_VECTOR_NAME, vector=query.values)  # noqa: E501
+        elif query.sparse_values:
+            query_vector = models.NamedSparseVector(
+                name=SPARSE_VECTOR_NAME,
+                vector=models.SparseVector(
+                    indices=query.sparse_values["indices"],
+                    values=query.sparse_values["values"],
+                ),
+            )
+        else:
+            raise ValueError("Query should have either dense or sparse vector.")
+        return query_vector