initial commit

D-Star-AI · Aug 1, 2024 · 9b1eb8b · 9b1eb8b
1 parent 6fdddd9
commit 9b1eb8b
Show file tree

Hide file tree

Showing 36 changed files with 4,697 additions and 2 deletions.
diff --git a/.github/workflows/matrix-test.yml b/.github/workflows/matrix-test.yml
@@ -0,0 +1,65 @@
+name: Python Test Workflow
+
+on:
+  push:
+    branches:
+      - main
+  pull_request:
+    branches:
+      - main
+
+jobs:
+  unit-tests:
+    name: Run Unit Tests
+    runs-on: ${{ matrix.os }}
+    strategy:
+      matrix:
+        # NOTE: macos-x uses 10X the github CI minutes so be careful using that
+        os: [ubuntu-latest, macos-latest, windows-latest]
+        python-version: ['3.9', '3.10', '3.11']
+
+    steps:
+    - name: Checkout Repository
+      uses: actions/checkout@v2
+
+    - name: Set up Python ${{ matrix.python-version }}
+      uses: actions/setup-python@v2
+      with:
+        python-version: ${{ matrix.python-version }}
+
+    - name: Install Dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install -r requirements.txt
+        
+    - name: Run Unit Tests
+      run: |
+        python -m unittest discover -s tests/unit
+
+
+  integration-tests:
+    name: Run Tests
+    runs-on: ${{ matrix.os }}
+    strategy:
+      matrix:
+        # NOTE: macos-x uses 10X the github CI minutes so be careful using that
+        os: [ubuntu-latest] #, macos-latest, windows-latest]
+        python-version: ['3.9'] #, '3.10', '3.11']
+
+    steps:
+    - name: Checkout Repository
+      uses: actions/checkout@v2
+
+    - name: Set up Python ${{ matrix.python-version }}
+      uses: actions/setup-python@v2
+      with:
+        python-version: ${{ matrix.python-version }}
+
+    - name: Install Dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install -r requirements.txt
+        
+    - name: Run Integration Tests
+      run: |
+        python -m unittest discover -s tests/integration
diff --git a/.gitignore b/.gitignore
@@ -160,3 +160,6 @@ cython_debug/
 #  and can be added to the global gitignore or merged into this file.  For a more nuclear
 #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
 #.idea/
+
+# DS Store files
+.DS_Store
diff --git a/README.md b/README.md
@@ -1,2 +1,51 @@
-# minDB
-Extremely memory-efficient vector database
+# minDB -  an extremely memory-efficient vector database
+Existing open source vector databases are built on HNSW indexes that must be held entirely in memory to be used. This uses an extremely large amount of memory, which severely limits the sizes of vector DBs that can be used locally, and creates very high costs for cloud deployments.
+
+It’s possible to build a vector database with extremely low memory requirements that still has high recall and low latency. The key is to use a highly compressed search index, combined with reranking from disk, as demonstrated in the [Zoom](https://arxiv.org/abs/1809.04067) paper. This project implements the core technique introduced in that paper. We also implement a novel adaptation of Faiss's two-level k-means clustering algorithm that only requires a small subset of vectors to be held in memory at an given point.
+
+With minDB, you can index and query 100M 768d vectors with peak memory usage of around 3GB. With an in-memory vector DB, you would need ~340GB of RAM. This means you could easily index and query all of Wikipedia on an average Macbook.
+
+**Note:** This is currently in a beta phase, and has not been fully tested in a production environment. It is possible there are bugs or edge cases that have not been tested, or additional limitations that are not listed here. There may also be breaking changes in the future.
+
+## Architecture overview
+minDB uses a two-step process to perform approximate nearest neighbors search. First, a highly compressed Faiss index is searched to find the `preliminary_top_k` (set to 500 by default) results. Then the full uncompressed vectors for these results are retrieved from a key-value store on disk, and a k-nearest neighbors search is performed on these vectors to arrive at the `final_top_k` results.
+
+## Basic usage guide
+
+Clone the repo and run `pip install -r requirements.txt` to install all of the necessary packages.
+
+For a quickstart guide, check out our getting started example [here](https://github.com/SuperpoweredAI/minDB/blob/main/examples/getting_started.ipynb).
+
+By default, all minDB databases are saved to the ~/.spdb directory. This directory is created automatically if it doesn’t exist when you initialize an minDB object. You can override this path by specifying a save_path when you create your minDB object.
+
+## Adding and removing items
+To add vectors to your database, call the `/db/{db_name}/add` endpoint, or use the `db.add()` method. This takes a list of `(vector, metadata)` tuples, where each vector is itself a list, and each metadata item is a dictionary with keys of your choosing.
+
+To remove items, you must pass in a list of ids corresponding to the vectors that you want removed if using FastAPI, or an array of ids when directly using minDB. 
+
+## How the index is trained
+The search index will get automatically trained once the number of vectors exceeds 25,000 (this parameter is configurable in the params.py file). Future training operations occur once the index coverage ratio is below 0.5 (also configurable). You can also train a database at any point by calling the `/db/{db_name}/train` function, or the `db.train()` method if you're not running FastAPI. If there are fewer than 5,000 vectors in a database, the training operation will be skipped and a flat index will still be used. 
+
+We don't recommend setting the number of vectors before training much higher than 25,000 due to increased memory usage and query latency. Untrained indexes use a flat Faiss index, which means the full uncompressed vectors are held in memory. Searches over a flat index are done via a brute force method, which gets substantially slower as the number of vectors increases.
+
+## Metadata
+You can add metadata to each vector by including a metadata dictionary. You can include whatever metadata fields you want, but the keys and values should all be serializable.
+
+Metadata filtering is the next major feature that will be added. This will allow you to use SQL-like statements to control which items get searched over.
+
+## FastAPI server deployment
+To deploy your database as a server with a REST API, you can make use of the `fastapi.py` file. To start the server, open up a terminal and run the following command:
+`uvicorn api.fastapi:app --host 0.0.0.0 --port 8000`.
+Please note, you must be in the main minDB directory to run this command.
+
+For more detail, you can check out our FastAPI tutorial [here](https://github.com/D-Star-AI/minDB/tree/main/examples/fastapi_example.ipynb).
+You can also learn more about FastAPI [here](https://fastapi.tiangolo.com).
+
+## Limitations
+- One of the main dependencies, Faiss, doesn't play nice with Apple M1/M2 chips. You may be able to get it to work by building it from source, but we haven't successfully done so yet.
+- We haven't tested it on datasets larger than 35M vectors yet. It should still work well up to 100-200M vectors, but beyond that performance may start to deteriorate.
+
+## Additional documentation
+- [Tunable parameters](https://github.com/D-Star-AI/minDB/wiki/Tunable-parameters)
+- [Contributing](https://github.com/D-Star-AI/minDB/wiki/Contributing)
+- [Examples](https://github.com/D-Star-AI/minDB/tree/main/examples)
diff --git a/examples/fastapi_example.ipynb b/examples/fastapi_example.ipynb
@@ -0,0 +1,192 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## FastAPI Tutorial\n",
+    "\n",
+    "This requires uvicorn and fastapi to be installed by running\n",
+    "\n",
+    "`pip install fastapi uvicorn`\n",
+    "\n",
+    "In order to start the FastAPI, open up a terminal and run the following command (This must be done from the root directory of this project):\n",
+    "\n",
+    "`uvicorn api.fastapi:app --host 0.0.0.0 --port 8000`"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Setup the environment\n",
+    "\n",
+    "Load in the necessary packages and append the paths needed"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import requests\n",
+    "import pickle\n",
+    "import sys\n",
+    "import os\n",
+    "\n",
+    "# Load in minDB from the local directory\n",
+    "current_dir = os.getcwd()\n",
+    "sys.path.append(current_dir + \"/../\")\n",
+    "sys.path.append(current_dir + \"/../tests/integration/\")\n",
+    "\n",
+    "from mindb.mindb import minDB\n",
+    "from tests.data import helpers"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Load in the Fiqa test data\n",
+    "vectors, text, queries, _ = helpers.fiqa_test_data()\n",
+    "with open(current_dir + \"/../tests/data/fiqa_queries_text.pickle\", \"rb\") as f:\n",
+    "    query_text = pickle.load(f)\n",
+    "# Vectors needs to be a list when using FastAPI\n",
+    "vectors = vectors.tolist()\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Create the minDB object"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Create a new minDB\n",
+    "\n",
+    "db_name = \"fast_api_test\"\n",
+    "url = \"http://0.0.0.0:8000/db/create\"\n",
+    "response = requests.post(url, json={\"name\": db_name})\n",
+    "print (response.text)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Add data to the minDB object\n",
+    "\n",
+    "Adding data to the minDB object using FastAPI must be done in batches. We recommend using a batch size of ~100. Pushing this number too high will result in a failure\n",
+    "\n",
+    "The data must also be a list. Numpy arrays are not a valid data type for FastAPI"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Add the data to the minDB in batches of 100\n",
+    "batch_size = 1000\n",
+    "data = [(vectors[i], {\"text\": text[i]}) for i in range(len(vectors))]\n",
+    "\n",
+    "url = f\"http://0.0.0.0:8000/db/{db_name}/add\"\n",
+    "\n",
+    "for i in range(0, 10000, batch_size):\n",
+    "    print (i)\n",
+    "    add_data = data[i:i+batch_size]\n",
+    "    response = requests.post(url, json={\"add_data\": add_data})"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Train the minDB object\n",
+    "\n",
+    "For this example, we are using PCA 256, compressed vector bytes of 32, and omitting OPQ\n",
+    "\n",
+    "For more information on these parameters, you can visit the Github Wiki [here](https://github.com/D-Star-AI/minDB/wiki/Tunable-parameters)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Train the minDB\n",
+    "\n",
+    "url = f\"http://0.0.0.0:8000/db/{db_name}/train\"\n",
+    "response = requests.post(url, json={\n",
+    "    \"use_two_level_clustering\": False,\n",
+    "    \"pca_dimension\": 256,\n",
+    "    \"compressed_vector_bytes\": 32,\n",
+    "    \"omit_opq\": True\n",
+    "})\n",
+    "print (response.text)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Query the trained index\n",
+    "\n",
+    "Make a test query using the `query` endpoint. The query vector must be converted to a list"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "url = f\"http://0.0.0.0:8000/db/{db_name}/query\"\n",
+    "query_vector = queries[0].tolist()\n",
+    "response = requests.post(url, json={\"query_vector\": query_vector})\n",
+    "\n",
+    "print (\"Query text:\", query_text[0])\n",
+    "print (\"\")\n",
+    "print (response.json()[\"metadata\"][0][\"text\"])"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3.9.12 ('base')",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.12"
+  },
+  "orig_nbformat": 4,
+  "vscode": {
+   "interpreter": {
+    "hash": "6493f77fe247bf2d19f0ba28dd5345ab8e8eb3b6587168c5c28be0d535e3568d"
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}