In this project, we'll go through the process of setting up ChromaDB, loading data from a JSON file, and performing simple queries to get similar products. You can expand upon this foundation to build more advanced data management systems for your own applications.
ChromaDB is a vector database that can be used to store and query vector data, such as text embeddings and product data. It is a good choice for applications that need to perform fast and efficient search and retrieval of vector data.
To get started, you need to install ChromaDB using pip:
pip install chromadb
This project consists of the following files:
app.py
: The main Python script that demonstrates how to use ChromaDB to store and query data.data.json
: A sample JSON file containing data for the project. You can replace it with your own data.README.md
: This README file providing instructions and information about the project.
import chromadb
import json
We need to import the chromadb library to interact with ChromaDB and the json library to read and write JSON data.
Next, we need to create a ChromaDB client:
client = chromadb.PersistentClient(path="./chroma")
The PersistentClient class creates a persistent ChromaDB database that is stored on disk. We specify the path to the database directory as the path argument.
Now, we need to create a collection to store the product data:
collection = client.create_collection(name="product_data")
A collection is a object in ChromaDB. It is similar to a table in a relational database.
Next, we need to load the product data from the JSON file:
with open("data.json", "r") as f:
json_data = json.load(f)
The open() function opens the JSON file for reading. The json.load() function reads the JSON data from the file and loads it into a Python object.
Next, we need to extract the unique product IDs and JSON strings from the product data:
unique_ids = []
json_string_objects = []
for json_object in json_data:
product_id = json_object["id"]
unique_ids.append(product_id)
json_string_object = json.dumps(json_object)
json_string_objects.append(json_string_object)
We use a for loop to iterate over the product data and extract the product ID and JSON string for each product. We then add the product ID and JSON string to their respective lists.
Now, we need to save the product data to the ChromaDB collection:
saved_collection = client.get_collection(name="product_data")
saved_collection.add(documents=json_string_objects, ids=unique_ids)
First we will get our created collection from disc, The get_collection() method loads the collection present in disc. The add() method adds the product data to the collection. We specify the JSON strings and product IDs as the documents and ids arguments, respectively. The id's should be unique for each document.
Now, we can query the product data:
results = saved_collection.query(query_texts=["smartphone case"], n_results=2)
The query() method performs a text search on the collection. We specify the query text as the query_texts argument. We also specify the number of results to return as the n_results argument.
Finally, we can print the results:
print(results)
Here is our output, as we can see it returns most similar products.
{
"ids": [["12", "11"]],
"distances": [[0.8864764754956473, 1.2108526876760022]],
"metadatas": [[null, null]],
"embeddings": null,
"documents": [
{
"id": "12",
"name": "Phone Case",
"description": "A protective phone case designed to safeguard your smartphone from drops and impacts.",
"price": "19.99"
},
{
"id": "11",
"name": "Screen Protector",
"description": "A high-quality screen protector to keep your smartphone's display safe.",
"price": "9.99"
}
]
}