Skip to content

Commit

Permalink
feat: Updated collection basics
Browse files Browse the repository at this point in the history
  • Loading branch information
tazarov committed Jan 11, 2024
1 parent b8f4ab2 commit 20ffe4b
Show file tree
Hide file tree
Showing 2 changed files with 102 additions and 15 deletions.
110 changes: 98 additions & 12 deletions docs/core/collections.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,87 @@
# Collections

Collections are the grouping mechanism for embeddings, documents, and metadata.

## Collection Basics

### Collection Properties

Each collection is characterized by the following properties:

- `name`: The name of the collection
- `metadata`: A dictionary of metadata associated with the collection. The metadata is a dictionary of key-value pairs.

Collection object:

```python
--8<-- "https://github.com/chroma-core/chroma/blob/main/chromadb/api/models/Collection.py#L57C1-L62C35"
```

### Creating a collection

```python

import chromadb

client = chromadb.PersistentClient(path="test") # or HttpClient()
col = client.create_collection("test")
```

Alternatively you can use the `get_or_create_collection` method to create a collection if it doesn't exist already.

```python
import chromadb

client = chromadb.PersistentClient(path="test") # or HttpClient()
col = client.get_or_create_collection("test")
```

### Deleting a collection

```python
import chromadb

client = chromadb.PersistentClient(path="test") # or HttpClient()
client.delete_collection("test")
```

### Listing all collections

```python

import chromadb

client = chromadb.PersistentClient(path="test") # or HttpClient()
collections = client.list_collections()
```

### Getting a collection

```python
import chromadb

client = chromadb.PersistentClient(path="test") # or HttpClient()
col = client.get_collection("test")
```

### Modifying a collection

Both collection properties (`name` and `metadata`) can be modified.

```python
import chromadb

client = chromadb.PersistentClient(path="test") # or HttpClient()
col = client.get_collection("test")
col.modify(name="test2", metadata={"key": "value"})
```

!!! note "Metadata"

Metadata is always overwritten when modified. If you want to add a new key-value pair to the metadata, you must
first get the existing metadata and then add the new key-value pair to it.


## Collection Utilities

### Cloning a collection
Expand All @@ -12,41 +94,45 @@ Here are some reasons why you might want to clone a collection:
```python
import chromadb

client = chromadb.PersistentClient(path="test") # or HttpClient()
col = client.get_or_create_collection("test") # create a new collection with L2 (default)
client = chromadb.PersistentClient(path="test") # or HttpClient()
col = client.get_or_create_collection("test") # create a new collection with L2 (default)

col.add(ids=[f"{i}" for i in range(1000)],documents=[f"document {i}" for i in range(1000)])
newCol = client.get_or_create_collection("test1",metadata={"hnsw:space":"cosine"}) # let's change the distance function to cosine
col.add(ids=[f"{i}" for i in range(1000)], documents=[f"document {i}" for i in range(1000)])
newCol = client.get_or_create_collection("test1", metadata={
"hnsw:space": "cosine"}) # let's change the distance function to cosine

existing_count = col.count()
batch_size = 10
for i in range(0,existing_count,batch_size):
batch = col.get(include = ["metadatas","documents","embeddings"], limit=batch_size, offset=i)
newCol.add(ids=batch["ids"],documents=batch["documents"],metadatas=batch["metadatas"],embeddings=batch["embeddings"])
for i in range(0, existing_count, batch_size):
batch = col.get(include=["metadatas", "documents", "embeddings"], limit=batch_size, offset=i)
newCol.add(ids=batch["ids"], documents=batch["documents"], metadatas=batch["metadatas"],
embeddings=batch["embeddings"])

print(newCol.count())
print(newCol.get(offset=0, limit=10)) #get first 10 documents
print(newCol.get(offset=0, limit=10)) # get first 10 documents
```

### Updating Document/Record Metadata

In this example we loop through all documents of a collection and strip all metadata fields of leading and trailing whitespace.
In this example we loop through all documents of a collection and strip all metadata fields of leading and trailing
whitespace.
Change the `update_metadata` function to suit your needs.

```python
from chromadb import Settings
import chromadb

client = chromadb.PersistentClient(path="test", settings=Settings(allow_reset=True))
client.reset() #reset the database so we can run this script multiple times
client.reset() # reset the database so we can run this script multiple times
col = client.get_or_create_collection("test")
count= col.count()
count = col.count()


def update_metadata(metadata: dict):
return {k: v.strip() for k, v in metadata.items()}


for i in range(0, count, 10):
batch = col.get(include = ["metadatas"], limit=10, offset=i)
batch = col.get(include=["metadatas"], limit=10, offset=i)
col.update(ids=batch["ids"], metadatas=[update_metadata(metadata) for metadata in batch["metadatas"]])
```
7 changes: 4 additions & 3 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@ extra:
analytics:
provider: google
property: G-FZFKK5FLEY
#extra_javascript:
# - javascripts/gtag.js
extra_javascript:
- javascripts/gtag.js
markdown_extensions:
- abbr
- admonition
Expand All @@ -30,7 +30,8 @@ markdown_extensions:
line_spans: __span
pygments_lang_class: true
- pymdownx.inlinehilite
- pymdownx.snippets
- pymdownx.snippets:
url_download: true
- pymdownx.superfences
- pymdownx.tabbed:
alternate_style: true
Expand Down

0 comments on commit 20ffe4b

Please sign in to comment.