Skip to content

Commit

Permalink
chore: Updated core concepts for Metadata and Vector Segments
Browse files Browse the repository at this point in the history
  • Loading branch information
tazarov committed Aug 5, 2024
1 parent 1185b80 commit cb10370
Showing 1 changed file with 20 additions and 11 deletions.
31 changes: 20 additions & 11 deletions docs/core/concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,26 +64,35 @@ following distance functions:
- Euclidean (L2) - Useful for text similarity, more sensitive to noise than `cosine`
- Inner Product (IP) - Recommender systems

## Embedding Vector
## Embedding Model

A representation of a document in the embedding space in te form of a vector, list of 32-bit floats (or ints).
## Embeddings

## Embedding Model
A representation of a document in the embedding model's latent space in te form of a vector, list of 32-bit floats (or
ints).

## Metadata Segment

## Document and Metadata Index
The metadata segment holds both the documents and their respective metadata fields (if any). The metadata segment is
stored in sqlite3 under `<persistent_dir>/chroma.sqlite3`.

The document and metadata index is stored in SQLite database.
## Vector Segment

## Vector Index (HNSW Index)
!!! tip "Segment or Index?"

Under the hood (ca. v0.4.22) Chroma uses its
In the below paragraphs we use, the terms "segment" and "index" are used interchangeably.

Under the hood Chroma uses its
own [fork](https://github.com/chroma-core/hnswlib) [HNSW lib](https://github.com/nmslib/hnswlib) for indexing and
searching vectors.

In a single-node mode, Chroma will create a single HNSW index for each collection. The index is stored in a subdir of
your persistent dir, named after the collection id (UUID-based).
In a single-node mode, Chroma will create a single vector index for each collection. The index is stored in a UUID-named
subdir in
your persistent dir, named after the vector segment of the collection.

The HNSW lib uses [fast ANN](https://arxiv.org/abs/1603.09320) algo to search the vectors in the index.



In addition to the HNSW index, Chroma uses Brute Force index to buffer embeddings in memory before they are added to the
HNSW index (see [`batch_size`](configuration.md#hnswbatch_size)). As the name suggests the search in the Brute Force
index is done by iterating over all the vectors in the index and comparing them to the query using the
distance_function. Brute Force index search is exhaustive and works well on small datasets.

0 comments on commit cb10370

Please sign in to comment.