Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat(vector store): lancedb #33

Merged
merged 45 commits into from
Oct 7, 2024
Merged

Feat(vector store): lancedb #33

merged 45 commits into from
Oct 7, 2024

Conversation

marieaurore123
Copy link
Contributor

@marieaurore123 marieaurore123 commented Sep 23, 2024

  1. Adds lanceDB as a vector store in rig-lancedb crate
  2. Modifies VectorStoreIndex trait to include a generic type type SearchParams which can be passed to the vector store query to customize the query.

@marieaurore123 marieaurore123 changed the title Feat(vector store)/lancedb Feat(vector store): lancedb Sep 23, 2024
@cvauclair
Copy link
Contributor

cvauclair commented Sep 23, 2024

@0xMochan @garance-buricatu would like your opinion on something here. This PR adds an associated type for the VectorStoreIndex trait SearchParams and adds it as an argument to the trait method, e.g.:

// in rig-core/src/vector_store/mod.rs
pub trait VectorStoreIndex: Send + Sync {
    type SearchParams: for<'a> Deserialize<'a> + Send + Sync;    // <-------- This new type

    fn top_n_from_query(
        &self,
        query: &str,
        n: usize,
        search_params: Self::SearchParams,  // <--------- used here
    ) -> impl std::future::Future<Output = Result<Vec<(f64, DocumentEmbeddings)>, VectorStoreError>> + Send;
    
    // ...
}

However, a downstream effect is that you now need to pass an VectorStoreIndex::SearchParams argument when constructing a RAG agent, e.g.:

// in rig-core/examples/rag.rs
    let rag_agent = openai_client.agent("gpt-4")
        .preamble("
            You are a dictionary assistant here to assist the user in understanding the meaning of words.
            You will find additional non-standard word definitions that could be useful below.
        ")
        .dynamic_context(1, index, "".to_string())  // <---- new empty string argument
        .build();

This feels a little awkward, even though it "makes sense". However, another approach would be to push whatever search/index parameters into the struct that implements the VectorStoreIndex trait. So instead of passing those params as an argument to the search functions, they would be fixed for a given VectorStoreIndex. E.g. you'd have this instead:

pub struct LanceDbVectorStore<M: EmbeddingModel> {
    model: M,
    document_table: lancedb::Table,
    embedding_table: lancedb::Table,
    params: ....   // <------- params would go inside the index object
}

This also has the advantage of making the VectorStoreIndex object safe avoid the need for the serde "hack" to get around the object safety requirements.

@0xMochan
Copy link
Contributor

0xMochan commented Sep 23, 2024

Interesting, for context, can you provide an example for how the search params can be used?

@marieaurore123
Copy link
Contributor Author

marieaurore123 commented Sep 23, 2024

@cvauclair

Pro for new approach:

  • wouldn't force the SearchParams struct to be deserializable/serializable, this has been annoying.
  • improves readability like in the example you showed where the empty string was confusing

Con for new approach:

  • can't use different settings of SearchParams for different calls to top_n_from_query or top_n_from_embedding. (unless we can modify the entire LanceDbVectorStore struct)

@marieaurore123
Copy link
Contributor Author

marieaurore123 commented Sep 23, 2024

@0xMochan

in lanceDB, search params can be used to define the distance type (cosine, l2, dot, ...), search type (approximate nearest neighbor, exact nearest neighbor), filters, and more fine tuning options on the search index.

@cvauclair
Copy link
Contributor

cvauclair commented Sep 23, 2024

Con for new approach:

- can't use different settings of SearchParams for different calls to `top_n_from_query` or `top_n_from_embedding`. (unless we can modify the entire `LanceDbVectorStore` struct)

I think in the context of vector search, this is fine since these settings will most likely be static once the user has found the configuration that works for their RAG system. Plus, although these methods are public, users would usually interact with VectorStoreIndex types only insofar as they would create one and give it as a parameter to their Agent. Even in the current implementation, the SearchParams of an agent's dynamic context sources are static, so we'd just make it static across the board (if that makes sense).

@0xMochan
Copy link
Contributor

Con for new approach:

- can't use different settings of SearchParams for different calls to `top_n_from_query` or `top_n_from_embedding`. (unless we can modify the entire `LanceDbVectorStore` struct)

I think in the context of vector search, this is fine since these settings will most likely be static once the user has found the configuration that works for their RAG system. Plus, although these methods are public, users would usually interact with VectorStoreIndex types only insofar as they would create one and give it as a parameter to their Agent. Even in the current implementation, the SearchParams of an agent's dynamic context sources are static, so we'd just make it static across the board (if that makes sense).

I do think it might be useful to change it on the fly, or if the user of an application wants to adjust this based on a UI / TUI. But you can always reconstruct an agent so it may not be that big of a deal. There can always be an extra search command to use a specific method of querying that overrides the default (so you choose a default when you create the Agent) but when you query, there's the ability to override that by using a specific function top_n_from_query_using_search_param (bad naming but ya know). This can be an extra function defined by a trait for this feature (theoretically) and can always be added in the future if there's a demand for it (perhaps we leave a comment about it?)

@marieaurore123
Copy link
Contributor Author

@0xMochan @cvauclair thanks guys! I will implement both moving the search params into the vector store struct and also adding an extra override method on the vector search trait

.github/workflows/ci.yaml Show resolved Hide resolved
rig-core/src/providers/openai.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@cvauclair cvauclair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't reviewed everything yet, but there are a lot of things to change so will make it in two parts

rig-core/src/providers/cohere.rs Show resolved Hide resolved
rig-core/src/providers/cohere.rs Show resolved Hide resolved
rig-core/src/providers/openai.rs Show resolved Hide resolved
rig-core/src/providers/openai.rs Outdated Show resolved Hide resolved
rig-core/src/vector_store/mod.rs Outdated Show resolved Hide resolved
rig-mongodb/src/lib.rs Outdated Show resolved Hide resolved
rig-lancedb/src/lib.rs Outdated Show resolved Hide resolved
rig-lancedb/src/lib.rs Outdated Show resolved Hide resolved
rig-lancedb/src/lib.rs Outdated Show resolved Hide resolved
rig-lancedb/src/lib.rs Outdated Show resolved Hide resolved
rig-mongodb/examples/vector_search_mongodb.rs Outdated Show resolved Hide resolved
rig-lancedb/src/table_schemas/document.rs Outdated Show resolved Hide resolved
rig-lancedb/src/lib.rs Outdated Show resolved Hide resolved
Comment on lines +120 to +131
if let Some(SearchType::Flat) = search_type {
query = query.bypass_vector_index();
}

if let Some(SearchType::Approximate) = search_type {
if let Some(nprobes) = nprobes {
query = query.nprobes(nprobes);
}
if let Some(refine_factor) = refine_factor {
query = query.refine_factor(refine_factor);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change this to a match statement

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't do match here because I don't need to handle None on any of these cases

Comment on lines +116 to +118
if let Some(distance_type) = distance_type {
query = query.distance_type(distance_type);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should a default value be set if distance_type is None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I updated the doc strings to answer questions about defaults.

rig-lancedb/src/utils/mod.rs Outdated Show resolved Hide resolved
rig-lancedb/src/lib.rs Outdated Show resolved Hide resolved
rig-lancedb/src/utils/deserializer.rs Outdated Show resolved Hide resolved
rig-lancedb/src/utils/deserializer.rs Outdated Show resolved Hide resolved
rig-lancedb/src/utils/deserializer.rs Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove AI generated data in favor of repeated hardcoded data

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove AI generated data in favor of repeated hardcoded data

@cvauclair cvauclair merged commit 92834f4 into main Oct 7, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants