Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: embeddings overhaul #120

Merged
merged 106 commits into from
Nov 29, 2024
Merged

feat: embeddings overhaul #120

merged 106 commits into from
Nov 29, 2024

Conversation

marieaurore123
Copy link
Contributor

@marieaurore123 marieaurore123 commented Nov 23, 2024

Complete overhaul of embeddings.

See Issue:
#52

The biggest changes can be found in:

  • rig-core/src/embeddings --> notice the test and doctests written to validate functionlities.
  • rig-core/riig-core-derive

Other significant changes in all examples related to vector searching for all of the sibling crates. All of the examples have been re-compiled and ran.

marieaurore123 and others added 30 commits October 2, 2024 18:07
…emeory-vector-store

refactor: remove DocumentEmbeddings from in memory vector store
marieaurore123 and others added 9 commits October 24, 2024 09:49
…e-feature

docs(embeddings): finalize embeddings overhaul feature
* refactor: Big refactor

* refactor: refactor Embed trait, fix all imports, rename files, fix macro

* fix(embed trait): fix errors while testing

* fix(lancedb): examples

* docs: fix hyperlink

* fmt: cargo fmt

* PR; make requested changes

* fix: change visibility of struct field

* fix: failing tests

---------

Co-authored-by: Christophe <[email protected]>
@mateobelanger mateobelanger changed the title Feat/embeddings overhaul feat: embeddings overhaul Nov 26, 2024
Copy link
Contributor

@0xMochan 0xMochan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is very well written, especially with how internal code comments clearly define everything and how elegant and simplified the internals have become. When I start to get to the point of nitpicking, I feel like that indicates the PR is in a great spot!

  1. I'm still not a fan of how rig-core-derive is a subcrate of rig-core. I understand namewise / recommendations led to this decision but I feel, because we are a monorepo, having rig-core-dervive be inside the rig-core crate sorta hides the crate a bit. I don't think this should block the PR though since it might be a larger, more annoying move to make at the last hour of the PR.

  2. A couple of the code documentation could use some [] so that the comments directly link to source.

  • Very much a nitpick as it can be tedious to link every single object mention (like EmbedError -> [EmbedError]).

Overall, I think this is fantastic. I think some docs on both a) how one can migrate to the new embeddings setup b) how the derive macro is leveraged and c) general usage docs would be very helpful once we get our documentation up and running!

Copy link
Contributor

@tarrencev tarrencev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little confusing to refer to embedded content as both documents and texts

rig-core/src/embeddings/embedding.rs Outdated Show resolved Hide resolved
rig-core/src/embeddings/embedding.rs Outdated Show resolved Hide resolved
rig-core/src/embeddings/embedding.rs Outdated Show resolved Hide resolved
* docs: Make examples+docstrings a bit more realistic

* feat: Add Embed implementation for &impl Embed

* test: Reorganize tests

* misc: Add `derive` feature to `all` feature flag

* test: Fix dead code warning

* test: Improve embed macro tests

* test: Add additional embed macro test

* docs: Add logging output to rag example

* docs: Fix looging output in tools example

* feat: Improve token usage log messages

* test: Small changes to embedbing builder tests

* style: cargo fmt

* fix: Clippy + docstrings

* docs: Fix docstring

* test: Fix test
@cvauclair cvauclair merged commit 630163d into main Nov 29, 2024
4 checks passed
@github-actions github-actions bot mentioned this pull request Nov 29, 2024
@mateobelanger mateobelanger linked an issue Nov 29, 2024 that may be closed by this pull request
14 tasks
mateobelanger added a commit that referenced this pull request Dec 2, 2024
* fix: exclude embedding properties from top_n node query

* refactor: more ergonomic index creation

* docs(neo4j): update examples

* fix: unused import in example

* feat(provider): xAI (grok) integration (#106)

* feat(xai): initial xai (grok) implementation

* fix(xai): renamings + tests

* style(xai): Update rig-core/src/providers/xai/client.rs

Co-authored-by: Mathieu Bélanger <[email protected]>

* style(xai): adds various comments and README improvements

* fix(xai): add some print statements to the grok example

* docs(xai): fix readme

---------

Co-authored-by: Mathieu Bélanger <[email protected]>

* fix(rig-mongodb): remove embeddings from `top_n` lookup (#115)

* fix(mongodb): remove embeddings from `top_n` lookup

* fix(mongodb): filter embeddings within agg pipeline

* style(mongodb): clippy moment

* fix(mongodb): dynamically get embedded fields from mongodb

* fix(mongodb): apply fixes from comments

* style(mongodb): fmt

* docs(readme): add perplexity logo to integrations (#112)

* docs(readme): add perplexity logo to integrations
* fix: perplexity logo size

* fix(readme): perplexity logo size

* feat: embeddings API overhaul (#120)

* feat: setup derive macro

* test: test out writing embeddable macro

* test: continue testing custom macro implementation

* feat: macro generate trait bounds

* refactor: split up macro into multiple files

* refactor: move macro derive crate inside rig-core

* feat: replace embedding logic with new embeddable trait and macro

* refactor: refactor rag examples, delete document embedding struct

* feat: remove document embedding from in memory store

* refactor: remove DocumentEmbeddings from in memory vector store

* refactor(examples): combine vector store with vector store index

* docs: add and update docstrings

* fix (examples): fix bugs in examples

* style: cargo fmt

* revert: revert vector store to main

* docs: update emebddings builder docstrings

* refactor: derive macro

* tests: add unit tests on in memory store

* fic(ci): asterix on pull request sto accomodate for epic branches

* fix(ci): double asterix

* feat: add error type on embeddable trait

* refactor: move embeddings to its own module and seperate embeddable

* refactor: split up macro into more files, fix all imports

* fix: revert logging change

* feat: handle tools with embeddingsbuilder

* bug(macro): fix error when embed tags missing

* style: cargo fmt

* fix(tests): clippy

* docs&revert: revert embeddable trait error type, add docstrings

* style: cargo clippy

* clippy(lancedb): fix unused function error

* fix(test): remove useless assert false statement

* cleanup: split up branch into 2 branches for readability

* cleanup: revert certain changes during branch split

* docs: revert doc string

* fix: add embedding_docs to embeddable tool

* refactor: use OneOrMany in Embbedable trait, make derive macro crate feature flag

* tests: add some more tests

* clippy: cargo clippy

* docs: add docstring to oneormany

* fix(macro): update error handling

* refactor: reexport EmbeddingsBuilder in rig and update imports

* feat: implement IntoIterator and Iterator for OneOrMany

* refactor: rename from methods

* tests: fix failing tests

* refactor&fix: make PR review changes

* fix: fix tests failing

* test: add test on OneOrMany

* style: cargo fmt

* docs&fix: fix doc strings, implement iter_mut for OneOrMany

* fix: update borrow and owning of macro

* clippy: add back print statements

* fix: fix issues caused by merge of derive macro branch

* fix: fix cargo toml of lancedb and mongodb

* refactor: use thiserror for OneOtMany::EmptyListError

* feat: add OneOrMany to in memory vector store

* style: cargo fmt

* fix: update embeddingsbuilder import path

* tests: add tests for embeddingsbuilder

* clippy: add is empty method

* fix: add feature flag to examples in mongodb and lancedb crates

* fix: move lancedb fixtures into it's own file

* fix: add dummy main function in fextures.rs for compiler

* fix: revert fixture file, remove fixtures from cargo toml examples

* fix: update fixture import in lancedb examples

* refactor: rename D to T in embeddingsbuilder generics

* refactor: remove clone

* PR: update builder, docstrings, and std::markers tags

* style: replace add with push

* fix: fix mongodb example

* fix: update lancedb and mongodb doc example

* fix: typo

* docs: add and fix docstrings and examples

* docs: add more doc tests

* feat: rename Embeddable trait to ExtractEmbeddingFields

* feat: rename macro files, cargo fmt

* PR; update docstrings, update `add_documents_with_id` function

* doc: fix doc linting

* misc: fmt

* test: fix test

* refactor(embeddings): embed trait definition (#89)

* refactor: Big refactor

* refactor: refactor Embed trait, fix all imports, rename files, fix macro

* fix(embed trait): fix errors while testing

* fix(lancedb): examples

* docs: fix hyperlink

* fmt: cargo fmt

* PR; make requested changes

* fix: change visibility of struct field

* fix: failing tests

---------

Co-authored-by: Christophe <[email protected]>

* fix/docs: fix erros from merge, cleanup embeddings docstrings

* fix: cargo clippy in examples

* Feat: small improvements + fixes + tests (#128)

* docs: Make examples+docstrings a bit more realistic

* feat: Add Embed implementation for &impl Embed

* test: Reorganize tests

* misc: Add `derive` feature to `all` feature flag

* test: Fix dead code warning

* test: Improve embed macro tests

* test: Add additional embed macro test

* docs: Add logging output to rag example

* docs: Fix looging output in tools example

* feat: Improve token usage log messages

* test: Small changes to embedbing builder tests

* style: cargo fmt

* fix: Clippy + docstrings

* docs: Fix docstring

* test: Fix test

* style: Small renaming for consistency

* docs: Improve docstrings

* style: fmt

* fix: `TextEmbedder::embed` visibility

* docs: Simplified the `EmbeddingsBuilder` docstring example to focus on the builder

* style: cargo fmt

* docs: Small edit to lancedb examples

---------

Co-authored-by: cvauclair <[email protected]>

* misc: Add `rig-derive` missing manifest fields (#129)

* feat: Improve `InMemoryVectorStore` API (#130)

* feat: Improve `InMemoryVectorStore` API

* style: clippy+fmt

* test: fix test

* fix: remove unused module (#132)

* fix: exclude embedding properties from top_n node query

* refactor: more ergonomic index creation

* docs(neo4j): update examples

* fix: unused import in example

* fix(example): remove embedding field from Deserialization type

---------

Co-authored-by: Mochan <[email protected]>
Co-authored-by: Garance Buricatu <[email protected]>
Co-authored-by: cvauclair <[email protected]>
@github-actions github-actions bot mentioned this pull request Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: EmbeddingsBuilder redesign
4 participants