Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT]: Max. chunk size should be overridable #2633

Open
TeaAlc opened this issue Nov 14, 2024 · 0 comments
Open

[FEAT]: Max. chunk size should be overridable #2633

TeaAlc opened this issue Nov 14, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request feature request

Comments

@TeaAlc
Copy link

TeaAlc commented Nov 14, 2024

What would you like to see?

Feature: To not only support one model that was somehow selected at development time and not have to maintain lists for every embedding model it would be great to have a option to overrule the hardcoded embeddingMaxChunkLength.

Explaination:
In Text splitting & Chunking Preferences the max chunk size seems to be embedding provider dependend rather than embedding model dependent, this leads through a max length of characters that does not fit to every model.

e.G.
image
is set for the Azure OpenAI Embedding Provider.
The chunk size seems to be from
https://github.com/Mintplex-Labs/anything-llm/blob/da3d0283ffee9c592e5b81d2be6a848722df298f/server/utils/EmbeddingEngines/azureOpenAi/index.js#L22C10-L22C34
The model that was used as base seems to be text-embedding-ada-002, but there are already newer models like text-embedding-3-large.

Also it seems that the AnythingLLM embedder counts characters rather than tokens, reducing the amount of data in a vector even further.

@TeaAlc TeaAlc added enhancement New feature or request feature request labels Nov 14, 2024
@timothycarambat timothycarambat self-assigned this Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request feature request
Projects
None yet
Development

No branches or pull requests

2 participants