[FEAT]: Max. chunk size should be overridable #2633

TeaAlc · 2024-11-14T09:48:23Z

What would you like to see?

Feature: To not only support one model that was somehow selected at development time and not have to maintain lists for every embedding model it would be great to have a option to overrule the hardcoded embeddingMaxChunkLength.

Explaination:
In Text splitting & Chunking Preferences the max chunk size seems to be embedding provider dependend rather than embedding model dependent, this leads through a max length of characters that does not fit to every model.

e.G.

is set for the Azure OpenAI Embedding Provider.
The chunk size seems to be from
https://github.com/Mintplex-Labs/anything-llm/blob/da3d0283ffee9c592e5b81d2be6a848722df298f/server/utils/EmbeddingEngines/azureOpenAi/index.js#L22C10-L22C34
The model that was used as base seems to be text-embedding-ada-002, but there are already newer models like text-embedding-3-large.

Also it seems that the AnythingLLM embedder counts characters rather than tokens, reducing the amount of data in a vector even further.

TeaAlc added enhancement New feature or request feature request labels Nov 14, 2024

timothycarambat self-assigned this Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT]: Max. chunk size should be overridable #2633

[FEAT]: Max. chunk size should be overridable #2633

TeaAlc commented Nov 14, 2024

[FEAT]: Max. chunk size should be overridable #2633

[FEAT]: Max. chunk size should be overridable #2633

Comments

TeaAlc commented Nov 14, 2024

What would you like to see?