You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The general idea is that because of merging SDPM will have lesser or same number of chunks as Semantic Chunker. Its clearly visible when we use standard Sentence Transformer models (mpnet or mini LM etc).
However I made a CustomEmbedding class for some recent models and talking specifically about BGE-M3 I see that no matter what I do, the chunks between SDPM and Semantic remain the same. I tried printing the similarities, emebeddings etc and I do see differences but for some reason I do not see different chunks when I do belive that SDPM should merge some chunks.
The general idea is that because of merging SDPM will have lesser or same number of chunks as Semantic Chunker. Its clearly visible when we use standard Sentence Transformer models (mpnet or mini LM etc).
However I made a CustomEmbedding class for some recent models and talking specifically about BGE-M3 I see that no matter what I do, the chunks between SDPM and Semantic remain the same. I tried printing the similarities, emebeddings etc and I do see differences but for some reason I do not see different chunks when I do belive that SDPM should merge some chunks.
Setup: Install BGEM3FlagModel -
pip install -U FlagEmbedding
Custom Embedding Class : (please don't mind the quick and dirty implementation, had to test fast)
Code: You can use the paul graham essay as input text for chunking -> https://gist.githubusercontent.com/wey-gu/75d49362d011a0f0354d39e396404ba2/raw/0844351171751ebb1ce54ea62232bf5e59445bb7/paul_graham_essay.txt
No matter what I use for
chunk_size
andthreshold
, the number of chunks are the same.For example: Using mpnet with the parameters above we get 384 and 372 (as expected) but for BGE-M3 we get 92 each.
The text was updated successfully, but these errors were encountered: