BGE-M3 custom embeddings always have the same number of chunks between Semantic and SDPM Chunker #94

armsp · 2024-12-16T11:39:51Z

The general idea is that because of merging SDPM will have lesser or same number of chunks as Semantic Chunker. Its clearly visible when we use standard Sentence Transformer models (mpnet or mini LM etc).

However I made a CustomEmbedding class for some recent models and talking specifically about BGE-M3 I see that no matter what I do, the chunks between SDPM and Semantic remain the same. I tried printing the similarities, emebeddings etc and I do see differences but for some reason I do not see different chunks when I do belive that SDPM should merge some chunks.

Setup: Install BGEM3FlagModel - pip install -U FlagEmbedding

Custom Embedding Class : (please don't mind the quick and dirty implementation, had to test fast)

class CustomEmbeddings(BaseEmbeddings):
    def __init__(self):
        self.model = BGEM3FlagModel("./bge-m3", use_fp16=True)
        self.task = "separation"
    
    @property
    def dimension(self) -> int:
        return 1024

    def embed(self, text: str) -> "np.ndarray":
        e = self.model.encode([text], return_dense=True, return_sparse=False, return_colbert_vecs=False)['dense_vecs'][0]
        # print(e)
        return e

    def embed_batch(self, texts: List[str]) -> List["np.ndarray"]:
        embeddings = self.model.encode(texts, return_dense=True, return_sparse=False, return_colbert_vecs=False
        )
        # print(embeddings['dense_vecs'])
        return embeddings['dense_vecs']

    def count_tokens(self, text: str) -> int:
        l = len(self.model.tokenizer.encode(text))
        # print(l)
        return l

    def count_tokens_batch(self, texts: List[str]) -> List[int]:
        encodings = self.model.tokenizer(texts)
        # print([len(enc) for enc in encodings["input_ids"]])
        return [len(enc) for enc in encodings["input_ids"]]

    def get_tokenizer_or_token_counter(self):
        return self.model.tokenizer
    
    def similarity(self, u: "np.ndarray", v: "np.ndarray") -> float:
        """Compute cosine similarity between two embeddings."""
        s = ([email protected])#.item()
        # print(s)
        return s
    
@classmethod
    def is_available(cls) -> bool:
        return True

    def __repr__(self) -> str:
        return "bgem3"

Code: You can use the paul graham essay as input text for chunking -> https://gist.githubusercontent.com/wey-gu/75d49362d011a0f0354d39e396404ba2/raw/0844351171751ebb1ce54ea62232bf5e59445bb7/paul_graham_essay.txt

from chonkie import SemanticChunker
from chonkie import SDPMChunker
from typing import List
import numpy as np
from FlagEmbedding import BGEM3FlagModel
from chonkie.embeddings import BaseEmbeddings

# New custom embedding code...
embeddings = CustomEmbeddings()

with open('./pg_essay.txt', 'r') as file:
    text = file.read()

chunker = SemanticChunker(
    embedding_model=embeddings,
    threshold=0.75,
    chunk_size=1536
)

chunks = chunker.chunk(text)
print(f"Number of chunks: {len(chunks)}")
# for chunk in chunks:
#     print(f"Chunk text: {chunk.text}")
#     print(f"Token count: {chunk.token_count}")
#     print(f"Number of sentences: {len(chunk.sentences)}")

chunker = SDPMChunker(
    embedding_model=embeddings,
    threshold=0.75,
    chunk_size=1536
)

chunks = chunker.chunk(text)
print("\n~~~~~~~~~  SDPM ~~~~~~~~~~~~~")
print(f"Number of chunks: {len(chunks)}")

No matter what I use for chunk_size and threshold, the number of chunks are the same.

For example: Using mpnet with the parameters above we get 384 and 372 (as expected) but for BGE-M3 we get 92 each.

The text was updated successfully, but these errors were encountered:

armsp added the bug Something isn't working label Dec 16, 2024

armsp assigned bhavnicksm Dec 16, 2024

shreyashnigam added the in progress Actively looking into the issue label Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BGE-M3 custom embeddings always have the same number of chunks between Semantic and SDPM Chunker #94

BGE-M3 custom embeddings always have the same number of chunks between Semantic and SDPM Chunker #94

armsp commented Dec 16, 2024 •

edited

Loading

BGE-M3 custom embeddings always have the same number of chunks between Semantic and SDPM Chunker #94

BGE-M3 custom embeddings always have the same number of chunks between Semantic and SDPM Chunker #94

Comments

armsp commented Dec 16, 2024 • edited Loading

armsp commented Dec 16, 2024 •

edited

Loading