Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BGE-M3 custom embeddings always have the same number of chunks between Semantic and SDPM Chunker #94

Open
armsp opened this issue Dec 16, 2024 · 0 comments
Assignees
Labels
bug Something isn't working in progress Actively looking into the issue

Comments

@armsp
Copy link

armsp commented Dec 16, 2024

The general idea is that because of merging SDPM will have lesser or same number of chunks as Semantic Chunker. Its clearly visible when we use standard Sentence Transformer models (mpnet or mini LM etc).

However I made a CustomEmbedding class for some recent models and talking specifically about BGE-M3 I see that no matter what I do, the chunks between SDPM and Semantic remain the same. I tried printing the similarities, emebeddings etc and I do see differences but for some reason I do not see different chunks when I do belive that SDPM should merge some chunks.

Setup: Install BGEM3FlagModel - pip install -U FlagEmbedding

Custom Embedding Class : (please don't mind the quick and dirty implementation, had to test fast)

class CustomEmbeddings(BaseEmbeddings):
    def __init__(self):
        self.model = BGEM3FlagModel("./bge-m3", use_fp16=True)
        self.task = "separation"
    
    @property
    def dimension(self) -> int:
        return 1024

    def embed(self, text: str) -> "np.ndarray":
        e = self.model.encode([text], return_dense=True, return_sparse=False, return_colbert_vecs=False)['dense_vecs'][0]
        # print(e)
        return e

    def embed_batch(self, texts: List[str]) -> List["np.ndarray"]:
        embeddings = self.model.encode(texts, return_dense=True, return_sparse=False, return_colbert_vecs=False
        )
        # print(embeddings['dense_vecs'])
        return embeddings['dense_vecs']

    def count_tokens(self, text: str) -> int:
        l = len(self.model.tokenizer.encode(text))
        # print(l)
        return l

    def count_tokens_batch(self, texts: List[str]) -> List[int]:
        encodings = self.model.tokenizer(texts)
        # print([len(enc) for enc in encodings["input_ids"]])
        return [len(enc) for enc in encodings["input_ids"]]

    def get_tokenizer_or_token_counter(self):
        return self.model.tokenizer
    
    def similarity(self, u: "np.ndarray", v: "np.ndarray") -> float:
        """Compute cosine similarity between two embeddings."""
        s = ([email protected])#.item()
        # print(s)
        return s
    
@classmethod
    def is_available(cls) -> bool:
        return True

    def __repr__(self) -> str:
        return "bgem3"

Code: You can use the paul graham essay as input text for chunking -> https://gist.githubusercontent.com/wey-gu/75d49362d011a0f0354d39e396404ba2/raw/0844351171751ebb1ce54ea62232bf5e59445bb7/paul_graham_essay.txt

from chonkie import SemanticChunker
from chonkie import SDPMChunker
from typing import List
import numpy as np
from FlagEmbedding import BGEM3FlagModel
from chonkie.embeddings import BaseEmbeddings

# New custom embedding code...
embeddings = CustomEmbeddings()

with open('./pg_essay.txt', 'r') as file:
    text = file.read()

chunker = SemanticChunker(
    embedding_model=embeddings,
    threshold=0.75,
    chunk_size=1536
)

chunks = chunker.chunk(text)
print(f"Number of chunks: {len(chunks)}")
# for chunk in chunks:
#     print(f"Chunk text: {chunk.text}")
#     print(f"Token count: {chunk.token_count}")
#     print(f"Number of sentences: {len(chunk.sentences)}")

chunker = SDPMChunker(
    embedding_model=embeddings,
    threshold=0.75,
    chunk_size=1536
)

chunks = chunker.chunk(text)
print("\n~~~~~~~~~  SDPM ~~~~~~~~~~~~~")
print(f"Number of chunks: {len(chunks)}")

No matter what I use for chunk_size and threshold, the number of chunks are the same.

For example: Using mpnet with the parameters above we get 384 and 372 (as expected) but for BGE-M3 we get 92 each.

@armsp armsp added the bug Something isn't working label Dec 16, 2024
@shreyashnigam shreyashnigam added the in progress Actively looking into the issue label Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working in progress Actively looking into the issue
Projects
None yet
Development

No branches or pull requests

3 participants