Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix]:add initial_sentences param and fix custom tokenizer does not work #86

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ljhssga
Copy link

@ljhssga ljhssga commented Dec 11, 2024

Feature Enhancement and Bug Fixes for Semantic Chunking

  1. SemanticChunker: Added initial_sentence Parameter

Introduced a new initial_sentence parameter to the SemanticChunker class
Allows users to specify the number of sentences used as initial context when starting a new chunk in accumulation mode
Default value set to 1
Provides increased flexibility in controlling semantic-based sentence grouping

  1. Improved Tokenizer Validation in BaseChunker

Enhanced input validation for the tokenizer_or_token_counter parameter
Added support for methods using inspect.ismethod() check
Increases robustness of tokenizer input processing
Enables more versatile tokenization strategies

Fixes for Current Version (v0.2.2)

Resolves TypeError when setting initial_sentences
Addresses ValueError related to custom tokenizer usage
Supports broader range of tokenization approaches

…dation

- Add initial_sentences parameter to SemanticChunker for controlling initial chunk size in cumulative mode
- Enhance tokenizer validation in BaseChunker to properly handle method objects
@ljhssga
Copy link
Author

ljhssga commented Dec 11, 2024

Fixes #64

@bhavnicksm
Copy link
Collaborator

Hey @ljhssga!

Thanks for opening a PR! 😊

Actually quite valid that the argument initial_sentences doesn't work, it's because it was removed in version v0.2.2, and the documentation needs to be update. Sorry for the confusion 🥹

We are using the argument similarity_window and min_sentences to control the behaviour of initial_sentences, wherein the min_sentences is the sentences it would start with in the mode="cumulative" which we were using by default earlier.

Currently, mode="window" is set to default and hence uses the similarity_window and min_sentences together to get the final result.

Would try to update the documentation ASAP, but I am pretty swamped with some features at the moment 🥲

Thank you for the patience 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants