[Fix]：add initial_sentences param and fix custom tokenizer does not work #86

ljhssga · 2024-12-11T13:42:09Z

Feature Enhancement and Bug Fixes for Semantic Chunking

SemanticChunker: Added initial_sentence Parameter

Introduced a new initial_sentence parameter to the SemanticChunker class
Allows users to specify the number of sentences used as initial context when starting a new chunk in accumulation mode
Default value set to 1
Provides increased flexibility in controlling semantic-based sentence grouping

Improved Tokenizer Validation in BaseChunker

Enhanced input validation for the tokenizer_or_token_counter parameter
Added support for methods using inspect.ismethod() check
Increases robustness of tokenizer input processing
Enables more versatile tokenization strategies

Fixes for Current Version (v0.2.2)

Resolves TypeError when setting initial_sentences
Addresses ValueError related to custom tokenizer usage
Supports broader range of tokenization approaches

…dation - Add initial_sentences parameter to SemanticChunker for controlling initial chunk size in cumulative mode - Enhance tokenizer validation in BaseChunker to properly handle method objects

ljhssga · 2024-12-11T13:45:49Z

Fixes #64

bhavnicksm · 2024-12-11T20:22:11Z

Hey @ljhssga!

Thanks for opening a PR! 😊

Actually quite valid that the argument initial_sentences doesn't work, it's because it was removed in version v0.2.2, and the documentation needs to be update. Sorry for the confusion 🥹

We are using the argument similarity_window and min_sentences to control the behaviour of initial_sentences, wherein the min_sentences is the sentences it would start with in the mode="cumulative" which we were using by default earlier.

Currently, mode="window" is set to default and hence uses the similarity_window and min_sentences together to get the final result.

Would try to update the documentation ASAP, but I am pretty swamped with some features at the moment 🥲

Thank you for the patience 🙏

feat(chunker): add initial_sentences param and improve tokenizer vali…

5b33d48

…dation - Add initial_sentences parameter to SemanticChunker for controlling initial chunk size in cumulative mode - Enhance tokenizer validation in BaseChunker to properly handle method objects

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix]：add initial_sentences param and fix custom tokenizer does not work #86

[Fix]：add initial_sentences param and fix custom tokenizer does not work #86

ljhssga commented Dec 11, 2024

ljhssga commented Dec 11, 2024

bhavnicksm commented Dec 11, 2024

[Fix]：add initial_sentences param and fix custom tokenizer does not work #86

Are you sure you want to change the base?

[Fix]：add initial_sentences param and fix custom tokenizer does not work #86

Conversation

ljhssga commented Dec 11, 2024

ljhssga commented Dec 11, 2024

bhavnicksm commented Dec 11, 2024