-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated chunking_document. #65
Conversation
Tested this PR. Chunking seems to be working as expected. @PalmPalm7 as discussed can you add the handling of document if its not passed as list? |
Updated chunking method with this logic. Ready to merge @abhi1092 . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please squash your 3 commits into 1 and add all of the description you have on the PR into the commit message? Let me know if you need any assistance doing this!
Thank you Russell! I have squashed and amended my commits. Ready to merge @russellb . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good -- passing e2e CI, and I like that it falls back to the old method in case of an error.
Instead of a raw print()
I would prefer using a logger. Can you add a new parameter for logger
, update consumers to pass in a logger, and then use that for the messages you're printing? Thanks!
Done! @russellb I've added this parameter: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, nice work!
(you could drop "resolved merge conflict" from the PR title / commit message, but it's not a big deal, and not worth the CI reset probably)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments but LGTM overall, great work!
This new dependency is coming at a time when we are in a "soft dependency" freeze - i.e. we are avoiding adding new dependencies, unless there is some exceptional reason to justify it If we did not add this library now (i.e. waited until the next milestone release has been completed), I guess we would just use IMO changing from the current naive list of separators to |
That makes a ton of sense. Thanks for thinking this through, Mark. Markdown is the only knowledge doc format supported via the CLI right now, so it's only format we need to support at the moment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing review based on Mark's feedback. I think he's right about the dependency and what features are required right now vs. can come later.
@PalmPalm7 if you split the magika part into a separate PR, we could merge the rest as a super useful improvement to the markdown support 👍 |
+1 -- magika introduces a dependency on onnxruntime, which is another package that doesn't publish standard source code packages so we'll have to figure out how to build it to prepare a release downstream. It would be nice if we could wait to bring that in until this first major release is settled. |
Thank you for taking the time to review and foresee potential risks @dhellmann @markmc @russellb , I'll split it into to two PR. |
ad106f3
to
19bf54b
Compare
Updated logic according to Mark and Doug's suggestions above. Ready to merge @russellb . |
1. Applied document-specific test splitter from Langchain in replace of original naive version. 2. Made heuristics changes to markdown file, especially using regex to trim markdown tables in attempt to fit in the whole table with limited context window. 3. For updated chunk_document() function, see Chunking_Demo.ipynb on chunking with server_ctx_size=4096, chunk_word_count=1024). Granite 7b has 4k context windows. Signed-off-by: Andy Xie <[email protected]>
Thank you for being flexible! |
@russellb can you please check if the changes addressed your concern above? |
) | ||
|
||
# Determine file type for heuristics, default with markdown |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this comment may be out of date, but not a big deal
Demonstrated new chunking methods in replace of RecursiveCharacterTextSplitter()
Updates:
Saved for later:
For further PR, an efficient library to detect file type could be added for document-specific chunking and heuristics.
No logic on larger chunk size updated this PR. "CharacterTextSplitter will only split on separator (which is '\n\n' by default). chunk_size is the maximum chunk size that will be split if splitting is possible." See this StackOverflow Post.
My Justification is there is I assume our most common use cases, as we discussed, is PDF into markdown formats, therefore default case should be markdown. Furthermore, by specifying the language param in RecursiveCharacterTextSplitter, it uses these following separators:
Instead of these separators.
Original PR, see: #45