Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] WordChunker's chunk_batch function fail #73

Closed
kime541200 opened this issue Nov 26, 2024 · 2 comments
Closed

[BUG] WordChunker's chunk_batch function fail #73

kime541200 opened this issue Nov 26, 2024 · 2 comments
Assignees
Labels
bug Something isn't working in progress Actively looking into the issue

Comments

@kime541200
Copy link

When I call chunk_batch function in WordChunker, it will shows the error message as below:

    batch_chunks: List[List[Chunk]] = chunker.chunk_batch(text=texts)
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/chonkie/chunker/base.py", line 214, in chunk_batch
    return pool.map(self.chunk, text)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/multiprocessing/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/multiprocessing/pool.py", line 774, in get
    raise self._value
  File "/opt/conda/lib/python3.11/multiprocessing/pool.py", line 540, in _handle_tasks
    put(task)
  File "/opt/conda/lib/python3.11/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
AttributeError: Can't pickle local object 'BaseChunker._get_tokenizer_counter.<locals>.<lambda>'

here is the code that I run:

from chonkie import WordChunker
from autotiktokenizer import AutoTikTokenizer

tokenizer = AutoTikTokenizer.from_pretrained("nvidia/Llama-3.1-Nemotron-70B-Instruct-HF")

chunker = WordChunker(
    tokenizer=tokenizer,
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
)

# the `documents` is a `List[str]` object, it contains several articles.
texts: List[str] = []
for document in documents:
     texts.append(document.content)

list_my_chunks: List[List[ChunkModel]] = []
batch_chunks: List[List[Chunk]] = chunker.chunk_batch(text=texts)
@kime541200 kime541200 added the bug Something isn't working label Nov 26, 2024
@bhavnicksm
Copy link
Collaborator

Hey @kime541200!

Thanks for submitting an issue 😊

Just a bit swamped at the moment; I'll get back to you after trying to reproduce the issue!

Thanks for the patience!

@bhavnicksm
Copy link
Collaborator

Hey @kime541200!

This issue has been resolved with #96 in the source and would be available as of next release~

Thanks! 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working in progress Actively looking into the issue
Projects
None yet
Development

No branches or pull requests

3 participants