[BUG] WordChunker's `chunk_batch` function fail #73

kime541200 · 2024-11-26T15:30:38Z

When I call chunk_batch function in WordChunker, it will shows the error message as below:

    batch_chunks: List[List[Chunk]] = chunker.chunk_batch(text=texts)
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/chonkie/chunker/base.py", line 214, in chunk_batch
    return pool.map(self.chunk, text)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/multiprocessing/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/multiprocessing/pool.py", line 774, in get
    raise self._value
  File "/opt/conda/lib/python3.11/multiprocessing/pool.py", line 540, in _handle_tasks
    put(task)
  File "/opt/conda/lib/python3.11/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
AttributeError: Can't pickle local object 'BaseChunker._get_tokenizer_counter.<locals>.<lambda>'

here is the code that I run:

from chonkie import WordChunker
from autotiktokenizer import AutoTikTokenizer

tokenizer = AutoTikTokenizer.from_pretrained("nvidia/Llama-3.1-Nemotron-70B-Instruct-HF")

chunker = WordChunker(
    tokenizer=tokenizer,
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
)

# the `documents` is a `List[str]` object, it contains several articles.
texts: List[str] = []
for document in documents:
     texts.append(document.content)

list_my_chunks: List[List[ChunkModel]] = []
batch_chunks: List[List[Chunk]] = chunker.chunk_batch(text=texts)

The text was updated successfully, but these errors were encountered:

bhavnicksm · 2024-11-26T15:51:33Z

Hey @kime541200!

Thanks for submitting an issue 😊

Just a bit swamped at the moment; I'll get back to you after trying to reproduce the issue!

Thanks for the patience!

bhavnicksm · 2024-12-21T17:25:04Z

Hey @kime541200!

This issue has been resolved with #96 in the source and would be available as of next release~

Thanks! 😊

kime541200 added the bug Something isn't working label Nov 26, 2024

kime541200 assigned bhavnicksm Nov 26, 2024

CharlesMoslonka mentioned this issue Dec 9, 2024

[BUG] TokenChunker Batch_chunking gives wrong end_index #84

Closed

sky-2002 mentioned this issue Dec 12, 2024

[Fix] WordChunker chunk_batch fail #90

Merged

shreyashnigam added the in progress Actively looking into the issue label Dec 16, 2024

bhavnicksm closed this as completed Dec 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] WordChunker's `chunk_batch` function fail #73

[BUG] WordChunker's `chunk_batch` function fail #73

kime541200 commented Nov 26, 2024

bhavnicksm commented Nov 26, 2024

bhavnicksm commented Dec 21, 2024

[BUG] WordChunker's chunk_batch function fail #73

[BUG] WordChunker's chunk_batch function fail #73

Comments

kime541200 commented Nov 26, 2024

bhavnicksm commented Nov 26, 2024

bhavnicksm commented Dec 21, 2024

[BUG] WordChunker's `chunk_batch` function fail #73

[BUG] WordChunker's `chunk_batch` function fail #73