Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Weaviate cannot retrieve _split_overlap field generated by the DocumentSplitter #1172

Open
davidsbatista opened this issue Nov 11, 2024 · 2 comments
Assignees
Labels
bug Something isn't working integration:weaviate

Comments

@davidsbatista
Copy link
Contributor

Describe the bug
Weavite cannot retrieve the _split_overlap field generated by the DocumentSplitter even if the field is specified in the schema.

To Reproduce
see more details at: deepset-ai/haystack#8511

Describe your environment (please complete the following information):

  • OS: macOS
  • Haystack version: 2.6.1
  • Integration version: 4.0.0
@davidsbatista davidsbatista added the bug Something isn't working label Nov 11, 2024
@anakin87
Copy link
Member

anakin87 commented Nov 11, 2024

Minimal reproducible code example:

from haystack import Document
from haystack.components.preprocessors import DocumentSplitter
from haystack_integrations.document_stores.weaviate.document_store import WeaviateDocumentStore

document_store = WeaviateDocumentStore(url="http://localhost:8080")

doc = Document(content = """This is a test. This is another test. This is a third test.
This is a fourth test. This is a fifth test. This is a sixth test.
This is a seventh test. This is an eighth test. This is a ninth test.
This is a tenth test. This is an eleventh test. This is a twelfth test.
This is a thirteenth test. This is a fourteenth test. This is a fifteenth test.""")

splitter = DocumentSplitter(split_length=3, split_overlap=2, split_by="word")

splitted_docs = splitter.run([doc])["documents"]

document_store.write_documents(splitted_docs)

print(document_store.filter_documents()[0])

Error:

Traceback (most recent call last):
  File "/home/anakin87/apps/haystack-core-integrations/integrations/weaviate/.hatch/weaviate-haystack/lib/python3.10/site-packages/weaviate/collections/grpc/query.py", line 798, in __call
    res = await self._connection.grpc_stub.Search(
  File "/home/anakin87/apps/haystack-core-integrations/integrations/weaviate/.hatch/weaviate-haystack/lib/python3.10/site-packages/grpc/aio/_call.py", line 327, in __await__
    raise _create_rpc_error(
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "creating primitive value for _split_overlap: proto: invalid type: []interface {}"
        debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"creating primitive value for _split_overlap: proto: invalid type: []interface {}", grpc_status:2, created_time:"2024-11-11T16:17:57.2080374+01:00"}"
>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/anakin87/apps/haystack-core-integrations/integrations/weaviate/try.py", line 36, in <module>
    print(document_store.filter_documents()[0])
  File "/home/anakin87/apps/haystack-core-integrations/integrations/weaviate/src/haystack_integrations/document_stores/weaviate/document_store.py", line 403, in filter_documents
    return [self._to_document(doc) for doc in result]
  File "/home/anakin87/apps/haystack-core-integrations/integrations/weaviate/src/haystack_integrations/document_stores/weaviate/document_store.py", line 403, in <listcomp>
    return [self._to_document(doc) for doc in result]
  File "/home/anakin87/apps/haystack-core-integrations/integrations/weaviate/.hatch/weaviate-haystack/lib/python3.10/site-packages/weaviate/collections/iterator.py", line 59, in __next__
    res = self.__query.fetch_objects(
  File "/home/anakin87/apps/haystack-core-integrations/integrations/weaviate/.hatch/weaviate-haystack/lib/python3.10/site-packages/weaviate/syncify.py", line 23, in sync_method
    return _EventLoopSingleton.get_instance().run_until_complete(
  File "/home/anakin87/apps/haystack-core-integrations/integrations/weaviate/.hatch/weaviate-haystack/lib/python3.10/site-packages/weaviate/event_loop.py", line 40, in run_until_complete
    return fut.result()
  File "/home/anakin87/.pyenv/versions/3.10.13/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/home/anakin87/.pyenv/versions/3.10.13/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/home/anakin87/apps/haystack-core-integrations/integrations/weaviate/.hatch/weaviate-haystack/lib/python3.10/site-packages/weaviate/collections/queries/fetch_objects/query.py", line 65, in fetch_objects
    res = await self._query.get(
  File "/home/anakin87/apps/haystack-core-integrations/integrations/weaviate/.hatch/weaviate-haystack/lib/python3.10/site-packages/weaviate/collections/grpc/query.py", line 805, in __call
    raise WeaviateQueryError(str(e), "GRPC search")  # pyright: ignore
weaviate.exceptions.WeaviateQueryError: Query call with protocol GRPC search failed with message <AioRpcError of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "creating primitive value for _split_overlap: proto: invalid type: []interface {}"
        debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"creating primitive value for _split_overlap: proto: invalid type: []interface {}", grpc_status:2, created_time:"2024-11-11T16:17:57.2080374+01:00"}"
>.

@anakin87 anakin87 changed the title bug: Weavite cannot retrieve _split_overlap field generated by the DocumentSplitter bug: Weaviate cannot retrieve _split_overlap field generated by the DocumentSplitter Nov 11, 2024
@davidsbatista davidsbatista self-assigned this Nov 11, 2024
@anakin87
Copy link
Member

anakin87 commented Nov 11, 2024

Weaviate code to reproduce the bug

import weaviate

client = weaviate.WeaviateClient(
   connection_params=(weaviate.connect.base.ConnectionParams.from_url(url="http://localhost:8080", grpc_port=50051))
)

client.connect()

DOCUMENT_COLLECTION_PROPERTIES = [
   {"name": "_original_id", "dataType": ["text"]},
   {"name": "content", "dataType": ["text"]},
   {"name": "dataframe", "dataType": ["text"]},
   {"name": "blob_data", "dataType": ["blob"]},
   {"name": "blob_mime_type", "dataType": ["text"]},
   {"name": "score", "dataType": ["number"]},
   # the following properties can be present or not. Weaviate shows the same behavior:
   # documents are correctly written but not correctly returned
   # {
   #         'name': 'mylistofobjects', 'dataType': ['object[]'],
   #         'nestedProperties': [
   #             {'dataType': ['text'], 'name': 'doc_id'},
   #             {'dataType': ['number[]'], 'name': 'range'}],
   # }
]


collection_settings = {
   "class": "Default",
   "invertedIndexConfig": {"indexNullState": True},
   "properties": DOCUMENT_COLLECTION_PROPERTIES,
}


collection = client.collections.create_from_dict(collection_settings)

properties = {
   'content': 'This is a test document',
   'dataframe': None,
   'score': None,
   'mylistofobjects': [{'doc_id': '1', 'range': [1, 2]}],
   '_original_id': '3972bbfa2c09af05a7118ed4233124582a138dd83e3de1db3ff742f810df4c41',
}

collection.data.insert(
   properties=properties,
   vector=[0.1] * 300,
)

# this works and returns all properties except byte
# (in this case byte properties are not present, but they are not returned even if present)
it = collection.iterator(include_vector=True)
for i in it:
   print(i)

# this fails
it = collection.iterator(include_vector=True, return_properties=["content", "mylistofobjects"])
for i in it:
   print(i)

Error:

  File "/home/anakin87/apps/haystack-core-integrations/integrations/weaviate/.hatch/weaviate-haystack/lib/python3.10/site-packages/weaviate/collections/grpc/query.py", line 798, in __call
    res = await self._connection.grpc_stub.Search(
  File "/home/anakin87/apps/haystack-core-integrations/integrations/weaviate/.hatch/weaviate-haystack/lib/python3.10/site-packages/grpc/aio/_call.py", line 327, in __await__
    raise _create_rpc_error(
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "creating primitive value for mylistofobjects: proto: invalid type: []interface {}"
        debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2024-11-11T17:27:11.360212037+01:00", grpc_status:2, grpc_message:"creating primitive value for mylistofobjects: proto: invalid type: []interface {}"}"
>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/anakin87/apps/haystack-core-integrations/integrations/weaviate/newtryweaviate.py", line 55, in <module>
    for i in it:
  File "/home/anakin87/apps/haystack-core-integrations/integrations/weaviate/.hatch/weaviate-haystack/lib/python3.10/site-packages/weaviate/collections/iterator.py", line 59, in __next__
    res = self.__query.fetch_objects(
  File "/home/anakin87/apps/haystack-core-integrations/integrations/weaviate/.hatch/weaviate-haystack/lib/python3.10/site-packages/weaviate/syncify.py", line 23, in sync_method
    return _EventLoopSingleton.get_instance().run_until_complete(
  File "/home/anakin87/apps/haystack-core-integrations/integrations/weaviate/.hatch/weaviate-haystack/lib/python3.10/site-packages/weaviate/event_loop.py", line 40, in run_until_complete
    return fut.result()
  File "/home/anakin87/.pyenv/versions/3.10.13/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/home/anakin87/.pyenv/versions/3.10.13/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/home/anakin87/apps/haystack-core-integrations/integrations/weaviate/.hatch/weaviate-haystack/lib/python3.10/site-packages/weaviate/collections/queries/fetch_objects/query.py", line 65, in fetch_objects
    res = await self._query.get(
  File "/home/anakin87/apps/haystack-core-integrations/integrations/weaviate/.hatch/weaviate-haystack/lib/python3.10/site-packages/weaviate/collections/grpc/query.py", line 805, in __call
    raise WeaviateQueryError(str(e), "GRPC search")  # pyright: ignore
weaviate.exceptions.WeaviateQueryError: Query call with protocol GRPC search failed with message <AioRpcError of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "creating primitive value for mylistofobjects: proto: invalid type: []interface {}"
        debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2024-11-11T17:27:11.360212037+01:00", grpc_status:2, grpc_message:"creating primitive value for mylistofobjects: proto: invalid type: []interface {}"}"
>.```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working integration:weaviate
Projects
None yet
Development

No branches or pull requests

2 participants