-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: Weaviate cannot retrieve _split_overlap
field generated by the DocumentSplitter
#1172
Labels
Comments
Minimal reproducible code example: from haystack import Document
from haystack.components.preprocessors import DocumentSplitter
from haystack_integrations.document_stores.weaviate.document_store import WeaviateDocumentStore
document_store = WeaviateDocumentStore(url="http://localhost:8080")
doc = Document(content = """This is a test. This is another test. This is a third test.
This is a fourth test. This is a fifth test. This is a sixth test.
This is a seventh test. This is an eighth test. This is a ninth test.
This is a tenth test. This is an eleventh test. This is a twelfth test.
This is a thirteenth test. This is a fourteenth test. This is a fifteenth test.""")
splitter = DocumentSplitter(split_length=3, split_overlap=2, split_by="word")
splitted_docs = splitter.run([doc])["documents"]
document_store.write_documents(splitted_docs)
print(document_store.filter_documents()[0]) Error:
|
anakin87
changed the title
bug: Weavite cannot retrieve
bug: Weaviate cannot retrieve Nov 11, 2024
_split_overlap
field generated by the DocumentSplitter
_split_overlap
field generated by the DocumentSplitter
Weaviate code to reproduce the bug import weaviate
client = weaviate.WeaviateClient(
connection_params=(weaviate.connect.base.ConnectionParams.from_url(url="http://localhost:8080", grpc_port=50051))
)
client.connect()
DOCUMENT_COLLECTION_PROPERTIES = [
{"name": "_original_id", "dataType": ["text"]},
{"name": "content", "dataType": ["text"]},
{"name": "dataframe", "dataType": ["text"]},
{"name": "blob_data", "dataType": ["blob"]},
{"name": "blob_mime_type", "dataType": ["text"]},
{"name": "score", "dataType": ["number"]},
# the following properties can be present or not. Weaviate shows the same behavior:
# documents are correctly written but not correctly returned
# {
# 'name': 'mylistofobjects', 'dataType': ['object[]'],
# 'nestedProperties': [
# {'dataType': ['text'], 'name': 'doc_id'},
# {'dataType': ['number[]'], 'name': 'range'}],
# }
]
collection_settings = {
"class": "Default",
"invertedIndexConfig": {"indexNullState": True},
"properties": DOCUMENT_COLLECTION_PROPERTIES,
}
collection = client.collections.create_from_dict(collection_settings)
properties = {
'content': 'This is a test document',
'dataframe': None,
'score': None,
'mylistofobjects': [{'doc_id': '1', 'range': [1, 2]}],
'_original_id': '3972bbfa2c09af05a7118ed4233124582a138dd83e3de1db3ff742f810df4c41',
}
collection.data.insert(
properties=properties,
vector=[0.1] * 300,
)
# this works and returns all properties except byte
# (in this case byte properties are not present, but they are not returned even if present)
it = collection.iterator(include_vector=True)
for i in it:
print(i)
# this fails
it = collection.iterator(include_vector=True, return_properties=["content", "mylistofobjects"])
for i in it:
print(i) Error:
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
Weavite cannot retrieve the
_split_overlap
field generated by theDocumentSplitter
even if the field is specified in the schema.To Reproduce
see more details at: deepset-ai/haystack#8511
Describe your environment (please complete the following information):
The text was updated successfully, but these errors were encountered: