-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add milvus bulk #190
base: main
Are you sure you want to change the base?
Add milvus bulk #190
Conversation
…/nv-ingest into minio-embedding-upload
dtype=pymilvus.DataType.JSON, | ||
description="Content metadata", | ||
).to_dict(), | ||
# pymilvus.FieldSchema( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed these fields because they are JSON is not compatible with parquet files. If we were to keep this data, when we go to upload, the ingestion blows up because milvus expect json data but actually gets structured data from arrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mpenn I'm not familiar enough with the lineage here, could you comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lineage for this was to provide content in an intuitive structure to support the RAG workflow behind the multimodal-pdf-data extraction blueprint. If we decide to move from this structure, it will likely require some blueprint re-work. Not impossible but would require some re-work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need further guidance on how we want to handle this schema change for VDB entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting on a decision from @randerzander on this; would definitely prefer a work around if it is feasible, as this is almost certainly a breaking change for some workflows.
@mpenn If you have time could you take a look at this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments/concerns lets discuss.
dtype=pymilvus.DataType.JSON, | ||
description="Content metadata", | ||
).to_dict(), | ||
# pymilvus.FieldSchema( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mpenn I'm not familiar enough with the lineage here, could you comment?
Co-authored-by: Devin Robison <[email protected]>
Co-authored-by: Devin Robison <[email protected]>
Co-authored-by: Devin Robison <[email protected]>
Co-authored-by: Devin Robison <[email protected]>
task_properties = { | ||
"method": self._store_method, | ||
"structured": self._structured, | ||
"images": self._images, | ||
"params": self._extra_params, | ||
"extra_params": self._extra_params, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fixes the params issue where you cant connect to the private bucket.
@@ -135,7 +135,7 @@ def on_data(ctrl_msg: ControlMessage): | |||
if store_images: | |||
content_types[ContentTypeEnum.IMAGE] = store_images | |||
|
|||
params = task_props.get("params", {}) | |||
params = task_props.get("extra_params", {}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
part of the fix for the params not coming all the way through from the client.
/ok to test |
Description
This PR adds in bulk ingest functionality, that can be used in a chained set of tasks or as a solo task after the fact. There were some changes made to the schema, because JSON dtypes are not compatible with parquet files. With #179 this closes #164. A large portion of the code reflected here is actually from #179 because it builds and requires that implementation.
Checklist