Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add milvus bulk #190

Open
wants to merge 37 commits into
base: main
Choose a base branch
from
Open

Add milvus bulk #190

wants to merge 37 commits into from

Conversation

jperez999
Copy link
Contributor

@jperez999 jperez999 commented Oct 24, 2024

Description

This PR adds in bulk ingest functionality, that can be used in a chained set of tasks or as a solo task after the fact. There were some changes made to the schema, because JSON dtypes are not compatible with parquet files. With #179 this closes #164. A large portion of the code reflected here is actually from #179 because it builds and requires that implementation.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

dtype=pymilvus.DataType.JSON,
description="Content metadata",
).to_dict(),
# pymilvus.FieldSchema(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed these fields because they are JSON is not compatible with parquet files. If we were to keep this data, when we go to upload, the ingestion blows up because milvus expect json data but actually gets structured data from arrow.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpenn I'm not familiar enough with the lineage here, could you comment?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lineage for this was to provide content in an intuitive structure to support the RAG workflow behind the multimodal-pdf-data extraction blueprint. If we decide to move from this structure, it will likely require some blueprint re-work. Not impossible but would require some re-work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need further guidance on how we want to handle this schema change for VDB entries.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting on a decision from @randerzander on this; would definitely prefer a work around if it is feasible, as this is almost certainly a breaking change for some workflows.

@drobison00 drobison00 requested a review from mpenn October 24, 2024 18:21
@drobison00
Copy link
Collaborator

@mpenn If you have time could you take a look at this.

Copy link
Collaborator

@drobison00 drobison00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments/concerns lets discuss.

client/src/nv_ingest_client/cli/util/click.py Outdated Show resolved Hide resolved
client/src/nv_ingest_client/cli/util/click.py Outdated Show resolved Hide resolved
client/src/nv_ingest_client/cli/util/click.py Show resolved Hide resolved
client/src/nv_ingest_client/primitives/tasks/__init__.py Outdated Show resolved Hide resolved
src/nv_ingest/modules/sinks/vdb_task_sink.py Show resolved Hide resolved
src/nv_ingest/modules/sinks/vdb_task_sink.py Show resolved Hide resolved
src/nv_ingest/modules/sinks/vdb_task_sink.py Show resolved Hide resolved
src/nv_ingest/modules/storages/embedding_storage.py Outdated Show resolved Hide resolved
dtype=pymilvus.DataType.JSON,
description="Content metadata",
).to_dict(),
# pymilvus.FieldSchema(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpenn I'm not familiar enough with the lineage here, could you comment?

task_properties = {
"method": self._store_method,
"structured": self._structured,
"images": self._images,
"params": self._extra_params,
"extra_params": self._extra_params,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes the params issue where you cant connect to the private bucket.

@@ -135,7 +135,7 @@ def on_data(ctrl_msg: ControlMessage):
if store_images:
content_types[ContentTypeEnum.IMAGE] = store_images

params = task_props.get("params", {})
params = task_props.get("extra_params", {})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

part of the fix for the params not coming all the way through from the client.

@jperez999 jperez999 mentioned this pull request Oct 30, 2024
3 tasks
Copy link

copy-pr-bot bot commented Nov 18, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@jdye64
Copy link
Collaborator

jdye64 commented Nov 18, 2024

/ok to test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA]: Add bulk ingest to vdb task
4 participants