Add milvus bulk #190

jperez999 · 2024-10-24T02:50:26Z

Description

This PR adds in bulk ingest functionality, that can be used in a chained set of tasks or as a solo task after the fact. There were some changes made to the schema, because JSON dtypes are not compatible with parquet files. With #179 this closes #164. A large portion of the code reflected here is actually from #179 because it builds and requires that implementation.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…/nv-ingest into minio-embedding-upload

src/nv_ingest/modules/storages/embedding_storage.py

src/nv_ingest/modules/sinks/vdb_task_sink.py

jperez999 · 2024-10-24T03:54:55Z

src/nv_ingest/schemas/vdb_task_sink_schema.py

-                    dtype=pymilvus.DataType.JSON,
-                    description="Content metadata",
-                ).to_dict(),
+                # pymilvus.FieldSchema(


removed these fields because they are JSON is not compatible with parquet files. If we were to keep this data, when we go to upload, the ingestion blows up because milvus expect json data but actually gets structured data from arrow.

@mpenn I'm not familiar enough with the lineage here, could you comment?

The lineage for this was to provide content in an intuitive structure to support the RAG workflow behind the multimodal-pdf-data extraction blueprint. If we decide to move from this structure, it will likely require some blueprint re-work. Not impossible but would require some re-work.

Still need further guidance on how we want to handle this schema change for VDB entries.

Waiting on a decision from @randerzander on this; would definitely prefer a work around if it is feasible, as this is almost certainly a breaking change for some workflows.

drobison00 · 2024-10-24T18:21:25Z

@mpenn If you have time could you take a look at this.

drobison00

Some comments/concerns lets discuss.

client/src/nv_ingest_client/cli/util/click.py

client/src/nv_ingest_client/primitives/tasks/__init__.py

client/src/nv_ingest_client/primitives/tasks/task_factory.py

src/nv_ingest/modules/sinks/vdb_task_sink.py

src/nv_ingest/modules/storages/embedding_storage.py

drobison00 · 2024-10-24T19:12:45Z

src/nv_ingest/schemas/vdb_task_sink_schema.py

-                    dtype=pymilvus.DataType.JSON,
-                    description="Content metadata",
-                ).to_dict(),
+                # pymilvus.FieldSchema(


@mpenn I'm not familiar enough with the lineage here, could you comment?

Co-authored-by: Devin Robison <[email protected]>

into add-milvus-bulk

client/src/nv_ingest_client/primitives/tasks/store.py

jperez999 · 2024-10-30T03:53:44Z

client/src/nv_ingest_client/primitives/tasks/store.py

        task_properties = {
            "method": self._store_method,
            "structured": self._structured,
            "images": self._images,
-            "params": self._extra_params,
+            "extra_params": self._extra_params,


This fixes the params issue where you cant connect to the private bucket.

src/nv_ingest/modules/sinks/vdb_task_sink.py

jperez999 · 2024-10-30T03:57:01Z

src/nv_ingest/modules/storages/image_storage.py

@@ -135,7 +135,7 @@ def on_data(ctrl_msg: ControlMessage):
            if store_images:
                content_types[ContentTypeEnum.IMAGE] = store_images

-            params = task_props.get("params", {})
+            params = task_props.get("extra_params", {})


part of the fix for the params not coming all the way through from the client.

src/nv_ingest/stages/storages/embedding_storage_stage.py

copy-pr-bot · 2024-11-18T15:27:45Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

jdye64 · 2024-11-18T23:20:42Z

/ok to test

jperez999 and others added 20 commits October 15, 2024 14:21

add in gpu milvus

3bdcb1c

fix image storage extra params pipe

1f8a29d

add changes for store embedding task

743220d

add embedding schema and ingestion logic

14fb6de

add embedding upload logic for store embedding task

6498a1a

add store embedding task to pipeline and for commandline

aeb7a46

add s3fs for direct upload to bucket for pandas

3f454a8

update tests for store and add tests for store embed

af21a90

add changes to allow for safe usage in cli

2b8e480

add changes to client example notebook

73748a1

add bucket path param for extra params

ea52f86

remove outputs from notebook

91e2339

merge in main

3cba7de

remove logger messages

f33fc99

Merge branch 'main' into minio-embedding-upload

9c03d7e

removed unnecessary logging and commented code

f34e5cf

Merge branch 'minio-embedding-upload' of https://github.com/jperez999…

95d0e02

…/nv-ingest into minio-embedding-upload

add bulk ingest support for minio loaded records

3c36548

merge in main

d1d539a

change args to params for standardization

ea0f13d

jperez999 commented Oct 24, 2024

View reviewed changes

src/nv_ingest/modules/storages/embedding_storage.py Outdated Show resolved Hide resolved

jperez999 commented Oct 24, 2024

View reviewed changes

src/nv_ingest/modules/sinks/vdb_task_sink.py Outdated Show resolved Hide resolved

jperez999 commented Oct 24, 2024

View reviewed changes

drobison00 requested a review from mpenn October 24, 2024 18:21

drobison00 requested changes Oct 24, 2024

View reviewed changes

jperez999 and others added 4 commits October 24, 2024 19:49

Update client/src/nv_ingest_client/cli/util/click.py

c83a0f1

Co-authored-by: Devin Robison <[email protected]>

Update client/src/nv_ingest_client/cli/util/click.py

30ec521

Co-authored-by: Devin Robison <[email protected]>

Update client/src/nv_ingest_client/primitives/tasks/__init__.py

fe6954e

Co-authored-by: Devin Robison <[email protected]>

Update client/src/nv_ingest_client/primitives/tasks/task_factory.py

f56c9b4

Co-authored-by: Devin Robison <[email protected]>

randerzander requested a review from drobison00 October 28, 2024 15:14

jperez999 added 5 commits October 28, 2024 20:34

add progress engines to vdb task sink

80257e6

Merge branch 'add-milvus-bulk' of https://github.com/jperez999/nv-ingest

6d05bfc

into add-milvus-bulk

merge main in

1e5cd8d

made the embedding storage stage multiprocess

35a734e

add in updated store_embed task schema

d02f329

jperez999 commented Oct 30, 2024

View reviewed changes

client/src/nv_ingest_client/primitives/tasks/store.py Show resolved Hide resolved

jperez999 commented Oct 30, 2024

View reviewed changes

src/nv_ingest/modules/sinks/vdb_task_sink.py Show resolved Hide resolved

jperez999 commented Oct 30, 2024

View reviewed changes

src/nv_ingest/stages/storages/embedding_storage_stage.py Show resolved Hide resolved

jperez999 mentioned this pull request Oct 30, 2024

Minio Embedding Upload #179

Closed

3 tasks

Merge branch 'main' into add-milvus-bulk

8b51108

jperez999 added 5 commits November 18, 2024 13:48

fix tests and cli interface

a1cd956

replace params for extra_params in test for job ingest schema

f020424

revert to params field to comply with current usage

2fb6062

full push of all schema changes for ingest about params

074476e

fix tests to reflect params revision

69da980

jperez999 added 2 commits November 21, 2024 11:34

revamped parquet table embedding and vdb upload

8d7ca17

use params to load in collection name to embedding storage task

8ddebc1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add milvus bulk #190

Add milvus bulk #190

jperez999 commented Oct 24, 2024 •

edited

Loading

jperez999 Oct 24, 2024

drobison00 Oct 24, 2024

mpenn Oct 25, 2024

jperez999 Oct 30, 2024

drobison00 Nov 20, 2024

drobison00 commented Oct 24, 2024

drobison00 left a comment

drobison00 Oct 24, 2024

jperez999 Oct 30, 2024

jperez999 Oct 30, 2024

copy-pr-bot bot commented Nov 18, 2024

jdye64 commented Nov 18, 2024

Add milvus bulk #190

Are you sure you want to change the base?

Add milvus bulk #190

Conversation

jperez999 commented Oct 24, 2024 • edited Loading

Description

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drobison00 commented Oct 24, 2024

drobison00 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

copy-pr-bot bot commented Nov 18, 2024

jdye64 commented Nov 18, 2024

jperez999 commented Oct 24, 2024 •

edited

Loading