Add doughnut http endpoint #230

edknv · 2024-11-14T18:21:31Z

Description

Part of https://github.com/NVIDIA/nv-ingest-private/issues/52
Also closes https://github.com/NVIDIA/nv-ingest-private/issues/49.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…ngest into feat/doughnut-http-endpoint-1

drobison00 · 2024-11-19T22:39:01Z

src/nv_ingest/stages/nim/chart_extraction.py

@@ -62,7 +64,8 @@ def _update_metadata(row: pd.Series, cached_client: Any, deplot_client: Any, tra
    # Only modify if content type is structured and subtype is 'chart' and chart_metadata exists
    if ((content_metadata.get("type") != "structured") or
            (content_metadata.get("subtype") != "chart") or
-            (chart_metadata is None)):
+            (chart_metadata is None) or
+            (chart_metadata.get("table_format") != TableFormatEnum.IMAGE)):


I don't think this is correct. Even if the table_format is image, we can still extract the content in the chart extractor. Am I thinking about this wrong?

I added this so that the tables extracted from the Doughnut model don't go through the table/chart extraction stages. Doughnut tables will already have text (as LaTex), so they don't need to go through the table/chart extraction stages. YOLOX tables need text extraction in table/chart extraction stages, and they are tagged as IMAGE tables so they do get processed in these stages. I'm not sure if that made sense, but I needed a way to skip table/chart extraction for tables identified by Doughnut, and thought TableFormat could be useful here to distinguish between yolox (== TableFormatEnum.IMAGE) and doughnut (==TableFormatEnum.LATEX).

drobison00 · 2024-11-19T22:39:20Z

src/nv_ingest/stages/nim/table_extraction.py

@@ -63,7 +67,8 @@ def _update_metadata(row: pd.Series, paddle_client: Any, paddle_version: Any, tr
    # Only modify if content type is structured and subtype is 'table' and table_metadata exists
    if ((content_metadata.get("type") != "structured") or
            (content_metadata.get("subtype") != "table") or
-            (table_metadata is None)):
+            (table_metadata is None) or
+            (table_metadata.get("table_format") != TableFormatEnum.IMAGE)):


Same question here.

drobison00 · 2024-11-19T22:43:00Z

src/nv_ingest/util/nim/helpers.py


+def _call_image_inference_http_client(client, model_name: str, image_data: np.ndarray):


I think we're getting a bit too specific in terms of data handling here. We probably want to standardize on a data format for each model and process it into that before we get to this function.

Small nitpick: please use base64_img(s) or base64_image(s), rather than base64_img and base64_images

I cleaned it up a bit in c5b5d3e by basically grouping doughnut together with deplot and removing batching. Similarly to deplot, request-level batching is currently not supported in doughnut, and they have the same output format (for now). Note: The input/output format is not finalized and will likely change and actually be closer to the yolox output format so there will be more PRs to follow.

edknv · 2024-11-22T19:53:24Z

321052f adds support for preserving the text bounding boxes in the metadata (in the hierarchy field) and addresses issue https://github.com/NVIDIA/nv-ingest-private/issues/49. It also changes the way text blocks are concatenated to concatenate text blocks with \n\n as requested by research team.

edknv and others added 9 commits November 14, 2024 10:17

Add doughnut http endpoint

3fbcf68

fix table and chart extraction

e5a281d

handle 202 reponses by repolling status

d933a33

add table format in unit tests

af9ca0e

Merge branch 'main' into feat/doughnut-http-endpoint-1

b5c992d

fix table and image max dimensions

6ad99c0

Merge branch 'feat/doughnut-http-endpoint-1' of github.com:edknv/nv-i…

e511dbf

…ngest into feat/doughnut-http-endpoint-1

add unit tests for the helper

b3c632b

add placeholder for url in docker compose

0deb5c9

edknv marked this pull request as ready for review November 19, 2024 19:01

add check for empty dataframe in table/chart extraction

2e7db6d

edknv requested review from drobison00 and jdye64 November 19, 2024 21:06

drobison00 requested changes Nov 19, 2024

View reviewed changes

edknv added 3 commits November 19, 2024 16:33

clean up doughnut specific confiditions in inference func

c5b5d3e

clean up doughnut specific confiditions in inference func

bcb0037

fix unit tests

187d13d

edknv requested a review from drobison00 November 20, 2024 03:48

edknv added 2 commits November 22, 2024 10:55

Merge branch 'main' into feat/doughnut-http-endpoint-1

51b5bca

Add support for text bounding boxes

321052f

also add table and image bounding boxes to metadata

a9a74b7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add doughnut http endpoint #230

Add doughnut http endpoint #230

edknv commented Nov 14, 2024 •

edited

Loading

drobison00 Nov 19, 2024

edknv Nov 19, 2024 •

edited

Loading

drobison00 Nov 19, 2024

drobison00 Nov 19, 2024

edknv Nov 20, 2024 •

edited

Loading

edknv commented Nov 22, 2024


		def _call_image_inference_http_client(client, model_name: str, image_data: np.ndarray):

Add doughnut http endpoint #230

Are you sure you want to change the base?

Add doughnut http endpoint #230

Conversation

edknv commented Nov 14, 2024 • edited Loading

Description

Checklist

drobison00 Nov 19, 2024

Choose a reason for hiding this comment

edknv Nov 19, 2024 • edited Loading

Choose a reason for hiding this comment

drobison00 Nov 19, 2024

Choose a reason for hiding this comment

drobison00 Nov 19, 2024

Choose a reason for hiding this comment

edknv Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

edknv commented Nov 22, 2024

edknv commented Nov 14, 2024 •

edited

Loading

edknv Nov 19, 2024 •

edited

Loading

edknv Nov 20, 2024 •

edited

Loading