Support Multimodal inference with multiple images and PDF-s in `NATIVE` engine #1424

xrdaukar · 2025-02-12T00:43:57Z

Description

-- Update NATIVE inference engine to support prediction with multiple images.
-- Add utils to load PDF-s as N images (1 image per page).
-- Add new target [file_formats] to load libraries required for PDF processing
-- Update CLI infer to support PDF inputs
-- Add sample 4-page PDF to testdata
-- Tested with Llama Vision only for now (will handle other model types and other engines separately)

Related issues

Towards OPE-994, OPE-355

Before submitting

This PR only changes documentation. (You can ignore the following checks in that case)
Did you read the contributor guideline Pull Request guidelines?
Did you link the issue(s) related to this PR in the section above?
Did you add / update tests where needed?

Reviewers

At least one review from a member of oumi-ai/oumi-staff is required.

…mi-predict

taenin · 2025-02-12T02:46:03Z

src/oumi/infer.py

@@ -43,7 +43,7 @@ def _get_engine(config: InferenceConfig) -> BaseInferenceEngine:
 def infer_interactive(
    config: InferenceConfig,
    *,
-    input_image_bytes: Optional[bytes] = None,
+    input_image_bytes: Optional[list[bytes]] = None,


More of a meta comment about this PR: This looks like a breaking change for VLM functionality. Should we consider a larger version bump for our next pypi package?

We may be lifting one arbitrary/un-documented constraint. The previous code might by broken/not-functional if multi-image inputs are passed to it (most datasets are single-image) => it's probably OK to do regular version increment.

src/oumi/utils/image_utils.py

Co-authored-by: Matthew Persons <[email protected]>

xrdaukar added 11 commits February 11, 2025 11:48

Add test PDF

060ea56

Add test PDF

ade3b29

Add test PDF

838de83

Add test PDF

f129f88

save

797bb6c

save

b4e0ce0

save

6e54168

save

5905fea

save

84b97cd

save

eec4985

save

081d100

xrdaukar requested a review from optas February 12, 2025 00:58

xrdaukar added 6 commits February 11, 2025 16:59

save

19b89b9

save

3f04ed1

save

2cb433d

save

5b35d95

load_pdf_pages_from_path

53f276a

load_pdf_pages_from_path

fd1d0fe

xrdaukar requested review from taenin and oelachqar February 12, 2025 01:29

xrdaukar marked this pull request as ready for review February 12, 2025 01:37

xrdaukar requested a review from wizeng23 February 12, 2025 01:38

xrdaukar added 2 commits February 11, 2025 17:39

load_pdf_pages_from_url

75efd56

Merge branch 'main' of https://github.com/oumi-ai/oumi into xrdaukar/…

2655500

…mi-predict

xrdaukar changed the title ~~Support Multimodal prediction with multiple images in native engine~~ Support Multimodal inference with multiple images and PDF-s in NATIVE engine Feb 12, 2025

taenin approved these changes Feb 12, 2025

View reviewed changes

Update src/oumi/utils/image_utils.py

ca1fb76

Co-authored-by: Matthew Persons <[email protected]>

xrdaukar merged commit 4d3278a into main Feb 12, 2025
2 checks passed

xrdaukar deleted the xrdaukar/mi-predict branch February 12, 2025 04:52

xrdaukar mentioned this pull request Feb 12, 2025

Multi-image support in SGLang inference engine #1426

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Multimodal inference with multiple images and PDF-s in `NATIVE` engine #1424

Support Multimodal inference with multiple images and PDF-s in `NATIVE` engine #1424

xrdaukar commented Feb 12, 2025 •

edited

Loading

taenin Feb 12, 2025

xrdaukar Feb 12, 2025

Support Multimodal inference with multiple images and PDF-s in NATIVE engine #1424

Support Multimodal inference with multiple images and PDF-s in NATIVE engine #1424

Conversation

xrdaukar commented Feb 12, 2025 • edited Loading

Description

Related issues

Before submitting

Reviewers

taenin Feb 12, 2025

Choose a reason for hiding this comment

xrdaukar Feb 12, 2025

Choose a reason for hiding this comment

Support Multimodal inference with multiple images and PDF-s in `NATIVE` engine #1424

Support Multimodal inference with multiple images and PDF-s in `NATIVE` engine #1424

xrdaukar commented Feb 12, 2025 •

edited

Loading