Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Multimodal inference with multiple images and PDF-s in NATIVE engine #1424

Merged
merged 20 commits into from
Feb 12, 2025

Conversation

xrdaukar
Copy link
Collaborator

@xrdaukar xrdaukar commented Feb 12, 2025

Description

-- Update NATIVE inference engine to support prediction with multiple images.
-- Add utils to load PDF-s as N images (1 image per page).
-- Add new target [file_formats] to load libraries required for PDF processing
-- Update CLI infer to support PDF inputs
-- Add sample 4-page PDF to testdata
-- Tested with Llama Vision only for now (will handle other model types and other engines separately)

Related issues

Towards OPE-994, OPE-355

Before submitting

  • This PR only changes documentation. (You can ignore the following checks in that case)
  • Did you read the contributor guideline Pull Request guidelines?
  • Did you link the issue(s) related to this PR in the section above?
  • Did you add / update tests where needed?

Reviewers

At least one review from a member of oumi-ai/oumi-staff is required.

@xrdaukar xrdaukar requested a review from optas February 12, 2025 00:58
@xrdaukar xrdaukar requested review from taenin and oelachqar February 12, 2025 01:29
@xrdaukar xrdaukar marked this pull request as ready for review February 12, 2025 01:37
@xrdaukar xrdaukar requested a review from wizeng23 February 12, 2025 01:38
@xrdaukar xrdaukar changed the title Support Multimodal prediction with multiple images in native engine Support Multimodal inference with multiple images and PDF-s in NATIVE engine Feb 12, 2025
@@ -43,7 +43,7 @@ def _get_engine(config: InferenceConfig) -> BaseInferenceEngine:
def infer_interactive(
config: InferenceConfig,
*,
input_image_bytes: Optional[bytes] = None,
input_image_bytes: Optional[list[bytes]] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More of a meta comment about this PR: This looks like a breaking change for VLM functionality. Should we consider a larger version bump for our next pypi package?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may be lifting one arbitrary/un-documented constraint. The previous code might by broken/not-functional if multi-image inputs are passed to it (most datasets are single-image) => it's probably OK to do regular version increment.

src/oumi/utils/image_utils.py Outdated Show resolved Hide resolved
@xrdaukar xrdaukar merged commit 4d3278a into main Feb 12, 2025
2 checks passed
@xrdaukar xrdaukar deleted the xrdaukar/mi-predict branch February 12, 2025 04:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants