Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement first version of parsing evaluation #3544

Open
chloedia opened this issue Jan 2, 2025 — with Linear · 1 comment
Open

Implement first version of parsing evaluation #3544

chloedia opened this issue Jan 2, 2025 — with Linear · 1 comment
Assignees

Comments

Copy link
Collaborator

chloedia commented Jan 2, 2025

For a list of potential datasets for parsing see CORE-335.

For details on OmniDocBench see CORE-332 or Notion

Evaluation steps for CI/CD

We only using a single EN subset, where we have excluded masked documents. Each subset contains 57 documents of 1 page each. We will run the evaluation on both native and image pdfs.

  1. Load dataset --> CORE-355
    1. For each row in the subset, retrieve the native (original) pdf from the url https://huggingface.co/datasets/Quivr/OmniDocBench/blob/main/ori_pdfs/file_name where file_name is extracted from page_info.image_path
    2. For each row in the subset, retrieve the image pdf from the url https://huggingface.co/datasets/Quivr/OmniDocBench/blob/main/pdfs/file_name
  2. Run Megaparse on each document, on both the native-pdf and image-pdf versions, and store the results in a JSON file --> CORE-342
  3. Compute the parsing metrics --> CORE-331
  4. Compute the OCR metrics --> CORE-333
  5. Push the different results (Megaparse output of step 5, output of steps 6 and 7) as JSON files to the exp. tracker, along with --> CORE-343
  6. Alert if metrics are below a given threshold --> CORE-344

Evaluation steps for optimising Megaparse

We should also be able to manually run these evaluations using the full dataset, i.e. subsets 1 to 5, for the purpose of optimizing / improving our parsing service.

Copy link

linear bot commented Jan 2, 2025

@jacopo-chevallard jacopo-chevallard changed the title Implement a first draft of metrics Implement a first draft of parsing metrics Jan 22, 2025
@jacopo-chevallard jacopo-chevallard changed the title Implement a first draft of parsing metrics Implement first version of parsing metrics Jan 22, 2025
@jacopo-chevallard jacopo-chevallard changed the title Implement first version of parsing metrics Implement first version of parsing evaluation Jan 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants