You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We only using a single EN subset, where we have excluded masked documents. Each subset contains 57 documents of 1 page each. We will run the evaluation on both native and image pdfs.
For each row in the subset, retrieve the native (original) pdf from the url https://huggingface.co/datasets/Quivr/OmniDocBench/blob/main/ori_pdfs/file_name where file_name is extracted from page_info.image_path
For each row in the subset, retrieve the image pdf from the url https://huggingface.co/datasets/Quivr/OmniDocBench/blob/main/pdfs/file_name
Run Megaparse on each document, on both the native-pdf and image-pdf versions, and store the results in a JSON file --> CORE-342
Push the different results (Megaparse output of step 5, output of steps 6 and 7) as JSON files to the exp. tracker, along with --> CORE-343
Alert if metrics are below a given threshold --> CORE-344
Evaluation steps for optimising Megaparse
We should also be able to manually run these evaluations using the full dataset, i.e. subsets 1 to 5, for the purpose of optimizing / improving our parsing service.
The text was updated successfully, but these errors were encountered:
For a list of potential datasets for parsing see CORE-335.
For details on OmniDocBench see CORE-332 or Notion
Evaluation steps for CI/CD
We only using a single EN subset, where we have excluded masked documents. Each subset contains 57 documents of 1 page each. We will run the evaluation on both native and image pdfs.
https://huggingface.co/datasets/Quivr/OmniDocBench/blob/main/ori_pdfs/file_name
wherefile_name
is extracted frompage_info.image_path
https://huggingface.co/datasets/Quivr/OmniDocBench/blob/main/pdfs/file_name
Evaluation steps for optimising Megaparse
We should also be able to manually run these evaluations using the full dataset, i.e. subsets 1 to 5, for the purpose of optimizing / improving our parsing service.
The text was updated successfully, but these errors were encountered: