Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add PDF Loader to Document Loaders
This PR implements a PDF loader as part of the document loaders module in Rig. It allows users to easily load and process PDF documents for use in RAG systems and other document processing tasks.
Changes
PdfLoader
struct insrc/document_loaders/pdf.rs
PdfLoader
to thedocument_loaders
moduleDocumentLoader
trait forPdfLoader
lopdf
crate for PDF parsingCargo.toml
with thelopdf
dependencyPdfLoader
PdfLoader
usage examplesImplementation Details
The
PdfLoader
uses thelopdf
crate to parse PDF files and extract text content. It handles potential errors such as file not found or parsing errors. The extracted text is converted intoDocumentEmbeddings
for further processing in Rig.Testing
Unit tests have been added to ensure the
PdfLoader
correctly loads PDF files and handles various edge cases. The tests cover:Documentation
Code files are commented and some docstrings added
Related Issue
Closes #24
Checklist
Additional Notes
This implementation focuses on text extraction from PDFs. Future enhancements could include handling PDFs with complex layouts or embedded images.