Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(loaders): document loader pdf #26

Closed
wants to merge 2 commits into from

Conversation

Tachikoma000
Copy link
Contributor

Add PDF Loader to Document Loaders

This PR implements a PDF loader as part of the document loaders module in Rig. It allows users to easily load and process PDF documents for use in RAG systems and other document processing tasks.

Changes

  • Implemented PdfLoader struct in src/document_loaders/pdf.rs
  • Added PdfLoader to the document_loaders module
  • Implemented DocumentLoader trait for PdfLoader
  • Used the lopdf crate for PDF parsing
  • Added error handling for file operations and PDF parsing
  • Updated Cargo.toml with the lopdf dependency
  • Added unit tests for PdfLoader
  • Updated documentation with PdfLoader usage examples

Implementation Details

The PdfLoader uses the lopdf crate to parse PDF files and extract text content. It handles potential errors such as file not found or parsing errors. The extracted text is converted into DocumentEmbeddings for further processing in Rig.

Testing

Unit tests have been added to ensure the PdfLoader correctly loads PDF files and handles various edge cases. The tests cover:

  • Loading a valid PDF file
  • Handling a non-existent file
  • Processing a PDF with multiple pages
  • Dealing with empty PDF files

Documentation

Code files are commented and some docstrings added

Related Issue

Closes #24

Checklist

  • Code follows the project's coding style
  • Tests have been added and all tests pass
  • Documentation has been updated
  • Commit messages are clear and descriptive
  • Changes have been reviewed for potential performance impacts

Additional Notes

This implementation focuses on text extraction from PDFs. Future enhancements could include handling PDFs with complex layouts or embedded images.

- Implement PdfLoader struct in src/document_loaders/pdf.rs
- Add PdfLoader to document_loaders module
- Implement DocumentLoader trait for PdfLoader
- Use lopdf crate for PDF parsing
- Add error handling for file operations and PDF parsing
- Update Cargo.toml with lopdf dependency
- Add unit tests for PdfLoader
- Update documentation with PdfLoader usage examples
- Comments added to pdf.rs, mod.rs, document_loaders.rs
@0xMochan
Copy link
Contributor

0xMochan commented Sep 19, 2024

I haven't done a formal review but I noticed a couple of things when I looked over this!

  • This PR adds a lot of dependencies. We probably want to throw this as a cargo feature on the main rig library so that not all users download these deps if they install the library.
    • Also, tokio likely shouldn't be one of the dependencies since the library can work with multiple async backends.
    • Perhaps another route would be for a new crate rig-pdf (or rig-loaders) instead of keeping this in the main library would be more suitable.
  • It seems like there is also a CSV loader but I don't really see that documented in the PR.
  • There's some code in the mod.rs file which I think we are trying to avoid. It could be spun out into a document.rs file.
  • There shouldn't be any println code in our library to avoid erroneous output when people use our library. It should be using the tracing module so that logging can be turned off and on at will.
  • I think some of the Error typing could be simplified using thiserror types for the return values.

These are just some first thought comments, I think stopher will do a main review of this. Looking forward to trying this out locally!

@Tachikoma000 Tachikoma000 changed the title Feat/document loader pdf feat(loaders): document loader pdf Sep 19, 2024
@cvauclair
Copy link
Contributor

This is being (re-)implemented in #55

@cvauclair cvauclair closed this Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: Add PDF Loader to Document Loaders in Rig
3 participants