Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add PDF loader to document loaders #25

Closed
wants to merge 1 commit into from
Closed

Conversation

Tachikoma000
Copy link
Contributor

  • Implement PdfLoader struct in src/document_loaders/pdf.rs
  • Add PdfLoader to document_loaders module
  • Implement DocumentLoader trait for PdfLoader
  • Use lopdf crate for PDF parsing
  • Add error handling for file operations and PDF parsing
  • Update Cargo.toml with lopdf dependency
  • Add unit tests for PdfLoader
  • Update documentation with PdfLoader usage examples

- Implement PdfLoader struct in src/document_loaders/pdf.rs
- Add PdfLoader to document_loaders module
- Implement DocumentLoader trait for PdfLoader
- Use lopdf crate for PDF parsing
- Add error handling for file operations and PDF parsing
- Update Cargo.toml with lopdf dependency
- Add unit tests for PdfLoader
- Update documentation with PdfLoader usage examples
@Tachikoma000
Copy link
Contributor Author

Tachikoma000 commented Sep 18, 2024

Add PDF Loader to Document Loaders

This PR implements a PDF loader as part of the document loaders module in Rig. It allows users to easily load and process PDF documents for use in RAG systems and other NLP tasks.

Changes

  • Implemented PdfLoader struct in src/document_loaders/pdf.rs
  • Added PdfLoader to the document_loaders module
  • Implemented DocumentLoader trait for PdfLoader
  • Used the lopdf crate for PDF parsing
  • Added error handling for file operations and PDF parsing
  • Updated Cargo.toml with the lopdf dependency
  • Added unit tests for PdfLoader
  • Updated documentation with PdfLoader usage examples

Implementation Details

The PdfLoader uses the lopdf crate to parse PDF files and extract text content. It handles potential errors such as file not found or parsing errors. The extracted text is converted into DocumentEmbeddings for further processing in Rig.

Testing

Unit tests have been added to ensure the PdfLoader correctly loads PDF files and handles various edge cases. The tests cover:

  • Loading a valid PDF file
  • Handling a non-existent file
  • Processing a PDF with multiple pages
  • Dealing with empty PDF files

Documentation

The main documentation has been updated to include usage examples for the PdfLoader. This includes how to initialize the loader and integrate it with the EmbeddingsBuilder.

Related Issue

@Tachikoma000 Tachikoma000 deleted the PDF_Loader branch September 18, 2024 22:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant