Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add ramalama rag command #501

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

rhatdan
Copy link
Member

@rhatdan rhatdan commented Dec 3, 2024

Summary by Sourcery

Add a new 'rag' command to the RamaLama CLI for generating RAG data from documents and converting it into an OCI Image, along with corresponding documentation updates.

New Features:

  • Introduce a new 'rag' command in the RamaLama CLI to generate Retrieval Augmented Generation (RAG) data from various document formats and convert it into an OCI Image.

Documentation:

  • Add documentation for the new 'rag' command, detailing its usage, options, and description.

Copy link
Contributor

sourcery-ai bot commented Dec 3, 2024

Reviewer's Guide by Sourcery

This PR adds a new 'rag' command to the RamaLama CLI that enables generating Retrieval Augmented Generation (RAG) data from various document formats and converting them into an OCI Image. The implementation includes a new command parser, core RAG functionality for document processing, and corresponding documentation.

Sequence diagram for the RAG command execution

sequenceDiagram
    actor User
    participant CLI as RamaLama CLI
    participant rag as rag
    participant Converter as DocumentConverter
    participant Result as ConversionResult

    User->>CLI: Execute 'ramalama rag PATH IMAGE'
    CLI->>rag: rag_cli(args)
    rag->>Converter: convert_all(targets)
    Converter->>Result: ConversionResult
    rag->>Result: export_documents(conv_results, output_dir)
    Result-->>rag: Success/Failure counts
    rag-->>CLI: Processed results
    CLI-->>User: Display results
Loading

Class diagram for the new RAG functionality

classDiagram
    class DocumentConverter {
        +convert_all(targets, raises_on_error)
    }

    class ConversionResult {
        +input: File
        +status: ConversionStatus
        +document: Document
        +errors: List<Error>
    }

    class ConversionStatus {
        <<enumeration>>
        SUCCESS
        PARTIAL_SUCCESS
        FAILURE
    }

    class Document {
        +export_to_dict()
        +export_to_document_tokens()
        +export_to_markdown(strict_text)
    }

    class rag {
        +generate(args)
        +walk(path)
        +export_documents(conv_results, output_dir)
    }

    DocumentConverter --> ConversionResult
    ConversionResult --> ConversionStatus
    ConversionResult --> Document
    rag --> DocumentConverter
    rag --> ConversionResult
    rag --> ConversionStatus
    rag --> Document
Loading

File-Level Changes

Change Details Files
Added new RAG command to CLI interface
  • Added rag_parser function to register the new command
  • Added rag_cli function as the command handler
  • Updated command list to include the new RAG command
  • Added command-line arguments for input paths and output image name
ramalama/cli.py
Implemented core RAG functionality for document processing
  • Created document walking function to recursively find target files
  • Implemented document conversion using DocumentConverter
  • Added export functionality for multiple formats (JSON, YAML, doctags, markdown, text)
  • Added error handling and conversion status reporting
ramalama/rag.py
Added documentation for the new RAG command
  • Created man page for ramalama-rag command
  • Updated main ramalama documentation to include RAG command
  • Fixed capitalization consistency in info command help text
docs/ramalama.1.md
docs/ramalama-rag.1.md
docs/ramalama-info.1.md

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time. You can also use
    this command to specify where the summary should be inserted.

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @rhatdan - I've reviewed your changes - here's some feedback:

Overall Comments:

  • Consider improving error message clarity in rag.py - 'The example failed converting X on Y' is confusing. Consider something like 'Failed to convert X out of Y documents'.
  • Help text capitalization is inconsistent across commands (e.g. 'display' vs 'Display'). Consider standardizing on sentence case for all help messages.
Here's what I looked at during the review
  • 🟡 General issues: 2 issues found
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟡 Complexity: 1 issue found
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

_log = logging.getLogger(__name__)


def walk(path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: The walk() function should consistently return a list in all cases

Currently returns None for empty directories but a list otherwise. This inconsistency could cause runtime errors. Consider returning an empty list for empty directories and removing the early return.

ramalama/rag.py Outdated

converter = DocumentConverter()
conv_results = converter.convert_all(targets, raises_on_error=False)
success_count, partial_success_count, failure_count = export_documents(conv_results, output_dir=Path("scratch"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Avoid hardcoding the scratch directory path

Consider making this configurable or using a constant defined at module level

    OUTPUT_DIR = Path("scratch")  # At module level, near top of file

    success_count, partial_success_count, failure_count = export_documents(conv_results, output_dir=OUTPUT_DIR)

return targets


def export_documents(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider using a format mapping dictionary to handle document exports

The export_documents function contains significant repetition that can be simplified without losing clarity. Consider restructuring using a format mapping:

EXPORT_FORMATS = {
    '.json': lambda doc: json.dumps(doc.export_to_dict()),
    '.yaml': lambda doc: yaml.safe_dump(doc.export_to_dict()),
    '.doctags.txt': lambda doc: doc.export_to_document_tokens(),
    '.md': lambda doc: doc.export_to_markdown(),
    '.txt': lambda doc: doc.export_to_markdown(strict_text=True)
}

def export_documents(conv_results: Iterable[ConversionResult], output_dir: Path):
    output_dir.mkdir(parents=True, exist_ok=True)
    stats = {'success': 0, 'partial': 0, 'failed': 0}

    for conv_res in conv_results:
        if conv_res.status == ConversionStatus.SUCCESS:
            stats['success'] += 1
            doc_filename = conv_res.input.file.stem

            for ext, export_fn in EXPORT_FORMATS.items():
                output_file = output_dir / f"{doc_filename}{ext}"
                with output_file.open('w') as fp:
                    fp.write(export_fn(conv_res.document))
        elif conv_res.status == ConversionStatus.PARTIAL_SUCCESS:
            stats['partial'] += 1
            _log.info(f"Document {conv_res.input.file} was partially converted with the following errors:")
            for item in conv_res.errors:
                _log.info(f"\t{item.error_message}")
        else:
            stats['failed'] += 1
            _log.info(f"Document {conv_res.input.file} failed to convert.")

    _log.info(f"Processed {sum(stats.values())} docs, of which {stats['failed']} failed "
              f"and {stats['partial']} were partially converted.")
    return stats['success'], stats['partial'], stats['failed']

This approach:

  1. Separates format configuration from export logic
  2. Reduces code duplication
  3. Makes adding new export formats easier
  4. Maintains explicit error handling and logging

@rhatdan rhatdan force-pushed the rag branch 7 times, most recently from ff9f7fa to 97a74c1 Compare December 6, 2024 21:17
@rhatdan
Copy link
Member Author

rhatdan commented Dec 6, 2024

@ericcurtin any idea why docling is not being found on Mac? Seems to be found on my MAC?

Allow users to specify Docx, PDF, Markdown ... files on the command line
and then processes them with docling and rag finally putting the output
into the specified container image.

Signed-off-by: Daniel J Walsh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant