Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(file-based): sync metadarecords #260

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

aldogonzalez8
Copy link
Contributor

@aldogonzalez8 aldogonzalez8 commented Jan 23, 2025

Summary by CodeRabbit

  • New Features

    • Added option to synchronize file metadata instead of file data.
    • Introduced configuration to control metadata synchronization for file-based sources.
    • Enabled optional metadata record transfer during data ingestion.
  • Enhancements

    • Expanded configuration flexibility for file-based stream readers.
    • Added metadata schema and file metadata retrieval capabilities.
  • Technical Improvements

    • Updated stream processing logic to support metadata synchronization.
    • Enhanced file-based source configuration handling.

@aldogonzalez8 aldogonzalez8 self-assigned this Jan 23, 2025
@github-actions github-actions bot added the enhancement New feature or request label Jan 23, 2025
@aldogonzalez8
Copy link
Contributor Author

/format-fix

@aldogonzalez8
Copy link
Contributor Author

aldogonzalez8 commented Jan 23, 2025

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

🟦 Job completed successfully (no changes).

Copy link
Contributor

coderabbitai bot commented Jan 23, 2025

📝 Walkthrough

Walkthrough

This pull request introduces a new boolean field, sync_metadata, to the DeliverRecords class, allowing users to choose between synchronizing file metadata or file data. The changes are implemented across multiple files, including the addition of new methods in the FileBasedSource and AbstractFileBasedStreamReader classes, which enhance the handling of records transfer and metadata synchronization. The DefaultFileBasedStream class is also updated to integrate this new functionality into its processing logic.

Changes

File Change Summary
airbyte_cdk/sources/file_based/config/abstract_file_based_spec.py Added sync_metadata boolean field to DeliverRecords class
airbyte_cdk/sources/file_based/file_based_source.py Added methods _use_records_transfer and _sync_metadata; Updated _make_default_stream method signature
airbyte_cdk/sources/file_based/file_based_stream_reader.py Added methods use_records_transfer, sync_metadata, get_file_metadata, and get_metadata_schema
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py Added sync_metadata handling in constructor, record reading, and schema methods
unit_tests/sources/file_based/scenarios/csv_scenarios.py Added sync_metadata property to delivery method configuration

Sequence Diagram

sequenceDiagram
    participant User
    participant FileBasedSource
    participant StreamReader
    participant Stream
    
    User->>FileBasedSource: Configure sync_metadata
    FileBasedSource->>StreamReader: Check sync_metadata
    StreamReader-->>FileBasedSource: Metadata sync status
    FileBasedSource->>Stream: Create stream with metadata sync
    Stream->>Stream: Process files
    alt sync_metadata is True
        Stream->>Stream: Extract file metadata
        Stream-->>User: Return metadata records
    else sync_metadata is False
        Stream-->>User: Return file data records
    end
Loading

Suggested reviewers

  • aaronsteers

How does this updated summary and the changes look to you? Any further modifications you'd like to consider?


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7e6ca59 and edd6f69.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/file_based/file_based_source.py (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • airbyte_cdk/sources/file_based/file_based_source.py
⏰ Context from checks skipped due to timeout of 90000ms (9)
  • GitHub Check: Build Python Package
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Analyze (python)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
airbyte_cdk/sources/file_based/config/abstract_file_based_spec.py (1)

25-30: Consider enhancing the field description for clarity?

The description could be more specific about what metadata is being synced and how it differs from file data. Maybe something like: "If enabled, streams will sync file metadata (e.g., file path, permissions, timestamps) instead of the actual file contents.", wdyt?

airbyte_cdk/sources/file_based/file_based_stream_reader.py (1)

213-227: Consider adding validation for required metadata fields?

The metadata schema defines several fields but doesn't mark any as required. Should we ensure that critical fields like id and file_path are always present in the metadata records? We could add a required field to the schema, wdyt?

     return {
         "type": "object",
         "properties": {
             "id": {"type": "string"},
             "file_path": {"type": "string"},
             "allowed_identity_remote_ids": {"type": "array", "items": "string"},
             "is_public": {"type": "boolean"},
         },
+        "required": ["id", "file_path"]
     }
airbyte_cdk/sources/file_based/file_based_source.py (1)

391-397: Consider reusing the delivery type check logic?

The _use_records_transfer method has similar logic to _use_file_transfer. Maybe we could extract the common pattern into a helper method to reduce duplication, wdyt?

+    @staticmethod
+    def _has_delivery_type(parsed_config: AbstractFileBasedSpec, delivery_type: str) -> bool:
+        return (
+            hasattr(parsed_config.delivery_method, "delivery_type")
+            and parsed_config.delivery_method.delivery_type == delivery_type
+        )

     @staticmethod
     def _use_records_transfer(parsed_config: AbstractFileBasedSpec) -> bool:
-        use_records_transfer = (
-            hasattr(parsed_config.delivery_method, "delivery_type")
-            and parsed_config.delivery_method.delivery_type == "use_records_transfer"
-        )
-        return use_records_transfer
+        return FileBasedSource._has_delivery_type(parsed_config, "use_records_transfer")
unit_tests/sources/file_based/scenarios/csv_scenarios.py (1)

512-518: The sync_metadata property looks good, but should we enhance its description?

The implementation looks correct, but we could make the description more specific about what "file metadata" includes. For example, we could mention the specific metadata fields that will be synced (like file size, creation date, etc.). wdyt?

-                                        "description": "If enabled, streams will sync files metadata instead of files data.",
+                                        "description": "If enabled, streams will sync file metadata (e.g., file size, creation date, last modified date) instead of file contents.",
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ec7e961 and 79c5f40.

📒 Files selected for processing (5)
  • airbyte_cdk/sources/file_based/config/abstract_file_based_spec.py (1 hunks)
  • airbyte_cdk/sources/file_based/file_based_source.py (3 hunks)
  • airbyte_cdk/sources/file_based/file_based_stream_reader.py (3 hunks)
  • airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (6 hunks)
  • unit_tests/sources/file_based/scenarios/csv_scenarios.py (1 hunks)
🧰 Additional context used
🪛 GitHub Actions: Linters
airbyte_cdk/sources/file_based/file_based_stream_reader.py

[error] 206-206: Missing return statement. If the method is meant to be abstract, use @abc.abstractmethod

⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Analyze (python)
🔇 Additional comments (1)
unit_tests/sources/file_based/scenarios/csv_scenarios.py (1)

Line range hint 1-1024: Should we add test scenarios for the new sync_metadata functionality?

The file includes comprehensive test scenarios for various CSV configurations, but I notice we're missing specific test cases for the new sync_metadata feature. Consider adding these test scenarios:

  1. A test case where sync_metadata=True to verify file metadata syncing
  2. A test case with sync_metadata=False (default case) to verify file content syncing
  3. Edge cases like empty files, invalid metadata, etc.

Would you like me to help draft these test scenarios?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (4)

138-145: Hey! What do you think about reducing code duplication here? 🤔

I notice this follows the same pattern as use_file_transfer. Maybe we could extract a helper method to handle the delivery type check? Something like this, wdyt?

+    def _check_delivery_type(self, expected_type: str) -> bool:
+        if self.config:
+            return (
+                hasattr(self.config.delivery_method, "delivery_type")
+                and self.config.delivery_method.delivery_type == expected_type
+            )
+        return False
+
     def use_file_transfer(self) -> bool:
-        if self.config:
-            use_file_transfer = (
-                hasattr(self.config.delivery_method, "delivery_type")
-                and self.config.delivery_method.delivery_type == "use_file_transfer"
-            )
-            return use_file_transfer
-        return False
+        return self._check_delivery_type("use_file_transfer")

     def use_records_transfer(self) -> bool:
-        if self.config:
-            use_records_transfer = (
-                hasattr(self.config.delivery_method, "delivery_type")
-                and self.config.delivery_method.delivery_type == "use_records_transfer"
-            )
-            return use_records_transfer
-        return False
+        return self._check_delivery_type("use_records_transfer")

158-166: Would you mind adding a docstring to clarify the behavior? 📝

The method has a dependency on use_records_transfer() that might not be immediately obvious to other developers. Maybe we could add a docstring explaining this? Here's a suggestion:

     def sync_metadata(self) -> bool:
+        """
+        Check if metadata synchronization is enabled in the config.
+        
+        Returns:
+            bool: True if metadata sync is enabled and records transfer is being used,
+                 False otherwise.
+        
+        Note:
+            This feature requires records transfer to be enabled
+            (delivery_type == "use_records_transfer").
+        """
         if (
             self.config
             and self.use_records_transfer()

214-219: Tiny typo in the docstring ✍️

There's an extra quote mark in the docstring. Would you mind fixing it?

-    def get_metadata_schema(self) -> Dict[str, Any]:
-        """ "
+    def get_metadata_schema(self) -> Dict[str, Any]:
+        """

220-228: How about adding type annotations to the schema? 🤓

The schema could be more specific about the types. Here's a suggestion that adds minItems for arrays and makes the types more explicit, wdyt?

         return {
             "type": "object",
+            "required": ["id", "file_path", "allowed_identity_remote_ids", "is_public"],
             "properties": {
                 "id": {"type": "string"},
                 "file_path": {"type": "string"},
-                "allowed_identity_remote_ids": {"type": "array", "items": "string"},
+                "allowed_identity_remote_ids": {
+                    "type": "array",
+                    "items": {"type": "string"},
+                    "minItems": 1
+                },
                 "is_public": {"type": "boolean"},
             },
         }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 79c5f40 and 4638f89.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/file_based/file_based_stream_reader.py (3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Analyze (python)
🔇 Additional comments (1)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (1)

206-212: The docstring could use more details about the return format 🎯

The previous review already caught the missing @abstractmethod decorator. Additionally, would you consider enhancing the docstring to describe the expected return format? Something like:

     @abstractmethod
     def get_file_metadata(self, file: RemoteFile, logger: logging.Logger) -> Dict[str, Any]:
         """
-        This is required for connectors that will support syncing
-        metadata from files.
+        Retrieve metadata for a given file.
+        
+        Args:
+            file: The remote file to get metadata for
+            logger: Logger for recording any issues
+        
+        Returns:
+            Dict containing file metadata with at least the following keys:
+            - id (str): Unique identifier for the file
+            - file_path (str): Path to the file
+            - allowed_identity_remote_ids (List[str]): Allowed remote IDs
+            - is_public (bool): Public access flag
         """
         ...

@aldogonzalez8
Copy link
Contributor Author

aldogonzalez8 commented Jan 23, 2025

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

@aaronsteers
Copy link
Contributor

@aldogonzalez8 - I'm updating from main to resolve a CI issue on the "Test Connectors" workflow.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (3)

138-146: Consider unifying the naming convention for file transfer.

Currently, we have both use_file_transfer and use_records_transfer. Would it make sense to merge them or rename them for clarity if they serve related purposes, wdyt?


158-167: Consider consolidating conditions or clarifying control flow.

sync_metadata checks multiple conditions similar to use_records_transfer. Perhaps unify or centralize these checks to avoid duplication and simplify readability, wdyt?


213-227: Evaluate if a concrete default implementation is desired.

get_metadata_schema provides a default schema. If every connector must customize this, would marking it abstract be better to ensure an override, wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7bfb8c3 and 7e6ca59.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/file_based/file_based_stream_reader.py (3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Analyze (python)
🔇 Additional comments (1)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (1)

206-212: Revisit making this method abstract.

There's a past comment suggesting @abstractmethod here, ensuring that all concrete implementations must override get_file_metadata. Do we want to enforce that strictly, wdyt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants