Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(file-based): changes for not mirroring paths #205

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

aldogonzalez8
Copy link
Contributor

@aldogonzalez8 aldogonzalez8 commented Jan 7, 2025

Summary by CodeRabbit

  • New Features

    • Added configuration options for preserving subdirectories during file transfers.
    • Introduced error handling for duplicate files in file-based sources.
  • Bug Fixes

    • Enhanced file path handling and transfer mechanisms.
  • Improvements

    • Expanded configuration flexibility for file-based source sync processes.
    • Improved error reporting for file processing scenarios.
    • Enhanced clarity in error messages related to duplicate files.

@aldogonzalez8 aldogonzalez8 self-assigned this Jan 7, 2025
@github-actions github-actions bot added the enhancement New feature or request label Jan 7, 2025
Copy link
Contributor

coderabbitai bot commented Jan 7, 2025

📝 Walkthrough

Walkthrough

The pull request introduces a new boolean field, preserve_subdirectories_directories, to the DeliverRawFiles and AbstractFileBasedSpec classes, enhancing configuration options for file delivery. It also adds a new exception class, DuplicatedFilesError, and a method for formatting error messages related to duplicate files. Additionally, several methods are updated or added across various classes to support the new configuration and error handling, streamlining the management of subdirectory structures and duplicate files during file transfers.

Changes

File Change Summary
airbyte_cdk/sources/file_based/config/abstract_file_based_spec.py - Added preserve_subdirectories_directories: bool to DeliverRawFiles class
airbyte_cdk/sources/file_based/exceptions.py - Added DuplicatedFilesError exception class
- Added format_duplicate_files_error_message function
airbyte_cdk/sources/file_based/file_based_source.py - Added _preserve_subdirectories_directories method
- Updated _make_default_stream method signature
airbyte_cdk/sources/file_based/file_based_stream_reader.py - Added preserve_subdirectories_directories method
- Changed _get_file_transfer_paths from static to instance method
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py - Added constants PRESERVE_SUBDIRECTORIES_KW and FILES_KEY
- Added preserve_subdirectories_directories property
- Added _duplicated_files_names method to handle file duplicates
unit_tests/sources/file_based/scenarios/csv_scenarios.py - Enhanced delivery_options configuration for testing

Possibly related PRs

Hey there! 👋 I noticed you've added some really cool configuration options for file delivery. Quick question: have you considered how this might impact existing connectors that don't explicitly set delivery_options? The default seems to be True for preserving subdirectories, but it might be worth double-checking backward compatibility. Wdyt? 🤔


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c9272cd and 66c6b97.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/file_based/file_based_stream_reader.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • airbyte_cdk/sources/file_based/file_based_stream_reader.py
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (6)
airbyte_cdk/sources/file_based/exceptions.py (1)

132-149: Add type annotations to improve type safety, wdyt?

The message formatting looks great! Consider adding type annotations:

-def format_duplicate_files_error_message(stream_name: str, duplicated_files_names: List):
+def format_duplicate_files_error_message(stream_name: str, duplicated_files_names: List[Dict[str, List[str]]]) -> str:
🧰 Tools
🪛 GitHub Actions: Linters

[error] 132-132: Function is missing a return type annotation


[error] 132-132: Missing type parameters for generic type "List"

airbyte_cdk/sources/file_based/file_based_stream_reader.py (1)

138-146: Consider improving null safety in the config check, wdyt?

The logic looks good, but we could make the null checks more explicit:

     def preserve_subdirectories_directories(self) -> bool:
         # fall back to preserve subdirectories if config is not present or incomplete
         if (
             self.config
             and hasattr(self.config, "delivery_options")
-            and hasattr(self.config.delivery_options, "preserve_subdirectories_directories")
+            and self.config.delivery_options is not None
+            and hasattr(self.config.delivery_options, "preserve_subdirectories_directories")
         ):
             return self.config.delivery_options.preserve_subdirectories_directories
         return True
🧰 Tools
🪛 GitHub Actions: Linters

[error] 145-145: Item "None" of "DeliveryOptions | None" has no attribute "preserve_subdirectories_directories"

airbyte_cdk/sources/file_based/config/abstract_file_based_spec.py (1)

17-22: Consider enhancing the field description for clarity, wdyt?

The implementation looks good! Consider making the description more descriptive:

     preserve_subdirectories_directories: bool = Field(
         True,
-        description="Flag indicating we should preserve subdirectories directories",
+        description="When enabled, preserves the subdirectory structure of files during transfer. When disabled, all files are stored in the root directory.",
     )
airbyte_cdk/sources/file_based/file_based_source.py (1)

392-399: Add return type and improve null safety, wdyt?

The logic looks good, but we could improve type safety:

     @staticmethod
-    def _preserve_subdirectories_directories(parsed_config: AbstractFileBasedSpec):
+    def _preserve_subdirectories_directories(parsed_config: AbstractFileBasedSpec) -> bool:
         # fall back to preserve subdirectories if config is not present or incomplete
         if hasattr(parsed_config, "delivery_options") and hasattr(
-            parsed_config.delivery_options, "preserve_subdirectories_directories"
+            parsed_config.delivery_options, "preserve_subdirectories_directories"
+        ) and parsed_config.delivery_options is not None:
         ):
             return parsed_config.delivery_options.preserve_subdirectories_directories
         return True
🧰 Tools
🪛 GitHub Actions: Linters

[error] 393-393: Function is missing a return type annotation


[error] 398-398: Item "None" of "DeliveryOptions | None" has no attribute "preserve_subdirectories_directories"

airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (2)

50-51: Consider a more concise property name.

The property name preserve_subdirectories_directories seems redundant with "directories". What do you think about shortening it to just preserve_subdirectories? wdyt?

-    preserve_subdirectories_directories = True
+    preserve_subdirectories = True

Also applies to: 59-59


64-66: Update initialization if property name changes.

If you agree with shortening the property name, we should update the initialization too. wdyt?

-            self.preserve_subdirectories_directories = kwargs.pop(
-                self.PRESERVE_SUBDIRECTORIES_KW, True
-            )
+            self.preserve_subdirectories = kwargs.pop(self.PRESERVE_SUBDIRECTORIES_KW, True)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3ee710d and c4150ee.

📒 Files selected for processing (6)
  • airbyte_cdk/sources/file_based/config/abstract_file_based_spec.py (3 hunks)
  • airbyte_cdk/sources/file_based/exceptions.py (2 hunks)
  • airbyte_cdk/sources/file_based/file_based_source.py (6 hunks)
  • airbyte_cdk/sources/file_based/file_based_stream_reader.py (2 hunks)
  • airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (3 hunks)
  • unit_tests/sources/file_based/scenarios/csv_scenarios.py (1 hunks)
🧰 Additional context used
🪛 GitHub Actions: Linters
airbyte_cdk/sources/file_based/exceptions.py

[error] 132-132: Function is missing a return type annotation


[error] 132-132: Missing type parameters for generic type "List"

airbyte_cdk/sources/file_based/stream/default_file_based_stream.py

[error] 5-44: Import block is un-sorted or un-formatted. This can be fixed automatically using the --fix option with ruff.

airbyte_cdk/sources/file_based/file_based_stream_reader.py

[error] 145-145: Item "None" of "DeliveryOptions | None" has no attribute "preserve_subdirectories_directories"

airbyte_cdk/sources/file_based/file_based_source.py

[error] 393-393: Function is missing a return type annotation


[error] 398-398: Item "None" of "DeliveryOptions | None" has no attribute "preserve_subdirectories_directories"

⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (5)
airbyte_cdk/sources/file_based/exceptions.py (1)

114-116: LGTM! Clean and consistent error class implementation.

The error class follows the established pattern in the codebase.

airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (2)

111-126: LGTM! Well-structured implementation.

The _duplicated_files_names method is well-implemented:

  • Uses appropriate data structures (set for O(1) lookups, defaultdict for grouping)
  • Clear variable names
  • Efficient single-pass algorithm

137-145: LGTM! Good error handling.

The duplicate file check in compute_slices is well-integrated:

  • Only checks when preserve_subdirectories_directories is False
  • Provides clear error messages with file details
  • Proper error propagation
unit_tests/sources/file_based/scenarios/csv_scenarios.py (2)

520-538: LGTM! Well-structured delivery options.

The delivery options under "Copy Raw Files" are well-defined:

  • Clear property structure
  • Descriptive title and description
  • Appropriate default value

552-570: Consider consolidating delivery options.

I notice we have delivery options defined both under "Copy Raw Files" and at the root level. Is this intentional duplication or should we consolidate them? wdyt?

✅ Verification successful

The delivery options duplication is intentional and correct.

The delivery_options appear in two different contexts:

  • At the root level for global configuration (order=8)
  • Within the "Copy Raw Files" delivery method for specific file transfer settings (order=2)

This allows users to configure delivery options both globally and specifically for file transfers when needed.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if both delivery options are used in the codebase
rg -A 5 "delivery_options" | grep -v "test"

Length of output: 2562


Script:

#!/bin/bash
# Search for class definitions containing delivery_options
ast-grep --pattern 'class $_ {
  $$$
  delivery_options: $_
  $$$
}'

Length of output: 70


Script:

#!/bin/bash
# Search for class definitions and their context
rg -B 2 -A 10 "class.*DeliveryOptions" 

Length of output: 1522


Script:

#!/bin/bash
# Search for the complete schema structure
rg -B 10 -A 20 "class.*FileBasedSpec" 

Length of output: 17481

@aldogonzalez8
Copy link
Contributor Author

aldogonzalez8 commented Jan 7, 2025

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (3)

59-66: Consider simplifying the property name.

The property name preserve_subdirectories_directories seems to have a redundant "directories" suffix. Would preserve_subdirectories be clearer and more concise? The initialization logic looks good though! Wdyt?

-    preserve_subdirectories_directories = True
+    preserve_subdirectories = True

     def __init__(self, **kwargs: Any):
         if self.FILE_TRANSFER_KW in kwargs:
             self.use_file_transfer = kwargs.pop(self.FILE_TRANSFER_KW, False)
-            self.preserve_subdirectories_directories = kwargs.pop(
+            self.preserve_subdirectories = kwargs.pop(
                 self.PRESERVE_SUBDIRECTORIES_KW, True
             )

111-126: Add type annotations and consider renaming the method.

The duplicate detection logic looks solid! A few suggestions to make it even better:

  1. The method name could be more Pythonic. Maybe _get_duplicated_filenames?
  2. Let's add proper type annotations to fix the pipeline failures.
-    def _duplicated_files_names(self, slices: List) -> list[dict]:
+    def _get_duplicated_filenames(
+        self,
+        slices: List[Dict[str, List[RemoteFile]]]
+    ) -> List[Dict[str, List[str]]]:
🧰 Tools
🪛 GitHub Actions: Linters

[error] 111-111: Missing type parameters for generic type "List"


[error] 111-111: Missing type parameters for generic type "dict"


133-145: Consider extracting the duplicate check into a guard clause.

The logic looks correct, but we could make it more readable by extracting the duplicate check into a guard clause at the beginning of the method. Wdyt?

     def compute_slices(self) -> Iterable[Optional[Mapping[str, Any]]]:
         # Sort files by last_modified, uri and return them grouped by last_modified
         all_files = self.list_files()
         files_to_read = self._cursor.get_files_to_sync(all_files, self.logger)
         sorted_files_to_read = sorted(files_to_read, key=lambda f: (f.last_modified, f.uri))
         slices = [
             {self.FILES_KEY: list(group[1])}
             for group in itertools.groupby(sorted_files_to_read, lambda f: f.last_modified)
         ]
-        if slices and not self.preserve_subdirectories_directories:
-            duplicated_files_names = self._duplicated_files_names(slices)
-            if duplicated_files_names:
-                raise DuplicatedFilesError(
-                    format_duplicate_files_error_message(
-                        stream_name=self.name, duplicated_files_names=duplicated_files_names
-                    ),
-                    stream=self.name,
-                )
+        if slices and not self.preserve_subdirectories_directories:
+            self._check_for_duplicates(slices)
         return slices

+    def _check_for_duplicates(self, slices: List[Dict[str, List[RemoteFile]]]) -> None:
+        duplicated_files_names = self._get_duplicated_filenames(slices)
+        if duplicated_files_names:
+            raise DuplicatedFilesError(
+                format_duplicate_files_error_message(
+                    stream_name=self.name,
+                    duplicated_files_names=duplicated_files_names
+                ),
+                stream=self.name,
+            )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c4150ee and 8ee5ac3.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (3 hunks)
🧰 Additional context used
🪛 GitHub Actions: Linters
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py

[error] 111-111: Missing type parameters for generic type "List"


[error] 111-111: Missing type parameters for generic type "dict"

⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (1)
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (1)

8-25: LGTM! The new imports and constants are well-organized.

The new constants PRESERVE_SUBDIRECTORIES_KW and FILES_KEY follow the existing naming convention and are descriptive. The imports are logically grouped, though we could consider organizing them alphabetically within their groups. Wdyt?

Also applies to: 50-51

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
airbyte_cdk/sources/file_based/file_based_source.py (1)

392-401: Well-structured configuration handling.

The method follows good practices:

  • Safely handles missing or incomplete configuration
  • Preserves backward compatibility by defaulting to True
  • Clear and descriptive attribute checks

What do you think about adding a docstring to explain the default behavior? wdyt?

 @staticmethod
 def _preserve_subdirectories_directories(parsed_config: AbstractFileBasedSpec) -> bool:
+    """
+    Determine whether to preserve subdirectories based on the configuration.
+    
+    Returns True if:
+    - The configuration is missing
+    - The delivery_options is not set
+    - The preserve_subdirectories_directories flag is not set
+    """
     # fall back to preserve subdirectories if config is not present or incomplete
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (2)

111-128: Well-implemented duplicate detection.

The implementation is efficient using sets and defaultdict. A few suggestions to make it even better:

  1. Would you consider adding type hints for the return value's inner types? wdyt?
  2. How about adding a docstring to explain the method's purpose and return format?
 def _duplicated_files_names(
     self, slices: List[dict[str, List[RemoteFile]]]
-) -> List[dict[str, List[str]]]:
+) -> List[dict[str, list[str]]]:
+    """
+    Identify duplicate file names across all slices.
+    
+    Args:
+        slices: List of slices containing RemoteFile objects
+        
+    Returns:
+        List of dictionaries mapping duplicate file names to their full paths
+    """

139-147: Clean integration of duplicate detection.

The duplicate check is well-integrated into the existing slice computation logic. However, consider extracting the duplicate check into a separate method for better readability. wdyt?

 def compute_slices(self) -> Iterable[Optional[Mapping[str, Any]]]:
+    def _check_for_duplicates(slices: List[dict[str, List[RemoteFile]]]) -> None:
+        if not self.preserve_subdirectories_directories:
+            duplicated_files_names = self._duplicated_files_names(slices)
+            if duplicated_files_names:
+                raise DuplicatedFilesError(
+                    format_duplicate_files_error_message(
+                        stream_name=self.name, duplicated_files_names=duplicated_files_names
+                    ),
+                    stream=self.name,
+                )
+
     all_files = self.list_files()
     files_to_read = self._cursor.get_files_to_sync(all_files, self.logger)
     sorted_files_to_read = sorted(files_to_read, key=lambda f: (f.last_modified, f.uri))
     slices = [
         {self.FILES_KEY: list(group[1])}
         for group in itertools.groupby(sorted_files_to_read, lambda f: f.last_modified)
     ]
-    if slices and not self.preserve_subdirectories_directories:
-        duplicated_files_names = self._duplicated_files_names(slices)
-        if duplicated_files_names:
-            raise DuplicatedFilesError(
-                format_duplicate_files_error_message(
-                    stream_name=self.name, duplicated_files_names=duplicated_files_names
-                ),
-                stream=self.name,
-            )
+    if slices:
+        _check_for_duplicates(slices)
     return slices
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8ee5ac3 and 1fd48aa.

📒 Files selected for processing (4)
  • airbyte_cdk/sources/file_based/exceptions.py (2 hunks)
  • airbyte_cdk/sources/file_based/file_based_source.py (6 hunks)
  • airbyte_cdk/sources/file_based/file_based_stream_reader.py (2 hunks)
  • airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • airbyte_cdk/sources/file_based/file_based_stream_reader.py
⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Analyze (python)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (3)
airbyte_cdk/sources/file_based/exceptions.py (2)

114-115: LGTM! Clean exception class implementation.

The DuplicatedFilesError follows the established pattern of extending BaseFileBasedSourceError.


132-151: Great error message formatting!

The error message is clear, actionable and includes:

  • The specific stream where duplicates were found
  • The number of duplicates for each file
  • Clear guidance on how to resolve the issue
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (1)

50-51: Good use of constants!

Using constants instead of magic strings improves maintainability.

@aldogonzalez8 aldogonzalez8 changed the title feat(file-based): adding initial changes for not mirroring paths ✨feat(file-based): changes for not mirroring paths Jan 7, 2025
@aldogonzalez8 aldogonzalez8 changed the title ✨feat(file-based): changes for not mirroring paths feat(file-based): changes for not mirroring paths Jan 7, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (1)

Line range hint 172-182: The logic looks good! Would you consider some readability improvements?

The implementation correctly uses the new configuration option, but we could make it even clearer with:

  1. More descriptive variable names (e.g., relative_file_path instead of file_relative_path)
  2. A more detailed comment explaining the path transformation logic

What do you think about this diff?

-    # Remove left slashes from source path format to make relative path for writing locally
-    file_relative_path = file.uri.lstrip("/")
+    # Transform the source file path into a relative path for local writing:
+    # - If preserving directories: maintain the path structure but remove leading slashes
+    # - If not preserving: use only the filename
+    relative_file_path = file.uri.lstrip("/") if preserve_subdirectories_directories else path.basename(file.uri)
airbyte_cdk/sources/file_based/config/abstract_file_based_spec.py (2)

34-38: The field definition looks good! Should we enhance the description?

The implementation is solid, but the description could be more informative about what this setting actually does. What do you think about making it clearer for users?

     preserve_subdirectories_directories: bool = Field(
         title="Preserve Subdirectories Directories",
-        description="Flag indicating we should preserve subdirectories directories",
+        description="When enabled, maintains the original directory structure of files when copying them to the destination. When disabled, all files are copied to a flat structure in the destination directory.",
         default=True,
     )

74-78: Should we use the same enhanced description here for consistency?

     preserve_subdirectories_directories: bool = Field(
         title="Preserve Subdirectories Directories",
-        description="Flag indicating we should preserve subdirectories directories",
+        description="When enabled, maintains the original directory structure of files when copying them to the destination. When disabled, all files are copied to a flat structure in the destination directory.",
         default=True,
     )
unit_tests/sources/file_based/scenarios/csv_scenarios.py (1)

526-544: The test configuration looks good! Should we add more test coverage?

The implementation correctly includes the new field in the test scenarios. However, we might want to add test cases that specifically verify the behavior when:

  1. preserve_subdirectories_directories is set to False
  2. Files are in nested subdirectories
  3. Edge cases like empty directories or files with identical names in different subdirectories

Would you like me to help draft these additional test scenarios?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1fd48aa and edf6816.

📒 Files selected for processing (4)
  • airbyte_cdk/sources/file_based/config/abstract_file_based_spec.py (2 hunks)
  • airbyte_cdk/sources/file_based/file_based_source.py (6 hunks)
  • airbyte_cdk/sources/file_based/file_based_stream_reader.py (2 hunks)
  • unit_tests/sources/file_based/scenarios/csv_scenarios.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • airbyte_cdk/sources/file_based/file_based_source.py
⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Analyze (python)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
🔇 Additional comments (1)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (1)

138-146: LGTM! The implementation looks solid.

The method handles all edge cases gracefully and maintains backward compatibility with the default True value.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
airbyte_cdk/sources/file_based/file_based_source.py (1)

393-393: What do you think about simplifying the method name? 🤔

The current name _preserve_subdirectories_directories seems a bit redundant with "directories" appearing twice. Maybe we could simplify it to _preserve_subdirectories? wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between edf6816 and c9272cd.

📒 Files selected for processing (4)
  • airbyte_cdk/sources/file_based/config/abstract_file_based_spec.py (1 hunks)
  • airbyte_cdk/sources/file_based/file_based_source.py (6 hunks)
  • airbyte_cdk/sources/file_based/file_based_stream_reader.py (2 hunks)
  • unit_tests/sources/file_based/scenarios/csv_scenarios.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • airbyte_cdk/sources/file_based/config/abstract_file_based_spec.py
  • unit_tests/sources/file_based/scenarios/csv_scenarios.py
  • airbyte_cdk/sources/file_based/file_based_stream_reader.py
⏰ Context from checks skipped due to timeout of 90000ms (6)
  • GitHub Check: Publish SDM to DockerHub
  • GitHub Check: Publish CDK version to PyPI
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Analyze (python)
🔇 Additional comments (3)
airbyte_cdk/sources/file_based/file_based_source.py (3)

245-245: Nice refactoring of the _make_default_stream method! 👍

The change to pass parsed_config instead of use_file_transfer is a good improvement. It consolidates configuration handling and makes the code more maintainable for future extensions.

Also applies to: 276-276, 288-288, 301-301, 313-316


392-401: Great defensive programming! 💪

The implementation safely handles all edge cases and provides a sensible default. I particularly like how it:

  • Validates file transfer usage first
  • Checks attribute existence
  • Has null-safety checks
  • Provides a safe default

392-401: Shall we verify the configuration usage? 🔍

Let's check if this new configuration option is properly documented and consistently used across the codebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant