-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(file-based): sync metadarecords #260
base: main
Are you sure you want to change the base?
Conversation
/format-fix |
/autofix
|
📝 WalkthroughWalkthroughThis pull request introduces a new boolean field, Changes
Sequence DiagramsequenceDiagram
participant User
participant FileBasedSource
participant StreamReader
participant Stream
User->>FileBasedSource: Configure sync_metadata
FileBasedSource->>StreamReader: Check sync_metadata
StreamReader-->>FileBasedSource: Metadata sync status
FileBasedSource->>Stream: Create stream with metadata sync
Stream->>Stream: Process files
alt sync_metadata is True
Stream->>Stream: Extract file metadata
Stream-->>User: Return metadata records
else sync_metadata is False
Stream-->>User: Return file data records
end
Suggested reviewers
How does this updated summary and the changes look to you? Any further modifications you'd like to consider? 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (9)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (4)
airbyte_cdk/sources/file_based/config/abstract_file_based_spec.py (1)
25-30
: Consider enhancing the field description for clarity?The description could be more specific about what metadata is being synced and how it differs from file data. Maybe something like: "If enabled, streams will sync file metadata (e.g., file path, permissions, timestamps) instead of the actual file contents.", wdyt?
airbyte_cdk/sources/file_based/file_based_stream_reader.py (1)
213-227
: Consider adding validation for required metadata fields?The metadata schema defines several fields but doesn't mark any as required. Should we ensure that critical fields like
id
andfile_path
are always present in the metadata records? We could add arequired
field to the schema, wdyt?return { "type": "object", "properties": { "id": {"type": "string"}, "file_path": {"type": "string"}, "allowed_identity_remote_ids": {"type": "array", "items": "string"}, "is_public": {"type": "boolean"}, }, + "required": ["id", "file_path"] }
airbyte_cdk/sources/file_based/file_based_source.py (1)
391-397
: Consider reusing the delivery type check logic?The
_use_records_transfer
method has similar logic to_use_file_transfer
. Maybe we could extract the common pattern into a helper method to reduce duplication, wdyt?+ @staticmethod + def _has_delivery_type(parsed_config: AbstractFileBasedSpec, delivery_type: str) -> bool: + return ( + hasattr(parsed_config.delivery_method, "delivery_type") + and parsed_config.delivery_method.delivery_type == delivery_type + ) @staticmethod def _use_records_transfer(parsed_config: AbstractFileBasedSpec) -> bool: - use_records_transfer = ( - hasattr(parsed_config.delivery_method, "delivery_type") - and parsed_config.delivery_method.delivery_type == "use_records_transfer" - ) - return use_records_transfer + return FileBasedSource._has_delivery_type(parsed_config, "use_records_transfer")unit_tests/sources/file_based/scenarios/csv_scenarios.py (1)
512-518
: The sync_metadata property looks good, but should we enhance its description?The implementation looks correct, but we could make the description more specific about what "file metadata" includes. For example, we could mention the specific metadata fields that will be synced (like file size, creation date, etc.). wdyt?
- "description": "If enabled, streams will sync files metadata instead of files data.", + "description": "If enabled, streams will sync file metadata (e.g., file size, creation date, last modified date) instead of file contents.",
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
airbyte_cdk/sources/file_based/config/abstract_file_based_spec.py
(1 hunks)airbyte_cdk/sources/file_based/file_based_source.py
(3 hunks)airbyte_cdk/sources/file_based/file_based_stream_reader.py
(3 hunks)airbyte_cdk/sources/file_based/stream/default_file_based_stream.py
(6 hunks)unit_tests/sources/file_based/scenarios/csv_scenarios.py
(1 hunks)
🧰 Additional context used
🪛 GitHub Actions: Linters
airbyte_cdk/sources/file_based/file_based_stream_reader.py
[error] 206-206: Missing return statement. If the method is meant to be abstract, use @abc.abstractmethod
⏰ Context from checks skipped due to timeout of 90000ms (4)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Analyze (python)
🔇 Additional comments (1)
unit_tests/sources/file_based/scenarios/csv_scenarios.py (1)
Line range hint
1-1024
: Should we add test scenarios for the new sync_metadata functionality?The file includes comprehensive test scenarios for various CSV configurations, but I notice we're missing specific test cases for the new
sync_metadata
feature. Consider adding these test scenarios:
- A test case where
sync_metadata=True
to verify file metadata syncing- A test case with
sync_metadata=False
(default case) to verify file content syncing- Edge cases like empty files, invalid metadata, etc.
Would you like me to help draft these test scenarios?
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py
Outdated
Show resolved
Hide resolved
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (4)
138-145
: Hey! What do you think about reducing code duplication here? 🤔I notice this follows the same pattern as
use_file_transfer
. Maybe we could extract a helper method to handle the delivery type check? Something like this, wdyt?+ def _check_delivery_type(self, expected_type: str) -> bool: + if self.config: + return ( + hasattr(self.config.delivery_method, "delivery_type") + and self.config.delivery_method.delivery_type == expected_type + ) + return False + def use_file_transfer(self) -> bool: - if self.config: - use_file_transfer = ( - hasattr(self.config.delivery_method, "delivery_type") - and self.config.delivery_method.delivery_type == "use_file_transfer" - ) - return use_file_transfer - return False + return self._check_delivery_type("use_file_transfer") def use_records_transfer(self) -> bool: - if self.config: - use_records_transfer = ( - hasattr(self.config.delivery_method, "delivery_type") - and self.config.delivery_method.delivery_type == "use_records_transfer" - ) - return use_records_transfer - return False + return self._check_delivery_type("use_records_transfer")
158-166
: Would you mind adding a docstring to clarify the behavior? 📝The method has a dependency on
use_records_transfer()
that might not be immediately obvious to other developers. Maybe we could add a docstring explaining this? Here's a suggestion:def sync_metadata(self) -> bool: + """ + Check if metadata synchronization is enabled in the config. + + Returns: + bool: True if metadata sync is enabled and records transfer is being used, + False otherwise. + + Note: + This feature requires records transfer to be enabled + (delivery_type == "use_records_transfer"). + """ if ( self.config and self.use_records_transfer()
214-219
: Tiny typo in the docstring ✍️There's an extra quote mark in the docstring. Would you mind fixing it?
- def get_metadata_schema(self) -> Dict[str, Any]: - """ " + def get_metadata_schema(self) -> Dict[str, Any]: + """
220-228
: How about adding type annotations to the schema? 🤓The schema could be more specific about the types. Here's a suggestion that adds
minItems
for arrays and makes the types more explicit, wdyt?return { "type": "object", + "required": ["id", "file_path", "allowed_identity_remote_ids", "is_public"], "properties": { "id": {"type": "string"}, "file_path": {"type": "string"}, - "allowed_identity_remote_ids": {"type": "array", "items": "string"}, + "allowed_identity_remote_ids": { + "type": "array", + "items": {"type": "string"}, + "minItems": 1 + }, "is_public": {"type": "boolean"}, }, }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
airbyte_cdk/sources/file_based/file_based_stream_reader.py
(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Analyze (python)
🔇 Additional comments (1)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (1)
206-212
: The docstring could use more details about the return format 🎯The previous review already caught the missing
@abstractmethod
decorator. Additionally, would you consider enhancing the docstring to describe the expected return format? Something like:@abstractmethod def get_file_metadata(self, file: RemoteFile, logger: logging.Logger) -> Dict[str, Any]: """ - This is required for connectors that will support syncing - metadata from files. + Retrieve metadata for a given file. + + Args: + file: The remote file to get metadata for + logger: Logger for recording any issues + + Returns: + Dict containing file metadata with at least the following keys: + - id (str): Unique identifier for the file + - file_path (str): Path to the file + - allowed_identity_remote_ids (List[str]): Allowed remote IDs + - is_public (bool): Public access flag """ ...
/autofix
|
@aldogonzalez8 - I'm updating from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (3)
138-146
: Consider unifying the naming convention for file transfer.Currently, we have both
use_file_transfer
anduse_records_transfer
. Would it make sense to merge them or rename them for clarity if they serve related purposes, wdyt?
158-167
: Consider consolidating conditions or clarifying control flow.
sync_metadata
checks multiple conditions similar touse_records_transfer
. Perhaps unify or centralize these checks to avoid duplication and simplify readability, wdyt?
213-227
: Evaluate if a concrete default implementation is desired.
get_metadata_schema
provides a default schema. If every connector must customize this, would marking it abstract be better to ensure an override, wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
airbyte_cdk/sources/file_based/file_based_stream_reader.py
(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Analyze (python)
🔇 Additional comments (1)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (1)
206-212
: Revisit making this method abstract.There's a past comment suggesting
@abstractmethod
here, ensuring that all concrete implementations must overrideget_file_metadata
. Do we want to enforce that strictly, wdyt?
Summary by CodeRabbit
New Features
Enhancements
Technical Improvements