Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(Low-Code Concurrent CDK): Add ConcurrentPerPartitionCursor #111

Open
wants to merge 37 commits into
base: main
Choose a base branch
from

Conversation

tolik0
Copy link
Contributor

@tolik0 tolik0 commented Dec 4, 2024

Summary by CodeRabbit

  • New Features

    • Added support for concurrent cursor processing in declarative sources.
    • Enhanced incremental sync capabilities with per-partition and global cursor strategies.
    • Introduced new methods for creating concurrent cursors based on partitioned data streams.
    • Added a method to check if the timer is running in the Timer class.
  • Improvements

    • Streamlined record filtering and cursor handling.
    • Improved error handling for cursor and record management.
    • Updated request option retrieval in retrievers.
  • Technical Updates

    • Introduced new cursor classes for more flexible incremental synchronization.
    • Expanded test coverage for concurrent and incremental stream processing.
    • Updated tests to reflect changes in stream categorization and expected behaviors.
    • Added detailed unit tests for low-code connector streams and hierarchical substreams.
    • Enhanced the testing framework for ConcurrentDeclarativeSource and improved error handling in tests.

@github-actions github-actions bot added the enhancement New feature or request label Dec 4, 2024
Copy link
Contributor

coderabbitai bot commented Dec 4, 2024

📝 Walkthrough

Walkthrough

The pull request introduces significant enhancements to the Airbyte CDK's concurrent declarative source handling, focusing on improving incremental sync capabilities for low-code connectors. The changes primarily revolve around extending support for different cursor types, particularly PerPartitionWithGlobalCursor, and refining the way streams are processed concurrently. The modifications span multiple files, introducing new classes, methods, and updating existing logic to support more complex incremental synchronization scenarios.

Changes

File Change Summary
airbyte_cdk/sources/declarative/concurrent_declarative_source.py Added import for PerPartitionWithGlobalCursor and updated _group_streams method to handle new cursor type. New method _get_retriever introduced.
airbyte_cdk/sources/declarative/incremental/__init__.py Added exports for ConcurrentCursorFactory and ConcurrentPerPartitionCursor.
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py Introduced new classes for managing concurrent cursors with advanced state tracking and several new methods for cursor management.
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py Added methods for creating concurrent cursors from different cursor types. Updated existing method to include additional parameters.
airbyte_cdk/sources/declarative/retrievers/simple_retriever.py Updated request handling to use request_option_provider.
airbyte_cdk/sources/declarative/extractors/record_filter.py Simplified cursor handling in ClientSideIncrementalRecordFilterDecorator.
airbyte_cdk/sources/declarative/incremental/per_partition_cursor.py Added should_be_synced method and enhanced record validation.
unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py Expanded test suite for ConcurrentDeclarativeSource, adding tests for incremental sync scenarios.
unit_tests/sources/declarative/extractors/test_record_filter.py Updated tests to reflect cursor parameter consolidation in ClientSideIncrementalRecordFilterDecorator.
unit_tests/sources/declarative/parsers/test_model_to_component_factory.py Added tests for error handling in incremental sync configurations.
unit_tests/sources/declarative/test_concurrent_declarative_source.py Adjusted tests for expected concurrent stream counts and classifications.
unit_tests/sources/declarative/retrievers/test_simple_retriever.py Updated tests to reflect changes in request handling logic.
airbyte_cdk/sources/streams/concurrent/cursor.py Modified state management in ConcurrentCursor class for better encapsulation.
airbyte_cdk/sources/declarative/incremental/global_substream_cursor.py Added is_running method to the Timer class.

Possibly related PRs

Suggested reviewers

Hey there! 👋 I noticed some interesting improvements in the concurrent declarative source handling. A few questions for you:

  1. The changes introduce PerPartitionWithGlobalCursor support - wdyt about how this might impact existing low-code connectors?

  2. I see you've consolidated some cursor handling logic. Do you think this makes the code more readable or potentially introduces any subtle complexity?

  3. The new ConcurrentPerPartitionCursor seems quite powerful. Any specific use cases that inspired its development?

Would love to hear your thoughts! 🚀

Finishing Touches

  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Outside diff range and nitpick comments (5)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1)

59-66: Consider simplifying the constructor parameters

The __init__ method of ConcurrentPerPartitionCursor has many parameters, which can make it harder to maintain and understand. Would it be beneficial to encapsulate related parameters into data classes or reduce the number of parameters if possible? Wdyt?

airbyte_cdk/sources/declarative/concurrent_declarative_source.py (2)

23-25: Unused import statement

We import PerPartitionWithGlobalCursor, but it's not used elsewhere in the code. Should we remove this import to keep the code clean? Wdyt?


309-361: Refactor duplicated code in _group_streams method

There seems to be duplicated code in the conditional blocks handling DatetimeBasedCursorModel streams. Could we refactor these blocks into a helper function to reduce redundancy and improve readability? Wdyt?

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

915-970: Consider reducing code duplication

The methods create_concurrent_cursor_from_datetime_based_cursor and create_concurrent_cursor_from_perpartition_cursor share similar logic. Could we extract the common parts into a shared helper function to simplify maintenance? Wdyt?

airbyte_cdk/sources/streams/concurrent/cursor.py (1)

243-251: Consider extracting common logic between close methods?

The new close_partition_without_emit shares a lot of logic with close_partition. What do you think about extracting the common logic into a private method to reduce duplication? Something like this, wdyt?

+    def _close_partition_internal(self, partition: Partition) -> bool:
+        slice_count_before = len(self.state.get("slices", []))
+        self._add_slice_to_state(partition)
+        should_merge = slice_count_before < len(self.state["slices"])
+        if should_merge:
+            self._merge_partitions()
+        self._has_closed_at_least_one_slice = True
+        return should_merge
+
     def close_partition_without_emit(self, partition: Partition) -> None:
-        slice_count_before = len(self.state.get("slices", []))
-        self._add_slice_to_state(partition)
-        if slice_count_before < len(
-            self.state["slices"]
-        ):  # only emit if at least one slice has been processed
-            self._merge_partitions()
-        self._has_closed_at_least_one_slice = True
+        self._close_partition_internal(partition)

     def close_partition(self, partition: Partition) -> None:
-        slice_count_before = len(self.state.get("slices", []))
-        self._add_slice_to_state(partition)
-        if slice_count_before < len(
-            self.state["slices"]
-        ):  # only emit if at least one slice has been processed
-            self._merge_partitions()
+        if self._close_partition_internal(partition):
             self._emit_state_message()
-        self._has_closed_at_least_one_slice = True
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between acb6630 and a3304b9.

📒 Files selected for processing (7)
  • airbyte_cdk/sources/declarative/concurrent_declarative_source.py (2 hunks)
  • airbyte_cdk/sources/declarative/incremental/__init__.py (2 hunks)
  • airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (3 hunks)
  • airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (1 hunks)
  • airbyte_cdk/sources/declarative/stream_slicers/declarative_partition_generator.py (1 hunks)
  • airbyte_cdk/sources/streams/concurrent/cursor.py (1 hunks)
🔇 Additional comments (5)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (2)

25-29: Question about returning only the first cursor

In the ConcurrentCursorFactory, the create method returns only the first cursor from _create_function. Since _create_function returns a tuple of cursors, should we ensure that we're not discarding any necessary cursors? Perhaps we should handle all cursors returned. Wdyt?


221-224: Ensure partition key serialization handles edge cases

In _to_partition_key and _to_dict, we use self._partition_serializer. Do we need to ensure that partition keys are properly sanitized or encoded to handle special characters and prevent serialization issues? Wdyt?

airbyte_cdk/sources/declarative/concurrent_declarative_source.py (1)

317-318: Type checking with isinstance

We use isinstance(declarative_stream.retriever.stream_slicer, PerPartitionWithGlobalCursor) to check the type. Should we consider using duck typing or interface checks to allow for subclasses or alternative implementations that fulfill the same protocol? Wdyt?

airbyte_cdk/sources/declarative/stream_slicers/declarative_partition_generator.py (1)

41-66: Clarify the purpose of DeclarativePartitionFactory1

The new class DeclarativePartitionFactory1 appears to be similar to DeclarativePartitionFactory but with a different approach to handling the retriever. Is this intended to replace the original factory or serve a different purpose? Providing more context or renaming the class for clarity might help. Wdyt?

airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (1)

181-181: LGTM!

The change to use request_option_provider instead of stream_slicer improves modularity by separating concerns.

Comment on lines 90 to 105
def state(self) -> MutableMapping[str, Any]:
states = []
for partition_tuple, cursor in self._cursor_per_partition.items():
cursor_state = cursor._connector_state_converter.convert_to_state_message(
cursor._cursor_field, cursor.state
)
if cursor_state:
states.append(
{
"partition": self._to_dict(partition_tuple),
"cursor": copy.deepcopy(cursor_state),
}
)
state: dict[str, Any] = {"states": states}
return state

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Potential performance issue when generating state

The state property method iterates over self._cursor_per_partition.items() and performs deep copies of cursor states. This could become a performance bottleneck with a large number of partitions. Should we consider optimizing this by avoiding deep copies or processing states incrementally? Wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we're doing a deepcopy here? If so, I think we should document it

Comment on lines +152 to +164
def _ensure_partition_limit(self) -> None:
"""
Ensure the maximum number of partitions is not exceeded. If so, the oldest added partition will be dropped.
"""
while len(self._cursor_per_partition) > self.DEFAULT_MAX_PARTITIONS_NUMBER - 1:
self._over_limit += 1
oldest_partition = self._cursor_per_partition.popitem(last=False)[
0
] # Remove the oldest partition
logger.warning(
f"The maximum number of partitions has been reached. Dropping the oldest partition: {oldest_partition}. Over limit: {self._over_limit}."
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Clarification on partition limit logic

In _ensure_partition_limit, we increment _over_limit every time we remove a partition when the limit is reached. However, in limit_reached, we check if _over_limit > DEFAULT_MAX_PARTITIONS_NUMBER. Is this the intended behavior? Should the condition be adjusted to properly reflect when the limit is truly exceeded? Wdyt?

Comment on lines +165 to +167
def limit_reached(self) -> bool:
return self._over_limit > self.DEFAULT_MAX_PARTITIONS_NUMBER

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Possible off-by-one error in limit_reached method

The method limit_reached returns True when _over_limit > DEFAULT_MAX_PARTITIONS_NUMBER. Given how _over_limit is incremented, could this condition lead to an off-by-one error? Should we revisit this logic to ensure accurate limit checks? Wdyt?

Comment on lines 333 to 348
)


partition_generator = StreamSlicerPartitionGenerator(
DeclarativePartitionFactory(
declarative_stream.name,
declarative_stream.get_json_schema(),
self._retriever_factory(
name_to_stream_mapping[declarative_stream.name],
config,
stream_state,
),
self.message_repository,
),
cursor,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Error handling for partition generator creation

In constructing the partition_generator, are there potential edge cases where dependencies might not be properly initialized? Should we add error handling or input validation to make the code more robust against such scenarios? Wdyt?

airbyte_cdk/sources/declarative/incremental/__init__.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (5)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1)

152-156: Could we simplify the partition limit condition in _ensure_partition_limit method?

Currently, the condition is len(self._cursor_per_partition) > self.DEFAULT_MAX_PARTITIONS_NUMBER - 1. Would it improve readability to use len(self._cursor_per_partition) >= self.DEFAULT_MAX_PARTITIONS_NUMBER instead? This change might make the maximum partition limit clearer. Wdyt?

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

910-965: Consider extracting common validation logic and fixing indentation.

A few suggestions to improve this new method:

  1. The component type validation logic (lines 922-926) is duplicated from create_concurrent_cursor_from_datetime_based_cursor. Should we extract it into a helper method to follow DRY?
  2. The indentation in the return statement (lines 955-964) seems inconsistent with the rest of the file. Should we align it with the standard indentation?
  3. Instead of using type ignore on line 961, we could properly type message_repository. Wdyt?
     def create_concurrent_cursor_from_perpartition_cursor(
         self,
         state_manager: ConnectorStateManager,
         model_type: Type[BaseModel],
         component_definition: ComponentDefinition,
         stream_name: str,
         stream_namespace: Optional[str],
         config: Config,
         stream_state: MutableMapping[str, Any],
         partition_router,
             **kwargs: Any,
     ) -> ConcurrentPerPartitionCursor:
-        component_type = component_definition.get("type")
-        if component_definition.get("type") != model_type.__name__:
-            raise ValueError(
-                f"Expected manifest component of type {model_type.__name__}, but received {component_type} instead"
-            )
+        self._validate_component_type(component_definition, model_type)

         # ... rest of the method ...

         return ConcurrentPerPartitionCursor(
-                cursor_factory=cursor_factory,
-                partition_router=partition_router,
-                stream_name=stream_name,
-                stream_namespace=stream_namespace,
-                stream_state=stream_state,
-                message_repository=self._message_repository,  # type: ignore
-                connector_state_manager=state_manager,
-                cursor_field=cursor_field,
-            )
+            cursor_factory=cursor_factory,
+            partition_router=partition_router,
+            stream_name=stream_name,
+            stream_namespace=stream_namespace,
+            stream_state=stream_state,
+            message_repository=self._message_repository,
+            connector_state_manager=state_manager,
+            cursor_field=cursor_field,
+        )

Helper method to add:

def _validate_component_type(self, component_definition: ComponentDefinition, expected_type: Type[BaseModel]) -> None:
    component_type = component_definition.get("type")
    if component_definition.get("type") != expected_type.__name__:
        raise ValueError(
            f"Expected manifest component of type {expected_type.__name__}, but received {component_type} instead"
        )
unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py (3)

276-276: Should the test case name match the function name for consistency?

In the parameterization at line 276, the test name is "test_incremental_parent_state", but the test function is named test_incremental_parent_state_no_incremental_dependency. Aligning the test case name with the function name could enhance clarity. Wdyt?


483-495: Could we improve the docstring formatting for better readability?

The docstring contains a detailed explanation of the test, but it's indented inconsistently. Adjusting the indentation could enhance readability and maintain consistency with PEP 257 guidelines. Wdyt?


519-523: Is comparing only the last state sufficient to validate the final state?

In the assertion at lines 519-523, we compare final_state[-1] with expected_state. Should we consider comparing all elements of final_state to ensure that all state messages throughout the sync match the expectations? Wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 4ddbb84 and dfcf17f.

📒 Files selected for processing (7)
  • airbyte_cdk/sources/declarative/concurrent_declarative_source.py (2 hunks)
  • airbyte_cdk/sources/declarative/extractors/record_filter.py (2 hunks)
  • airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1 hunks)
  • airbyte_cdk/sources/declarative/incremental/per_partition_cursor.py (1 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4 hunks)
  • airbyte_cdk/sources/streams/concurrent/cursor.py (1 hunks)
  • unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • airbyte_cdk/sources/declarative/concurrent_declarative_source.py
🔇 Additional comments (2)
airbyte_cdk/sources/declarative/extractors/record_filter.py (1)

78-81: Is it safe to use an empty stream_name when creating Record instances?

In the filter_records method, we're creating Record instances with an empty stream_name. Could this lead to issues if other parts of the codebase expect stream_name to be non-empty? Should we consider passing the actual stream name to prevent potential problems? Wdyt?

airbyte_cdk/sources/streams/concurrent/cursor.py (1)

243-251: Could we clarify the behavior of close_partition_without_emit method?

In the close_partition_without_emit method, we update _has_closed_at_least_one_slice to True, but we don't emit a state message. Is this intended? Should we ensure that the state is eventually emitted elsewhere to keep the state consistent? Wdyt?

@tolik0
Copy link
Contributor Author

tolik0 commented Dec 6, 2024

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

❌ Job failed.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
unit_tests/sources/declarative/extractors/test_record_filter.py (1)

Line range hint 341-432: Comprehensive test coverage for cursor types! 👍

The parameterized tests cover all important cursor scenarios:

  • Basic datetime cursor
  • Global substream cursor (with and without state)
  • Per-partition cursor with various state configurations

One thought: would it be valuable to add a test case for error handling when an invalid cursor type is passed? wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between dfcf17f and b84e68a.

📒 Files selected for processing (1)
  • unit_tests/sources/declarative/extractors/test_record_filter.py (2 hunks)
🔇 Additional comments (1)
unit_tests/sources/declarative/extractors/test_record_filter.py (1)

293-293: Nice API simplification! 🎉

I see you've consolidated the cursor parameters (date_time_based_cursor and substream_cursor) into a single cursor parameter. This makes the API cleaner and more flexible. The tests demonstrate it works well with all cursor types (DatetimeBasedCursor, GlobalSubstreamCursor, and PerPartitionWithGlobalCursor).

Also applies to: 431-431

@tolik0
Copy link
Contributor Author

tolik0 commented Dec 6, 2024

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)
airbyte_cdk/sources/declarative/concurrent_declarative_source.py (1)

305-316: Consider extracting condition to a descriptive method

The nested conditions make the code harder to read and maintain. Would you consider extracting these conditions into a separate method with a descriptive name? Wdyt?

Something like:

+def _supports_concurrent_perpartition_cursor(
+    self,
+    declarative_stream: DeclarativeStream,
+    incremental_sync_component_definition: Mapping[str, Any]
+) -> bool:
+    return (
+        incremental_sync_component_definition
+        and incremental_sync_component_definition.get("type", "") == DatetimeBasedCursorModel.__name__
+        and self._stream_supports_concurrent_partition_processing(declarative_stream=declarative_stream)
+        and hasattr(declarative_stream.retriever, "stream_slicer")
+        and isinstance(declarative_stream.retriever.stream_slicer, PerPartitionWithGlobalCursor)
+    )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between b84e68a and a6b5daa.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • airbyte_cdk/sources/declarative/concurrent_declarative_source.py (2 hunks)
  • airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1 hunks)
  • airbyte_cdk/sources/declarative/incremental/per_partition_cursor.py (1 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4 hunks)
  • unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py
  • airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
🔇 Additional comments (4)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

84-85: ⚠️ Potential issue

Add missing comma between imports

There's a missing comma between ConcurrentCursorFactory and ConcurrentPerPartitionCursor in the import statement.

Apply this diff:

-    ConcurrentCursorFactory
-    ConcurrentPerPartitionCursor,
+    ConcurrentCursorFactory,
+    ConcurrentPerPartitionCursor,
airbyte_cdk/sources/declarative/incremental/per_partition_cursor.py (2)

Line range hint 320-358: LGTM! Well-structured comparison logic

The implementation includes proper null checks and validation for partition matching. The error messages are clear and descriptive.


306-319: ⚠️ Potential issue

Add null check for record.associated_slice

The code accesses record.associated_slice.partition without verifying if record.associated_slice is not None. This could lead to a NullPointerException.

Consider adding a null check like this:

def should_be_synced(self, record: Record) -> bool:
+   if not record.associated_slice:
+       raise ValueError("Record must have an associated slice")
    if (
        self._to_partition_key(record.associated_slice.partition)
        not in self._cursor_per_partition
    ):
airbyte_cdk/sources/declarative/concurrent_declarative_source.py (1)

317-331: Consider handling potential None stream_state

When retrieving stream_state, we might want to handle the case where it could be None to prevent unexpected errors during cursor creation. Wdyt?

Let's verify the current handling of stream_state:

#!/bin/bash
# Check how stream_state is handled in other parts of the codebase
rg -A 3 "get_stream_state" | grep -v "test"

@tolik0
Copy link
Contributor Author

tolik0 commented Dec 6, 2024

Copy link
Contributor

@maxi297 maxi297 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very happy with the progress on this! I've added a couple of comments more in terms of code maintenance and structure but the functional part seemed fine. I want to check the tests eventually (tomorrow hopefully) but I can still leave a couple of comments here

@@ -299,6 +302,60 @@ def _group_streams(
cursor=final_state_cursor,
)
)
elif (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we filter only on list partition router until we support the global cursor part?

@@ -77,7 +75,7 @@ def filter_records(
records = (
record
for record in records
if (self._substream_cursor or self._date_time_based_cursor).should_be_synced(
if self._cursor.should_be_synced(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is so beautiful that I want to cry ❤️



class ConcurrentCursorFactory:
def __init__(self, create_function: Callable[..., Cursor]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the input type for Callable be StreamState?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are keyword arguments, so we can’t directly enforce types in the Callable signature.

self._set_initial_state(stream_state)

@property
def cursor_field(self) -> CursorField:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add this in the Cursor interface? If we put it there, it seems like the return type will need to be Optional because of the FinalStateCursor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do FinalStateCursor need this method?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point: this indeed seems weird. My question should probably have been: why do we expose that? I only see this used internally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using cursor_field outside of this class:

                    concurrent_streams.append(
                        DefaultStream(
                            partition_generator=partition_generator,
                            name=declarative_stream.name,
                            json_schema=declarative_stream.get_json_schema(),
                            availability_strategy=AlwaysAvailableAvailabilityStrategy(),
                            primary_key=get_primary_key_from_stream(declarative_stream.primary_key),
                            cursor_field=perpartition_cursor.cursor_field.cursor_field_key,
                            logger=self.logger,
                            cursor=perpartition_cursor,
                        )
                    )

https://github.com/airbytehq/airbyte-python-cdk/blob/cf5107f3d75db401dc1512a32be485558783684d/airbyte_cdk/sources/declarative/concurrent_declarative_source.py#L255

def state(self) -> MutableMapping[str, Any]:
states = []
for partition_tuple, cursor in self._cursor_per_partition.items():
cursor_state = cursor._connector_state_converter.convert_to_state_message(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two confusions regarding this method:

Why is Cursor.state public?
I know it was already like that before you changes but I'm trying to understand where it is used and with it is public. Before your changes, it didn't seem like we needed it to be public.

Is there a way we can access the information we need without accessing private parameters?
It feel like what we need to make public here is the representation of the data that is being set as an AirbyteMessage of type state. In order words, the one in ConcurrentCursor._emit_state_message because when PerPartitionCursor will want to emit the state, it does not care about the implementation of the cursor but care about setting it in a state message

Based on the two points above, can we have the state method be more explicit (something like as_state_message) with a comment saying it is used for PerPartitionCursor and it is NOT used for other classes that are not cursors to emit states?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -240,6 +240,15 @@ def observe(self, record: Record) -> None:
def _extract_cursor_value(self, record: Record) -> Any:
return self._connector_state_converter.parse_value(self._cursor_field.extract_value(record))

def close_partition_without_emit(self, partition: Partition) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like if we add a public method here, it should be also in the public interface. Based on that, I'm wondering if it would be easier to have the ConcurrentCursorFactory set the message_repository as a NoopMessageRepository to prevent the cursors from emitting anything. This solution would work with any type of cursors as well (let's say we want to support custom cursors soon)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 90 to 105
def state(self) -> MutableMapping[str, Any]:
states = []
for partition_tuple, cursor in self._cursor_per_partition.items():
cursor_state = cursor._connector_state_converter.convert_to_state_message(
cursor._cursor_field, cursor.state
)
if cursor_state:
states.append(
{
"partition": self._to_dict(partition_tuple),
"cursor": copy.deepcopy(cursor_state),
}
)
state: dict[str, Any] = {"states": states}
return state

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we're doing a deepcopy here? If so, I think we should document it

@@ -178,7 +178,7 @@ def _request_headers(
stream_slice,
next_page_token,
self._paginator.get_request_headers,
self.stream_slicer.get_request_headers,
self.request_option_provider.get_request_headers,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brianjlai was there a reason we were still using the stream_slicer here instead of the request_option_provider?

@tolik0 tolik0 force-pushed the tolik0/concurrent-perpartitioncursor branch from a6b5daa to 79ffb77 Compare December 18, 2024 16:18
@tolik0
Copy link
Contributor Author

tolik0 commented Dec 18, 2024

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
airbyte_cdk/sources/declarative/incremental/per_partition_cursor.py (1)

306-319: ⚠️ Potential issue

Should we add a null check for record.associated_slice?

The code accesses record.associated_slice.partition without verifying if record.associated_slice is not None. This could raise an AttributeError. Consider adding a null check, wdyt?

 def should_be_synced(self, record: Record) -> bool:
+    if not record.associated_slice:
+        raise ValueError("Record must have an associated slice")
     if (
         self._to_partition_key(record.associated_slice.partition)
         not in self._cursor_per_partition
     ):
unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py (1)

244-268: ⚠️ Potential issue

Should we pass the state parameter to the read method?

The state parameter is passed to ConcurrentDeclarativeSource initialization but not to the read method (empty list is passed instead). This might affect state handling. Consider using the state parameter in both places, wdyt?

-    messages = list(source.read(logger=source.logger, config=config, catalog=catalog, state=[]))
+    messages = list(source.read(logger=source.logger, config=config, catalog=catalog, state=state))
🧹 Nitpick comments (1)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

962-1018: Consider adding error handling for cursor creation

The cursor creation logic could benefit from additional error handling similar to create_concurrent_cursor_from_datetime_based_cursor. For instance, what happens if the cursor factory creation fails? Wdyt?

    def create_concurrent_cursor_from_perpartition_cursor(
        self,
        state_manager: ConnectorStateManager,
        model_type: Type[BaseModel],
        component_definition: ComponentDefinition,
        stream_name: str,
        stream_namespace: Optional[str],
        config: Config,
        stream_state: MutableMapping[str, Any],
        partition_router,
        **kwargs: Any,
    ) -> ConcurrentPerPartitionCursor:
+       try:
            component_type = component_definition.get("type")
            if component_definition.get("type") != model_type.__name__:
                raise ValueError(
                    f"Expected manifest component of type {model_type.__name__}, but received {component_type} instead"
                )

            # Rest of the implementation...

            return ConcurrentPerPartitionCursor(
                cursor_factory=cursor_factory,
                partition_router=partition_router,
                stream_name=stream_name,
                stream_namespace=stream_namespace,
                stream_state=stream_state,
                message_repository=self._message_repository,
                connector_state_manager=state_manager,
                cursor_field=cursor_field,
            )
+       except Exception as e:
+           raise ValueError(
+               f"Failed to create concurrent cursor for stream '{stream_name}': {str(e)}"
+           ) from e
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a6b5daa and a36726b.

📒 Files selected for processing (6)
  • airbyte_cdk/sources/declarative/concurrent_declarative_source.py (2 hunks)
  • airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1 hunks)
  • airbyte_cdk/sources/declarative/incremental/per_partition_cursor.py (1 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (7 hunks)
  • airbyte_cdk/sources/streams/concurrent/cursor.py (1 hunks)
  • unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • airbyte_cdk/sources/streams/concurrent/cursor.py
  • airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
🔇 Additional comments (5)
unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py (3)

482-528: LGTM! Well-structured test with clear documentation.

The test case is comprehensive and effectively validates the parent state behavior when incremental dependency is disabled.


530-630: LGTM! Excellent test helper implementation.

The helper function is well-documented and implements thorough validation steps for testing incremental parent state behavior.


1124-1556: LGTM! Comprehensive test coverage.

The test functions thoroughly cover different scenarios including state migration and empty record sets. The test cases are well-structured and documented.

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)

Line range hint 814-961: LGTM! Well-structured implementation with proper error handling.

The implementation of create_concurrent_cursor_from_datetime_based_cursor is thorough and handles all edge cases appropriately.


84-85: ⚠️ Potential issue

Add missing comma between imports

There seems to be a missing comma between ConcurrentCursorFactory and ConcurrentPerPartitionCursor. Wdyt?

-    ConcurrentCursorFactory
-    ConcurrentPerPartitionCursor,
+    ConcurrentCursorFactory,
+    ConcurrentPerPartitionCursor,

Likely invalid or redundant comment.

Comment on lines +2087 to 2089
) and not isinstance(stream_slicer, PerPartitionWithGlobalCursor):
# Many of the custom component implementations of DatetimeBasedCursor override get_request_params() (or other methods).
# Because we're decoupling RequestOptionsProvider from the Cursor, custom components will eventually need to reimplement
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Simplify the type checking condition

The current condition seems overly complex and might cause issues with valid subclasses:

not isinstance(stream_slicer, DatetimeBasedCursor) or type(stream_slicer) is not DatetimeBasedCursor

Should we simplify it to just use isinstance? The current combination of isinstance and type checks with OR could unexpectedly reject valid subclasses of DatetimeBasedCursor. Wdyt?

-        if (
-            not isinstance(stream_slicer, DatetimeBasedCursor)
-            or type(stream_slicer) is not DatetimeBasedCursor
-        ) and not isinstance(stream_slicer, PerPartitionWithGlobalCursor):
+        if not isinstance(stream_slicer, (DatetimeBasedCursor, PerPartitionWithGlobalCursor)):

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ed687f5 and cfef872.

📒 Files selected for processing (4)
  • airbyte_cdk/sources/declarative/concurrent_declarative_source.py (2 hunks)
  • airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (8 hunks)
  • unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py (1 hunks)
🧰 Additional context used
🪛 GitHub Actions: Linters
airbyte_cdk/sources/declarative/concurrent_declarative_source.py

[error] 327-327: Incompatible types in assignment: ConcurrentPerPartitionCursor cannot be assigned to ConcurrentCursor

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py

[error] 1031-1031: Function is missing type annotations for one or more arguments

airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py

[error] 89-91: Missing type annotations for variables: _state, _semaphore_per_partition, _finished_partitions


[error] 110-148: Multiple attribute errors: Cursor and Partition objects missing required attributes


[error] 167-169: Type incompatibility in assignments


[error] 266-269: Type incompatibility in assignments


[error] 309-321: Return type errors and missing attributes in Cursor class

⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Analyze (python)
🔇 Additional comments (4)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (2)

89-91: ⚠️ Potential issue

Add type annotations to class variables.

The pipeline failures indicate missing type annotations. Should we add them to improve type safety? Wdyt?

-self._state = {"states": []}
-self._semaphore_per_partition = OrderedDict()
-self._finished_partitions = set()
+self._state: Dict[str, Any] = {"states": []}
+self._semaphore_per_partition: OrderedDict[str, threading.Semaphore] = OrderedDict()
+self._finished_partitions: Set[str] = set()

Likely invalid or redundant comment.

🧰 Tools
🪛 GitHub Actions: Linters

[error] 89-91: Missing type annotations for variables: _state, _semaphore_per_partition, _finished_partitions


167-169: ⚠️ Potential issue

Handle potential None values in assignments.

The pipeline failures indicate type incompatibility in assignments. Should we add type checks for None values? Wdyt?

-            self._global_cursor = self._new_global_cursor
-            self._lookback_window = self._timer.finish()
-            self._parent_state = self._partition_router.get_stream_state()
+            self._global_cursor = self._new_global_cursor
+            lookback_window = self._timer.finish()
+            if lookback_window is not None:
+                self._lookback_window = lookback_window
+            parent_state = self._partition_router.get_stream_state()
+            if parent_state is not None:
+                self._parent_state = parent_state

Likely invalid or redundant comment.

🧰 Tools
🪛 GitHub Actions: Linters

[error] 167-169: Type incompatibility in assignments

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

1031-1086: ⚠️ Potential issue

Add type annotations to method parameters.

The pipeline failures indicate missing type annotations. Should we add type hints to improve type safety? Wdyt?

    def create_concurrent_cursor_from_perpartition_cursor(
        self,
-       state_manager,
-       model_type,
-       component_definition,
-       stream_name,
-       stream_namespace,
-       config,
-       stream_state,
-       partition_router,
+       state_manager: ConnectorStateManager,
+       model_type: Type[BaseModel],
+       component_definition: ComponentDefinition,
+       stream_name: str,
+       stream_namespace: Optional[str],
+       config: Config,
+       stream_state: MutableMapping[str, Any],
+       partition_router: PartitionRouter,
        **kwargs: Any,
    ) -> ConcurrentPerPartitionCursor:

Likely invalid or redundant comment.

🧰 Tools
🪛 GitHub Actions: Linters

[error] 1031-1031: Function is missing type annotations for one or more arguments

unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py (1)

493-531: LGTM! Comprehensive test coverage.

The test cases thoroughly validate the concurrent partition cursor functionality, including edge cases and error scenarios.

Comment on lines 110 to 148
cursor_state = cursor._connector_state_converter.convert_to_state_message(
self.cursor_field, cursor.state
)
if cursor_state:
states.append(
{
"partition": self._to_dict(partition_tuple),
"cursor": copy.deepcopy(cursor_state),
}
)
state: dict[str, Any] = {"states": states}

if self._global_cursor:
state["state"] = self._global_cursor
if self._lookback_window is not None:
state["lookback_window"] = self._lookback_window
if self._parent_state is not None:
state["parent_state"] = self._parent_state
return state

def close_partition(self, partition: Partition) -> None:
self._cursor_per_partition[
self._to_partition_key(partition._stream_slice.partition)
].close_partition(partition=partition)
with self._lock:
self._semaphore_per_partition[
self._to_partition_key(partition._stream_slice.partition)
].acquire()
cursor = self._cursor_per_partition[
self._to_partition_key(partition._stream_slice.partition)
]
cursor_state = cursor._connector_state_converter.convert_to_state_message(
cursor._cursor_field, cursor.state
)
if (
self._to_partition_key(partition._stream_slice.partition)
in self._finished_partitions
and self._semaphore_per_partition[
self._to_partition_key(partition._stream_slice.partition)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider encapsulating cursor state conversion.

The code directly accesses private attributes like cursor._connector_state_converter and cursor._cursor_field. Should we add public methods to handle state conversion instead of accessing private attributes? Wdyt?

🧰 Tools
🪛 GitHub Actions: Linters

[error] 110-148: Multiple attribute errors: Cursor and Partition objects missing required attributes

Comment on lines 327 to 336
cursor = self._constructor.create_concurrent_cursor_from_perpartition_cursor(
state_manager=state_manager,
model_type=DatetimeBasedCursorModel,
component_definition=incremental_sync_component_definition,
stream_name=declarative_stream.name,
stream_namespace=declarative_stream.namespace,
config=config or {},
stream_state=stream_state,
partition_router=partition_router,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix type incompatibility in cursor assignment.

The pipeline failures indicate that ConcurrentPerPartitionCursor cannot be assigned to ConcurrentCursor. Should we update the type hints or add type casting? Wdyt?

🧰 Tools
🪛 GitHub Actions: Linters

[error] 327-327: Incompatible types in assignment: ConcurrentPerPartitionCursor cannot be assigned to ConcurrentCursor

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
airbyte_cdk/sources/streams/concurrent/cursor.py (1)

199-201: Consider documenting the state conversion logic.

The state property now uses convert_to_state_message to transform the state. Should we add a docstring explaining this conversion process and its implications? Wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cfef872 and 4260415.

📒 Files selected for processing (3)
  • airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1 hunks)
  • airbyte_cdk/sources/declarative/incremental/per_partition_cursor.py (1 hunks)
  • airbyte_cdk/sources/streams/concurrent/cursor.py (7 hunks)
🧰 Additional context used
🪛 GitHub Actions: Linters
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py

[error] 89-89: Need type annotation for '_state'


[error] 90-90: Need type annotation for '_semaphore_per_partition'


[error] 91-91: Need type annotation for '_finished_partitions'


[error] 129-142: Multiple instances of 'Partition' has no attribute '_stream_slice'


[error] 151-161: Multiple type incompatibility issues in assignments


[error] 303-303: Returning Any from function declared to return 'bool' and 'Cursor' has no attribute 'should_be_synced'


[error] 315-315: Returning Any from function declared to return 'bool' and 'Cursor' has no attribute 'is_greater_than_or_equal'

⏰ Context from checks skipped due to timeout of 90000ms (6)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (6)
airbyte_cdk/sources/streams/concurrent/cursor.py (3)

246-249: LGTM! Improved state management.

The change to use _concurrent_state for slice counting provides better encapsulation and consistency with other state operations.


Line range hint 261-272: Validate state structure before accessing slices.

Good addition of the validation check! This helps prevent runtime errors when the state structure is invalid.


Line range hint 387-399: Consider adding error handling for boundary calculations.

The slice boundary calculations look correct, but should we add try-catch blocks to handle potential type mismatches or invalid values? Wdyt?

airbyte_cdk/sources/declarative/incremental/per_partition_cursor.py (1)

306-320: LGTM! Robust validation for record comparisons.

The validation checks for associated_slice and partition matching are thorough and will help catch potential issues early.

airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (2)

89-91: ⚠️ Potential issue

Add type annotations to class attributes.

The linter indicates missing type annotations. Should we add them like this? Wdyt?

-self._state = {"states": []}
-self._semaphore_per_partition = OrderedDict()
-self._finished_partitions = set()
+self._state: MutableMapping[str, Any] = {"states": []}
+self._semaphore_per_partition: OrderedDict[str, threading.Semaphore] = OrderedDict()
+self._finished_partitions: set[str] = set()

Likely invalid or redundant comment.

🧰 Tools
🪛 GitHub Actions: Linters

[error] 89-89: Need type annotation for '_state'


[error] 90-90: Need type annotation for '_semaphore_per_partition'


[error] 91-91: Need type annotation for '_finished_partitions'


151-161: ⚠️ Potential issue

Handle potential None values in state assignments.

The assignments might fail due to type incompatibility with None values. Should we add type checks? Wdyt?

-self._global_cursor = self._new_global_cursor
-self._lookback_window = self._timer.finish()
-self._parent_state = self._partition_router.get_stream_state()
+self._global_cursor = self._new_global_cursor if self._new_global_cursor is not None else {}
+lookback_window = self._timer.finish()
+if lookback_window is not None:
+    self._lookback_window = lookback_window
+parent_state = self._partition_router.get_stream_state()
+if parent_state is not None:
+    self._parent_state = parent_state

Likely invalid or redundant comment.

🧰 Tools
🪛 GitHub Actions: Linters

[error] 151-161: Multiple type incompatibility issues in assignments

return cursor

def should_be_synced(self, record: Record) -> bool:
return self._get_cursor(record).should_be_synced(record)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix return type annotations in cursor methods.

The methods should_be_synced and is_greater_than_or_equal are returning Any instead of bool. Should we update the Cursor base class to include these methods in its interface? Wdyt?

Consider updating the Cursor base class to include these methods:

class Cursor(ABC):
    @abstractmethod
    def should_be_synced(self, record: Record) -> bool:
        pass

    @abstractmethod
    def is_greater_than_or_equal(self, first: Record, second: Record) -> bool:
        pass

Also applies to: 315-315

🧰 Tools
🪛 GitHub Actions: Linters

[error] 303-303: Returning Any from function declared to return 'bool' and 'Cursor' has no attribute 'should_be_synced'

Comment on lines 129 to 142
self._to_partition_key(partition._stream_slice.partition)
].close_partition(partition=partition)
with self._lock:
self._semaphore_per_partition[
self._to_partition_key(partition._stream_slice.partition)
].acquire()
cursor = self._cursor_per_partition[
self._to_partition_key(partition._stream_slice.partition)
]
if (
self._to_partition_key(partition._stream_slice.partition)
in self._finished_partitions
and self._semaphore_per_partition[
self._to_partition_key(partition._stream_slice.partition)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix attribute access in partition handling.

The code is trying to access _stream_slice which doesn't exist on the Partition class. Should we update the attribute access? Wdyt?

-self._to_partition_key(partition._stream_slice.partition)
+self._to_partition_key(partition.to_slice().partition)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self._to_partition_key(partition._stream_slice.partition)
].close_partition(partition=partition)
with self._lock:
self._semaphore_per_partition[
self._to_partition_key(partition._stream_slice.partition)
].acquire()
cursor = self._cursor_per_partition[
self._to_partition_key(partition._stream_slice.partition)
]
if (
self._to_partition_key(partition._stream_slice.partition)
in self._finished_partitions
and self._semaphore_per_partition[
self._to_partition_key(partition._stream_slice.partition)
self._to_partition_key(partition.to_slice().partition)
].close_partition(partition=partition)
with self._lock:
self._semaphore_per_partition[
self._to_partition_key(partition.to_slice().partition)
].acquire()
cursor = self._cursor_per_partition[
self._to_partition_key(partition.to_slice().partition)
]
if (
self._to_partition_key(partition.to_slice().partition)
in self._finished_partitions
and self._semaphore_per_partition[
self._to_partition_key(partition.to_slice().partition)
🧰 Tools
🪛 GitHub Actions: Linters

[error] 129-142: Multiple instances of 'Partition' has no attribute '_stream_slice'

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
unit_tests/sources/declarative/test_concurrent_declarative_source.py (3)

654-655: Consider adding more context to the comment.

The comment describes the stream counts but doesn't explain why they are grouped this way. Would you consider adding more context about the stream types and their characteristics, wdyt?

-    # 1 full refresh stream, 2 incremental streams, 1 substream w/o incremental, 1 list based substream w/o incremental
-    # 1 async job stream, 1 substream w/ incremental
+    # Stream grouping breakdown:
+    # - 1 full refresh stream (palaces)
+    # - 2 incremental streams (party_members, locations)
+    # - 1 substream without incremental (party_members_skills)
+    # - 1 list based substream without incremental (arcana_personas)
+    # - 1 async job stream (async_job_stream)
+    # - 1 substream with incremental (palace_enemies)

1459-1462: Consider improving the comment clarity.

The comment could be more explicit about which streams fall into each category. Would you consider adding more details, wdyt?

-    # 1 full refresh stream, 2 with parent stream without incremental dependency, 1 stream with async retriever, 1 incremental with parent stream (palace_enemies)
-    assert len(concurrent_streams) == 5
-    # 2 incremental stream with interpolation on state (locations and party_members)
-    assert len(synchronous_streams) == 2
+    # Concurrent streams (5):
+    # - 1 full refresh stream (palaces)
+    # - 2 streams with parent stream without incremental dependency (party_members_skills, arcana_personas)
+    # - 1 stream with async retriever (async_job_stream)
+    # - 1 incremental stream with parent stream (palace_enemies)
+    assert len(concurrent_streams) == 5
+    
+    # Synchronous streams (2):
+    # - 2 incremental streams with state interpolation (locations, party_members)
+    assert len(synchronous_streams) == 2

Line range hint 1465-1601: The new test case looks good but could benefit from some documentation.

The test case effectively verifies that streams with partition routing and incremental sync are processed concurrently. However, it would be helpful to add docstring documentation explaining the test's purpose and setup.

Consider adding a docstring like this:

 def test_given_partition_routing_and_incremental_sync_then_stream_is_concurrent():
+    """
+    Verifies that a stream with both partition routing and incremental sync capabilities
+    is processed concurrently. This test specifically:
+    1. Sets up a stream with ListPartitionRouter and DatetimeBasedCursor
+    2. Configures the stream for incremental sync mode
+    3. Verifies that the stream is grouped as concurrent rather than synchronous
+    """
     manifest = {
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4260415 and 089137f.

📒 Files selected for processing (1)
  • unit_tests/sources/declarative/test_concurrent_declarative_source.py (5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Analyze (python)
🔇 Additional comments (1)
unit_tests/sources/declarative/test_concurrent_declarative_source.py (1)

663-663: The assertion sequence looks good.

The assertions correctly verify the number and types of concurrent streams after adding the new palace_enemies stream.

Also applies to: 676-678

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

2194-2194: Simplify type checking condition.

The type checking condition seems overly complex. Should we simplify it to just use isinstance? Wdyt?

-        if (
-            not isinstance(stream_slicer, DatetimeBasedCursor)
-            or type(stream_slicer) is not DatetimeBasedCursor
-        ) and not isinstance(stream_slicer, PerPartitionWithGlobalCursor):
+        if not isinstance(stream_slicer, (DatetimeBasedCursor, PerPartitionWithGlobalCursor)):
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 089137f and 3489c7a.

📒 Files selected for processing (3)
  • airbyte_cdk/sources/declarative/concurrent_declarative_source.py (2 hunks)
  • airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (8 hunks)
🧰 Additional context used
🪛 GitHub Actions: Linters
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py

[error] 122-122: "Partition" has no attribute "_stream_slice"


[error] 270-270: Item "None" of "StreamSlice | None" has no attribute "partition"


[error] 288-288: Returning Any from function declared to return "bool"


[error] 288-288: "Cursor" has no attribute "should_be_synced"


[error] 300-300: Returning Any from function declared to return "bool"


[error] 300-300: "Cursor" has no attribute "is_greater_than_or_equal"


[error] 309-309: Argument 2 to "Record" has incompatible type "StreamSlice | None"; expected "str"

⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Analyze (python)
🔇 Additional comments (8)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (5)

122-123: ⚠️ Potential issue

Fix attribute access in partition handling.

The pipeline failures indicate that Partition has no attribute _stream_slice. Should we update this to use a public method instead? Wdyt?

-partition_key = self._to_partition_key(partition._stream_slice.partition)
+partition_key = self._to_partition_key(partition.to_slice().partition)

Likely invalid or redundant comment.

🧰 Tools
🪛 GitHub Actions: Linters

[error] 122-122: "Partition" has no attribute "_stream_slice"


146-148: ⚠️ Potential issue

Handle potential None values in state updates.

The assignments to self._lookback_window and self._parent_state don't handle potential None returns from self._timer.finish() and self._partition_router.get_stream_state(). Should we add type checks? Wdyt?

-            self._lookback_window = self._timer.finish()
-            self._parent_state = self._partition_router.get_stream_state()
+            lookback_window = self._timer.finish()
+            if lookback_window is not None:
+                self._lookback_window = lookback_window
+            parent_state = self._partition_router.get_stream_state()
+            if parent_state is not None:
+                self._parent_state = parent_state

Likely invalid or redundant comment.


84-86: 🛠️ Refactor suggestion

Add type annotations to instance variables.

Several instance variables are missing type annotations, which is causing linting errors. Should we add them? Wdyt?

-        self._cursor_per_partition: OrderedDict[str, Cursor] = OrderedDict()
-        self._semaphore_per_partition = OrderedDict()
-        self._finished_partitions = set()
+        self._cursor_per_partition: OrderedDict[str, Cursor] = OrderedDict()
+        self._semaphore_per_partition: OrderedDict[str, threading.Semaphore] = OrderedDict()
+        self._finished_partitions: set[str] = set()

Likely invalid or redundant comment.


192-204: ⚠️ Potential issue

Consider adjusting partition limit logic.

The _over_limit counter is incremented every time we remove a partition, but limit_reached() checks if it exceeds DEFAULT_MAX_PARTITIONS_NUMBER. This could lead to incorrect limit handling. Should we adjust this to count active partitions instead? Wdyt?

-        while len(self._cursor_per_partition) > self.DEFAULT_MAX_PARTITIONS_NUMBER - 1:
-            self._over_limit += 1
+        current_partition_count = len(self._cursor_per_partition)
+        if current_partition_count > self.DEFAULT_MAX_PARTITIONS_NUMBER:
+            self._over_limit = current_partition_count - self.DEFAULT_MAX_PARTITIONS_NUMBER

Likely invalid or redundant comment.


101-119: 🛠️ Refactor suggestion

Optimize state generation to avoid deep copies.

The state property performs deep copies of cursor states for each partition, which could be a performance bottleneck with many partitions. Should we consider a more efficient approach? Wdyt?

-                        "cursor": copy.deepcopy(cursor.state),
+                        "cursor": cursor.state.copy() if isinstance(cursor.state, dict) else cursor.state,

Likely invalid or redundant comment.

airbyte_cdk/sources/declarative/concurrent_declarative_source.py (2)

322-324: 🛠️ Refactor suggestion

Handle potential None in stream_state retrieval.

The get_stream_state call doesn't handle potential None returns. Should we add a default value? Wdyt?

-                    stream_state = state_manager.get_stream_state(
-                        stream_name=declarative_stream.name, namespace=declarative_stream.namespace
-                    )
+                    stream_state = state_manager.get_stream_state(
+                        stream_name=declarative_stream.name, namespace=declarative_stream.namespace
+                    ) or {}

Likely invalid or redundant comment.


325-326: ⚠️ Potential issue

Add error handling for partition router access.

Direct access to _partition_router could fail if intermediate attributes don't exist. Should we add error handling? Wdyt?

-            partition_router = declarative_stream.retriever.stream_slicer._partition_router
+            try:
+                partition_router = declarative_stream.retriever.stream_slicer._partition_router
+            except AttributeError as e:
+                raise ValueError(
+                    f"Failed to access partition router for stream {declarative_stream.name}: {str(e)}"
+                ) from e

Likely invalid or redundant comment.

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)

1031-1086: 🛠️ Refactor suggestion

Add error handling for cursor creation.

The cursor creation logic could benefit from additional error handling. Should we wrap it in a try-catch block? Wdyt?

    def create_concurrent_cursor_from_perpartition_cursor(
        self,
        ...
    ) -> ConcurrentPerPartitionCursor:
+       try:
            # ... existing implementation ...
            return ConcurrentPerPartitionCursor(
                cursor_factory=cursor_factory,
                partition_router=partition_router,
                stream_name=stream_name,
                stream_namespace=stream_namespace,
                stream_state=stream_state,
                message_repository=self._message_repository,
                connector_state_manager=state_manager,
                cursor_field=cursor_field,
            )
+       except Exception as e:
+           raise ValueError(
+               f"Failed to create concurrent cursor for stream '{stream_name}': {str(e)}"
+           ) from e

Likely invalid or redundant comment.

Comment on lines +287 to +289
def should_be_synced(self, record: Record) -> bool:
return self._get_cursor(record).should_be_synced(record)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add type hints to cursor methods.

The methods should_be_synced and is_greater_than_or_equal are causing type errors. Should we update the base Cursor class to include these methods in its interface? Wdyt?

class Cursor(ABC):
+    @abstractmethod
+    def should_be_synced(self, record: Record) -> bool:
+        pass
+
+    @abstractmethod
+    def is_greater_than_or_equal(self, first: Record, second: Record) -> bool:
+        pass

Also applies to: 300-303

🧰 Tools
🪛 GitHub Actions: Linters

[error] 288-288: Returning Any from function declared to return "bool"


[error] 288-288: "Cursor" has no attribute "should_be_synced"

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3489c7a and 301bd31.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1 hunks)
🧰 Additional context used
🪛 GitHub Actions: Linters
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py

[error] 124-124: Item "Mapping[str, Any]" of "Mapping[str, Any] | None" has no attribute "partition"


[error] 124-124: Item "None" of "Mapping[str, Any] | None" has no attribute "partition"


[error] 272-272: Item "None" of "StreamSlice | None" has no attribute "partition"

⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Analyze (python)
🔇 Additional comments (1)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1)

291-292: Ensure should_be_synced returns bool as expected

In the should_be_synced method, we're returning self._get_cursor(record).should_be_synced(record). Could we confirm that should_be_synced returns a bool value? Also, does the base class Cursor define this method to ensure consistency across implementations? Wdyt?


def close_partition(self, partition: Partition) -> None:
partition_key = self._to_partition_key(partition.to_slice().partition)
self._cursor_per_partition[partition_key].close_partition(partition=partition)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Is there a need to handle None when accessing partition.to_slice().partition?

In close_partition, we're accessing partition.to_slice().partition, but if partition.to_slice() or partition.to_slice().partition is None, it might lead to errors. Should we include a check to ensure these are not None before proceeding? Wdyt?

Comment on lines +145 to +147
if not any(
semaphore_item[1]._value for semaphore_item in self._semaphore_per_partition.items()
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid accessing private attributes like _value of Semaphore

In the check if not any(semaphore_item[1]._value for semaphore_item in self._semaphore_per_partition.items()):, we're accessing the internal _value attribute of threading.Semaphore, which is intended to be private. Should we consider an alternative approach to track the semaphore state without accessing private attributes? Wdyt?

@tolik0 tolik0 requested a review from maxi297 January 15, 2025 17:30
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (3)

31-32: Consider adding return type hint to create_function parameter.

The create_function parameter could benefit from a more explicit type hint. Wdyt about updating it like this?

-    def __init__(self, create_function: Callable[..., ConcurrentCursor]):
+    def __init__(self, create_function: Callable[[Mapping[str, Any], Any], ConcurrentCursor]):

201-213: Consider adding partition removal notification.

When removing old partitions, it might be helpful to notify the system about the removed partition. Would it make sense to add an event or callback mechanism? This could help with monitoring and debugging. Wdyt?


293-301: Add type hint for runtime_lookback_window parameter.

The runtime_lookback_window parameter could benefit from a more specific type hint than Any. Since it's converted to timedelta, should we use Optional[int]? Wdyt?

     def _create_cursor(
-        self, cursor_state: Any, runtime_lookback_window: Any = None
+        self, cursor_state: Any, runtime_lookback_window: Optional[int] = None
     ) -> ConcurrentCursor:
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 301bd31 and 9574f8c.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Analyze (python)
🔇 Additional comments (3)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (3)

278-284: 🛠️ Refactor suggestion

Add null check for associated_slice.partition.

The code assumes record.associated_slice.partition exists, but associated_slice could be None. Should we add a null check? Wdyt?

     def observe(self, record: Record) -> None:
         if not record.associated_slice:
             raise ValueError(
                 "Invalid state as stream slices that are emitted should refer to an existing cursor"
             )
+        if not hasattr(record.associated_slice, 'partition'):
+            raise ValueError("Stream slice must have a partition")
         self._cursor_per_partition[
             self._to_partition_key(record.associated_slice.partition)
         ].observe(record)

Likely invalid or redundant comment.


152-154: 🛠️ Refactor suggestion

Consider using a thread-safe way to check semaphore state.

We're accessing the private _value attribute of threading.Semaphore. Would it be better to implement a thread-safe method to check the semaphore state? For example:

-        if not any(
-            semaphore_item[1]._value for semaphore_item in self._semaphore_per_partition.items()
-        ):
+        if not any(
+            self._is_semaphore_acquired(sem) for _, sem in self._semaphore_per_partition.items()
+        ):

# Add this helper method
+    def _is_semaphore_acquired(self, semaphore: threading.Semaphore) -> bool:
+        # Try to acquire without blocking
+        acquired = semaphore.acquire(blocking=False)
+        if acquired:
+            semaphore.release()
+            return False
+        return True

Likely invalid or redundant comment.


214-216: Review partition limit logic.

The limit_reached method returns True when _over_limit > DEFAULT_MAX_PARTITIONS_NUMBER, but _over_limit is incremented each time we remove a partition. Should this condition be >= instead of >? Wdyt?

     def limit_reached(self) -> bool:
-        return self._over_limit > self.DEFAULT_MAX_PARTITIONS_NUMBER
+        return self._over_limit >= self.DEFAULT_MAX_PARTITIONS_NUMBER
✅ Verification successful

The > operator is correct here, let me explain why.

The current implementation creates an intentional buffer zone:

  • Partitions start getting removed when reaching DEFAULT_MAX_PARTITIONS_NUMBER
  • Switch to global cursor happens only after removing DEFAULT_MAX_PARTITIONS_NUMBER + 1 partitions

This buffer helps prevent premature switching to global cursor due to temporary partition spikes, which aligns with the documented behavior of ensuring "switching is only done when a sustained high number of partitions is observed."

🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found


Script:

#!/bin/bash
# Find all occurrences of _over_limit and DEFAULT_MAX_PARTITIONS_NUMBER
echo "=== Searching for _over_limit usage ==="
rg "_over_limit" -B 2 -A 2

echo -e "\n=== Searching for DEFAULT_MAX_PARTITIONS_NUMBER usage ==="
rg "DEFAULT_MAX_PARTITIONS_NUMBER" -B 2 -A 2

echo -e "\n=== Looking for test files related to concurrent partition cursor ==="
fd "test.*concurrent.*partition.*\.py$"

Length of output: 14889

Copy link
Contributor

@maxi297 maxi297 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is anything blocking but I added a couple of comments. I'll re-review whenever you want and approve once most of these are resolved

declarative_stream.retriever.stream_slicer, PerPartitionWithGlobalCursor
)
):
stream_state = state_manager.get_stream_state(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inside of the elif clause seems very similar to the one for if self._is_datetime_incremental_without_partition_routing(declarative_stream, incremental_sync_component_definition). Should we merge them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added helper method

self._set_initial_state(stream_state)

@property
def cursor_field(self) -> CursorField:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point: this indeed seems weird. My question should probably have been: why do we expose that? I only see this used internally.

does not have parent streams, this step will be skipped due to the default PartitionRouter implementation.

Args:
stream_state (StreamState): The state of the streams to be set. The format of the stream state should be:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the global state/lookback window parts to the doc as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

for partition in slices:
yield from self.generate_slices_from_partition(partition)

def generate_slices_from_partition(self, partition: StreamSlice) -> Iterable[StreamSlice]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we prefix with _ to make it clear it is private?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


def stream_slices(self) -> Iterable[StreamSlice]:
slices = self._partition_router.stream_slices()
self._timer.start()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It's probably fine because we control the CDK and we don't expect people to call stream_slices but it would be weird if stream_slices was called more than once. Should we at least add a comment mentioning that the assumption is that it's once called once per sync?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We agreed sync that we can raise an error here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

state: dict[str, Any] = {"states": states}

if self._global_cursor:
state["state"] = self._global_cursor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have this as global_state instead of just state? Is this a backward compatibility thing? I'm wondering because just state is not so explicit given that there are per_partition_state, global_state, parent_state, etc...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this follows the same format used in declarative streams. Would you like me to create an issue to refactor it later for better clarity?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that makes sense. I think for now, we could just extract this in a constant called _GLOBAL_STATE_KEY with a comment as to why it is only "state" and it would be more than enough. Thanks for the context!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, also added constant for PerPartition key

) -> List[AirbyteMessage]:
catalog = ConfiguredAirbyteCatalog(
streams=[
ConfiguredAirbyteStream(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Using ConfiguredAirbyteStreamBuilder will make this more maintainable as if the format changes, we have a layer of protection for this not to break

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -0,0 +1,2215 @@
# Copyright (c) 2024 Airbyte, Inc., all rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out-of-scope (just a comment to see if we can do differently next time): I like how exhaustive the tests are. I also feel they might be a bit hard to maintain for someone that does not have prior knowledge of this code. I'm wondering if some of those things might have helps:

  • Re-using read and message parsing logic by using the EntrypointWrapper (it feels like all the orjson parsing does not need to be visible from the test's perspective)
  • Using variables to improve naming and make dependencies explicit. For example, it feels easier to relate to other parts of the test if I see this:
                        "comments": [
                            {"id": 9, "post_id": 1, "updated_at": "2023-01-01T00:00:00Z"},
                            {"id": 10, "post_id": 1, "updated_at": _MOST_RECENT_DATE_OF_PARTITION_X},
                            {"id": 11, "post_id": 1, "updated_at": "2024-01-24T00:00:00Z"},
                        ],

I can assume that the dates for id 9 and id 11 are not so important and I can understand why the date for id 10 is important.

I'm wondering if there is anything you would add to this list. Again, I'm not asking to change the tests but I feel it is crucial to eventually improve them because we have a good idea of what's going on here but for someone new to cursors, they will freak out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, these tests are quite complex, and something could easily be missed during updates. I’ll add documentation to explain what is being filtered and why the arguments are structured as they are. Additionally, I’ll incorporate variables to simplify the tests and improve readability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't spend too much time. I think we can socialize the idea that anyone that needs to update those test should try to improve them. This is not a blocker for me to merge

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (5)

31-32: Consider adding more specific type hints for create_function.

The create_function parameter could benefit from more specific type hints. Would you consider updating it to explicitly show the expected parameter types? Wdyt?

-    def __init__(self, create_function: Callable[..., ConcurrentCursor]):
+    def __init__(
+        self,
+        create_function: Callable[[Mapping[str, Any], Optional[Any]], ConcurrentCursor]
+    ):

42-54: Consider enhancing documentation with thread safety details.

The class uses locks and semaphores, but the documentation doesn't mention thread safety considerations. Would you like to add a section about thread safety and concurrent access patterns? Wdyt?


57-58: Consider using frozen dictionaries for constants.

The _NO_STATE and _NO_CURSOR_STATE are mutable mappings. To prevent accidental modifications, would you consider using MappingProxyType or frozendict? Wdyt?

-    _NO_STATE: Mapping[str, Any] = {}
-    _NO_CURSOR_STATE: Mapping[str, Any] = {}
+    _NO_STATE: Mapping[str, Any] = MappingProxyType({})
+    _NO_CURSOR_STATE: Mapping[str, Any] = MappingProxyType({})

102-121: Consider optimizing state copying strategy.

The state property performs deep copies of cursor states for each partition. For better performance with large states, would you consider:

  1. Using shallow copies where safe
  2. Implementing copy-on-write
  3. Caching the state if it hasn't changed

Wdyt?


306-314: Enhance error messages for better debugging.

The error messages could be more specific about what caused the invalid state. Would you consider including more context in the error messages? Wdyt?

-                "Invalid state as stream slices that are emitted should refer to an existing cursor"
+                f"Invalid state: Partition key '{partition_key}' not found in cursor_per_partition. This might indicate a race condition or premature record emission."
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9574f8c and 36c4992.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Analyze (python)
🔇 Additional comments (1)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1)

216-218: ⚠️ Potential issue

Review partition limit logic.

The limit_reached method returns true when _over_limit > DEFAULT_MAX_PARTITIONS_NUMBER, but _over_limit is incremented each time we remove a partition. This means we'll only return true after exceeding the limit by the full amount. Should this be >= instead? Wdyt?

-        return self._over_limit > self.DEFAULT_MAX_PARTITIONS_NUMBER
+        return self._over_limit >= self.DEFAULT_MAX_PARTITIONS_NUMBER

Likely invalid or redundant comment.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (3)

64-98: Consider adding type hints for class variables.

The initialization is well-structured, but adding type hints for class variables would improve code clarity and help catch potential type-related issues early. Wdyt about adding them like this?

-    _cursor_per_partition: OrderedDict[str, ConcurrentCursor] = OrderedDict()
-    _semaphore_per_partition: OrderedDict[str, threading.Semaphore] = OrderedDict()
-    _finished_partitions: set[str] = set()
-    _global_cursor: Optional[StreamState] = {}
-    _new_global_cursor: Optional[StreamState] = None
-    _lookback_window: int = 0
-    _parent_state: Optional[StreamState] = None
-    _over_limit: int = 0

104-123: Consider optimizing state generation for large number of partitions.

The state property performs deep copies of cursor states for each partition. This could become a performance bottleneck with many partitions. Would it make sense to optimize this by:

  1. Processing states incrementally?
  2. Using shallow copies where possible?
  3. Implementing a caching mechanism?

Wdyt?


224-291: Consider adding error handling for invalid state format.

The _set_initial_state method has good documentation but could benefit from explicit error handling for invalid state formats. For example:

     if not stream_state:
         return

+    if not isinstance(stream_state, dict):
+        raise ValueError(f"Invalid state format. Expected dict, got {type(stream_state)}")
+
     if self._PERPARTITION_STATE_KEY not in stream_state:
airbyte_cdk/sources/declarative/concurrent_declarative_source.py (1)

296-349: Consider extracting complex condition to a method.

The condition for checking if a stream supports concurrent partition processing with global cursor is quite complex. Would it make sense to extract it to a separate method for better readability? Something like:

+    def _supports_concurrent_partition_with_global_cursor(
+        self,
+        declarative_stream: DeclarativeStream,
+        incremental_sync_component_definition: Mapping[str, Any],
+    ) -> bool:
+        return (
+            incremental_sync_component_definition
+            and incremental_sync_component_definition.get("type", "")
+            == DatetimeBasedCursorModel.__name__
+            and self._stream_supports_concurrent_partition_processing(
+                declarative_stream=declarative_stream
+            )
+            and hasattr(declarative_stream.retriever, "stream_slicer")
+            and isinstance(
+                declarative_stream.retriever.stream_slicer, PerPartitionWithGlobalCursor
+            )
+        )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 36c4992 and daa6873.

📒 Files selected for processing (3)
  • airbyte_cdk/sources/declarative/concurrent_declarative_source.py (5 hunks)
  • airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1 hunks)
  • airbyte_cdk/sources/declarative/incremental/global_substream_cursor.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-the-guardian-api' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Analyze (python)
🔇 Additional comments (5)
airbyte_cdk/sources/declarative/incremental/global_substream_cursor.py (1)

67-69: LGTM! Simple and effective implementation.

The is_running method provides a clean way to check the timer's state. The implementation is thread-safe and follows the single responsibility principle.

airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (3)

30-39: LGTM! Clean factory implementation.

The factory pattern is well implemented with a clear separation of concerns. The create method signature is explicit about its parameters and return type.


208-220: Verify partition limit logic.

In _ensure_partition_limit, we increment _over_limit every time we remove a partition when the limit is reached. However, this could lead to edge cases where partitions are unnecessarily dropped. Should we consider:

  1. Resetting _over_limit when partitions are successfully processed?
  2. Using a different metric for tracking over-limit conditions?

Wdyt?


221-223: Potential off-by-one error in limit check.

The limit_reached method returns True when _over_limit > DEFAULT_MAX_PARTITIONS_NUMBER. Should this be >= instead? The current logic means we need to exceed the limit by more than one partition before considering it reached. Wdyt?

airbyte_cdk/sources/declarative/concurrent_declarative_source.py (1)

440-459: LGTM! Well-documented optimization.

The _get_retriever method contains important optimizations for cursor and state management. The comments clearly explain the rationale behind the changes.

@tolik0
Copy link
Contributor Author

tolik0 commented Jan 17, 2025

Regression tests for Jira: GitHub Action

Failure: test_all_records_are_the_same_with_state

Issues:

  1. project_avatars: projectId is added by a transformation, so id does not uniquely identify the record.
  2. screen_tab_fields: Similar to project_avatars, screenId and tabId are added by a transformation, so id does not uniquely identify the record.

Regression tests for Chargebee: GitHub Action

Failure: The contact stream read test failed due to its update to concurrency. It no longer uses resumable full refresh state, which resulted in an increased number of records. I updated the connector version to the latest CDK version and encountered the same error: GitHub Action.

Catalog Changes:
4 streams were updated: attached_item, contact, quote_line_group, and subscription_with_scheduled_changes. These are full refresh streams with parent streams.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants