-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(Low-Code Concurrent CDK): Add ConcurrentPerPartitionCursor #111
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughThe pull request introduces significant enhancements to the Airbyte CDK's concurrent declarative source handling, focusing on improving incremental sync capabilities for low-code connectors. The changes primarily revolve around extending support for different cursor types, particularly Changes
Possibly related PRs
Suggested reviewersHey there! 👋 I noticed some interesting improvements in the concurrent declarative source handling. A few questions for you:
Would love to hear your thoughts! 🚀 Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
🧹 Outside diff range and nitpick comments (5)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1)
59-66
: Consider simplifying the constructor parametersThe
__init__
method ofConcurrentPerPartitionCursor
has many parameters, which can make it harder to maintain and understand. Would it be beneficial to encapsulate related parameters into data classes or reduce the number of parameters if possible? Wdyt?airbyte_cdk/sources/declarative/concurrent_declarative_source.py (2)
23-25
: Unused import statementWe import
PerPartitionWithGlobalCursor
, but it's not used elsewhere in the code. Should we remove this import to keep the code clean? Wdyt?
309-361
: Refactor duplicated code in_group_streams
methodThere seems to be duplicated code in the conditional blocks handling
DatetimeBasedCursorModel
streams. Could we refactor these blocks into a helper function to reduce redundancy and improve readability? Wdyt?airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
915-970
: Consider reducing code duplicationThe methods
create_concurrent_cursor_from_datetime_based_cursor
andcreate_concurrent_cursor_from_perpartition_cursor
share similar logic. Could we extract the common parts into a shared helper function to simplify maintenance? Wdyt?airbyte_cdk/sources/streams/concurrent/cursor.py (1)
243-251
: Consider extracting common logic between close methods?The new
close_partition_without_emit
shares a lot of logic withclose_partition
. What do you think about extracting the common logic into a private method to reduce duplication? Something like this, wdyt?+ def _close_partition_internal(self, partition: Partition) -> bool: + slice_count_before = len(self.state.get("slices", [])) + self._add_slice_to_state(partition) + should_merge = slice_count_before < len(self.state["slices"]) + if should_merge: + self._merge_partitions() + self._has_closed_at_least_one_slice = True + return should_merge + def close_partition_without_emit(self, partition: Partition) -> None: - slice_count_before = len(self.state.get("slices", [])) - self._add_slice_to_state(partition) - if slice_count_before < len( - self.state["slices"] - ): # only emit if at least one slice has been processed - self._merge_partitions() - self._has_closed_at_least_one_slice = True + self._close_partition_internal(partition) def close_partition(self, partition: Partition) -> None: - slice_count_before = len(self.state.get("slices", [])) - self._add_slice_to_state(partition) - if slice_count_before < len( - self.state["slices"] - ): # only emit if at least one slice has been processed - self._merge_partitions() + if self._close_partition_internal(partition): self._emit_state_message() - self._has_closed_at_least_one_slice = True
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (7)
airbyte_cdk/sources/declarative/concurrent_declarative_source.py
(2 hunks)airbyte_cdk/sources/declarative/incremental/__init__.py
(2 hunks)airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
(1 hunks)airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(3 hunks)airbyte_cdk/sources/declarative/retrievers/simple_retriever.py
(1 hunks)airbyte_cdk/sources/declarative/stream_slicers/declarative_partition_generator.py
(1 hunks)airbyte_cdk/sources/streams/concurrent/cursor.py
(1 hunks)
🔇 Additional comments (5)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (2)
25-29
: Question about returning only the first cursor
In the ConcurrentCursorFactory
, the create
method returns only the first cursor from _create_function
. Since _create_function
returns a tuple of cursors, should we ensure that we're not discarding any necessary cursors? Perhaps we should handle all cursors returned. Wdyt?
221-224
: Ensure partition key serialization handles edge cases
In _to_partition_key
and _to_dict
, we use self._partition_serializer
. Do we need to ensure that partition keys are properly sanitized or encoded to handle special characters and prevent serialization issues? Wdyt?
airbyte_cdk/sources/declarative/concurrent_declarative_source.py (1)
317-318
: Type checking with isinstance
We use isinstance(declarative_stream.retriever.stream_slicer, PerPartitionWithGlobalCursor)
to check the type. Should we consider using duck typing or interface checks to allow for subclasses or alternative implementations that fulfill the same protocol? Wdyt?
airbyte_cdk/sources/declarative/stream_slicers/declarative_partition_generator.py (1)
41-66
: Clarify the purpose of DeclarativePartitionFactory1
The new class DeclarativePartitionFactory1
appears to be similar to DeclarativePartitionFactory
but with a different approach to handling the retriever. Is this intended to replace the original factory or serve a different purpose? Providing more context or renaming the class for clarity might help. Wdyt?
airbyte_cdk/sources/declarative/retrievers/simple_retriever.py (1)
181-181
: LGTM!
The change to use request_option_provider
instead of stream_slicer
improves modularity by separating concerns.
def state(self) -> MutableMapping[str, Any]: | ||
states = [] | ||
for partition_tuple, cursor in self._cursor_per_partition.items(): | ||
cursor_state = cursor._connector_state_converter.convert_to_state_message( | ||
cursor._cursor_field, cursor.state | ||
) | ||
if cursor_state: | ||
states.append( | ||
{ | ||
"partition": self._to_dict(partition_tuple), | ||
"cursor": copy.deepcopy(cursor_state), | ||
} | ||
) | ||
state: dict[str, Any] = {"states": states} | ||
return state | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Potential performance issue when generating state
The state
property method iterates over self._cursor_per_partition.items()
and performs deep copies of cursor states. This could become a performance bottleneck with a large number of partitions. Should we consider optimizing this by avoiding deep copies or processing states incrementally? Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we're doing a deepcopy here? If so, I think we should document it
def _ensure_partition_limit(self) -> None: | ||
""" | ||
Ensure the maximum number of partitions is not exceeded. If so, the oldest added partition will be dropped. | ||
""" | ||
while len(self._cursor_per_partition) > self.DEFAULT_MAX_PARTITIONS_NUMBER - 1: | ||
self._over_limit += 1 | ||
oldest_partition = self._cursor_per_partition.popitem(last=False)[ | ||
0 | ||
] # Remove the oldest partition | ||
logger.warning( | ||
f"The maximum number of partitions has been reached. Dropping the oldest partition: {oldest_partition}. Over limit: {self._over_limit}." | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarification on partition limit logic
In _ensure_partition_limit
, we increment _over_limit
every time we remove a partition when the limit is reached. However, in limit_reached
, we check if _over_limit > DEFAULT_MAX_PARTITIONS_NUMBER
. Is this the intended behavior? Should the condition be adjusted to properly reflect when the limit is truly exceeded? Wdyt?
def limit_reached(self) -> bool: | ||
return self._over_limit > self.DEFAULT_MAX_PARTITIONS_NUMBER | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible off-by-one error in limit_reached
method
The method limit_reached
returns True
when _over_limit > DEFAULT_MAX_PARTITIONS_NUMBER
. Given how _over_limit
is incremented, could this condition lead to an off-by-one error? Should we revisit this logic to ensure accurate limit checks? Wdyt?
airbyte_cdk/sources/declarative/concurrent_declarative_source.py
Outdated
Show resolved
Hide resolved
) | ||
|
||
|
||
partition_generator = StreamSlicerPartitionGenerator( | ||
DeclarativePartitionFactory( | ||
declarative_stream.name, | ||
declarative_stream.get_json_schema(), | ||
self._retriever_factory( | ||
name_to_stream_mapping[declarative_stream.name], | ||
config, | ||
stream_state, | ||
), | ||
self.message_repository, | ||
), | ||
cursor, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Error handling for partition generator creation
In constructing the partition_generator
, are there potential edge cases where dependencies might not be properly initialized? Should we add error handling or input validation to make the code more robust against such scenarios? Wdyt?
airbyte_cdk/sources/declarative/stream_slicers/declarative_partition_generator.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (5)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1)
152-156
: Could we simplify the partition limit condition in_ensure_partition_limit
method?Currently, the condition is
len(self._cursor_per_partition) > self.DEFAULT_MAX_PARTITIONS_NUMBER - 1
. Would it improve readability to uselen(self._cursor_per_partition) >= self.DEFAULT_MAX_PARTITIONS_NUMBER
instead? This change might make the maximum partition limit clearer. Wdyt?airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
910-965
: Consider extracting common validation logic and fixing indentation.A few suggestions to improve this new method:
- The component type validation logic (lines 922-926) is duplicated from create_concurrent_cursor_from_datetime_based_cursor. Should we extract it into a helper method to follow DRY?
- The indentation in the return statement (lines 955-964) seems inconsistent with the rest of the file. Should we align it with the standard indentation?
- Instead of using type ignore on line 961, we could properly type message_repository. Wdyt?
def create_concurrent_cursor_from_perpartition_cursor( self, state_manager: ConnectorStateManager, model_type: Type[BaseModel], component_definition: ComponentDefinition, stream_name: str, stream_namespace: Optional[str], config: Config, stream_state: MutableMapping[str, Any], partition_router, **kwargs: Any, ) -> ConcurrentPerPartitionCursor: - component_type = component_definition.get("type") - if component_definition.get("type") != model_type.__name__: - raise ValueError( - f"Expected manifest component of type {model_type.__name__}, but received {component_type} instead" - ) + self._validate_component_type(component_definition, model_type) # ... rest of the method ... return ConcurrentPerPartitionCursor( - cursor_factory=cursor_factory, - partition_router=partition_router, - stream_name=stream_name, - stream_namespace=stream_namespace, - stream_state=stream_state, - message_repository=self._message_repository, # type: ignore - connector_state_manager=state_manager, - cursor_field=cursor_field, - ) + cursor_factory=cursor_factory, + partition_router=partition_router, + stream_name=stream_name, + stream_namespace=stream_namespace, + stream_state=stream_state, + message_repository=self._message_repository, + connector_state_manager=state_manager, + cursor_field=cursor_field, + )Helper method to add:
def _validate_component_type(self, component_definition: ComponentDefinition, expected_type: Type[BaseModel]) -> None: component_type = component_definition.get("type") if component_definition.get("type") != expected_type.__name__: raise ValueError( f"Expected manifest component of type {expected_type.__name__}, but received {component_type} instead" )unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py (3)
276-276
: Should the test case name match the function name for consistency?In the parameterization at line 276, the test name is
"test_incremental_parent_state"
, but the test function is namedtest_incremental_parent_state_no_incremental_dependency
. Aligning the test case name with the function name could enhance clarity. Wdyt?
483-495
: Could we improve the docstring formatting for better readability?The docstring contains a detailed explanation of the test, but it's indented inconsistently. Adjusting the indentation could enhance readability and maintain consistency with PEP 257 guidelines. Wdyt?
519-523
: Is comparing only the last state sufficient to validate the final state?In the assertion at lines 519-523, we compare
final_state[-1]
withexpected_state
. Should we consider comparing all elements offinal_state
to ensure that all state messages throughout the sync match the expectations? Wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (7)
airbyte_cdk/sources/declarative/concurrent_declarative_source.py
(2 hunks)airbyte_cdk/sources/declarative/extractors/record_filter.py
(2 hunks)airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
(1 hunks)airbyte_cdk/sources/declarative/incremental/per_partition_cursor.py
(1 hunks)airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(4 hunks)airbyte_cdk/sources/streams/concurrent/cursor.py
(1 hunks)unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- airbyte_cdk/sources/declarative/concurrent_declarative_source.py
🔇 Additional comments (2)
airbyte_cdk/sources/declarative/extractors/record_filter.py (1)
78-81
: Is it safe to use an empty stream_name
when creating Record
instances?
In the filter_records
method, we're creating Record
instances with an empty stream_name
. Could this lead to issues if other parts of the codebase expect stream_name
to be non-empty? Should we consider passing the actual stream name to prevent potential problems? Wdyt?
airbyte_cdk/sources/streams/concurrent/cursor.py (1)
243-251
: Could we clarify the behavior of close_partition_without_emit
method?
In the close_partition_without_emit
method, we update _has_closed_at_least_one_slice
to True
, but we don't emit a state message. Is this intended? Should we ensure that the state is eventually emitted elsewhere to keep the state consistent? Wdyt?
airbyte_cdk/sources/declarative/incremental/per_partition_cursor.py
Outdated
Show resolved
Hide resolved
unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py
Outdated
Show resolved
Hide resolved
/autofix
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
unit_tests/sources/declarative/extractors/test_record_filter.py (1)
Line range hint
341-432
: Comprehensive test coverage for cursor types! 👍The parameterized tests cover all important cursor scenarios:
- Basic datetime cursor
- Global substream cursor (with and without state)
- Per-partition cursor with various state configurations
One thought: would it be valuable to add a test case for error handling when an invalid cursor type is passed? wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
unit_tests/sources/declarative/extractors/test_record_filter.py
(2 hunks)
🔇 Additional comments (1)
unit_tests/sources/declarative/extractors/test_record_filter.py (1)
293-293
: Nice API simplification! 🎉
I see you've consolidated the cursor parameters (date_time_based_cursor
and substream_cursor
) into a single cursor
parameter. This makes the API cleaner and more flexible. The tests demonstrate it works well with all cursor types (DatetimeBasedCursor, GlobalSubstreamCursor, and PerPartitionWithGlobalCursor).
Also applies to: 431-431
/autofix
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (1)
airbyte_cdk/sources/declarative/concurrent_declarative_source.py (1)
305-316
: Consider extracting condition to a descriptive methodThe nested conditions make the code harder to read and maintain. Would you consider extracting these conditions into a separate method with a descriptive name? Wdyt?
Something like:
+def _supports_concurrent_perpartition_cursor( + self, + declarative_stream: DeclarativeStream, + incremental_sync_component_definition: Mapping[str, Any] +) -> bool: + return ( + incremental_sync_component_definition + and incremental_sync_component_definition.get("type", "") == DatetimeBasedCursorModel.__name__ + and self._stream_supports_concurrent_partition_processing(declarative_stream=declarative_stream) + and hasattr(declarative_stream.retriever, "stream_slicer") + and isinstance(declarative_stream.retriever.stream_slicer, PerPartitionWithGlobalCursor) + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (1)
poetry.lock
is excluded by!**/*.lock
📒 Files selected for processing (5)
airbyte_cdk/sources/declarative/concurrent_declarative_source.py
(2 hunks)airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
(1 hunks)airbyte_cdk/sources/declarative/incremental/per_partition_cursor.py
(1 hunks)airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(4 hunks)unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py
- airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
🔇 Additional comments (4)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
84-85
:
Add missing comma between imports
There's a missing comma between ConcurrentCursorFactory
and ConcurrentPerPartitionCursor
in the import statement.
Apply this diff:
- ConcurrentCursorFactory
- ConcurrentPerPartitionCursor,
+ ConcurrentCursorFactory,
+ ConcurrentPerPartitionCursor,
airbyte_cdk/sources/declarative/incremental/per_partition_cursor.py (2)
Line range hint 320-358
: LGTM! Well-structured comparison logic
The implementation includes proper null checks and validation for partition matching. The error messages are clear and descriptive.
306-319
:
Add null check for record.associated_slice
The code accesses record.associated_slice.partition
without verifying if record.associated_slice
is not None. This could lead to a NullPointerException.
Consider adding a null check like this:
def should_be_synced(self, record: Record) -> bool:
+ if not record.associated_slice:
+ raise ValueError("Record must have an associated slice")
if (
self._to_partition_key(record.associated_slice.partition)
not in self._cursor_per_partition
):
airbyte_cdk/sources/declarative/concurrent_declarative_source.py (1)
317-331
: Consider handling potential None stream_state
When retrieving stream_state
, we might want to handle the case where it could be None to prevent unexpected errors during cursor creation. Wdyt?
Let's verify the current handling of stream_state:
#!/bin/bash
# Check how stream_state is handled in other parts of the codebase
rg -A 3 "get_stream_state" | grep -v "test"
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very happy with the progress on this! I've added a couple of comments more in terms of code maintenance and structure but the functional part seemed fine. I want to check the tests eventually (tomorrow hopefully) but I can still leave a couple of comments here
@@ -299,6 +302,60 @@ def _group_streams( | |||
cursor=final_state_cursor, | |||
) | |||
) | |||
elif ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we filter only on list partition router until we support the global cursor part?
@@ -77,7 +75,7 @@ def filter_records( | |||
records = ( | |||
record | |||
for record in records | |||
if (self._substream_cursor or self._date_time_based_cursor).should_be_synced( | |||
if self._cursor.should_be_synced( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is so beautiful that I want to cry ❤️
|
||
|
||
class ConcurrentCursorFactory: | ||
def __init__(self, create_function: Callable[..., Cursor]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the input type for Callable
be StreamState
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are keyword arguments, so we can’t directly enforce types in the Callable signature.
self._set_initial_state(stream_state) | ||
|
||
@property | ||
def cursor_field(self) -> CursorField: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add this in the Cursor
interface? If we put it there, it seems like the return type will need to be Optional
because of the FinalStateCursor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do FinalStateCursor need this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point: this indeed seems weird. My question should probably have been: why do we expose that? I only see this used internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using cursor_field
outside of this class:
concurrent_streams.append(
DefaultStream(
partition_generator=partition_generator,
name=declarative_stream.name,
json_schema=declarative_stream.get_json_schema(),
availability_strategy=AlwaysAvailableAvailabilityStrategy(),
primary_key=get_primary_key_from_stream(declarative_stream.primary_key),
cursor_field=perpartition_cursor.cursor_field.cursor_field_key,
logger=self.logger,
cursor=perpartition_cursor,
)
)
def state(self) -> MutableMapping[str, Any]: | ||
states = [] | ||
for partition_tuple, cursor in self._cursor_per_partition.items(): | ||
cursor_state = cursor._connector_state_converter.convert_to_state_message( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have two confusions regarding this method:
Why is Cursor.state public?
I know it was already like that before you changes but I'm trying to understand where it is used and with it is public. Before your changes, it didn't seem like we needed it to be public.
Is there a way we can access the information we need without accessing private parameters?
It feel like what we need to make public here is the representation of the data that is being set as an AirbyteMessage of type state. In order words, the one in ConcurrentCursor._emit_state_message
because when PerPartitionCursor will want to emit the state, it does not care about the implementation of the cursor but care about setting it in a state message
Based on the two points above, can we have the state
method be more explicit (something like as_state_message
) with a comment saying it is used for PerPartitionCursor and it is NOT used for other classes that are not cursors to emit states?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
@@ -240,6 +240,15 @@ def observe(self, record: Record) -> None: | |||
def _extract_cursor_value(self, record: Record) -> Any: | |||
return self._connector_state_converter.parse_value(self._cursor_field.extract_value(record)) | |||
|
|||
def close_partition_without_emit(self, partition: Partition) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like if we add a public method here, it should be also in the public interface. Based on that, I'm wondering if it would be easier to have the ConcurrentCursorFactory
set the message_repository
as a NoopMessageRepository
to prevent the cursors from emitting anything. This solution would work with any type of cursors as well (let's say we want to support custom cursors soon)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
def state(self) -> MutableMapping[str, Any]: | ||
states = [] | ||
for partition_tuple, cursor in self._cursor_per_partition.items(): | ||
cursor_state = cursor._connector_state_converter.convert_to_state_message( | ||
cursor._cursor_field, cursor.state | ||
) | ||
if cursor_state: | ||
states.append( | ||
{ | ||
"partition": self._to_dict(partition_tuple), | ||
"cursor": copy.deepcopy(cursor_state), | ||
} | ||
) | ||
state: dict[str, Any] = {"states": states} | ||
return state | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we're doing a deepcopy here? If so, I think we should document it
@@ -178,7 +178,7 @@ def _request_headers( | |||
stream_slice, | |||
next_page_token, | |||
self._paginator.get_request_headers, | |||
self.stream_slicer.get_request_headers, | |||
self.request_option_provider.get_request_headers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brianjlai was there a reason we were still using the stream_slicer
here instead of the request_option_provider
?
a6b5daa
to
79ffb77
Compare
/autofix
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (2)
airbyte_cdk/sources/declarative/incremental/per_partition_cursor.py (1)
306-319
:⚠️ Potential issueShould we add a null check for
record.associated_slice
?The code accesses
record.associated_slice.partition
without verifying ifrecord.associated_slice
is notNone
. This could raise anAttributeError
. Consider adding a null check, wdyt?def should_be_synced(self, record: Record) -> bool: + if not record.associated_slice: + raise ValueError("Record must have an associated slice") if ( self._to_partition_key(record.associated_slice.partition) not in self._cursor_per_partition ):unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py (1)
244-268
:⚠️ Potential issueShould we pass the state parameter to the read method?
The
state
parameter is passed toConcurrentDeclarativeSource
initialization but not to theread
method (empty list is passed instead). This might affect state handling. Consider using the state parameter in both places, wdyt?- messages = list(source.read(logger=source.logger, config=config, catalog=catalog, state=[])) + messages = list(source.read(logger=source.logger, config=config, catalog=catalog, state=state))
🧹 Nitpick comments (1)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
962-1018
: Consider adding error handling for cursor creationThe cursor creation logic could benefit from additional error handling similar to
create_concurrent_cursor_from_datetime_based_cursor
. For instance, what happens if the cursor factory creation fails? Wdyt?def create_concurrent_cursor_from_perpartition_cursor( self, state_manager: ConnectorStateManager, model_type: Type[BaseModel], component_definition: ComponentDefinition, stream_name: str, stream_namespace: Optional[str], config: Config, stream_state: MutableMapping[str, Any], partition_router, **kwargs: Any, ) -> ConcurrentPerPartitionCursor: + try: component_type = component_definition.get("type") if component_definition.get("type") != model_type.__name__: raise ValueError( f"Expected manifest component of type {model_type.__name__}, but received {component_type} instead" ) # Rest of the implementation... return ConcurrentPerPartitionCursor( cursor_factory=cursor_factory, partition_router=partition_router, stream_name=stream_name, stream_namespace=stream_namespace, stream_state=stream_state, message_repository=self._message_repository, connector_state_manager=state_manager, cursor_field=cursor_field, ) + except Exception as e: + raise ValueError( + f"Failed to create concurrent cursor for stream '{stream_name}': {str(e)}" + ) from e
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
airbyte_cdk/sources/declarative/concurrent_declarative_source.py
(2 hunks)airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
(1 hunks)airbyte_cdk/sources/declarative/incremental/per_partition_cursor.py
(1 hunks)airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(7 hunks)airbyte_cdk/sources/streams/concurrent/cursor.py
(1 hunks)unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- airbyte_cdk/sources/streams/concurrent/cursor.py
- airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
🔇 Additional comments (5)
unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py (3)
482-528
: LGTM! Well-structured test with clear documentation.
The test case is comprehensive and effectively validates the parent state behavior when incremental dependency is disabled.
530-630
: LGTM! Excellent test helper implementation.
The helper function is well-documented and implements thorough validation steps for testing incremental parent state behavior.
1124-1556
: LGTM! Comprehensive test coverage.
The test functions thoroughly cover different scenarios including state migration and empty record sets. The test cases are well-structured and documented.
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)
Line range hint 814-961
: LGTM! Well-structured implementation with proper error handling.
The implementation of create_concurrent_cursor_from_datetime_based_cursor
is thorough and handles all edge cases appropriately.
84-85
:
Add missing comma between imports
There seems to be a missing comma between ConcurrentCursorFactory
and ConcurrentPerPartitionCursor
. Wdyt?
- ConcurrentCursorFactory
- ConcurrentPerPartitionCursor,
+ ConcurrentCursorFactory,
+ ConcurrentPerPartitionCursor,
Likely invalid or redundant comment.
airbyte_cdk/sources/declarative/concurrent_declarative_source.py
Outdated
Show resolved
Hide resolved
) and not isinstance(stream_slicer, PerPartitionWithGlobalCursor): | ||
# Many of the custom component implementations of DatetimeBasedCursor override get_request_params() (or other methods). | ||
# Because we're decoupling RequestOptionsProvider from the Cursor, custom components will eventually need to reimplement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Simplify the type checking condition
The current condition seems overly complex and might cause issues with valid subclasses:
not isinstance(stream_slicer, DatetimeBasedCursor) or type(stream_slicer) is not DatetimeBasedCursor
Should we simplify it to just use isinstance
? The current combination of isinstance
and type
checks with OR could unexpectedly reject valid subclasses of DatetimeBasedCursor
. Wdyt?
- if (
- not isinstance(stream_slicer, DatetimeBasedCursor)
- or type(stream_slicer) is not DatetimeBasedCursor
- ) and not isinstance(stream_slicer, PerPartitionWithGlobalCursor):
+ if not isinstance(stream_slicer, (DatetimeBasedCursor, PerPartitionWithGlobalCursor)):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
airbyte_cdk/sources/declarative/concurrent_declarative_source.py
(2 hunks)airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
(1 hunks)airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(8 hunks)unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py
(1 hunks)
🧰 Additional context used
🪛 GitHub Actions: Linters
airbyte_cdk/sources/declarative/concurrent_declarative_source.py
[error] 327-327: Incompatible types in assignment: ConcurrentPerPartitionCursor cannot be assigned to ConcurrentCursor
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
[error] 1031-1031: Function is missing type annotations for one or more arguments
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
[error] 89-91: Missing type annotations for variables: _state, _semaphore_per_partition, _finished_partitions
[error] 110-148: Multiple attribute errors: Cursor and Partition objects missing required attributes
[error] 167-169: Type incompatibility in assignments
[error] 266-269: Type incompatibility in assignments
[error] 309-321: Return type errors and missing attributes in Cursor class
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Analyze (python)
🔇 Additional comments (4)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (2)
89-91
:⚠️ Potential issueAdd type annotations to class variables.
The pipeline failures indicate missing type annotations. Should we add them to improve type safety? Wdyt?
-self._state = {"states": []} -self._semaphore_per_partition = OrderedDict() -self._finished_partitions = set() +self._state: Dict[str, Any] = {"states": []} +self._semaphore_per_partition: OrderedDict[str, threading.Semaphore] = OrderedDict() +self._finished_partitions: Set[str] = set()Likely invalid or redundant comment.
🧰 Tools
🪛 GitHub Actions: Linters
[error] 89-91: Missing type annotations for variables: _state, _semaphore_per_partition, _finished_partitions
167-169
:⚠️ Potential issueHandle potential None values in assignments.
The pipeline failures indicate type incompatibility in assignments. Should we add type checks for
None
values? Wdyt?- self._global_cursor = self._new_global_cursor - self._lookback_window = self._timer.finish() - self._parent_state = self._partition_router.get_stream_state() + self._global_cursor = self._new_global_cursor + lookback_window = self._timer.finish() + if lookback_window is not None: + self._lookback_window = lookback_window + parent_state = self._partition_router.get_stream_state() + if parent_state is not None: + self._parent_state = parent_stateLikely invalid or redundant comment.
🧰 Tools
🪛 GitHub Actions: Linters
[error] 167-169: Type incompatibility in assignments
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
1031-1086
:⚠️ Potential issueAdd type annotations to method parameters.
The pipeline failures indicate missing type annotations. Should we add type hints to improve type safety? Wdyt?
def create_concurrent_cursor_from_perpartition_cursor( self, - state_manager, - model_type, - component_definition, - stream_name, - stream_namespace, - config, - stream_state, - partition_router, + state_manager: ConnectorStateManager, + model_type: Type[BaseModel], + component_definition: ComponentDefinition, + stream_name: str, + stream_namespace: Optional[str], + config: Config, + stream_state: MutableMapping[str, Any], + partition_router: PartitionRouter, **kwargs: Any, ) -> ConcurrentPerPartitionCursor:Likely invalid or redundant comment.
🧰 Tools
🪛 GitHub Actions: Linters
[error] 1031-1031: Function is missing type annotations for one or more arguments
unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py (1)
493-531
: LGTM! Comprehensive test coverage.The test cases thoroughly validate the concurrent partition cursor functionality, including edge cases and error scenarios.
cursor_state = cursor._connector_state_converter.convert_to_state_message( | ||
self.cursor_field, cursor.state | ||
) | ||
if cursor_state: | ||
states.append( | ||
{ | ||
"partition": self._to_dict(partition_tuple), | ||
"cursor": copy.deepcopy(cursor_state), | ||
} | ||
) | ||
state: dict[str, Any] = {"states": states} | ||
|
||
if self._global_cursor: | ||
state["state"] = self._global_cursor | ||
if self._lookback_window is not None: | ||
state["lookback_window"] = self._lookback_window | ||
if self._parent_state is not None: | ||
state["parent_state"] = self._parent_state | ||
return state | ||
|
||
def close_partition(self, partition: Partition) -> None: | ||
self._cursor_per_partition[ | ||
self._to_partition_key(partition._stream_slice.partition) | ||
].close_partition(partition=partition) | ||
with self._lock: | ||
self._semaphore_per_partition[ | ||
self._to_partition_key(partition._stream_slice.partition) | ||
].acquire() | ||
cursor = self._cursor_per_partition[ | ||
self._to_partition_key(partition._stream_slice.partition) | ||
] | ||
cursor_state = cursor._connector_state_converter.convert_to_state_message( | ||
cursor._cursor_field, cursor.state | ||
) | ||
if ( | ||
self._to_partition_key(partition._stream_slice.partition) | ||
in self._finished_partitions | ||
and self._semaphore_per_partition[ | ||
self._to_partition_key(partition._stream_slice.partition) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider encapsulating cursor state conversion.
The code directly accesses private attributes like cursor._connector_state_converter
and cursor._cursor_field
. Should we add public methods to handle state conversion instead of accessing private attributes? Wdyt?
🧰 Tools
🪛 GitHub Actions: Linters
[error] 110-148: Multiple attribute errors: Cursor and Partition objects missing required attributes
cursor = self._constructor.create_concurrent_cursor_from_perpartition_cursor( | ||
state_manager=state_manager, | ||
model_type=DatetimeBasedCursorModel, | ||
component_definition=incremental_sync_component_definition, | ||
stream_name=declarative_stream.name, | ||
stream_namespace=declarative_stream.namespace, | ||
config=config or {}, | ||
stream_state=stream_state, | ||
partition_router=partition_router, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix type incompatibility in cursor assignment.
The pipeline failures indicate that ConcurrentPerPartitionCursor
cannot be assigned to ConcurrentCursor
. Should we update the type hints or add type casting? Wdyt?
🧰 Tools
🪛 GitHub Actions: Linters
[error] 327-327: Incompatible types in assignment: ConcurrentPerPartitionCursor cannot be assigned to ConcurrentCursor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
airbyte_cdk/sources/streams/concurrent/cursor.py (1)
199-201
: Consider documenting the state conversion logic.The state property now uses
convert_to_state_message
to transform the state. Should we add a docstring explaining this conversion process and its implications? Wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
(1 hunks)airbyte_cdk/sources/declarative/incremental/per_partition_cursor.py
(1 hunks)airbyte_cdk/sources/streams/concurrent/cursor.py
(7 hunks)
🧰 Additional context used
🪛 GitHub Actions: Linters
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
[error] 89-89: Need type annotation for '_state'
[error] 90-90: Need type annotation for '_semaphore_per_partition'
[error] 91-91: Need type annotation for '_finished_partitions'
[error] 129-142: Multiple instances of 'Partition' has no attribute '_stream_slice'
[error] 151-161: Multiple type incompatibility issues in assignments
[error] 303-303: Returning Any from function declared to return 'bool' and 'Cursor' has no attribute 'should_be_synced'
[error] 315-315: Returning Any from function declared to return 'bool' and 'Cursor' has no attribute 'is_greater_than_or_equal'
⏰ Context from checks skipped due to timeout of 90000ms (6)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (6)
airbyte_cdk/sources/streams/concurrent/cursor.py (3)
246-249
: LGTM! Improved state management.The change to use
_concurrent_state
for slice counting provides better encapsulation and consistency with other state operations.
Line range hint
261-272
: Validate state structure before accessing slices.Good addition of the validation check! This helps prevent runtime errors when the state structure is invalid.
Line range hint
387-399
: Consider adding error handling for boundary calculations.The slice boundary calculations look correct, but should we add try-catch blocks to handle potential type mismatches or invalid values? Wdyt?
airbyte_cdk/sources/declarative/incremental/per_partition_cursor.py (1)
306-320
: LGTM! Robust validation for record comparisons.The validation checks for
associated_slice
and partition matching are thorough and will help catch potential issues early.airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (2)
89-91
:⚠️ Potential issueAdd type annotations to class attributes.
The linter indicates missing type annotations. Should we add them like this? Wdyt?
-self._state = {"states": []} -self._semaphore_per_partition = OrderedDict() -self._finished_partitions = set() +self._state: MutableMapping[str, Any] = {"states": []} +self._semaphore_per_partition: OrderedDict[str, threading.Semaphore] = OrderedDict() +self._finished_partitions: set[str] = set()Likely invalid or redundant comment.
🧰 Tools
🪛 GitHub Actions: Linters
[error] 89-89: Need type annotation for '_state'
[error] 90-90: Need type annotation for '_semaphore_per_partition'
[error] 91-91: Need type annotation for '_finished_partitions'
151-161
:⚠️ Potential issueHandle potential None values in state assignments.
The assignments might fail due to type incompatibility with None values. Should we add type checks? Wdyt?
-self._global_cursor = self._new_global_cursor -self._lookback_window = self._timer.finish() -self._parent_state = self._partition_router.get_stream_state() +self._global_cursor = self._new_global_cursor if self._new_global_cursor is not None else {} +lookback_window = self._timer.finish() +if lookback_window is not None: + self._lookback_window = lookback_window +parent_state = self._partition_router.get_stream_state() +if parent_state is not None: + self._parent_state = parent_stateLikely invalid or redundant comment.
🧰 Tools
🪛 GitHub Actions: Linters
[error] 151-161: Multiple type incompatibility issues in assignments
return cursor | ||
|
||
def should_be_synced(self, record: Record) -> bool: | ||
return self._get_cursor(record).should_be_synced(record) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix return type annotations in cursor methods.
The methods should_be_synced
and is_greater_than_or_equal
are returning Any instead of bool. Should we update the Cursor base class to include these methods in its interface? Wdyt?
Consider updating the Cursor base class to include these methods:
class Cursor(ABC):
@abstractmethod
def should_be_synced(self, record: Record) -> bool:
pass
@abstractmethod
def is_greater_than_or_equal(self, first: Record, second: Record) -> bool:
pass
Also applies to: 315-315
🧰 Tools
🪛 GitHub Actions: Linters
[error] 303-303: Returning Any from function declared to return 'bool' and 'Cursor' has no attribute 'should_be_synced'
self._to_partition_key(partition._stream_slice.partition) | ||
].close_partition(partition=partition) | ||
with self._lock: | ||
self._semaphore_per_partition[ | ||
self._to_partition_key(partition._stream_slice.partition) | ||
].acquire() | ||
cursor = self._cursor_per_partition[ | ||
self._to_partition_key(partition._stream_slice.partition) | ||
] | ||
if ( | ||
self._to_partition_key(partition._stream_slice.partition) | ||
in self._finished_partitions | ||
and self._semaphore_per_partition[ | ||
self._to_partition_key(partition._stream_slice.partition) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix attribute access in partition handling.
The code is trying to access _stream_slice
which doesn't exist on the Partition
class. Should we update the attribute access? Wdyt?
-self._to_partition_key(partition._stream_slice.partition)
+self._to_partition_key(partition.to_slice().partition)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
self._to_partition_key(partition._stream_slice.partition) | |
].close_partition(partition=partition) | |
with self._lock: | |
self._semaphore_per_partition[ | |
self._to_partition_key(partition._stream_slice.partition) | |
].acquire() | |
cursor = self._cursor_per_partition[ | |
self._to_partition_key(partition._stream_slice.partition) | |
] | |
if ( | |
self._to_partition_key(partition._stream_slice.partition) | |
in self._finished_partitions | |
and self._semaphore_per_partition[ | |
self._to_partition_key(partition._stream_slice.partition) | |
self._to_partition_key(partition.to_slice().partition) | |
].close_partition(partition=partition) | |
with self._lock: | |
self._semaphore_per_partition[ | |
self._to_partition_key(partition.to_slice().partition) | |
].acquire() | |
cursor = self._cursor_per_partition[ | |
self._to_partition_key(partition.to_slice().partition) | |
] | |
if ( | |
self._to_partition_key(partition.to_slice().partition) | |
in self._finished_partitions | |
and self._semaphore_per_partition[ | |
self._to_partition_key(partition.to_slice().partition) |
🧰 Tools
🪛 GitHub Actions: Linters
[error] 129-142: Multiple instances of 'Partition' has no attribute '_stream_slice'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
unit_tests/sources/declarative/test_concurrent_declarative_source.py (3)
654-655
: Consider adding more context to the comment.The comment describes the stream counts but doesn't explain why they are grouped this way. Would you consider adding more context about the stream types and their characteristics, wdyt?
- # 1 full refresh stream, 2 incremental streams, 1 substream w/o incremental, 1 list based substream w/o incremental - # 1 async job stream, 1 substream w/ incremental + # Stream grouping breakdown: + # - 1 full refresh stream (palaces) + # - 2 incremental streams (party_members, locations) + # - 1 substream without incremental (party_members_skills) + # - 1 list based substream without incremental (arcana_personas) + # - 1 async job stream (async_job_stream) + # - 1 substream with incremental (palace_enemies)
1459-1462
: Consider improving the comment clarity.The comment could be more explicit about which streams fall into each category. Would you consider adding more details, wdyt?
- # 1 full refresh stream, 2 with parent stream without incremental dependency, 1 stream with async retriever, 1 incremental with parent stream (palace_enemies) - assert len(concurrent_streams) == 5 - # 2 incremental stream with interpolation on state (locations and party_members) - assert len(synchronous_streams) == 2 + # Concurrent streams (5): + # - 1 full refresh stream (palaces) + # - 2 streams with parent stream without incremental dependency (party_members_skills, arcana_personas) + # - 1 stream with async retriever (async_job_stream) + # - 1 incremental stream with parent stream (palace_enemies) + assert len(concurrent_streams) == 5 + + # Synchronous streams (2): + # - 2 incremental streams with state interpolation (locations, party_members) + assert len(synchronous_streams) == 2
Line range hint
1465-1601
: The new test case looks good but could benefit from some documentation.The test case effectively verifies that streams with partition routing and incremental sync are processed concurrently. However, it would be helpful to add docstring documentation explaining the test's purpose and setup.
Consider adding a docstring like this:
def test_given_partition_routing_and_incremental_sync_then_stream_is_concurrent(): + """ + Verifies that a stream with both partition routing and incremental sync capabilities + is processed concurrently. This test specifically: + 1. Sets up a stream with ListPartitionRouter and DatetimeBasedCursor + 2. Configures the stream for incremental sync mode + 3. Verifies that the stream is grouped as concurrent rather than synchronous + """ manifest = {
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
unit_tests/sources/declarative/test_concurrent_declarative_source.py
(5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Analyze (python)
🔇 Additional comments (1)
unit_tests/sources/declarative/test_concurrent_declarative_source.py (1)
663-663
: The assertion sequence looks good.The assertions correctly verify the number and types of concurrent streams after adding the new
palace_enemies
stream.Also applies to: 676-678
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
2194-2194
: Simplify type checking condition.The type checking condition seems overly complex. Should we simplify it to just use
isinstance
? Wdyt?- if ( - not isinstance(stream_slicer, DatetimeBasedCursor) - or type(stream_slicer) is not DatetimeBasedCursor - ) and not isinstance(stream_slicer, PerPartitionWithGlobalCursor): + if not isinstance(stream_slicer, (DatetimeBasedCursor, PerPartitionWithGlobalCursor)):
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
airbyte_cdk/sources/declarative/concurrent_declarative_source.py
(2 hunks)airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
(1 hunks)airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(8 hunks)
🧰 Additional context used
🪛 GitHub Actions: Linters
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
[error] 122-122: "Partition" has no attribute "_stream_slice"
[error] 270-270: Item "None" of "StreamSlice | None" has no attribute "partition"
[error] 288-288: Returning Any from function declared to return "bool"
[error] 288-288: "Cursor" has no attribute "should_be_synced"
[error] 300-300: Returning Any from function declared to return "bool"
[error] 300-300: "Cursor" has no attribute "is_greater_than_or_equal"
[error] 309-309: Argument 2 to "Record" has incompatible type "StreamSlice | None"; expected "str"
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Analyze (python)
🔇 Additional comments (8)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (5)
122-123
:⚠️ Potential issueFix attribute access in partition handling.
The pipeline failures indicate that
Partition
has no attribute_stream_slice
. Should we update this to use a public method instead? Wdyt?-partition_key = self._to_partition_key(partition._stream_slice.partition) +partition_key = self._to_partition_key(partition.to_slice().partition)Likely invalid or redundant comment.
🧰 Tools
🪛 GitHub Actions: Linters
[error] 122-122: "Partition" has no attribute "_stream_slice"
146-148
:⚠️ Potential issueHandle potential None values in state updates.
The assignments to
self._lookback_window
andself._parent_state
don't handle potentialNone
returns fromself._timer.finish()
andself._partition_router.get_stream_state()
. Should we add type checks? Wdyt?- self._lookback_window = self._timer.finish() - self._parent_state = self._partition_router.get_stream_state() + lookback_window = self._timer.finish() + if lookback_window is not None: + self._lookback_window = lookback_window + parent_state = self._partition_router.get_stream_state() + if parent_state is not None: + self._parent_state = parent_stateLikely invalid or redundant comment.
84-86
: 🛠️ Refactor suggestionAdd type annotations to instance variables.
Several instance variables are missing type annotations, which is causing linting errors. Should we add them? Wdyt?
- self._cursor_per_partition: OrderedDict[str, Cursor] = OrderedDict() - self._semaphore_per_partition = OrderedDict() - self._finished_partitions = set() + self._cursor_per_partition: OrderedDict[str, Cursor] = OrderedDict() + self._semaphore_per_partition: OrderedDict[str, threading.Semaphore] = OrderedDict() + self._finished_partitions: set[str] = set()Likely invalid or redundant comment.
192-204
:⚠️ Potential issueConsider adjusting partition limit logic.
The
_over_limit
counter is incremented every time we remove a partition, butlimit_reached()
checks if it exceedsDEFAULT_MAX_PARTITIONS_NUMBER
. This could lead to incorrect limit handling. Should we adjust this to count active partitions instead? Wdyt?- while len(self._cursor_per_partition) > self.DEFAULT_MAX_PARTITIONS_NUMBER - 1: - self._over_limit += 1 + current_partition_count = len(self._cursor_per_partition) + if current_partition_count > self.DEFAULT_MAX_PARTITIONS_NUMBER: + self._over_limit = current_partition_count - self.DEFAULT_MAX_PARTITIONS_NUMBERLikely invalid or redundant comment.
101-119
: 🛠️ Refactor suggestionOptimize state generation to avoid deep copies.
The
state
property performs deep copies of cursor states for each partition, which could be a performance bottleneck with many partitions. Should we consider a more efficient approach? Wdyt?- "cursor": copy.deepcopy(cursor.state), + "cursor": cursor.state.copy() if isinstance(cursor.state, dict) else cursor.state,Likely invalid or redundant comment.
airbyte_cdk/sources/declarative/concurrent_declarative_source.py (2)
322-324
: 🛠️ Refactor suggestionHandle potential None in stream_state retrieval.
The
get_stream_state
call doesn't handle potentialNone
returns. Should we add a default value? Wdyt?- stream_state = state_manager.get_stream_state( - stream_name=declarative_stream.name, namespace=declarative_stream.namespace - ) + stream_state = state_manager.get_stream_state( + stream_name=declarative_stream.name, namespace=declarative_stream.namespace + ) or {}Likely invalid or redundant comment.
325-326
:⚠️ Potential issueAdd error handling for partition router access.
Direct access to
_partition_router
could fail if intermediate attributes don't exist. Should we add error handling? Wdyt?- partition_router = declarative_stream.retriever.stream_slicer._partition_router + try: + partition_router = declarative_stream.retriever.stream_slicer._partition_router + except AttributeError as e: + raise ValueError( + f"Failed to access partition router for stream {declarative_stream.name}: {str(e)}" + ) from eLikely invalid or redundant comment.
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
1031-1086
: 🛠️ Refactor suggestionAdd error handling for cursor creation.
The cursor creation logic could benefit from additional error handling. Should we wrap it in a try-catch block? Wdyt?
def create_concurrent_cursor_from_perpartition_cursor( self, ... ) -> ConcurrentPerPartitionCursor: + try: # ... existing implementation ... return ConcurrentPerPartitionCursor( cursor_factory=cursor_factory, partition_router=partition_router, stream_name=stream_name, stream_namespace=stream_namespace, stream_state=stream_state, message_repository=self._message_repository, connector_state_manager=state_manager, cursor_field=cursor_field, ) + except Exception as e: + raise ValueError( + f"Failed to create concurrent cursor for stream '{stream_name}': {str(e)}" + ) from eLikely invalid or redundant comment.
def should_be_synced(self, record: Record) -> bool: | ||
return self._get_cursor(record).should_be_synced(record) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add type hints to cursor methods.
The methods should_be_synced
and is_greater_than_or_equal
are causing type errors. Should we update the base Cursor
class to include these methods in its interface? Wdyt?
class Cursor(ABC):
+ @abstractmethod
+ def should_be_synced(self, record: Record) -> bool:
+ pass
+
+ @abstractmethod
+ def is_greater_than_or_equal(self, first: Record, second: Record) -> bool:
+ pass
Also applies to: 300-303
🧰 Tools
🪛 GitHub Actions: Linters
[error] 288-288: Returning Any from function declared to return "bool"
[error] 288-288: "Cursor" has no attribute "should_be_synced"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
(1 hunks)
🧰 Additional context used
🪛 GitHub Actions: Linters
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
[error] 124-124: Item "Mapping[str, Any]" of "Mapping[str, Any] | None" has no attribute "partition"
[error] 124-124: Item "None" of "Mapping[str, Any] | None" has no attribute "partition"
[error] 272-272: Item "None" of "StreamSlice | None" has no attribute "partition"
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Analyze (python)
🔇 Additional comments (1)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1)
291-292
: Ensureshould_be_synced
returnsbool
as expectedIn the
should_be_synced
method, we're returningself._get_cursor(record).should_be_synced(record)
. Could we confirm thatshould_be_synced
returns abool
value? Also, does the base classCursor
define this method to ensure consistency across implementations? Wdyt?
|
||
def close_partition(self, partition: Partition) -> None: | ||
partition_key = self._to_partition_key(partition.to_slice().partition) | ||
self._cursor_per_partition[partition_key].close_partition(partition=partition) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a need to handle None
when accessing partition.to_slice().partition
?
In close_partition
, we're accessing partition.to_slice().partition
, but if partition.to_slice()
or partition.to_slice().partition
is None
, it might lead to errors. Should we include a check to ensure these are not None
before proceeding? Wdyt?
if not any( | ||
semaphore_item[1]._value for semaphore_item in self._semaphore_per_partition.items() | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid accessing private attributes like _value
of Semaphore
In the check if not any(semaphore_item[1]._value for semaphore_item in self._semaphore_per_partition.items()):
, we're accessing the internal _value
attribute of threading.Semaphore
, which is intended to be private. Should we consider an alternative approach to track the semaphore state without accessing private attributes? Wdyt?
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (3)
31-32
: Consider adding return type hint to create_function parameter.The
create_function
parameter could benefit from a more explicit type hint. Wdyt about updating it like this?- def __init__(self, create_function: Callable[..., ConcurrentCursor]): + def __init__(self, create_function: Callable[[Mapping[str, Any], Any], ConcurrentCursor]):
201-213
: Consider adding partition removal notification.When removing old partitions, it might be helpful to notify the system about the removed partition. Would it make sense to add an event or callback mechanism? This could help with monitoring and debugging. Wdyt?
293-301
: Add type hint for runtime_lookback_window parameter.The
runtime_lookback_window
parameter could benefit from a more specific type hint thanAny
. Since it's converted to timedelta, should we useOptional[int]
? Wdyt?def _create_cursor( - self, cursor_state: Any, runtime_lookback_window: Any = None + self, cursor_state: Any, runtime_lookback_window: Optional[int] = None ) -> ConcurrentCursor:
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Analyze (python)
🔇 Additional comments (3)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (3)
278-284
: 🛠️ Refactor suggestionAdd null check for associated_slice.partition.
The code assumes
record.associated_slice.partition
exists, butassociated_slice
could beNone
. Should we add a null check? Wdyt?def observe(self, record: Record) -> None: if not record.associated_slice: raise ValueError( "Invalid state as stream slices that are emitted should refer to an existing cursor" ) + if not hasattr(record.associated_slice, 'partition'): + raise ValueError("Stream slice must have a partition") self._cursor_per_partition[ self._to_partition_key(record.associated_slice.partition) ].observe(record)Likely invalid or redundant comment.
152-154
: 🛠️ Refactor suggestionConsider using a thread-safe way to check semaphore state.
We're accessing the private
_value
attribute ofthreading.Semaphore
. Would it be better to implement a thread-safe method to check the semaphore state? For example:- if not any( - semaphore_item[1]._value for semaphore_item in self._semaphore_per_partition.items() - ): + if not any( + self._is_semaphore_acquired(sem) for _, sem in self._semaphore_per_partition.items() + ): # Add this helper method + def _is_semaphore_acquired(self, semaphore: threading.Semaphore) -> bool: + # Try to acquire without blocking + acquired = semaphore.acquire(blocking=False) + if acquired: + semaphore.release() + return False + return TrueLikely invalid or redundant comment.
214-216
: Review partition limit logic.The
limit_reached
method returnsTrue
when_over_limit > DEFAULT_MAX_PARTITIONS_NUMBER
, but_over_limit
is incremented each time we remove a partition. Should this condition be>=
instead of>
? Wdyt?def limit_reached(self) -> bool: - return self._over_limit > self.DEFAULT_MAX_PARTITIONS_NUMBER + return self._over_limit >= self.DEFAULT_MAX_PARTITIONS_NUMBER✅ Verification successful
The
>
operator is correct here, let me explain why.The current implementation creates an intentional buffer zone:
- Partitions start getting removed when reaching DEFAULT_MAX_PARTITIONS_NUMBER
- Switch to global cursor happens only after removing DEFAULT_MAX_PARTITIONS_NUMBER + 1 partitions
This buffer helps prevent premature switching to global cursor due to temporary partition spikes, which aligns with the documented behavior of ensuring "switching is only done when a sustained high number of partitions is observed."
🏁 Scripts executed
The following scripts were executed for the analysis:
Error: No shell scripts found
Script:
#!/bin/bash # Find all occurrences of _over_limit and DEFAULT_MAX_PARTITIONS_NUMBER echo "=== Searching for _over_limit usage ===" rg "_over_limit" -B 2 -A 2 echo -e "\n=== Searching for DEFAULT_MAX_PARTITIONS_NUMBER usage ===" rg "DEFAULT_MAX_PARTITIONS_NUMBER" -B 2 -A 2 echo -e "\n=== Looking for test files related to concurrent partition cursor ===" fd "test.*concurrent.*partition.*\.py$"Length of output: 14889
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is anything blocking but I added a couple of comments. I'll re-review whenever you want and approve once most of these are resolved
declarative_stream.retriever.stream_slicer, PerPartitionWithGlobalCursor | ||
) | ||
): | ||
stream_state = state_manager.get_stream_state( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inside of the elif
clause seems very similar to the one for if self._is_datetime_incremental_without_partition_routing(declarative_stream, incremental_sync_component_definition)
. Should we merge them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added helper method
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
Show resolved
Hide resolved
self._set_initial_state(stream_state) | ||
|
||
@property | ||
def cursor_field(self) -> CursorField: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point: this indeed seems weird. My question should probably have been: why do we expose that? I only see this used internally.
does not have parent streams, this step will be skipped due to the default PartitionRouter implementation. | ||
|
||
Args: | ||
stream_state (StreamState): The state of the streams to be set. The format of the stream state should be: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add the global state/lookback window parts to the doc as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
for partition in slices: | ||
yield from self.generate_slices_from_partition(partition) | ||
|
||
def generate_slices_from_partition(self, partition: StreamSlice) -> Iterable[StreamSlice]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we prefix with _
to make it clear it is private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
|
||
def stream_slices(self) -> Iterable[StreamSlice]: | ||
slices = self._partition_router.stream_slices() | ||
self._timer.start() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It's probably fine because we control the CDK and we don't expect people to call stream_slices
but it would be weird if stream_slices
was called more than once. Should we at least add a comment mentioning that the assumption is that it's once called once per sync?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We agreed sync that we can raise an error here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
state: dict[str, Any] = {"states": states} | ||
|
||
if self._global_cursor: | ||
state["state"] = self._global_cursor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to have this as global_state
instead of just state
? Is this a backward compatibility thing? I'm wondering because just state
is not so explicit given that there are per_partition_state
, global_state
, parent_state
, etc...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this follows the same format used in declarative streams. Would you like me to create an issue to refactor it later for better clarity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that makes sense. I think for now, we could just extract this in a constant called _GLOBAL_STATE_KEY
with a comment as to why it is only "state"
and it would be more than enough. Thanks for the context!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, also added constant for PerPartition key
) -> List[AirbyteMessage]: | ||
catalog = ConfiguredAirbyteCatalog( | ||
streams=[ | ||
ConfiguredAirbyteStream( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Using ConfiguredAirbyteStreamBuilder
will make this more maintainable as if the format changes, we have a layer of protection for this not to break
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
@@ -0,0 +1,2215 @@ | |||
# Copyright (c) 2024 Airbyte, Inc., all rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out-of-scope (just a comment to see if we can do differently next time): I like how exhaustive the tests are. I also feel they might be a bit hard to maintain for someone that does not have prior knowledge of this code. I'm wondering if some of those things might have helps:
- Re-using read and message parsing logic by using the EntrypointWrapper (it feels like all the orjson parsing does not need to be visible from the test's perspective)
- Using variables to improve naming and make dependencies explicit. For example, it feels easier to relate to other parts of the test if I see this:
"comments": [
{"id": 9, "post_id": 1, "updated_at": "2023-01-01T00:00:00Z"},
{"id": 10, "post_id": 1, "updated_at": _MOST_RECENT_DATE_OF_PARTITION_X},
{"id": 11, "post_id": 1, "updated_at": "2024-01-24T00:00:00Z"},
],
I can assume that the dates for id 9 and id 11 are not so important and I can understand why the date for id 10 is important.
I'm wondering if there is anything you would add to this list. Again, I'm not asking to change the tests but I feel it is crucial to eventually improve them because we have a good idea of what's going on here but for someone new to cursors, they will freak out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, these tests are quite complex, and something could easily be missed during updates. I’ll add documentation to explain what is being filtered and why the arguments are structured as they are. Additionally, I’ll incorporate variables to simplify the tests and improve readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't spend too much time. I think we can socialize the idea that anyone that needs to update those test should try to improve them. This is not a blocker for me to merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (5)
31-32
: Consider adding more specific type hints for create_function.The create_function parameter could benefit from more specific type hints. Would you consider updating it to explicitly show the expected parameter types? Wdyt?
- def __init__(self, create_function: Callable[..., ConcurrentCursor]): + def __init__( + self, + create_function: Callable[[Mapping[str, Any], Optional[Any]], ConcurrentCursor] + ):
42-54
: Consider enhancing documentation with thread safety details.The class uses locks and semaphores, but the documentation doesn't mention thread safety considerations. Would you like to add a section about thread safety and concurrent access patterns? Wdyt?
57-58
: Consider using frozen dictionaries for constants.The
_NO_STATE
and_NO_CURSOR_STATE
are mutable mappings. To prevent accidental modifications, would you consider usingMappingProxyType
orfrozendict
? Wdyt?- _NO_STATE: Mapping[str, Any] = {} - _NO_CURSOR_STATE: Mapping[str, Any] = {} + _NO_STATE: Mapping[str, Any] = MappingProxyType({}) + _NO_CURSOR_STATE: Mapping[str, Any] = MappingProxyType({})
102-121
: Consider optimizing state copying strategy.The
state
property performs deep copies of cursor states for each partition. For better performance with large states, would you consider:
- Using shallow copies where safe
- Implementing copy-on-write
- Caching the state if it hasn't changed
Wdyt?
306-314
: Enhance error messages for better debugging.The error messages could be more specific about what caused the invalid state. Would you consider including more context in the error messages? Wdyt?
- "Invalid state as stream slices that are emitted should refer to an existing cursor" + f"Invalid state: Partition key '{partition_key}' not found in cursor_per_partition. This might indicate a race condition or premature record emission."
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Analyze (python)
🔇 Additional comments (1)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1)
216-218
:⚠️ Potential issueReview partition limit logic.
The
limit_reached
method returns true when_over_limit > DEFAULT_MAX_PARTITIONS_NUMBER
, but_over_limit
is incremented each time we remove a partition. This means we'll only return true after exceeding the limit by the full amount. Should this be>=
instead? Wdyt?- return self._over_limit > self.DEFAULT_MAX_PARTITIONS_NUMBER + return self._over_limit >= self.DEFAULT_MAX_PARTITIONS_NUMBERLikely invalid or redundant comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (3)
64-98
: Consider adding type hints for class variables.The initialization is well-structured, but adding type hints for class variables would improve code clarity and help catch potential type-related issues early. Wdyt about adding them like this?
- _cursor_per_partition: OrderedDict[str, ConcurrentCursor] = OrderedDict() - _semaphore_per_partition: OrderedDict[str, threading.Semaphore] = OrderedDict() - _finished_partitions: set[str] = set() - _global_cursor: Optional[StreamState] = {} - _new_global_cursor: Optional[StreamState] = None - _lookback_window: int = 0 - _parent_state: Optional[StreamState] = None - _over_limit: int = 0
104-123
: Consider optimizing state generation for large number of partitions.The state property performs deep copies of cursor states for each partition. This could become a performance bottleneck with many partitions. Would it make sense to optimize this by:
- Processing states incrementally?
- Using shallow copies where possible?
- Implementing a caching mechanism?
Wdyt?
224-291
: Consider adding error handling for invalid state format.The
_set_initial_state
method has good documentation but could benefit from explicit error handling for invalid state formats. For example:if not stream_state: return + if not isinstance(stream_state, dict): + raise ValueError(f"Invalid state format. Expected dict, got {type(stream_state)}") + if self._PERPARTITION_STATE_KEY not in stream_state:airbyte_cdk/sources/declarative/concurrent_declarative_source.py (1)
296-349
: Consider extracting complex condition to a method.The condition for checking if a stream supports concurrent partition processing with global cursor is quite complex. Would it make sense to extract it to a separate method for better readability? Something like:
+ def _supports_concurrent_partition_with_global_cursor( + self, + declarative_stream: DeclarativeStream, + incremental_sync_component_definition: Mapping[str, Any], + ) -> bool: + return ( + incremental_sync_component_definition + and incremental_sync_component_definition.get("type", "") + == DatetimeBasedCursorModel.__name__ + and self._stream_supports_concurrent_partition_processing( + declarative_stream=declarative_stream + ) + and hasattr(declarative_stream.retriever, "stream_slicer") + and isinstance( + declarative_stream.retriever.stream_slicer, PerPartitionWithGlobalCursor + ) + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
airbyte_cdk/sources/declarative/concurrent_declarative_source.py
(5 hunks)airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
(1 hunks)airbyte_cdk/sources/declarative/incremental/global_substream_cursor.py
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Analyze (python)
🔇 Additional comments (5)
airbyte_cdk/sources/declarative/incremental/global_substream_cursor.py (1)
67-69
: LGTM! Simple and effective implementation.The
is_running
method provides a clean way to check the timer's state. The implementation is thread-safe and follows the single responsibility principle.airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (3)
30-39
: LGTM! Clean factory implementation.The factory pattern is well implemented with a clear separation of concerns. The create method signature is explicit about its parameters and return type.
208-220
: Verify partition limit logic.In
_ensure_partition_limit
, we increment_over_limit
every time we remove a partition when the limit is reached. However, this could lead to edge cases where partitions are unnecessarily dropped. Should we consider:
- Resetting
_over_limit
when partitions are successfully processed?- Using a different metric for tracking over-limit conditions?
Wdyt?
221-223
: Potential off-by-one error in limit check.The
limit_reached
method returnsTrue
when_over_limit > DEFAULT_MAX_PARTITIONS_NUMBER
. Should this be>=
instead? The current logic means we need to exceed the limit by more than one partition before considering it reached. Wdyt?airbyte_cdk/sources/declarative/concurrent_declarative_source.py (1)
440-459
: LGTM! Well-documented optimization.The
_get_retriever
method contains important optimizations for cursor and state management. The comments clearly explain the rationale behind the changes.
Regression tests for Jira: GitHub Action Failure: Issues:
Regression tests for Chargebee: GitHub Action Failure: The Catalog Changes: |
Summary by CodeRabbit
New Features
Improvements
Technical Updates
ConcurrentDeclarativeSource
and improved error handling in tests.