-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(low-code concurrent): Add use_global_cursor
flag to ConcurrentPerPartitionCursor
#279
feat(low-code concurrent): Add use_global_cursor
flag to ConcurrentPerPartitionCursor
#279
Conversation
📝 WalkthroughWalkthroughThe pull request introduces modifications to the Changes
Sequence DiagramsequenceDiagram
participant Cursor as ConcurrentPerPartitionCursor
participant Partition as Partition
Cursor->>Cursor: Initialize _use_global_cursor
Cursor->>Partition: Observe partition
alt Partition limit reached
Cursor->>Cursor: Set _use_global_cursor = True
else Partition limit not reached
Cursor->>Cursor: Maintain per-partition state
end
Cursor->>Cursor: Emit state based on _use_global_cursor
Possibly related PRs
Suggested reviewers
Hey there! 👋 I noticed some interesting state management improvements. What do you think about the new ✨ Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (15)
unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py (11)
724-724
: Consider a dedicated test scenario for this setting
Would you consider adding a separate test scenario for when"use_global_cursor"
isTrue
just to ensure coverage of that path? wdyt?
1125-1125
: Add a quick rationale
Could you leave a brief inline comment on why"use_global_cursor"
isFalse
here, so future readers quickly grasp the reasoning? wdyt?
1175-1227
: Reduce nesting for clarity
This dictionary is quite large. Would you consider factoring out nested chunks or using helper methods to make it easier to read and maintain? wdyt?
1228-1228
: Check necessity of deepcopy
Is it crucial to usecopy.deepcopy
here rather than a direct assignment? If so, could you add a short comment explaining why? wdyt?
1230-1230
: Clarify test purpose
This line forces"use_global_cursor"
toTrue
. Could you note which exact behavior this is meant to verify? wdyt?
1393-1425
: Cover empty-partition edge case
Might it be beneficial to include a test with"use_global_cursor": True
and zero partitions ininitial_state
to ensure the fallback logic works as expected? wdyt?
1538-1538
: Maintain consistency
We set"use_global_cursor": False
in multiple tests. Would you prefer a single constant or fixture to keep them in sync and reduce duplication? wdyt?
1713-1713
: Possible refactor
We keep repeating"use_global_cursor": False
. Shall we define a shared helper or fixture to avoid accidental typos or mismatch? wdyt?
2251-2251
: Centralize settings
Would it be simpler to store"use_global_cursor"
in a single place or constant to adjust it for every test if needed? wdyt?
2350-2350
: Consider environment-driven approach
Would you consider reading"use_global_cursor"
from an environment variable for more flexible local testing or automation? wdyt?
2878-2878
: Confirm default consistency
Is defaulting toFalse
for"use_global_cursor"
consistent across other parts of the codebase? Maybe unify to avoid confusion. wdyt?airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (4)
98-98
: Constructor default
Would it be clearer to accept"use_global_cursor"
as an explicit constructor parameter instead of always defaulting toFalse
? wdyt?
109-120
: Behavior whenuse_global_cursor
is True
Would it help to log or warn if we have non-empty partition states whileuse_global_cursor
is active, to highlight potentially conflicting usage? wdyt?
232-232
: Naming_over_limit
Could_over_limit
be renamed to something like_excess_partition_count
to clarify its purpose? wdyt?
305-306
: Type validation
What ifstream_state.get("use_global_cursor")
is not a bool? Maybe you could raise a warning or exception to avoid unexpected behavior? wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
(7 hunks)unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py
(11 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Analyze (python)
🔇 Additional comments (6)
unit_tests/sources/declarative/incremental/test_concurrent_perpartitioncursor.py (3)
Line range hint
1234-1391
: Expand coverage in parameterization
Could you consider adding a case where"use_global_cursor"
is set to a non-boolean to ensure validation or error handling? wdyt?
1706-1706
: Mid-sync toggles
Are there tests covering a mid-sync transition fromFalse
toTrue
? That scenario might reveal corner cases. wdyt?
1983-1983
: Check partial sync states
How does toggling"use_global_cursor"
after some partitions have synced affect the final state? Adding a targeted test might clarify. wdyt?airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (3)
153-154
: Potential missed states
Whenuse_global_cursor
isTrue
, we skip this emission. Do we risk losing intermediate states? Perhaps a comment clarifying would help. wdyt?
330-332
: Switching to global cursor
Switching touse_global_cursor
once the limit is reached might drop or consolidate partition states unexpectedly. Would you consider logging a clear message about this transition? wdyt?
372-373
: Comparison operator
Shouldn’t this be>=
to indicate the limit is reached exactly atDEFAULT_MAX_PARTITIONS_NUMBER
? wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (4)
109-120
: Consider extracting state building logic for better maintainability?The state property could be more maintainable by extracting the per-partition state building logic into a separate method. This would make the code easier to test and modify. Additionally, we could optimize the dictionary creation. WDYT about something like this?
@property def state(self) -> MutableMapping[str, Any]: - state: dict[str, Any] = {"use_global_cursor": self._use_global_cursor} + state = self._build_base_state() if not self._use_global_cursor: - states = [] - for partition_tuple, cursor in self._cursor_per_partition.items(): - if cursor.state: - states.append( - { - "partition": self._to_dict(partition_tuple), - "cursor": copy.deepcopy(cursor.state), - } - ) - state[self._PERPARTITION_STATE_KEY] = states + state[self._PERPARTITION_STATE_KEY] = self._build_partition_states() # ... rest of the method + def _build_base_state(self) -> dict[str, Any]: + return {"use_global_cursor": self._use_global_cursor} + + def _build_partition_states(self) -> list[dict[str, Any]]: + return [ + { + "partition": self._to_dict(partition_tuple), + "cursor": copy.deepcopy(cursor.state), + } + for partition_tuple, cursor in self._cursor_per_partition.items() + if cursor.state + ]🧰 Tools
🪛 GitHub Actions: Linters
[warning] File requires formatting. The code does not conform to the project's formatting standards.
153-154
: Add a comment explaining the conditional state emission?The condition
if not self._use_global_cursor
determines when to emit state messages, but it might not be immediately clear why. Would you consider adding a brief comment explaining this behavior? Something like:+ # Only emit state messages for per-partition state management + # When using global cursor, state emission is deferred until all partitions are processed if not self._use_global_cursor: self._emit_state_message()🧰 Tools
🪛 GitHub Actions: Linters
[warning] File requires formatting. The code does not conform to the project's formatting standards.
232-240
: Consider improving the over-limit tracking?A few suggestions to make this more robust:
The over-limit counter (
self._over_limit
) is an unbounded integer. For long-running syncs, should we consider using a bounded counter or resetting it periodically to prevent potential overflow?The log message includes the limit value directly. Would it be clearer to use a class constant or format string? Something like:
+ _PARTITION_LIMIT_MESSAGE = "The maximum number of partitions ({limit}) has been reached. Dropping the oldest {state} partition: {partition}. Over limit: {count}." + def _ensure_partition_limit(self) -> None: with self._lock: while len(self._cursor_per_partition) > self.DEFAULT_MAX_PARTITIONS_NUMBER - 1: self._over_limit += 1 for partition_key in list(self._cursor_per_partition.keys()): if partition_key in self._finished_partitions and self._semaphore_per_partition[partition_key]._value == 0: oldest_partition = self._cursor_per_partition.pop(partition_key) logger.warning( - f"The maximum number of partitions has been reached. Dropping the oldest finished partition: {oldest_partition}. Over limit: {self._over_limit}." + self._PARTITION_LIMIT_MESSAGE.format( + limit=self.DEFAULT_MAX_PARTITIONS_NUMBER, + state="finished", + partition=oldest_partition, + count=self._over_limit + ) )🧰 Tools
🪛 GitHub Actions: Linters
[warning] File requires formatting. The code does not conform to the project's formatting standards.
330-332
: Consider refining the global cursor transition logic?A few thoughts on the current implementation:
The transition to global cursor is one-way (irreversible). Should we consider conditions where we might want to switch back to per-partition state?
The
limit_reached
method uses a simple comparison. Would it be more accurate to consider both the current partition count and the over-limit counter? Something like:def limit_reached(self) -> bool: - return self._over_limit > self.DEFAULT_MAX_PARTITIONS_NUMBER + return ( + len(self._cursor_per_partition) >= self.DEFAULT_MAX_PARTITIONS_NUMBER and + self._over_limit > 0 + )What are your thoughts on these suggestions?
Also applies to: 372-373
🧰 Tools
🪛 GitHub Actions: Linters
[warning] File requires formatting. The code does not conform to the project's formatting standards.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
(7 hunks)
🧰 Additional context used
🪛 GitHub Actions: Linters
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
[warning] File requires formatting. The code does not conform to the project's formatting standards.
⏰ Context from checks skipped due to timeout of 90000ms (9)
- GitHub Check: Build Python Package
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Analyze (python)
🔇 Additional comments (1)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1)
Line range hint
1-373
: Run the formatter to fix formatting issues?The pipeline indicates that the file requires formatting. Would you mind running the project's formatter to ensure the code conforms to the formatting standards?
🧰 Tools
🪛 GitHub Actions: Linters
[warning] File requires formatting. The code does not conform to the project's formatting standards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question regarding thread safety
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
Outdated
Show resolved
Hide resolved
/autofix
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (5)
109-120
: Consider documenting theuse_global_cursor
field in the state dictionary.The state dictionary now includes a new field that affects behavior, but its purpose isn't immediately clear from the code. Would you consider adding a docstring explaining when and why this flag is set? wdyt?
153-154
: Consider adding a comment explaining the state emission condition.The state emission is skipped when using global cursor. Would it be helpful to add a brief comment explaining why we don't emit state in this case? This could help future maintainers understand the design decision. wdyt?
232-243
: Consider making the warning message more actionable.The warning message now includes the over limit count, which is great for observability. Would it be helpful to include guidance on what users should do when they see this warning? For example, suggesting they adjust their partition strategy or increase the limit if needed? wdyt?
333-335
: Consider logging the switch to global cursor.When we switch to using the global cursor due to reaching the limit, it might be helpful to log this significant change in behavior. This could help with debugging and monitoring. wdyt?
if not self._use_global_cursor and self.limit_reached(): + logger.info("Switching to global cursor as partition limit has been reached") self._use_global_cursor = True
375-376
: Consider caching the limit check result.The
limit_reached
method is called frequently in the observe method. Would it make sense to cache the result once we exceed the limit since it can't go back down? This could avoid unnecessary comparisons. wdyt?+ _limit_reached: bool = False + def limit_reached(self) -> bool: - return self._over_limit > self.DEFAULT_MAX_PARTITIONS_NUMBER + if not self._limit_reached: + self._limit_reached = self._over_limit > self.DEFAULT_MAX_PARTITIONS_NUMBER + return self._limit_reached
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py
(7 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Analyze (python)
🔇 Additional comments (1)
airbyte_cdk/sources/declarative/incremental/concurrent_partition_cursor.py (1)
308-309
: Consider the implications of the default value.The
use_global_cursor
is initialized with a default ofFalse
. Have you considered if this is the safest default for all scenarios? For example, what happens if we're migrating from an older version that didn't have this flag? wdyt?
Summary by CodeRabbit
New Features
Tests