fix: add request count with eviction for not retriable requests to fix memory problem #92

maxi297 · 2024-11-27T13:24:41Z

what

Addresses high memory consumption when enabling concurrency in https://github.com/airbytehq/oncall/issues/6821.

The _request_attempt_count attribute keeps growing as multiple requests are made using the same http_client.
To mitigate this issue, we evict keys for completed requests once we confirm that no further retries are needed.
This helps manage memory usage more efficiently while maintaining the necessary logic for retry attempts.

Summary by CodeRabbit

New Features
- Enhanced error handling and memory management in the HTTP client, improving request retry logic and clarity of error messages.
- Introduced a mechanism to evict completed requests from memory based on retry conditions.
Bug Fixes
- Improved error message clarity and ensured sensitive information is filtered out during logging.
Tests
- Added new tests to validate request eviction logic and enhanced existing tests for better error handling and robustness.
Chores
- Added a new development dependency (pympler) for improved memory profiling.

…rategy

airbyte_cdk/sources/streams/http/expiring_dictionary.py

… retry is needed

maxi297

LGTM for this part. I'm mostly wondering on the other changes in maxi297/test_closing_requests_objects now

airbyte_cdk/sources/streams/http/http_client.py

maxi297 · 2024-11-28T13:53:19Z

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

🟦 Job completed successfully (no changes).

…with_expiration

maxi297

There seems to be a test failing but apart from that, the change looks good to me. Thanks for hopping onto this one Aldo! (note that you will need to auto approve as I was the one opening this PR)

Following the release of this, can we release source-klaviyo with an updated CDK version?

aldogonzalez8 · 2024-11-28T13:57:38Z

There seems to be a test failing but apart from that, the change looks good to me. Thanks for hopping onto this one Aldo! (note that you will need to auto approve as I was the one opening this PR)

Following the release of this, can we release source-klaviyo with an updated CDK version?

Auto approve is my passion.

Yes, I will look at the test and later follow up with a Klaviyo release.

coderabbitai · 2024-11-28T14:01:10Z

📝 Walkthrough

Walkthrough

The changes in this pull request primarily enhance the HttpClient class in the airbyte_cdk library by introducing a new constant for retry actions, improving error handling, and implementing a method for evicting completed requests from memory. Additionally, the pyproject.toml file is updated to include a new development dependency, pympler, while maintaining the organization of existing dependencies. The test suite for HttpClient is also modified to include a new test for request eviction and to enhance existing tests for robustness.

Changes

File	Change Summary
`airbyte_cdk/sources/streams/http/http_client.py`	- Added constant `_ACTIONS_TO_RETRY_ON` for retry actions. - Introduced method `_evict_key` for memory management. - Updated error handling in `_handle_error_resolution` for clearer logging and sensitive information filtering.
`pyproject.toml`	- Added `pympler` as a development dependency. - Improved organization and readability of existing dependencies.
`unit_tests/sources/streams/http/test_http_client.py`	- Added new test function `test_evicting_requests_for_request_count`. - Enhanced existing tests with additional assertions and error handling.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant HttpClient
    participant Server

    Client->>HttpClient: Send request
    HttpClient->>Server: Forward request
    Server-->>HttpClient: Response (success/error)
    HttpClient->>HttpClient: Handle response
    alt Error response
        HttpClient->>HttpClient: Check if retry is needed
        HttpClient->>HttpClient: Evict key if no retry
    end
    HttpClient-->>Client: Return response

What do you think of these changes? Do they align with your expectations for improving the HttpClient functionality?

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between a4fcd16 and ffe1093.

📒 Files selected for processing (1)

unit_tests/sources/streams/http/test_http_client.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

unit_tests/sources/streams/http/test_http_client.py

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (4)

pyproject.toml (1)

97-97: Consider pinning the pympler version for reproducible builds

Hey! I see you've added pympler for memory profiling, which aligns perfectly with the memory optimization goals 🎯. However, using "*" might lead to unexpected behavior if a breaking change is introduced in future versions. Would you consider pinning to a specific version or version range (e.g., "^1.0.0") for more predictable builds? wdyt?
unit_tests/sources/streams/http/test_http_client.py (2)
416-417: Consider extracting magic numbers as constants for better maintainability.

The values 1000 and 250 could be more meaningful as named constants. What do you think about something like this?
+TOTAL_TEST_REQUESTS = 1000
+MAX_MEMORY_THRESHOLD_BYTES = 250  # Empirically determined threshold

-    requests_to_make = 1000
+    requests_to_make = TOTAL_TEST_REQUESTS
     for requests_count in range(requests_to_make):
         ...
-    assert size_of_request_count_store < 250
+    assert size_of_request_count_store < MAX_MEMORY_THRESHOLD_BYTES
Also applies to: 423-424

422-422: The memory comparison comment could be more descriptive.

The comment about 0.5 MB could be clearer about how this number was determined. Maybe something like:
-    # for the number of requests_to_make if we didn't evict the requests we could see increase the value ~0.5 MB
+    # Without request eviction, storing 1000 requests would consume ~0.5 MB (empirically measured)
wdyt?
airbyte_cdk/sources/streams/http/http_client.py (1)

356-358: Are we evicting keys in all necessary scenarios?

We currently call _evict_key when response_action is not in _ACTIONS_TO_RETRY_ON. Should we also consider evicting keys for other response actions to prevent potential memory leaks? What do you think?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 5f0831e and 765c5ea.

⛔ Files ignored due to path filters (1)

poetry.lock is excluded by !**/*.lock

📒 Files selected for processing (3)

airbyte_cdk/sources/streams/http/http_client.py (5 hunks)
pyproject.toml (1 hunks)
unit_tests/sources/streams/http/test_http_client.py (2 hunks)

🔇 Additional comments (4)

pyproject.toml (1)

97-97: Verify pympler usage in test files

The addition of pympler in dev dependencies looks good! To ensure it's being used effectively for memory profiling, let's verify its implementation:

✅ Verification successful

Pympler is used appropriately for memory profiling, but could we expand its usage? 🤔

The dependency is correctly placed in dev dependencies and is being used to measure memory allocation in HTTP client tests. I see it's specifically checking that the request attempt count store stays under 250 bytes - neat!

Would you consider adding more memory checks in other critical areas of the codebase where memory usage is important? Just a thought since we have the tool available now! wdyt?

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for pympler imports and usage in test files
echo "Checking pympler usage in test files:"
rg -l "from pympler" unit_tests/
rg -l "import pympler" unit_tests/

Length of output: 208

Script:

#!/bin/bash
# Let's check the actual implementation to understand how pympler is being used
echo "Checking pympler usage in http_client test:"
rg -A 5 "pympler" unit_tests/sources/streams/http/test_http_client.py

# Let's also check if there are any other memory-related tests
echo -e "\nChecking for memory-related tests:"
rg -i "memory|mem_|size|allocation" unit_tests/sources/streams/http/test_http_client.py

Length of output: 645

unit_tests/sources/streams/http/test_http_client.py (2)

5-5: LGTM! Clean import addition.

The asizeof import from pympler is well-placed and essential for memory measurements.

378-425: Consider potential test flakiness due to system-dependent memory measurements.

The test relies on absolute memory measurements which could vary across different systems or CI environments. Would you consider either:

Making the threshold relative (e.g., compare initial vs final memory usage)
Adding a tolerance factor to the assertion
Skip the test in CI environments

Let's check if this test is already flaky in CI:

airbyte_cdk/sources/streams/http/http_client.py (1)

385-385: Can we ensure filter_secrets sanitizes all sensitive data?

While using filter_secrets on the error_message before logging, are we confident it effectively sanitizes all sensitive information, including any secrets in request.url and request.body? Should we verify this to prevent accidental leakage? Wdyt?

airbyte_cdk/sources/streams/http/http_client.py

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)

unit_tests/sources/streams/http/test_http_client.py (1)

379-426: Consider enhancing the test's maintainability and clarity

A few suggestions to make this test more robust and maintainable:

The magic number 250 for size assertion could be a named constant at the module level, what do you think?
The test documentation could better explain the memory management strategy being tested.
We could parameterize this test to cover different scenarios (e.g., different request counts, size thresholds).
The nonlocal variable usage could be simplified.

Here's a possible refactor, wdyt?

+# Maximum allowed size in bytes for the request count store
+MAX_REQUEST_COUNT_STORE_SIZE = 250

-@pytest.mark.usefixtures("mock_sleep")
-def test_evicting_requests_for_request_count():
+@pytest.mark.usefixtures("mock_sleep")
+@pytest.mark.parametrize("requests_to_make,max_size", [
+    (1000, MAX_REQUEST_COUNT_STORE_SIZE),
+    (500, MAX_REQUEST_COUNT_STORE_SIZE // 2),
+])
+def test_evicting_requests_for_request_count(requests_to_make, max_size):
+    """Test that the request count store's memory usage remains bounded.
+    
+    This test verifies that the HttpClient properly evicts completed requests
+    from its request count store to prevent unbounded memory growth.
+    """
     mocked_session = MagicMock(spec=requests.Session)
     valid_response = MagicMock(spec=requests.Response)
     valid_response.status_code = 200
     valid_response.ok = True
     valid_response.headers = {}
-    call_count = 3
+    retry_count = 3
 
     def update_response(*args, **kwargs):
-        nonlocal call_count
-        if http_client._session.send.call_count == call_count:
-            call_count += 3
+        if http_client._session.send.call_count % retry_count == 0:
             return valid_response
-        else:
-            retry_response = MagicMock(spec=requests.Response)
-            retry_response.ok = False
-            retry_response.status_code = 408
-            retry_response.headers = {}
-            return retry_response
+        retry_response = MagicMock(spec=requests.Response)
+        retry_response.ok = False
+        retry_response.status_code = 408
+        retry_response.headers = {}
+        return retry_response

     # ... rest of the test ...
-    assert size_of_request_count_store < 250
+    assert size_of_request_count_store < max_size

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 765c5ea and a4fcd16.

📒 Files selected for processing (3)

airbyte_cdk/sources/streams/http/http_client.py (3 hunks)
pyproject.toml (1 hunks)
unit_tests/sources/streams/http/test_http_client.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

pyproject.toml

🔇 Additional comments (5)

airbyte_cdk/sources/streams/http/http_client.py (4)

79-79: Nice addition of the retry actions constant! 👍

The set-based approach for retry actions improves readability and maintainability. The constant effectively encapsulates the retry logic in one place.

382-384: Clean integration of the eviction logic! 🎯

The eviction is well-placed to occur when we're certain no more retries are needed. The code reads naturally and maintains the existing error handling flow.

363-373: 🛠️ Refactor suggestion

Consider enhancing thread safety for the eviction method

The implementation looks good and the docstring is very informative! However, in concurrent scenarios, the check-then-delete pattern might lead to race conditions. What do you think about using a thread-safe data structure for _request_attempt_count? For example:

- self._request_attempt_count: Dict[requests.PreparedRequest, int] = {}
+ from threading import Lock
+ from collections import defaultdict
+ self._request_attempt_count_lock = Lock()
+ self._request_attempt_count = defaultdict(int)

Then in _evict_key:

 def _evict_key(self, prepared_request: requests.PreparedRequest) -> None:
-    if prepared_request in self._request_attempt_count:
-        del self._request_attempt_count[prepared_request]
+    with self._request_attempt_count_lock:
+        if prepared_request in self._request_attempt_count:
+            del self._request_attempt_count[prepared_request]

This would ensure thread-safe operations. Wdyt? 🤔

Likely invalid or redundant comment.

Line range hint 363-384: Verify memory optimization effectiveness

Let's confirm that the memory usage is indeed optimized with these changes.

✅ Verification successful

Memory optimization implementation looks solid! 🎯

The implementation is well-tested and effectively addresses the memory consumption issue:

Found memory-specific test in unit_tests/sources/streams/http/test_http_client.py that verifies the memory footprint stays under 250 bytes
The _request_attempt_count is properly managed with eviction when requests are completed
Implementation is used consistently across concurrent scenarios (verified through test coverage)
Test cases cover both regular retries and exponential backoff scenarios

The memory optimization appears to be working as intended, with proper test coverage and implementation. The size assertion in tests provides a concrete guarantee against memory bloat.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for memory-related tests and verify the implementation

# Look for memory-related tests
echo "Searching for memory-related tests..."
rg -l "test.*memory|test.*evict" --type python

# Check for concurrent testing scenarios
echo "Searching for concurrent testing patterns..."
rg -l "concurrent|threading|asyncio" --type python

# Look for other request count implementations that might need similar optimization
echo "Searching for other request counting patterns..."
ast-grep --pattern 'request_attempt_count = {$$$}'

Length of output: 507

Script:

#!/bin/bash
# Let's try again with correct file type specifications

# Look for memory-related tests
echo "Searching for memory-related tests..."
rg -l "test.*memory|test.*evict"

# Check for concurrent testing scenarios
echo "Searching for concurrent testing patterns..."
rg -l "concurrent|threading|asyncio"

# Look for the actual implementation of request counting
echo "Searching for request counting implementation..."
rg "_request_attempt_count" -A 2 -B 2

# Check for related test files
echo "Looking for related test files..."
fd "test.*http.*\.py$"

Length of output: 12692

unit_tests/sources/streams/http/test_http_client.py (1)

5-5: LGTM!

The import statement is well-placed and properly used for memory size measurement.

aldogonzalez8

APPROVED

aldogonzalez8 · 2024-11-29T18:25:27Z

I will release it on 2024-12-02.

maxi297 and others added 8 commits November 22, 2024 11:12

attempt to clean response and session

90df66b

Log response if failure

d630a9b

Remove HttpClient._request_attempt_count that was used for backoff st…

2a0219e

…rategy

source-klaviyo: make expiration dictionary

af77521

source-klaviyo: remove comments

ce100c0

source-klaviyo: add more testing

3aa4999

source-klaviyo: add more testing

38b4985

source-klaviyo: add more testing

a284303

maxi297 commented Nov 27, 2024

View reviewed changes

airbyte_cdk/sources/streams/http/expiring_dictionary.py Outdated Show resolved Hide resolved

aldogonzalez8 added 6 commits November 27, 2024 08:20

source-klaviyo: add test dependencies

e56c83c

source-klaviyo: update time for request if is retried

5a4a623

source-klaviyo: evict request for path when retry on is not needed

0c7bc2d

airbyte-cdk: use simple dictionary to store and evist request when no…

6e19df5

… retry is needed

airbyte-cdk: fix for ruff check results

6162198

airbyte-cdk: fix mypy check results

0d40f92

maxi297 commented Nov 27, 2024

View reviewed changes

airbyte_cdk/sources/streams/http/http_client.py Outdated Show resolved Hide resolved

airbyte_cdk/sources/streams/http/http_client.py Show resolved Hide resolved

aldogonzalez8 added 2 commits November 27, 2024 12:16

airbyte-cdk: add comments for evict method

466adde

airbyte-cdk: remove close logic

34ffadf

aldogonzalez8 temporarily deployed to PyPi November 27, 2024 18:30 — with GitHub Actions Inactive

aldogonzalez8 marked this pull request as ready for review November 28, 2024 01:29

aldogonzalez8 self-assigned this Nov 28, 2024

aldogonzalez8 changed the title ~~Aldogonzalez8/airbyte cdk/add request count with expiration~~ Feat: add request count with expiration Nov 28, 2024

aldogonzalez8 changed the title ~~Feat: add request count with expiration~~ bug: add request count with expiration to fix memory problem Nov 28, 2024

aldogonzalez8 changed the title ~~bug: add request count with expiration to fix memory problem~~ fix: add request count with expiration to fix memory problem Nov 28, 2024

aldogonzalez8 changed the title ~~fix: add request count with expiration to fix memory problem~~ fix: add request count with eviction for not retriable requests to fix memory problem Nov 28, 2024

aldogonzalez8 changed the base branch from maxi297/test_closing_requests_objects to main November 28, 2024 13:42

airbyte-cdk: fix problem with mypy

765c5ea

github-actions bot added bug Something isn't working security labels Nov 28, 2024

maxi297 mentioned this pull request Nov 28, 2024

Enable ddtrace in image and update dependency airbytehq/airbyte#48609

Closed

2 tasks

Merge branch 'main' into aldogonzalez8/airbyte-cdk/add_request_count_…

b241766

…with_expiration

maxi297 commented Nov 28, 2024

View reviewed changes

coderabbitai bot requested changes Nov 28, 2024

View reviewed changes

airbyte_cdk/sources/streams/http/http_client.py Show resolved Hide resolved

airbyte-cdk: merge from main

a4fcd16

coderabbitai bot reviewed Nov 29, 2024

View reviewed changes

airbyte-cdk: fix ruff yelling

ffe1093

coderabbitai bot approved these changes Nov 29, 2024

View reviewed changes

aldogonzalez8 approved these changes Nov 29, 2024

View reviewed changes

aldogonzalez8 temporarily deployed to PyPi November 29, 2024 16:43 — with GitHub Actions Inactive

aldogonzalez8 mentioned this pull request Nov 29, 2024

🐛bug(source-klaviyo): bump cdk to more recent release airbytehq/airbyte#48748

Merged

2 tasks

aldogonzalez8 merged commit f95652b into main Dec 2, 2024
25 of 26 checks passed

aldogonzalez8 deleted the aldogonzalez8/airbyte-cdk/add_request_count_with_expiration branch December 2, 2024 14:50

coderabbitai bot mentioned this pull request Dec 4, 2024

fix(concurrency): support failed on http cache write #115

Open

coderabbitai bot mentioned this pull request Dec 13, 2024

fix: (HttpClient) rate limit fix unlimited tries #171

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add request count with eviction for not retriable requests to fix memory problem #92

fix: add request count with eviction for not retriable requests to fix memory problem #92

maxi297 commented Nov 27, 2024 •

edited by coderabbitai bot

Loading

maxi297 left a comment

maxi297 commented Nov 28, 2024 •

edited by github-actions bot

Loading

maxi297 left a comment

aldogonzalez8 commented Nov 28, 2024

coderabbitai bot commented Nov 28, 2024 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot left a comment

aldogonzalez8 left a comment

aldogonzalez8 commented Nov 29, 2024

fix: add request count with eviction for not retriable requests to fix memory problem #92

fix: add request count with eviction for not retriable requests to fix memory problem #92

Conversation

maxi297 commented Nov 27, 2024 • edited by coderabbitai bot Loading

what

Summary by CodeRabbit

maxi297 left a comment

Choose a reason for hiding this comment

maxi297 commented Nov 28, 2024 • edited by github-actions bot Loading

maxi297 left a comment

Choose a reason for hiding this comment

aldogonzalez8 commented Nov 28, 2024

coderabbitai bot commented Nov 28, 2024 • edited Loading

Walkthrough

Changes

Sequence Diagram(s)

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

aldogonzalez8 left a comment

Choose a reason for hiding this comment

aldogonzalez8 commented Nov 29, 2024

maxi297 commented Nov 27, 2024 •

edited by coderabbitai bot

Loading

maxi297 commented Nov 28, 2024 •

edited by github-actions bot

Loading

coderabbitai bot commented Nov 28, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)