Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: fallback to json if orjson cannot serialize value #210

Merged
merged 4 commits into from
Jan 9, 2025

Conversation

artem1205
Copy link
Contributor

@artem1205 artem1205 commented Jan 9, 2025

What

resolve https://github.com/airbytehq/oncall/issues/7226

too big ( >int64) values can not be serialized by orjson library: ijl/orjson#116

How

fallback to json if orjson cannot serialize value

Test

Prerequisites

  • source-hardcoded-records
  • stream: dummyFields
  • config: { "count": 10000000 }

Test results

Test Time
default (no changes) 3m15.55s real
with try except block 3m14.50s real
with {"field5": 111111111111111111111} in DummyFields 4m35.39s real

Conclusion

  • in best case scenario we'll see no difference in performance
  • fallback to json will slow down by ≈ 30% (might be higher for nested dicts with more key:value pairs)

Summary by CodeRabbit

  • Bug Fixes

    • Enhanced error handling for message serialization to improve robustness.
    • Implemented a fallback mechanism for serialization when the primary method fails.
  • Tests

    • Added a new test case to verify the fallback serialization behavior during errors.

@artem1205 artem1205 self-assigned this Jan 9, 2025
@github-actions github-actions bot added bug Something isn't working security labels Jan 9, 2025
@artem1205 artem1205 marked this pull request as ready for review January 9, 2025 12:07
@artem1205 artem1205 requested a review from maxi297 January 9, 2025 12:07
Copy link
Contributor

coderabbitai bot commented Jan 9, 2025

📝 Walkthrough

Walkthrough

The changes focus on improving error handling in the airbyte_message_to_string method within the AirbyteEntrypoint class. The modification introduces a more robust serialization approach by first using AirbyteMessageSerializer.dump and then attempting to serialize with orjson.dumps. If the orjson serialization fails, the method falls back to using json.dumps, preventing potential application crashes due to serialization errors. A global variable is added to ensure that serialization error warnings are logged only once.

Changes

File Change Summary
airbyte_cdk/entrypoint.py - Updated airbyte_message_to_string method with enhanced serialization error handling
- Added json module import for fallback serialization
- Introduced global variable _HAS_LOGGED_FOR_SERIALIZATION_ERROR to limit logging
unit_tests/test_entrypoint.py - Added test function test_given_serialization_error_using_orjson_then_fallback_on_json to validate fallback serialization behavior

Sequence Diagram

sequenceDiagram
    participant Entrypoint as AirbyteEntrypoint
    participant Serializer as AirbyteMessageSerializer
    participant OrJSON as orjson
    participant StdJSON as json

    Entrypoint->>Serializer: dump(airbyte_message)
    Serializer-->>Entrypoint: serialized_message
    Entrypoint->>OrJSON: dumps(serialized_message)
    alt Serialization Successful
        OrJSON-->>Entrypoint: JSON string
    else Serialization Failed
        Entrypoint->>StdJSON: dumps(serialized_message)
        StdJSON-->>Entrypoint: JSON string
    end
Loading

What do you think about this summary? Is there anything else you’d like to add or modify?


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 29e506f and 5c5bc21.

📒 Files selected for processing (2)
  • airbyte_cdk/entrypoint.py (3 hunks)
  • unit_tests/test_entrypoint.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • unit_tests/test_entrypoint.py
  • airbyte_cdk/entrypoint.py
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)

Finishing Touches

  • 📝 Generate Docstrings

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
airbyte_cdk/entrypoint.py (1)

295-299: How about adding some improvements to the error handling? wdyt?

The fallback logic works, but we could make it more robust:

  1. Consider catching specific exceptions instead of using a bare except
  2. Maybe add debug logging when falling back to help with troubleshooting?
  3. Would it be helpful to add a comment about the ~30% performance impact when falling back?

Here's a suggestion:

     serialized_message = AirbyteMessageSerializer.dump(airbyte_message)
     try:
         return orjson.dumps(serialized_message).decode()
-    except:
+    except (orjson.JSONEncodeError, ValueError) as e:
+        logger.debug(f"Failed to serialize with orjson, falling back to json: {str(e)}")
+        # Note: Fallback to json.dumps may result in ~30% slower serialization
         return json.dumps(serialized_message)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e78eaff and cbae6cb.

📒 Files selected for processing (1)
  • airbyte_cdk/entrypoint.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (2)
airbyte_cdk/entrypoint.py (2)

8-8: LGTM!

The json import is correctly placed with other standard library imports.


295-299: Let's verify the usage and impact of this change

The changes look good, but let's verify the impact across the codebase.

✅ Verification successful

The serialization change looks good! 👍

The change aligns well with the codebase's existing patterns where orjson is already widely used for serialization. The fallback to json.dumps ensures backward compatibility in case of any edge cases.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Find all usages of airbyte_message_to_string to understand the impact
echo "Searching for airbyte_message_to_string usage:"
rg "airbyte_message_to_string" -B 2 -A 2

# Look for any performance-related tests that might need updating
echo -e "\nSearching for performance tests:"
rg -i "performance|benchmark" --type python

Length of output: 3510


Script:

#!/bin/bash
# Check for orjson import and usage
echo "Checking orjson imports:"
rg "import.*orjson" -B 2 -A 2

# Look for serialization-related tests
echo -e "\nChecking for serialization tests:"
rg -l "test.*serial|serial.*test" --type-add 'python:*.py' --type python

# Check for other json.dumps usage
echo -e "\nChecking other json.dumps usage:"
rg "json\.dumps" --type-add 'python:*.py' --type python -B 2 -A 2

Length of output: 70089

Copy link
Contributor

@maxi297 maxi297 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the analysis!

I've added a log and a test. Can you have a look at it? If this is fine for you, I'll release the CDK and klaviyo

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
unit_tests/test_entrypoint.py (1)

773-788: Enhance test assertions and documentation?

The test verifies the fallback mechanism when orjson fails to serialize large integers. Consider these improvements, wdyt?

  1. Add docstring explaining the test's purpose and the specific scenario being tested
  2. Verify the content of the serialized messages to ensure data integrity
  3. Add assertions to confirm that the fallback to json actually occurred

Here's a suggested implementation:

 def test_given_serialization_error_using_orjson_then_fallback_on_json(entrypoint: AirbyteEntrypoint, mocker, spec_mock, config_mock):
+    """
+    Test that when orjson fails to serialize a value (e.g., integers > int64),
+    the system falls back to the standard json library successfully.
+    """
     parsed_args = Namespace(
         command="read", config="config_path", state="statepath", catalog="catalogpath"
     )
     record = AirbyteMessage(record=AirbyteRecordMessage(stream="stream", data={"data": 7046723166326052303072}, emitted_at=1), type=Type.RECORD)
+    # Mock logger to verify fallback warning
+    mock_logger = mocker.patch("airbyte_cdk.entrypoint.logger")
     mocker.patch.object(MockSource, "read_state", return_value={})
     mocker.patch.object(MockSource, "read_catalog", return_value={})
     mocker.patch.object(
         MockSource, "read", return_value=[record, record]
     )

     messages = list(entrypoint.run(parsed_args))

     # There will be multiple messages here because the fixture `entrypoint` sets a control message. We only care about records here
     record_messages = list(filter(lambda message: "RECORD" in message, messages))
     assert len(record_messages) == 2
+    # Verify that the fallback warning was logged
+    mock_logger.warning.assert_called_once_with("Failed to serialize message with orjson, falling back to json")
+    # Verify the content of serialized messages
+    for message in record_messages:
+        deserialized = json.loads(message)
+        assert deserialized["record"]["data"]["data"] == 7046723166326052303072
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cbae6cb and 29e506f.

📒 Files selected for processing (2)
  • airbyte_cdk/entrypoint.py (3 hunks)
  • unit_tests/test_entrypoint.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • airbyte_cdk/entrypoint.py
⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: Ruff Lint Check
  • GitHub Check: Build and Inspect Python Package
  • GitHub Check: Ruff Format Check
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: preview_docs
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: MyPy Check
  • GitHub Check: Pytest (Fast)
  • GitHub Check: update_release_draft
  • GitHub Check: Analyze (python)

Signed-off-by: Artem Inzhyyants <[email protected]>
@artem1205 artem1205 merged commit cc8ec47 into main Jan 9, 2025
16 of 20 checks passed
@artem1205 artem1205 deleted the artem1205/orjson-tests branch January 9, 2025 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working security
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants