chore(decoder): clean decoders and make csvdecoder available #326

maxi297 · 2025-02-07T23:33:18Z

What

https://github.com/airbytehq/airbyte-internal-issues/issues/11616

This is a breaking change but only for an experimental component or one that is only used in source-amplitude so I'm fine keeping this a minor change.

Summary by CodeRabbit

New Features
- Introduced a unified decoding framework supporting multiple data types (e.g., CSV, JSON, and compressed formats) for enhanced flexibility.
- Added new decoders for handling CSV and Gzip decoding.
Refactor
- Streamlined the data extraction process by consolidating redundant decoding components and improving handling for both streamed and non-streamed responses.
Tests
- Expanded test coverage to validate improved response processing and enhanced memory usage efficiency under various conditions.
- Added new test cases for the CompositeRawDecoder to ensure correct behavior with consumed and non-streamed responses.

maxi297 · 2025-02-07T23:35:56Z

airbyte_cdk/sources/declarative/decoders/json_decoder.py


    def decode(
        self, response: requests.Response
    ) -> Generator[MutableMapping[str, Any], None, None]:
        """
        Given the response is an empty string or an emtpy list, the function will return a generator with an empty mapping.
        """
+        has_yielded = False


The default behavior of this decoder was a bit weird so I decided to keep it not to push the weird logic in JsonParser

maxi297 · 2025-02-07T23:36:20Z

airbyte_cdk/sources/declarative/decoders/json_decoder.py

-
-
-@dataclass
-class GzipJsonDecoder(JsonDecoder):


Removes because it wasn't used anyway. We can introduce this for source-amazon-seller-partner though

coderabbitai · 2025-02-07T23:36:55Z

📝 Walkthrough

Walkthrough

This PR refactors the decoding and parsing architecture. It removes several deprecated decoders and parsers (e.g., GzipJsonDecoder, JsonParser, JsonLineParser, CsvParser) and introduces a unified approach with a new GzipDecoder and renamed CsvDecoder. The CompositeRawDecoder now supports configurable streaming via a new stream_response flag, and the JsonDecoder has been restructured to delegate to it. Updates are applied in component schemas, the ModelToComponentFactory, and multiple test files to align with the new decoder interface.

Changes

File(s)	Change Summary
airbyte_cdk/.../declarative_component_schema.yaml, airbyte_cdk/.../declarative_component_schema.py	Removed obsolete decoder/parser components (GzipJsonDecoder, JsonParser, JsonLineParser, CsvParser) and introduced new ones (GzipDecoder, CsvDecoder). Updated ZipfileDecoder, SimpleRetriever, AsyncRetriever, and SessionTokenAuthenticator to reference the new decoder properties.
airbyte_cdk/.../decoders/{init.py, composite_raw_decoder.py, json_decoder.py}	Removed deprecated decoders from public API; added a `stream_response` flag to CompositeRawDecoder; restructured JsonDecoder by removing the @DataClass decorator, defining an explicit constructor, delegating streaming logic, and simplifying error handling.
airbyte_cdk/.../parsers/model_to_component_factory.py	Consolidated decoder creation methods: introduced `create_csv_decoder` and updated `create_json_decoder` and `create_zipfile_decoder` to use the new unified decoder interface.
unit_tests/.../(auth/test_token_provider.py, decoders/{test_composite_decoder.py, test_decoders_memory_usage.py, test_json_decoder.py}, extractors/test_dpath_extractor.py)	Updated tests to reflect the new decoding architecture: replaced deprecated decoders with CompositeRawDecoder where applicable, adjusted response handling (using `json.dumps()` for content), and removed tests for obsolete gzip decoding functionality.

Sequence Diagram(s)

sequenceDiagram
    participant C as Caller
    participant CRD as CompositeRawDecoder
    participant P as Parser

    C->>CRD: decode(response)
    alt stream_response is True
       CRD->>P: parse(response.raw)
    else stream_response is False
       CRD->>CRD: wrap response.content in BytesIO
       CRD->>P: parse(wrapped content)
    end
    P-->>CRD: return parsed data
    CRD-->>C: yield decoded data

Possibly related PRs

feat: add download_decoder + download_extractor #50: The changes in the main PR are related to the modifications of the AsyncRetriever and SimpleRetriever components, which also involve updates to the declarative_component_schema.yaml file, similar to the enhancements made in the retrieved PR regarding the download_decoder and download_extractor properties.
feat(Low-Code Concurrent CDK): Make SimpleRetriever thread-safe so that different partitions can share the same SimpleRetriever #185: The changes in the main PR, which involve significant modifications to the SimpleRetriever class and its handling of decoders, are related to the retrieved PR that also modifies the SimpleRetriever class, particularly in how it manages state and pagination logic.
feat: Adds ZipfileDecoder component #169: The changes in the main PR are related to the removal of the GzipParser and the introduction of the GzipDecoder, which aligns with the modifications in the retrieved PR that also involves the ZipfileDecoder utilizing various parsers, including GzipParser.

Suggested labels

enhancement

Suggested reviewers

artem1205
maxi297

Would this setup work for you? wdyt?

✨ Finishing Touches

📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (15)

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4)

2035-2036: Consider leveraging the model parameters or removing the unused argument.

Right now, this method always returns a new JsonDecoder with empty parameters, ignoring the passed-in model. Would it make sense to incorporate model parameters or drop the unused argument to avoid confusion, wdyt?

2039-2040: Make stream_response configurable or confirm it’s always false.

Here, you set stream_response=False for CSV. Are you certain that no streaming scenario is needed for CSV data, or would making it configurable benefit some use cases, wdyt?

2061-2061: Check for ZipfileDecoder parameters.

Currently, the created ZipfileDecoder ignores additional parameters in model.decoder or model.parameters. Do you want to forward them to the parser, or is this intentional, wdyt?

2064-2077: Consider exposing parameter checks & fallback for decoders.

The _get_parser method doesn't incorporate model.parameters. If additional settings (like encoding) are required, you might unify that logic here.

There's a potential for infinitely nested GzipParser if user misconfigures the inner_decoder repeatedly. A recursion limit or check might help.
Wdyt about adding these safeguards?

airbyte_cdk/sources/declarative/decoders/json_decoder.py (2)

24-25: Consider making 'stream_response' a parameter.
It's currently hardcoded to False. Would you like to introduce a parameter to toggle streaming for future flexibility, wdyt?

36-41: Catching broad exceptions.
Catching Exception might mask unexpected errors. Would you like to handle a more specific exception type, wdyt?

unit_tests/sources/declarative/decoders/test_json_decoder.py (2)

11-13: Great alignment with the new composite decoders!
This import approach looks consistent. Would you consider adding more test coverage to verify interplay between CompositeRawDecoder and JsonDecoder, wdyt?

44-45: Testing partial streaming scenarios?
We now set stream=True. Would you like to add tests confirming that partial lines or chunked responses are handled gracefully, wdyt?

unit_tests/sources/declarative/auth/test_token_provider.py (1)

58-60: Testing updated token response.
This properly simulates a new token. Maybe we could also test invalid JSON scenarios to ensure robustness, wdyt?
unit_tests/sources/declarative/extractors/test_dpath_extractor.py (1)
24-24: Consider adding a comment explaining the stream_response flag.

The initialization looks good, but since this is a test file, it might be helpful to add a comment explaining why stream_response=True is needed here, wdyt?
-decoder_jsonl = CompositeRawDecoder(parser=JsonLineParser(), stream_response=True)
+# stream_response=True is required for JSONL parsing to handle streaming responses correctly
+decoder_jsonl = CompositeRawDecoder(parser=JsonLineParser(), stream_response=True)
airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py (2)
142-145: Consider adding docstring for the decode method.

The implementation looks good, but since this is a significant change in behavior, would you consider adding a docstring explaining the difference between streaming and non-streaming modes, wdyt?
 def decode(
     self, response: requests.Response
 ) -> Generator[MutableMapping[str, Any], None, None]:
+    """Decode the response based on stream_response setting.
+    
+    When stream_response is True:
+      - Uses response.raw for streaming parsing
+      - Suitable for large responses or JSONL format
+    When stream_response is False:
+      - Uses response.content with BytesIO
+      - Suitable for responses that need to be parsed multiple times
+    """
     if self.is_stream_response():
         yield from self.parser.parse(data=response.raw)  # type: ignore[arg-type]
     else:
         yield from self.parser.parse(data=io.BytesIO(response.content))
134-134: Nice addition of streaming control! Consider adding docstring?

The new stream_response flag and its implementation look good. Would you consider adding a docstring to explain when to use each mode? For example:
 stream_response: bool = True
+    """
+    Controls how responses are processed:
+    - True: Streams response.raw directly (memory efficient for large responses)
+    - False: Loads response.content into memory (allows multiple iterations)
+    """
Also applies to: 136-137, 142-145
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)
1268-1272: Consider adding docstring for CsvDecoder.

The implementation looks good, but would you consider adding a docstring explaining the purpose and configuration options of the CSV decoder, wdyt?
 class CsvDecoder(BaseModel):
     type: Literal["CsvDecoder"]
+    """Decoder for CSV formatted data.
+    
+    Attributes:
+        encoding: The character encoding to use (default: utf-8)
+        delimiter: The character used to separate fields (default: comma)
+    """
     encoding: Optional[str] = "utf-8"
     delimiter: Optional[str] = ","
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)

3012-3025: CsvDecoder – Making CSV decoding available
Introducing the CsvDecoder with clear defaults (utf-8 encoding and a comma delimiter) is a clean and welcome addition. It looks like it accomplishes the PR objective to make CSV decoding available while cleaning up the decoders. Would you be open to adding some tests for different CSV configurations to ensure robustness? wdyt?
unit_tests/sources/declarative/decoders/test_composite_decoder.py (1)
203-213: Great test for stream consumption! Consider adding error message check?

The test for streamed response consumption looks good. Would you consider also asserting the specific error message to ensure the right error is being raised? Something like:
-    with pytest.raises(Exception):
+    with pytest.raises(Exception, match="Response body has already been consumed"):
         list(composite_raw_decoder.decode(response))

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6260248 and 6e79ecf.

📒 Files selected for processing (11)

airbyte_cdk/sources/declarative/declarative_component_schema.yaml (3 hunks)
airbyte_cdk/sources/declarative/decoders/__init__.py (0 hunks)
airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py (2 hunks)
airbyte_cdk/sources/declarative/decoders/json_decoder.py (1 hunks)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (8 hunks)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (5 hunks)
unit_tests/sources/declarative/auth/test_token_provider.py (3 hunks)
unit_tests/sources/declarative/decoders/test_composite_decoder.py (1 hunks)
unit_tests/sources/declarative/decoders/test_decoders_memory_usage.py (0 hunks)
unit_tests/sources/declarative/decoders/test_json_decoder.py (2 hunks)
unit_tests/sources/declarative/extractors/test_dpath_extractor.py (1 hunks)

💤 Files with no reviewable changes (2)

airbyte_cdk/sources/declarative/decoders/init.py
unit_tests/sources/declarative/decoders/test_decoders_memory_usage.py

🧰 Additional context used

🪛 GitHub Actions: Linters

unit_tests/sources/declarative/decoders/test_json_decoder.py