Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1748140: Modify schema_expression to be structured type aware. #2659

Merged

Conversation

sfc-gh-jrose
Copy link
Contributor

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-1748140

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
    • I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
  3. Please describe how your code solves the related issue.

This pull request includes several changes to improve the handling of structured data types in the schema_expression function to support structured types.

Schema expression has different case for when a field is nullable and when it is not. This change modifies the nullable case to skip the semi-structured branches of the if statement and instead use the default case so that convert_sp_to_sf_type does the work to specify the schema instead.

For non-nullable fields it relies on recursive calls to schema_expression in order to get good default values for the various nestsed data types and then casts a relevant data structured to the correct schema. Note that non-nullable columns do not have nullability respected in their child fields which is a limitation of the way we infer schemas using the dummy query approach.

While writing some test cases for this I also found that the nullability of child fields was not set correctly when parsing metadata so I added that in type_utils as well.

I've opened two new bugs as a result of this change:

SNOW-1819531 - Large query breakdown does not appear to work correctly with the structured types.
SNOW-1819428 - When calling create_dataframe with a schema that contains a StructType column, the child fields do not have their nullability respected. I suspect this due to a similar limitation in how we generate schema strings.

@sfc-gh-jrose sfc-gh-jrose added the NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md label Nov 20, 2024
@sfc-gh-jrose sfc-gh-jrose requested a review from a team November 20, 2024 22:57
@sfc-gh-jrose sfc-gh-jrose requested review from a team as code owners November 20, 2024 22:57
if data_type.structured:
key = schema_expression(data_type.key_type, is_nullable)
value = schema_expression(data_type.value_type, is_nullable)
return f"object_construct_keep_null({key}, {value}) :: {convert_sp_to_sf_type(data_type)}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we determine whether keeping null values based on is_nullable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't keep nulls and either the key or value gets evaluated to a NULL :: type statement then that field would be dropped from the schema altogether. For this reason I think we always want nulls.

assert table.union(table).schema == expected_schema
# Functions used in schema generation don't respect nested nullability so compare query string instead
non_null_union = non_null_table.union(non_null_table)
assert non_null_union._plan.schema_query == (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also create a test case for nested array and object? like to_array(... to_array(...))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@sfc-gh-jrose sfc-gh-jrose merged commit 2423507 into main Nov 25, 2024
37 checks passed
@sfc-gh-jrose sfc-gh-jrose deleted the jrose_snow_1748140_schema_expression_for_struct_types branch November 25, 2024 20:26
@github-actions github-actions bot locked and limited conversation to collaborators Nov 25, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants