Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Microbatch timestamps incorrectly offset for system timezone in Snowflake #10976

Open
2 tasks done
Tracked by #10624
bthomson22 opened this issue Nov 5, 2024 · 0 comments
Open
2 tasks done
Tracked by #10624
Assignees
Labels
bug Something isn't working microbatch Issues related to the microbatch incremental strategy
Milestone

Comments

@bthomson22
Copy link

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

Currently, microbatch timestamps are being recognized as TIMESTAMP_NTZ in Snowflake, instead of TIMESTAMP_TZ. As a result, if user's system timezone is in anything but UTC, Snowflake will offset the timezone, resulting in incorrect and duplicative rows loaded.

For example, if you ran a microbatch model with the --event-end-time of today (2024-11-05) it would produce the following code:

and [timestamp_field] < TIMESTAMP '2024-11-05 00:00:00+00:00'

When you look at this same timestamp in Snowflake you can see it's stored as timestamp_ntz:
image

When you convert this to a timestamp_tz then you'll see the system offset will takeover and adjust accordingly:
Screenshot 2024-11-05 at 1 15 05 PM

According to the docs, microbatch expects input data to be in UTC. Therefore, if your source column is of type timestamp_tz and in UTC, the two timestamps would be off by 8 hours in the above example - causing duplicative rows being inserted.

Instead of using the invalid TIMESTAMP function (which I believe is the cause for data being stored as UTC) - such as the below:

TIMESTAMP '2024-11-05 00:00:00+00:00'

dbt Core should instead use the to_timestamp_tz function that correctly stores this information as a timestamp_tz with the correct UTC offset and data type:
Screenshot 2024-11-05 at 1 25 05 PM

Expected Behavior

Microbatch DDL statements should be operating off of timestamp_tz data types with a UTC offset of 0. Running a model with a microbatch materialization over and over again should not duplicate rows of data in the resulting dataset.

Insert/Delete statements should be operating off of the same datatype & UTC offset as source data.

Steps To Reproduce

  1. Connect dbt Cloud to a Snowflake project
  2. Ensure the Snowflake system timezone has a timezone offset broadly applied.
  3. Create a source model with event_time configured on a timestamp_tz column in UTC.
  4. Run the initial microbatch model from start to end (begins_on through now).
  5. Re-run the microbatch model using a lookback of 1
  6. Observe duplicate rows in the resulting data.

Relevant log output

No response

Environment

- Python: 3.8.12
- dbt: dbt Cloud CLI - 0.38.21

Which database adapter are you using with dbt?

snowflake

Additional Context

No response

@bthomson22 bthomson22 added bug Something isn't working triage labels Nov 5, 2024
@dbeatty10 dbeatty10 added the microbatch Issues related to the microbatch incremental strategy label Nov 5, 2024
@graciegoheen graciegoheen added this to the v1.9 milestone Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working microbatch Issues related to the microbatch incremental strategy
Projects
None yet
Development

No branches or pull requests

4 participants