-
Notifications
You must be signed in to change notification settings - Fork 960
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[merged after dec 5th) adds new hard_deletes config #6558
base: current
Are you sure you want to change the base?
Changes from all commits
071717d
4e329ce
0f86a11
ee76962
a81ec90
660996c
eecf31b
6dfcf94
0fc49ba
513befd
3a1f0fc
7fdf069
cff9dca
4d5edef
a7f236b
9ec7644
454e33d
a05f318
6785baa
bb08235
8ce6271
ece882c
a70a5f1
95b14cb
fa9af39
39ece31
1237a14
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,8 +10,7 @@ | |
* [Snapshot properties](/reference/snapshot-properties) | ||
* [`snapshot` command](/reference/commands/snapshot) | ||
|
||
|
||
### What are snapshots? | ||
## What are snapshots? | ||
Analysts often need to "look back in time" at previous data states in their mutable tables. While some source data systems are built in a way that makes accessing historical data possible, this is not always the case. dbt provides a mechanism, **snapshots**, which records changes to a mutable <Term id="table" /> over time. | ||
|
||
Snapshots implement [type-2 Slowly Changing Dimensions](https://en.wikipedia.org/wiki/Slowly_changing_dimension#Type_2:_add_new_row) over mutable source tables. These Slowly Changing Dimensions (or SCDs) identify how a row in a table changes over time. Imagine you have an `orders` table where the `status` field can be overwritten as the order is processed. | ||
|
@@ -63,9 +62,9 @@ | |
[unique_key](/reference/resource-configs/unique_key): column_name_or_expression | ||
[check_cols](/reference/resource-configs/check_cols): [column_name] | all | ||
[updated_at](/reference/resource-configs/updated_at): column_name | ||
[invalidate_hard_deletes](/reference/resource-configs/invalidate_hard_deletes): true | false | ||
[snapshot_meta_column_names](/reference/resource-configs/snapshot_meta_column_names): dictionary | ||
[dbt_valid_to_current](/reference/resource-configs/dbt_valid_to_current): string | ||
[hard_deletes](/reference/resource-configs/hard-deletes): string | ||
``` | ||
|
||
</File> | ||
|
@@ -81,9 +80,9 @@ | |
| [unique_key](/reference/resource-configs/unique_key) | A <Term id="primary-key" /> column(s) (string or array) or expression for the record | Yes | `id` or `[order_id, product_id]` | | ||
| [check_cols](/reference/resource-configs/check_cols) | If using the `check` strategy, then the columns to check | Only if using the `check` strategy | ["status"] | | ||
| [updated_at](/reference/resource-configs/updated_at) | If using the `timestamp` strategy, the timestamp column to compare | Only if using the `timestamp` strategy | updated_at | | ||
| [invalidate_hard_deletes](/reference/resource-configs/invalidate_hard_deletes) | Find hard deleted records in source and set `dbt_valid_to` to current time if the record no longer exists | No | True | | ||
| [dbt_valid_to_current](/reference/resource-configs/dbt_valid_to_current) | Set a custom indicator for the value of `dbt_valid_to` in current snapshot records (like a future date). By default, this value is `NULL`. When configured, dbt will use the specified value instead of `NULL` for `dbt_valid_to` for current records in the snapshot table.| No | string | | ||
| [snapshot_meta_column_names](/reference/resource-configs/snapshot_meta_column_names) | Customize the names of the snapshot meta fields | No | dictionary | | ||
| [hard_deletes](/reference/resource-configs/hard-deletes) | Specify how to handle deleted rows from the source. Supported options are `ignore` (default), `invalidate` (replaces the legacy `invalidate_hard_deletes=true`), and `new_record`.| No | string | | ||
Check warning on line 85 in website/docs/docs/build/snapshots.md GitHub Actions / vale[vale] website/docs/docs/build/snapshots.md#L85
Raw output
|
||
|
||
|
||
- In versions prior to v1.9, the `target_schema` (required) and `target_database` (optional) configurations defined a single schema or database to build a snapshot across users and environment. This created problems when testing or developing a snapshot, as there was no clear separation between development and production environments. In v1.9, `target_schema` became optional, allowing snapshots to be environment-aware. By default, without `target_schema` or `target_database` defined, snapshots now use the `generate_schema_name` or `generate_database_name` macros to determine where to build. Developers can still set a custom location with [`schema`](/reference/resource-configs/schema) and [`database`](/reference/resource-configs/database) configs, consistent with other resource types. | ||
|
@@ -215,10 +214,14 @@ | |
- The `dbt_valid_to` column will be updated for any existing records that have changed. | ||
- The updated record and any new records will be inserted into the snapshot table. These records will now have `dbt_valid_to = null` or the value configured in `dbt_valid_to_current` (available in Versionless and 1.9 and higher). | ||
|
||
<VersionBlock firstVersion="1.9"> | ||
|
||
#### Note | ||
- These column names can be customized to your team or organizational conventions using the [snapshot_meta_column_names](#snapshot-meta-fields) config. | ||
- Use the `dbt_valid_to_current` config to set a custom indicator for the value of `dbt_valid_to` in current snapshot records (like a future date such as `9999-12-31`). By default, this value is `NULL`. When set, dbt will use this specified value instead of `NULL` for `dbt_valid_to` for current records in the snapshot table. | ||
|
||
- Use the [`hard_deletes`](/reference/resource-configs/hard-deletes) config to track hard deletes by adding a new record when row become "deleted" in source. Supported options are `ignore`, `invalidate`, and `new_record`. | ||
</VersionBlock> | ||
|
||
Snapshots can be referenced in downstream models the same way as referencing models — by using the [ref](/reference/dbt-jinja-functions/ref) function. | ||
|
||
## Detecting row changes | ||
|
@@ -294,7 +297,7 @@ | |
|
||
::: | ||
|
||
**Example Usage** | ||
**Example usage** | ||
|
||
<VersionBlock lastVersion="1.8"> | ||
|
||
|
@@ -344,15 +347,64 @@ | |
|
||
### Hard deletes (opt-in) | ||
|
||
<VersionBlock firstVersion="1.9"> | ||
|
||
In dbt v1.9 and higher, the [`hard_deletes`](/reference/resource-configs/hard-deletes) config replaces the `invalidate_hard_deletes` config to give you more control on how to handle deleted rows from the source. The `hard_deletes` config is an additional opt-in feature that can be used with any snapshot strategy. | ||
Check warning on line 352 in website/docs/docs/build/snapshots.md GitHub Actions / vale[vale] website/docs/docs/build/snapshots.md#L352
Raw output
|
||
|
||
The `hard_deletes` config has three methods: | ||
| Methods | Description | | ||
| --------- | ----------- | | ||
| `ignore` (default) | No action for deleted records. | | ||
| `invalidate` | Behaves the same as the existing `invalidate_hard_deletes=true`, where deleted records are invalidated by setting `dbt_valid_to` to the current time. | | ||
| `new_record` | Tracks deleted records as new rows using the `dbt_is_deleted` [meta field](#snapshot-meta-fields) to indicate when records are in a deleted state.| | ||
|
||
import HardDeletes from '/snippets/_hard-deletes.md'; | ||
|
||
<HardDeletes /> | ||
|
||
#### Example usage | ||
|
||
<File name='snapshots/orders_snapshot.yml'> | ||
|
||
```yaml | ||
snapshots: | ||
- name: orders_snapshot_hard_delete | ||
relation: source('jaffle_shop', 'orders') | ||
config: | ||
schema: snapshots | ||
unique_key: id | ||
strategy: timestamp | ||
updated_at: updated_at | ||
hard_deletes: new_record # options are: 'ignore', 'invalidate', or 'new_record' | ||
``` | ||
|
||
</File> | ||
|
||
In this example, the `hard_deletes: new_record` config will add a new row for deleted records with the `dbt_is_deleted` column set to `True`. | ||
Any restored records are added as new rows with the `dbt_is_deleted` field set to `False`. | ||
mirnawong1 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The resulting table will look like this: | ||
|
||
| id | status | updated_at | dbt_valid_from | dbt_valid_to | dbt_is_deleted | | ||
Check warning on line 388 in website/docs/docs/build/snapshots.md GitHub Actions / vale[vale] website/docs/docs/build/snapshots.md#L388
Raw output
|
||
| -- | ------ | ---------- | -------------- | ------------ | -------------- | | ||
| 1 | pending | 2024-01-01 10:47 | 2024-01-01 10:47 | 2024-01-01 11:05 | False | | ||
| 1 | shipped | 2024-01-01 11:05 | 2024-01-01 11:05 | 2024-01-01 11:20 | False | | ||
| 1 | shipped | 2024-01-01 11:20 | 2024-01-01 11:20 | | True | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This row should have a |
||
| 1 | shipped | 2024-01-01 12:00 | 2024-01-01 12:00 | | False | | ||
|
||
</VersionBlock> | ||
|
||
<VersionBlock lastVersion="1.8"> | ||
|
||
Rows that are deleted from the source query are not invalidated by default. With the config option `invalidate_hard_deletes`, dbt can track rows that no longer exist. This is done by left joining the snapshot table with the source table, and filtering the rows that are still valid at that point, but no longer can be found in the source table. `dbt_valid_to` will be set to the current snapshot time. | ||
|
||
This configuration is not a different strategy as described above, but is an additional opt-in feature. It is not enabled by default since it alters the previous behavior. | ||
|
||
For this configuration to work with the `timestamp` strategy, the configured `updated_at` column must be of timestamp type. Otherwise, queries will fail due to mixing data types. | ||
|
||
**Example Usage** | ||
Note, in v1.9 and higher, setting the [`hard_deletes`](/reference/resource-configs/hard-deletes) config to `invalidate` replaces the `invalidate_hard_deletes` config for better control over how to handle deleted rows from the source. | ||
Check warning on line 405 in website/docs/docs/build/snapshots.md GitHub Actions / vale[vale] website/docs/docs/build/snapshots.md#L405
Raw output
|
||
|
||
<VersionBlock lastVersion="1.8"> | ||
#### Example usage | ||
|
||
<File name='snapshots/orders_snapshot_hard_delete.sql'> | ||
|
||
|
@@ -378,40 +430,22 @@ | |
|
||
</VersionBlock> | ||
|
||
<VersionBlock firstVersion="1.9"> | ||
|
||
<File name='snapshots/orders_snapshot.yml'> | ||
|
||
```yaml | ||
snapshots: | ||
- name: orders_snapshot_hard_delete | ||
relation: source('jaffle_shop', 'orders') | ||
config: | ||
schema: snapshots | ||
unique_key: id | ||
strategy: timestamp | ||
updated_at: updated_at | ||
invalidate_hard_deletes: true | ||
``` | ||
|
||
</File> | ||
|
||
</VersionBlock> | ||
|
||
## Snapshot meta-fields | ||
|
||
Snapshot <Term id="table">tables</Term> will be created as a clone of your source dataset, plus some additional meta-fields*. | ||
|
||
Starting in 1.9 or with [dbt Cloud Versionless](/docs/dbt-versions/upgrade-dbt-version-in-cloud#versionless): | ||
- These column names can be customized to your team or organizational conventions using the [`snapshot_meta_column_names`](/reference/resource-configs/snapshot_meta_column_names) config. | ||
- Use the [`dbt_valid_to_current` config](/reference/resource-configs/dbt_valid_to_current) to set a custom indicator for the value of `dbt_valid_to` in current snapshot records (like a future date such as `9999-12-31`). By default, this value is `NULL`. When set, dbt will use this specified value instead of `NULL` for `dbt_valid_to` for current records in the snapshot table. | ||
- Use the [`hard_deletes`](/reference/resource-configs/hard-deletes) config and `new_record` method to track deleted records as new rows with the `dbt_is_deleted` meta field. | ||
|
||
| Field | Meaning | Usage | | ||
| -------------- | ------- | ----- | | ||
| -------------- | ------- | ----- | | ||
| dbt_valid_from | The timestamp when this snapshot row was first inserted | This column can be used to order the different "versions" of a record. | | ||
| dbt_valid_to | The timestamp when this row became invalidated. <br /> For current records, this is `NULL` by default <VersionBlock firstVersion="1.9"> or the value specified in `dbt_valid_to_current`.</VersionBlock> | The most recent snapshot record will have `dbt_valid_to` set to `NULL` <VersionBlock firstVersion="1.9"> or the specified value. </VersionBlock> | | ||
| dbt_scd_id | A unique key generated for each snapshotted record. | This is used internally by dbt | | ||
| dbt_updated_at | The updated_at timestamp of the source record when this snapshot row was inserted. | This is used internally by dbt | | ||
| dbt_is_deleted | A boolean value indicating if the record is in a deleted state. `True` if deleted, `False` otherwise. | Added when `hard_deletes='new_record'` is configured. | | ||
|
||
*The timestamps used for each column are subtly different depending on the strategy you use: | ||
|
||
|
@@ -445,6 +479,15 @@ | |
| 1 | pending | 2024-01-01 10:47 | 2024-01-01 10:47 | 2024-01-01 11:05 | 2024-01-01 10:47 | | ||
| 1 | shipped | 2024-01-01 11:05 | 2024-01-01 11:05 | | 2024-01-01 11:05 | | ||
|
||
Snapshot results with `hard_deletes='new_record'`: | ||
|
||
| id | status | updated_at | dbt_valid_from | dbt_valid_to | dbt_updated_at | dbt_is_deleted | | ||
Check warning on line 484 in website/docs/docs/build/snapshots.md GitHub Actions / vale[vale] website/docs/docs/build/snapshots.md#L484
Raw output
|
||
|----|---------|------------------|------------------|------------------|------------------|----------------| | ||
| 1 | pending | 2024-01-01 10:47 | 2024-01-01 10:47 | 2024-01-01 11:05 | 2024-01-01 10:47 | False | | ||
| 1 | shipped | 2024-01-01 11:05 | 2024-01-01 11:05 | 2024-01-01 11:20 | 2024-01-01 11:05 | False | | ||
| 1 | shipped | 2024-01-01 11:20 | 2024-01-01 11:20 | | 2024-01-01 11:20 | True | | ||
|
||
|
||
</details> | ||
|
||
<br/> | ||
|
@@ -532,7 +575,7 @@ | |
| [unique_key](/reference/resource-configs/unique_key) | A <Term id="primary-key" /> column or expression for the record | Yes | id | | ||
| [check_cols](/reference/resource-configs/check_cols) | If using the `check` strategy, then the columns to check | Only if using the `check` strategy | ["status"] | | ||
| [updated_at](/reference/resource-configs/updated_at) | If using the `timestamp` strategy, the timestamp column to compare | Only if using the `timestamp` strategy | updated_at | | ||
| [invalidate_hard_deletes](/reference/resource-configs/invalidate_hard_deletes) | Find hard deleted records in source, and set `dbt_valid_to` current time if no longer exists | No | True | | ||
| [invalidate_hard_deletes](/reference/resource-configs/invalidate_hard_deletes) (legacy) | Find hard deleted records in source, and set `dbt_valid_to` current time if no longer exists. This is a legacy config replaced by [`hard_deletes`](/reference/resource-configs/hard-deletes) in dbt v1.9. | No | True | | ||
Check warning on line 578 in website/docs/docs/build/snapshots.md GitHub Actions / vale[vale] website/docs/docs/build/snapshots.md#L578
Raw output
Check warning on line 578 in website/docs/docs/build/snapshots.md GitHub Actions / vale[vale] website/docs/docs/build/snapshots.md#L578
Raw output
|
||
|
||
- A number of other configurations are also supported (e.g. `tags` and `post-hook`), check out the full list [here](/reference/snapshot-configs). | ||
- Snapshots can be configured from both your `dbt_project.yml` file and a `config` block, check out the [configuration docs](/reference/snapshot-configs) for more information. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wonder if instead of saying "string" here, we should say something like
ignore | invalidate | new_record
so it's clear what your options are?