Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Recognize API Relation Table Used in SQL Header as Parent of Model #11038

Closed
2 tasks done
aboomer07 opened this issue Nov 22, 2024 · 3 comments
Closed
2 tasks done
Labels
enhancement New feature or request wontfix Not a bug or out of scope for dbt-core

Comments

@aboomer07
Copy link

aboomer07 commented Nov 22, 2024

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

In one of my DBT models my_downstream_model, I am defining a table reference using api.relation.create like this:

{% set my_table = api.Relation.create(
    database=<my_database>,
    schema=<my_schema>,
    identifier=<my_identifier>
) -%}

This table is then used to leverage the declare variables functionality of BigQuery within a SQL header. I am doing this since the main table used in the model is a hive partitioned external table, so declaring the partitions as a variable greatly reduces the query cost compared to having the partitions be defined by the result of another query.

The SQL header call looks something like this:

{% call set_sql_header(config) %}
    DECLARE my_partition_list ARRAY<STRING> DEFAULT(ARRAY(select partition_field from {{ my_table }}));
{%- endcall %}

my_table is not used anywhere else in the main model query, and it is not recognized as a parent of my_downstream_model.

Is this the intended behavior? I also tried to set it as a dependency manually, doing -- depends_on: {{ my_table }} but that didn't get it to be recognized as a parent either.

Expected Behavior

My expectation was that my_table would be recognized as a parent of my_downstream_model, so that the DAG execution ordering would be correct.

Steps To Reproduce

Create a DBT SQL model with my_table defined as above, and use it in a SQL header call as I defined. I ran dbt ls --select=my_table+ and my_downstream_model was not in its list of dependencies

Relevant log output

No response

Environment

- OS: macOS 14.5 and Debian 11.4
- Python: 3.10.8
- dbt: 1.8.9

Which database adapter are you using with dbt?

bigquery

Additional Context

No response

@aboomer07 aboomer07 added bug Something isn't working triage labels Nov 22, 2024
@aboomer07 aboomer07 changed the title [Bug] <title> API Relation Table Used in SQL Header Not Recognized as Parent of Model [Bug] API Relation Table Used in SQL Header Not Recognized as Parent of Model Nov 22, 2024
@dbeatty10
Copy link
Contributor

Thanks for reaching out @aboomer07 !

What you described sounds like the expected behavior to me. This is because dbt will only recognize dependencies that come from a call to the ref or source functions. The dbt DAG does not support dynamic graph building -- all the dependencies must be specified up-front.

Is there a reason why you can't create a source and then utilize it like the following?

  1. First, create a source with your database, schema, and identifier:

models/_sources.yml

sources:
  - name: <my_source_name>  # often same as <my_schema>, but you can name the source whatever you want
    database: <my_database>  
    schema: <my_schema>  
    tables:
      - name: <my_identifier>
  1. Then, utilize it within a dbt model:

models/my_downstream_model.sql

-- ...

-- depends_on: {{ source("<my_source_name>", "<my_identifier>") }}

-- ...

I didn't try one way or another with placing the source reference within the sql_header, but the at the very least the -- depends_on: approach should work.

@aboomer07
Copy link
Author

aboomer07 commented Nov 25, 2024

Thanks for getting back!

I now see that wasn't quite enough information on my part, but in general that makes sense what you're saying about dynamic graph building.

So I have a couple more things going on here, I have both a DBT project (<my_project>) as well as a private external (to the project) package (<my_package>) I built that I am using in that project. <my_package> has some generalized DBT macros and models that can be used across multiple projects.

So <my_table> is a model within <my_project> and gets referenced with {{ ref }} in other parts of the project.

<my_downstream_table> is actually a generic model within <my_package>. The way the models within <my_package> are built is that the project using <my_package> gets to supply a variable in the project specifying the name of the tables required by <my_downstream_table>. The goal of this was to set up a package where the user would only need to supply the table names specific to their project (and containing the correct fields), and then the package would be able to generate results from those.

So following the DAG I have <my_table> which is a model defined within and created by <my_project>. <my_downstream_table> is a generic model defined within <my_package>. <my_project> uses <my_downstream_table> and creates a realization of it dependent on <my_table> (and a couple others).

<my_downstream_table> is actually dependent on a few other tables either created / with their sources defined within <my_project> that are referenced within the main query, and the dependency graphing of those works fine!

Then <my_project> has some other tables that are dependent on <my_downstream_table>.

Hopefully this makes sense? I found a workaround for now, and if this isn't a good pattern to pursue then I'm fine with closing this issue! I guess this is also sort of a BigQuery issue, since this kind of pattern is only needed for specifying the partition filters.

@dbeatty10
Copy link
Contributor

Thank you for that additional context @aboomer07 💡

<my_downstream_table> is actually a generic model within <my_package>. The way the models within <my_package> are built is that the project using <my_package> gets to supply a variable in the project specifying the name of the tables required by <my_downstream_table>. The goal of this was to set up a package where the user would only need to supply the table names specific to their project (and containing the correct fields), and then the package would be able to generate results from those.

This would be a feature request rather than a bug report, so I'm going to update the title accordingly. The ability to pass in a list of table dependencies is not something we'd be interested in adopting especially since you have a workaround so I'm going to close this as "not planned".

@dbeatty10 dbeatty10 closed this as not planned Won't fix, can't repro, duplicate, stale Nov 26, 2024
@dbeatty10 dbeatty10 added enhancement New feature or request wontfix Not a bug or out of scope for dbt-core and removed bug Something isn't working triage labels Nov 26, 2024
@dbeatty10 dbeatty10 changed the title [Bug] API Relation Table Used in SQL Header Not Recognized as Parent of Model [Feature] Recognize API Relation Table Used in SQL Header as Parent of Model Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request wontfix Not a bug or out of scope for dbt-core
Projects
None yet
Development

No branches or pull requests

2 participants