Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-2059] [Feature] Syntax to restrict selection to the current package #6891

Closed
3 tasks done
gwenwindflower opened this issue Feb 7, 2023 · 26 comments · Fixed by #9270
Closed
3 tasks done

[CT-2059] [Feature] Syntax to restrict selection to the current package #6891

gwenwindflower opened this issue Feb 7, 2023 · 26 comments · Fixed by #9270
Labels
enhancement New feature or request good_first_issue Straightforward + self-contained changes, good for new contributors! node selection Functionality and syntax for selecting DAG nodes

Comments

@gwenwindflower
Copy link

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

There are many use cases for not running the models in your packages.

  • Feeding a dbt ls of the models in the project to an external tool
  • Only checking the build of the current project in a mesh-like structure
  • Many others but I won't belabor it unless you disagree :)

Unfortunately there's no easy way that I'm aware of that you can filter to just the models in the project without hardcoding the selects or excludes to various projects. Particularly as the mesh grows, I can see this becoming more of an issue, and a desire to 'just run this project's models' becoming really helpful.

I would love to implement something like:
dbt build --select this.* or perhaps dbt ls --exclude external or dbt run --internal-only/dbt run -i -- i haven't thought of the ideal syntax, but hopefully you get the idea.

Describe alternatives you've considered

You can exclude all the packages by name in YAML selectors, or you can [project_name].* but this isn't very convenient, particularly when working programmatically on top of the CLI, such as with editor extensions that would need to parse the project YAML to get the package name.

Who will this benefit?

This will primarily benefit people using lots of Fivetran packages, a multi-project mesh structure, or some other configuration that involves lots of models contributed by packages that you don't necessarily want to constantly re-run when you're developing.

Are you interested in contributing this feature?

Yes! Very much. Dying to contribute to the CLI in some way.

Anything else?

No response

@gwenwindflower gwenwindflower added enhancement New feature or request triage labels Feb 7, 2023
@github-actions github-actions bot changed the title [Feature] More easily exclude models from packages [CT-2059] [Feature] More easily exclude models from packages Feb 7, 2023
@PedramNavid
Copy link

PedramNavid commented Feb 8, 2023

+1 to this.

A very concrete use case is a vim plugin I am writing where I’d like to provide the user with the ability to quickly jump to any model within their project, however dbt ls shows all models including those in packages.

The JSON keys don’t provide a great way for differentiating between packaged and internal models.

Other vendors may also benefit, for example, when building a dbt integration, being able to display the models for users can be simplified with this option.

@jtcohen6 jtcohen6 added good_first_issue Straightforward + self-contained changes, good for new contributors! node selection Functionality and syntax for selecting DAG nodes labels Feb 8, 2023
@jtcohen6
Copy link
Contributor

jtcohen6 commented Feb 8, 2023

Love it :) and would love to have you contribute this @gwenwindflower!

Should this be a special value of the package: selection method?

$ dbt ls --select package:this
$ dbt ls --select package:root

That's assuming, of course, that no one names their project/package this, or root, or whatever we come up with

@jtcohen6
Copy link
Contributor

jtcohen6 commented Feb 8, 2023

The code for the PackageSelectorMethod is defined here:

class PackageSelectorMethod(SelectorMethod):
def search(self, included_nodes: Set[UniqueId], selector: str) -> Iterator[UniqueId]:
"""Yields nodes from included that have the specified package"""
for node, real_node in self.all_nodes(included_nodes):
if real_node.package_name == selector:
yield node

We don't have access to the full RuntimeConfig here, the closest thing we do have is self.manifest.metadata.project_id, which is the hashed version of the project_name. We could reasonably add the unhashed project_name to the ManifestMetadata, too.

@dbeatty10
Copy link
Contributor

Jumping on the brainstorming bandwagon, here's some ideas for names (some taken from this 😉 ):

  • this
  • self
  • me
  • root
  • home
  • super
  • mine
$ dbt ls --select package:this
$ dbt ls --select package:self
$ dbt ls --select package:me
$ dbt ls --select package:root
$ dbt ls --select package:home
$ dbt ls --select package:super
$ dbt ls --select package:mine

If needed/desired, we could add some special symbol like $ as a prefix or require special capitalization like This.

$ dbt ls --select package:This
$ dbt ls --select package:$self
$ dbt ls --select package:$Me

(I'm guessing we wouldn't actually use $ 'cause that's special to shells like zsh, bash, etc.)

Alternatively, we could introduce some other selection shorthand that means the same thing as dbt ls --select package:{project_name}.

$ dbt ls --select package:.

That's assuming, of course, that no one names their project/package this, or root, or whatever we come up with

It would be nice if it were somehow impossible to not collide with a project name! We could restrict the possible names to ones that wouldn't be valid project names anyways. Or we could make our choice a reserved keyword and disallow projects from taking on that name (which would be a breaking change).

@dbeatty10
Copy link
Contributor

Something to keep in mind during any implementation:

#6598 / #6599 (not reviewed/merged yet) adds the ability to use Unix-style glob patterns to selectors like:

metric:arr_*
file:stg_salesforce__*.sql
package:internal_pkg_*
source:salesforce.*
tag:*_pii
test_name:singular_test_suite_[1-9]

while still retaining these:

fqn:*
source:*
exposure:*

So we'd want the two features and PRs to be compatible with each other.

@gwenwindflower
Copy link
Author

at first blush i think this will be the most friendly as it's a concept we already have introduced in dbt templates.

dbt run --select package:this feels quite readable to me!

@jtcohen6
Copy link
Contributor

jtcohen6 commented Feb 9, 2023

I won't lie, this did pique my interest:

$ dbt ls --select package:.
$ dbt ls -s .

But I'm not convinced that shorter & cleverer is better here, versus something explicit & easily understandable.

@jtcohen6 jtcohen6 removed the triage label Feb 9, 2023
@PedramNavid
Copy link

I do wonder if maybe making a package boolean available as part of the JSON output might be a more appropriate solution?

Some type of “internal_package: true” key/value could then be parsed by the user through jq or a programming language as needed. Including the package name could also be very helpful so clients don’t have to also parse the dbt_project.yml file.

In an ideal world, dbt ls —output=json would have two additional keys:

  • internal_package: bool
  • package_name: string

This doesn’t prevent also the work described above for selecting internal packages but could be a useful addition for integrations and clients.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Feb 9, 2023

(@PedramNavid What time is it where you are?)

There is already the ability to include the package_name in the JSON output of dbt list:

$ dbt ls --output json --output-keys name,resource_type,package_name
{"name": "my_model", "resource_type": "model", "package_name": "test"}
{"name": "my_seed", "resource_type": "seed", "package_name": "test"}

(The CLI syntax of --output-keys is wonky today and should be fixed: #6676)

So if you know your internal/root package's name, it's easy enough to filter on that already. I also see how it would be handy to have that as a (new) boolean flag.

@PedramNavid
Copy link

I’m in Singapore right now on an 11 hour layover, only 7pm! Been traveling through many time zones so time is in perpetual flux.

Ah, package name is handy, then either a way to get the package name from dbt might almost be enough to solve this (for me anyway). Maybe just the dbt project name as an output key, then I could easily filter for models where project name==package name?

@jtcohen6
Copy link
Contributor

jtcohen6 commented Feb 9, 2023

Bon voyage !

So, today, the options would be:

  • Pull the name out of dbt_project.yml (doesn't feel great)
  • We add root_project_name as an output key available from dbt list, thereby enabling an easy filter

Future (!!)

just for fun, while you're laying over

We're working on a Python API for dbt-core that mirrors its CLI functionality (#6356). The syntax I have below is not guaranteed to actually be The Thing a few months from now—lots of design conversation to keep having between now and then—but this does actually work on our feature branch, plus a few small tweaks (43e7e2b).

Does this excite you? Terrify you? Genuinely curious.

# my-dbt-project-dir/script.py

import json
import pprint
from dbt.cli.main import dbtRunner
from dbt.contracts.graph.manifest import Manifest

manifest: Manifest = None
success: bool = None
list_results: list[str]

# TODO: disable other stdout logging
dbt = dbtRunner()
manifest, success = dbt.invoke(["parse"])
root_project_name = manifest.metadata.project_name

# at this point, you could just loop over manifest.nodes
# but let's imagine you want to use 'dbt list'
list_results, success = dbt.invoke(["list", "--output", "json", "--output-keys", "name,resource_type,package_name"])
results_json = [json.loads(res) for res in list_results]
filtered = [res for res in results_json if res["package_name"] == root_project_name]

pp = pprint.PrettyPrinter()
pp.pprint(filtered)
$ python3 script.py
...
[{'name': 'my_model', 'package_name': 'test', 'resource_type': 'model'},
 {'name': 'my_seed', 'package_name': 'test', 'resource_type': 'seed'}]

@gwenwindflower
Copy link
Author

i have a feeling our boy P-Money is using lua for his nvim plug so a python api might not do the trick

@PedramNavid for present day constraints, might i suggest using yq instead (might already be on your radar but it was new to me last week) -- makes parsing YAML very jq like and it handles a bunch of formats!

@gwenwindflower
Copy link
Author

I won't lie, this did pique my interest:

$ dbt ls --select package:.
$ dbt ls -s .

But I'm not convinced that shorter & cleverer is better here, versus something explicit & easily understandable.

true! that would be an invalid project name right? or we could safely make it one? i do feel icky about the possibility of some company called "This Cosmetics" making a project called "this", and i also don't like the idea of outlawing 'this' as a project name.

@PedramNavid
Copy link

I love the Python API, but Winnie has it right, my integration is lua so I’m dependent on the command line. I’ve also seen integrations built using JS/TypeScript so being able to access more detailed information via stdout is a bit more flexible.

Both jq/yq are great solutions but hard for me to use for a project i want to distribute widely. Fortunately vim comes with a json parser and I can use grep/sed to extract the project name from the dbt_project.yml so I’m not blocked here at all!

@gwenwindflower
Copy link
Author

mmm yea right right, i ran into this problem with some codegen stuff i was building (re the portability of jq/yq).

i don't like the boolean as it feels a lot for not much info, but i wouldn't hate the idea of a package_type key, with values like root or this, external, internal (signaling a company's internal mesh packages) that provides a little more flexibility and information.

i do want the CLI to have this functionality regardless of the specific tooling use case.

@gwenwindflower
Copy link
Author

WIP branch https://github.com/dbt-labs/dbt-core/tree/winnie/this-package-selector

trying to figure out how to get the newly added self.manifest.metadata.project_name to actually populate (currently it's returning None) -- i don't actually understand that area of the codebase yet so will have to circle back.

appreciate if anybody has any ideas or context on that though! once that's working i think the rest of it is a pleasant cruise.

@gwenwindflower
Copy link
Author

@jtcohen6 i think i'm hitting a wall, from what i can understand in my very, very limited understanding -- i think project_name has been added to the ManifestMetadata, the property is now being returned okay, but it's always None. any thoughts on where i'm missing giving that its value?

@jtcohen6
Copy link
Contributor

@gwenwindflower I was running into that too! (While testing the commit I linked above.) Confusing, but—I figured out it was a thing with partial parsing (the existing manifest doesn't have the project_name in its metadata). Try deleting the target/ directory, or disabling partial parsing, and see if that fixes it in your local testing.

@gwenwindflower
Copy link
Author

🙌🏻 💥 brilliant!

we're cooking!

i'll try to write some tests for this this weekend and then do a proper PR.

Screenshot 2023-02-10 at 11 04 29
Screenshot 2023-02-10 at 11 04 43

@dbeatty10 dbeatty10 changed the title [CT-2059] [Feature] More easily exclude models from packages [CT-2059] [Feature] Syntax to restrict selection to the current package Nov 1, 2023
@jaklan
Copy link

jaklan commented Nov 2, 2023

Hi @gwenwindflower & @jtcohen6, this feature request seems to be a bit stale, but I have raised recently a similar issue (#8954) and @dbeatty10 redirected me here, so I wanted to bump it with a few comments.

As I can see, the core discussion here is about the proper syntax for selecting nodes from the current package. That's definitely a step forward, but I wanted to put on a table a bit different approach:

I would expect dbt to only select a model from the current project (namespace) by default.

  • If you want to match a model with the same name from an installed package - you should specify it explicitly with -s "another_package.some_model".
  • If you want to match all models with given name from all packages - you should specify it explicitly with -s "*.some_model".

I see two main benefits of such approach:

  • you don't have to remember about specifying -s package:. each time you want to run a model / a subset of models from your current package and be sure no models from installed packages are selected - it would be given as a default behaviour then
  • it's less error-prone when working with multiple models with the same name - since namespaces were introduced, the current behaviour to match nodes from all installed packages can very easily lead to unexpected situations like the one described in #8954

This way you can run simple commands like dbt compile or dbt build -s some_model and don't worry about any side effects - the default context is always limited to your current project.

That's also related to another point raised in the above issue:

I believe it shouldn't be possible to select private and protected models from a package with restrict-access: True at all.

As @dbeatty10 replied there - currently model access modifiers are only applied for ref(), and not for selection. But you can imagine to revisit that approach in the future (personally I find it quite reasonable) and with the above approach - it would be much easier to introduce such change.


If that proposal doesn't resonate with you, I have two more ideas:

  1. Enable such behaviour with project config. In this case we don't change the default behaviour of dbt, but we allow to configure it in dbt_projects.yml and ensure consistent outcomes when working on given package.

  2. Allow to combine a YAML selector with flags. Currently it's not possible:

    If I run a command that defines its own selection criteria (via --select, --exclude, or --selector), dbt will ignore the default selector and use the flag criteria instead. It will not try to combine the two.

    but if we can set a selector as default and mark it as non-overwritable - then you would be able to define -s package:. as the default one and ensure it's still applied even if additional flags are provided, so e.g. dbt build -s some_model would be resolved to dbt build -s package:. -s some_model.

@dbeatty10 dbeatty10 added the Refinement Maintainer input needed label Nov 9, 2023
@dbeatty10
Copy link
Contributor

@jaklan Thanks for offering these ideas! Since they are are extending the scope of the original issue here, would you mind opening up new issue(s) here?

Although it's related to this one, the scope is different, so we'd like to treat it as a separate proposal.

@dbeatty10
Copy link
Contributor

If anyone wants to implement the narrowly version of this issue as-is, winnie had a WIP branch here that you could take a peek at and create your own branch from.

Most of the diff is reformatting import statements. From a quick peek, these are the sections with changed logic:

@dbeatty10 dbeatty10 removed the Refinement Maintainer input needed label Nov 9, 2023
@jtcohen6
Copy link
Contributor

@barton996 Thank you for opening the PR! @aranke and I started reviewing today, and we wanted to revisit the question of whether this syntax should be:

dbt ls -s package:.
dbt ls -s package:this
# or something else

After talking it through, we like dbt ls -s package:this the best. Two reasons:

  1. It's not a Unix glob pattern (as are * or .), so we avoid confusion there. It's already more of a dbt convention: {{ this }} to mean "current" model, as package:this would means current/root project.
  2. Thinking through the fact that you can run dbt from a subfolder of a project, and dbt will recurse up to find the nearest parent project directory. It is intended behavior that this always works the same, as if you were running it from the project root, but the use of . (which often means "current directory") might seem like it should yield different behavior depending on the working directory where dbt is running.

@barton996 I will leave a comment on the PR suggesting this syntax change, if it makes sense to you!

@jaklan
Copy link

jaklan commented Apr 12, 2024

@jtcohen6 quite an edge-case, but... in that scenario you would need to restrict this as a potential package name.

I would personally vote for . as that's a similar syntax to pip install . which makes it intuitive, but it doesn't matter so much - I rather wonder what you think about the points raised in that comment: #6891 (comment)

@jtcohen6
Copy link
Contributor

@jaklan Definitely considered (in comments above) - I think the added clarity outweighs the edge case (risk of collision with installed package named this).

For the ideas further up -

  • I don't think we're going to change the default selection behavior (only root project versus all enabled resources)
  • We should support package:this within yaml selectors, including a "default" selector
  • Re: combined use of --selector + --select/--exclude: There's an issue for this, with a lot of good discussion, and a PR that I'm long overdue for responding to :)

@jaklan
Copy link

jaklan commented Apr 12, 2024

@jtcohen6 thanks for the link.

I don't think we're going to change the default selection behavior (only root project versus all enabled resources)

Well, that's a bit disappointing, especially because of the reason I mentioned above:

That's also related to another point raised in the above issue:

I believe it shouldn't be possible to select private and protected models from a package with restrict-access: True at all.

As @dbeatty10 replied there - currently model access modifiers are only applied for ref(), and not for selection. But you can imagine to revisit that approach in the future (personally I find it quite reasonable) and with the above approach - it would be much easier to introduce such change.

With no way to enforce package:this on all users - it could lead to very unexpected results as described in #8954.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good_first_issue Straightforward + self-contained changes, good for new contributors! node selection Functionality and syntax for selecting DAG nodes
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants