Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

show missing factor levels on join report #64

Open
gmstanle opened this issue Apr 5, 2024 · 1 comment
Open

show missing factor levels on join report #64

gmstanle opened this issue Apr 5, 2024 · 1 comment

Comments

@gmstanle
Copy link

gmstanle commented Apr 5, 2024

It would be great to have a more-verbose option for joins (inner_join, left_join etc). This would show not just the number of rows filtered from each input df, but also the level combinations that were unique. So if you joined two datasets about cars on make, it might report that levels Ferrari and Mazda were only found in df_left and filtered out, and Ford was only found in df_right and filtered out. The issue is the output length will grow unbounded with the number of factor levels, so I think it would have to be optional and truncate the output past a certain number of lines. I am happy to help with development.

@elbersb
Copy link
Owner

elbersb commented Apr 17, 2024

Hi! At first glance I'm not sure whether this would be in scope for tidylog, as the join logic is already fairly complex, and this would add a lot of additional output. A few questions:

  • what's the specific use case?
  • if this is implemented, this should be opt in. How would one opt in?
  • how to deal with high-cardinality values - as you mention, you might have 100 of levels
  • why only factors? (could work as well for booleans, strings, even ints)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants