show missing factor levels on join report #64

gmstanle · 2024-04-05T17:16:29Z

It would be great to have a more-verbose option for joins (inner_join, left_join etc). This would show not just the number of rows filtered from each input df, but also the level combinations that were unique. So if you joined two datasets about cars on make, it might report that levels Ferrari and Mazda were only found in df_left and filtered out, and Ford was only found in df_right and filtered out. The issue is the output length will grow unbounded with the number of factor levels, so I think it would have to be optional and truncate the output past a certain number of lines. I am happy to help with development.

The text was updated successfully, but these errors were encountered:

elbersb · 2024-04-17T09:01:10Z

Hi! At first glance I'm not sure whether this would be in scope for tidylog, as the join logic is already fairly complex, and this would add a lot of additional output. A few questions:

what's the specific use case?
if this is implemented, this should be opt in. How would one opt in?
how to deal with high-cardinality values - as you mention, you might have 100 of levels
why only factors? (could work as well for booleans, strings, even ints)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

show missing factor levels on join report #64

show missing factor levels on join report #64

gmstanle commented Apr 5, 2024

elbersb commented Apr 17, 2024

show missing factor levels on join report #64

show missing factor levels on join report #64

Comments

gmstanle commented Apr 5, 2024

elbersb commented Apr 17, 2024