Skip to content
This repository was archived by the owner on Jun 29, 2021. It is now read-only.

Add public API to deal with grouped data #17

Open
piever opened this issue Nov 14, 2018 · 3 comments
Open

Add public API to deal with grouped data #17

piever opened this issue Nov 14, 2018 · 3 comments
Labels
enhancement New feature or request

Comments

@piever
Copy link
Member

piever commented Nov 14, 2018

Reminder: while GoG assumes data is the long tidy format, one could probably be more flexible by allowing more methods to construct the PlottableTable: there could be some public API to build PlottableTables manually starting from different data structures and grouping information. An added benefit would be if this gives some equivalent of plot(x, [y1 y2]) from Plots for free.

@piever piever added the enhancement New feature or request label Nov 14, 2018
@pdeffebach
Copy link

pdeffebach commented Nov 15, 2018

Note that there are two issues here. The first case is something that to my knowledge is not possible in ggplot or any conventional plotting packages: grouping based on two non-mutually exclusive dummy variables.

Say you want to graph a histogram of income for white people and hispanic people, but many people identify as both white and hispanic.

julia> df = DataFrame(income = randn(10), white = rand(Bool, 10), hispanic = rand(Bool, 10))
10×3 DataFrame
│ Row │ income     │ white │ hispanic │
│     │ Float64    │ Bool  │ Bool     │
├─────┼────────────┼───────┼──────────┤
│ 1   │ 0.490092   │ false │ true     │
│ 2   │ 1.05979    │ true  │ true     │
│ 3   │ 0.0334069  │ false │ true     │
│ 4   │ -0.391703  │ true  │ true     │
│ 5   │ -0.587518  │ true  │ false    │
│ 6   │ -1.02922   │ false │ true     │
│ 7   │ -0.0573893 │ true  │ false    │
│ 8   │ 2.3907     │ false │ true     │
│ 9   │ 1.08107    │ false │ false    │
│ 10  │ -0.324261  │ true  │ false    │

Grammar of Graphics assumes that a category is mutually exclusive, as it would only allow grouping based on a single categorical variable ethnicity.

What I would love to be able to do is a syntax along the lines of

plot(df, :income, Color = G([:white, :hispanic]))

Here, G is a function that makes it look like I did the following:

plot1 = @linq df |> 
    stack([:white, :hispanic] |>
    where(:value) |>
    plot(:income, Color = :race)

Note that the above scenario only works (I think), if both :white and :hispanic are dummy variables. So presumably any function would have to check if that is the case.

Perhaps this idea could be extended all the way to the grouping APIs themselves in JuliaDB and DataFrames. As far as I know, there isn't too much preventing a GroupedDataFrame from having non-mutually exclusive groups.

cc @nalimilan because this seems like something a demographer might have desired before.

I think that plot(x, [y1, y2]) is a related issue, but would require very different implementations, namely a flatten and a zip in some capacity (if at the end of the line it's all GoG-like).

@mkborregaard
Copy link
Member

mkborregaard commented Nov 15, 2018

This type of grouping is not part of grouping APIs because it's statistically invalid, and the approach listed leads to pseudoreplication. I am a strong believer in that your plots should honestly portray your data and there should be a seamless correspondence between plots and statistics.

A statistically appropriate way (which is consistent with standard grouping) is to include a third factor level for those that self-identify as more than one ethnic group.

@nalimilan
Copy link

I think there are situations where it's fine to represent stats for non-exclusive subgroups. For example it can happen if you ask a batteries of yes/no questions and want to see the characteristics of people who answered "yes" for each question. In this case it's not practical (nor interesting) to have a level for each combination of possible answers.

That said, I'm not familiar enough with StatsMakie to have an opinion regarding the API.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants