Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Density geometry revamp #1157

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
6 changes: 3 additions & 3 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ Each release typically has a number of minor bug fixes beyond what is listed her
# Version 0.7.1

* `Geom.contour`: add support for `DataFrame` (#1150)
* `Geom.density`: add ability to use custom kernels and adds support for scaling, stacking, vertical orientation ([#1157](https://github.com/GiovineItalia/Gadfly.jl/pull/1157))
* `Geom.violin`: add ability to adjust scaling and bandwidth and support for horizontal and split violins ([#1157](https://github.com/GiovineItalia/Gadfly.jl/pull/1157))

# Version 0.7.0

Expand All @@ -23,7 +25,7 @@ Each release typically has a number of minor bug fixes beyond what is listed her
# Version 0.6.4

* Regression testing tools (#1020)

# Version 0.6.3

* Wide format data (#1013)
Expand Down Expand Up @@ -214,5 +216,3 @@ Each release typically has a number of minor bug fixes beyond what is listed her
keys are wrapped automatically.

* Default Theme changes.


41 changes: 35 additions & 6 deletions docs/src/gallery/geometries.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,8 +92,9 @@ gridstack([p1 p2; p3 p4])
```@example
using Gadfly, RDatasets, Distributions
set_default_plot_size(21cm, 8cm)
p1 = plot(dataset("ggplot2", "diamonds"), x="Price", Geom.density)
p2 = plot(dataset("ggplot2", "diamonds"), x="Price", color="Cut", Geom.density)
data = dataset("ggplot2", "diamonds")
p1 = plot(data, x="Price", Geom.density)
p2 = plot(data, x="Price", color="Cut", Geom.density)
hstack(p1,p2)
```

Expand All @@ -102,13 +103,27 @@ using Gadfly, RDatasets, Distributions
set_default_plot_size(14cm, 8cm)
dist = MixtureModel(Normal, [(0.5, 0.2), (1, 0.1)])
xs = rand(dist, 10^5)
plot(layer(x=xs, Geom.density, Theme(default_color="orange")),
plot(layer(x=xs, Geom.density, Theme(default_color="orange")),
layer(x=xs, Geom.density(bandwidth=0.0003), Theme(default_color="green")),
layer(x=xs, Geom.density(bandwidth=0.25), Theme(default_color="purple")),
Guide.manual_color_key("bandwidth", ["auto", "bw=0.0003", "bw=0.25"],
["orange", "green", "purple"]))
```

```@example
using Gadfly, RDatasets
set_default_plot_size(21cm, 8cm)
data = dataset("ggplot2", "diamonds")
p1 = plot(data, x=:Carat, color=:Cut, Geom.density(position=:stack), Guide.title("Loses marginal densities"))
p2 = plot(data, x=:Carat, color=:Cut, Geom.density(position=:stack, scale=:count), Guide.title("Preserve marginal densities"))
hstack(p1, p2)
```

```@example
using Gadfly, RDatasets
plot(dataset("ggplot2", "diamonds"), x=:Carat, color=:Cut, Geom.density(position=:fill), Guide.title("Conditional density estimate"), Coord.cartesian(ymax=1.0, xmax=5))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this long line will likely create a horizontal slider in the generated doc html. a hard line break would be nice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I had already corrected this locally.

```


## [`Geom.density2d`](@ref)

Expand Down Expand Up @@ -464,16 +479,16 @@ using Gadfly, RDatasets
set_default_plot_size(21cm, 8cm)

coord = Coord.cartesian(xmin=-2, xmax=2, ymin=-2, ymax=2)
p1 = plot(coord, z=(x,y)->x*exp(-(x^2+y^2)),
xmin=[-2], xmax=[2], ymin=[-2], ymax=[2],
p1 = plot(coord, z=(x,y)->x*exp(-(x^2+y^2)),
xmin=[-2], xmax=[2], ymin=[-2], ymax=[2],
# or: x=-2:0.25:2.0, y=-2:0.25:2.0,
Geom.vectorfield(scale=0.4, samples=17), Geom.contour(levels=6),
Scale.x_continuous(minvalue=-2.0, maxvalue=2.0),
Scale.y_continuous(minvalue=-2.0, maxvalue=2.0),
Guide.xlabel("x"), Guide.ylabel("y"), Guide.colorkey(title="z"))

volcano = Matrix{Float64}(dataset("datasets", "volcano"))
volc = volcano[1:4:end, 1:4:end]
volc = volcano[1:4:end, 1:4:end]
coord = Coord.cartesian(xmin=1, xmax=22, ymin=1, ymax=16)
p2 = plot(coord, z=volc, x=1.0:22, y=1.0:16,
Geom.vectorfield(scale=0.05), Geom.contour(levels=7),
Expand All @@ -495,3 +510,17 @@ Dsing = dataset("lattice","singer")
Dsing[:Voice] = [x[1:5] for x in Dsing[:VoicePart]]
plot(Dsing, x=:VoicePart, y=:Height, color=:Voice, Geom.violin)
```

```@example
using Gadfly, RDatasets
set_default_plot_size(14cm, 8cm)
tips = dataset("reshape2", "tips")
plot(tips, x=:Day, y=:TotalBill, color=:Sex, Geom.violin(scale=:count), Scale.x_discrete(order=[3,4,2,1]))
```

```@example
using Gadfly, RDatasets
set_default_plot_size(12cm, 16cm)
melanoma = dataset("mlmRev", "Mmmec")
plot(melanoma, y=:Nation, x=:Deaths, color=:Nation, Geom.violin(orientation=:horizontal, scale=:count))
```
76 changes: 76 additions & 0 deletions src/aesthetics.jl
Original file line number Diff line number Diff line change
Expand Up @@ -413,3 +413,79 @@ function inherit!(a::Aesthetics, b::Aesthetics;
end
nothing
end

"""
groupby(aes, by, togroupvar)

Given aesthetics to group with, `by`, and an aesthetic to group `togroupvar`
this function constructs a dictionary that maps each given combination of the
`by` aesthetics to the positions which they apply to. Thus the output is a
dictionary of tuples of each unique combination of `by` mapped to a boolean
array of length `n` where `n` is the length of the aesthetics (they have to all
have the same length). If the provided aesthetics are missing, a placeholder
`nothing` is return instead of the unique value.

## Examples

```jldoctest
using Gadfly
aes = Gadfly.Aesthetics()
aes.x = repeat([1, 2], inner=3)
aes.y = collect(1:6)

Gadfly.groupby(aes, [:x, :color], :y)

# output

DataStructures.OrderedDict{Tuple{Int64,Void},Array{Bool,1}} with 2 entries:
(1, nothing) => Bool[true, true, true, false, false, false]
(2, nothing) => Bool[false, false, false, true, true, true]
```

```jldoctest
using Gadfly
aes = Gadfly.Aesthetics()
aes.x = repeat([:a, :b], inner=2)
aes.y = collect(1:4)
aes.color = repeat([colorant"red", colorant"blue"], inner=2)

Gadfly.groupby(aes, [:x, :color], :y)

# output

DataStructures.OrderedDict{Tuple{Symbol,ColorTypes.RGB{FixedPointNumbers.Normed{UInt8,8}}},Array{Bool,1}} with 2 entries:
(:a, RGB{N0f8}(1.0,0.0,0.0)) => Bool[true, true, false, false]
(:b, RGB{N0f8}(0.0,0.0,1.0)) => Bool[false, false, true, true]
```

"""
function groupby(aes::Gadfly.Aesthetics, by::Vector{Symbol}, togroupvar::Symbol)
types = fill(Nothing, length(by))
isconcrete = fill(false, length(by))
for i in 1:length(by)
isconcrete[i] = getfield(aes, by[i]) != nothing
(!isconcrete[i]) && continue
types[i] = eltype(getfield(aes, by[i]))
@assert length(getfield(aes, togroupvar)) == length(getfield(aes, by[i])) "$togroupvar and $(by[i]) aesthetics must have same length"
end

grouped = DataStructures.OrderedDict{Tuple{types...}, Vector{Bool}}()

# gather options for each `by` aesthetic
opt = [if isconcrete[i] unique(getfield(aes, by[i])) else [nothing] end for i in 1:length(by)]

# The approach is to identify positions were multiple by aesthetics overlap
# and thus grouping the data positions. We first assume that all positions
# belong to a combination of aesthetics and then whittle it down
for combo in IterTools.product(opt...)
belongs = fill(true, length(getfield(aes, togroupvar)))
for i in 1:length(combo)
(combo[i] == nothing) && continue
belongs .&= getfield(aes, by[i]) .== combo[i]
end
# for multiple by variables we need to check whether there is any overlap
# between this specific combo before adding it to the dict
(any(belongs)) && (grouped[combo] = belongs)
end
grouped
end
191 changes: 191 additions & 0 deletions src/geom/density.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
struct DensityGeometry <: Gadfly.GeometryElement
stat::Gadfly.StatisticElement
order::Int
tag::Symbol
end

function DensityGeometry(; n=256,
bandwidth=-Inf,
adjust=1.0,
kernel=Normal,
trim=false,
scale=:area,
position=:dodge,
orientation=:horizontal,
order=1,
tag=empty_tag)
stat = Gadfly.Stat.DensityStatistic(n, bandwidth, adjust, kernel, trim,
scale, position, orientation, false)
DensityGeometry(stat, order, tag)
end

DensityGeometry(stat; order=1, tag=empty_tag) = DensityGeometry(stat, order, tag)

"""
Geom.density(; bandwidth, adjust, kernel, trim, scale, position, orientation, order)

Draws a kernel density estimate. This is a cousin of [`Geom.histogram`](@ref)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice to document how Geom.density(Stat.identity) behaved. in other words, what aesthetics does Geom.density directly use.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll add this.

that is especially useful when the datapoints originate from a underlying smooth
distribution. Unlike histograms, density estimates do not suffer from edge
effects from incorrect bin choices. Some caveats do apply:

1) Plot components do not necessarily correspond to the raw datapoints, but
instead to the kernel density estimation of the underlying distribution
2) Density estimation improves as a function of the number of data points and
can be misleadingly smooth when the number of datapoints is small.
3) Results can be sensitive to the choise of `kernel` and `bandwidth`

For horizontal histograms (default), `Geom.density` draws the kernel density
estimate of `x` optionally grouped by `color`. If the `orientation=:vertical`
flag is passed to the function, then densities will be computed along `y`. The
estimates are normalized by default to have areas equal to 1, but this can
changed by passing `scale=:count` to scale by the raw number of datapoints or
`scale=:peak` to scale by the max height of the estimate. Additionally, multiple
densities can be stacked using the `position=:stack` flag or the conditional
density estimate can be drawn using `position=:fill`. See
[`Stat.DensityStatistic`](@ref Gadfly.Stat.DensityStatistic) for details on
optional parameters that can control the `bandwidth`, `kernel`, etc used.

External links

* [Kernel Density Estimation on Wikipedia](https://en.wikipedia.org/wiki/Kernel_density_estimation)
"""
const density = DensityGeometry

element_aesthetics(::DensityGeometry) = Symbol[]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

element_aesthetics should contain :x, :y, and :color, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I don't leave this blank, they are filled with autogenerated values so it's impossible to give useful error messages using Gadfly.assert_aesthetics_defined. I wasn't sure how to get around this so I leave this blank and figure out errors later: https://github.com/GiovineItalia/Gadfly.jl/pull/1157/files#diff-9ec506bf78232ae17d082c22c2e66449R616

default_statistic(geom::DensityGeometry) = Gadfly.Stat.DensityStatistic(geom.stat)

function render(geom::DensityGeometry, theme::Gadfly.Theme, aes::Gadfly.Aesthetics)
Gadfly.assert_aesthetics_defined("Geom.density", aes, :x, :y)
Gadfly.assert_aesthetics_equal_length("Geom.density", aes, :x, :y)

grouped_data = Gadfly.groupby(aes, [:color], :y)
densities = Array{NTuple{2, Float64}}[]
colors = []

for (keys, belongs) in grouped_data
xs = aes.x[belongs]
ys = aes.y[belongs]

push!(densities, [(x, y) for (x, y) in zip(xs, ys)])
push!(colors, keys[1] != nothing ? keys[1] : theme.default_color)
end

ctx = context(order=geom.order)
# TODO: This should be user controllable
if geom.stat.position == :dodge
compose!(ctx, Compose.polygon(densities, geom.tag), stroke(colors), fill(nothing))
else
compose!(ctx, Compose.polygon(densities, geom.tag), fill(colors))
end

compose!(ctx, svgclass("geometry"))
end

struct ViolinGeometry <: Gadfly.GeometryElement
stat::Gadfly.StatisticElement
order::Int
tag::Symbol
end

function ViolinGeometry(; n=256,
bandwidth=-Inf,
adjust=1.0,
kernel=Normal,
trim=true,
scale=:area,
orientation=:vertical,
order=1,
tag=empty_tag)
stat = Gadfly.Stat.DensityStatistic(n, bandwidth, adjust, kernel, trim,
scale, :dodge, orientation, true)
ViolinGeometry(stat, order, tag)
end

"""
Geom.violin[(; bandwidth, adjust, kernel, trim, order)]

Draws a violin plot which is a combination of [`Geom.density`](@ref) and
[`Geom.boxplot`](@ref). This plot type is useful for comparing differences in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't see a call to Geom.boxplot in the Geom.violin code. is this docstring correct in this regard?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant it stylistically, not technically, but that could change (see #1157 (comment))

the distribution of quantitative data between categories, especially when the
data is non-normally distributed. See [`Geom.density`](@ref) for some caveats.

In the case of standard vertical violins, `Geom.violin` draws the density
estimate of `y` optionally grouped categorically by `x` and colored
with `color`. See [`Stat.DensityStatistic`](@ref Gadfly.Stat.DensityStatistic)
for details on optional parameters that can control the `bandwidth`, `kernel`,
etc used.

```@example
using RDatasets, Gadfly

df = dataset("ggplot2", "diamonds")

p = plot(df, x=:Cut, y=:Carat, color=:Cut, Geom.violin())
draw(SVG("diamonds_violin1.svg", 10cm, 8cm), p) # hide
nothing # hide
```
![](diamonds_violin1.svg)
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no other Geom docstrings currently have examples. i'd suggest moving this to the gallery.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be nice to change, but it depends on JuliaDocs/Documenter.jl#736 anyway so I'll remove this for now.

const violin = ViolinGeometry

element_aesthetics(::ViolinGeometry) = []

default_statistic(geom::ViolinGeometry) = Gadfly.Stat.DensityStatistic(geom.stat)

function render(geom::ViolinGeometry, theme::Gadfly.Theme, aes::Gadfly.Aesthetics)

Gadfly.assert_aesthetics_defined("Geom.violin", aes, :y, :width)
Gadfly.assert_aesthetics_equal_length("Geom.violin", aes, :y, :width)

output_dims, groupon = Gadfly.Stat._find_output_dims(geom.stat)
grouped_data = Gadfly.groupby(aes, groupon, output_dims[2])
violins = Array{NTuple{2, Float64}}[]

(aes.color == nothing) && (aes.color = fill(theme.default_color, length(aes.x)))
colors = eltype(aes.color)[]
color_opts = unique(aes.color)
split = false
# TODO: Add support for dodging violins (i.e. having more than two colors
# per major category). Also splitting should not happen automatically, but
# as a optional keyword to Geom.violin
if length(keys(grouped_data)) > 2*length(unique(getfield(aes, output_dims[1])))
error("Violin plots do not currently support having more than 2 colors per $(output_dims[1]) category")
elseif length(color_opts) == 2
split = true
end

for (keys, belongs) in grouped_data
x, color = keys
ys = getfield(aes, output_dims[2])[belongs]
ws = aes.width[belongs]

if split
pos = findfirst(color_opts, color)
if pos == 1
push!(violins, [(x - w/2, y) for (y, w) in zip(ys, ws)])
else
push!(violins, reverse!([(x + w/2, y) for (y, w) in zip(ys, ws)]))
end
push!(colors, color)
else
push!(violins, vcat([(x - w/2, y) for (y, w) in zip(ys, ws)],
reverse!([(x + w/2, y) for (y, w) in zip(ys, ws)])))
push!(colors, color != nothing ? color : theme.default_color)
end
end

if geom.stat.orientation == :horizontal
for violin in violins
for i in 1:length(violin)
violin[i] = reverse(violin[i])
end
end
end

ctx = context(order=geom.order)
compose!(ctx, Compose.polygon(violins, geom.tag), fill(colors))

compose!(ctx, svgclass("geometry"))

end
10 changes: 0 additions & 10 deletions src/geom/line.jl
Original file line number Diff line number Diff line change
Expand Up @@ -51,16 +51,6 @@ geometry is equivalent to [`Geom.line`](@ref) with `preserve_order=true`.
"""
path() = LineGeometry(preserve_order=true)

"""
Geom.density[(; bandwidth=-Inf)]

Draw a line showing the density estimate of the `x` aesthetic.
This geometry is equivalent to [`Geom.line`](@ref) with
[`Stat.density`](@ref); see the latter for more information.
"""
density(; bandwidth::Real=-Inf) =
LineGeometry(Gadfly.Stat.density(bandwidth=bandwidth))

"""
Geom.density2d[(; bandwidth=(-Inf,-Inf), levels=15)]

Expand Down
Loading