Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multivariate implementation #110

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 34 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Kernel density estimators for Julia.
### Univariate
The main accessor function is `kde`:

```
```julia
U = kde(data)
```

Expand All @@ -33,47 +33,61 @@ The `UnivariateKDE` object `U` contains gridded coordinates (`U.x`) and the dens
estimate (`U.density`). These are typically sufficient for plotting.
A related function

``` kde_lscv(data) ```
```julia
kde_lscv(data)
```

will construct a `UnivariateKDE` object, with the bandwidth selected by
least-squares cross validation. It accepts the above keyword arguments, except
`bandwidth`.


There are also some slightly more advanced interfaces:
```
```julia
kde(data, midpoints::R) where R<:AbstractRange
```
allows specifying the internal grid to use. Optional keyword arguments are
`kernel` and `bandwidth`.

```
```julia
kde(data, dist::Distribution)
```
allows specifying the exact distribution to use as the kernel. Optional
keyword arguments are `boundary` and `npoints`.

```
```julia
kde(data, midpoints::R, dist::Distribution) where R<:AbstractRange
```
allows specifying both the distribution and grid.

### Bivariate
### Multivariate

The usage mirrors that of the univariate case, except that `data` is now
either a tuple of vectors
```
B = kde((xdata, ydata))
either an `N`-tuple of vectors
```julia
M = kde((xdata, ydata, zdata))
```
or a matrix with two columns
or a matrix with `N` columns
```julia
M = kde(datamatrix)
```
B = kde(datamatrix)
Similarly, the optional arguments all now take `N`-tuple arguments:
e.g. `boundary` now takes a `N`-tuple of tuples `((xlo, xhi), (ylo, yhi), (zlo, zhi))`.

The `MultivariateKDE` object `M` contains gridded coordinates (`M.ranges`, an
`N`-tuple of `AbstractRange`s) and the multivariate density estimate (`M.density`).

### Bi- and Trivariate

Special type definitions exist for the bi- and trivariate case:
```julia
const BivariateKDE = MultivariateKDE{2, ...}
const TrivariateKDE = MultivariateKDE{3, ...}
```
Similarly, the optional arguments all now take tuple arguments:
e.g. `boundary` now takes a tuple of tuples `((xlo,xhi),(ylo,yhi))`.

The `BivariateKDE` object `B` contains gridded coordinates (`B.x` and `B.y`) and the bivariate density
estimate (`B.density`).
Their contained gridded coordinates can be accessed as `B.x` and `B.y` for
`B isa BivariateKDE` and `T.x`, `T.y`, and `T.z` for `T isa TrivariateKDE`,
respectively.

### Interpolation

Expand All @@ -83,17 +97,19 @@ intermediate values can be interpolated using the
[Interpolations.jl](https://github.com/tlycken/Interpolations.jl) package via the `pdf` method
(extended from Distributions.jl).

```
```julia
pdf(k::UnivariateKDE, x)
pdf(k::BivariateKDE, x, y)
pdf(k::TrivariateKDE, x, y, z)
pdf(k::MultivariateKDE, x...)
```

where `x` and `y` are real numbers or arrays.
where `x`, `y`, and `z` are real numbers or arrays.

If you are making multiple calls to `pdf`, it will be more efficient to
construct an intermediate `InterpKDE` to store the interpolation structure:

```
```julia
ik = InterpKDE(k)
pdf(ik, x)
```
Expand Down
4 changes: 3 additions & 1 deletion src/KernelDensity.jl
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,14 @@ import StatsBase: RealVector, RealMatrix
import Distributions: twoπ, pdf
import FFTW: rfft, irfft

export kde, kde_lscv, UnivariateKDE, BivariateKDE, InterpKDE, pdf
export kde, kde_lscv, UnivariateKDE, BivariateKDE, MultivariateKDE, InterpKDE, pdf

abstract type AbstractKDE end

include("univariate.jl")
include("multivariate.jl")
include("bivariate.jl")
include("trivariate.jl")
include("interp.jl")

end # module
158 changes: 10 additions & 148 deletions src/bivariate.jl
Original file line number Diff line number Diff line change
@@ -1,153 +1,15 @@
"""
$(TYPEDEF)
const BivariateKDE{Rx <: AbstractRange, Ry <: AbstractRange} =
MultivariateKDE{2, Tuple{Rx, Ry}}

Store both grid and density for KDE over the real line.

Reading the fields directly is part of the API, and
const BivariateDistribution = NTuple{2, UnivariateDistribution}

```julia
sum(density) * step(x) * step(y) ≈ 1
```

# Fields

$(FIELDS)
"""
mutable struct BivariateKDE{Rx<:AbstractRange,Ry<:AbstractRange} <: AbstractKDE
"First coordinate of gridpoints for evaluating the density."
x::Rx
"Second coordinate of gridpoints for evaluating the density."
y::Ry
"Kernel density at corresponding gridpoints `Tuple.(x, permutedims(y))`."
density::Matrix{Float64}
end

function kernel_dist(::Type{D},w::Tuple{Real,Real}) where D<:UnivariateDistribution
kernel_dist(D,w[1]), kernel_dist(D,w[2])
end
function kernel_dist(::Type{Tuple{Dx, Dy}},w::Tuple{Real,Real}) where {Dx<:UnivariateDistribution,Dy<:UnivariateDistribution}
kernel_dist(Dx,w[1]), kernel_dist(Dy,w[2])
end

const DataTypeOrUnionAll = Union{DataType, UnionAll}

# this function provided for backwards compatibility, though it doesn't have the type restrictions
# to ensure that the given tuple only contains univariate distributions
function kernel_dist(d::Tuple{DataTypeOrUnionAll, DataTypeOrUnionAll}, w::Tuple{Real,Real})
kernel_dist(d[1],w[1]), kernel_dist(d[2],w[2])
end

# TODO: there are probably better choices.
function default_bandwidth(data::Tuple{RealVector,RealVector})
default_bandwidth(data[1]), default_bandwidth(data[2])
end

# tabulate data for kde
function tabulate(data::Tuple{RealVector, RealVector}, midpoints::Tuple{Rx, Ry},
weights::Weights = default_weights(data)) where {Rx<:AbstractRange,Ry<:AbstractRange}
xdata, ydata = data
ndata = length(xdata)
length(ydata) == ndata || error("data vectors must be of same length")

xmid, ymid = midpoints
nx, ny = length(xmid), length(ymid)
sx, sy = step(xmid), step(ymid)

# Set up a grid for discretized data
grid = zeros(Float64, nx, ny)
ainc = 1.0 / (sum(weights)*(sx*sy)^2)

# weighted discretization (cf. Jones and Lotwick)
for i in 1:length(xdata)
x = xdata[i]
y = ydata[i]
kx, ky = searchsortedfirst(xmid,x), searchsortedfirst(ymid,y)
jx, jy = kx-1, ky-1
if 1 <= jx <= nx-1 && 1 <= jy <= ny-1
grid[jx,jy] += (xmid[kx]-x)*(ymid[ky]-y)*ainc*weights[i]
grid[kx,jy] += (x-xmid[jx])*(ymid[ky]-y)*ainc*weights[i]
grid[jx,ky] += (xmid[kx]-x)*(y-ymid[jy])*ainc*weights[i]
grid[kx,ky] += (x-xmid[jx])*(y-ymid[jy])*ainc*weights[i]
end
end

# returns an un-convolved KDE
BivariateKDE(xmid, ymid, grid)
end

# convolution with product distribution of two univariates distributions
function conv(k::BivariateKDE, dist::Tuple{UnivariateDistribution,UnivariateDistribution})
# Transform to Fourier basis
Kx, Ky = size(k.density)
ft = rfft(k.density)

distx, disty = dist

# Convolve fft with characteristic function of kernel
cx = -twoπ/(step(k.x)*Kx)
cy = -twoπ/(step(k.y)*Ky)
for j = 0:size(ft,2)-1
for i = 0:size(ft,1)-1
ft[i+1,j+1] *= cf(distx,i*cx)*cf(disty,min(j,Ky-j)*cy)
end
function Base.getproperty(k::BivariateKDE, s::Symbol)
if s === :x
k.ranges[1]
elseif s === :y
k.ranges[2]
else
getfield(k, s)
end
dens = irfft(ft, Kx)

for i = 1:length(dens)
dens[i] = max(0.0,dens[i])
end

# Invert the Fourier transform to get the KDE
BivariateKDE(k.x, k.y, dens)
end

const BivariateDistribution = Union{MultivariateDistribution,Tuple{UnivariateDistribution,UnivariateDistribution}}

default_weights(data::Tuple{RealVector, RealVector}) = UniformWeights(length(data[1]))

function kde(data::Tuple{RealVector, RealVector}, weights::Weights, midpoints::Tuple{Rx, Ry},
dist::BivariateDistribution) where {Rx<:AbstractRange,Ry<:AbstractRange}
k = tabulate(data, midpoints, weights)
conv(k,dist)
end

function kde(data::Tuple{RealVector, RealVector}, dist::BivariateDistribution;
boundary::Tuple{Tuple{Real,Real}, Tuple{Real,Real}} = (kde_boundary(data[1],std(dist[1])),
kde_boundary(data[2],std(dist[2]))),
npoints::Tuple{Int,Int}=(256,256),
weights::Weights = default_weights(data))

xmid = kde_range(boundary[1],npoints[1])
ymid = kde_range(boundary[2],npoints[2])

kde(data,weights,(xmid,ymid),dist)
end

function kde(data::Tuple{RealVector, RealVector}, midpoints::Tuple{Rx, Ry};
bandwidth=default_bandwidth(data), kernel=Normal,
weights::Weights = default_weights(data)) where {Rx<:AbstractRange,Ry<:AbstractRange}

dist = kernel_dist(kernel,bandwidth)
kde(data,weights,midpoints,dist)
end

function kde(data::Tuple{RealVector, RealVector};
bandwidth=default_bandwidth(data),
kernel=Normal,
boundary::Tuple{Tuple{Real,Real}, Tuple{Real,Real}} = (kde_boundary(data[1],bandwidth[1]),
kde_boundary(data[2],bandwidth[2])),
npoints::Tuple{Int,Int}=(256,256),
weights::Weights = default_weights(data))

dist = kernel_dist(kernel,bandwidth)
xmid = kde_range(boundary[1],npoints[1])
ymid = kde_range(boundary[2],npoints[2])

kde(data,weights,(xmid,ymid),dist)
end

# matrix data
function kde(data::RealMatrix,args...;kwargs...)
size(data,2) == 2 || error("Can only construct KDE from matrices with 2 columns.")
kde((data[:,1],data[:,2]),args...;kwargs...)
end
14 changes: 9 additions & 5 deletions src/interp.jl
Original file line number Diff line number Diff line change
Expand Up @@ -16,17 +16,21 @@ end
InterpKDE(kde::UnivariateKDE) = InterpKDE(kde, BSpline(Quadratic(Line(OnGrid()))))


function InterpKDE(kde::BivariateKDE, opts...)
function InterpKDE(kde::MultivariateKDE, opts...)
itp_u = interpolate(kde.density,opts...)
itp = scale(itp_u, kde.x, kde.y)
itp = scale(itp_u, kde.ranges...)
etp = extrapolate(itp, zero(eltype(kde.density)))
InterpKDE{typeof(kde),typeof(etp)}(kde, etp)
end
InterpKDE(kde::BivariateKDE) = InterpKDE(kde::BivariateKDE, BSpline(Quadratic(Line(OnGrid()))))
InterpKDE(kde::MultivariateKDE) = InterpKDE(kde, BSpline(Quadratic(Line(OnGrid()))))

pdf(ik::InterpKDE,x::Real...) = ik.itp(x...)
pdf(ik::InterpKDE,xs::AbstractVector) = [ik.itp(x) for x in xs]
pdf(ik::InterpKDE,xs::AbstractVector,ys::AbstractVector) = [ik.itp(x,y) for x in xs, y in ys]
function pdf(ik::InterpKDE{K, I}, xs::Vararg{AbstractVector, N}) where
{N, R, K <: MultivariateKDE{N, R}, I}

[ik.itp(x...) for x in Iterators.product(xs...)]
end

pdf(k::UnivariateKDE,x) = pdf(InterpKDE(k),x)
pdf(k::BivariateKDE,x,y) = pdf(InterpKDE(k),x,y)
pdf(k::MultivariateKDE,x...) = pdf(InterpKDE(k),x...)
Loading