Skip to content

Latest commit

 

History

History
283 lines (212 loc) · 14.4 KB

README.md

File metadata and controls

283 lines (212 loc) · 14.4 KB

AxisKeys.jl

Docstrings Github CI

This package defines a thin wrapper which, alongside any array, stores a vector of "keys" for each dimension. This may be useful to store perhaps actual times of measurements, or some strings labeling columns, etc. These will be propagated through many operations on arrays (including broadcasting, map, comprehensions, sum etc.) and altered by a few (sorting, fft, push!).

It works closely with NamedDims.jl, another wrapper which attaches names to dimensions. These names are a tuple of symbols, like those of a NamedTuple, and can be used for specifying which dimensions to sum over, etc. A nested pair of these wrappers can be made as follows:

using AxisKeys
data = rand(Int8, 2,10,3) .|> abs;
A = KeyedArray(data; channel=[:left, :right], time=range(13, step=2.5, length=10), iter=31:33)

terminal pretty printing

The package aims not to be opinionated about what you store in these "key vectors": they can be arbitrary AbstractVectors, and need not be sorted nor have unique elements. Integer "keys" are allowed, and should have no surprising interactions with indices. While it is further from zero-cost than NamedDims.jl, it aims to be light-weight, leaving as much functionality as possible to other packages.

See § elsewhere below for other packages doing similar things.

Selections

Indexing still works directly on the underlying array, and keyword indexing (of a nested pair) works exactly as for a NamedDimsArray. But in addition, it is possible to pick out elements based on the keys, which for clarity I will call lookup. This is written with round brackets:

Dimension d Indexing: i ∈ axes(A,d) Lookup: key ∈ axiskeys(A,d)
by position A[1,2,:] A(:left, 15.5, :)
by name A[iter=1] A(iter=31)
by type -- B = A(:left)

When using dimension names, fixing only some of them will return a slice, such as B = A[channel=1]. You may also give just one key, provided its type matches those of just one dimension, such as B = A(:left) where the key is a Symbol.

Note that indexing is the primary way to access the data. Lookup calls for example i = findfirst(axiskeys(A,1), :left) to convert keys to indices, thus will always be slower. If you want this to be the primary mode of access, then you may want a dictionary, possibly Dictionaries.jl.

There are also a numer of special selectors, which work like this:

Indexing Lookup
one nearest B[time = 3] B(time = Near(17.0)) vector
all in a range B[2:5, :] B(Interval(14,25), :) matrix
all matching B[3:end, Not(3)] B(>(17), !=(33)) matrix
mixture B[1, Key(33)] B(Index[1], 33) scalar
non-scalar B[iter=[1, 3]] B(iter=[31, 33]) matrix

Here Interval(13,18) can also be written 13..18, it's from IntervalSets.jl. Any functions can be used to select keys, including lambdas: B(time = t -> 0<t<17). You may give just one ::Base.Fix2 function (such as <=(18) or ==(20)) provided its argument type matches the keys of one dimension. An interval or a function always selects via findall, i.e. it does not drop a dimension, even if there is exactly one match.

While this table shows lookup selectors inside B(...), they can in fact all be used inside B[...], not just Key(k) as shown. They still refer to keys not indices! (This will not select dimension based on type, i.e. A[Key(:left)] is an error.) You may also write Index[end] but not Index[end-1].

By default lookup returns a view, while indexing returns a copy unless you add @views. This means that you can write into the array with B(time = <=(18)) .= 0. For scalar output, you cannot of course write B(13.0, 33) = 0 as this parsed as a function definition, but you can write B[Key(13.0), Key(33)] = 0, or else B(13.0, 33, :) .= 0 as a trailing colon makes a zero-dimensional view.

Construction

KeyedArray(rand(Int8, 2,10), ([:a, :b], 10:10:100)) # AbstractArray, Tuple{AbstractVector, ...}

A nested pair of wrappers can be constructed with keywords for names, and everything should work the same way in either order:

KeyedArray(rand(Int8, 2,10), row=[:a, :b], col=10:10:100)     # KeyedArray(NamedDimsArray(...))
NamedDimsArray(rand(Int8, 2,10), row=[:a, :b], col=10:10:100) # NamedDimsArray(KeyedArray(...))

Calling AxisKeys.keyless(A) removes the KeyedArray wrapper, if any, and NamedDims.unname(A) similarly removes the names (regardless of which is outermost).

There is another more "casual" constructor, via the function wrapdims. This does a bit more checking of inputs, and will adjust the length of ranges of keys if it can, and will fix indexing offsets if needed to match the array. The resulting order of wrappers is controlled by AxisKeys.nameouter()=false.

wrapdims(rand(Int8, 10), alpha='a':'z') 
# Warning: range 'a':1:'z' replaced by 'a':1:'j', to match size(A, 1) == 10

wrapdims(OffsetArray(rand(Int8, 10),-1), iter=10:10:100)
axiskeys(ans,1) # 10:10:100 with indices 0:9

Finally, wrapdims will also convert AxisArrays, NamedArrays, as well as NamedTuples.

Functions

The function axes(A) returns (a tuple of vectors of) indices as usual, and axiskeys(A) similarly returns (a tuple of vectors of) keys. If the array has names, then dimnames(A) returns them. These functions work like size(A, d) = size(A, name) to get just one.

The following things should work:

  • Broadcasting log.(A) and map(log, A), as well as comprehensions [log(x) for x in A] should all work.

  • Transpose etc, permutedims, mapslices.

  • Concatenation hcat(B, B .+ 100) works. Note that the keys along the glued direction may not be unique afterwards.

  • Reductions like sum(A; dims=:channel) can use dimension names. Likewise prod, mean etc., and dropdims.

  • Sorting: sort and sortslices permute keys & data by the array, while a new function sortkeys goes by the keys. reverse similarly re-orders keys to match data.

  • Some linear algebra functions like * and \ will work.

  • Getproperty returns the key vector, to allow things like for (i,t) in enumerate(A.time); fun(val = A[i,:], time = t); ....

  • Vectors support push!(V, val), which will try to extend the key vector. There is also a method push!(V, key => val) which pushes in a new key.

To allow for this limited mutability, V.keys isa Ref for vectors, while A.keys isa Tuple for matrices & higher. But axiskeys(A) always returns a tuple.

  • Named tuples can be converted to and from keyed vectors, with collect(keys(nt)) == Symbol.(axiskeys(V),1)

  • The Tables.jl interface is supported, with wrapdims(df, :val, :x, :y) creating a matrix from 3 columns.

  • Some StatsBase.jl and CovarianceEstimation.jl functions are supported. (PR#28.)

  • FFTW.fft transforms the keys; if these are times such as Unitful.s then the results are fequency labels. (PR#15.)

  • LazyStack.stack understands names and keys. Stacks of named tuples like stack((a=i, b=i^2) for i=1:5) create a matrix with [:a, :b].

  • NamedPlus has a macro which works on comprehensions: @named [n^pow for n=1:10, pow=0:2:4] has names and keys.

Absent

  • There is no automatic alignment of dimensions by name. Thus A .+ A[iter=3] is fine as both names and keys line up, but A .+ B is an error, as B's first name is :time not :channel. (See NamedPlus.@named for something like this.)

As for NamedDims.jl, the guiding idea is that every operation which could be done on ordinary arrays should still produce the same data, but propagate the extra information (names/keys), and error if it conflicts.

Both packages allow for wildcards, which never conflict. In NamedDims.jl this is the name :_, here it is a Base.OneTo(n), like the axes of an Array. These can be constructed as M = wrapdims(rand(2,2); _=[:a, :b], cols=nothing), and for instance M .+ M' is not an error.

  • There are no special types provided for key vectors, they can be any AbstractVectors. Lookup happens by calling i = findfirst(isequal(20.0), axiskeys(A,2)), or is = findall(<(18), axiskeys(A,2)).

If you need lookup to be very fast, then you will want to use a package like UniqueVectors.jl or AcceleratedArrays.jl or CategoricalArrays.jl. To apply such a type to all dimensions, you may write D = wrapdims(rand(1000), UniqueVector, rand(Int, 1000)). Then D(n) here will use the fast lookup from UniqueVectors.jl (about 60x faster).

When a key vector is a Julia AbstractRange, then this package provides some faster overloads for things like findall(<=(42), 10:10:100).

  • There is also no automatic alignment by keys, like time. But this could be done elsewhere?

  • There is no interaction with interpolation, although this seems a natural fit. Why doesn't A(:left, 13.7, :) interpolate along continuous dimensions?

Elsewhere

This is more or less an attempt to replace AxisArrays with several smaller packages. The complaints are: (1) It's confusing to guess whether to perform indexing or lookup based on whether it is given an integer (index) or not (key). (2) Each "axis" was its own type Axis{:name} which allowed zero-overhead lookup before Julia 1.0. But this is now possible with a simpler design. (They were called axes before Base.axes() was added, hence (3) the confusing terminology.) (4) Broadcasting is not supported, as this changed dramatically in Julia 1.0. (5) There are lots of assorted functions, special categorical vector types, etc. which aren't part of the core, and are poorly documented.

Other older packages (pre-Julia-1.0):

  • NamedArrays also provides names & keys, which are always OrderedDicts. Named lookup looks like NA[:x => 13.0] instead of A(x=13.0) here; this is not very fast. Dimension names & keys can be set after creation. Has nice pretty-printing routines. Returned by FreqTables.

  • LabelledArrays adds names for individual elements, more like a NamedTuple. Only for small sizes: the storage inside is a Tuple, not an Array.

  • AxisArrayPlots has some plot recipes.

  • OffsetArrays actually changes the indices of an Array, allowing any continuous integer range, like 0:9 or -10:10. This package is happy to wrap such arrays, and if needed will adjust indices of the given key vectors: O = wrapdims(OffsetArray(["left", "mid", "right"], -1:1), 'A':'C'), then O[-1:0] works.

Other new packages (post-1.0):

  • Dictionaries does very fast lookup only (in this terminology), with no indexing. Not <: AbstractArray, not a wrapped around an Array. And presently only one-dimensional.

  • NamedPlus is some experiments using NamedDims. Function align permutes dimensions automatically, and macro @named can introduce this into broadcasting expressions.

  • AxisSets builds on this package to handle groups of arrays as a KeyedDataset.

  • DimensionalData is another replacement for AxisArrays. It again uses types like Dim{:name} to store both name & keys, although you can use Symbol keys that are converted to types internally. There are also some special ones like X, Y of the same abstract type (which must be in scope). Named lookup can use these types DA[X(At(:a))], or use the corresponding symbols DA[X=At(:a)], for what this package would write A(x=:a) or A[x=Key(:a)].

  • AxisIndices differs mainly by storing the keys with the axes in its own Axis type. This is returned by Base.axes(A) (instead of Base.OneTo etc.) like PR#6.

See also docs/speed.jl for some checks on this package, and comparisons to other ones. And see docs/repl.jl for some usage examples, showing pretty printing.

In 🐍-land:

  • xarray does indexing x[:, 0] and lookup by "coordinate label" as x.loc[:, 'IA']; with names these become x.isel(space=0) and da.sel(space='IA').

  • pandas is really more like DataFrames, only one- and two-dimensional. Writes indexing "by position" as df.iat[1, 1] for scalars or df.iloc[1:3, :] allowing slices, and lookup "by label" as df.at[dates[0], 'A'] for scalars or df.loc['20130102':'20130104', ['A', 'B']] for slices, "both endpoints are included" in this. See also Pandas.jl for a wrapper.