Slicing with larger key vectors is slow #146

jtackm · 2023-11-08T14:39:08Z

Thanks for this great package! I find myself regularly slicing large KeyedArray matrices with large vectors of string keys (about 10% of the matrix). This is unfortunately currently slow:

using Random, AxisKeys, BenchmarkTools
A = KeyedArray(zeros(100000, 100), sid=["S$i" for i in 1:100000], oid=["O$i" for i in 1:100])
sub_sids = rand(axiskeys(A, 1), 10000)
A_sub_slow = @btime A[Key(sub_sids), :] 
# run time: 6.156 s

On my real data it can take minutes, which is why I regularly find myself using an indexin workaround:

A_sub_fast = @btime begin
    sub_sids_idx = indexin(sub_sids, axiskeys(A, 1))
    A[sub_sids_idx, :]
end 
# run time: 26.203 ms

@assert A_sub_slow == A_sub_fast

The slow method becomes much faster when slicing the matrix at the beginning (sub_sids = axiskeys(A, 1)[1:10000],
255.530 ms) and much slower when slicing at the end (sub_sids = axiskeys(A, 1)[end-10000:end],
18.904 s). The fast method is faster in all scenarios: 12.335 ms (slice beginning), 15.037 ms (slice end).

Only when performing small slices (100 elements) at the beginning can I see advantages of the default method (36.438 μs vs 77.185 μs). I may be missing something, but could the current method perhaps make use of indexin internally?

The text was updated successfully, but these errors were encountered:

aplavin · 2023-11-08T15:22:23Z

As noted in the README, AxisKeys uses whatever search method your axiskeys arrays provide (as findfirst method). Regular arrays do linear search in their findfirst, that's why it is slow for many indices. AxisKeys is composable though, so you can use any array type with faster search:

julia> using UniqueVectors

julia> A = KeyedArray(zeros(100000, 100), sid=UniqueVector(["S$i" for i in 1:100000]), oid=["O$i" for i in 1:100])

julia> @btime A[Key(sub_sids), :]
  4.374 ms (25 allocations: 7.78 MiB)

I guess it could do indexin automatically in these cases, but for now it always does findfirst.

jtackm · 2023-11-08T15:33:19Z

Oh wow, that's a great feature! This capability didn't become apparent to me when reading the README, but maybe that's just me? I had tried with Set at one point (as a blind guess) in hope of faster lookups. Anyways, thanks a lot for the pointer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slicing with larger key vectors is slow #146

Slicing with larger key vectors is slow #146

jtackm commented Nov 8, 2023

aplavin commented Nov 8, 2023 •

edited

Loading

jtackm commented Nov 8, 2023 •

edited

Loading

Slicing with larger key vectors is slow #146

Slicing with larger key vectors is slow #146

Comments

jtackm commented Nov 8, 2023

aplavin commented Nov 8, 2023 • edited Loading

jtackm commented Nov 8, 2023 • edited Loading

aplavin commented Nov 8, 2023 •

edited

Loading

jtackm commented Nov 8, 2023 •

edited

Loading