You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for this great package! I find myself regularly slicing large KeyedArray matrices with large vectors of string keys (about 10% of the matrix). This is unfortunately currently slow:
using Random, AxisKeys, BenchmarkTools
A =KeyedArray(zeros(100000, 100), sid=["S$i"for i in1:100000], oid=["O$i"for i in1:100])
sub_sids =rand(axiskeys(A, 1), 10000)
A_sub_slow =@btime A[Key(sub_sids), :]
# run time: 6.156 s
On my real data it can take minutes, which is why I regularly find myself using an indexin workaround:
The slow method becomes much faster when slicing the matrix at the beginning (sub_sids = axiskeys(A, 1)[1:10000],
255.530 ms) and much slower when slicing at the end (sub_sids = axiskeys(A, 1)[end-10000:end],
18.904 s). The fast method is faster in all scenarios: 12.335 ms (slice beginning), 15.037 ms (slice end).
Only when performing small slices (100 elements) at the beginning can I see advantages of the default method (36.438 μs vs 77.185 μs). I may be missing something, but could the current method perhaps make use of indexin internally?
The text was updated successfully, but these errors were encountered:
As noted in the README, AxisKeys uses whatever search method your axiskeys arrays provide (as findfirst method). Regular arrays do linear search in their findfirst, that's why it is slow for many indices. AxisKeys is composable though, so you can use any array type with faster search:
julia>using UniqueVectors
julia> A =KeyedArray(zeros(100000, 100), sid=UniqueVector(["S$i"for i in1:100000]), oid=["O$i"for i in1:100])
julia>@btime A[Key(sub_sids), :]
4.374 ms (25 allocations:7.78 MiB)
I guess it could do indexin automatically in these cases, but for now it always does findfirst.
Oh wow, that's a great feature! This capability didn't become apparent to me when reading the README, but maybe that's just me? I had tried with Set at one point (as a blind guess) in hope of faster lookups. Anyways, thanks a lot for the pointer.
Thanks for this great package! I find myself regularly slicing large KeyedArray matrices with large vectors of string keys (about 10% of the matrix). This is unfortunately currently slow:
On my real data it can take minutes, which is why I regularly find myself using an
indexin
workaround:The slow method becomes much faster when slicing the matrix at the beginning (
sub_sids = axiskeys(A, 1)[1:10000]
,255.530 ms) and much slower when slicing at the end (
sub_sids = axiskeys(A, 1)[end-10000:end]
,18.904 s). The fast method is faster in all scenarios: 12.335 ms (slice beginning), 15.037 ms (slice end).
Only when performing small slices (100 elements) at the beginning can I see advantages of the default method (36.438 μs vs 77.185 μs). I may be missing something, but could the current method perhaps make use of
indexin
internally?The text was updated successfully, but these errors were encountered: