Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slicing with larger key vectors is slow #146

Open
jtackm opened this issue Nov 8, 2023 · 2 comments
Open

Slicing with larger key vectors is slow #146

jtackm opened this issue Nov 8, 2023 · 2 comments

Comments

@jtackm
Copy link

jtackm commented Nov 8, 2023

Thanks for this great package! I find myself regularly slicing large KeyedArray matrices with large vectors of string keys (about 10% of the matrix). This is unfortunately currently slow:

using Random, AxisKeys, BenchmarkTools
A = KeyedArray(zeros(100000, 100), sid=["S$i" for i in 1:100000], oid=["O$i" for i in 1:100])
sub_sids = rand(axiskeys(A, 1), 10000)
A_sub_slow = @btime A[Key(sub_sids), :] 
# run time: 6.156 s

On my real data it can take minutes, which is why I regularly find myself using an indexin workaround:

A_sub_fast = @btime begin
    sub_sids_idx = indexin(sub_sids, axiskeys(A, 1))
    A[sub_sids_idx, :]
end 
# run time: 26.203 ms

@assert A_sub_slow == A_sub_fast

The slow method becomes much faster when slicing the matrix at the beginning (sub_sids = axiskeys(A, 1)[1:10000],
255.530 ms) and much slower when slicing at the end (sub_sids = axiskeys(A, 1)[end-10000:end],
18.904 s). The fast method is faster in all scenarios: 12.335 ms (slice beginning), 15.037 ms (slice end).

Only when performing small slices (100 elements) at the beginning can I see advantages of the default method (36.438 μs vs 77.185 μs). I may be missing something, but could the current method perhaps make use of indexin internally?

@aplavin
Copy link
Collaborator

aplavin commented Nov 8, 2023

As noted in the README, AxisKeys uses whatever search method your axiskeys arrays provide (as findfirst method). Regular arrays do linear search in their findfirst, that's why it is slow for many indices. AxisKeys is composable though, so you can use any array type with faster search:

julia> using UniqueVectors

julia> A = KeyedArray(zeros(100000, 100), sid=UniqueVector(["S$i" for i in 1:100000]), oid=["O$i" for i in 1:100])

julia> @btime A[Key(sub_sids), :]
  4.374 ms (25 allocations: 7.78 MiB)

I guess it could do indexin automatically in these cases, but for now it always does findfirst.

@jtackm
Copy link
Author

jtackm commented Nov 8, 2023

Oh wow, that's a great feature! This capability didn't become apparent to me when reading the README, but maybe that's just me? I had tried with Set at one point (as a blind guess) in hope of faster lookups. Anyways, thanks a lot for the pointer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants