Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance vs DataFrames #4

Open
jariji opened this issue Oct 25, 2024 · 1 comment
Open

Performance vs DataFrames #4

jariji opened this issue Oct 25, 2024 · 1 comment

Comments

@jariji
Copy link

jariji commented Oct 25, 2024

DF.jl grouping is faster. Maybe it does column-oriented instead of rows or something, I think you mentioned an Accessors.jl way to do column ops?

using DataFrames, StructArrays, FlexiGroups, Random
n = 1000000
sa = StructArray(
    a = [randstring() for _ in 1:n],
    b = [randstring() for _ in 1:n],
    c = [randstring() for _ in 1:n],
)
df = DataFrame(sa)
@btime group(x->(x.a,x.b,x.c), sa);
@btime groupview(x->(x.a,x.b,x.c), sa);
@btime group_vg(x->(x.a,x.b,x.c), sa);
@btime groupview_vg(x->(x.a,x.b,x.c), sa);
@btime groupby(df, [:a,:b,:c]);

julia> @btime group(x->(x.a,x.b,x.c), sa);
  178.707 ms (72 allocations: 225.47 MiB)

julia> @btime groupview(x->(x.a,x.b,x.c), sa);
  273.819 ms (68 allocations: 278.88 MiB)

julia> @btime group_vg(x->(x.a,x.b,x.c), sa);
  474.875 ms (4999566 allocations: 690.86 MiB)

julia> @btime groupview_vg(x->(x.a,x.b,x.c), sa);
  287.869 ms (73 allocations: 362.80 MiB)

julia> @btime groupby(df, [:a,:b,:c]);
  36.635 ms (190 allocations: 31.28 MiB)

  [a93c6f00] DataFrames v1.7.0
  [1e56b746] FlexiGroups v0.1.26

julia> versioninfo()
Julia Version 1.10.3
Commit 0b4590a5507 (2024-04-30 10:59 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × AMD Ryzen 9 3900XT 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver2)
Threads: 24 default, 0 interactive, 12 GC (on 24 virtual cores)
@aplavin
Copy link
Member

aplavin commented Oct 25, 2024

This has nothing to do with column/row-based. Even simple functions that just counts occurrences are slower than DataFrames.groupby, both from StatsBase or manually using a Dict:
image

Speed improvements would be nice, but I personally don't have near-term plans to work on them – for my usecases, anything "asymptotically fast" is enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants