Performance vs DataFrames #4

jariji · 2024-10-25T06:51:02Z

DF.jl grouping is faster. Maybe it does column-oriented instead of rows or something, I think you mentioned an Accessors.jl way to do column ops?

using DataFrames, StructArrays, FlexiGroups, Random
n = 1000000
sa = StructArray(
    a = [randstring() for _ in 1:n],
    b = [randstring() for _ in 1:n],
    c = [randstring() for _ in 1:n],
)
df = DataFrame(sa)
@btime group(x->(x.a,x.b,x.c), sa);
@btime groupview(x->(x.a,x.b,x.c), sa);
@btime group_vg(x->(x.a,x.b,x.c), sa);
@btime groupview_vg(x->(x.a,x.b,x.c), sa);
@btime groupby(df, [:a,:b,:c]);

julia> @btime group(x->(x.a,x.b,x.c), sa);
  178.707 ms (72 allocations: 225.47 MiB)

julia> @btime groupview(x->(x.a,x.b,x.c), sa);
  273.819 ms (68 allocations: 278.88 MiB)

julia> @btime group_vg(x->(x.a,x.b,x.c), sa);
  474.875 ms (4999566 allocations: 690.86 MiB)

julia> @btime groupview_vg(x->(x.a,x.b,x.c), sa);
  287.869 ms (73 allocations: 362.80 MiB)

julia> @btime groupby(df, [:a,:b,:c]);
  36.635 ms (190 allocations: 31.28 MiB)

  [a93c6f00] DataFrames v1.7.0
  [1e56b746] FlexiGroups v0.1.26

julia> versioninfo()
Julia Version 1.10.3
Commit 0b4590a5507 (2024-04-30 10:59 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 24 × AMD Ryzen 9 3900XT 12-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver2)
Threads: 24 default, 0 interactive, 12 GC (on 24 virtual cores)

aplavin · 2024-10-25T09:07:39Z

This has nothing to do with column/row-based. Even simple functions that just counts occurrences are slower than DataFrames.groupby, both from StatsBase or manually using a Dict:

Speed improvements would be nice, but I personally don't have near-term plans to work on them – for my usecases, anything "asymptotically fast" is enough.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance vs DataFrames #4

Performance vs DataFrames #4

jariji commented Oct 25, 2024

aplavin commented Oct 25, 2024

Performance vs DataFrames #4

Performance vs DataFrames #4

Comments

jariji commented Oct 25, 2024

aplavin commented Oct 25, 2024