-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regalloc2 makes too many memory allocations #87
Comments
Thanks for the analysis and write up! Very similar to what I've seen when profiling cranelift.
The way cranelift works is that each function is compiled on a separate thread effectively so there isn't a great way for cranelift at least to reuse these contexts since each function compilation only calls regalloc once. I wonder how many of these temporary allocations could be reused within a single invocation of the register allocator though?
There are some restrictions that |
This is not mandatory. For example cg_clif compiles all functions in the same cgu on the same thread, reusing the |
For what's it's worth, I made a PR some time ago that allows to reuse cranelift's contexts across compilations of different functions on different threads: https://github.com/bytecodealliance/wasmtime/blob/main/crates/cranelift/src/compiler.rs#L52 So if the regalloc context was stored in the cranelift context, it looks like it could be reused. |
I used heaptrack to analyze the memory allocation patterns of my compiler which uses regalloc2. I used a large benchmark (LLVM library) which has 1257955 basic blocks in 104913 functions. Out of a total of 46780667 allocations, 46779011 of them come from regalloc2 and only 1656 come from other parts of the compiler.
My compiler uses a similar structure 1 to Cranelift where a
Context
structure allows memory allocations to be reused across compilation units. Specifically, this allows variousVec
s andHashMap
s to grow to a sufficient size only once while being reused many times for compiling multiple functions.regalloc2 unfortunately doesn't follow this design:
Env
structure is allocated and freed for each function. No memory allocations are preserved.SmallVec
s andHashSet
s. Heaptrack reports that 19865290 allocations (42% of the total) are "temporary". These are allocations which are immediately followed by a free, with no other allocations in between.BTreeMap
which needs to allocate many nodes separately and cannot reuse memory unlikeHashMap
andVec
.Heaptrack results
Here are the major allocation hotspots shown by heaptrack:
In
try_to_allocate_bundle_to_reg
:FxHashSet
.PRegData::allocations
(LiveRangeSet
backed by aBTreeMap
).In
merge_bundles
:VReg
toSpillSet::vregs
(SmallVec
).LiveRangeListEntry
s toLiveBundle::ranges
(SmallVec
).In
insert_use_into_liverange
:Use
toLiveRange::uses
(SmallVec
).In
add_liverange_to_vreg
:LiveRangeListEntry
toVRegData::ranges
(SmallVec
).In
create_pregs_and_vregs
:Env::vregs
andEnv::allocs
andEnv::inst_alloc_offsets
. (Vec
)In
Env::new
:Vec
s withVec::with_capacity
.Performance impact
Benchmarking shows that approximately 10% of the time is spent in memory allocation functions (
malloc
,realloc
,free
). Additionally, another 10% of the time is spent inmemcpy
/memmove
which could be due to vector reallocation.Compiling with multiple threads can cause additional contention on the global allocator. This was noticed on musl targets where the global allocator doesn't have thread-local caches: compiler performance on 6 threads (on a 6-core machine) was reduced to that of a single thread whereas other allocators allow almost-perfect performance scaling.
Proposed solution
I believe the best way to improve regalloc2 would be to use a design similar to Cranelift with a
Context
structure that can preserve memory allocation across multiple invocations of the register allocator.SmallVec
s andHashSet
s should be replaced with a single instance inContext
.BTreeMap
s should be replaced withcranelift-bforest
which maintains a pool of nodes that can be reused.SmallVec
s should be replaced withEntityList
fromcranelift-entity
and aListPool
inContext
.Since this would require a dependency on the
cranelift-entity
andcranelift-bforest
crates, this is also a good opportunity to resolve #62 and switch to using proper indexed types internally.Footnotes
In fact I am using the
cranelift-entity
crate directly, it's very well written. ↩The text was updated successfully, but these errors were encountered: