-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ObjectMapper
default heap consumption increased significantly from 2.13.x to 2.14.0
#3665
Comments
From my understanding, ObjectMapper's should be singleton's as much as possible, So this would be a bad implementation of using the ObjectMapper We use native imaging throughout and don't experience anything like this when it's used properly |
I created 100 ObjectMappers to show the extent of this problem. As you'll see in https://github.com/mhalbritter/graalvm-jackson, it is even observable with only 1 ObjectMapper. 2.14.0 uses ~880x more heap for an empty ObjectMapper compared to 2.13.4.2. Spring Boot in my WebMVC sample app uses 4 of them. Not sure if we can get this down to 1. |
Hmmm - @cowtowncoder this one needs your eyes |
I agree with @mhalbritter, it looks like the heap allocation regression in 2.14 is pretty huge and I hope that something could be refined on Jackson side. @cowtowncoder Looking forward to your feedback, we still have the opportunity to update Jackson before Spring Boot 3 GA is released (expected next week). |
Uh. I did have some concerns about apparent complexity of the improved cache choice, relative to benefits. The baseline for cache should be solid from what I recall, but @pjfanning worked on it most closely. This is unfortunate of course. But just to make sure: that's about 0.5 MB per Given timing I suspect the only potentially easyish win would be changing initial size? And even that would require 2.14.1 release. |
ObjectMapper
default heap consumption increased significantly from 2.13.x to 2.14.0
Ok, looks like there are just 4 cache instances by default (plus one optional):
Initial sizes of all these seem to be between 16 and 65, nothing drastic unless I am missing something. It would be good to get more numbers, if possible, on sizes of individual |
Sorry, I haven’t looked at that caching logic for about 7 years (or 12 when written). At the time the emphasis was concurrency for server-class usages, when a concurrent lru algorithm was a novel, hard problem. I’m sure those AtomicReferences could be replaced by aAtomicReferenceArray to slim it down. It is striped lanes to reduce thread contention, but you could greatly simplify that by knowing what is actually needed here, e.g. a single array is probably fine. I later rewrote this to dynamically stripe light weight, mpsc ring buffers. That is more memory conscious by being slimmer and demand based. |
That sounds useful @ben-manes thank you for sharing. I wonder how easy it'd be to measure -- of course it's not rocket science to just count fields and do it manually but there are probably tools to automate using introspection to at least give static minimums or such. |
I used openjdk’s object layout on the current code to verify padding rules for protecting against false sharing. And for a rough footprint estimate, jamm is good enough for seeing if things line up to expectations. All are just best guesses due to jvm internal complexity, JEP-8249196. |
VisualVM tells me that 1 ObjectMapper on the JVM uses 132.352 B of heap. This is a VisualVM tells me that 1 ObjectMapper in a native image uses 311.944 B of heap. However all LRUmap instances are 507.320 B of heap. Here are all the LRUmap instances on the JVM with the fields which have a reference on them: |
They are all the same size, because the static final int NCPU = Runtime.getRuntime().availableProcessors(); // 9 on my machine
static final int NUMBER_OF_READ_BUFFERS = ceilingNextPowerOfTwo(NCPU); // 16 on my machine
static final int READ_BUFFER_THRESHOLD = 32;
static final int READ_BUFFER_DRAIN_THRESHOLD = 2 * READ_BUFFER_THRESHOLD; // 64
static final int READ_BUFFER_SIZE = 2 * READ_BUFFER_DRAIN_THRESHOLD; // 128
readBuffers = new AtomicReference[NUMBER_OF_READ_BUFFERS][READ_BUFFER_SIZE]; which will make this is a |
Yes, concurrency is unrelated to a maximum cache size. At that time a synchronized lru was standard and perfectly fine in most cases due to single core machines still being common. Those who used this concurrent cache (ConcurrentLinkedHashMap) were servers that had clear needs (4-1024 cores). This led to porting into Guava (CacheBuilder) where we defaulted to 4 for Google’s needs, with a setting to adjust manually (concurrencyLevel). For a more general, modern cache (Caffeine) it now dynamically adjusts the size by need so that low concurrency has a low footprint while high concurrency benefits by trading some extra space. However CLHM is simpler to fork and embed, so decisions that aged poorly are copied over. For Jackson a modest size is enough, thanks to lossy buffers and modest needs, so slimming this down should be simple and effective. Or if you prefer the advanced solution, copying that code and making the minor replacement edits. |
The jackson-databind use case is to actually implement LRU behavior and not clear the whole cache when it fills (unlike 2.13-and-before implementation). The use of PrivateMaxEntriesMap (a renamed fork of https://github.com/ben-manes/concurrentlinkedhashmap) is to be able to cap the entry size of the map and to evict the LRU instances when the entry size limit is reached. We could certainly try to reduce the |
I've locally switched Right now Spring's implementation is not the dominator here, but we might still revisit the read buffer sizes or how its implementation works. |
@pjfanning (I reworded your desc a bit) -- +1 for capping size of those arrays so that no matter how many cores it would not increase linearly. @bclozel Use of Ultimately I think memory usage per cache should be significant lower than initial implementation. Load testing would be nice, but it is worth noting we didn't perf test new implementation vs old implementation either... |
This is due to cpu cache line sharing which causes coherency bottlenecks as a distinct fields within a block are invalidated together. As the thread distribution is not perfect some shared producers will select the same lane. In a perfect world each slot would be fully padded so that once an index is acquired the two producers do not impact each other. That of course is very wasteful (64-byte blocks). The use of object references was a middle ground for partial padding. In these guava benchmarks from one of my attempts to revisit using ring buffers (uses CLQ as quick-and-dirty first pass), the read throughput ranged from 240M (64b) / 256M (256b) / 285M (1kb) ops/s. While the 45M spread sounds scary, in practice the cache reads won't be a bottleneck if it can support ~60M ops/s due to real application work so further gains won't impact system performance.
Please see this jmh benchmark as a reference. |
|
@pjfanning you forgot to account for the # Running 64-bit HotSpot VM.
# Using compressed oop with 3-bit shift.
# Using compressed klass with 3-bit shift.
# WARNING | Compressed references base/shifts are guessed by the experiment!
# WARNING | Therefore, computed addresses are just guesses, and ARE NOT RELIABLE.
# WARNING | Make sure to attach Serviceability Agent to get the reliable addresses.
# Objects are 8 bytes aligned.
# Field sizes by type: 4, 1, 1, 2, 2, 4, 4, 8, 8 [bytes]
# Array element sizes: 4, 1, 1, 2, 2, 4, 4, 8, 8 [bytes]
... (many variations) ...
***** Hotspot Layout Simulation (JDK 15, 64-bit model, compressed class pointers, 16-byte aligned)
java.util.concurrent.atomic.AtomicReference object internals:
OFF SZ TYPE DESCRIPTION VALUE
0 8 (object header: mark) N/A
8 4 (object header: class) N/A
12 4 (alignment/padding gap)
16 8 java.lang.Object AtomicReference.value N/A
24 8 (object alignment gap)
Instance size: 32 bytes
Space losses: 4 bytes internal + 8 bytes external = 12 bytes total |
300 kB sounds in the ballpark that was observed.
I wonder how much we could cut things here -- to me some added overhead is fine; |
i dont think it has to be a 2-dimensional array either, could save some space by flattening it. |
@yawkat Going to |
To be fair, since Jackson's caches are likely tiny a Clock (aka Second Chance) policy would be adequate if you wanted a much simpler rewrite. That was my original stop-gap when starting to explore a general approach. It is simply a FIFO with a mark bit that is set on a read and reset during an O(n) eviction scan. That has similar hit rates to an LRU with lock-free reads, but writes block on a shared lock and large caches have GC-like pauses due to scanning. For a small cache like I imagine these to be, anything simple (even just FIFO or random) is likely good enough. I think it would be perfectly fine to write a very simple, fast, low overhead cache that is specific to your needs. |
I have bit ambivalent feelings about this. On one hand, yes, use case specific would be good. On the other hand, using something (as close to) off-the-shelf (as possible) is great. |
Ok at this point I am optimistic about being able to optimize I wonder if use of |
@cowtowncoder I tried to add exactly such a test (though the limit is fairly lax), but ran into these issues with JOL on CI: #3675 (comment) |
Fixed for 2.14.1. |
Awesome, thanks a lot for this quick update! Looking forward to leverage Jackson |
Nice, thanks a lot! |
I'll see if I could find time today to release a patch version; cannot promise it but will try my best. |
Ta-dah! I did it. 2.14.1 release mostly done, last couple of artifacts on their way to Maven Central (Scala module may take bit longer). Please LMK if you find issues. |
Fix an important memory consumption regression, see FasterXML/jackson-databind#3665 for more details. Closes spring-projectsgh-29539
### What changes were proposed in this pull request? This pr aims upgrade `Jackson` related dependencies to 2.14.1 ### Why are the changes needed? This version include an optimization of heap memory usage for Jackson 2.14.x: - FasterXML/jackson-databind#3665 The full release notes as follows: - https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.14.1 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes #38771 from LuciferYang/SPARK-41239. Authored-by: yangjie01 <[email protected]> Signed-off-by: Sean Owen <[email protected]>
@yawkat |
@cowtowncoder ugh, or just disable it for now. ive given up on making the results resilient to gc, there is too much going on, weird caches with soft references and such. maybe it will be less flaky with epsilon gc, but that's not available on java 8 so i didn't try it for the ci |
Prior to this commit, the `ConcurrentLruCache` implementation would use arrays of `AtomicReference` as operation buffers, and the buffer count would be calculated with the nearest power of two for the CPU count. This can result in significant heap memory usage as each `AtomicReference` buffer entry adds to the memory pressure. As seen in FasterXML/jackson-databind#3665, this can add a significant overhead for no real added benefit for the current use case. This commit changes the current implementation to use `AtomicReferenceArray` as buffers and reduce the number of buffers. JMH benchmarks results are within the error margin so we can assume that this does not change the performance characteristics for the typical use case in Spring Framework. Fixes gh-29520
### What changes were proposed in this pull request? This pr aims upgrade `Jackson` related dependencies to 2.14.1 ### Why are the changes needed? This version include an optimization of heap memory usage for Jackson 2.14.x: - FasterXML/jackson-databind#3665 The full release notes as follows: - https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.14.1 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes apache#38771 from LuciferYang/SPARK-41239. Authored-by: yangjie01 <[email protected]> Signed-off-by: Sean Owen <[email protected]>
### What changes were proposed in this pull request? This pr aims upgrade `Jackson` related dependencies to 2.14.1 ### Why are the changes needed? This version include an optimization of heap memory usage for Jackson 2.14.x: - FasterXML/jackson-databind#3665 The full release notes as follows: - https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.14.1 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes apache#38771 from LuciferYang/SPARK-41239. Authored-by: yangjie01 <[email protected]> Signed-off-by: Sean Owen <[email protected]>
Hello,
I'm currently looking at heap consumption when using native image with a Spring Boot 3 application. Spring Boot 3.0 SNAPSHOT uses Jackson 2.14.0-rc3. I noticed that with the upgrade from 2.13.4.2 to 2.14.0 the application uses more heap memory, which i tracked down to this PR. It changed the cache implementation from
ConcurrentHashMap
toPrivateMaxEntriesMap
, which in turn preallocates a lot ofAtomicReference
(they appear to be all empty).I created a reproducer here - it creates an empty objectmapper and then dumps the heap. It has more information and screenshots in the README.
When using Jackson 2.13.4.2 it uses 567 B for the LRUmap, with Jackson 2.14.0 it uses 507320 B for the LRU map. The bad thing is that most of these caches are per
ObjectMapper
- create more of them and it uses even more heap.As we're trying to optimize the memory consumption of native image applications, this is quite a bummer and we hope you can help us out here.
Thanks!
The text was updated successfully, but these errors were encountered: