-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bees causes memory fragmentation and low cached memory #260
Comments
This is probably the same as #257. I can observe this behavior since switching from LTS 5.19 to 6.1. It's not that bees occupies free memory, it's rather btrfs seems to create high memory fragmentation as can be observed in |
I see those |
The vmalloc bug from #257 was backported to 6.1. The fix hasn't been backported yet, so 6.1 is currently broken (and so is 5.15). One thing to try is reducing the size of the buffers for
Try lower values for If this does work, then maybe a workaround is possible: use the small buffer size at first, but if the buffer fills up, then switch to the larger buffer. Off the top of my head I don't know why the kernel would allocate a buffer here, since userspace is already providing a buffer, but this is off the top of my head--I haven't looked at the kernel code for this issue. |
This is the first time memory fragmentation caused issues for me. My memory is already very fragmented. I can see counts of lower orders of blocks change significantly. I've applied the settings in #257 (comment) and it seems better now (much less cache miss; I no longer see my postgres processes being in Not sure what a lower
How many reflinks is "a single extent"? I should have a few dozens of reflinks for most files on this filesystem. I'll skip newer 6.1 kernels and update to latest 6.3 to avoid those vmalloc messages. They seem to not cause visible harm but waste journald's disk space. |
How do we know if the buffer is too small? |
If the number of references returned is equal to the maximum number for that buffer, then the buffer is probably too small (it might be exactly the right size, but that's unlikely enough not to matter). Then we can use a larger buffer and repeat the entire A smaller buffer limits the total number of references that bees can create to a common block of data. Once it hits 10,000 or so, other parts of btrfs start getting slower, so it's arguably not worth creating that many references in any case. A 64K buffer holds 2730 references, a 16 MB buffer holds a little less than 700,000. |
Thanks for the explanation. That's quite a lot! It's still possible that my filesystem could reach this limit but it's good enough for me. |
So bees could dynamically increase that number by 64k up to 3 times until we pass around 10k references. So if it would get exactly 2730 refs, it would use a 128k buffer for future requests? Or maybe even increase in steps of 4k? OTOH, the kernel should handle the allocations better. Does the kernel write to that buffer and reads data back in? Then I could imagine why it does not write directly to the user-space buffer: It could be modified while the kernel is still working with it, and thus open attack vectors. For now, I'll be shipping the Gentoo version of bees with 64k buffer. |
It would have to skip directly from the minimum size to the maximum size (which would initially be 64K and 16MB) when the minimum size fills up. It can keep track of the distribution of ref counts and adjust the minimum size until it is large enough for 99% of all requests. We don't want to do more than two requests for any given block, since all of the prior work is wasted when we repeat the request, and the large requests aren't that expensive if they are rare. We can also do things like limit the number of threads that can use a large buffer, while allowing all threads to use a small one at the same time. The kernel's risks are the same whether it writes the buffer as it goes, or does one giant memcpy at the end; however, there might be some issues with memory fault errors interrupting the kernel thread. Either way, there is a lot of room for improvement of |
Would it be possible to first dedupe the refs found by the first lookup, and then resubmit that extent later to see if there are more refs? As far as I understand, that could provide some level of incremental work but care would need to be taken to prevent loops if no more progress is made and the same refs show up over and over again. |
It's possible, but it would only be productive on large extents. For those 4K blocks that appear a million times on a filesystem, it's better to simply ignore them when they have too many references, unless the dedupe is combined with a defrag operation to make the extents larger. |
Okay but in theory, bigger extents tend to have fewer refs - statistically. So in the end, such a "resubmit" because of a small buffer would not happen anyways, and we tend to ignore the huge ref counts for small common extents anyways due to bad timing behavior. So yes, there's probably no point in resubmitting such extents, but there's also no point in having a huge buffer of 16M and 64k or 128k should work just fine. Unless the defrag thing happens... So I conclude going with my 64k patch is just fine for the moment. This leaves the patch problematic for filesystems with thousands of snapshots - and under that condition, btrfs doesn't behave very well anyways. |
I've been collecting some data on that. 64K is a little too small (it's very common to blow past 2730 refs with a year's worth of snapshots, which is why I made the patch to increase it in the kernel), but the ideal size seems to be smaller than 256K (once an extent has 10920 refs it takes a millisecond of CPU to process each one). Maybe go with a 128K buffer for subvol scans? My extent tree scanner prototype does much better with snapshots, but it has its own |
You mean: daily snapshots? hourly? |
Daily (365 snapshots) would leave only 7.49 references available for each extent in a 64K buffer, so there can be only 7 duplicates within the subvol before we hit the limit and can't dedupe further. Containers, build workspaces, even rootfs upgrade snapshots can all easily generate more than 7 duplicate refs after dedupe. On build servers it can be the other way around: they average 350 or so references to each file, so only 7 or 8 snapshots can be deduplicated. There is a log message for this:
which will start appearing once the 64K limit has been exceeded. |
Yeah okay I think I get how the numbers calculate. Thanks for the explanation. |
Note to self: |
Getting back to the actual issue for a moment: for the benefit of those who haven't clicked on it, the graph at #260 (comment) does show a dramatic improvement in kernel behavior right after setting a smaller bees has used 16 MiB Maybe the second change (in v0.9.3) accidentally triggered some unfortunate kernel behavior, or maybe bees and the kernel have always behaved this way and nobody reported it before. v0.9.3 fixed a number of bees memory management problems which could easily have concealed an issue like this by making the cache page evictions look like "normal" bees process memory usage. |
Okay, so I thought I didn't see that behavior with last years LTS kernel and I observed bad memory pressure after going to 6.1. But coincidentally, your change to bees in February falls into the same time period when I started to use kernel 6.1. So maybe I should try again with that change reverted? OTOH, even when I stop bees, the kernel behaves bad when memory cgroups are enabled. So I believe there has been some impactful change in the kernel between last years LTS and current LTS - and it is in memory management. I tried with the new multigen LRU and without. Only turning off both memory cgroups AND transparent hugepages gets this under control for me. We are still facing issues with memory pressure when bees is running on our servers but it's mostly okay now with hugepages in madvise-only mode. |
Sounds like a good idea. If we get an effect that can be used for bisection, then we might be able to find where things went wrong.
There's no rule that says kernel bugs have to appear one at a time. ;) |
Since we'll never process more than BEES_MAX_EXTENT_REF_COUNT extent references by definition, it follows that we should not allocate buffer space for them when we perform the LOGICAL_INO ioctl. There is some evidence (particularly #260 (comment)) that the kernel is subjecting the page cache to a lot of disruption when trying allocate large buffers for LOGICAL_INO. Signed-off-by: Zygo Blaxell <[email protected]>
Following up from #268 (comment):
After my repeated issue in #268, it becomes very apparent that btrfs has some really bad memory fragmentation patterns. As posted before, with cgroups enabled, the kernel almost has no chance of using physical memory anymore, swap usage increases a lot. Most of the memory stays completely free. cgroups steal cache memory a lot from each other and fight over a very tight memory resource. This happens within the first few hours of system operation. With transparent huge pages, a very similar effect kicks in. This is similar to the previous issue but cache memory fluctuates a lot more. This happens with the first few hours of system operation. Without cgroups and transparent huge pages, the effect takes longer to emerge, usually within a few days of system operation. But this is largely impacted by bees activity: If bees activity is high, the effect kicks within the first 1 or 2 hours of operation. If activity is low, the system can work for days before the effect kicks in and causes high desktop latency. This is different from the previous two observations which were largely based on low bees activity. |
Lately I've been running with |
Is |
I've created a patch (9999 max ref count) and reverted my previous patch (64kB max memory, as suggested previously in Jul 2023), and running that for around 48 hours now shows really great results. I'll let it run for some more time to see how it behaves after different workloads applied to the system. |
is there a way to know if BEES_MAX_EXTENT_REF_COUNT is to low? |
I wonder if this is related to zswap using zbud or zfold3 allocators... Due to the latest LTS kernel update, I've switched to zsmalloc, and memory stats look much better since. But I couldn't collect a lot of data up to now. |
See #260 (comment) - if the log message says 9999 references have been reached, then there may be some benefit to a higher max ref count. If the log message never reaches 9999, or if it does so on only a few extents in the entire filesystem, then there will be no significant improvement when the limit is raised. Note that benefit will typically be very small even when it exists, since the maximum space savings after deduping 9999 copies of the data is 0.01% of the logical data size, and each additional reference adds more work for any btrfs operation that modifies the extent later on (both within bees and other applications). Each additional reference adds a fraction of a millisecond CPU time, but after 10,000 of those, they add up to whole seconds. The exceptional case is where you have exactly 10000 to 19999 copies of everything in the filesystem: these would be deduped to two extents with a maximum of 9999 references. Adding one to BEES_MAX_EXTENT_REF_COUNT would result in a 50% total saving in that case, as the last two copies of the data are merged into one.
699049 to 9999 is a huge decrease, but the increase from 2730 to 699049 was even larger. 2730 to 9999 is a reasonably-sized increase. |
I have a server running zswap with zsmalloc and bees still causes a lot of unused memory until I applied the buf_size patch. I can't say if it is better with zsmalloc or not, however. Probably yes because I've seen some improvement after a kernel update (among other packages) in the past. |
@lilydjwg Well, it is clearly visible when I deployed the change to zsmalloc: But this doesn't necessarily mean that bees is the only process causing it. It could be inside the kernel from btrfs page cache handling, or other services like databases, PHP, redis etc... I'm currently testing my home PC and office workstation with transparent hugepages reenabled (mode "always"). It looks mostly good so far. But some more time is needed to evaluate behavior under various different workloads and memory pressure situations. On all systems, the buf_size patch is deployed. Also, I'm guessing that we might get much better behavior with next LTS kernel 6.12 because it seems to have gained multi-size THP through folios since LTS 6.6. |
The weird thing for me is that I had low cache usage and high free ram even tough fragmentation didn't seem that bad Order 8 is still pretty high and order 9 is above 0. This doesn't seem that bad fragmentation wise yet the system was using less ony about 500MB of ram for cache and had massive amounts of free ram. Lowering BEES_MAX_EXTENT_REF_COUNT to 9999 (in bees code) seemingly fixed things the system is now using almost all it's free ram as cache like it should. What do you mean with buf_size patch? Is that just lowering BEES_MAX_EXTENT_REF_COUNT inside bees code like I did or does it do something else too? |
I mean this change #260 (comment). I didn't change BEES_MAX_EXTENT_REF_COUNT. |
My Gentoo package currently uses these patches (the two recent ones): It works fine with those. Maybe logging pressure without my logging patch is still causing issues for some people? |
When bees is running, I get a lot of free memory but low
cached
usage, and more swap usage, i.e.You can see the change here:
Also, kcompactd is constantly running.
This makes pagecache hit rate lower. There are many more disk reads, making processes accessing disks slow or unresponsive. I have three disks, nvme0n1 is my system disk, sda is a data disk receiving some constant writes, sdb is the one bees running with (it started at 13:00 and before that it was deleting unneeded snaphosts). Both sda and sdb are spinning disks.
There are a lot of snapshots on sdb and I'm using
--scan-mode=0
. I've also observed the same issue on another machine running bees and without spinning disks.The text was updated successfully, but these errors were encountered: