Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bees causes memory fragmentation and low cached memory #260

Open
lilydjwg opened this issue Jul 8, 2023 · 32 comments
Open

bees causes memory fragmentation and low cached memory #260

lilydjwg opened this issue Jul 8, 2023 · 32 comments

Comments

@lilydjwg
Copy link

lilydjwg commented Jul 8, 2023

When bees is running, I get a lot of free memory but low cached usage, and more swap usage, i.e.

图片

You can see the change here:
图片

Also, kcompactd is constantly running.

This makes pagecache hit rate lower. There are many more disk reads, making processes accessing disks slow or unresponsive. I have three disks, nvme0n1 is my system disk, sda is a data disk receiving some constant writes, sdb is the one bees running with (it started at 13:00 and before that it was deleting unneeded snaphosts). Both sda and sdb are spinning disks.

图片

There are a lot of snapshots on sdb and I'm using --scan-mode=0. I've also observed the same issue on another machine running bees and without spinning disks.

@kakra
Copy link
Contributor

kakra commented Jul 8, 2023

This is probably the same as #257. I can observe this behavior since switching from LTS 5.19 to 6.1. It's not that bees occupies free memory, it's rather btrfs seems to create high memory fragmentation as can be observed in /proc/buddyinfo.

@lilydjwg
Copy link
Author

lilydjwg commented Jul 9, 2023

I see those vmalloc errors on 6.3 too, but this one is running 6.1. I'll check /proc/buddyinfo (I didn't know about it).

@Zygo
Copy link
Owner

Zygo commented Jul 9, 2023

The vmalloc bug from #257 was backported to 6.1. The fix hasn't been backported yet, so 6.1 is currently broken (and so is 5.15).

One thing to try is reducing the size of the buffers for LOGICAL_INO by changing this line in include/crucible/fs.h:

            BtrfsIoctlLogicalInoArgs(uint64_t logical, size_t buf_size = 16 * 1024 * 1024);

Try lower values for buf_size, e.g. 64 * 1024. This will drastically reduce the number of reflinks bees can handle to a single extent, but that size is beyond more than enough for most filesystems. Making the buffer smaller may also reduce the size of vmallocs which might be aggravating the kernel's memory manager.

If this does work, then maybe a workaround is possible: use the small buffer size at first, but if the buffer fills up, then switch to the larger buffer.

Off the top of my head I don't know why the kernel would allocate a buffer here, since userspace is already providing a buffer, but this is off the top of my head--I haven't looked at the kernel code for this issue.

@lilydjwg
Copy link
Author

lilydjwg commented Jul 9, 2023

This is the first time memory fragmentation caused issues for me. My memory is already very fragmented. I can see counts of lower orders of blocks change significantly.

I've applied the settings in #257 (comment) and it seems better now (much less cache miss; I no longer see my postgres processes being in D state frequently and causing timeout everywhere).

Not sure what a lower buf_size has achieved. It seems that cached memory size no longer changes a lot.

图片
10:40: bees starts
10:55: applied those kernel settings mentioned above
11:40: bees with 64 * 1024 buf_size starts

How many reflinks is "a single extent"? I should have a few dozens of reflinks for most files on this filesystem.

I'll skip newer 6.1 kernels and update to latest 6.3 to avoid those vmalloc messages. They seem to not cause visible harm but waste journald's disk space.

@lilydjwg lilydjwg changed the title bees "occupies" free memory bees causes memory fragmentation and low cached memory Jul 9, 2023
@kakra
Copy link
Contributor

kakra commented Jul 9, 2023

If this does work, then maybe a workaround is possible: use the small buffer size at first, but if the buffer fills up, then switch to the larger buffer.

How do we know if the buffer is too small?

@Zygo
Copy link
Owner

Zygo commented Jul 9, 2023

If the number of references returned is equal to the maximum number for that buffer, then the buffer is probably too small (it might be exactly the right size, but that's unlikely enough not to matter). Then we can use a larger buffer and repeat the entire LOGICAL_INO operation. With ideal kernel behavior, we would just use the larger buffer all the time, as there's no way to restart or extend a LOGICAL_INO result.

A smaller buffer limits the total number of references that bees can create to a common block of data. Once it hits 10,000 or so, other parts of btrfs start getting slower, so it's arguably not worth creating that many references in any case. A 64K buffer holds 2730 references, a 16 MB buffer holds a little less than 700,000.

@lilydjwg
Copy link
Author

lilydjwg commented Jul 9, 2023

A 64K buffer holds 2730 references

Thanks for the explanation. That's quite a lot! It's still possible that my filesystem could reach this limit but it's good enough for me.

@kakra
Copy link
Contributor

kakra commented Jul 9, 2023

So bees could dynamically increase that number by 64k up to 3 times until we pass around 10k references. So if it would get exactly 2730 refs, it would use a 128k buffer for future requests? Or maybe even increase in steps of 4k?

OTOH, the kernel should handle the allocations better.

Does the kernel write to that buffer and reads data back in? Then I could imagine why it does not write directly to the user-space buffer: It could be modified while the kernel is still working with it, and thus open attack vectors.

For now, I'll be shipping the Gentoo version of bees with 64k buffer.

@Zygo
Copy link
Owner

Zygo commented Jul 9, 2023

It would have to skip directly from the minimum size to the maximum size (which would initially be 64K and 16MB) when the minimum size fills up. It can keep track of the distribution of ref counts and adjust the minimum size until it is large enough for 99% of all requests. We don't want to do more than two requests for any given block, since all of the prior work is wasted when we repeat the request, and the large requests aren't that expensive if they are rare. We can also do things like limit the number of threads that can use a large buffer, while allowing all threads to use a small one at the same time.

The kernel's risks are the same whether it writes the buffer as it goes, or does one giant memcpy at the end; however, there might be some issues with memory fault errors interrupting the kernel thread. Either way, there is a lot of room for improvement of LOGICAL_INO, starting from making it work incrementally, or providing a way for userspace to follow metadata tree page backrefs.

@kakra
Copy link
Contributor

kakra commented Jul 11, 2023

We don't want to do more than two requests for any given block, since all of the prior work is wasted when we repeat the request, and the large requests aren't that expensive if they are rare.

Would it be possible to first dedupe the refs found by the first lookup, and then resubmit that extent later to see if there are more refs? As far as I understand, that could provide some level of incremental work but care would need to be taken to prevent loops if no more progress is made and the same refs show up over and over again.

@Zygo
Copy link
Owner

Zygo commented Jul 13, 2023

Would it be possible to first dedupe the refs found by the first lookup, and then resubmit that extent later to see if there are more refs?

It's possible, but it would only be productive on large extents. For those 4K blocks that appear a million times on a filesystem, it's better to simply ignore them when they have too many references, unless the dedupe is combined with a defrag operation to make the extents larger.

@kakra
Copy link
Contributor

kakra commented Jul 13, 2023

Okay but in theory, bigger extents tend to have fewer refs - statistically. So in the end, such a "resubmit" because of a small buffer would not happen anyways, and we tend to ignore the huge ref counts for small common extents anyways due to bad timing behavior. So yes, there's probably no point in resubmitting such extents, but there's also no point in having a huge buffer of 16M and 64k or 128k should work just fine. Unless the defrag thing happens...

So I conclude going with my 64k patch is just fine for the moment.

This leaves the patch problematic for filesystems with thousands of snapshots - and under that condition, btrfs doesn't behave very well anyways.

@Zygo
Copy link
Owner

Zygo commented Jul 13, 2023

I've been collecting some data on that. 64K is a little too small (it's very common to blow past 2730 refs with a year's worth of snapshots, which is why I made the patch to increase it in the kernel), but the ideal size seems to be smaller than 256K (once an extent has 10920 refs it takes a millisecond of CPU to process each one). Maybe go with a 128K buffer for subvol scans?

My extent tree scanner prototype does much better with snapshots, but it has its own LOGICAL_INO code (because IGNORE_OFFSETS is mandatory), so it can easily construct the object with a different buffer size. I suspect defrag will be able to do the same thing.

@kakra
Copy link
Contributor

kakra commented Jul 13, 2023

with a year's worth of snapshots

You mean: daily snapshots? hourly?

@Zygo
Copy link
Owner

Zygo commented Jul 13, 2023

Daily (365 snapshots) would leave only 7.49 references available for each extent in a 64K buffer, so there can be only 7 duplicates within the subvol before we hit the limit and can't dedupe further. Containers, build workspaces, even rootfs upgrade snapshots can all easily generate more than 7 duplicate refs after dedupe.

On build servers it can be the other way around: they average 350 or so references to each file, so only 7 or 8 snapshots can be deduplicated.

There is a log message for this:

 addr 0x46ce881fc000 refs 3112 beats previous record 2730

which will start appearing once the 64K limit has been exceeded.

@kakra
Copy link
Contributor

kakra commented Jul 13, 2023

Yeah okay I think I get how the numbers calculate. Thanks for the explanation.

@Zygo
Copy link
Owner

Zygo commented Jul 17, 2023

Note to self: LOGICAL_INO does report the total number of references when the buffer overflows, in btrfs_data_container::elem_missed (that plus elem_cnt = total number of refs). Unfortunately the kernel still does all the work to find each reference just to fill in that count, so it's not exactly efficient to e.g. call with a zero buffer size, then call again with a buffer exactly the right size for the number of refs found.

@Zygo
Copy link
Owner

Zygo commented Jul 17, 2023

Getting back to the actual issue for a moment: for the benefit of those who haven't clicked on it, the graph at #260 (comment) does show a dramatic improvement in kernel behavior right after setting a smaller buf_size.

bees has used 16 MiB buf_size since 2018, with a more recent change at 3430f16 which kept the buffers between uses instead of allocating and deallocating 16 MiB for each ioctl call.

Maybe the second change (in v0.9.3) accidentally triggered some unfortunate kernel behavior, or maybe bees and the kernel have always behaved this way and nobody reported it before. v0.9.3 fixed a number of bees memory management problems which could easily have concealed an issue like this by making the cache page evictions look like "normal" bees process memory usage.

@kakra
Copy link
Contributor

kakra commented Jul 17, 2023

Okay, so I thought I didn't see that behavior with last years LTS kernel and I observed bad memory pressure after going to 6.1. But coincidentally, your change to bees in February falls into the same time period when I started to use kernel 6.1. So maybe I should try again with that change reverted?

OTOH, even when I stop bees, the kernel behaves bad when memory cgroups are enabled. So I believe there has been some impactful change in the kernel between last years LTS and current LTS - and it is in memory management. I tried with the new multigen LRU and without. Only turning off both memory cgroups AND transparent hugepages gets this under control for me. We are still facing issues with memory pressure when bees is running on our servers but it's mostly okay now with hugepages in madvise-only mode.

@Zygo
Copy link
Owner

Zygo commented Jul 18, 2023

So maybe I should try again with that change reverted?

Sounds like a good idea. If we get an effect that can be used for bisection, then we might be able to find where things went wrong.

I believe there has been some impactful change in the kernel between last years LTS and current LTS - and it is in memory management.

There's no rule that says kernel bugs have to appear one at a time. ;)

Zygo pushed a commit that referenced this issue Apr 18, 2024
Since we'll never process more than BEES_MAX_EXTENT_REF_COUNT extent
references by definition, it follows that we should not allocate buffer
space for them when we perform the LOGICAL_INO ioctl.

There is some evidence (particularly
#260 (comment)) that
the kernel is subjecting the page cache to a lot of disruption when
trying allocate large buffers for LOGICAL_INO.

Signed-off-by: Zygo Blaxell <[email protected]>
@kakra
Copy link
Contributor

kakra commented Apr 21, 2024

Following up from #268 (comment):

... find out why btrfs is causing that high memory fragmentation. I feel like this has become worse and worse with each LTS kernel version. It's probably less an issue within bees but rather a side effect of btrfs doing some very unfavorable memory management in the kernel.

After my repeated issue in #268, it becomes very apparent that btrfs has some really bad memory fragmentation patterns.

As posted before, with cgroups enabled, the kernel almost has no chance of using physical memory anymore, swap usage increases a lot. Most of the memory stays completely free. cgroups steal cache memory a lot from each other and fight over a very tight memory resource. This happens within the first few hours of system operation.

With transparent huge pages, a very similar effect kicks in. This is similar to the previous issue but cache memory fluctuates a lot more. This happens with the first few hours of system operation.

Without cgroups and transparent huge pages, the effect takes longer to emerge, usually within a few days of system operation. But this is largely impacted by bees activity: If bees activity is high, the effect kicks within the first 1 or 2 hours of operation. If activity is low, the system can work for days before the effect kicks in and causes high desktop latency. This is different from the previous two observations which were largely based on low bees activity.

@Zygo
Copy link
Owner

Zygo commented Apr 22, 2024

Lately I've been running with vm.swappiness=0 and BEES_MAX_EXTENT_REF_COUNT set to 9999. With those settings there's no swapping at all.

@kakra
Copy link
Contributor

kakra commented Apr 22, 2024

Is BEES_MAX_EXTENT_REF_COUNT and env var, or is it a defined constant in the source code?

@kakra
Copy link
Contributor

kakra commented May 2, 2024

Lately I've been running with vm.swappiness=0 and BEES_MAX_EXTENT_REF_COUNT set to 9999. With those settings there's no swapping at all.

I've created a patch (9999 max ref count) and reverted my previous patch (64kB max memory, as suggested previously in Jul 2023), and running that for around 48 hours now shows really great results. I'll let it run for some more time to see how it behaves after different workloads applied to the system.

@KeinNiemand
Copy link

KeinNiemand commented Oct 16, 2024

is there a way to know if BEES_MAX_EXTENT_REF_COUNT is to low?
Going from the default of 699049 to just 9999 seems like a huge decrease.

@kakra
Copy link
Contributor

kakra commented Oct 16, 2024

I wonder if this is related to zswap using zbud or zfold3 allocators... Due to the latest LTS kernel update, I've switched to zsmalloc, and memory stats look much better since. But I couldn't collect a lot of data up to now.

@Zygo
Copy link
Owner

Zygo commented Oct 17, 2024

is there a way to know if BEES_MAX_EXTENT_REF_COUNT is to low?

See #260 (comment) - if the log message says 9999 references have been reached, then there may be some benefit to a higher max ref count. If the log message never reaches 9999, or if it does so on only a few extents in the entire filesystem, then there will be no significant improvement when the limit is raised.

Note that benefit will typically be very small even when it exists, since the maximum space savings after deduping 9999 copies of the data is 0.01% of the logical data size, and each additional reference adds more work for any btrfs operation that modifies the extent later on (both within bees and other applications). Each additional reference adds a fraction of a millisecond CPU time, but after 10,000 of those, they add up to whole seconds.

The exceptional case is where you have exactly 10000 to 19999 copies of everything in the filesystem: these would be deduped to two extents with a maximum of 9999 references. Adding one to BEES_MAX_EXTENT_REF_COUNT would result in a 50% total saving in that case, as the last two copies of the data are merged into one.

Going from the default of 699049 to just 9999 seems like a huge decrease.

699049 to 9999 is a huge decrease, but the increase from 2730 to 699049 was even larger. 2730 to 9999 is a reasonably-sized increase.

@lilydjwg
Copy link
Author

I wonder if this is related to zswap using zbud or zfold3 allocators... Due to the latest LTS kernel update, I've switched to zsmalloc, and memory stats look much better since. But I couldn't collect a lot of data up to now.

I have a server running zswap with zsmalloc and bees still causes a lot of unused memory until I applied the buf_size patch. I can't say if it is better with zsmalloc or not, however. Probably yes because I've seen some improvement after a kernel update (among other packages) in the past.

@kakra
Copy link
Contributor

kakra commented Oct 17, 2024

@lilydjwg Well, it is clearly visible when I deployed the change to zsmalloc:
image

But this doesn't necessarily mean that bees is the only process causing it. It could be inside the kernel from btrfs page cache handling, or other services like databases, PHP, redis etc...

I'm currently testing my home PC and office workstation with transparent hugepages reenabled (mode "always"). It looks mostly good so far. But some more time is needed to evaluate behavior under various different workloads and memory pressure situations.

On all systems, the buf_size patch is deployed. Also, I'm guessing that we might get much better behavior with next LTS kernel 6.12 because it seems to have gained multi-size THP through folios since LTS 6.6.

@KeinNiemand
Copy link

The weird thing for me is that I had low cache usage and high free ram even tough fragmentation didn't seem that bad
this was my /proc/buddyinfo while I was having the problems
this is /proc/buddyinfo
Node 0, zone DMA 0 0 0 0 0 0 0 0 1 2 2
Node 0, zone DMA32 472 424 733 608 351 182 86 45 38 1 0
Node 0, zone Normal 3658 8084 10386 7519 5560 4188 3314 2902 2726 5 0

Order 8 is still pretty high and order 9 is above 0. This doesn't seem that bad fragmentation wise yet the system was using less ony about 500MB of ram for cache and had massive amounts of free ram.

Lowering BEES_MAX_EXTENT_REF_COUNT to 9999 (in bees code) seemingly fixed things the system is now using almost all it's free ram as cache like it should.

What do you mean with buf_size patch? Is that just lowering BEES_MAX_EXTENT_REF_COUNT inside bees code like I did or does it do something else too?

@lilydjwg
Copy link
Author

What do you mean with buf_size patch? Is that just lowering BEES_MAX_EXTENT_REF_COUNT inside bees code like I did or does it do something else too?

I mean this change #260 (comment). I didn't change BEES_MAX_EXTENT_REF_COUNT.

@kakra
Copy link
Contributor

kakra commented Oct 17, 2024

My Gentoo package currently uses these patches (the two recent ones):
https://github.com/gentoo/gentoo/tree/master/sys-fs/bees/files

It works fine with those. Maybe logging pressure without my logging patch is still causing issues for some people?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants