-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatic balance after btrfs-cleaner #63
Comments
Sounds like what The only thing that is new or different in this section is the "on by default" part, which is a problematic change in behavior (but it could be enabled in userspace by distros etc).
I'm not sure what you mean here:
These are very different and some of what you propose is counterproductive for one case but useful for the other.
This is a problem, and balancing makes it worse. It would be useful to slow down the deletion process so that the IOPS are less degraded, e.g. keep it down to under 100 refs/sec and split it up into smaller transactions, to avoid long stalls during transaction commits that lock out all writers to the filesystem. Some relief for this might be available through extent tree v2 changes (TL;DR don't delete everything in a huge burst in the transaction critical section, write the delayed refs to disk and process them at a sustainable rate instead). On filesystems as small as 20 TiB, big deletes can lock up the filesystem for 20 minutes or more. Balances then lock up the filesystem in 2-10 minute bursts for some hours after. We try to schedule both during maintenance windows, but sometimes you just have to delete something in the middle of the working day. We still want the balance in the maintenance window, and there's a good chance we've refilled the free space so balance doesn't have to do anything by then.
I would say that if you're still scheduling balances even though the kernel has supported automatic balances for years now, it's because it absolutely hurts to have both. This already exists through sysfs and it already allows using one, the other, both, or neither.
On medium-to-large filesystems, free space allocation speed drops dramatically somewhere above 95% utilizations, but balances stop being possible somewhere above 90% utilization. Balances don't pack data as efficiently as normal writes do theoretically because they can't change extent sizes, and practically because balancing also changes some other allocation parameters. There's some possible relief coming via #54 on the packing efficiency, but that could move the ENOSPC problem into REMAP block groups without solving it. On small filesystems, there's few block groups and simply no space to put any data other than in existing block groups. On those filesystems there's no benefit from balancing, so there's never a need to balance. A naive automatic balancer can end up wasting IOPS all day, pushing data back and forth between the same two locations on disk.
That is a good idea. We are having a lot of success with the formula:
which accounts for the worst case scenario:
After balancing with usage = 75%, if the above equation hasn't already happened, we send an alert for manual intervention. Balancing more block groups is generally futile as a filesystem over 90% full can't balance anyway, and might ENOSPC (case 2 above, the bad one) for even trying.
Hard NAK on this statement. We want ENOSPC (case 1 above, which only returns an error to userspace) before the filesystem gets slow. Right now, when we are somewhere over 95%, allocation speeds drop below 4K/second, but we have over a terabyte of data space free. It can take multiple hours to finish a commit and recover use of the filesystem if we immediately SIGSTOP or SIGKILL all writing applications. If we let applications keep trying to write, the commit time keeps exponentially increasing, until forced reboot becomes the only path to recovery (along with loss of any data that did manage to get written in the hours before the reboot). We'd definitely like a knob that stops writes with ENOSPC well before that happens (ideally subtracting the unusable space from There are multiple problems that occur in this type of scenario. Running out of space simply isn't possible on many of our filesystems because the drives will crumble to dust long before btrfs can allocate the last data block. Metadata ENOSPC is simply impossible.
That would require rework of the existing balance code. Right now a balance cannot be deprioritized because it holds the transaction lock for a long time, so lowering the priority causes priority inversion that prevents any other users from writing to the filesystem until the balances are done. Balances can only be deferred, i.e. scheduled to run at some later time when high-priority tasks are not running. Raising the priority of the balance helps a little, because it locks everything out for a shorter time. That balance rework might already be coming (#54, #25) but it's not here yet, and that limits what can be done in the short term. |
Basic idea
Intended effect
What about btrfs-maintenance?
Further refinements
Alternatives
The text was updated successfully, but these errors were encountered: