Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For discussion: "slack" compression option #15215

Draft
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

robn
Copy link
Member

@robn robn commented Aug 28, 2023

Motivation and Context

We have a customer that deals in very large, generally uncompressible data. They run with compression=off and lately have been considering recordsize=16M. In testing, we found that this presented a real bottleneck for files that only fill a small amount of the last record, eg, a 17M file.

The main reason for this is simply that we’re doing a ton more writing IO than we need to - we write all those trailing zeroes that we don’t even need. There’s also substantial extra CPU load, eg a lot of time spent checksumming.

So, we’ve been experimenting with various ways of ignoring the “slack space” of trailing zeroes at the end of each record. We call this “slack compression”.

There was some interest in this at the last OpenZFS Leadership call and maybe even doing it outside of the compression system. So, I’m posting this PR as a starting point for discussion!

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Method

The method is pretty simple no matter how its hooked up:

  • in zio_write_compress(), call a function that will search the data buffer to find the last non-zero byte in the file. This value is the psize for the write.
  • In zio_read_bp_init(), push a transform that will set psize to the lsize.

Yes, this is basically describing compression 😄

The slack_compress() function shows a fairly simple implementation of the search. It runs backwards through the data buffer as an array of uint64_t. It exits at the first non-zero value it finds. This implementation is loosely modeled on zio_compress_zeroed_cb() (indeed, our earliest experiments were simply exposing ZIO_COMPRESS_EMPTY). There are obvious optimisation opportunities available here, including vectorisation and letting it work on a scatter ABD to avoid the need to linearise the incoming buffer.

The interesting part is how and when to use this functionality.

compression=slack (this PR)

This PR shows the simplest form: adding a compression=slack option. This works just fine as a substitute for compression=empty, and the ARC gets the benefit as well.

Early on, I plumbed this method directly into zio_write_compress rather than hooking it up to zio_compress_table, because we’re not actually doing any compression, so we don’t need a destination buffer or a copy; the source buffer can be reused. As it turns out it only made a small difference to performance because the search function still requires a linear buffer.

This works just fine. It could be merged today if we had any spare DMU_BACKUP_FEATURE_* bits). But its not really how I would like to do it.

compression=none and force psize to lsize

I am probably wrong about this, but as far as I could tell, there is only one situation outside of compression where we store psize < lsize, and that is when the ZIL is stored on a non-raidz vdev. Before issuing IO zio_shrink() is called to set psize as small as possible. On read, the data is read into a lsize'd buffer, so it naturally falls out right. I haven’t tried hard to find the code, but my intuition is that that doesn’t work for raidz because the block gets sliced up further and that goes wrong if the sizes aren’t right coming in.

If that’s right, that should mean that we could set a transform in zio_read_bp_init() for any uncompressed block with psize < lsize and it would “just work”, without requiring a separate compression option. It would still need a read feature flag as older software won’t understand it (perhaps only active if written to a raidz vdev), but possibly won’t require a stream feature flag.

We wouldn’t have to, but we could also teach the ARC about this, so that even though compression is “off” it could still keep only the real data portion in memory rather than larger buffers of zeroes.

Slack with content compression?

We didn’t try this, but it might be possible to use slack with data compression. This might be worthwhile, as collapsing zeroes could perhaps be made faster than a general compressor and may still have less overhead.

I’ve not thought hard about this, but I think method should be as simple as collapsing zeroes before invoking the real compressor, and when reading back, resetting psize after compression. However I think it would require that compression algorithms are able to know their decompressed size from the compressed state, since we can’t tell them ahead of time.

If that’s no good, then I can see taking a spare bit from somewhere in the BP (maybe bit 7 of the checksum type?) and use that as a signal that a further transform is necessary after decompression. Hmm, or maybe not, because where is the size? Do we need the final size for read even? Ok, I would really need to think about that more.

Slack compress the last block only?

We also spent some time trying to see if we could only trigger slack compression for the last block in an object, for when we know that the “inner” blocks are full and so pointless to search, but it seems to be a non-starter. If a file is written in order from start to finish and the ZIL isn’t used, then its usually possible to have a clear notion of “last block” which can be plumbed down through the write policy. If the block is written out to the ZIL though (and later hooked up with WR_INDIRECT) then there’s no way to know that it was the “last” block. Since our customer has a fsync()-heavy workload, most writes were of this type. For workloads that aren’t calling fsync() a lot it can be better, but there also problems if holes or zero-fills are introduced to “inner” blocks and so could benefit from slack compression are not now able to benefit. Taken together it ends up being a very situational gain, so we haven’t pursued it further.

Make ZLE faster?

We considered that perhaps we could avoid this by optimising ZLE. The main difference is that ZLE is a true compression algorithm, and so copies the data. It could certainly be optimised, but regardless, it would require the entire block to be scanned, so we’d be going from a full scan and checksum to a full scan, copying some amount and then another shorter scan for the checksum. We think the best case for slack will be no copies, scanning backwards to find the end of data, and then a forward scan to compute the checksum, so at the end, the entire block has only been seen once and no copies made.

(That said, ZLE is ripe for some optimising work, though I doubt many people are using it directly these days)

Performance

We have a test scaffold available that runs a representative simulation of the customer’s workload: open(), write(), fsync(), close(). We use this to test the effects of config changes, patches, etc.

Below are results for varying combinations of recordsize, compression and at different file sizes. The test is done on three 12-wide raidz3 pools, each with 10 writing threads, for 10 minutes. Note that there is a lot of other tuning involved (standard for the customer workload), so these are useful for ballpark comparison only.

The “throughput” column is showing the “real” amount of file data, not the total written, which helps compare the total loss due to trailing zeroes. The “fsync” column is the average time for the fsync() call to return, which is a useful proxy for the total IO time (and an important number for our customer as their clients are blocked until this returns).

These results have mostly given us confidence that even this quite naive version helps bring us closer to where we were with none, and can even rival lz4 some of the time. Our next steps are likely to see what we can gain by removing or minimising the number of allocations and copies required, and by improving the performance of the searching function.

recordsize compression file size files written throughput (MB/s) fsync (ms)
128k none 15m 84892 707 203.000
16m 84054 747 202.667
17m 77221 729 223.000
24m 58366 778 295.000
31m 47749 822 360.333
lz4 15m 84937 707 203.000
16m 83768 745 203.333
17m 77605 733 222.333
24m 57284 764 300.667
31m 46755 805 368.000
zle 15m 82360 686 209.000
16m 82078 729 207.000
17m 75475 713 228.333
24m 54590 728 315.333
31m 43946 757 391.333
slack 15m 85562 713 201.000
16m 84296 749 202.000
17m 77982 736 220.333
24m 58677 782 293.333
31m 47734 822 359.333
16m none 15m 85691 714 176.333
16m 89569 796 159.333
17m 47338 447 306.333
24m 47172 629 304.667
31m 46861 807 304.333
lz4 15m 92176 768 158.667
16m 88380 785 161.000
17m 66540 628 187.667
24m 58349 778 225.333
31m 45889 790 299.333
zle 15m 86731 723 165.000
16m 86580 769 163.000
17m 61562 581 196.333
24m 57622 768 232.000
31m 46719 804 301.333
slack 15m 90687 756 157.667
16m 88439 786 161.333
17m 62187 587 190.667
24m 56066 747 216.667
31m 46534 801 306.000

Feedback

We’ll keep working on this, but we’re interested in feedback on any or all of the above. If you’ve got workloads you can test this on, or might like us to try, let us know. If you have thoughts about what this might mean if it wasn’t a compression option but “always on”, great! (especially in “weird” IO situations, like gang blocks, or metadata, or whatever). I’d like to hear more about what it all might mean for block pointer and IO pipeline changes. And I dunno what else; if you were interested in the dev call, please throw some thoughts down too!

@robn robn marked this pull request as draft August 28, 2023 02:19
@behlendorf behlendorf added the Status: Design Review Needed Architecture or design is under discussion label Oct 20, 2023
@allanjude
Copy link
Contributor

@rincebrain Have you had a look at this?

@shodanshok
Copy link
Contributor

For what it is worth, my opinion is that optimizing ZLE should be preferred. After all, if LZ4 is so fast, why zero elimination should be slower? I understand the concerns about buffer copies - but if the end result is faster/simpler, I can not see any real issue.

@robn robn force-pushed the slack-compression branch 3 times, most recently from 0f4861a to 3dc96e0 Compare July 3, 2024 09:28
Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presence of such a trivial algorithm makes me even more wish to have ability to run compressor/decompressor code directly on ABD instead of linear buffer. In this trivial case even one copy looks excessive, but two are just wasteful. Our benchmarks show NAS currently using 10-20 times more memory bandwidth than data read/written. We should more care about it.

Just a thought about possible alternatives: couldn't we make all decompressors we have to report how much data they've actually filled to just fill the rest with zeroes up to the logical size? This way we could run the slack check before any normal compressors and just say compressor to ignore the zeroes. This way slack handling would be done on ABD without need to linearise the zero part of the buffer and using more advanced compressor on it.

{
(void) level;
ASSERT3U(d_len, >=, s_len);
memcpy(dst, src, s_len);
Copy link
Member

@amotin amotin Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should wipe the rest with zeroes here.

@robn
Copy link
Member Author

robn commented Nov 18, 2024

Last push just rebasing to master in preparation for more work, nothing to see here.

The "slack" option simply searches from the end of the block backwards
to the last non-zero byte, and sets that position as the "compressed"
size.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
@robn
Copy link
Member Author

robn commented Nov 19, 2024

Experimental work from yesterday and today pushed.

First, zfs_slack_compress is converted to operate on ABDs directly, rather than linearising. In simple tests this makes very little difference, as the IO buffers are usually LINEAR_PAGE, so it's just one chunk. More chunks are going to be slower, because it has to walk forward through the chunks. It's a step though.

Next is in-place compression. Allows a compression method to be "in-place" or "zero-copy", where it operates on the source ABD only. I've changed the zio_compress_data return semantics a little, so that now you find out if compression happened by comparing the compressed length to the source length. If compression happened, the output is either definitely in the dest buffer if the caller passed one, or in the dest buffer if the one needed to be allocated, or in the source buffer if not. See the comment. The point is, the caller can indicate what they want to happen and understand what did happen wrt copies.

Finally, added a very very prototype reverse iterator for ABD. About all I can say is that it works. It's not too bad, but inefficient on the "advance" step for gang ABDs. Not that those are tested really, all pretty theoretical at this point.

Very very early numbers are promising. This is short fio runs, overwrite the 16-17M range of a 17M file on recordsize 16M, on a memory-backed pool. Only tuning is primarycache=metadata, and this is not really a good machine for performance work, but it's what I have right now (working on getting access back to the machine & workload I ran the original numbers on).

95b5cb1ba rob.nor.. 2024-11-18  compress: add "slack" compression option

compression=off:
  WRITE: bw=238MiB/s (249MB/s), 238MiB/s-238MiB/s (249MB/s-249MB/s), io=10.0GiB (10.8GB), run=43161-43161msec

compression=slack:
  WRITE: bw=518MiB/s (543MB/s), 518MiB/s-518MiB/s (543MB/s-543MB/s), io=18.3GiB (19.7GB), run=36258-36258msec

compression=lz4
  WRITE: bw=561MiB/s (588MB/s), 561MiB/s-561MiB/s (588MB/s-588MB/s), io=19.7GiB (21.1GB), run=35872-35872msec


87b4315c8 rob.nor.. 2024-11-18  WIP: slack: compress in-place

compression=slack:
  WRITE: bw=636MiB/s (667MB/s), 636MiB/s-636MiB/s (667MB/s-667MB/s), io=21.7GiB (23.3GB), run=34931-34931msec


458fcefbc rob.nor.. 2024-11-19  WIP: slack: use reverse iterator

compression=slack:
  WRITE: bw=628MiB/s (658MB/s), 628MiB/s-628MiB/s (658MB/s-658MB/s), io=21.3GiB (22.9GB), run=34787-34787msec

compression=lz4
  WRITE: bw=552MiB/s (579MB/s), 552MiB/s-552MiB/s (579MB/s-579MB/s), io=19.4GiB (20.8GB), run=35994-35994msec

So roughly what we saw before, but reducing copy overhead helps a lot.

With regards to true upstreaming, at this point I'm thinking to offer something very close to this: in-place/zero-copy compress, much improved reverse iterator, and that's about all as a first cut.


I thought about doing in-place for decompress too, but so far I don't think it will be very useful just because of how it's used. Most of the time, it's going into the ARC, and we want to keep that storing the "compressed" copy, so "decompressing" into the same buffer, even if it was big enough, probably isn't going to come up very often.

Though, that's also the kind of place where I feel like compress=slack perhaps shouldn't be expressed as just another compression option. If the ARC knew, then when "decompressing" it's copy, it could instead hand out an ABD that is a view over the "compressed" ABD, plus a bunch of zero pages at the back. No copies, no double allocation. But that would of course be a special case just for this compression method. Which maybe is fine, but maybe also says its not really compression, and should be something else.

In another world, maybe I'd name it after zio_shrink() and call it "shrinkwrap". But regardless, we can't store something extra without a signal in the block pointer that some post-processing is required.


Other things I want to play with that seem adjacent to this:

  • optimise the zero search. AVX-based population counts are pretty easy to put together
  • "lazy" ABDs, where we don't allocate the buffer/chunks until first use (borrowed, mapped, iterated), then zero regions can just be empty chunk slots and/or points to zero_page, and most of the zero scan could just be skipping the empty chunks. Of course, that makes LINEAR_PAGE interesting; maybe scatter ends up being non-contiguous that way.

That's it. Have at it, I'll be back tomorrow :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Design Review Needed Architecture or design is under discussion
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants