Raw extent v2 #1270

mkeeter · 2024-04-22T21:58:18Z

(Note: this PR is staged on top of #1268 ; only the last two commits are relevant)

The current raw extent implementation lays out on-disk data like this:

[ --- block --- | --- block --- | ... | ctx a0 | ctx a1 | ... | ctx b0 | ctx b1 | ... | metadata ]

In other words,

N Blocks are tightly packed at the front of the file
Then, there are two arrays of block contexts (one slot per block per array, so 2N slots in total)
Finally, there's a trailing metadata chunk

(The layout is also discussed in this block comment).

This layout prioritizes keeping blocks contiguous / aligned / tightly packed. In doing so, it makes sacrifices:

Context slots cannot be written atomically with their block data
- This means that we have to have two context slot arrays, and ping-pong between then
It's possible for context slot arrays to get fragmented (some blocks using array A, and some using array B), which means reads have to be split between the two arrays.
- We have extra logic to defragment if this effect becomes significant, which is subtle
Reading or writing a single block requires two syscalls
- Those IO operations touch very different parts of the file!

The latter is the original sin of this layout, and is clearly visible in flamegraphs: reading the block data (4 KiB) and context slot (48 bytes) take basically the same amount of time, because (spoilers) each one has to read a full ZFS record (128 KiB).

315651005-3912780b-fde3-4fd8-b4ca-1e3a1f94e483

(graph from https://github.com/oxidecomputer/stlouis/issues/541)

Having blocks be aligned to 512 / 4K boundaries makes sense in terms of mechanical sympathy with the drive, but we're not controlling the drive directly; we've got all of ZFS between our filesystem operations and bytes actually landing in an SSD.

I decided to experiment with an alternate layout that tightly packs blocks and context slots:

[ --- block --- | ctx 0 | ctx 1 | --- block --- | --- block --- | ctx 2 | ctx 3 | ... | metadata ]

(Naively, you could do [block | ctx | block | ctx | ..., but interleaving them uses fewer iovecs)

This formats means that blocks and contexts can be read and written together (using preadv / pwritev), and live in the same ZFS records (so we don't have to load 2x records to get a single block's data).

Blocks and contexts are organized so that they always live in the same ZFS record (i.e. a 128 KiB chunk of the file), and are written in a single pwritev call. Based on discussion in #oxide-q&a, this means that block + context should always be written together (or not at all).

The new format looks like a significant performance improvement. Doing random writes to a completely full 128 GiB disk, here's the numbers:

Write size (blocks)	V1	V2
1	12.7 MiB/sec ± 328.8 KiB/sec	17.7 MiB/sec ± 269.9 KiB/sec
2	24.6 MiB/sec ± 764.4 KiB/sec	34 MiB/sec ± 334.8 KiB/sec
4	47.1 MiB/sec ± 1 MiB/sec	65.4 MiB/sec ± 870.7 KiB/sec
8	81 MiB/sec ± 658.1 KiB/sec	112.1 MiB/sec ± 1.8 MiB/sec
16	135.8 MiB/sec ± 2.7 MiB/sec	192.5 MiB/sec ± 3.8 MiB/sec
32	220.5 MiB/sec ± 5.1 MiB/sec	291.3 MiB/sec ± 6.9 MiB/sec
64	290.2 MiB/sec ± 8 MiB/sec	368.8 MiB/sec ± 12.4 MiB/sec
128	455.5 MiB/sec ± 5.2 MiB/sec	549.9 MiB/sec ± 7.6 MiB/sec
192	561.6 MiB/sec ± 6.8 MiB/sec	651.8 MiB/sec ± 11 MiB/sec
256	580.6 MiB/sec ± 19.2 MiB/sec	746.2 MiB/sec ± 16.5 MiB/sec
320	552.8 MiB/sec ± 17.4 MiB/sec	679.9 MiB/sec ± 9.5 MiB/sec
384	580.6 MiB/sec ± 36.2 MiB/sec	680.1 MiB/sec ± 15.8 MiB/sec
448	611.3 MiB/sec ± 18.6 MiB/sec	738.4 MiB/sec ± 26 MiB/sec
512	652 MiB/sec ± 47.6 MiB/sec	708.8 MiB/sec ± 29.6 MiB/sec
576	605.8 MiB/sec ± 28.7 MiB/sec	712.3 MiB/sec ± 26.6 MiB/sec
640	641.5 MiB/sec ± 13.4 MiB/sec	720.8 MiB/sec ± 26 MiB/sec
704	653.8 MiB/sec ± 8.2 MiB/sec	747.6 MiB/sec ± 6.2 MiB/sec
768	666.2 MiB/sec ± 13.7 MiB/sec	756 MiB/sec ± 27.1 MiB/sec
832	654.3 MiB/sec ± 7.6 MiB/sec	700.9 MiB/sec ± 21.8 MiB/sec
896	673.2 MiB/sec ± 14.1 MiB/sec	746.7 MiB/sec ± 21.9 MiB/sec
960	676.7 MiB/sec ± 11 MiB/sec	776 MiB/sec ± 12.9 MiB/sec
1024	707.7 MiB/sec ± 32.3 MiB/sec	764.8 MiB/sec ± 24.9 MiB/sec

In graph form:

Todo

Write a big comment about ZFS_RECORDSIZE shenanigans / crash consistency
Retest after a870b1e
Test read performance
Make integration tests generic across extent types and run them for both RawFile and RawFileV2?

aapoalas · 2024-04-23T09:43:54Z

downstairs/src/extent_inner_raw_v2.rs

+    layout: RawLayout,
+
+    /// Has this block been written?
+    block_written: Vec<bool>,


Excuse my random outsider comment:

suggestion: This seems to never change in size after creation, so it could maybe be a boxed slice?

mkeeter · 2024-08-02T15:26:11Z

This is now updated onto main and ready for review.

It's a little simpler than the previous description: I decided to remove the fancy interleaving of blocks and contexts, and went back to the straight-forward

[block | ctx | block | ctx | ... ]

(this may change if I see compelling benchmarks, to come soon)

jmpesp · 2024-08-07T16:05:23Z

downstairs/src/extent_inner_raw_v2.rs

+};
+use zerocopy::AsBytes;
+
+pub(crate) const DEFAULT_ZFS_RECORDSIZE: u64 = 128 * 1024;


both cases in get_record_size where this is used, it doesn't seem like the path is a ZFS file system, so maybe this should be NON_ZFS_RECORDSIZE or something?

Good point, renamed to DUMMY_RECORDSIZE in f7c13d7

jmpesp · 2024-08-07T16:06:26Z

downstairs/src/extent_inner_raw_v2.rs

+/// ## Expected recordsize
+/// After the block data, we store a single `u64` representing the expected
+/// recordsize when the file was written.  When the file is reopened, we detect
+/// if its recordsize has changed, which would be surprising!


TODO-jwm this may happen after migration

To be clear, after region / region snapshot replacement.

jmpesp · 2024-08-07T16:10:06Z

downstairs/src/extent_inner_raw_v2.rs

+            if self.layout.has_padding_after(block) {
+                vs.push(IoSlice::new(&padding));
+                expected_bytes += padding.len();
+            }


TODO-jwm can pwritev partially fail?

I added this as a comment for myself to verify, but my thinking here was

of course pwritev can partially fail, it can return a different number of bytes written out than what was added

are there any zfs specific considerations to think about

jmpesp · 2024-08-07T16:11:50Z

downstairs/src/extent_inner_raw_v2.rs

+            (job_id.0, self.extent_number.0, n_blocks as u64)
+        });
+
+        // Now execute each chunk in a separate `pwritev` call


TODO-jwm this means the Crucible level write can partially succeed

To be clear! I think this was always true but was TODO for myself to 1) verify this and 2) suggest we add this as documentation somewhere?

downstairs/src/extent_inner_raw_v2.rs

jmpesp · 2024-08-07T16:15:53Z

downstairs/src/extent_inner_raw_v2.rs

+            });
+            if self.layout.has_padding_after(block) {
+                iovecs.push(libc::iovec {
+                    iov_base: padding.as_mut_ptr() as *mut _,


What is preadv's behaviour when it's passed multiple iovecs with the same address?

I believe they're processed in array order, so the last one wins. It doesn't really matter here, because we're not using the padding data for anything.

jmpesp · 2024-08-07T16:16:33Z

downstairs/src/extent_inner_raw_v2.rs

+                iov_len: block_size,
+            });
+            iovecs.push(libc::iovec {
+                iov_base: ctx as *mut _ as *mut _,


what's going on here haha?

lol it's casting from &mut [u8; 32] → *mut [u8; 32] → *mut c_void

jmpesp · 2024-08-07T16:23:52Z

downstairs/src/extent_inner_raw_v2.rs

+                    // If the `zfs` executable isn't present, then we're
+                    // presumably on a non-ZFS filesystem and will use a default
+                    // recordsize


There's a remote possibility that PATH isn't set correctly but we are on a ZFS filesystem. I'm wondering if the downstairs should somehow expect that this binary is present if we're running on illumos (we already expect it for taking snapshots!)

Good idea, changed in 5c446c7 to return an error if zfs isn't present on an illumos system.

jmpesp · 2024-08-07T16:24:45Z

downstairs/src/extent_inner_raw_v2.rs

+impl std::fmt::Debug for RawLayout {
+    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
+        f.debug_struct("RawLayout")
+            .field("extent_size", &self.extent_size)


Why only this?

This was copy-pasta from extent_inner_raw.rs, but I'm not sure why they're not just deriving Debug.

Fixed in 5b2a942 for both

jmpesp · 2024-08-07T16:30:15Z

Looks like I clicked "Comment" too early, so there's a few TODO-jwm there! :)

leftwo

I think we might need some additional testing, or some updates to the testing
done in integration_tests/src/lib.rs. That file by default will only create RAW
and SQLite backend types, and we should go through it and figure out where
we need to also do the same tests for RAW_V2 (or if all the RAW tests become
RAW_V2

We have snapshots with SQLite
We have snapshots with RAW.
So, we need to support the downstairs being able to "clone" either of those.
integration_test_clone_raw()

Should we add a test (or do we have a test already) of downstairs migration from
SQLite to RAW_V2?

downstairs/src/extent_inner_raw_v2.rs

leftwo · 2024-08-07T20:47:26Z

downstairs/src/extent_inner_raw_v2.rs

+                // Try to recompute the context slot from the file.  If this
+                // fails, then we _really_ can't recover, so bail out
+                // unceremoniously.
+                self.recompute_block_written_from_file(block).unwrap();


Is this step "putting back" block context in the event we failed a write?
I'm unclear what we are doing here exactly.

This comment was out of date; fixed!

leftwo · 2024-08-07T20:53:09Z

downstairs/src/extent_inner_raw_v2.rs

+                self.extent_number
+            )));
+        }
+        cdt::extent__read__file__done!(|| {


for write() we still hit the dtrace __done probe even on error. We should do the
same for reads so we don't have differing behavior between the two IOs.

Fixed (both here and in extent_inner_raw.rs, which had the same behavior)

downstairs/src/extent_inner_raw_v2.rs

leftwo · 2024-08-07T23:21:08Z

downstairs/src/lib.rs

@@ -3416,9 +3417,12 @@ enum WrappedStream {
 /// tests, it can be useful to create volumes using older backends.
 #[derive(Copy, Clone, Default, Debug, PartialEq)]
 pub enum Backend {
-    #[default]
+    #[cfg(any(test, feature = "integration-tests"))]


I wonder if there is some value in allowing crucible-downstairs to pick what backend it
wants at creation time? I think there could be value in making "old" versions for various kinds
of tests and not hiding them behind a feature would allow that.

It's also not something to worry about for this PR though. Why do I even bring it up then?
Good question. For which I have no answer..

mkeeter · 2024-08-08T15:21:05Z

I think we might need some additional testing, or some updates to the testing done in integration_tests/src/lib.rs. That file by default will only create RAW and SQLite backend types, and we should go through it and figure out where we need to also do the same tests for RAW_V2 (or if all the RAW tests become RAW_V2

We have snapshots with SQLite We have snapshots with RAW. So, we need to support the downstairs being able to "clone" either of those. integration_test_clone_raw()

I've gone through and made integration_tests generic across different backends (3615848).

Should we add a test (or do we have a test already) of downstairs migration from SQLite to RAW_V2?

Right now, the only migration is the automatic SQLite → RAW_V1, which happens if a read-write SQLite extent is opened. Since we don't have any read-write SQLite extents in the field (only read-only snapshots), I didn't bother implementing SQLite → RAW_V2.

leftwo · 2024-08-08T16:03:38Z

Should we add a test (or do we have a test already) of downstairs migration from SQLite to RAW_V2?

Right now, the only migration is the automatic SQLite → RAW_V1, which happens if a read-write SQLite extent is opened. Since we don't have any read-write SQLite extents in the field (only read-only snapshots), I didn't bother implementing SQLite → RAW_V2.

Yeah, that seems like the right call here. Thanks.

aapoalas reviewed Apr 23, 2024

View reviewed changes

mkeeter force-pushed the raw-extent-v2 branch from a870b1e to b0691c2 Compare May 3, 2024 14:17

mkeeter force-pushed the raw-extent-v2 branch from b0691c2 to 91c9f47 Compare August 2, 2024 15:24

mkeeter force-pushed the raw-extent-v2 branch from 91c9f47 to d590fff Compare August 2, 2024 15:30

Port and simplify raw-file-v2

0f03bf1

mkeeter force-pushed the raw-extent-v2 branch from d590fff to 0f03bf1 Compare August 2, 2024 15:39

Fix for non-ZFS regions on a machine with the zfs command

db967b8

leftwo marked this pull request as ready for review August 5, 2024 17:33

jmpesp reviewed Aug 7, 2024

View reviewed changes

mkeeter added 8 commits August 7, 2024 14:03

Rename DEFAULT_ZFS_RECORDSIZE -> DUMMY_RECORDSIZE

f7c13d7

Fix outdated comment

19589ed

Derive debug; fix missing DUMMY_RECORDSIZE

5b2a942

Only get recordsize once per region

832a154

Insist on zfs being present on illumos

5c446c7

Make sure root dir is made before opening extents

ea83dd1

Fix tempdir being deleted mid-test

01c2ed8

Make get_recordsize a member function

a08c08b

leftwo reviewed Aug 7, 2024

View reviewed changes

mkeeter added 4 commits August 8, 2024 09:40

Easy fixes from Alan's review

c70114b

Update comment about recompute_block_written_from_file

b0266aa

Only try calling zfs on illumos machines

ee479c6

Make integration tests generic

3615848

mkeeter force-pushed the raw-extent-v2 branch from d1cd59e to 3615848 Compare August 8, 2024 15:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raw extent v2 #1270

Raw extent v2 #1270

mkeeter commented Apr 22, 2024 •

edited

Loading

aapoalas Apr 23, 2024

mkeeter commented Aug 2, 2024

jmpesp Aug 7, 2024

mkeeter Aug 7, 2024

jmpesp Aug 7, 2024

jmpesp Aug 7, 2024

jmpesp Aug 7, 2024

jmpesp Aug 7, 2024

jmpesp Aug 7, 2024

jmpesp Aug 7, 2024

jmpesp Aug 7, 2024

mkeeter Aug 7, 2024

jmpesp Aug 7, 2024

mkeeter Aug 7, 2024

jmpesp Aug 7, 2024

mkeeter Aug 7, 2024

jmpesp Aug 7, 2024

mkeeter Aug 7, 2024

jmpesp commented Aug 7, 2024

leftwo left a comment

leftwo Aug 7, 2024

mkeeter Aug 8, 2024

leftwo Aug 7, 2024

mkeeter Aug 8, 2024

leftwo Aug 7, 2024

mkeeter commented Aug 8, 2024

leftwo commented Aug 8, 2024

Raw extent v2 #1270

Are you sure you want to change the base?

Raw extent v2 #1270

Conversation

mkeeter commented Apr 22, 2024 • edited Loading

Choose a reason for hiding this comment

mkeeter commented Aug 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmpesp commented Aug 7, 2024

leftwo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mkeeter commented Aug 8, 2024

leftwo commented Aug 8, 2024

mkeeter commented Apr 22, 2024 •

edited

Loading