-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raw extent v2 #1270
base: main
Are you sure you want to change the base?
Raw extent v2 #1270
Conversation
layout: RawLayout, | ||
|
||
/// Has this block been written? | ||
block_written: Vec<bool>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excuse my random outsider comment:
suggestion: This seems to never change in size after creation, so it could maybe be a boxed slice?
This is now updated onto It's a little simpler than the previous description: I decided to remove the fancy interleaving of blocks and contexts, and went back to the straight-forward
(this may change if I see compelling benchmarks, to come soon) |
}; | ||
use zerocopy::AsBytes; | ||
|
||
pub(crate) const DEFAULT_ZFS_RECORDSIZE: u64 = 128 * 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both cases in get_record_size
where this is used, it doesn't seem like the path is a ZFS file system, so maybe this should be NON_ZFS_RECORDSIZE
or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, renamed to DUMMY_RECORDSIZE
in f7c13d7
/// ## Expected recordsize | ||
/// After the block data, we store a single `u64` representing the expected | ||
/// recordsize when the file was written. When the file is reopened, we detect | ||
/// if its recordsize has changed, which would be surprising! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO-jwm this may happen after migration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, after region / region snapshot replacement.
if self.layout.has_padding_after(block) { | ||
vs.push(IoSlice::new(&padding)); | ||
expected_bytes += padding.len(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO-jwm can pwritev partially fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this as a comment for myself to verify, but my thinking here was
- of course pwritev can partially fail, it can return a different number of bytes written out than what was added
- are there any zfs specific considerations to think about
(job_id.0, self.extent_number.0, n_blocks as u64) | ||
}); | ||
|
||
// Now execute each chunk in a separate `pwritev` call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO-jwm this means the Crucible level write can partially succeed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear! I think this was always true but was TODO for myself to 1) verify this and 2) suggest we add this as documentation somewhere?
}); | ||
if self.layout.has_padding_after(block) { | ||
iovecs.push(libc::iovec { | ||
iov_base: padding.as_mut_ptr() as *mut _, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is preadv's behaviour when it's passed multiple iovecs with the same address?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe they're processed in array order, so the last one wins. It doesn't really matter here, because we're not using the padding data for anything.
iov_len: block_size, | ||
}); | ||
iovecs.push(libc::iovec { | ||
iov_base: ctx as *mut _ as *mut _, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's going on here haha?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol it's casting from &mut [u8; 32]
→ *mut [u8; 32]
→ *mut c_void
// If the `zfs` executable isn't present, then we're | ||
// presumably on a non-ZFS filesystem and will use a default | ||
// recordsize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a remote possibility that PATH isn't set correctly but we are on a ZFS filesystem. I'm wondering if the downstairs should somehow expect that this binary is present if we're running on illumos (we already expect it for taking snapshots!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, changed in 5c446c7 to return an error if zfs
isn't present on an illumos system.
impl std::fmt::Debug for RawLayout { | ||
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { | ||
f.debug_struct("RawLayout") | ||
.field("extent_size", &self.extent_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was copy-pasta from extent_inner_raw.rs
, but I'm not sure why they're not just deriving Debug.
Fixed in 5b2a942 for both
Looks like I clicked "Comment" too early, so there's a few |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might need some additional testing, or some updates to the testing
done in integration_tests/src/lib.rs
. That file by default will only create RAW
and SQLite
backend types, and we should go through it and figure out where
we need to also do the same tests for RAW_V2
(or if all the RAW
tests become
RAW_V2
We have snapshots with SQLite
We have snapshots with RAW.
So, we need to support the downstairs being able to "clone" either of those.
integration_test_clone_raw()
Should we add a test (or do we have a test already) of downstairs migration from
SQLite to RAW_V2?
// Try to recompute the context slot from the file. If this | ||
// fails, then we _really_ can't recover, so bail out | ||
// unceremoniously. | ||
self.recompute_block_written_from_file(block).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this step "putting back" block context in the event we failed a write?
I'm unclear what we are doing here exactly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment was out of date; fixed!
self.extent_number | ||
))); | ||
} | ||
cdt::extent__read__file__done!(|| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for write()
we still hit the dtrace __done probe even on error. We should do the
same for reads so we don't have differing behavior between the two IOs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed (both here and in extent_inner_raw.rs
, which had the same behavior)
@@ -3416,9 +3417,12 @@ enum WrappedStream { | |||
/// tests, it can be useful to create volumes using older backends. | |||
#[derive(Copy, Clone, Default, Debug, PartialEq)] | |||
pub enum Backend { | |||
#[default] | |||
#[cfg(any(test, feature = "integration-tests"))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there is some value in allowing crucible-downstairs
to pick what backend it
wants at creation time? I think there could be value in making "old" versions for various kinds
of tests and not hiding them behind a feature would allow that.
It's also not something to worry about for this PR though. Why do I even bring it up then?
Good question. For which I have no answer..
I've gone through and made
Right now, the only migration is the automatic SQLite → RAW_V1, which happens if a read-write SQLite extent is opened. Since we don't have any read-write SQLite extents in the field (only read-only snapshots), I didn't bother implementing SQLite → RAW_V2. |
Yeah, that seems like the right call here. Thanks. |
(Note: this PR is staged on top of #1268 ; only the last two commits are relevant)
The current raw extent implementation lays out on-disk data like this:
In other words,
(The layout is also discussed in this block comment).
This layout prioritizes keeping blocks contiguous / aligned / tightly packed. In doing so, it makes sacrifices:
The latter is the original sin of this layout, and is clearly visible in flamegraphs: reading the block data (4 KiB) and context slot (48 bytes) take basically the same amount of time, because (spoilers) each one has to read a full ZFS record (128 KiB).
(graph from https://github.com/oxidecomputer/stlouis/issues/541)
Having blocks be aligned to 512 / 4K boundaries makes sense in terms of mechanical sympathy with the drive, but we're not controlling the drive directly; we've got all of ZFS between our filesystem operations and bytes actually landing in an SSD.
I decided to experiment with an alternate layout that tightly packs blocks and context slots:
(Naively, you could do
[block | ctx | block | ctx | ...
, but interleaving them uses fewer iovecs)This formats means that blocks and contexts can be read and written together (using
preadv
/pwritev
), and live in the same ZFS records (so we don't have to load 2x records to get a single block's data).Blocks and contexts are organized so that they always live in the same ZFS record (i.e. a 128 KiB chunk of the file), and are written in a single
pwritev
call. Based on discussion in #oxide-q&a, this means that block + context should always be written together (or not at all).The new format looks like a significant performance improvement. Doing random writes to a completely full 128 GiB disk, here's the numbers:
In graph form:
Todo
ZFS_RECORDSIZE
shenanigans / crash consistencyRawFile
andRawFileV2
?