Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support non-contiguous put payloads / vectored writes (#5514) #5538

Merged
merged 4 commits into from
Apr 15, 2024

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Mar 21, 2024

Which issue does this PR close?

Closes #5514

Rationale for this change

Allowing non-contiguous payloads avoids needing to bump allocate buffers when writing data where the size isn't known up front. This can be significantly more efficient.

It will also open up interesting possibilities for the parquet writer, where the data is already buffered in chunks

What changes are included in this PR?

Are there any user-facing changes?

FYI @roeap @wjones127 @alamb

@github-actions github-actions bot added the object-store Object Store Interface label Mar 21, 2024
@@ -438,16 +443,8 @@ impl S3Client {

let mut builder = self.client.request(Method::POST, url);

// Compute checksum - S3 *requires* this for DeleteObjects requests, so we default to
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup given SHA256 is the only checksum we support and are ever likely to

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and to be clear for anyone else reading this PR -- this code is not exposed publicly so it can be changed in the future if needed as well)

async move {
let mut retries = 0;
let now = Instant::now();

loop {
let s = req.try_clone().expect("request body must be cloneable");
match client.execute(s).await {
let mut s = self.try_clone().expect("request body must be cloneable");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We defer setting the body until this point when we have a streaming body, as otherwise this would fail

@tustvold tustvold force-pushed the non-contiguous-write-payloads branch 4 times, most recently from 3640a8c to 4e20f2d Compare April 9, 2024 13:27
@tustvold tustvold force-pushed the non-contiguous-write-payloads branch from 4e20f2d to 70de4ee Compare April 9, 2024 15:02
@@ -468,8 +468,8 @@ impl S3Client {
let response = builder
.header(CONTENT_TYPE, "application/xml")
.body(body)
.with_aws_sigv4(credential.authorizer(), payload_sha256.as_deref())
.send_retry_with_idempotency(&self.config.retry_config, false)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

false is the default for a POST request

@@ -186,11 +186,7 @@ impl DynamoCommit {
to: &Path,
) -> Result<()> {
self.conditional_op(client, to, None, || async {
client
.copy_request(from, to)
.set_idempotent(false)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

false is the default

@@ -237,21 +242,23 @@ impl AzureClient {
builder = builder.header(CONTENT_TYPE, value);
}

builder = builder
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moved into send above

.expect("request body must be cloneable");

if let Some(x) = &self.payload {
*s.body_mut() = Some(x.body());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting this here sidesteps issues around clone-ability of stream based request bodies

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a function call like s.set_body(Some(x.body()) instead might be easier to understand the intent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an upstream type unfortunately

@@ -327,20 +335,24 @@ impl GoogleCloudStorageClient {
let builder = self
.client
.request(Method::PUT, url)
.header(header::CONTENT_TYPE, content_type)
.header(header::CONTENT_LENGTH, payload.len())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved into send

@@ -231,7 +231,7 @@ impl std::fmt::Debug for BufWriter {

enum BufWriterState {
/// Buffer up to capacity bytes
Buffer(Path, Vec<u8>),
Buffer(Path, PutPayloadMut),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change means that we no longer bump allocate 🎉

@@ -121,7 +122,8 @@ impl WriteMultipart {
pub fn new_with_chunk_size(upload: Box<dyn MultipartUpload>, chunk_size: usize) -> Self {
Self {
upload,
buffer: Vec::with_capacity(chunk_size),
chunk_size,
buffer: PutPayloadMut::new(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change means we avoid bump allocating but also don't allocate an entire 10MB buffer up-front only to potentially use some fraction of it

object_store/src/payload.rs Outdated Show resolved Hide resolved
@tustvold tustvold marked this pull request as ready for review April 9, 2024 15:07
@tustvold tustvold changed the title Support non-contiguous put payloads (#5514) Support non-contiguous put payloads / vectored writes (#5514) Apr 9, 2024
object_store/src/payload.rs Show resolved Hide resolved

impl FromIterator<Bytes> for PutPayload {
fn from_iter<T: IntoIterator<Item = Bytes>>(iter: T) -> Self {
Self(iter.into_iter().collect())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Arc::from_iter() directly to reduce one alloaction (after TrustedLen stablized in the future).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this does use Arc::from_iter?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arc::from_iter has special logic for TrustedLen that can avoid the collect()

Copy link
Contributor Author

@tustvold tustvold Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at the definition of collect it just invokes from_iter for the collection being collected into, in this case Arc

fn collect<B: FromIterator<Self::Item>>(self) -> B
    where
        Self: Sized,
    {
        FromIterator::from_iter(self)
    }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank yoU @tustvold -- I think the structure of this code / API is quite nice and elegant 👌 👏

I do think it is a non trivial API change (not complicated, but will require a bunch of mechanical changes) and so marked the PR as API change. To that end I suggest we add some better documentation to help users map types they are familiar with to the new PutPayload.

Here is a suggestion: tustvold#81

Other than that I had a few suggestions / nits, the only one I feel somewhat strongly about is adding a new put rather than updating the existing write API.

@@ -266,7 +270,8 @@ pub(crate) struct Request<'a> {
path: &'a Path,
config: &'a S3Config,
builder: RequestBuilder,
payload_sha256: Option<Vec<u8>>,
payload_sha256: Option<digest::Digest>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are unrelated cleanups right? Or is there a reason digest::Digest is preferred over Vec<u8>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It avoids an allocation, but yes an unrelated cleanup

@@ -438,16 +443,8 @@ impl S3Client {

let mut builder = self.client.request(Method::POST, url);

// Compute checksum - S3 *requires* this for DeleteObjects requests, so we default to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and to be clear for anyone else reading this PR -- this code is not exposed publicly so it can be changed in the future if needed as well)

/// should be able to observe a partially written object
async fn put(&self, location: &Path, bytes: Bytes) -> Result<PutResult> {
self.put_opts(location, bytes, PutOptions::default()).await
async fn put(&self, location: &Path, payload: PutPayload) -> Result<PutResult> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For anyone else reviewing this PR. I think this is an (and the key) API change -- to take a slice of [Bytes] bytes rather than a Bytes directly

}

/// Append a [`Bytes`] to this [`PutPayloadMut`]
pub fn push(&mut self, bytes: Bytes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please document that this is a zero copy API -- specifically that the progress is closed off and the underlying data is not copied.

Comment on lines +260 to +264
if !self.in_progress.is_empty() {
let completed = std::mem::take(&mut self.in_progress).into();
self.completed.push(completed);
}
PutPayload(self.completed.into())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think you can save typing and allocating another Vec with something like

Suggested change
if !self.in_progress.is_empty() {
let completed = std::mem::take(&mut self.in_progress).into();
self.completed.push(completed);
}
PutPayload(self.completed.into())
let Self { in_progress, completed, ..} = self;
if !in_progress.is_empty() {
completed.push(in_progress);
}
PutPayload(completed.into())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vec::default doesn't allocate

assert_eq!(chunks[3].len(), 23);
assert_eq!(chunks[4].len(), 20);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add a test for push as well (esp to document the non copying behavior)

/// Put a chunk of data into this [`WriteMultipart`]
///
/// See [`Self::write`] for information on backpressure
pub fn put(&mut self, mut bytes: Bytes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are changing the API anyways, is there a reason to keep both write and put? As in why not change write to take a Bytes as one can could then call write via write(my_payload.into()) 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are circumstances where having the writer perform the buffering might be advantageous, e.g. when integrating with the Write trait

Copy link
Contributor

@alamb alamb Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be great to include in the docs as I found it non obvious.

Update the docs were updated (thanks 🙏 )

    /// Put a chunk of data into this [`WriteMultipart`] without copying
    ///
    /// Data is buffered using [`PutPayloadMut::push`]. Implementations looking to
    /// perform writes from non-owned buffers should prefer [`Self::write`] as this
    /// will allow multiple calls to share the same underlying allocation.

.expect("request body must be cloneable");

if let Some(x) = &self.payload {
*s.body_mut() = Some(x.body());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a function call like s.set_body(Some(x.body()) instead might be easier to understand the intent

}

impl RetryableRequest {
/// Set whether this request is idempotent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document what idempotent means in this context? Like it means the entire request can be retried?

loop {
let s = req.try_clone().expect("request body must be cloneable");
match client.execute(s).await {
let mut s = self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you like brevity but one character variable names that are different than the first letter of the things they are holding are pretty hard for me to grok. (I realize you can't use r b/c there is another variable below named r).

How about we rename s to request and x to payload?

@tustvold tustvold merged commit 8731760 into apache:master Apr 15, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API object-store Object Store Interface
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ObjectStore] Non-Contiguous Write Payloads
3 participants