Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reliable storage lock #2014

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Reliable storage lock #2014

wants to merge 3 commits into from

Conversation

IvoDD
Copy link
Collaborator

@IvoDD IvoDD commented Nov 25, 2024

What does this implement or fix?

Introduces a reliable storage lock

Introduces a new ReliableStorageLock and ReliableStorageLockGuard to
be used as a slower but more reliable alternative to the existing
StorageLock.

It uses the new If-None-Match atomic put operations in S3.

First commit (Reliable storage lock):

  • Upgrades the aws-sdk-cpp in vcpkg (which needed a few additions
    because of some problematic dependencies)
  • Adds write_if_none_match capability to AsyncStore's S3 and to
    InMemoryStore
  • Logic for ReliableStorageLock
  • C++ tests using the InMemoryStore

Second commit (Real S3 storage python tests for ReliableStorageLock)

  • Adds a ReliableStorageLockManager which exposes functions to aquire and free the lock in python. (The guard structure is unusable for python)
  • Adds a new storage lock test with the existing real_s3_version_store fixture.

Any other comments?

Checklist

Checklist for code changes...
  • Have you updated the relevant docstrings, documentation and copyright notice?
  • Is this contribution tested against all ArcticDB's features?
  • Do all exceptions introduced raise appropriate error messages?
  • Are API changes highlighted in the PR description?
  • Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

@IvoDD IvoDD force-pushed the reliable-storage-lock branch 4 times, most recently from f06bd01 to 776bc2d Compare November 25, 2024 18:26
Introduces a new `ReliableStorageLock` and `ReliableStorageLockGuard` to
be used as a slower but more reliable alternative to the existing
`StorageLock`.

It uses the new If-None-Match atomic put operations in S3.

This commit:
- Upgrades the aws-sdk-cpp in vcpkg (which needed a few additions
  because of some problematic dependencies)
- Adds `write_if_none_match` capability to `AsyncStore`'s S3 and to
  `InMemoryStore`
- Logic for `ReliableStorageLock`
- C++ tests using the `InMemoryStore`

Follow up commit will introduce a python integration test with real aws
s3.
Adds a real s3 storage test (currently to be run with persistent storage
tests mark) for the lock.
@IvoDD
Copy link
Collaborator Author

IvoDD commented Nov 26, 2024

Evidence of real s3 storage tests passing here

@@ -60,6 +60,7 @@ KeyData get_key_data(KeyType key_type) {
STRING_REF(KeyType::APPEND_REF, aref, 'a')
STRING_KEY(KeyType::MULTI_KEY, mref, 'm')
STRING_REF(KeyType::LOCK, lref, 'x')
STRING_REF(KeyType::SLOW_LOCK, lref, 'x')
Copy link
Collaborator

@poodlewars poodlewars Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This clashes with the character and prefix for LOCK above which is very odd. Is this intentional? Doesn't it mean we'd list out both sorts of lock when we iterate type?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They all have to be unique, otherwise bad things will happen. Also, I'm not sure why we're characterising this as a slow lock, since if anything it will be faster than the other lock which required variable (and ultimately unknowable) wait periods to ensure that it really was the holder of the lock.

My preference would be that the eventual lock key (the copy destination) is just a normal KeyType::LOCK, and the one that gets copied to it (the first one written is called a PENDING_LOCK or a LOCK_ATTEMPT or something like that

@@ -57,7 +57,8 @@ class S3ClientWrapper {
virtual S3Result<std::monostate> put_object(
const std::string& s3_object_name,
Segment&& segment,
const std::string& bucket_name) = 0;
const std::string& bucket_name,
bool if_none_match = false) = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate bools on public APIs. Could we have two different public methods without the bool that push down to a single private method with the bool or something?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also bools tend to multiply out of control, it's preferable to use an enum, that way callers have to have a name on them (WriteConditions::IF_NONE_MATCH is way more self-explanatory at the call site than 'true')

@@ -83,7 +83,8 @@ S3Result<Segment> MockS3Client::get_object(
S3Result<std::monostate> MockS3Client::put_object(
const std::string &s3_object_name,
Segment &&segment,
const std::string &bucket_name) {
const std::string &bucket_name,
bool if_none_match[[maybe_unused]]) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should extend the mock client so it can give us precondition failed response codes shouldn't we?

@@ -14,4 +16,11 @@ struct PilotedClock {
}
};

struct PilotedClockNoAutoIncrement {
Copy link
Collaborator

@poodlewars poodlewars Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's already a clock type in the codebase that works like this - ManualClock?

@@ -1,3 +1,5 @@
#pragma once
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yuck

// provide a slower but more reliable lock than the StorageLock. It should be completely consistent unless a process
// holding a lock get's paused for times comparable to the lock timeout.
// It lock follows the algorithm described here:
// https://www.morling.dev/blog/leader-election-with-s3-conditional-writes/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should save of copy of this blog somewhere, will suck to have no docs of how this works if it vanishes

@IvoDD IvoDD force-pushed the reliable-storage-lock branch 3 times, most recently from 9853ee8 to 3b41a46 Compare November 27, 2024 18:26
Currently all backends are unsupported apart from S3 (for which only some providers like AWS support it).

Unfotrunately it's impossible to differentiate between aws and e.g. vast
backends apart from looking at the endpoint which can be subject to
rerouting etc.

This commit also reworks the Guard to work only with aquired locks.

// The ReliableStorageLock is a storage lock which relies on atomic If-None-Match Put and ListObject operations to
// provide a slower but more reliable lock than the StorageLock. It should be completely consistent unless a process
// holding a lock get's paused for times comparable to the lock timeout.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get's should be gets


}

#include "arcticdb/util/reliable_storage_lock.tpp"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We normally put all the templated definitions straight in the .hpp file. It's an arbitrary choice, so I think we should be consistent with the rest of the codebase here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other alternative is to put them in reliable_storage_lock-inl.hpp, which is how we usually indicate that really this is implementation but it's templated implementation. You can look at any of the other -inl.hpp files for the usual mechanism

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in normally we include them at the bottom of the header file, and ensure that they're not included anywhere else, that way you get file separation, but you don't need to know about the additional file in your cpp files, you just include the header as normal

template <class ClockType>
ReliableStorageLock<ClockType>::ReliableStorageLock(const std::string &base_name, const std::shared_ptr<Store> store, timestamp timeout) :
base_name_(base_name), store_(store), timeout_(timeout) {
auto s3_timeout = ConfigsMap::instance()->get_int("S3Storage.RequestTimeoutMs", 200000) * ONE_MILLISECOND;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems quite weird to have references to S3 in this layer of the code. Also didn't we discover some quite odd properties of the s3 request timeout in the sense that it does not prevent s3 requests taking longer than the value set but instead definitions a window of time in which we expect a certain number of bytes to be transferred?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we'll want to use it for azure etc which already has similar operations, it would be better to pass it in as a policy from the storage where we just expect a function that is void() that does whatever we need


namespace lock {

using Epoch = uint64_t;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's the nomenclature that the blog used, but I find this Epoch name really confusing. I think most people would expect it to be some sort of timestamp, but it's a counter (and then we have timestamps in the segment itself, to add to the confusion).

return;
}
auto lock_stream_id = get_stream_id(held_lock_epoch);
auto expiration = ClockType::nanos_since_epoch(); // Write current time to mark lock as expired as of now
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Managing the expiry with this process' clock isn't right is it? The blog uses the last modification time on S3. This is probably something we can improve afterwards.

Clock::time_ = 10;
ASSERT_EQ(lock2.try_take_lock(), std::nullopt);
Clock::time_ = 19;
ASSERT_EQ(lock1.try_take_lock(), std::nullopt);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The std::nullopt assertions should all be for both lock1 and lock2, right? We should document that the lock API is not re-entrant

that->lock_lost_ = true;
});
auto value_before_sleep = cnt_;
// std::cout<<"Taken a lock with "<<value_before_sleep<<std::endl;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the print statements

{
"name": "aws-sdk-cpp",
"$version reason": "Minimum version in the baseline that works with aws-c-io above.",
"version>=": "1.11.405",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't the version down in the overrides section like all the other packages?


TEST(ReliableStorageLock, StressMultiThreaded) {
// It is hard to use a piloted clock for these tests because the folly::FunctionScheduler we use for the lock
// extensions doesn't support a custom clock. Thus this test will need to run for about 2 minutes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's useful to keep test_unit_arcticdb running fast. Might be better to add this stress test to a new cmake target. Can be done as a follow up.


read_df = lib.read(symbol).data
expected_df = pd.DataFrame({"col": [num_processes]})
assert_frame_equal(read_df, expected_df)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also assert on the version number after all these writes?

@@ -60,6 +60,7 @@ KeyData get_key_data(KeyType key_type) {
STRING_REF(KeyType::APPEND_REF, aref, 'a')
STRING_KEY(KeyType::MULTI_KEY, mref, 'm')
STRING_REF(KeyType::LOCK, lref, 'x')
STRING_REF(KeyType::SLOW_LOCK, lref, 'x')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They all have to be unique, otherwise bad things will happen. Also, I'm not sure why we're characterising this as a slow lock, since if anything it will be faster than the other lock which required variable (and ultimately unknowable) wait periods to ensure that it really was the holder of the lock.

My preference would be that the eventual lock key (the copy destination) is just a normal KeyType::LOCK, and the one that gets copied to it (the first one written is called a PENDING_LOCK or a LOCK_ATTEMPT or something like that

@@ -57,7 +57,8 @@ class S3ClientWrapper {
virtual S3Result<std::monostate> put_object(
const std::string& s3_object_name,
Segment&& segment,
const std::string& bucket_name) = 0;
const std::string& bucket_name,
bool if_none_match = false) = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also bools tend to multiply out of control, it's preferable to use an enum, that way callers have to have a name on them (WriteConditions::IF_NONE_MATCH is way more self-explanatory at the call site than 'true')


}

#include "arcticdb/util/reliable_storage_lock.tpp"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other alternative is to put them in reliable_storage_lock-inl.hpp, which is how we usually indicate that really this is implementation but it's templated implementation. You can look at any of the other -inl.hpp files for the usual mechanism


}

#include "arcticdb/util/reliable_storage_lock.tpp"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in normally we include them at the bottom of the header file, and ensure that they're not included anywhere else, that way you get file separation, but you don't need to know about the additional file in your cpp files, you just include the header as normal

const auto EXTENDS_PER_TIMEOUT = 5u;

inline StreamDescriptor lock_stream_descriptor(const StreamId &stream_id) {
return StreamDescriptor{stream_descriptor(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it needs the other constructor as the function already returns a stream descriptor

template <class ClockType>
ReliableStorageLock<ClockType>::ReliableStorageLock(const std::string &base_name, const std::shared_ptr<Store> store, timestamp timeout) :
base_name_(base_name), store_(store), timeout_(timeout) {
auto s3_timeout = ConfigsMap::instance()->get_int("S3Storage.RequestTimeoutMs", 200000) * ONE_MILLISECOND;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we'll want to use it for azure etc which already has similar operations, it would be better to pass it in as a policy from the storage where we just expect a function that is void() that does whatever we need

}

template <class ClockType>
StreamId ReliableStorageLock<ClockType>::get_stream_id(Epoch e) const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e is commonly used for error, no harm in calling it 'epoch'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants