Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add common logic for creating and cleaning buckets in tests #277

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Added

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it more like an infrastructure thing? It's for developers, not the actual users of the SDK, right?

* Wrapper for B2Api class which can be used for test purposes

## [1.21.0] - 2023-04-17

### Added
Expand Down
9 changes: 9 additions & 0 deletions b2sdk/_test_manager/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
######################################################################

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this directory named with a leading underscore? I don't believe I saw this notation anywhere else. Have you considered doing another package, if that was supposed to be installed with [dev]?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it also seems to me that creating a separate package would be a cleaner approach, something like b2sdk-test?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a moments's thought, I concluded that we could actually go down the most common path I've encountered, which is simply package the test in the main package, so it would be b2sdk/test. Numpy, pandas or django use such a patter. What do you think?

#
# File: b2sdk/_test_manager/__init__.py
#
# Copyright 2023 Backblaze Inc. All Rights Reserved.
#
# License https://www.backblaze.com/using_b2_code.html
#
######################################################################
188 changes: 188 additions & 0 deletions b2sdk/_test_manager/api.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
######################################################################
#
# File: b2sdk/_test_manager/api.py
#
# Copyright 2023 Backblaze Inc. All Rights Reserved.
#
# License https://www.backblaze.com/using_b2_code.html
#
######################################################################
import contextlib
import random
import string
import time

from datetime import datetime
from os import environ
from typing import Union

import backoff

from .bucket_tracking import BucketTrackingMixin
from .._v3.exception import BucketIdNotFound as v3BucketIdNotFound
from ..v2 import NO_RETENTION_FILE_SETTING, B2Api, Bucket, InMemoryAccountInfo, InMemoryCache, LegalHold, RetentionMode
from ..v2.exception import BucketIdNotFound, DuplicateBucketName, FileNotPresent, TooManyRequests, NonExistentBucket

ONE_HOUR_MILLIS = 60 * 60 * 1000
ONE_DAY_MILLIS = ONE_HOUR_MILLIS * 24

BUCKET_NAME_LENGTH = 50
BUCKET_NAME_CHARS = string.ascii_letters + string.digits + '-'

BUCKET_NAME_PREFIX = 'b2tst'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a hard limit on the length of the bucket name that we're hitting? If not, why not name it b2test?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently after using the uuid it will normally be b2test-{uuid4} :) As far as i know the maximum limit is 63 char


# RUNNER_NAME is the only variable exposed by the GitHub CI that was changing for each matrix entry.
# Example values are "GitHub Actions N" (with N being a whole number, starting from 2) and "Hosted Agent".
# Here, we're using these names as long as time as seeds to start the random number generator.
# Name fraction is used for runners inside the same matrix, time fraction is used for runners in different runs.
# To avoid collision when the same runners are fired in different commits at the same time we also use GITHUB_SHA
random.seed(
environ.get('RUNNER_NAME', 'local') + environ.get('GITHUB_SHA', 'local') + str(time.time_ns())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we skip all that and use str(uuid.uuid4())? I've checked and Python library uses system uuid.h which in turns is said to be good enough to provide random value. This would remove reliance on the clock entirely.
Also, worth noting that SystemRandom would be nice, but I'm not sure whether it's even available on Windows :/
What do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes in the new version I got rid of seed and used uui4, we will see how it goes

)


def generate_bucket_name() -> str:
suffix_length = BUCKET_NAME_LENGTH - len(BUCKET_NAME_PREFIX)
suffix = ''.join(random.choice(BUCKET_NAME_CHARS) for _ in range(suffix_length))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When thinking about it... Wouldn't uuid4 be good enough of a replacement? It contains letters, digits and hyphens. It is said to be "good enough" for a unique identifier for past, present and the future by Debian man page: https://manpages.debian.org/jessie/uuid-dev/uuid_generate_time_safe.3.en.html and I quote:

The UUID is 16 bytes (128 bits) long, which gives approximately 3.4x10^38 unique values
(there are approximately 10^80 elementary particles in the universe according to Carl
Sagan's Cosmos). The new UUID can reasonably be considered unique among all
UUIDs created on the local system, and among UUIDs created on other systems
in the past and in the future.

What do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all in all, I assumed in advance that the uuid had been tested before and there were some problems with it :) But yes I would give uuid a chance

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly, I cannot neither confirm nor dismiss this assumption. Let's try with uuid.

return f"{BUCKET_NAME_PREFIX}{suffix}"


def current_time_millis() -> int:
return int(round(time.time() * 1000))


class Api(BucketTrackingMixin, B2Api):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a name that would reflect the information that it's only for testing purposes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ApiTestManager?

"""
B2Api wrapper which should only be used for testing purposes!
"""

def __init__(self, account_id: str, application_key: str, realm: str, *args, **kwargs):
info = InMemoryAccountInfo()
cache = InMemoryCache()
super().__init__(info, cache=cache, *args, **kwargs)
self.authorize_account(realm, account_id, application_key)

@backoff.on_exception(
backoff.constant,
DuplicateBucketName,
max_tries=8,
)
def create_test_bucket(self, bucket_type="allPublic", **kwargs) -> Bucket:
bucket_name = generate_bucket_name()
print('Creating bucket:', bucket_name)
try:
return self.create_bucket(bucket_name, bucket_type, **kwargs)
except DuplicateBucketName:
self._duplicated_bucket_name_debug_info(bucket_name)
raise

@backoff.on_exception(
backoff.expo,
TooManyRequests,
max_tries=8,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to extract file removal as a function/method and put a backoff on it? What do you think?

)
def clean_bucket(self, bucket: Union[Bucket, str]) -> None:
if isinstance(bucket, str):
bucket = self.get_bucket_by_name(bucket)

files_leftover = False
file_versions = bucket.ls(latest_only=False, recursive=True)

for file_version_info, _ in file_versions:
if file_version_info.file_retention:
if file_version_info.file_retention.mode == RetentionMode.GOVERNANCE:
print('Removing retention from file version:', file_version_info.id_)
self.update_file_retention(
file_version_info.id_, file_version_info.file_name,
NO_RETENTION_FILE_SETTING, True

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it probably came from previous version, but would you mind using kwargs notation for literals, please? This True is not conveying and information on its own, but with parameter name it would be pretty nice.

)
elif file_version_info.file_retention.mode == RetentionMode.COMPLIANCE:
if file_version_info.file_retention.retain_until > current_time_millis(): # yapf: disable
print(
'File version: %s cannot be removed due to compliance mode retention' %
(file_version_info.id_,)
)
files_leftover = True
continue
elif file_version_info.file_retention.mode == RetentionMode.NONE:
pass
else:
raise ValueError(
'Unknown retention mode: %s' % (file_version_info.file_retention.mode,)
)
if file_version_info.legal_hold.is_on():
print('Removing legal hold from file version:', file_version_info.id_)
self.update_file_legal_hold(
file_version_info.id_, file_version_info.file_name, LegalHold.OFF
)
print('Removing file version:', file_version_info.id_)
try:
self.delete_file_version(file_version_info.id_, file_version_info.file_name)
except FileNotPresent:
print(
'It seems that file version %s has already been removed' %
(file_version_info.id_,)
)

if files_leftover:
print('Unable to remove bucket because some retained files remain')
else:
print('Removing bucket:', bucket.name)
try:
self.delete_bucket(bucket)
except BucketIdNotFound:
print('It seems that bucket %s has already been removed' % (bucket.name,))
print()

def clean_buckets(self) -> None:
for bucket in self.buckets:
with contextlib.suppress(BucketIdNotFound, v3BucketIdNotFound, NonExistentBucket):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per chance, could you explain to me the difference between v2 and v3 versions of this exception? My memory is foggy and I was wondering whether we could just use one of them.

self.clean_bucket(bucket)
self.buckets = []

def clean_all_buckets(self) -> None:
buckets = self.list_buckets()
print(f'Total bucket count: {len(buckets)}')

for bucket in buckets:
if not bucket.name.startswith(BUCKET_NAME_PREFIX):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want an operation like that? What about other runners running in parallel on the same set of keys?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, this function is intended as a kind of garbage collector after all the tests have been performed in order to collect the "leftovers". You are right about if several commits run the tests there may be unwanted deletions. It seems to me that it is enough here to add the first 10-15 characters from the SHA of the commit to our BUCKET_NAME_PREFIX. This way we will be sure that after all finished runs if there are any buckets left the garbage collector will remove them. What do you think?

print(f'Skipping bucket removal: "{bucket.name}"')
continue

print(f'Removing bucket: "{bucket.name}"')
try:
self.clean_bucket(bucket)
except (BucketIdNotFound, v3BucketIdNotFound):
print(f'It seems that bucket "{bucket.name}" has already been removed')

buckets = self.list_buckets()
print(f'Total bucket count after cleanup: {len(buckets)}')
for bucket in buckets:
print(bucket)

def count_and_print_buckets(self) -> int:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A simple search provides me with no usages of this function. Can you point me to an actual usage, please?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was intended to be a bit of statistic in total as you write in the comment below also we can put this function into cleanup_buckets function to always track how many buckets have been created when closing tests

buckets = self.list_buckets()
count = len(buckets)
print(f'Total bucket count at {datetime.now()}: {count}')
for i, bucket in enumerate(buckets, start=1):
print(f'- {i}\t{bucket.name} [{bucket.id_}]')
return count

def _duplicated_bucket_name_debug_info(self, bucket_name: str) -> None:
# Trying to obtain as much information as possible about this bucket.
print(' DUPLICATED BUCKET DEBUG START '.center(60, '='))
bucket = self.get_bucket_by_name(bucket_name)

print('Bucket metadata:')
bucket_dict = bucket.as_dict()
for info_key, info in bucket_dict.items():
print('\t%s: "%s"' % (info_key, info))

print('All files (and their versions) inside the bucket:')
ls_generator = bucket.ls(recursive=True, latest_only=False)
for file_version, _directory in ls_generator:
# as_dict() is bound to have more info than we can use,
# but maybe some of it will cast some light on the issue.
print('\t%s (%s)' % (file_version.file_name, file_version.as_dict()))

print(' DUPLICATED BUCKET DEBUG END '.center(60, '='))
32 changes: 32 additions & 0 deletions b2sdk/_test_manager/bucket_tracking.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
######################################################################
#
# File: b2sdk/_test_manager/bucket_tracking.py
#
# Copyright 2023 Backblaze Inc. All Rights Reserved.
#
# License https://www.backblaze.com/using_b2_code.html
#
######################################################################
from b2sdk.v2 import Bucket


class BucketTrackingMixin:
"""
Mixin class for B2Api, which enables bucket tracking.
This mixin will add a `buckets` member to the B2Api instance and will use it track created and
deleted buckets. The main purpose of this are tests -- the `buckets` member can be used in test
teardown to ensure proper bucket cleanup.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding a statistic of how many buckets in total is created in tests and how many buckets exists at once when tests are ran? This could help us to determine whether we can use -xdist to parallelise tests or to run several workflows in parallel.

"""

def __init__(self, *args, **kwargs):
self.buckets = []
super().__init__(*args, **kwargs)

def create_bucket(self, name: str, *args, **kwargs) -> Bucket:
bucket = super().create_bucket(name, *args, **kwargs)
self.buckets.append(bucket)
return bucket

def delete_bucket(self, bucket: Bucket):
super().delete_bucket(bucket)
self.buckets = [b for b in self.buckets if b.id_ != bucket.id_]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know there's only 100 buckets tops. But I still find this total rewrite of a list kinda ugly... But maybe that's just me?
Btw – doesn't our account info / cache hold this information already? Can't we just extract it from one of these places? Or is it just too convoluted?

4 changes: 2 additions & 2 deletions noxfile.py
Original file line number Diff line number Diff line change
Expand Up @@ -134,15 +134,15 @@ def unit(session):
@nox.session(python=PYTHON_VERSIONS)
def integration(session):
"""Run integration tests."""
install_myself(session)
install_myself(session, ['dev'])
session.run('pip', 'install', *REQUIREMENTS_TEST)
session.run('pytest', '-s', *session.posargs, 'test/integration')


@nox.session(python=PYTHON_DEFAULT_VERSION)
def cleanup_old_buckets(session):
"""Remove buckets from previous test runs."""
install_myself(session)
install_myself(session, ['dev'])
session.run('pip', 'install', *REQUIREMENTS_TEST)
session.run('python', '-m', 'test.integration.cleanup_buckets')

Expand Down
1 change: 1 addition & 0 deletions requirements-dev.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
backoff>=1.4.0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding also an upper limit on the library version?

5 changes: 4 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,10 @@ def read_requirements(extra=None):
# dependencies). You can install these using the following syntax,
# for example:
# $ pip install -e .[dev,test]
extras_require={'doc': read_requirements('doc')},
extras_require={
'doc': read_requirements('doc'),
'dev': read_requirements('dev')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comma on the end of this line, please. This way, if another line is added, this line is not changed. I believe it's explained in the handbook.

},
setup_requires=['setuptools_scm<6.0'],
use_scm_version=True,

Expand Down
7 changes: 6 additions & 1 deletion test/integration/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,10 @@
#
######################################################################
import os
from typing import Tuple


def get_b2_auth_data():
def get_b2_auth_data() -> Tuple[str, str]:
kkalinowski-reef marked this conversation as resolved.
Show resolved Hide resolved
application_key_id = os.environ.get('B2_TEST_APPLICATION_KEY_ID')
if application_key_id is None:
raise ValueError('B2_TEST_APPLICATION_KEY_ID is not set.')
Expand All @@ -19,3 +20,7 @@ def get_b2_auth_data():
if application_key is None:
raise ValueError('B2_TEST_APPLICATION_KEY is not set.')
return application_key_id, application_key


def get_realm() -> str:
return os.environ.get('B2_TEST_ENVIRONMENT', 'production')
64 changes: 7 additions & 57 deletions test/integration/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,7 @@

import pytest

from b2sdk.v2 import current_time_millis
from b2sdk.v2.exception import DuplicateBucketName
from .bucket_cleaner import BucketCleaner
from .helpers import GENERAL_BUCKET_NAME_PREFIX, BUCKET_NAME_LENGTH, BUCKET_CREATED_AT_MILLIS, bucket_name_part, authorize
from b2sdk._test_manager.api import Api


class IntegrationTestBase:
Expand All @@ -26,30 +23,11 @@ def set_http_debug(self):
http.client.HTTPConnection.debuglevel = 1

@pytest.fixture(autouse=True)
def save_settings(self, dont_cleanup_old_buckets, b2_auth_data):
type(self).dont_cleanup_old_buckets = dont_cleanup_old_buckets

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that this flag had some logic behind it. Could you tell me, in short, if the new code "has something similar" or was it just not needed, please?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having changed the logic for removing bucket (using mixin) it seem to me that we no longer need to keep this flag

type(self).b2_auth_data = b2_auth_data

@classmethod
def setup_class(cls):
cls.this_run_bucket_name_prefix = GENERAL_BUCKET_NAME_PREFIX + bucket_name_part(8)
kkalinowski-reef marked this conversation as resolved.
Show resolved Hide resolved

@classmethod
def teardown_class(cls):
BucketCleaner(
cls.dont_cleanup_old_buckets,
*cls.b2_auth_data,
current_run_prefix=cls.this_run_bucket_name_prefix
).cleanup_buckets()

@pytest.fixture(autouse=True)
def setup_method(self):
self.b2_api, self.info = authorize(self.b2_auth_data)

def generate_bucket_name(self):
return self.this_run_bucket_name_prefix + bucket_name_part(
BUCKET_NAME_LENGTH - len(self.this_run_bucket_name_prefix)
)
def setup_method(self, b2_auth_data, realm):
self.b2_api = Api(*b2_auth_data, realm)
self.info = self.b2_api.account_info
yield
self.b2_api.clean_buckets()

def write_zeros(self, file, number):
line = b'0' * 1000 + b'\n'
Expand All @@ -60,32 +38,4 @@ def write_zeros(self, file, number):
written += line_len

def create_bucket(self):
bucket_name = self.generate_bucket_name()
try:
return self.b2_api.create_bucket(
bucket_name,
'allPublic',
bucket_info={BUCKET_CREATED_AT_MILLIS: str(current_time_millis())}
)
except DuplicateBucketName:
self._duplicated_bucket_name_debug_info(bucket_name)
raise

def _duplicated_bucket_name_debug_info(self, bucket_name: str) -> None:
# Trying to obtain as much information as possible about this bucket.
print(' DUPLICATED BUCKET DEBUG START '.center(60, '='))
bucket = self.b2_api.get_bucket_by_name(bucket_name)

print('Bucket metadata:')
bucket_dict = bucket.as_dict()
for info_key, info in bucket_dict.items():
print('\t%s: "%s"' % (info_key, info))

print('All files (and their versions) inside the bucket:')
ls_generator = bucket.ls(recursive=True, latest_only=False)
for file_version, _directory in ls_generator:
# as_dict() is bound to have more info than we can use,
# but maybe some of it will cast some light on the issue.
print('\t%s (%s)' % (file_version.file_name, file_version.as_dict()))

print(' DUPLICATED BUCKET DEBUG END '.center(60, '='))
return self.b2_api.create_test_bucket()
Loading