Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MDB-27106] Use system.remote_data_paths table for cleaning S3 #94

Merged

Conversation

aalexfvk
Copy link
Contributor

@aalexfvk aalexfvk commented Jan 25, 2024

Use system.remote_data_paths table for enumerating objects known to ClickHouse.
Enumerated objects from S3 is placed into table in ClickHouse. And selecting of objects presented only on S3 is performed by ANTIJOIN query with system.remote_data_paths table.

TODO (not implemented in the PR):

  1. Counting total size of objects
  2. Improve embedded ClickHouse client to make streaming INSERTs.
  3. Add additional checks for incomplete deletion (requests might not throw an error while disconnecting while stream request)

Note:
PR is big due to updating of poetry.lock.

@aalexfvk aalexfvk force-pushed the MDB-27106_Use_remote_data_paths_system_table_for_cleaning_s3 branch 5 times, most recently from e5347d7 to 2211ce4 Compare January 26, 2024 10:16
@aalexfvk aalexfvk force-pushed the MDB-27106_Use_remote_data_paths_system_table_for_cleaning_s3 branch from 140bb8e to 9df88c1 Compare January 26, 2024 14:29
@aalexfvk aalexfvk changed the title [MDB-27106] Use remote_data_paths system table for cleaning S3 [MDB-27106] Use system.remote_data_paths table for cleaning S3 Jan 26, 2024
pyproject.toml Outdated Show resolved Hide resolved
tests/features/object_storage.feature Outdated Show resolved Hide resolved
tests/features/object_storage.feature Outdated Show resolved Hide resolved
tests/environment.py Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
try:
execute_query(
ctx,
f"CREATE TABLE IF NOT EXISTS {listing_table} (obj_path String) ENGINE MergeTree ORDER BY obj_path",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what database this table will be created? Suggest to make it configurable through ch-tools configuration file (https://github.com/yandex/ch-tools/blob/master/ch_tools/common/config.py).

Copy link
Contributor Author

@aalexfvk aalexfvk Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point. Done it.
It makes sense when we have different listing tables for different clusters/hosts, but script command remains the same.

default=False,
help="List objects that are not referenced in the metadata.",
)
@object_storage_group.command("clean")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for removing the list command? afaik, we have no alternative.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This command is useless for now. Before it was intended for generating a list of orpahned objects to stdout for reading by another cleaning script. But now we do this in one clean command.
Nevertheless, we can reproduce this functionality for now by chadmin object-storage clean --keep-paths --dry-run.

for _, obj in s3_object_storage_iterator(
ctx.obj["disk_configuration"], object_name_prefix=prefix
):
if obj.last_modified > now - to_time:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the knobs description:

"Objects with a modification time falling interval [now - from_time, now - to_time] are considered.

But here now-to_time is the lower bound.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an upper bound of an interval

if obj.last_modified > now - to_time:
continue

Actually we filter objects newer than now - 24h point in time and pass objects created more than 24h ago.

@aalexfvk aalexfvk force-pushed the MDB-27106_Use_remote_data_paths_system_table_for_cleaning_s3 branch 2 times, most recently from 1c0d66b to 3dfdae2 Compare January 30, 2024 09:31
@aalexfvk aalexfvk force-pushed the MDB-27106_Use_remote_data_paths_system_table_for_cleaning_s3 branch from 3dfdae2 to a486698 Compare January 30, 2024 10:06
Copy link
Contributor

@MikhailBurdukov MikhailBurdukov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the rest looks good

file = GzipFile(fileobj=file)
file = TextIOWrapper(file)
obj_paths_batch = []
counter = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added info message with it

@aalexfvk aalexfvk force-pushed the MDB-27106_Use_remote_data_paths_system_table_for_cleaning_s3 branch from 4474325 to 3798e77 Compare January 30, 2024 15:04
@aalexfvk aalexfvk merged commit 89b33d8 into main Jan 30, 2024
15 of 16 checks passed
@aalexfvk aalexfvk deleted the MDB-27106_Use_remote_data_paths_system_table_for_cleaning_s3 branch January 30, 2024 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants