-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MDB-27106] Use system.remote_data_paths
table for cleaning S3
#94
[MDB-27106] Use system.remote_data_paths
table for cleaning S3
#94
Conversation
e5347d7
to
2211ce4
Compare
140bb8e
to
9df88c1
Compare
remote_data_paths
system table for cleaning S3system.remote_data_paths
table for cleaning S3
try: | ||
execute_query( | ||
ctx, | ||
f"CREATE TABLE IF NOT EXISTS {listing_table} (obj_path String) ENGINE MergeTree ORDER BY obj_path", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what database this table will be created? Suggest to make it configurable through ch-tools configuration file (https://github.com/yandex/ch-tools/blob/master/ch_tools/common/config.py).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good point. Done it.
It makes sense when we have different listing tables for different clusters/hosts, but script command remains the same.
default=False, | ||
help="List objects that are not referenced in the metadata.", | ||
) | ||
@object_storage_group.command("clean") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for removing the list
command? afaik, we have no alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This command is useless for now. Before it was intended for generating a list of orpahned objects to stdout for reading by another cleaning script. But now we do this in one clean
command.
Nevertheless, we can reproduce this functionality for now by chadmin object-storage clean --keep-paths --dry-run
.
for _, obj in s3_object_storage_iterator( | ||
ctx.obj["disk_configuration"], object_name_prefix=prefix | ||
): | ||
if obj.last_modified > now - to_time: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the knobs description:
"Objects with a modification time falling interval [now - from_time, now - to_time] are considered.
But here now-to_time
is the lower bound.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is an upper bound of an interval
ch-tools/ch_tools/chadmin/cli/object_storage_group.py
Lines 238 to 239 in 3dfdae2
if obj.last_modified > now - to_time: | |
continue |
Actually we filter objects newer than now - 24h
point in time and pass objects created more than 24h ago.
1c0d66b
to
3dfdae2
Compare
3dfdae2
to
a486698
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the rest looks good
file = GzipFile(fileobj=file) | ||
file = TextIOWrapper(file) | ||
obj_paths_batch = [] | ||
counter = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added info message with it
4474325
to
3798e77
Compare
Use
system.remote_data_paths
table for enumerating objects known to ClickHouse.Enumerated objects from S3 is placed into table in ClickHouse. And selecting of objects presented only on S3 is performed by ANTIJOIN query with
system.remote_data_paths
table.TODO (not implemented in the PR):
requests
might not throw an error while disconnecting while stream request)Note:
PR is big due to updating of
poetry.lock
.