CountItems to consider MPUs in storage metrics #333

williamlardier · 2024-12-23T15:35:49Z

Sum of all MPU parts as part of the current metric for the associated bucket
Follow the same logic as the usual object to ensure we properly map the parts to the right accounts
Use the overview keys to count the number of incomplete/pending objects

The logic to process the object was kept inline in the function as the number of accessed variables is high, to avoid unnecessary complexity: the function is already unit-tested and hopefully dropped soon or later...

Issue: S3UTILS-186

bert-e · 2024-12-23T15:35:53Z

Hello williamlardier,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options

name	description	privileged	authored
`/after_pull_request`	Wait for the given pull request id to be merged before continuing with the current one.
`/bypass_author_approval`	Bypass the pull request author's approval	⭐
`/bypass_build_status`	Bypass the build and test status	⭐
`/bypass_commit_size`	Bypass the check on the size of the changeset `TBA`	⭐
`/bypass_incompatible_branch`	Bypass the check on the source branch prefix	⭐
`/bypass_jira_check`	Bypass the Jira issue check	⭐
`/bypass_peer_approval`	Bypass the pull request peers' approval	⭐
`/bypass_leader_approval`	Bypass the pull request leaders' approval	⭐
`/approve`	Instruct Bert-E that the author has approved the pull request.		✍️
`/create_pull_requests`	Allow the creation of integration pull requests.
`/create_integration_branches`	Allow the creation of integration branches.
`/no_octopus`	Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
`/unanimity`	Change review acceptance criteria from `one reviewer at least` to `all reviewers`
`/wait`	Instruct Bert-E not to run until further notice.

Available commands

name	description	privileged
`/help`	Print Bert-E's manual in the pull request.
`/status`	Print Bert-E's current status in the pull request `TBA`
`/clear`	Remove all comments from Bert-E from the history `TBA`
`/retry`	Re-start a fresh build `TBA`
`/build`	Re-start a fresh build `TBA`
`/force_reset`	Delete integration branches & pull requests, and restart merge process from the beginning.
`/reset`	Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

bert-e · 2024-12-23T15:35:56Z

Incorrect fix version

The Fix Version/s in issue S3UTILS-186 contains:

None

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

1.14.17
1.15.7

Please check the Fix Version/s of S3UTILS-186, or the target
branch of this pull request.

CountItems/utils/utils.js

utils/S3UtilsMongoClient.js

francoisferrand · 2024-12-26T09:21:10Z

utils/S3UtilsMongoClient.js

+                        return callback(err);
+                    }
+                    const retResult = this._handleResults(collRes, isVer);
+                    retResult.stalled = stalledCount;
+                    return callback(null, retResult);
+                },


multiple calls to callback : after processing both the cursor and mpu cursor, and once more eventually after the inflight processing...

....which may actually not really be the issue: looking at the documentation of mongodb driver, I don't see any mention of a second callback: so I am thinking this may be a left-over from the upgrade to promises, and this callback should just be removed (errors will raise an exception, caught eventually; and handleResults is called for normal case at line 434)

I will remove the err callbacks, they are not used with mongodb driver v5 indeed, so no impact

francoisferrand · 2024-12-26T09:28:01Z

utils/S3UtilsMongoClient.js

-                            collRes.account[account].locations[location][targetCount]++;
-                            collRes.account[account].locations[location].deleteMarkerCount += res.value.isDeleteMarker ? 1 : 0;
+                            collRes[metricLevel][resourceName][targetData] += data[metricLevel][resourceName];
+                            // Do not count the MPU parts as objects


not an issue in this PR, but worth considering for the future: eventually, do we want to report MPU by number of uploads (i.e. "potential objects") or actual parts (which are stored internally as separate documents, and actually each reference some data)...

thinking about this,

it may actually be better to count each part as an object : so we can also report on left-over parts even if the overview key is missing. Semantics may not be so good (part vs object), but I'd rather report something that more closely matches the storage than the user's business logic for now, and thus not mask any issue.

as far as semantics go, mpuPartsCount should be a number of parts: if we count only the overview keys, should be something like mpuUploadsCount instead (here and in other field names)

A compromise may be to count (and store) partsCount, uploadsCount and partsSize, but only aggregate uploadsCount in object count and partsSize in object size. But not sure it is worth the extra effort: may be best to just keep it "simple", counting and measuring individual parts and reporting them like objects...

not an issue in this PR, but worth considering for the future

I formally disagree that more effort should be put into this obsolete script. But it's worth discussing how to count MPUs for Scuba.

But it's worth discussing how to count MPUs for Scuba.

that is my point: this discussion is not about the component where it is implemented (utapi, s3utils, scuba, ...) but really about the semantics and data we want to measure.

utils/S3UtilsMongoClient.js

francoisferrand · 2024-12-26T09:33:41Z

tests/unit/CountItems/utils/utils.js

@@ -23,6 +24,7 @@ describe('CountItems::utils::consolidateDataMetrics', () => {
            _currentRestoring: 0,
            _nonCurrentRestored: 0,
            _nonCurrentRestoring: 0,
+            _incompleteMPUParts: 0,


as implemented today (see my other comment), this is not the count of parts but the count of uploads, so the field should be named _incompleteMPUUploads. Or may be better instead to actually count parts.

Counting parts does not make sense from a client point of view, as they cannot control, most of the time, how the MPUs are splited. What is important is knowing how many MPUs are incomplete and how much data is occupied by these MPUs. It's anyway not important here, because our only use case are the quotas (only on storage bytes) and reflecting the current usage in the UI (no count of objects here as well). We can however consider it for Scuba, in a separate work.

as discussed above, i seems presomptuous to say what makes or does not make sense for a client point of view: anyway the APIs are limited, it is not presented to the user...

the request from product (as a proxy of customers) today is really just to count the "size" used by incomplete MPU (parts or uploads is the same here). As for the number of parts or uploads, it actually fits different uses, both for the "customers" (but different personas):

number of parts may makes more sense for an admin, which wants to understand why mongo is overloaded;

number of uploads may make more sense for the user which performed the upload, which wants to understand how many extra uploads he did... but maybe he does not care so much about number of uploads, and more about the amount of data he has to re-upload...

i also don't know what is expected, and it will need to be considered for Scuba : but as soon as you cristalize a behavior here and it gets shipped, customers may start to use it, and we will have to support it... So we must really refrain from adding something quickly just because we can, esp. if the semantics may be ambiguous.

What I mean is that in the S3 world we would typically use ListMultipartUploads to get the number of in-progress/incomplete MPUs: this is not returning all the part. If we need the parts we can use ListParts.

S3 doesn't expose the number of parts either in their metrics, but:

Incomplete Multipart Upload Storage Bytes – The total bytes in scope with incomplete multipart uploads

Incomplete Multipart Upload Object Count – The number of objects in scope that are incomplete multipart uploads

So for sure the number of upload will be required for us to be standard. We cannot reflect more storage utilization if we have 0 object reported (for example, reporting 1TB of data with 0 object if everything is only incomplete MPUs): there needs to be a correlation between the two. And the number of upload is the natural (and standard) information to have.

Then maybe we can consider the number of parts, not supporting them here is actually what I suggested and what you seem to align on: if we ship it we'll need it in SUR, yet it may not be needed, or hard to track within scuba...

williamlardier · 2025-01-09T16:00:54Z

ping

bert-e · 2025-01-09T16:01:03Z

Request integration branches

Waiting for integration branch creation to be requested by the user.

To request integration branches, please comment on this pull request with the following command:

/create_integration_branches

Alternatively, the /approve and /create_pull_requests commands will automatically
create the integration branches.

- The MPU parts size are part of the current size of the bucket - Only the count of overview keys are part of the object count - The metrics are detailed in a field - The getObjectMDStats function is updated: the logic to process each cursor's entry is shared, and mpu entries are processed in the same way as the regular objects, with some specifics. Issue: S3UTILS-186

Issue: S3UTILS-186

Also remove unused callback in the mongodb foreach Issue: S3UTILS-186

Issue: S3UTILS-186

williamlardier · 2025-01-09T16:03:30Z

/create_integration_branches

bert-e · 2025-01-09T16:03:39Z

Conflict

A conflict has been raised during the creation of
integration branch w/1.15/bugfix/S3UTILS-186 with contents from bugfix/S3UTILS-186
and development/1.15.

I have not created the integration branch.

Here are the steps to resolve this conflict:

 $ git fetch
 $ git checkout -B w/1.15/bugfix/S3UTILS-186 origin/development/1.15
 $ git merge origin/bugfix/S3UTILS-186
 $ # <intense conflict resolution>
 $ git commit
 $ git push -u origin w/1.15/bugfix/S3UTILS-186

The following options are set: create_integration_branches

williamlardier · 2025-01-09T16:04:51Z

/approve

bert-e · 2025-01-09T16:09:44Z

Build failed

The build for commit did not succeed in branch bugfix/S3UTILS-186

The following options are set: approve, create_integration_branches

bert-e · 2025-01-09T16:24:29Z

I have successfully merged the changeset of this pull request
into targetted development branches:

✔️ development/1.14
✔️ development/1.15

The following branches have NOT changed:

development/1.13
development/1.4

Please check the status of the associated issue S3UTILS-186.

Goodbye williamlardier.

The following options are set: approve, create_integration_branches