-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add prometheus metrics to the ndt8 handlers #13
Conversation
Pull Request Test Coverage Report for Build 7617005033Warning: This coverage report may be inaccurate.We've detected an issue with your CI configuration that might affect the accuracy of this pull request's coverage report.
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 approvals obtained (waiting on @robertodauria)
internal/handler/handler.go
line 48 at r1 (raw file):
Help: "Counter of ndt8 tests", }, []string{"direction"},
My instinct is to recommend including an additional label here also to distinguish between success and errors easily.
As currently written the "total number of tests (successful or otherwise)" requires two metrics, i.e. sum(msak_ndt8_tests_total) + sum(msak_ndt8_connection_errors_total)
rather than an expression like sum(msak_ndt8_tests_total)
(or similar).
I created a breakdown like this in ndt-server so we could see aggregate counts easily or dive into detailed errors https://github.com/m-lab/ndt-server/tree/main/ndt7/metrics. I'm not saying do the same thing but I wanted to cite that as an example. In particular, ndt7_client_test_results_total
which has a short set of states ("ok", "error", etc) and ndt7_client_sender_errors_total
which has more information about the error cases.
internal/handler/handler.go
line 190 at r1 (raw file):
err = conn.SetCC(requestCC) if err != nil { log.Info("Failed to set cc", "ctx", fmt.Sprintf("%p", req.Context()),
Counting these events could also be instructive even if they're not fatal errors.
internal/handler/handler.go
line 221 at r1 (raw file):
}() testsTotal.WithLabelValues(string(kind)).Inc()
Are these all successes?
internal/handler/handler.go
line 260 at r1 (raw file):
log.Info("Connection closed unexpectedly", "context", fmt.Sprintf("%p", timeout), "error", err) // TODO: Add Prometheus metric
Do we need a metric or label on testsTotal
here to distinguish between two kinds of results? e.g. successes and premature errors? Should these be counted as successes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL again?
Reviewable status: 0 of 1 approvals obtained (waiting on @stephen-soltesz)
internal/handler/handler.go
line 48 at r1 (raw file):
Previously, stephen-soltesz (Stephen Soltesz) wrote…
My instinct is to recommend including an additional label here also to distinguish between success and errors easily.
As currently written the "total number of tests (successful or otherwise)" requires two metrics, i.e.
sum(msak_ndt8_tests_total) + sum(msak_ndt8_connection_errors_total)
rather than an expression likesum(msak_ndt8_tests_total)
(or similar).I created a breakdown like this in ndt-server so we could see aggregate counts easily or dive into detailed errors https://github.com/m-lab/ndt-server/tree/main/ndt7/metrics. I'm not saying do the same thing but I wanted to cite that as an example. In particular,
ndt7_client_test_results_total
which has a short set of states ("ok", "error", etc) andndt7_client_sender_errors_total
which has more information about the error cases.
Now there are two metrics, similarly to how it's done in ndt-server:
client_connections_total
: every connection that reaches the ul/dl handler.status
indicates if there was an error (i.e. missing mandatory parameter) or it successfully reached the websocket upgradetests_total
: every connection that upgraded to websocket. Must correspond tofile_writes_total
since from that point an archive is created in any case
internal/handler/handler.go
line 190 at r1 (raw file):
Previously, stephen-soltesz (Stephen Soltesz) wrote…
Counting these events could also be instructive even if they're not fatal errors.
I suspect I'd need a separate metric, since the existing ones would otherwise count a connection where we failed to set cc twice... msak_throughput1_errors_total{error="..."}
?
internal/handler/handler.go
line 221 at r1 (raw file):
Previously, stephen-soltesz (Stephen Soltesz) wrote…
Are these all successes?
Changed.
internal/handler/handler.go
line 260 at r1 (raw file):
Previously, stephen-soltesz (Stephen Soltesz) wrote…
Do we need a metric or label on
testsTotal
here to distinguish between two kinds of results? e.g. successes and premature errors? Should these be counted as successes?
I've tried to clarify what the possible cases are:
- close errors
- normal close
- unexpected close (i.e. EOF)
- non-close errors (something went wrong in the protocol code)
There is still a remaining issue though: when the client closes the connection after 5s it does so abruptly, which is the counted as close-error
. Not sure if I should consider this an actual error or include CloseAbnormalClosure
in the "ok" case. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 approvals obtained (waiting on @robertodauria)
internal/handler/handler.go
line 48 at r1 (raw file):
Previously, robertodauria (Roberto D'Auria) wrote…
Now there are two metrics, similarly to how it's done in ndt-server:
client_connections_total
: every connection that reaches the ul/dl handler.status
indicates if there was an error (i.e. missing mandatory parameter) or it successfully reached the websocket upgradetests_total
: every connection that upgraded to websocket. Must correspond tofile_writes_total
since from that point an archive is created in any case
Since client_connections_total
is just about the websocket upgrade path, please consider a rename to make this self-describing. Also please hint in the help text.
e.g. client_websocket_upgrades_total
or similar
internal/handler/handler.go
line 190 at r1 (raw file):
Previously, robertodauria (Roberto D'Auria) wrote…
I suspect I'd need a separate metric, since the existing ones would otherwise count a connection where we failed to set cc twice...
msak_throughput1_errors_total{error="..."}
?
What if every connection fails to set the requested CC? Can we monitor this?
internal/handler/handler.go
line 260 at r1 (raw file):
Previously, robertodauria (Roberto D'Auria) wrote…
I've tried to clarify what the possible cases are:
- close errors
- normal close
- unexpected close (i.e. EOF)
- non-close errors (something went wrong in the protocol code)
There is still a remaining issue though: when the client closes the connection after 5s it does so abruptly, which is the counted as
close-error
. Not sure if I should consider this an actual error or includeCloseAbnormalClosure
in the "ok" case. Thoughts?
I recommend adding an issue and TODO to record abnormal close errors. This is something always missing from ndt7. If the server knows that something that shouldn't have happened happened, we should record this in the results.
internal/handler/handler.go
line 56 at r3 (raw file):
Namespace: "msak", Subsystem: "throughput1", Name: "tests_total",
Clarifying: sum(client_connections_total{status="ok"}) == sum(tests_total)
?
internal/handler/handler.go
line 65 at r3 (raw file):
Namespace: "msak", Subsystem: "throughput1", Name: "file_writes_total",
Clarifying: sum(tests_total) == sum(file_writes_total)
?
internal/handler/handler.go
line 252 at r3 (raw file):
case <-timeout.Done(): // If the test has timed out count it as a success and return. testsTotal.WithLabelValues(string(kind), "ok").Inc()
Is this truly the same as "websocket.CloseNormalClosure"?
Which is the expected common case? server timeout or client close? Which ever is the expected common case, call "ok". The other, please qualify with something like "ok-foo" whatever foo is, e.g. maybe "close-ok" below for the normal close case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 approvals obtained (waiting on @stephen-soltesz)
internal/handler/handler.go
line 48 at r1 (raw file):
Previously, stephen-soltesz (Stephen Soltesz) wrote…
Since
client_connections_total
is just about the websocket upgrade path, please consider a rename to make this self-describing. Also please hint in the help text.e.g.
client_websocket_upgrades_total
or similar
Done.
internal/handler/handler.go
line 190 at r1 (raw file):
Previously, stephen-soltesz (Stephen Soltesz) wrote…
What if every connection fails to set the requested CC? Can we monitor this?
I agree that this would be useful to have. 🙂 What do you think of the name above?
internal/handler/handler.go
line 260 at r1 (raw file):
Previously, stephen-soltesz (Stephen Soltesz) wrote…
I recommend adding an issue and TODO to record abnormal close errors. This is something always missing from ndt7. If the server knows that something that shouldn't have happened happened, we should record this in the results.
Hmm... both the server and the client should close the connection cleanly with a proper CloseMessage
but it not generally expected that every client implementation does so. When the client terminates the connection in advance (e.g. CTRL+C, to replicate this scenario) the server sees a websocket.CloseAbnormalClosure
(unexpected EOF) not a websocket.CloseNormalClosure
. These "abnormal" closures are totally expected and should not be classified as an error. I've updated the code and added a comment explaining this better.
internal/handler/handler.go
line 56 at r3 (raw file):
Previously, stephen-soltesz (Stephen Soltesz) wrote…
Clarifying:
sum(client_connections_total{status="ok"}) == sum(tests_total)
?
Yes. Every connection that reached the websocket upgrade successfully is counted in tests_total
.
internal/handler/handler.go
line 65 at r3 (raw file):
Previously, stephen-soltesz (Stephen Soltesz) wrote…
Clarifying:
sum(tests_total) == sum(file_writes_total)
?
Yes. Every connection that reached the websocket upgrade successfully must generate an archive file, regardless of the outcome of the test.
internal/handler/handler.go
line 252 at r3 (raw file):
Previously, stephen-soltesz (Stephen Soltesz) wrote…
Is this truly the same as "websocket.CloseNormalClosure"?
Which is the expected common case? server timeout or client close? Which ever is the expected common case, call "ok". The other, please qualify with something like "ok-foo" whatever foo is, e.g. maybe "close-ok" below for the normal close case.
Both CloseNormalClosure
and CloseAbnormalClosure
are (somewhat unfortunately) expected -- see below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 approvals obtained (waiting on @robertodauria)
internal/handler/handler.go
line 190 at r1 (raw file):
Previously, robertodauria (Roberto D'Auria) wrote…
I agree that this would be useful to have. 🙂 What do you think of the name above?
I don't understand. I can imagine this being a separate metric. I recommend adding one. I agree we don't want to double count errors on an existing metric. Sorry if I've misunderstood.
internal/handler/handler.go
line 260 at r1 (raw file):
Previously, robertodauria (Roberto D'Auria) wrote…
Hmm... both the server and the client should close the connection cleanly with a proper
CloseMessage
but it not generally expected that every client implementation does so. When the client terminates the connection in advance (e.g. CTRL+C, to replicate this scenario) the server sees awebsocket.CloseAbnormalClosure
(unexpected EOF) not awebsocket.CloseNormalClosure
. These "abnormal" closures are totally expected and should not be classified as an error. I've updated the code and added a comment explaining this better.
"This protocol has no errors" seems suspicious to me, no? Doesn't have to be this PR. Maybe create a TODO referencing an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 approvals obtained (waiting on @stephen-soltesz)
internal/handler/handler.go
line 190 at r1 (raw file):
Previously, stephen-soltesz (Stephen Soltesz) wrote…
I don't understand. I can imagine this being a separate metric. I recommend adding one. I agree we don't want to double count errors on an existing metric. Sorry if I've misunderstood.
Done.
internal/handler/handler.go
line 260 at r1 (raw file):
Previously, stephen-soltesz (Stephen Soltesz) wrote…
"This protocol has no errors" seems suspicious to me, no? Doesn't have to be this PR. Maybe create a TODO referencing an issue.
Done. Discussed on VC, we'll add a field to the archival data to mark tests that should be considered invalid (e.g. if the requested duration does not match the actual stream duration)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r3.
Reviewable status: complete! 1 of 1 approvals obtained (waiting on @robertodauria)
internal/handler/handler.go
line 260 at r1 (raw file):
Previously, robertodauria (Roberto D'Auria) wrote…
Done. Discussed on VC, we'll add a field to the archival data to mark tests that should be considered invalid (e.g. if the requested duration does not match the actual stream duration)
I'd be satisfied with just documenting the trade off in an issue for now. e.g. bake the policy into msak server or increase complexity of bq filtering and analysis.
This PR adds the following Prometheus metrics:
msak_ndt8_tests_total{direction}
: number of tests that started, by direction (download/upload)msak_ndt8_connection_errors_total{direction, error}
: number of connections that failed before starting a test, by direction and error typemsak_ndt8_file_writes_total{direction, status}
: number of files written to disk, by test direction and status (ok/error)This change is