Add prometheus metrics to the ndt8 handlers #13

robertodauria · 2023-06-01T21:57:50Z

This PR adds the following Prometheus metrics:

msak_ndt8_tests_total{direction}: number of tests that started, by direction (download/upload)
msak_ndt8_connection_errors_total{direction, error}: number of connections that failed before starting a test, by direction and error type
msak_ndt8_file_writes_total{direction, status}: number of files written to disk, by test direction and status (ok/error)

This change is

coveralls · 2023-06-01T21:59:44Z

Pull Request Test Coverage Report for Build 7617005033

Warning: This coverage report may be inaccurate.

We've detected an issue with your CI configuration that might affect the accuracy of this pull request's coverage report.
To ensure accuracy in future PRs, please see these guidelines.
A quick fix for this PR: rebase it; your next report should be accurate.

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.7%) to 69.534%

Totals
Change from base Build 7504079297:	-0.7%
Covered Lines:	881
Relevant Lines:	1267

💛 - Coveralls

stephen-soltesz

Reviewable status: 0 of 1 approvals obtained (waiting on @robertodauria)

internal/handler/handler.go line 48 at r1 (raw file):

			Help:      "Counter of ndt8 tests",
		},
		[]string{"direction"},

My instinct is to recommend including an additional label here also to distinguish between success and errors easily.

As currently written the "total number of tests (successful or otherwise)" requires two metrics, i.e. sum(msak_ndt8_tests_total) + sum(msak_ndt8_connection_errors_total) rather than an expression like sum(msak_ndt8_tests_total) (or similar).

I created a breakdown like this in ndt-server so we could see aggregate counts easily or dive into detailed errors https://github.com/m-lab/ndt-server/tree/main/ndt7/metrics. I'm not saying do the same thing but I wanted to cite that as an example. In particular, ndt7_client_test_results_total which has a short set of states ("ok", "error", etc) and ndt7_client_sender_errors_total which has more information about the error cases.

internal/handler/handler.go line 190 at r1 (raw file):

		err = conn.SetCC(requestCC)
		if err != nil {
			log.Info("Failed to set cc", "ctx", fmt.Sprintf("%p", req.Context()),

Counting these events could also be instructive even if they're not fatal errors.

internal/handler/handler.go line 221 at r1 (raw file):

	}()

	testsTotal.WithLabelValues(string(kind)).Inc()

Are these all successes?

internal/handler/handler.go line 260 at r1 (raw file):

				log.Info("Connection closed unexpectedly", "context",
					fmt.Sprintf("%p", timeout), "error", err)
				// TODO: Add Prometheus metric

Do we need a metric or label on testsTotal here to distinguish between two kinds of results? e.g. successes and premature errors? Should these be counted as successes?

…col.

robertodauria

PTAL again?

Reviewable status: 0 of 1 approvals obtained (waiting on @stephen-soltesz)

internal/handler/handler.go line 48 at r1 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

My instinct is to recommend including an additional label here also to distinguish between success and errors easily.

As currently written the "total number of tests (successful or otherwise)" requires two metrics, i.e. sum(msak_ndt8_tests_total) + sum(msak_ndt8_connection_errors_total) rather than an expression like sum(msak_ndt8_tests_total) (or similar).

I created a breakdown like this in ndt-server so we could see aggregate counts easily or dive into detailed errors https://github.com/m-lab/ndt-server/tree/main/ndt7/metrics. I'm not saying do the same thing but I wanted to cite that as an example. In particular, ndt7_client_test_results_total which has a short set of states ("ok", "error", etc) and ndt7_client_sender_errors_total which has more information about the error cases.

Now there are two metrics, similarly to how it's done in ndt-server:

client_connections_total: every connection that reaches the ul/dl handler. status indicates if there was an error (i.e. missing mandatory parameter) or it successfully reached the websocket upgrade
tests_total: every connection that upgraded to websocket. Must correspond to file_writes_total since from that point an archive is created in any case

internal/handler/handler.go line 190 at r1 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Counting these events could also be instructive even if they're not fatal errors.

I suspect I'd need a separate metric, since the existing ones would otherwise count a connection where we failed to set cc twice... msak_throughput1_errors_total{error="..."} ?

internal/handler/handler.go line 221 at r1 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Are these all successes?

Changed.

internal/handler/handler.go line 260 at r1 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Do we need a metric or label on testsTotal here to distinguish between two kinds of results? e.g. successes and premature errors? Should these be counted as successes?

I've tried to clarify what the possible cases are:

close errors
- normal close
- unexpected close (i.e. EOF)
non-close errors (something went wrong in the protocol code)

There is still a remaining issue though: when the client closes the connection after 5s it does so abruptly, which is the counted as close-error. Not sure if I should consider this an actual error or include CloseAbnormalClosure in the "ok" case. Thoughts?

stephen-soltesz

Reviewable status: 0 of 1 approvals obtained (waiting on @robertodauria)

internal/handler/handler.go line 48 at r1 (raw file):

Previously, robertodauria (Roberto D'Auria) wrote…

Now there are two metrics, similarly to how it's done in ndt-server:

client_connections_total: every connection that reaches the ul/dl handler. status indicates if there was an error (i.e. missing mandatory parameter) or it successfully reached the websocket upgrade

tests_total: every connection that upgraded to websocket. Must correspond to file_writes_total since from that point an archive is created in any case

Since client_connections_total is just about the websocket upgrade path, please consider a rename to make this self-describing. Also please hint in the help text.

e.g. client_websocket_upgrades_total or similar

internal/handler/handler.go line 190 at r1 (raw file):

Previously, robertodauria (Roberto D'Auria) wrote…

I suspect I'd need a separate metric, since the existing ones would otherwise count a connection where we failed to set cc twice... msak_throughput1_errors_total{error="..."} ?

What if every connection fails to set the requested CC? Can we monitor this?

internal/handler/handler.go line 260 at r1 (raw file):

Previously, robertodauria (Roberto D'Auria) wrote…

I've tried to clarify what the possible cases are:

close errors

normal close

unexpected close (i.e. EOF)

non-close errors (something went wrong in the protocol code)

There is still a remaining issue though: when the client closes the connection after 5s it does so abruptly, which is the counted as close-error. Not sure if I should consider this an actual error or include CloseAbnormalClosure in the "ok" case. Thoughts?

I recommend adding an issue and TODO to record abnormal close errors. This is something always missing from ndt7. If the server knows that something that shouldn't have happened happened, we should record this in the results.

internal/handler/handler.go line 56 at r3 (raw file):

			Namespace: "msak",
			Subsystem: "throughput1",
			Name:      "tests_total",

Clarifying: sum(client_connections_total{status="ok"}) == sum(tests_total) ?

internal/handler/handler.go line 65 at r3 (raw file):

			Namespace: "msak",
			Subsystem: "throughput1",
			Name:      "file_writes_total",

Clarifying: sum(tests_total) == sum(file_writes_total)?

internal/handler/handler.go line 252 at r3 (raw file):

		case <-timeout.Done():
			// If the test has timed out count it as a success and return.
			testsTotal.WithLabelValues(string(kind), "ok").Inc()

Is this truly the same as "websocket.CloseNormalClosure"?

Which is the expected common case? server timeout or client close? Which ever is the expected common case, call "ok". The other, please qualify with something like "ok-foo" whatever foo is, e.g. maybe "close-ok" below for the normal close case.

robertodauria

Reviewable status: 0 of 1 approvals obtained (waiting on @stephen-soltesz)

internal/handler/handler.go line 48 at r1 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Since client_connections_total is just about the websocket upgrade path, please consider a rename to make this self-describing. Also please hint in the help text.

e.g. client_websocket_upgrades_total or similar

Done.

internal/handler/handler.go line 190 at r1 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

What if every connection fails to set the requested CC? Can we monitor this?

I agree that this would be useful to have. 🙂 What do you think of the name above?

internal/handler/handler.go line 260 at r1 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

I recommend adding an issue and TODO to record abnormal close errors. This is something always missing from ndt7. If the server knows that something that shouldn't have happened happened, we should record this in the results.

Hmm... both the server and the client should close the connection cleanly with a proper CloseMessage but it not generally expected that every client implementation does so. When the client terminates the connection in advance (e.g. CTRL+C, to replicate this scenario) the server sees a websocket.CloseAbnormalClosure (unexpected EOF) not a websocket.CloseNormalClosure. These "abnormal" closures are totally expected and should not be classified as an error. I've updated the code and added a comment explaining this better.

internal/handler/handler.go line 56 at r3 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Clarifying: sum(client_connections_total{status="ok"}) == sum(tests_total) ?

Yes. Every connection that reached the websocket upgrade successfully is counted in tests_total.

internal/handler/handler.go line 65 at r3 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Clarifying: sum(tests_total) == sum(file_writes_total)?

Yes. Every connection that reached the websocket upgrade successfully must generate an archive file, regardless of the outcome of the test.

internal/handler/handler.go line 252 at r3 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Is this truly the same as "websocket.CloseNormalClosure"?

Which is the expected common case? server timeout or client close? Which ever is the expected common case, call "ok". The other, please qualify with something like "ok-foo" whatever foo is, e.g. maybe "close-ok" below for the normal close case.

Both CloseNormalClosure and CloseAbnormalClosure are (somewhat unfortunately) expected -- see below.

stephen-soltesz

Reviewable status: 0 of 1 approvals obtained (waiting on @robertodauria)

internal/handler/handler.go line 190 at r1 (raw file):

Previously, robertodauria (Roberto D'Auria) wrote…

I agree that this would be useful to have. 🙂 What do you think of the name above?

I don't understand. I can imagine this being a separate metric. I recommend adding one. I agree we don't want to double count errors on an existing metric. Sorry if I've misunderstood.

internal/handler/handler.go line 260 at r1 (raw file):

Previously, robertodauria (Roberto D'Auria) wrote…

Hmm... both the server and the client should close the connection cleanly with a proper CloseMessage but it not generally expected that every client implementation does so. When the client terminates the connection in advance (e.g. CTRL+C, to replicate this scenario) the server sees a websocket.CloseAbnormalClosure (unexpected EOF) not a websocket.CloseNormalClosure. These "abnormal" closures are totally expected and should not be classified as an error. I've updated the code and added a comment explaining this better.

"This protocol has no errors" seems suspicious to me, no? Doesn't have to be this PR. Maybe create a TODO referencing an issue.

robertodauria

Reviewable status: 0 of 1 approvals obtained (waiting on @stephen-soltesz)

internal/handler/handler.go line 190 at r1 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

I don't understand. I can imagine this being a separate metric. I recommend adding one. I agree we don't want to double count errors on an existing metric. Sorry if I've misunderstood.

Done.

internal/handler/handler.go line 260 at r1 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

"This protocol has no errors" seems suspicious to me, no? Doesn't have to be this PR. Maybe create a TODO referencing an issue.

Done. Discussed on VC, we'll add a field to the archival data to mark tests that should be considered invalid (e.g. if the requested duration does not match the actual stream duration)

stephen-soltesz

Reviewed 1 of 2 files at r3.
Reviewable status: complete! 1 of 1 approvals obtained (waiting on @robertodauria)

internal/handler/handler.go line 260 at r1 (raw file):

Previously, robertodauria (Roberto D'Auria) wrote…

Done. Discussed on VC, we'll add a field to the archival data to mark tests that should be considered invalid (e.g. if the requested duration does not match the actual stream duration)

I'd be satisfied with just documenting the trade off in an issue for now. e.g. bake the policy into msak server or increase complexity of bq filtering and analysis.

robertodauria added 4 commits June 1, 2023 23:29

Add written_files_total counter.

0e2bbc7

Rename ClientConnection to ConnectionErrors and add help strings.

53e156b

Rename WrittenFiles -> FileWrites

25d9d38

Add testsTotal metric

faf3de8

robertodauria requested a review from stephen-soltesz June 1, 2023 21:57

Remove extra (for now) return.

988e5b9

stephen-soltesz reviewed Jun 1, 2023

View reviewed changes

robertodauria added 2 commits January 17, 2024 16:32

Merge branch 'main' into add-writtenfiles-metric

e4cdf7e

Remove extra output and clarify possible errors coming from the proto…

2306746

…col.

robertodauria commented Jan 17, 2024

View reviewed changes

robertodauria requested a review from stephen-soltesz January 17, 2024 16:31

stephen-soltesz reviewed Jan 17, 2024

View reviewed changes

Rename metrics and update definitions of ok/error for ws closures

93c9e20

robertodauria commented Jan 22, 2024

View reviewed changes

robertodauria requested a review from stephen-soltesz January 22, 2024 16:38

stephen-soltesz reviewed Jan 22, 2024

View reviewed changes

Add metric to count failed setCC attempts

4d999f8

robertodauria commented Jan 22, 2024

View reviewed changes

robertodauria requested a review from stephen-soltesz January 22, 2024 20:33

stephen-soltesz approved these changes Jan 22, 2024

View reviewed changes

robertodauria merged commit 068bb2b into main Jan 23, 2024
7 checks passed

robertodauria deleted the add-writtenfiles-metric branch January 23, 2024 14:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add prometheus metrics to the ndt8 handlers #13

Add prometheus metrics to the ndt8 handlers #13

robertodauria commented Jun 1, 2023 •

edited

Loading

coveralls commented Jun 1, 2023 •

edited

Loading

stephen-soltesz left a comment

robertodauria left a comment

stephen-soltesz left a comment

robertodauria left a comment

stephen-soltesz left a comment

robertodauria left a comment

stephen-soltesz left a comment

Add prometheus metrics to the ndt8 handlers #13

Add prometheus metrics to the ndt8 handlers #13

Conversation

robertodauria commented Jun 1, 2023 • edited Loading

coveralls commented Jun 1, 2023 • edited Loading

Pull Request Test Coverage Report for Build 7617005033

Warning: This coverage report may be inaccurate.

💛 - Coveralls

stephen-soltesz left a comment

Choose a reason for hiding this comment

robertodauria left a comment

Choose a reason for hiding this comment

stephen-soltesz left a comment

Choose a reason for hiding this comment

robertodauria left a comment

Choose a reason for hiding this comment

stephen-soltesz left a comment

Choose a reason for hiding this comment

robertodauria left a comment

Choose a reason for hiding this comment

stephen-soltesz left a comment

Choose a reason for hiding this comment

robertodauria commented Jun 1, 2023 •

edited

Loading

coveralls commented Jun 1, 2023 •

edited

Loading