Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add prometheus metrics to the ndt8 handlers #13

Merged
merged 9 commits into from
Jan 23, 2024

Conversation

robertodauria
Copy link
Contributor

@robertodauria robertodauria commented Jun 1, 2023

This PR adds the following Prometheus metrics:

  • msak_ndt8_tests_total{direction}: number of tests that started, by direction (download/upload)
  • msak_ndt8_connection_errors_total{direction, error}: number of connections that failed before starting a test, by direction and error type
  • msak_ndt8_file_writes_total{direction, status}: number of files written to disk, by test direction and status (ok/error)

This change is Reviewable

@coveralls
Copy link

coveralls commented Jun 1, 2023

Pull Request Test Coverage Report for Build 7617005033

Warning: This coverage report may be inaccurate.

We've detected an issue with your CI configuration that might affect the accuracy of this pull request's coverage report.
To ensure accuracy in future PRs, please see these guidelines.
A quick fix for this PR: rebase it; your next report should be accurate.

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.7%) to 69.534%

Totals Coverage Status
Change from base Build 7504079297: -0.7%
Covered Lines: 881
Relevant Lines: 1267

💛 - Coveralls

Copy link
Contributor

@stephen-soltesz stephen-soltesz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 approvals obtained (waiting on @robertodauria)


internal/handler/handler.go line 48 at r1 (raw file):

			Help:      "Counter of ndt8 tests",
		},
		[]string{"direction"},

My instinct is to recommend including an additional label here also to distinguish between success and errors easily.

As currently written the "total number of tests (successful or otherwise)" requires two metrics, i.e. sum(msak_ndt8_tests_total) + sum(msak_ndt8_connection_errors_total) rather than an expression like sum(msak_ndt8_tests_total) (or similar).

I created a breakdown like this in ndt-server so we could see aggregate counts easily or dive into detailed errors https://github.com/m-lab/ndt-server/tree/main/ndt7/metrics. I'm not saying do the same thing but I wanted to cite that as an example. In particular, ndt7_client_test_results_total which has a short set of states ("ok", "error", etc) and ndt7_client_sender_errors_total which has more information about the error cases.


internal/handler/handler.go line 190 at r1 (raw file):

		err = conn.SetCC(requestCC)
		if err != nil {
			log.Info("Failed to set cc", "ctx", fmt.Sprintf("%p", req.Context()),

Counting these events could also be instructive even if they're not fatal errors.


internal/handler/handler.go line 221 at r1 (raw file):

	}()

	testsTotal.WithLabelValues(string(kind)).Inc()

Are these all successes?


internal/handler/handler.go line 260 at r1 (raw file):

				log.Info("Connection closed unexpectedly", "context",
					fmt.Sprintf("%p", timeout), "error", err)
				// TODO: Add Prometheus metric

Do we need a metric or label on testsTotal here to distinguish between two kinds of results? e.g. successes and premature errors? Should these be counted as successes?

Copy link
Contributor Author

@robertodauria robertodauria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL again?

Reviewable status: 0 of 1 approvals obtained (waiting on @stephen-soltesz)


internal/handler/handler.go line 48 at r1 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

My instinct is to recommend including an additional label here also to distinguish between success and errors easily.

As currently written the "total number of tests (successful or otherwise)" requires two metrics, i.e. sum(msak_ndt8_tests_total) + sum(msak_ndt8_connection_errors_total) rather than an expression like sum(msak_ndt8_tests_total) (or similar).

I created a breakdown like this in ndt-server so we could see aggregate counts easily or dive into detailed errors https://github.com/m-lab/ndt-server/tree/main/ndt7/metrics. I'm not saying do the same thing but I wanted to cite that as an example. In particular, ndt7_client_test_results_total which has a short set of states ("ok", "error", etc) and ndt7_client_sender_errors_total which has more information about the error cases.

Now there are two metrics, similarly to how it's done in ndt-server:

  • client_connections_total: every connection that reaches the ul/dl handler. status indicates if there was an error (i.e. missing mandatory parameter) or it successfully reached the websocket upgrade
  • tests_total: every connection that upgraded to websocket. Must correspond to file_writes_total since from that point an archive is created in any case

internal/handler/handler.go line 190 at r1 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Counting these events could also be instructive even if they're not fatal errors.

I suspect I'd need a separate metric, since the existing ones would otherwise count a connection where we failed to set cc twice... msak_throughput1_errors_total{error="..."} ?


internal/handler/handler.go line 221 at r1 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Are these all successes?

Changed.


internal/handler/handler.go line 260 at r1 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Do we need a metric or label on testsTotal here to distinguish between two kinds of results? e.g. successes and premature errors? Should these be counted as successes?

I've tried to clarify what the possible cases are:

  • close errors
    • normal close
    • unexpected close (i.e. EOF)
  • non-close errors (something went wrong in the protocol code)

There is still a remaining issue though: when the client closes the connection after 5s it does so abruptly, which is the counted as close-error. Not sure if I should consider this an actual error or include CloseAbnormalClosure in the "ok" case. Thoughts?

Copy link
Contributor

@stephen-soltesz stephen-soltesz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 approvals obtained (waiting on @robertodauria)


internal/handler/handler.go line 48 at r1 (raw file):

Previously, robertodauria (Roberto D'Auria) wrote…

Now there are two metrics, similarly to how it's done in ndt-server:

  • client_connections_total: every connection that reaches the ul/dl handler. status indicates if there was an error (i.e. missing mandatory parameter) or it successfully reached the websocket upgrade
  • tests_total: every connection that upgraded to websocket. Must correspond to file_writes_total since from that point an archive is created in any case

Since client_connections_total is just about the websocket upgrade path, please consider a rename to make this self-describing. Also please hint in the help text.

e.g. client_websocket_upgrades_total or similar


internal/handler/handler.go line 190 at r1 (raw file):

Previously, robertodauria (Roberto D'Auria) wrote…

I suspect I'd need a separate metric, since the existing ones would otherwise count a connection where we failed to set cc twice... msak_throughput1_errors_total{error="..."} ?

What if every connection fails to set the requested CC? Can we monitor this?


internal/handler/handler.go line 260 at r1 (raw file):

Previously, robertodauria (Roberto D'Auria) wrote…

I've tried to clarify what the possible cases are:

  • close errors
    • normal close
    • unexpected close (i.e. EOF)
  • non-close errors (something went wrong in the protocol code)

There is still a remaining issue though: when the client closes the connection after 5s it does so abruptly, which is the counted as close-error. Not sure if I should consider this an actual error or include CloseAbnormalClosure in the "ok" case. Thoughts?

I recommend adding an issue and TODO to record abnormal close errors. This is something always missing from ndt7. If the server knows that something that shouldn't have happened happened, we should record this in the results.


internal/handler/handler.go line 56 at r3 (raw file):

			Namespace: "msak",
			Subsystem: "throughput1",
			Name:      "tests_total",

Clarifying: sum(client_connections_total{status="ok"}) == sum(tests_total) ?


internal/handler/handler.go line 65 at r3 (raw file):

			Namespace: "msak",
			Subsystem: "throughput1",
			Name:      "file_writes_total",

Clarifying: sum(tests_total) == sum(file_writes_total)?


internal/handler/handler.go line 252 at r3 (raw file):

		case <-timeout.Done():
			// If the test has timed out count it as a success and return.
			testsTotal.WithLabelValues(string(kind), "ok").Inc()

Is this truly the same as "websocket.CloseNormalClosure"?

Which is the expected common case? server timeout or client close? Which ever is the expected common case, call "ok". The other, please qualify with something like "ok-foo" whatever foo is, e.g. maybe "close-ok" below for the normal close case.

Copy link
Contributor Author

@robertodauria robertodauria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 approvals obtained (waiting on @stephen-soltesz)


internal/handler/handler.go line 48 at r1 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Since client_connections_total is just about the websocket upgrade path, please consider a rename to make this self-describing. Also please hint in the help text.

e.g. client_websocket_upgrades_total or similar

Done.


internal/handler/handler.go line 190 at r1 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

What if every connection fails to set the requested CC? Can we monitor this?

I agree that this would be useful to have. 🙂 What do you think of the name above?


internal/handler/handler.go line 260 at r1 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

I recommend adding an issue and TODO to record abnormal close errors. This is something always missing from ndt7. If the server knows that something that shouldn't have happened happened, we should record this in the results.

Hmm... both the server and the client should close the connection cleanly with a proper CloseMessage but it not generally expected that every client implementation does so. When the client terminates the connection in advance (e.g. CTRL+C, to replicate this scenario) the server sees a websocket.CloseAbnormalClosure (unexpected EOF) not a websocket.CloseNormalClosure. These "abnormal" closures are totally expected and should not be classified as an error. I've updated the code and added a comment explaining this better.


internal/handler/handler.go line 56 at r3 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Clarifying: sum(client_connections_total{status="ok"}) == sum(tests_total) ?

Yes. Every connection that reached the websocket upgrade successfully is counted in tests_total.


internal/handler/handler.go line 65 at r3 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Clarifying: sum(tests_total) == sum(file_writes_total)?

Yes. Every connection that reached the websocket upgrade successfully must generate an archive file, regardless of the outcome of the test.


internal/handler/handler.go line 252 at r3 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

Is this truly the same as "websocket.CloseNormalClosure"?

Which is the expected common case? server timeout or client close? Which ever is the expected common case, call "ok". The other, please qualify with something like "ok-foo" whatever foo is, e.g. maybe "close-ok" below for the normal close case.

Both CloseNormalClosure and CloseAbnormalClosure are (somewhat unfortunately) expected -- see below.

Copy link
Contributor

@stephen-soltesz stephen-soltesz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 approvals obtained (waiting on @robertodauria)


internal/handler/handler.go line 190 at r1 (raw file):

Previously, robertodauria (Roberto D'Auria) wrote…

I agree that this would be useful to have. 🙂 What do you think of the name above?

I don't understand. I can imagine this being a separate metric. I recommend adding one. I agree we don't want to double count errors on an existing metric. Sorry if I've misunderstood.


internal/handler/handler.go line 260 at r1 (raw file):

Previously, robertodauria (Roberto D'Auria) wrote…

Hmm... both the server and the client should close the connection cleanly with a proper CloseMessage but it not generally expected that every client implementation does so. When the client terminates the connection in advance (e.g. CTRL+C, to replicate this scenario) the server sees a websocket.CloseAbnormalClosure (unexpected EOF) not a websocket.CloseNormalClosure. These "abnormal" closures are totally expected and should not be classified as an error. I've updated the code and added a comment explaining this better.

"This protocol has no errors" seems suspicious to me, no? Doesn't have to be this PR. Maybe create a TODO referencing an issue.

Copy link
Contributor Author

@robertodauria robertodauria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 approvals obtained (waiting on @stephen-soltesz)


internal/handler/handler.go line 190 at r1 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

I don't understand. I can imagine this being a separate metric. I recommend adding one. I agree we don't want to double count errors on an existing metric. Sorry if I've misunderstood.

Done.


internal/handler/handler.go line 260 at r1 (raw file):

Previously, stephen-soltesz (Stephen Soltesz) wrote…

"This protocol has no errors" seems suspicious to me, no? Doesn't have to be this PR. Maybe create a TODO referencing an issue.

Done. Discussed on VC, we'll add a field to the archival data to mark tests that should be considered invalid (e.g. if the requested duration does not match the actual stream duration)

Copy link
Contributor

@stephen-soltesz stephen-soltesz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 2 files at r3.
Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @robertodauria)


internal/handler/handler.go line 260 at r1 (raw file):

Previously, robertodauria (Roberto D'Auria) wrote…

Done. Discussed on VC, we'll add a field to the archival data to mark tests that should be considered invalid (e.g. if the requested duration does not match the actual stream duration)

I'd be satisfied with just documenting the trade off in an issue for now. e.g. bake the policy into msak server or increase complexity of bq filtering and analysis.

@robertodauria robertodauria merged commit 068bb2b into main Jan 23, 2024
7 checks passed
@robertodauria robertodauria deleted the add-writtenfiles-metric branch January 23, 2024 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants