Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add client metrics #751

Merged
merged 38 commits into from
Nov 11, 2024
Merged

Add client metrics #751

merged 38 commits into from
Nov 11, 2024

Conversation

isacikgoz
Copy link
Member

Summary

Add client metrics implementation. Since the performance report generated with user activity, it should be better to reflect that behaviour and it would be much more realistic about the payload. I wanted to hear an early feedback whether this is a good approach.

I'll add ClientFirstContentfulPaint, ClientLargestContentfulPaint metrics in the meantime.

@isacikgoz isacikgoz added the 2: Dev Review Requires review by a core committer label May 28, 2024
Copy link
Member

@agarciamontoro agarciamontoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, I like the approach! I left some comments, looking forward to seeing more metrics 👀 Thanks for this work!

Comment on lines 157 to 160
err := c.user.ObserveClientMetric(model.ClientTimeToFirstByte, float64(time.Now().UnixMilli()-start)/1000)
if err != nil {
mlog.Warn("Failed to store observation", mlog.Err(err))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point we've already made 3-4 requests, so I don't think this is the time to first byte, right? How are we defining the time to first byte in the webapp?
We can go quite low-level here and use something like https://pkg.go.dev/net/http/httptrace#WithClientTrace, there's a callback for GotFirstResponseByte, when we can observe the metric. But this is a per-request metric, so are we planning in measuring that in all requests? Or in some specific one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL! We want to measure that whenever a user logs in (thinking of a scenario where opens a web page). But the user could also be logged in with the session so we may need to cover that scenario. The goal of the PR is to measure the load in the server with client metrics, but I think it could be a good idea to actually get realistic metrics from the agents.

Maybe we can add this to backlog, WDYT?

loadtest/control/simulcontroller/controller.go Outdated Show resolved Hide resolved
loadtest/store/memstore/store.go Outdated Show resolved Hide resolved
loadtest/store/memstore/store.go Outdated Show resolved Hide resolved
loadtest/store/memstore/store.go Outdated Show resolved Hide resolved
Copy link
Contributor

@streamer45 streamer45 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Gave this a first quick pass and left some comments.

.golangci.yml Outdated Show resolved Hide resolved
go.mod Outdated Show resolved Hide resolved
loadtest/control/simulcontroller/actions.go Outdated Show resolved Hide resolved
loadtest/control/simulcontroller/controller.go Outdated Show resolved Hide resolved
loadtest/store/memstore/store.go Outdated Show resolved Hide resolved
loadtest/store/memstore/store.go Outdated Show resolved Hide resolved
@agnivade
Copy link
Member

@isacikgoz - Just checking on this, is there anything left other than us reviewing this again?

Comment on lines 60 to 61
func randomUserAgent() string {
i := rand.Intn(len(userAgents))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whaaat? Are you saying there's an equal chance of Safari sending metrics than Chrome? :p

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vanilla macos users are quite a lot among devs to be fair :p

.golangci.yml Show resolved Hide resolved
.golangci.yml Show resolved Hide resolved
loadtest/control/simulcontroller/periodic.go Outdated Show resolved Hide resolved
loadtest/control/simulcontroller/periodic.go Outdated Show resolved Hide resolved
loadtest/user/userentity/report.go Outdated Show resolved Hide resolved
loadtest/user/userentity/report.go Outdated Show resolved Hide resolved
loadtest/user/userentity/user.go Outdated Show resolved Hide resolved
Comment on lines +158 to +162
elapsed := time.Since(start).Seconds()
err := c.user.ObserveClientMetric(model.ClientTimeToFirstByte, elapsed)
if err != nil {
mlog.Warn("Failed to store observation", mlog.Err(err))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this is not semantically correct. I'm not sure if it's a problem with the naming (how are we using it exactly in the webapp?) or with the code here. For reference: I would understand this if the metric would be called something like model.ClientTimeToLogin.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, answering to your comment in the other thread (sorry, just read it now):

TIL! We want to measure that whenever a user logs in (thinking of a scenario where opens a web page). But the user could also be logged in with the session so we may need to cover that scenario. The goal of the PR is to measure the load in the server with client metrics, but I think it could be a good idea to actually get realistic metrics from the agents.
Maybe we can add this to backlog, WDYT?

I'm ok adding a ticket to refine this to the backlog, but I'd still like to understand how the webapp uses this specific metric, I think I lack some context here, sorry.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An explanation can be found here: https://web.dev/articles/ttfb and turns out my interpretation is also incorrect. Just sent a canonical way of measuring it with your suggestion 😅

isacikgoz and others added 9 commits July 30, 2024 11:59
Co-authored-by: Alejandro García Montoro <[email protected]>
* Enable ethtool metrics

* Increase network interface rx size on proxy instance

* Use retransmission rate instead of timeouts

* Review socket buffer sizes

* Update panel unit

* Add TCP retransmissions panel

* Fix datasource

* Make RX size dynamic

* Fix comment
* Update postgres client

* Make assets
Half the value based on latest test results
Recover broken images and tweak things that have changed since this was
first written.
* MM-59319: Allow opensearch installations to be created

We unblock the prefix limitation, and allow
either prefixes.

https://mattermost.atlassian.net/browse/MM-59319

* Fix test
* Allow 0 shard replicas

For 1-node ES clusters, we need 0 shard replicas, not 1. Otherwise, the
cluster status check is never valid, since there are always unassigned
shards (one per index) and subsequently the cluster status is always
yellow.

* Validate RestoreSnapshot options before marshaling
* Remove Cloudwatch log policy

There's a limit of 10 Cloudwatch log policies per region per account, so
it's not scalable to create one with every deployment. Instead, we rely
on such a policy already present in the AWS account. If it is not
present, the only downside is that logs cannot be viewable through
Cloudwatch, but everything else should keep working.

* make assets

* Check needed policy and create if it doesn't exist

* Refactor CloudWatch logic to another file and test
@isacikgoz
Copy link
Member Author

@agarciamontoro after discussing with @hmhealey I reverted the TTFB measurement to be made on login for only once. I think that more or less reflects our current usual scenario. I'll merge it as it is now if you are okay with it.

@agarciamontoro
Copy link
Member

@agarciamontoro after discussing with @hmhealey I reverted the TTFB measurement to be made on login for only once. I think that more or less reflects our current usual scenario. I'll merge it as it is now if you are okay with it.

Ok! I still don't think the name is correct, though, since it's not the time to first byte, but the time for the whole login to finish. Is there a way we can rename it? Don't get me wrong, I think that "the time for the whole login to finish" is a great metric to measure, it's just that "TTFB" means a very specific thing.

@isacikgoz
Copy link
Member Author

@agarciamontoro Yes, you are right in terms of naming but this is just to simulate what does client metrics bring load. I'm not sure if we can use this metric for the agents anyway. We can add a new metric to simulate same payload but we would just add another time series in Prometheus. Is that something we should do?

@agarciamontoro
Copy link
Member

Yeah, for this PR I'm ok with what we have as long as we don't deviate from the implementation in our clients. My concern is with the metric itself: if that's the name that we use in the clients, I think we should change it. But that's an off-topic for this PR, actually.

My TL;DR for this PR is: if the implementation here mimics what our clients do, then let's merge it (and keep it updated with the clients implementation in the future) :)

@isacikgoz
Copy link
Member Author

@agarciamontoro gotcha, yes here this is only to mimic what our clients do. I'd say let's go with it for now and keep discussion on renaming it or using it per request etc.

@agarciamontoro agarciamontoro added 4: Reviews Complete All reviewers have approved the pull request and removed 2: Dev Review Requires review by a core committer labels Sep 16, 2024
@agarciamontoro
Copy link
Member

Sounds good! All yours to merge, then :)

@hmhealey
Copy link
Member

@isacikgoz @agarciamontoro I think there might be some misunderstanding here still. The TTFB measurement that the we app reports is indeed a proper TTFB, but it's the TTFB for the initial request made from the app to the server when loading the page, not the TTFB for every single request. It's sent exactly once on each page load.

There's more technical information on it and the other web app metrics here: https://mattermost.atlassian.net/wiki/spaces/ICU/pages/2715418659/Grafana+Metrics#Time-to-First-Byte

@agarciamontoro
Copy link
Member

@hmhealey, can you point to the code measuring and reporting this?

@hmhealey
Copy link
Member

@agarciamontoro The code that reports it is here, but the calculations are all done by either the browser itself or by the Chrome team's web-vitals library. There's very little that we do here ourselves

@agarciamontoro
Copy link
Member

Ah, ok ok, thanks, @hmhealey! Then I think we should move the login TTFB metric to the lower level possible, and get it for the actual first byte. If we measure it in SimulController.login as we're doing now, and we defer the measurement, we're adding several finished requests to that metric. SimulController.login calls control.Login, which calls UserEntity.Login, which in turn calls the final Client4.Login. That's where we need to measure it, I believe.
I agree that in terms of load it will be the same, but now that we're adding it, I would like to get it right. That metric may be useful in tests as well. @isacikgoz, thoughts?

@isacikgoz
Copy link
Member Author

@agarciamontoro If you think that metric will be useful then let's go for it. So basically will run it again in the http trace but single time with sync.Once right?

@agnivade
Copy link
Member

I think we are spending way too much time here. The purpose of this PR is to just add coverage for the client metrics feature. It should not be taken as the real data, because clients aren't going to be in the same network as the server. Where it will be useful is when someone actually logs in with a browser and points to the instance. That's why we are using "other"/"other" as the browser/platform combination for the load-test agent, so that we can clearly see real data when used from browsers.

It's already been months with this PR :)

@agarciamontoro
Copy link
Member

I'm ok unblocking this PR and getting it merged. But I don't feel comfortable having a metric that doesn't do what its name suggests. So here's my proposal: let's merge this and, in parallel, create a ticket to address that concern, so that the tool comes closer to the real implementation. Thoughts?

@agnivade
Copy link
Member

Sounds good to me. 👍

@agarciamontoro
Copy link
Member

@isacikgoz, feel free to merge whenever you want! We can create the ticket at any time :)

@isacikgoz
Copy link
Member Author

@agarciamontoro I just didn't find the time to run another load test with these changes. Should we merge without it?

@agarciamontoro
Copy link
Member

@isacikgoz A smoke test would be good just to ensure all is good. I can try to run one if you don't have the time

@isacikgoz
Copy link
Member Author

@agarciamontoro It's not the time but generally I'm not so familiar with perf testing tool itself, if you have time I'd appreciate the help :D

@agarciamontoro
Copy link
Member

Understandable! I'll run something quick as soon as I can, and maybe we can sync later at some point to do it together :)

@agarciamontoro
Copy link
Member

@isacikgoz, can you fix the conflicts so I can test over a clean branch?

@isacikgoz
Copy link
Member Author

@agarciamontoro It should be good now.

@agarciamontoro
Copy link
Member

Just ran a dummy comparison and I can confirm it works perfectly: https://snapshots.raintank.io/dashboard/snapshot/ukK9uieg0hd1eca2S0caXwvZyCdwAKcM

As a bonus, here's a plot of the number of users against the number of reports submitted per minute, and it's the same, with exactly one minute of delay. Perfect alignment :)
image

@agarciamontoro agarciamontoro merged commit 25b9f90 into master Nov 11, 2024
1 check passed
@agarciamontoro agarciamontoro deleted the add-client-metrics branch November 11, 2024 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4: Reviews Complete All reviewers have approved the pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants