-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add client metrics #751
Add client metrics #751
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, I like the approach! I left some comments, looking forward to seeing more metrics 👀 Thanks for this work!
err := c.user.ObserveClientMetric(model.ClientTimeToFirstByte, float64(time.Now().UnixMilli()-start)/1000) | ||
if err != nil { | ||
mlog.Warn("Failed to store observation", mlog.Err(err)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point we've already made 3-4 requests, so I don't think this is the time to first byte, right? How are we defining the time to first byte in the webapp?
We can go quite low-level here and use something like https://pkg.go.dev/net/http/httptrace#WithClientTrace, there's a callback for GotFirstResponseByte
, when we can observe the metric. But this is a per-request metric, so are we planning in measuring that in all requests? Or in some specific one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL! We want to measure that whenever a user logs in (thinking of a scenario where opens a web page). But the user could also be logged in with the session so we may need to cover that scenario. The goal of the PR is to measure the load in the server with client metrics, but I think it could be a good idea to actually get realistic metrics from the agents.
Maybe we can add this to backlog, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Gave this a first quick pass and left some comments.
2cfc265
to
e8be503
Compare
@isacikgoz - Just checking on this, is there anything left other than us reviewing this again? |
loadtest/user/userentity/report.go
Outdated
func randomUserAgent() string { | ||
i := rand.Intn(len(userAgents)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whaaat? Are you saying there's an equal chance of Safari sending metrics than Chrome? :p
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vanilla macos users are quite a lot among devs to be fair :p
elapsed := time.Since(start).Seconds() | ||
err := c.user.ObserveClientMetric(model.ClientTimeToFirstByte, elapsed) | ||
if err != nil { | ||
mlog.Warn("Failed to store observation", mlog.Err(err)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think this is not semantically correct. I'm not sure if it's a problem with the naming (how are we using it exactly in the webapp?) or with the code here. For reference: I would understand this if the metric would be called something like model.ClientTimeToLogin
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, answering to your comment in the other thread (sorry, just read it now):
TIL! We want to measure that whenever a user logs in (thinking of a scenario where opens a web page). But the user could also be logged in with the session so we may need to cover that scenario. The goal of the PR is to measure the load in the server with client metrics, but I think it could be a good idea to actually get realistic metrics from the agents.
Maybe we can add this to backlog, WDYT?
I'm ok adding a ticket to refine this to the backlog, but I'd still like to understand how the webapp uses this specific metric, I think I lack some context here, sorry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An explanation can be found here: https://web.dev/articles/ttfb and turns out my interpretation is also incorrect. Just sent a canonical way of measuring it with your suggestion 😅
Co-authored-by: Alejandro García Montoro <[email protected]>
* Enable ethtool metrics * Increase network interface rx size on proxy instance * Use retransmission rate instead of timeouts * Review socket buffer sizes * Update panel unit * Add TCP retransmissions panel * Fix datasource * Make RX size dynamic * Fix comment
* Update postgres client * Make assets
Half the value based on latest test results
Recover broken images and tweak things that have changed since this was first written.
* MM-59319: Allow opensearch installations to be created We unblock the prefix limitation, and allow either prefixes. https://mattermost.atlassian.net/browse/MM-59319 * Fix test
* Allow 0 shard replicas For 1-node ES clusters, we need 0 shard replicas, not 1. Otherwise, the cluster status check is never valid, since there are always unassigned shards (one per index) and subsequently the cluster status is always yellow. * Validate RestoreSnapshot options before marshaling
* Remove Cloudwatch log policy There's a limit of 10 Cloudwatch log policies per region per account, so it's not scalable to create one with every deployment. Instead, we rely on such a policy already present in the AWS account. If it is not present, the only downside is that logs cannot be viewable through Cloudwatch, but everything else should keep working. * make assets * Check needed policy and create if it doesn't exist * Refactor CloudWatch logic to another file and test
@agarciamontoro after discussing with @hmhealey I reverted the TTFB measurement to be made on login for only once. I think that more or less reflects our current usual scenario. I'll merge it as it is now if you are okay with it. |
Ok! I still don't think the name is correct, though, since it's not the time to first byte, but the time for the whole login to finish. Is there a way we can rename it? Don't get me wrong, I think that "the time for the whole login to finish" is a great metric to measure, it's just that "TTFB" means a very specific thing. |
@agarciamontoro Yes, you are right in terms of naming but this is just to simulate what does client metrics bring load. I'm not sure if we can use this metric for the agents anyway. We can add a new metric to simulate same payload but we would just add another time series in Prometheus. Is that something we should do? |
Yeah, for this PR I'm ok with what we have as long as we don't deviate from the implementation in our clients. My concern is with the metric itself: if that's the name that we use in the clients, I think we should change it. But that's an off-topic for this PR, actually. My TL;DR for this PR is: if the implementation here mimics what our clients do, then let's merge it (and keep it updated with the clients implementation in the future) :) |
@agarciamontoro gotcha, yes here this is only to mimic what our clients do. I'd say let's go with it for now and keep discussion on renaming it or using it per request etc. |
Sounds good! All yours to merge, then :) |
@isacikgoz @agarciamontoro I think there might be some misunderstanding here still. The TTFB measurement that the we app reports is indeed a proper TTFB, but it's the TTFB for the initial request made from the app to the server when loading the page, not the TTFB for every single request. It's sent exactly once on each page load. There's more technical information on it and the other web app metrics here: https://mattermost.atlassian.net/wiki/spaces/ICU/pages/2715418659/Grafana+Metrics#Time-to-First-Byte |
@hmhealey, can you point to the code measuring and reporting this? |
@agarciamontoro The code that reports it is here, but the calculations are all done by either the browser itself or by the Chrome team's |
Ah, ok ok, thanks, @hmhealey! Then I think we should move the login TTFB metric to the lower level possible, and get it for the actual first byte. If we measure it in |
@agarciamontoro If you think that metric will be useful then let's go for it. So basically will run it again in the http trace but single time with sync.Once right? |
I think we are spending way too much time here. The purpose of this PR is to just add coverage for the client metrics feature. It should not be taken as the real data, because clients aren't going to be in the same network as the server. Where it will be useful is when someone actually logs in with a browser and points to the instance. That's why we are using "other"/"other" as the browser/platform combination for the load-test agent, so that we can clearly see real data when used from browsers. It's already been months with this PR :) |
I'm ok unblocking this PR and getting it merged. But I don't feel comfortable having a metric that doesn't do what its name suggests. So here's my proposal: let's merge this and, in parallel, create a ticket to address that concern, so that the tool comes closer to the real implementation. Thoughts? |
Sounds good to me. 👍 |
@isacikgoz, feel free to merge whenever you want! We can create the ticket at any time :) |
@agarciamontoro I just didn't find the time to run another load test with these changes. Should we merge without it? |
@isacikgoz A smoke test would be good just to ensure all is good. I can try to run one if you don't have the time |
@agarciamontoro It's not the time but generally I'm not so familiar with perf testing tool itself, if you have time I'd appreciate the help :D |
Understandable! I'll run something quick as soon as I can, and maybe we can sync later at some point to do it together :) |
@isacikgoz, can you fix the conflicts so I can test over a clean branch? |
@agarciamontoro It should be good now. |
Just ran a dummy comparison and I can confirm it works perfectly: https://snapshots.raintank.io/dashboard/snapshot/ukK9uieg0hd1eca2S0caXwvZyCdwAKcM As a bonus, here's a plot of the number of users against the number of reports submitted per minute, and it's the same, with exactly one minute of delay. Perfect alignment :) |
Summary
Add client metrics implementation. Since the performance report generated with user activity, it should be better to reflect that behaviour and it would be much more realistic about the payload. I wanted to hear an early feedback whether this is a good approach.
I'll add
ClientFirstContentfulPaint
,ClientLargestContentfulPaint
metrics in the meantime.