add handler metrics to bus and saga #101

danielwitz · 2019-07-15T17:06:17Z

Adding:

handler metrics: success, failure counters + running latency
rejected messages counter.

Relates to this issue

Merge the 1.0.2 release into master

merge v1.x into master

…on), add registration for saga handlers

danielwitz · 2019-07-17T13:36:39Z

gbus/metrics/handler_metrics.go

+	return &HandlerMetrics{
+		result: promauto.NewCounterVec(
+			prometheus.CounterOpts{
+				Namespace: GrabbitNamespace,


this will add the grabbit prefix to the metrics name

rhinof · 2019-07-17T16:03:43Z

gbus/metrics/handler_metrics.go

+}
+
+func RunHandlerWithMetric(handleMessage func() error, handlerName string, logger logrus.FieldLogger) error {
+	handlerMetrics, ok := handlerMetricsByHandlerName[handlerName]


synchronizing access to the map ?

Why do I need to synchronize reads?

it can conflict with a write. maybe change the type to sync.Map instead of the regular map and put it behind you

rhinof · 2019-07-17T18:45:48Z

gbus/abstractions.go

@@ -109,6 +112,12 @@ type HandlerRegister interface {
 //MessageHandler signature for all command handlers
 type MessageHandler func(invocation Invocation, message *BusMessage) error

+func (mg MessageHandler) Name() string {


doesn't feel right that implementation is defined the abstractions.go file.
Maybe extract both the MessageHandler and the Name() method to a message_handler.go file ?

rhinof · 2019-07-17T18:46:55Z

gbus/builder/builder.go

@@ -107,6 +107,7 @@ func (builder *defaultBuilder) Build(svcName string) gbus.Bus {
 			panic(err)
 		}
 	}
+


file contains no changes (other than a blank line) should be reverted and not included in the PR

rhinof · 2019-07-17T18:51:53Z

gbus/metrics/handler_metrics.go

+	return err
+}
+
+func ReportHandlerExceededMaxRetries(handlerName string, logger logrus.FieldLogger) {


I do not think this metric should be reported on handlers but rather on rejected messages.
In the current setup and with the above code if a message s rejected it may result in multiple metrics reported for the same message which seem wrong.
think that this should be in a message_metrics.go and named ReportRejectedMessage.
@danielwitz, @vladshub WDYT ?

regardless the metric for failed message handling might happen multiple times for the same message, didn't find a way around that.
In the current setup if the handler exceeds the retries count the metric failure will be increased by that count and we'll have increased the exceeded retries metric only once.
Rejected message might be a better name though

the handler can indeed be invoked more than once that is why the rejected message metric should only be reported (once) after all retries fail.

rhinof · 2019-07-17T18:55:25Z

gbus/worker.go

@@ -321,6 +320,9 @@ func (worker *worker) processMessage(delivery amqp.Delivery, isRPCreply bool) {
 	if err == nil {
 		_ = worker.ack(delivery)
 	} else {
+		for _, handler := range handlers {
+			metrics.ReportHandlerExceededMaxRetries(handler.Name(), worker.log())


think this metric should be replaced with a ReportMessageRejected metric that is called only once if the message is rejected and not per handler

And the report should not include the handler name but the message name?

I think maybe we should have both, cause it's on the line between the messages domain and the handlers

@danielwitz what benefit do we gain with reporting this metric on the handler level...won't it just create info overload that won't be used ?

@rhinof thought about it some more and got to the conclusion that a metric about rejected message is exactly what we wanted.
I'll add that metric instead of the ExceededMaxRetries

* add handler metrics to bus and saga (#101) * add handler metrics to bus and saga + tests * fix build * add 0 to the default buckets to catch fast message handling * PR correction - changed latency to summary(removed bucket configuration), add registration for saga handlers * PR correction - getting logger as a param * PR correction - new line in eof * PR corrections message handler + sync.map + latency as summary * add rejected messages metric * dead letter handler should reject messages on failures and rollbacks and ack on commit success (#105) * dead letter handler should reject messages on failures and rollbacks * dead letter handler should reject messages on failures and rollbacks * dead letter handler should reject messages on failures and rollbacks * dead letter handler should reject messages on failures and rollbacks * return an error from the saga store when deleting a saga if saga can not (#110) be found In order to deal with concurrent deletes of the sage saga instance we would wan't to indicate that deleting the saga failed if the saga is not stored so callers can take proper action * Persisted timeouts (#107) * decouple transaction manager from glue * moved timeout manager to gbus/tx package * initial commit in order to support persisted timeouts * first working version of a mysql persisted timeout manager * fixing ci lint errors * refactored ensure schema of timeout manager * cleanup timeout manager when bs shuts down * fixing formatting issues * changed logging level from Info to Debug when inserting a new timeout * resusing timeouts tablename (PR review) * renamed AcceptTimeoutFunction to SetTimeoutFunction on the TimeoutManager interface (PR review) * refactored glue to implement the Logged inetrface and use the GLogged helper struct * locking timeout record before executing timeout In order to prevent having a timeout beeing executed twice due to two concurrent grabbit instances running the same service a lock (FOR UPDATE) has been placed on the timeout record in the scope of the executing transaction * Commiting the select transaction when querying for pending timeouts * feat(timeout:lock): This is done in order to reduce contention and allow for parallel processing in case of multiple grabbit instances * Enable returning a message back from the dead to the queue (#112) * enable sending raw messages * enable sending raw messages * enable sending raw messages * enable sending raw messages * enable sending raw messages * enable sending raw messages * enable sending raw messages * enable sending raw messages * enable sending raw messages * enable sending raw messages * enable sending raw messages * enable sending raw messages * enable sending raw messages * return to q * return to q * return to q * return to q * return dead to q * allow no retries * test - resend dead to queue * test - resend dead to queue * test - resend dead to queue * test - resend dead to queue - fixes after cr * test - resend dead to queue - fixes after cr * test - resend dead to queue - fixes after cr * added metric report on saga timeout (#114) 1) added reporting saga timeouts to the glue component 2) fixed mysql timeoutmanager error when trying to clear a timeout * Added documentation for grabbit metrics (#117) * added initial documentation for grabbit metrics * including metrics section in readme.md * fixing goreportcard issues (#118) * removed logging a warning when worker message channel returns an error (#116) * corrected saga metrics name and added to metrics documentation (#119) * corrected saga metrics name and added documentatio * corrected saga metric name * corrected typos * removed non transactional bus mode (#120)

* add handler metrics to bus and saga (#101) * add handler metrics to bus and saga + tests * fix build * add 0 to the default buckets to catch fast message handling * PR correction - changed latency to summary(removed bucket configuration), add registration for saga handlers * PR correction - getting logger as a param * PR correction - new line in eof * PR corrections message handler + sync.map + latency as summary * add rejected messages metric * dead letter handler should reject messages on failures and rollbacks and ack on commit success (#105) * dead letter handler should reject messages on failures and rollbacks * dead letter handler should reject messages on failures and rollbacks * dead letter handler should reject messages on failures and rollbacks * dead letter handler should reject messages on failures and rollbacks * return an error from the saga store when deleting a saga if saga can not (#110) be found In order to deal with concurrent deletes of the sage saga instance we would wan't to indicate that deleting the saga failed if the saga is not stored so callers can take proper action * Persisted timeouts (#107) * decouple transaction manager from glue * moved timeout manager to gbus/tx package * initial commit in order to support persisted timeouts * first working version of a mysql persisted timeout manager * fixing ci lint errors * refactored ensure schema of timeout manager * cleanup timeout manager when bs shuts down * fixing formatting issues * changed logging level from Info to Debug when inserting a new timeout * resusing timeouts tablename (PR review) * renamed AcceptTimeoutFunction to SetTimeoutFunction on the TimeoutManager interface (PR review) * refactored glue to implement the Logged inetrface and use the GLogged helper struct * locking timeout record before executing timeout In order to prevent having a timeout beeing executed twice due to two concurrent grabbit instances running the same service a lock (FOR UPDATE) has been placed on the timeout record in the scope of the executing transaction * Commiting the select transaction when querying for pending timeouts * feat(timeout:lock): This is done in order to reduce contention and allow for parallel processing in case of multiple grabbit instances * Enable returning a message back from the dead to the queue (#112) * enable sending raw messages * enable sending raw messages * enable sending raw messages * enable sending raw messages * enable sending raw messages * enable sending raw messages * enable sending raw messages * enable sending raw messages * enable sending raw messages * enable sending raw messages * enable sending raw messages * enable sending raw messages * enable sending raw messages * return to q * return to q * return to q * return to q * return dead to q * allow no retries * test - resend dead to queue * test - resend dead to queue * test - resend dead to queue * test - resend dead to queue - fixes after cr * test - resend dead to queue - fixes after cr * test - resend dead to queue - fixes after cr * added metric report on saga timeout (#114) 1) added reporting saga timeouts to the glue component 2) fixed mysql timeoutmanager error when trying to clear a timeout * Added documentation for grabbit metrics (#117) * added initial documentation for grabbit metrics * including metrics section in readme.md * fixing goreportcard issues (#118) * removed logging a warning when worker message channel returns an error (#116) * corrected saga metrics name and added to metrics documentation (#119) * corrected saga metrics name and added documentatio * corrected saga metric name * corrected typos * removed non transactional bus mode (#120) * remove fields * remove fields * go fmt and go lint error fixes to improve goreportcard (#126) * go fmt on some files * go fmt * added comments on exported types * cunsume the messages channel via ranging over the channel to prevent (#125) empty delivreies * Migrations functionality (#111) * implement migrations * implement migrations * implement migrations * implement migrations * implement migrations * migrations * migrations * migrations * migrations * migrations * migrations * migrations * fix tests error * add migrations * migrations - timeout table migration * test - resend dead to queue - fixes after cr * migraration to grabbit (use forked migrator) * remove fields * remove fields * remove fields * remove fields * touch

* add handler metrics to bus and saga (#101) * add handler metrics to bus and saga + tests * fix build * add 0 to the default buckets to catch fast message handling * PR correction - changed latency to summary(removed bucket configuration), add registration for saga handlers * PR correction - getting logger as a param * PR correction - new line in eof * PR corrections message handler + sync.map + latency as summary * add rejected messages metric * dead letter handler should reject messages on failures and rollbacks and ack on commit success (#105) * dead letter handler should reject messages on failures and rollbacks * dead letter handler should reject messages on failures and rollbacks * dead letter handler should reject messages on failures and rollbacks * dead letter handler should reject messages on failures and rollbacks * return an error from the saga store when deleting a saga if saga can not (#110) be found In order to deal with concurrent deletes of the sage saga instance we would wan't to indicate that deleting the saga failed if the saga is not stored so callers can take proper action * Persisted timeouts (#107) * decouple transaction manager from glue * moved timeout manager to gbus/tx package * initial commit in order to support persisted timeouts * first working version of a mysql persisted timeout manager * fixing ci lint errors * refactored ensure schema of timeout manager * cleanup timeout manager when bs shuts down * fixing formatting issues * changed logging level from Info to Debug when inserting a new timeout * resusing timeouts tablename (PR review) * renamed AcceptTimeoutFunction to SetTimeoutFunction on the TimeoutManager interface (PR review) * refactored glue to implement the Logged inetrface and use the GLogged helper struct * locking timeout record before executing timeout In order to prevent having a timeout beeing executed twice due to two concurrent grabbit instances running the same service a lock (FOR UPDATE) has been placed on the timeout record in the scope of the executing transaction * Commiting the select transaction when querying for pending timeouts * feat(timeout:lock): This is done in order to reduce contention and allow for parallel processing in case of multiple grabbit instances * Enable returning a message back from the dead to the queue (#112) * enable sending raw messages * enable sending raw messages * enable sending raw messages * enable sending raw messages * enable sending raw messages * enable sending raw messages * enable sending raw messages * enable sending raw messages * enable sending raw messages * enable sending raw messages * enable sending raw messages * enable sending raw messages * enable sending raw messages * return to q * return to q * return to q * return to q * return dead to q * allow no retries * test - resend dead to queue * test - resend dead to queue * test - resend dead to queue * test - resend dead to queue - fixes after cr * test - resend dead to queue - fixes after cr * test - resend dead to queue - fixes after cr * added metric report on saga timeout (#114) 1) added reporting saga timeouts to the glue component 2) fixed mysql timeoutmanager error when trying to clear a timeout * Added documentation for grabbit metrics (#117) * added initial documentation for grabbit metrics * including metrics section in readme.md * fixing goreportcard issues (#118) * removed logging a warning when worker message channel returns an error (#116) * corrected saga metrics name and added to metrics documentation (#119) * corrected saga metrics name and added documentatio * corrected saga metric name * corrected typos * removed non transactional bus mode (#120) * remove fields * remove fields * go fmt and go lint error fixes to improve goreportcard (#126) * go fmt on some files * go fmt * added comments on exported types * cunsume the messages channel via ranging over the channel to prevent (#125) empty delivreies * Migrations functionality (#111) * implement migrations * implement migrations * implement migrations * implement migrations * implement migrations * migrations * migrations * migrations * migrations * migrations * migrations * migrations * fix tests error * add migrations * migrations - timeout table migration * test - resend dead to queue - fixes after cr * migraration to grabbit (use forked migrator) * remove fields * remove fields * remove fields * remove fields * sanitize migrations table name (#130) * more linting fixes for goreportcard (#129) * added metrics on deadLetterHandler, refactored HandleDeadLetter inter… (#122) * added metrics on deadLetterHandler, refactored HandleDeadLetter interface to receive new DeadLetterMessageHandler type * fix dead letter test and a build error * added documentation for DeadLetterMessageHandler, also fixed poison spelling throughout code * retrigger build * align migrations table name with grabbit convention (#140) * Improved tracing and added documentation (#142) * Support handling raw message (#138) * added call to worker.span.Finish() when exiting processMessage (#145) * bug fix - when a deadletterhandler panics grabbit fails to reject the… (#136) * bug fix - when a deadletterhandler panics grabbit fails to reject the message * bug fix - when a deadletterhandler panics grabbit fails to reject the message * BPINFRA125 - MERGE MASTER INTO BRANCH * calling channel.Cancel when worker is stopped (#149) * Handle empty body messages (#147) * fixing golint warnings from goreport card (#150) * more golint fixes (#152)

rhinof · 2019-10-09T10:38:16Z

#36

Guy Baron and others added 9 commits June 9, 2019 12:01

Merge pull request wework#85 from wework/v1.x

51a7b20

Merge the 1.0.2 release into master

Merge pull request wework#93 from wework/v1.x

5624310

merge v1.x into master

Merge pull request wework#99 from wework/v1.x

5d077c4

merge v1.x into master

add handler metrics to bus and saga + tests

28bc512

fix build

bdf85a1

add 0 to the default buckets to catch fast message handling

0bdb7fd

PR correction - changed latency to summary(removed bucket configurati…

38eaddb

…on), add registration for saga handlers

PR correction - getting logger as a param

8c551a6

PR correction - new line in eof

e546eb2

danielwitz commented Jul 17, 2019

View reviewed changes

rhinof changed the base branch from master to v1.x July 17, 2019 16:00

rhinof reviewed Jul 17, 2019

View reviewed changes

daniel witz added 2 commits July 21, 2019 09:41

PR corrections message handler + sync.map + latency as summary

b573744

add rejected messages metric

4a6fcd9

danielwitz mentioned this pull request Jul 21, 2019

Report message handling metrics #103

Closed

rhinof approved these changes Jul 22, 2019

View reviewed changes

rhinof merged commit f617e04 into wework:v1.x Jul 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add handler metrics to bus and saga #101

add handler metrics to bus and saga #101

danielwitz commented Jul 15, 2019 •

edited

Loading

danielwitz Jul 17, 2019

rhinof Jul 17, 2019

danielwitz Jul 18, 2019

rhinof Jul 18, 2019

rhinof Jul 17, 2019

rhinof Jul 17, 2019

rhinof Jul 17, 2019

danielwitz Jul 18, 2019

rhinof Jul 18, 2019

rhinof Jul 17, 2019

vladshub Jul 18, 2019

danielwitz Jul 18, 2019

rhinof Jul 18, 2019

danielwitz Jul 18, 2019

rhinof commented Oct 9, 2019

add handler metrics to bus and saga #101

add handler metrics to bus and saga #101

Conversation

danielwitz commented Jul 15, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhinof commented Oct 9, 2019

danielwitz commented Jul 15, 2019 •

edited

Loading