Additional logs for the grpc conn. #1319

huseyinbabal · 2023-11-23T11:28:33Z

Description

Changes proposed in this pull request:

Added additional logs for grpc connection
Set timeout for grpc connection methods since somehow retry mechanism somehow stuck on second retry after 502 server response and it keeps ctx open infinitely

Testing

Related issue(s)

pkosiec

Not sure if the timeout isn't too small, but let's. test the PR image on our Botkube environments to see if that helps for a longer period of time 👍

Also, consider also adding https://grpc.io/docs/guides/deadlines/ to the server-side too if that makes sense 👍

huseyinbabal · 2023-11-24T13:33:40Z

As a summary;

I tested the initial version in dev but timeout just didn't fix it
The problem caused by not-reseted exponential back-off that causes a stuck even we restart slack processor
I have added exponential back-off reset mechanism by using failure counter.
Now, whenever agent connects router, it resets failureNo and backoff strategy is based on failureNo, which means we have faster retries on processor restarts in agent side.

pkosiec

Looks very promising!

pkosiec · 2023-11-27T11:38:17Z

pkg/bot/slack_cloud.go

+		if !lastFailureTimestamp.IsZero() && time.Since(lastFailureTimestamp) >= successIntervalDuration {
+			// if the last run was long enough, we treat is as success, so we reset failures
+			log.Infof("Resetting failures counter as last failure was more than %s ago", successIntervalDuration)
+			b.failuresNo = 0


we can also reset the failure msg here, or it doesn't make sense?

pkosiec · 2023-11-27T11:39:39Z

pkg/bot/teams_cloud.go

 		},
 		retry.OnRetry(func(_ uint, err error) {
 			log.Warnf("Retrying Cloud Teams startup (attempt no %d/%d): %s", b.failuresNo, maxRetries, err)
 		}),
-		retry.Delay(retryDelay),
+		retry.DelayType(resettableBackoff),
 		retry.Attempts(0), // infinite, we cancel that by our own
 		retry.LastErrorOnly(true),
 		retry.Context(ctx),


should we also add something similar here to

b.failuresNo = 0 // Reset the failures to start exponential back-off from the beginning b.setFailureReason("") b.log.Info("Botkube connected to Slack!")

?

~~I couldn't get the point here, could you please elaborate a bit more?~~ My bad, yes now I get it, I will modify teams

pkosiec · 2023-11-27T11:41:48Z

pkg/bot/slack_cloud.go

while it wasnt the actual reason, imho it would be good to add the same what we have in cloud teams:

ctxTimeout, cancelFn := context.WithTimeout(ctx, cloudTeamsConnectTimeout) defer cancelFn() err = svc.Start(ctxTimeout) if err != nil { return fmt.Errorf("while starting gRPC connector %w", err) }

Actually, I should remove this from teams too, I was testing slack first. So the problem with ctxTimeout, it cancels the existing streaming connection event after the start is initiated. Assume, I provided 15 secs timeout, connection is created successfully, but during the streaming it just cancelled since it is not only used in connection, it is also used in grpc operation.

aha, ok, makes sense 👍

pkosiec

LGTM based on the code review and testing on dev 👍

I think we can merge this PR, run Botkube from latest main image for all our envs and observe behavior 👍

huseyinbabal requested a review from PrasadG193 as a code owner November 23, 2023 11:28

huseyinbabal requested review from a team, madebyrogal and pkosiec and removed request for madebyrogal November 23, 2023 11:28

huseyinbabal added the enhancement New feature or request label Nov 23, 2023

pkosiec reviewed Nov 23, 2023

View reviewed changes

huseyinbabal closed this Nov 23, 2023

huseyinbabal reopened this Nov 23, 2023

huseyinbabal requested a review from pkosiec November 27, 2023 10:03

pkosiec reviewed Nov 27, 2023

View reviewed changes

pkosiec approved these changes Nov 27, 2023

View reviewed changes

huseyinbabal added 6 commits November 28, 2023 09:57

additional logs for the grpc conn.

0d14a26

decrease teams and slack cloud connect timeout

faa5b35

removed ctx timeout since it is used for streaming

f46650f

added exponential back-off reset

34d1a99

test fix

761c3f4

addressed pr feedbacks

e3545a3

huseyinbabal force-pushed the platform-conn-check branch from 0a42954 to e3545a3 Compare November 28, 2023 06:57

huseyinbabal merged commit 24ddf27 into kubeshop:main Nov 28, 2023
15 checks passed

huseyinbabal deleted the platform-conn-check branch November 28, 2023 09:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional logs for the grpc conn. #1319

Additional logs for the grpc conn. #1319

huseyinbabal commented Nov 23, 2023 •

edited

Loading

pkosiec left a comment

huseyinbabal commented Nov 24, 2023

pkosiec left a comment

pkosiec Nov 27, 2023 •

edited

Loading

pkosiec Nov 27, 2023

huseyinbabal Nov 27, 2023 •

edited

Loading

pkosiec Nov 27, 2023

huseyinbabal Nov 27, 2023

pkosiec Nov 27, 2023

pkosiec left a comment

Additional logs for the grpc conn. #1319

Additional logs for the grpc conn. #1319

Conversation

huseyinbabal commented Nov 23, 2023 • edited Loading

Description

Testing

Related issue(s)

pkosiec left a comment

Choose a reason for hiding this comment

huseyinbabal commented Nov 24, 2023

pkosiec left a comment

Choose a reason for hiding this comment

pkosiec Nov 27, 2023 • edited Loading

Choose a reason for hiding this comment

pkosiec Nov 27, 2023

Choose a reason for hiding this comment

huseyinbabal Nov 27, 2023 • edited Loading

Choose a reason for hiding this comment

pkosiec Nov 27, 2023

Choose a reason for hiding this comment

huseyinbabal Nov 27, 2023

Choose a reason for hiding this comment

pkosiec Nov 27, 2023

Choose a reason for hiding this comment

pkosiec left a comment

Choose a reason for hiding this comment

huseyinbabal commented Nov 23, 2023 •

edited

Loading

pkosiec Nov 27, 2023 •

edited

Loading

huseyinbabal Nov 27, 2023 •

edited

Loading