Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional logs for the grpc conn. #1319

Merged
merged 6 commits into from
Nov 28, 2023

Conversation

huseyinbabal
Copy link
Contributor

@huseyinbabal huseyinbabal commented Nov 23, 2023

Description

Changes proposed in this pull request:

  • Added additional logs for grpc connection
  • Set timeout for grpc connection methods since somehow retry mechanism somehow stuck on second retry after 502 server response and it keeps ctx open infinitely

Testing

Related issue(s)

@huseyinbabal huseyinbabal requested review from a team, madebyrogal and pkosiec and removed request for madebyrogal November 23, 2023 11:28
@huseyinbabal huseyinbabal added the enhancement New feature or request label Nov 23, 2023
Copy link
Collaborator

@pkosiec pkosiec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if the timeout isn't too small, but let's. test the PR image on our Botkube environments to see if that helps for a longer period of time 👍

Also, consider also adding https://grpc.io/docs/guides/deadlines/ to the server-side too if that makes sense 👍

@huseyinbabal
Copy link
Contributor Author

As a summary;

  • I tested the initial version in dev but timeout just didn't fix it
  • The problem caused by not-reseted exponential back-off that causes a stuck even we restart slack processor
  • I have added exponential back-off reset mechanism by using failure counter.
  • Now, whenever agent connects router, it resets failureNo and backoff strategy is based on failureNo, which means we have faster retries on processor restarts in agent side.

@huseyinbabal huseyinbabal requested a review from pkosiec November 27, 2023 10:03
Copy link
Collaborator

@pkosiec pkosiec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very promising!

if !lastFailureTimestamp.IsZero() && time.Since(lastFailureTimestamp) >= successIntervalDuration {
// if the last run was long enough, we treat is as success, so we reset failures
log.Infof("Resetting failures counter as last failure was more than %s ago", successIntervalDuration)
b.failuresNo = 0
Copy link
Collaborator

@pkosiec pkosiec Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can also reset the failure msg here, or it doesn't make sense?

},
retry.OnRetry(func(_ uint, err error) {
log.Warnf("Retrying Cloud Teams startup (attempt no %d/%d): %s", b.failuresNo, maxRetries, err)
}),
retry.Delay(retryDelay),
retry.DelayType(resettableBackoff),
retry.Attempts(0), // infinite, we cancel that by our own
retry.LastErrorOnly(true),
retry.Context(ctx),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also add something similar here to

	b.failuresNo = 0 // Reset the failures to start exponential back-off from the beginning
	b.setFailureReason("")
	b.log.Info("Botkube connected to Slack!")

?

Copy link
Contributor Author

@huseyinbabal huseyinbabal Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't get the point here, could you please elaborate a bit more? My bad, yes now I get it, I will modify teams

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while it wasnt the actual reason, imho it would be good to add the same what we have in cloud teams:

ctxTimeout, cancelFn := context.WithTimeout(ctx, cloudTeamsConnectTimeout)
	defer cancelFn()
	err = svc.Start(ctxTimeout)
	if err != nil {
		return fmt.Errorf("while starting gRPC connector %w", err)
	}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I should remove this from teams too, I was testing slack first. So the problem with ctxTimeout, it cancels the existing streaming connection event after the start is initiated. Assume, I provided 15 secs timeout, connection is created successfully, but during the streaming it just cancelled since it is not only used in connection, it is also used in grpc operation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aha, ok, makes sense 👍

Copy link
Collaborator

@pkosiec pkosiec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM based on the code review and testing on dev 👍

I think we can merge this PR, run Botkube from latest main image for all our envs and observe behavior 👍

@huseyinbabal huseyinbabal merged commit 24ddf27 into kubeshop:main Nov 28, 2023
15 checks passed
@huseyinbabal huseyinbabal deleted the platform-conn-check branch November 28, 2023 09:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants