Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed DefaultWaitTimeoutForHCPControlPlaneInMinutes and timeouts while still installing #723

Open
chrisahl opened this issue Jul 18, 2024 · 3 comments

Comments

@chrisahl
Copy link

We are attempting to do a bulk create of 20 ROSA clusters at a time in the same AWS account and region. It appears that there is some throttling of 13 ROSA creates at a time, so it takes until one of these 13 complete until any additional ROSA deploys start running. This is leading to us seeing timeouts.

Is there a dynamic way to change the hard coded value of:

DefaultWaitTimeoutForHCPControlPlaneInMinutes = int64(20)
?

Any reason for the 20 min vs something larger? Any other suggestions for achieving higher success rates?

Thanks.

@willgarcia
Copy link

willgarcia commented Sep 16, 2024

Hi @chrisahl

I am a user of this provider and suspect having the same issue when deploying clusters in bulk.

My clusters show in ready state but Terraform fails with the following error: "Waiting for cluster creation finished with the error".

Is that the error you see?

According to the different places in code showing this message, the actual error should be added at the end of the error message but that does not seem to be the case for me. I would like to confirm it is timeout related.

At TF re-run, the clusters get deleted as well because of the TF state erroring, so it takes a long time to get lucky.

@chrisahl
Copy link
Author

@willgarcia In my case I get an error saying the error is "installing" because it is timing out. I think it would be good if DefaultWaitTimeoutForHCPControlPlanInMinutes was parameterized similar to how DefaultWaitTimeoutInMinutes has the ability to use
resource "rhcs_cluster_wait" "rosa_cluster" {
cluster = rhcs_cluster_rosa_classic.rosa_sts_cluster.id

timeout in minutes

timeout = 60
}

because different AWS regions take longer than others to provision based on your geo location and time of day/load.

@chrisahl
Copy link
Author

https://issues.redhat.com/browse/OCM-12006 was recently opened and may help get this addressed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants