Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nuke stalls #426

Closed
YuriGal opened this issue Nov 21, 2024 · 42 comments
Closed

Nuke stalls #426

YuriGal opened this issue Nov 21, 2024 · 42 comments

Comments

@YuriGal
Copy link

YuriGal commented Nov 21, 2024

We have an account in a desperate need for cleanup, it's been used as playground, have tons of old stale resources. When I run nuke on this account - the nuke just stalls, not outputting anything even a list of would be removed resources. Any idea what's causing it?

@YuriGal
Copy link
Author

YuriGal commented Nov 21, 2024

My theory it hits a resource type with huge number of resources, and just gets stuck on it. If so - can I bypass it somehow (I know I can add it to the exclusion list, but I will have to guess which resource type it is every time it happens). And also - how do I actually clean up these resources?

@YuriGal
Copy link
Author

YuriGal commented Nov 21, 2024

Like, maybe have an option of maximum number of resources to retrieve? This way nuke can be done in several consecutive runs.

@ekristen
Copy link
Owner

Is S3Object excluded in your configuration?

@YuriGal
Copy link
Author

YuriGal commented Nov 21, 2024

Yes, matter of fact I tried to target exclusively CloudWatchLogsLogGroup, but this account has 34K of those.

@ekristen
Copy link
Owner

Well this gives us a place to start ...

Unfortunately 34k is a LOT and there are rate limits involved with those APIs.

We can only query 10 queries per second, 50 at a time at a time, then 15 streams queries per-second.

https://github.com/ekristen/aws-nuke/blob/main/resources/cloudwatchlogs-loggroups.go#L44-L49

34k at 50 per query is 680 queries to describe log groups. Because we also query LogStream to get some additional metadata, there's 1 query per log group 15 per second max, so that's 34k at 15/second - that's 37 minutes to discover.

I can add some debugging logging to a special build for testing via github actions. It's likely just taking forever to query everything.

It's possible I could add a setting to bypass querying the log stream, that'd cut out 37 minutes worth of time.

@YuriGal
Copy link
Author

YuriGal commented Nov 21, 2024

It's not just log groups, there are other resource types with huge number of resources. Would it be possible to add a generic option to limit number of resources nuke enumerates per resource type, something like --max-resources 5000 so it won't have to enumerate everything?

@ekristen
Copy link
Owner

Interesting idea. What do you think would be more useful, per resource or global.

--max-per-resource-type=50 would limit to top 50 of each resource type.

--max-resources=5000 would limit to the first 5000 discovered (this might be harder to implement)

@YuriGal
Copy link
Author

YuriGal commented Nov 21, 2024

I think per resource type is a better option. Nuke has no problems going over large number of resource types, it's when a particular resource type has huge number of resources - it gets stuck. Limiting number of resources retrieved per resource type should solve this issue.

@ekristen
Copy link
Owner

My only wonder if it should be per-resource type instead of globally for all resource-types. Like --max-per-resource-type resourcetype=blah

@YuriGal
Copy link
Author

YuriGal commented Nov 21, 2024

Having option per specific resource type is good if you know the resource type that causing troubles - then you can target that type specifically. But sometimes you don't know in advance which type it's going to be. If it's possible to have it both ways (if type is specified - apply limit only to that type, otherwise to all types) it would be the best of both worlds.

@mdgm88
Copy link

mdgm88 commented Nov 26, 2024

You would need to run aws-nuke a lot more times to clean everything up with this limit in place, but at least it wouldn't get stuck.

If aws-nuke is run regularly against an account it should work well most of the time.

Cost Explorer can help identify where the spend is in an account but that doesn't necessarily correspond to what you have a lot of. You can have a huge amount of something that costs a negligible amount, and not much of something that costs a lot.

@ekristen
Copy link
Owner

It definitely feels like an advanced feature. It's something that would have to be implemented per resource too.

@YuriGal
Copy link
Author

YuriGal commented Nov 26, 2024

If aws-nuke is run regularly against an account it should work well most of the time.

That's the idea. It's the initial cleanup that is problematic, once it's done we plan to schedule a weekly nuke run that should keep things tidy. We already do this with our other sandbox accounts, and it works pretty well.

Cost Explorer can help identify where the spend is in an account but that doesn't necessarily correspond to what you have a lot of. You can have a huge amount of something that costs a negligible amount, and not much of something that costs a lot.

Same, e.g. we have thousands of log groups that are over 5 years old, and contain literally gigabytes of logs.

@ekristen
Copy link
Owner

@YuriGal this is not hard to implement but very time consuming and beyond this use case I'm not sure it makes sense to do just yet. However, I'd be willing to create a branch, make a hard coded change on the 1 or 2 resources you are having issues with and I can upload a release against the issue, or you can build the docker image yourself with the changes. What do you think?

@YuriGal
Copy link
Author

YuriGal commented Nov 26, 2024

If I can get a Darwin ARM binary for a release that would have this feature for Cloudwatch log groups, and Quicksight users - it would be a great help, thanks!

@ekristen
Copy link
Owner

It would be custom so not an actual release. Just a custom branch, I can build it for you and post a link to download.

@YuriGal
Copy link
Author

YuriGal commented Nov 26, 2024

Oh yes, I understand, I didn't mean it would be a general release. And I really appreciate you doing this.

@ekristen
Copy link
Owner

ekristen commented Dec 3, 2024

@ekristen ekristen closed this as completed Dec 3, 2024
@ekristen ekristen reopened this Dec 3, 2024
@ekristen
Copy link
Owner

ekristen commented Dec 4, 2024

@YuriGal builds are at the above link, let me know how it goes. Wish you luck.

@YuriGal
Copy link
Author

YuriGal commented Dec 4, 2024

Thanks! What is the name of the flag you implemented? It doesn't seem to recognize --max-per-resource-type

@ekristen
Copy link
Owner

ekristen commented Dec 4, 2024

No flag. I just hard coded for 1000 max per run.

@YuriGal
Copy link
Author

YuriGal commented Dec 4, 2024

I am getting error

FATA[0120] failed get caller identity: RequestError: send request failed
caused by: Post "https://sts.amazonaws.com/": dial tcp: lookup sts.amazonaws.com: i/o timeout 

I am running commands under aws-vault, and released nuke doesn't have this issue.

@ekristen
Copy link
Owner

ekristen commented Dec 4, 2024

I just tested on two different machines. Works ok.

@YuriGal
Copy link
Author

YuriGal commented Dec 4, 2024

Weird. How are you supplying AWS credentials?

@ekristen
Copy link
Owner

ekristen commented Dec 4, 2024

Always environment variables. :)

i/o timeout indicates to me that the local system is preventing the network connection for whatever reason

@YuriGal
Copy link
Author

YuriGal commented Dec 4, 2024

Still no luck. Maybe because it's an unsigned executable? MacOS wouldn't even let me run the file until I found where to enable it, but I can't find the same for denied network connection.

@ekristen
Copy link
Owner

ekristen commented Dec 4, 2024

That's it. I think I have it setup to only sign tagged builds. System Preferences > Security you can hit allow to run, that should fix things.

@YuriGal
Copy link
Author

YuriGal commented Dec 4, 2024

That's the thing - I did that, and it allowed me to run this build. But apparently it doesn't allow it to connect.

@YuriGal
Copy link
Author

YuriGal commented Dec 5, 2024

Sorry, still unable to run it. I enabled it in security settings, so I can execute it. I added it to firewall allowed list, but still I am getting

FATA[0120] failed get caller identity: RequestError: send request failed
caused by: Post "https://sts.amazonaws.com/": dial tcp: lookup sts.amazonaws.com: i/o timeout 

when running it. Just to reiterate this does not happen with released nuke version

@ekristen
Copy link
Owner

ekristen commented Dec 5, 2024

@adamLShine
Copy link

Hi All,

I believe I may also having this issue, I have approx 2400 resources in my AWS account and AWS Nuke is flagging 1800 resources to be removed. My pipeline running AWS nuke has been running for 75 minutes, does that seem too long for that amount of resources?

@ekristen
Copy link
Owner

ekristen commented Dec 9, 2024

Yuris problem is different likely.

What resource types? How far does it get? Send logs and config.

@adamLShine
Copy link

Here is a copy of my config file, I have 6000 log groups and when I run aws nuke within my pipeline it just stalls. I have had it running for 30+ minutes (see screenshot)

`
regions:

  • global
  • us-east-1
  • ap-southeast-2

bypass-alias-check-accounts:

  • "111111111111"

accounts:
111111111111: # Sandpit

resource-types:
includes:
- CloudWatchLogsLogGroup
`

Screenshot 2024-12-09 at 11 48 24 PM

@ekristen
Copy link
Owner

ekristen commented Dec 9, 2024

oh, you probably want to try and use the binaries I built for this, it limits it to 1000 each run and strips a bunch of extra queries out. Just to get you a decent baseline.

@adamLShine
Copy link

ok thanks, I'm using the docker images are the binaries included on them or will they need to be added separately?

@ekristen
Copy link
Owner

ekristen commented Dec 9, 2024

You'll need to grab them from the GitHub Actions, this is a special build to try and help you all out for the time being while I think through if I can easily implement a limits cli option.

@adamLShine
Copy link

Thanks! That definitely helped and it's no longer stalling for my Cloudwatch log groups

However if I remove the 'resources types' from my config file and run again aws nuke stalls again. From my testing is seems like aws nuke has a difficult time running over a large aws account without stalling.

Screenshot 2024-12-10 at 10 55 02 AM

@ekristen
Copy link
Owner

ekristen commented Dec 9, 2024

That message is normal. Do you have S3Object excluded?

Open a new issue about this please, provide version and config and run with log level trace and provide that.

@adamLShine
Copy link

Thanks, yes I do. The new issue is here #453

@YuriGal
Copy link
Author

YuriGal commented Dec 11, 2024

@YuriGal fixed the signing -- https://github.com/ekristen/aws-nuke/actions/runs/12189713328/artifacts/2282303070

Thanks! Finally managed to run it. Just to confirm - hard-coded 1000 resources limit applies only to Quicksight users and Cloudwatch log groups, or to all resource types in the account?

@ekristen
Copy link
Owner

Just to the cloudwatch log groups. We can do more hard coding in a special build if needed to help you out I don't mind, but to do it where it's configurable will take a bit of effort that I don't have time for at the moment.

@YuriGal
Copy link
Author

YuriGal commented Dec 11, 2024

I think having Cloudwatch groups support should suffice for us for now, they're the main offender. For the rest I'd rather wait when the feature is officially supported in a release. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants