Skip to content
This repository has been archived by the owner on Aug 22, 2024. It is now read-only.

EBS volumes not getting deleted #30

Open
likku123 opened this issue Jul 8, 2021 · 5 comments
Open

EBS volumes not getting deleted #30

likku123 opened this issue Jul 8, 2021 · 5 comments

Comments

@likku123
Copy link

likku123 commented Jul 8, 2021

I have tried using EBS autoscale for my use case and it's working like a charm. One thing I have noticed is when ever EBS volume is created for my instance (to use scratch space)
sh /opt/amazon-ebs-autoscale/install.sh -m /var/lib/docker -d /dev/sdc -s 50 2>&1 > /var/log/ebs-autoscale-install.log
What is have noticed is this EBS volume is not getting deleted along with the termination of the instance and it's getting pile up.
Any body faced this issue before? Please help me in giving some pointers to overcome this.

@wleepang
Copy link
Contributor

Check what version you are running. The following was added to ensure volumes were deleted when an instance was terminated:

aws ec2 modify-instance-attribute \
--region $region \
--instance-id $instance_id \
--block-device-mappings "DeviceName=$device,Ebs={DeleteOnTermination=true,VolumeId=$volume_id}" \

@BrunoGrandePhD
Copy link

@likku123: Did you ever find a solution to this issue? I'm experiencing the same thing. Out of curiosity, were you using spot instances?

@wleepang: Thanks for pointing out the create-ebs-volume script! I initially didn't suspect amazon-ebs-autoscale because I was seeing the additional volumes being configured with DeleteOnTermination. However, now that I'm looking at the script, I'm noticing that the modify-instance-attribute command is the last step of the script (line 342), whereas volume creation is initiated much earlier (line 269). I do appreciate that DeleteOnTermination cannot be enabled before the volume is attached, so this is an AWS-imposed limitation.

Based on this, the following scenario seems feasible to me: the script initiates volume creation, but script execution is interrupted for some reason, resulting in an EBS volume without DeleteOnTermination, which is expected to linger indefinitely in the account after the EC2 instance is terminated.

I realize this might appear unlikely, but I ended up with over 700 unattached volumes in my account over the course of two months. Hence, there might be certain conditions that increase the risk of orphaned volumes. As you can see in the Cost Explorer plot below, the ongoing EBS cost (proportional to the number of unattached volumes) jumps at specific points. For context, I'm using amazon-ebs-autoscale indirectly via Nextflow Tower, which deploys AWS Batch jobs that are configured to auto-scale EBS volumes.

My current hypothesis is that spot termination causes the premature termination of the create-ebs-volume script and that the risk of ending up with unattached volumes is proportional to the number of jobs. For example, on March 25th (the day before the first jump in EBS cost visible in the middle of the plot), I ran a workflow with over 100,000 jobs. I'm currently trying out workflow runs with almost 40,000 jobs using on-demand instances and unattached volumes haven't appeared yet. I might perform the same test with spot instances to see if unattached volumes start accumulating.

@wleepang: Does my logic seem sound to you? Can you think of other possible causes?

cost

@wleepang
Copy link
Contributor

@BrunoGrandePhD - That is a probable scenario. Attaching volumes and rebalancing the filesystem can take a up to a minute to complete. A spot termination should send a signal to the EC2 instance giving it about 2min to handle things gracefully. However, there is no functionality in EBS autoscale to specifically monitor for these signals.

@MikeKroell
Copy link
Contributor

I have run across this situation as well. At times of heavy load, I came across thousands of orphaned volumes (costing upwards of 300$/d.)

I have switched from creating an instance with a device for ebs-autoscale to start with to having ebs-autoscale create the initial device. All of my orphan volumes come from hitting API limits.

I see this in the logs:

Nov  9 10:47:39 ip-10-125-224-246 cloud-init: An error occurred (RequestLimitExceeded) when calling the ModifyInstanceAttribute operation (reached max retries: 2): Request limit exceeded.
Nov  9 10:47:39 ip-10-125-184-11 cloud-init: An error occurred (RequestLimitExceeded) when calling the ModifyInstanceAttribute operation (reached max retries: 2): Request limit exceeded.
Nov  9 10:47:39 ip-10-125-166-23 cloud-init: An error occurred (RequestLimitExceeded) when calling the ModifyInstanceAttribute operation (reached max retries: 2): Request limit exceeded.
Nov  9 10:47:39 ip-10-125-246-38 cloud-init: An error occurred (RequestLimitExceeded) when calling the ModifyInstanceAttribute operation (reached max retries: 2): Request limit exceeded.
Nov  9 10:47:39 ip-10-125-94-250 cloud-init: An error occurred (RequestLimitExceeded) when calling the ModifyInstanceAttribute operation (reached max retries: 2): Request limit exceeded.
Nov  9 10:47:39 ip-10-125-245-99 cloud-init: An error occurred (RequestLimitExceeded) when calling the ModifyInstanceAttribute operation (reached max retries: 2): Request limit exceeded.

which then corresponds to:

Nov  9 10:54:22 ip-10-125-161-181 cloud-init: Error: could not attach volume to instance
Nov  9 10:54:22 ip-10-125-207-131 cloud-init: Error: could not attach volume to instance
Nov  9 10:54:22 ip-10-125-160-175 cloud-init: Error: could not attach volume to instance
Nov  9 10:54:22 ip-10-125-84-179 cloud-init: Error: could not attach volume to instance
Nov  9 10:54:22 ip-10-125-225-0 cloud-init: Error: could not attach volume to instance
Nov  9 10:54:22 ip-10-125-149-241 cloud-init: Error: could not attach volume to instance
Nov  9 10:54:22 ip-10-125-129-227 cloud-init: Error: could not attach volume to instance
Nov  9 10:54:22 ip-10-125-192-238 cloud-init: Error: could not attach volume to instance

What are your suggestions on putting a better retry and backoff mechanism in here?

@geertvandeweyer
Copy link

We occasionally see the same issue of lingering volumes. I'm wondering if you can either add the retry around the attach-volume and modify-instance-attribute, like it's used around eg the create-volume call.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants