Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate dockerhost-azure system to lower cost disks #3796

Closed
sxa opened this issue Nov 4, 2024 · 11 comments
Closed

Migrate dockerhost-azure system to lower cost disks #3796

sxa opened this issue Nov 4, 2024 · 11 comments

Comments

@sxa
Copy link
Member

sxa commented Nov 4, 2024

Currently dockerhost-azure-ubuntu2204-x64-1-intel is configured with three premium SSD LRS disks:
512GiB sdb for /var/lib/docker (Only 256GiB allocated, plus 16GiB swap partition, plus 60GiB Solaris partition)
60GiB sdd for Solaris (Defined as LVM)
256GiB sdc (Defined as LVM) with ~120GiB allocated to /home/jenkins.

This means we have half of sdc unused (/home/jenkins currently has 21GiB in use, 92GiB free) and a lot of the 512GiB disk also unused.

Suggest we replace the above as follows with disk type "Standard HDD" which is about a third of the cost:

  • 256GiB for /var/lib/docker
  • 256GiB 16GiB swap partition, then LVM split between Solaris (Two VMs currently using about half of the 50GiB FS) and /home/jenkins (alternate option: allocate this as a single non-LVM /home)
  • We could potentially replace the swap partition with a /swapfile - it's unlikely to be heavily used sine the machine has 64GiB of RAM.

Goal here should be:

  1. Create new disks
  2. Shutdown docker process and solaris VMs
  3. Copy all data from old disk to new disks
  4. Unmount everything
  5. Switch /etc/fstab to point to new disks
  6. Remount everything (or maybe reboot?)
  7. Start everything up again and make sure jenkins jobs work (particularly the Solaris build pipelines)
  8. Remove the old disks (maybe after a weekly EA build cycle.

We should also look at reducing the size of the second one if the above is successful, particularly if we can select an AMD system to complement the Intel one - a d8as_v4 with 8vCPU/32GiB RAM should be adequate looking at current usage of the second machine which doesn't have any Solaris VMs on it at present and a similar disk layout to the above (potentially with 128GiB instead of 256 for the second disk as it does not host Solaris images). This would likely reduce the cost of this machine by over 50%.

@sxa
Copy link
Member Author

sxa commented Nov 6, 2024

Quick bit of gap analysis:
The machine running the solaris VMs has the following containers:

  • Cent7, Alpine 3.20, UBI9, U2404, U2004, U2204, F39, U2004 and Alpine 3.19
    The other machine has:
  • CS9, U2410, F41, F40, U2204 (2 of them), UBI8, AL2023, U2004, Alpine 3.19 (2 of them) DEB12 (2 of them)
    So only the second one has the later Fedoras, U2410, UBI8, AL2023, CS9 and Debian
    Noting also that the second machine does have a Solaris Vagrant VM defined but the log from it is showing a lack of virtualisation enabled:
Stderr: VBoxManage: error: VT-x is disabled in the BIOS for all CPU modes (VERR_VMX_MSR_ALL_VMX_DISABLED)
VBoxManage: error: Details: code NS_ERROR_FAILURE (0x80004005), component ConsoleWrap, interface IConsole

@Haroon-Khel Haroon-Khel self-assigned this Nov 11, 2024
@Haroon-Khel
Copy link
Contributor

On dockerhost-azure-ubuntu2204-x64-1-intel, the new docker disk is /dev/docker_hdd/docker_hdd and its mounted on /var/lib/docker

sde                                       8:64   0  256G  0 disk 
└─sde1                                    8:65   0  256G  0 part 
  └─docker_hdd-docker_hdd               252:2    0  255G  0 lvm  /var/lib/docker

The docker contents have been moved over to this disk and the containers are up and running https://ci.adoptium.net/label/hw.dockerhost.dockerhost-azure-ubuntu2204-x64-1/

@Haroon-Khel
Copy link
Contributor

On the machine, theres a partition called temp-docker. I used this to move the docker contents onto while unmounting the old /var/lib/docker (as I couldnt have 2 /var/lib/docker directories existing simultaneously). I'll clear the temp-docker partition once we're comfortable with the migrated docker containers

@sxa
Copy link
Member Author

sxa commented Nov 12, 2024

On dockerhost-azure-ubuntu2204-x64-1-intel, the new docker disk is /dev/docker_hdd/docker_hdd and its mounted on /var/lib/docker

sde                                       8:64   0  256G  0 disk 
└─sde1                                    8:65   0  256G  0 part 
  └─docker_hdd-docker_hdd               252:2    0  255G  0 lvm  /var/lib/docker

The docker contents have been moved over to this disk and the containers are up and running https://ci.adoptium.net/label/hw.dockerhost.dockerhost-azure-ubuntu2204-x64-1/

Thanks - I've kicked off a JDK21u pipeline at https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk21u/job/windbld/1206/ which should run the tests on a few of the containers and give them a bit of a workout :-)

EDIT: LGTM other than the jobs that got scheduled on systems labelled rhel6 which isn't suitable for JDK21. I've kicked off another at 1207 but that should not be considered a blocker :-)

@sxa
Copy link
Member Author

sxa commented Nov 13, 2024

Recording the old/new disk info here :-) It may be beneficial to enable write-caching on the docker file system at some point and seeing if it makes a difference to e.g. setup time for the tests, but not necessary for this task at the moment.

dhx1disks

@Haroon-Khel
Copy link
Contributor

Ive wiped the temp-docker disk and split it into 2 solaris and 1 jenkins logical volumes

sdf                                       8:80   0  256G  0 disk 
└─sdf1                                    8:81   0  256G  0 part 
  ├─home_hdd-solaris_build              252:3    0   50G  0 lvm  
  ├─home_hdd-solaris_test               252:4    0   50G  0 lvm  
  └─home_hdd-jenkins_home               252:5    0  100G  0 lvm 

@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Nov 14, 2024

The directories are mounted and the content of each have been moved over

sde                                       8:64   0  256G  0 disk 
└─sde1                                    8:65   0  256G  0 part 
  ├─home_hdd-solaris_build              252:1    0   50G  0 lvm  /home/solarisbuild
  ├─home_hdd-solaris_test               252:2    0   50G  0 lvm  /home/solaris
  └─home_hdd-jenkins_home               252:3    0  120G  0 lvm  /home/jenkins

The solaris and dockerhost agents are all online
https://ci.adoptium.net/computer/test-azure-solaris10-x64-1/
https://ci.adoptium.net/computer/build-azure-solaris10-x64-1/
https://ci.adoptium.net/computer/dockerhost-azure-ubuntu2204-x64-1/

All thats left is to delete the ssds in the azure console. We can do this once we're content with the hdds

@Haroon-Khel
Copy link
Contributor

https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-solaris-x64-temurin/519/ has been running on build-azure-solaris10-x64-1, and its tests on test-azure-solaris10-x64-1

Looking good so far

@sxa
Copy link
Member Author

sxa commented Nov 15, 2024

Excellent - now we'll mess it all up by changing the way the Solaris pipelines work ;-) But we're now in a nice stable known good state to start that work from.

@Haroon-Khel
Copy link
Contributor

Ive deleted the ssds from the azure console. This issue can be closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

2 participants