Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EPIC: New Machine requirement: Windows dockerBuild containers #3286

Open
6 of 8 tasks
sxa opened this issue Dec 6, 2023 · 36 comments · Fixed by #3702
Open
6 of 8 tasks

EPIC: New Machine requirement: Windows dockerBuild containers #3286

sxa opened this issue Dec 6, 2023 · 36 comments · Fixed by #3702

Comments

@sxa
Copy link
Member

sxa commented Dec 6, 2023

I need to request a new machine:

  • New machine operating system (e.g. linux/windows/macos/solaris/aix): Windows
  • New machine architecture (e.g. x64/aarch32/arm32/ppc64/ppc64le/sparc): x64
  • Provider (leave blank if it does not matter): Docker :-)
  • Desired usage: Build containers, similar to what we have for Linux
  • Any unusual specification/setup required:
  • How many of them are required: n/a - they should be created dynamically

Please explain what this machine is needed for:
Running builds in an isolated way where we can achieve SLSA build level 3 compliance on Windows along with the other primary platforms. Ideally we'll be able to create windows-on-windows container images which we share and then download and run the builds in.

As background info:

So the tasks required would be:

  • Identify the appropriate software for running containers and ensure no licensing concerns (Likely something from the microsoft site linked above)
  • See if we can verify that a "basic" dockerfile works in that environment and whether we can map directories into it (same as -v on linux) which are read+write in the container
  • Determine whether we can create a container from the playbooks using a dockerfile equivalent to the Linux ones
  • Once we create the container, map a directoryi from the host into it with -v and use that to build Temurin in the container on the mapped volume so that the output is visible on the host system.
  • Understand whether we can reasonably push the resulting container images with the compiler up to dockerhub
  • Integrate this into the build pipelines
  • Implement processes to regenerate the images when playbook updates are made, - likely an addition to what we do for Linux in https://github.com/adoptium/infrastructure/blob/master/FAQ.md#what-about-the-builds-that-use-the-dockerbuild-tag
  • Declare SLSA Build level 3 on Windows :-)

Once this level of analysis and expertise is gained it will likely make windows installer testing, or any other such activities simpler and give us more options moving forward.

Related for historic reference:

@RadekCap
Copy link

Please, assign this task to me. Thank you.

@sxa
Copy link
Member Author

sxa commented Jul 4, 2024

Of the three options listed on the Microsoft website:

  • The first (Docker CE / Moby) seems to work well out of the box
  • The second (Mirantis) appears to be a commercial offering
  • The third (Containerd+nerdctl) appers functional although networking doesn't work out of the box and it seems to fail to be able to start the eclipse-temurin container's default jshell process.

@sxa
Copy link
Member Author

sxa commented Jul 4, 2024

OK First phase done ...

  • docker run -p 5986:5986 -v c:\Users\sxa:c:\sxa mcr.microsoft.com/windows/servercore:ltsc2022
  • Run ConfigureRemotingForAnsible.ps1 with the usual parameters with the netsh commands disabled (They require windows defender which isn't in the image)
  • Create a user to connect with for the playbooks (MyPassword is not what I've used on the live system!):
net user ansible MyPassword /ADD
net localgroup "Administrators" ansible /ADD
net localgroup "Remote Management Users" ansible /ADD

This allows the machine to be accessible via ansible running on a remote machine :-)

(Also, for my own notes, to debug powershell scripts use Set-PSDebug -Trace 2)

@sxa sxa self-assigned this Jul 4, 2024
@sxa
Copy link
Member Author

sxa commented Jul 5, 2024

Playbook execution notes:

  • VS2013 requires the archive under /Vendor_Files/windows, otherwise MSVS_2013 needs to be skipped
  • NTP_TIME needs to be skipped as that has issues that are presumably related to running in a container: FAILED! => {"changed": false, "msg": "Unhandled exception while executing module: Service 'Windows Time (W32Time)' cannot be started due to the following error: Cannot start service W32Time on computer '.'."}
  • In the absence of the fixed layout files for VS2019 and VS2022, adoptopenjdk needs to be skipped to allow them to complete successfully

@sxa sxa moved this to In Progress in 2024 3Q Adoptium Plan Jul 5, 2024
@sxa sxa added this to the 2024-07 (July) milestone Jul 5, 2024
@sxa
Copy link
Member Author

sxa commented Jul 5, 2024

ansible can be run on the host to point at the container if you install cygwin which has ansible as one of its installable options (You probably want to include git too if it's a clean install on the host system). Noting that if you use localhost/127.0.0.1 in your hosts file you should specify -e git_sha=12345 or something appropriate otherwise the execution will trip up over

- name: Get Latest git commit SHA (Windows Localhost)

Noting that WSL could probably be used too, but that requires a system with virtualization extension instructions to be available which is not the case on all systems.

@sxa
Copy link
Member Author

sxa commented Jul 26, 2024

Latest attempt is with:
--skip-tags adoptopenjdk,reboot,MSVS_2013,MSVS_2017,NTP_TIME
(Note: MSVS_2013 is because I didn't have the installer on the machine, 2017 did not work, could also add Dragonwell to skip that install which is not required for Temurin.
Playbook changes to make it complete:

  • Set ansible_connection/ansible_winrm_transport in ansible.cfg
  • Set ansible_user/ansible_password in group_vars/all/adoptopenjdk_variables.yml
  • Remove win_reboot: from Common/roles/main.yml Line 60
  • Remove win_reboot: from MSVS_2013 role line 50
  • Remove win_reboot: from MSVS_2017 role line 37
  • Remove checksum parameters MSVS_2022 role line 103 as it's been updated
  • Remove win_reboot from WMF_5.1 role line 29
  • Remove win_reboot from cygwin role line 45 (Although it's already covered with th reboot tag)

After ansible run is complete, run the commands shown in this article

docker ps
docker stop <image>
docker commit <image> win2022_build_image

After which it can be started again and used

@sxa
Copy link
Member Author

sxa commented Jul 29, 2024

docker commit didn't work on my image:
Error response from daemon: re-exec error: exit status 1: output: mkdir \\?\C:\Windows\SystemTemp\hcs376450290\Files: Access is denied
This is specific to the new image which has had the playbook run on it and does not occur when attempting to commit a image with only basic changes applied.

EDIT: This seems to be the temporary location where it is storing the entire image before it is committed and the machine ran out of space.

Noting that outside that directory most of the docker data is stored in C:\ProgramData\docker

EDIT 2: The docker commit command on the second machine which had adequate space used around 95GB of space in C:\windows\SystemTemp to perform the commit (excluded VS2013 and 2017) and took about 40 minutes at 40-50Mb/sec showing on resource monitor, followed by about 10 minutes of using another 15GB on C: then moving data back to the docker directory at a faster rate (Maybe ~100Mb/sec)

It did, however, hit an error Error response from damon: re-execx error: exit status 1: output: hcsshim::IpmportLayer failed in Win32: Access is denied. (0x5) (Probably hit a zero disk space condition on C: since DOCKER_TMPDIR apparently isn't working to relocate that since docker 25)

@sxa
Copy link
Member Author

sxa commented Jul 29, 2024

This is unfortunate. The builds aren't working because it looks like the automatic shortname generation (fsutil behavior set disable8.3 0) does not appear to be working within the container but is mandatory for the openjdk build process. Directories can have a shortname created manually with fsutil file setshortname "Long name" shortname but that is not ideal to do for each possible path.

EDIT: Noting that https://github.com/adoptium/infrastructure/blob/master/ansible/playbooks/AdoptOpenJDK_Windows_Playbook/roles/shortNames/tasks/main.yml already has some explicit short name creation.

@sxa
Copy link
Member Author

sxa commented Jul 29, 2024

Manually created a few of the shortnames that the configure step was objecting to and I have a JDK21u build complete in a container, so this seems feasible 👍🏻

@sxa
Copy link
Member Author

sxa commented Jul 30, 2024

Noting that we should look at doing this with the MS build tools installer which is suitable for use by Open Source projects. The jdk21u builds currently use:

10:04:20  * C Compiler:     Version 19.37.32822 (at /cygdrive/c/progra~1/micros~3/2022/commun~1/vc/tools/msvc/1437~1.328/bin/hostx64/x64/cl.exe)
10:04:20  * C++ Compiler:   Version 19.37.32822 (at /cygdrive/c/progra~1/micros~3/2022/commun~1/vc/tools/msvc/1437~1.328/bin/hostx64/x64/cl.exe)

Other references (this numbering is more confiusing that I realised - I thought we only had the '2022' vs '19.xx' versioning differences to worry about before today...)

@sxa
Copy link
Member Author

sxa commented Jul 30, 2024

Noting that we should look at doing this with the MS build tools installer which is suitable for use by Open Source projects. The jdk21u builds currently use:

10:04:20  * C Compiler:     Version 19.37.32822 (at /cygdrive/c/progra~1/micros~3/2022/commun~1/vc/tools/msvc/1437~1.328/bin/hostx64/x64/cl.exe)
10:04:20  * C++ Compiler:   Version 19.37.32822 (at /cygdrive/c/progra~1/micros~3/2022/commun~1/vc/tools/msvc/1437~1.328/bin/hostx64/x64/cl.exe)

Other references (this numbering is more confiusing that I realised - I thought we only had the '2022' vs '19.xx' versioning differences to worry about before today...)

@sxa
Copy link
Member Author

sxa commented Jul 30, 2024

Struggling with the GPG role at the moment which is called during the ANT role (I'm getting gnupg as a requirement which supplies gpg2 instead of gpg). Also Wix has to be skipped as I don't have ansible.builtin.runs available.

Other than that a two-phase dockerfile is looking quite promising. The first sets up WinRM (will only be invoked locally) and installs cygwin with git and ansible, then triggers a reboot to ensure the cygwin path takes effect.

The second runs the playbooks as normal, although for now I've currently it running in multiple layers for performance of testing to allow the caching of each layer to take effect independently:

  1. --skip-tags adoptopenjdk,reboot,ANT,NTP_TIME,Wix,MSVS_2013,MSVS_2017,MSVS_2019,MSVS_2022
  2. -t ANT
  3. -t MSVS_2019
  4. -t MSVS_2022

This is currently using the playbook branch at https://github.com/sxa/infrastructure/tree/sxa_allhosts which makes a few changes to support this execution.

@sxa sxa modified the milestones: 2024-07 (July), 2024-08 (August) Jul 31, 2024
@sxa
Copy link
Member Author

sxa commented Aug 1, 2024

The above approach seemed to work yesterday now that the machine is rebooted after adding cygwin to the PATH and I had a system which was able to successfully build jdk21u using two dockerfiles (First to configure WinRM, the second to run the playbooks using the individual layers from the previous comment. Next steps as follows:

  • Verify that on a clean image (I made some changes inside the image after my infrastructure branch was extracted, so that needs to be confirmed as captured in the branch)
  • Fix Wix install
  • Fix the git_sha detection
  • Update the MSVS_2022 role to use MS build tools to ensure reproducibility of the builds
  • Ideally test with the MSVS_2013 and 2017 installers available in the image so those roles do not need to be skipped.

Noting that the image without VS2013 or 2017 is 99GB in size.

@sxa
Copy link
Member Author

sxa commented Aug 1, 2024

Now fixed the path setting so that it only requires one dockerfile so we have something consistent with what we have on Linux now 👍🏻

It still currently requires a username/password for the authentication, but the password can be passed into the dockerfile with --build arg PW=SomeAcceptablePassword on the docker build command.

I haven't got it picking up the git_sha properly yet so that is currently hard-coded. Everything else is good enough to be able to run a jdk21u build on, but it's missing the compilers for some earlier versions (Will need those on the host and mapped in via Vendor_Files, similar to what we do with AWX). Also we'll want the jenkins_user role (Currently skipped via adoptopenjdk unless we're happy with the processes running as an administrator within the container (Need to check how well user mapping works in these containers)

Otherwise, here is the dockerfile Dockerfile.win2022v2.txt which uses the playbook changes from https://github.com/sxa/infrastructure/tree/windows_docker_fixes

@sxa sxa pinned this issue Aug 1, 2024
@sxa
Copy link
Member Author

sxa commented Aug 22, 2024

For my own reference - the build times on the docker machine (Not as powerful as the main build machines - it's 2 core / 8GiB) are:

Version Time for 2-core docker build Typical time on Azure 4-core machine
jdk8u 52m 31m
jdk11u 2h14 1h31
jdk17u 2h20 1h27
jdk21u 2h32 1h29
jdk24 1h45

@sxa
Copy link
Member Author

sxa commented Aug 28, 2024

First build using the main pipelines on the dockerhost machine: https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk21u/job/jdk21u-windows-x64-temurin/151/
"NODE_LABEL": "dockerhost-azure-win2022-x64-1",
"DOCKER_IMAGE": "notrhel_build_image",
USER_REMOTE_CONFIGS:

{
    "branch": "docker_windows_shortpath",
    "remotes": {
        "url": "https://github.com/sxa/ci-jenkins-pipelines.git"
    }
}

DEFAULTS_JSON:

        "pipeline_branch": "docker_windows_shortpath",
        "pipeline_url": "https://github.com/sxa/ci-jenkins-pipelines.git",`

@sxa sxa added the Epic label Sep 4, 2024
@sxa
Copy link
Member Author

sxa commented Sep 25, 2024

It's been quite a lot of work but the sign_Verification job now has a working run after a refactor of the code that does the signing and assembly within the pipelines. Ref: #3709 (comment)
A bit of cleaning up, and then verifying that it can create reproducible builds, will mean this can go in as a PR.

@sxa
Copy link
Member Author

sxa commented Sep 26, 2024

--create-sbom wasn't working as ant is not in the PATH on the machine. For now I've added that to the path of the environment variables in the jenkins machine definition, but that's probably something we want to cover in the container image setup.

@sxa
Copy link
Member Author

sxa commented Oct 12, 2024

Noting that when attempting to run a build using a fixed SCM_REF for a reproducibility comparison some problems occur

The create_installer_windows job needs to have the PRODUCT_*_VERSION fields matching the directory layout for the build. As an example windbld#986 which was built with an SCM_REF of jdk-21.0.4+7_adopt produced a zip file with a top level directory of jdk-21.0.5+9 but the JDK inside has OpenJDK Runtime Environment Temurin-21.0.4+7-202410111909 (build 21.0.4-beta+7-202410111909 in the java -version output. This causes the installer job to baulk at the end of a loop searching for a path name with 21.0.4 in it (these lines are not consecutive but there's a lot of debug stuff in these logs). Once it fails to find something it shows the directoriy it has which clearly has an unexpected version number in it.

looking for .\SourceDir\OpenJDK21\hotspot\x64\jdk-21.0.4
looking for .\SourceDir\OpenJDK21\hotspot\x64\jdk21u4-b7
looking for .\SourceDir\OpenJDK21\hotspot\x64\jdk-21+7
looking for .\SourceDir\OpenJDK21\hotspot\x64\jdk-21.0.4+7
looking for .\SourceDir\OpenJDK21\hotspot\x64\jdk-21.0.4.0+7
looking for .\SourceDir\OpenJDK-Latest\hotspot\x64\jdk-21.0.4+7
SOURCE Dir not found / failed
Listing directory :
F:\workspace\workspace\build-scripts\release\create_installer_windows\wix\SourceDir\OpenJDK21
F:\workspace\workspace\build-scripts\release\create_installer_windows\wix\SourceDir\OpenJDK21\hotspot
F:\workspace\workspace\build-scripts\release\create_installer_windows\wix\SourceDir\OpenJDK21\hotspot\x64
F:\workspace\workspace\build-scripts\release\create_installer_windows\wix\SourceDir\OpenJDK21\hotspot\x64\jdk-21.0.5+9

This is likely nothing to do with the docker changes, but is likely something we should look to address as a build issue in the general case when building something that isn't the latest version, such as a previous GA level. (FYI @andrew-m-leonard)
I've locked create_installer_windows#779 which shows the issue so it can be looked at and re-run if desired. The lock should be released once this is resolved. Similarly create_installer_windows#785 has been locked which was a re-run with the PRODUCT_*_VERSION fields corrected to be consistent with the directory name in the zip file.

@sxa sxa reopened this Oct 12, 2024
@sxa
Copy link
Member Author

sxa commented Oct 28, 2024

Reproducibility tests based on 21.0.5+11:

Job docker UCRT params devkit param Result
1065 Differences in jmods - retry passed 100%
1066 ReproduciblePercent = 100 %
1067 ReproduciblePercent = 100 %
1068 UCRT paths with /cygdrive/C instead of /c: ReproduciblePercent = 100%

@sxa
Copy link
Member Author

sxa commented Nov 5, 2024

Reproducibility is confirmed good after a number of tests.

A couple of additional notes on this:

  • https://lippertmarkus.com/2021/09/04/containers-without-docker-desktop/ may be another option for installing docker but I have not tried this
  • The docker commands from the CE version I've used require need to be run as administrator - there is no way to have this run as a normal users (the jenkins one we typically use for the agent does not generally run as an administrative user)

@sxa
Copy link
Member Author

sxa commented Nov 7, 2024

pipelines PR was merged yesterday so the code is in and this can now be used once we identify suitable systems which can run docker, bearing in mind that for the current docker tests you need to have jenkins running as an administrative user, which is not the case for our existing machines.

@sxa sxa changed the title New Machine requirement: Windows dockerBuild containers EPIC: New Machine requirement: Windows dockerBuild containers Nov 26, 2024
@sxa sxa unpinned this issue Nov 26, 2024
@sxa
Copy link
Member Author

sxa commented Nov 26, 2024

New machines being tested:

The AMD machine completed a jdk21u build in a container from the command line in just under 3h so it is possible to build with 4GiB of RAM. It was slightly slower than the numbers from the B2ms systems in the earlier comment, but it's also a different CPU, plus those tests were done with the machine having been worked a bit from the start so may have been subject to bursting limits. Since the AMD one seems to work I will also look at loading up its 256GiB C: drive with a normal playbook run (excluding VS2013, 2017 and 2019) so it can act as a drop-in replacement for an existing build machine even without enabling the containerised builds. The first two here are my prototype machines which will be replaced by the two new ones, but here is a spec comparison so we have the info stored:

Machine Cores Docker Disk RAM jdk8u jdk21u jdk24
dh-w22-1 Xeon 8370C 250GiB Premium SSD v2 8GiB 57m29 (re-run)
dh-w22-sxa1/3 Xeon 8370C 200GiB Premium SSD 8GiB 33m02 (Now deleted)
dh-w22-1-intel Xeon 8171M 128GiB HDD 8GiB TBC (Failed when run with docker support)
dh-w22-2-amd AMD EPYC 7763 128GiB HDD 4GiB 1h04
dh-w22-3-intel Xeon E5-2673v4 128GiB HDD 4GiB

@sxa
Copy link
Member Author

sxa commented Nov 27, 2024

I have got "static" containers running jenkins agents which are running on burstable machines to test the process with the ea pipelines this week. This will mean we can switch to/from this performance testing without modifying the pipelines (i.e. the new explicit docker support is not yet being enabled). A couple of notes on this:

  1. Other than the first build, this is likely to be quite slow due to the use of burstable machines
  2. The wix label has been removed from these agents as there is a problem with the locales in the containers (Potentially we could install enough to make WIX work on the docker host system) [*]
  3. I have removed the build label from the two existing machines so that they will not be used for the builds, but they are kept online so that the installer jobs requiring wix can run on them.

[*] - Sample failure: https://ci.adoptium.net/job/build-scripts/job/release/job/create_installer_windows/1108/console

Building setup translation for culture "de-de" with LangID "1031"...
Input Error: Can not find script file "C:\Program Files (x86)\Windows Kits\10\bin\10.0.17763.0\x64\WiLangId.vbs".
WiLangId failed with : 1
Failed to generate setup translation of culture "de-de" with LangID "1031".
failed to build translation de-de 1031

@sxa
Copy link
Member Author

sxa commented Nov 28, 2024

I've done a bit of rebasing to allow us to use the updated playbooks (Currently the dockerfile is still pointing at my original fork/branch of the infrastructure repo, which has resulted in it not having recent updates such as ant)

I still have to have fixes for ant-contrib (The download isn't working - I have to pull it from a copy I've put in place) and #3828 is preventing parts of the playbook from completing too with the latest versions.

@sxa
Copy link
Member Author

sxa commented Nov 29, 2024

Ref the WiX error above, I have done the following to try to allow it to run on the 8GiB Intel machine:

This means that right now the machine can technically run jobs from three different jenkins agents running on it:

The first two should not be enabled in parallel (same for the other similar machines) as this will overwhelm them. None of these agents are currently running as a service during this prototype phase.

@sxa
Copy link
Member Author

sxa commented Nov 29, 2024

First pass with JDK8

Machine Time
build-docker-1-amd (on dh-2-amd) 1h06
build-docker-2-intel (on dh-1-intel) 1h32
build-docker-3-intel (on dh-2-intel) 1h40

With jdk21 (I'll kick off some runs and populate this table once adoptium/installer#1063 is merged):

Machine Time Reproducible? Notes
b-d-w22-2-intel XhXX + XhXX
b-d-w22-1-amd XhXX + XhXX
b-d-w22-3-intel XhXX + XhXX
dh-w22-1-intel
dh-w22-2-amd
dh-w22-3-intel

@sxa
Copy link
Member Author

sxa commented Nov 29, 2024

Noting that all -ea builds from this week were run int he static docker containers, and are therefore the first set to be built with machines that only have the MS VS2022 Build Tools installation which meet the requirement in adoptium/temurin-build#3787

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

2 participants