-
-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docker on EC2 aarch64 dynamic images restarting after ~30 minutes (during build) #1630
Comments
It looks like this failure started from 17th Oct... |
The common theme seems to be the job fails after running exactly 30mins ... ? |
At some point in thie middle of this afternoon this seems to be ok. All affected builds during the time period appeared to be aborting after 30 minutes - no obvious reason why and we have not knowingly taken any remidial action on it. I will close for now and reopen if it recurs. Sample error:
|
Above PR made no difference - https://ci.adoptopenjdk.net/job/build-scripts-pr-tester/job/build-test/job/jobs/job/jdk/job/jdk-linux-aarch64-openj9/317/consoleFull showed the same failure. I have added the appropriate labels to https://ci.adoptopenjdk.net/computer/docker-packet-ubuntu1604-armv8-1/ and my proposal would be that unless we can determine the cause of the failures we disable the dynamically provisioned EC2 aarch64 systems for the GA and single thread all builds through that packet machine. I am running a build at https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk/job/jdk-linux-aarch64-hotspot/285/console to verify whether it is able to run the dockerBuilds successfully, so if that works the only potential with the proposal is if we see any recurrance of adoptium/temurin-build#1804 |
(NOTE: @gdams had increased the capacity on the systems yesterday but that did not resolve the issue) |
After seeing that in the time when the logs stopped the docker images was no longer running I wondered if we were hitting an issue with the docker subsystem on the dynamically provisioned hosts being updated and restarted while the builds were taking place. Looking at the jenkins job log, and the package logs on the host system, they seemed almost exactly an hour out at a time when the
vs the update logs:
I therefore believe that these issues have been caused by an automatic update on the machine trying to update docker from a one-off template image used for creating the dynamic instances. If we rebuild the template with a more up to date docker.io package I believe it will resolve the problem |
Looks ok after rebuilding the image (and increasing the space again as the first rebuild was too small to run a build on). Closing as I'm now reasonably confident that the issue has been identified and resolved. |
https://ci.adoptopenjdk.net/view/Failed%20Builds/job/build-scripts/job/jobs/job/jdk11u/job/jdk11u-linux-aarch64-openj9-linuxXL/397/console
Node:
08:42:28 All nodes of label ‘build&&linux&&aarch64&&dockerBuild’ are offline
08:43:32 Running on EC2 (adopt_aws) - Dynamic Linux aarch64 VM provisioned from AWS (i-0fbd7d9f07a7b18f3) in /home/ubuntu/workspace/build-scripts/jobs/jdk11u/jdk11u-linux-aarch64-openj9-linuxXL
The text was updated successfully, but these errors were encountered: