Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert ARM changes #92

Merged
merged 2 commits into from
Jan 29, 2025
Merged

Revert ARM changes #92

merged 2 commits into from
Jan 29, 2025

Conversation

andyundso
Copy link
Member

it seems that if you build an image for different architectures and on different machines, the last one pushed overwrites the previous one, even if their architecture is different. right now the pipeline on main passed and only the ARM64 versions are present ...

image

@andyundso andyundso merged commit 536d763 into main Jan 29, 2025
14 of 26 checks passed
@justinclift justinclift deleted the revert-arm-changes branch February 1, 2025 10:12
@justinclift
Copy link
Member

it seems that if you build an image for different architectures and on different machines, the last one pushed overwrites the previous one, even if their architecture is different

Well damn. 😬

@justinclift
Copy link
Member

Weirdly, after the revert things are still not building correctly.

Looking at the failing base images (Alpine, 11), it has:

437  54.23 configure: error: zlib library not found
438  54.23 If you have zlib already installed, see config.log for details on the
439  54.23 failure.  It is possible the compiler isn't looking in the proper directory.
440  54.23 Use --without-zlib to disable zlib support.

Which makes no sense, as we're specifically installing the zlib-dev package as part of the Dockerfile for Alpine:

apk add --update build-base icu-data-full icu-dev linux-headers lz4-dev musl musl-locales musl-utils tzdata zlib-dev zstd-dev && \

The installation of the zlib dev library happens on line 207:

207  #10 [linux/amd64 base-build 4/4] RUN apk update &&   apk upgrade &&   apk add --update build-base \
icu-data-full icu-dev linux-headers lz4-dev musl musl-locales musl-utils tzdata zlib-dev zstd-dev && \
apk cache clean

I wonder if the problem is bad caching of some variety?

@andyundso
Copy link
Member Author

... could be? I mean we could delete the base-image from Docker Hub and see if it rebuilds.

@justinclift
Copy link
Member

Ahhh, hadn't thought of that. That's a good idea.

In the meantime, I've just temporariliy disabled zlib support in the PG 11 build to see if the failure changes at all. I'm kind of suspecting it won't, but lets see what happens. 😄

@andyundso
Copy link
Member Author

funny enough, the final images were all able to build the base-11 on their own. :D

@justinclift
Copy link
Member

My temporary commit didn't help at all, so I'm going to revert it and try your idea of killing the base-image from Docker Hub. 😄

@justinclift
Copy link
Member

justinclift commented Feb 1, 2025

Hmmm, I'm not seeing image(s) on Docker Hub with base in their name. Are you meaning the build-* ones?

@andyundso
Copy link
Member Author

ah yeah, build, sorry 😅

@justinclift
Copy link
Member

Killed all of the build-* images, but things are still failing. 😬

Looking over the failing log this time (a different one, yet again):

7878  #18 6.005 checking whether we are cross compiling... configure: error: in `/buildroot/postgresql-14.15':
7879  #18 6.153 configure: error: cannot run C compiled programs.

It seems like it's running the compile process for PG twice (for the same ARM64 arch), and in this particular case it somehow thinks during the second compile that the C compiler isn't producing valid programs.

I say "during the second compile" because it runs through the whole process a few thousand lines earlier, successfully installing and compile PG for ARM64 the first time around.

Any ideas?

I wonder if we need to get the config.log to be output (ie cat config.log) to try and figure out wtf is going wrong? As per the error output...

7897  6.153 See `config.log' for more details

@justinclift
Copy link
Member

Meh, I'm pretty certain it's more some kind of flakiness with GitHubs infrastructure.

I just re-ran one of the successful CI runs from last week, and it failed this time around.

https://github.com/pgautoupgrade/docker-pgautoupgrade/actions/runs/12970259213

@andyundso
Copy link
Member Author

maybe some of the ARM infrastructure is leaking over to the AMD64 infrastructure? 😄

I mean, shouldn't be possible ... as long as we can build the final images it is okay, but the pipeline will take much longer.

@andyundso
Copy link
Member Author

I am really confused why the builds do not work on main. My initial thought was somehow the base-images are broken since the ARM changes, but this workflow from @spwoodcock does not use any caching. I deleted all caches for GitHub Actions, still same.

at least it appears to always fail at the same point in the Alpine image:

21.99 configure: error: in `/buildroot/postgresql-9.5.25':
21.99 configure: error: C compiler cannot create executables

I rebuilt the Alpine image on my local machine without any issues.

The first time the build failed was two weeks ago (26th of January). The build the week before (19th of January) passed. So it is likely not related to the fact that GitHub changed the default Ubuntu image from 22.04 to 24.04.

image

but what is also strange is that fail-fast seems to work no longer. usually, when building of the base-images failed, it did not build the target images. but now it suddenly does that (at least this is how I remembered it).

image

@justinclift
Copy link
Member

justinclift commented Feb 6, 2025

Yeah, it makes no sense to me too. I was wondering if one of the external things we call might have changed (ie Alpine or Debian base image), but that wouldn't seem to consistently cause the behaviour we're seeing either.

@andyundso
Copy link
Member Author

I know re-activated fail-fast on my feature branch and re-ran the Build dev images workflow - now suddenly it works ...

@andyundso
Copy link
Member Author

well, nevermind. segmentation fault when building the ARM version of the Postgres v15 base image:

2025-02-06T10:41:42.7953077Z  > [linux/arm64 build-15 2/2] RUN cd postgresql-15.* &&   ./configure --prefix=/usr/local-pg15 --with-openssl=no --without-readline --with-icu --with-lz4 --with-system-tzdata=/usr/share/zoneinfo --enable-debug=no CFLAGS="-Os" &&   make -j $(nproc) &&   make install-world &&   rm -rf /usr/local-pg15/include:
2025-02-06T10:41:42.7960267Z 1018.0 gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -Os -DFRONTEND -I. -I../../src/common -I../../src/include  -D_GNU_SOURCE  -DVAL_CC="\"gcc\"" -DVAL_CPPFLAGS="\"-D_GNU_SOURCE\"" -DVAL_CFLAGS="\"-Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -Os\"" -DVAL_CFLAGS_SL="\"-fPIC\"" -DVAL_LDFLAGS="\"-Wl,--as-needed -Wl,-rpath,'/usr/local-pg15/lib',--enable-new-dtags\"" -DVAL_LDFLAGS_EX="\"\"" -DVAL_LDFLAGS_SL="\"\"" -DVAL_LIBS="\"-lpgcommon -lpgport -llz4 -lz -lm \""  -c -o exec.o exec.c
2025-02-06T10:41:42.7966186Z 1019.3 gcc: internal compiler error: Segmentation fault signal terminated program cc1
2025-02-06T10:41:42.7966925Z 1019.3 Please submit a full bug report, with preprocessed source (by using -freport-bug).
2025-02-06T10:41:42.7967695Z 1019.3 See <file:///usr/share/doc/gcc-12/README.Bugs> for instructions.

@spwoodcock
Copy link
Member

spwoodcock commented Feb 6, 2025

I'm far from certain as I haven't checked very thoroughly yet, but using qemu and buildx runs the ARM and AMD builds in parallel on the same machine, right?

If there is caching that is defined for the whole workflow, would the AMD and ARM builds be sharing the same cache?

Could that cause the issue we are seeing?
Worth testing disabling the cache?
Or adding the arch variable to the cache-from key?

Saying that, I have never experienced such an issue before though, with similar multi-arch workflows

@andyundso
Copy link
Member Author

but using qemu and buildx runs the ARM and AMD builds in parallel on the same machine, right?

correct.

If there is caching that is defined for the whole workflow, would the AMD and ARM builds be sharing the same cache?

i am not quite sure how the GitHub Actions cache created by the Docker build-and-push behaves. It also quite small (23 MB), so I do not think it contains anything related to our image.

Saying that, I have never experienced such an issue before though, with similar multi-arch workflows

same here. although at this point I think the failing build on main is disconnected from the failing build on my feature branch using your new workflow. I assume the build on the feature branch is simply too much load on one machine, that's why it's crashing.

I'll open a new PR to switch to the Ubuntu 22.04 image. maybe this could help, not sure.

@andyundso
Copy link
Member Author

very funky, build appears to work for all base images on Ubuntu 22.04. so it could really be that the 24.04 image has some kind of issue?

I think the 22.04 is supported for quite some time ... so let's stay on that version until we need to upgrade 😄

@justinclift
Copy link
Member

Yeah, this was a super weird one. Glad something worked to get things building properly again though. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants