Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big endian PowerPC CI #17258

Draft
wants to merge 23 commits into
base: master
Choose a base branch
from
Draft

Conversation

NattyNarwhal
Copy link
Member

@NattyNarwhal NattyNarwhal commented Dec 24, 2024

I'll clean up the history later, but some notes:

  • This is a nightly run; it takes approximately half an hour to run to completion.
  • This runs on Gentoo, because it has good support for ppc64 and I know people familiar with it.
    • If we want to stick with a binary distribution, Debian unstable, openSUSE, or Adelie would probably be the best bet.
  • The runner system is an LPAR on a POWER8 system I have racked in a real datacentre.
    • The partition has 4 cores (8 threads per core, so 32 threads) assigned to it and 128 GB of RAM (which it rarely uses it seems, tops out at just over 16 GB in my experience if we need to resize). The storage is some spinning rust in RAID 5.
    • We could probably do ppc64le CI with this as well. Or developer systems for PHP people who need to test things.
  • The official runner is incompatible, so I have to use an unofficial implementation.
    • This is joined to my php-src fork right now; if we want to merge this, we'd obviously need to assign this to php/php-src. Is there any process for this?
    • I have some tweaks to the runner scripts to run it as a systemd unit and pick up the right LLVM version on path.
    • This isn't being run containerized. It'd be nice to figure that out.
  • Sample run here.
    • ext/ffi/tests/gh11934.phpt will be fixed by Fix GH-16013 and bug #80857: Big endian issues #17255. I don't know what's up with the POSIX tests; I believe these used to pass until I updated the system. I'll look into this.
    • A lot tests are slow; the number may need tuning because POWER8 is a bit old. The gd tests being very slow seem very unusual to me though.

@cmb69
Copy link
Member

cmb69 commented Dec 30, 2024

  • The gd tests being very slow seem very unusual to me though.

The 5 slowest test use helpers written in PHP which check all pixels; the 6th slowest calls imagecolormatch() (which also checks all pixels) on a 16 mega-pixel image. Given that the tests use -j32 and that the L1 cache appears to be very small, I guess this is partly a RAM performance issue. I wouldn't be too worried about that.

However, it would make sense to rewrite the GD test helpers (particularly, test_image_equals_file() and calc_image_dissimilarity()) in C. See #17305.

@NattyNarwhal
Copy link
Member Author

Looks like all ran tests pass now after a rebase! The tests are still slow (in particular gd), but the server is a little older and we are running several tests in parallel with sanitizers.

Copy link
Member

@iluuu1994 iluuu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good and simple enough. Is mysql already set up on the machine? I'm guessing this runs without docker? I'm not sure if those tests are actually executed.

net-nds/openldap \
dev-db/unixODBC \
dev-db/postgresql \
|| true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the || true be removed? Was it added for testing purposes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think this is to allow a trailing backslash on the "last" line (like trailing comma for arrays).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that it also suppresses errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

&& true instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user that CI is running as doesn't have permission to install anything, it's purely for documentation until it does (though w/o binpkgs on Gentoo, we probably don't want to anyways).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of hundreds of lines of untested code :) In that case, we should probably just remove it. We can keep it in the PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean by keep it in the PR; the PR description?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a comment, in case it needs to be added at some point so it's not fully lost.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've turned these into asserts that the package is installed.

@NattyNarwhal
Copy link
Member Author

Is mysql already set up on the machine? I'm guessing this runs without docker? I'm not sure if those tests are actually executed.

This does not run containerized. I'd like to, just have to figure out how with the runner implementation.

There's also no DB server running, so I suspect those tests are skipped. Would it be important to get that working?

@iluuu1994
Copy link
Member

iluuu1994 commented Jan 6, 2025

Would it be important to get that working?

That's up to you. I think the mysql extension does involve some bit shifting stuff for the protocol implementation, so it might make sense to test it with big endian.

@NattyNarwhal
Copy link
Member Author

Rebasing and adding MySQL tests, run here. Postgres and Firebird in theory shouldn't also be much more work, although ideally I'd like to run all these DBs (and the job itself) in a container.

@NattyNarwhal
Copy link
Member Author

Failed tests after enabling that + rebasing:

  • There's some DOM tests failing. Not sure what's up with that.
  • ext/mysqli/tests/functions/mysqli_character_set_name.phpt, ext/mysqli/tests/functions/mysqli_character_set_name_oo.phpt, ext/mysqli/tests/mysqli_change_user_set_names.phpt, ext/mysqli/tests/mysqli_get_charset.phpt: Looks like MariaDB has a different default for utf8 (instead of utf8mb3). Considering mb4 should be used nowadays AFAIK...
  • ext/mysqli/tests/mysqli_thread_id.phpt, ext/mysqli/tests/mysqli_stmt_get_result.phpt: looks like differences in warnings for closed connections
  • ext/mysqli/tests/mysqli_get_client_stats.phpt, ext/mysqli/tests/mysqli_query.phpt, ext/mysqli/tests/mysqli_query_unicode.phpt: Looks like our unprivileged DB user can't drop tables? Pretty sure GRANT ALL ON test.* [...] should include DROP...
  • ext/mysqli/tests/bug74021.phpt: Might also be related to the fact we're running unprivileged.

Note that this is MariaDB instead of MySQL. I can try with MySQL instead. Some stuff is also because we're not running it as root; I'd really prefer to avoid that until again, some kind of containerization.

@NattyNarwhal
Copy link
Member Author

NattyNarwhal commented Jan 8, 2025

Having some issues with MySQL (MariaDB works, so whatever, upstream bug filed), so setting up Pg in a similar manner. Firebird looks like a piece of work, so putting that off. I might try to clean up the tests so they work on both MariaDB and MySQL, since MariaDB is pretty common in the ecosystem nowadays.

Run here (oops, forgot pg is in skipSlow)

@NattyNarwhal
Copy link
Member Author

Running MySQL 8 instead of MariaDB, tests work as expected except for the ones that assume it's running as a superuser so far: https://github.com/NattyNarwhal/php-src/actions/runs/12696420415/job/35390469488

@NattyNarwhal
Copy link
Member Author

Did a run with skipSlow set to false to test more things:

  • DOM still failing for mysterious reasons; I thought this may have been libxml2 regression since I did an upgrade, but reverting that version didn't change much
  • Enchant tests fail because no dicts, adding those next run.
  • ext/pgsql/tests/80_bug14383.phpt: I didn't even know pg could work with gdbm here; not sure what's up here.
  • SNMP and LDAP are basically skipped even though they're built, SNMP should be trivial to set up (though I don't like rwusers...), LDAP more involved (likely have to adapt setup-slapd).

@cmb69
Copy link
Member

cmb69 commented Jan 10, 2025

  • DOM still failing for mysterious reasons

Possibly a big-endian issue, or otherwise platform related. cc @nielsdos

  • ext/pgsql/tests/80_bug14383.phpt: I didn't even know pg could work with gdbm here; not sure what's up here.

Maybe @Girgias has an idea?

@nielsdos
Copy link
Member

@NattyNarwhal I'll take a look at DOM tomorrow on your big endian machine.

@nielsdos
Copy link
Member

@NattyNarwhal Please rebase, the DOM issue is fixed on master.

@Girgias
Copy link
Member

Girgias commented Jan 11, 2025

  • DOM still failing for mysterious reasons

Possibly a big-endian issue, or otherwise platform related. cc @nielsdos

  • ext/pgsql/tests/80_bug14383.phpt: I didn't even know pg could work with gdbm here; not sure what's up here.

Maybe @Girgias has an idea?

This is annoying as I don't know which tests of the suite fails, there might some weird conflict with the other DBA tests being run in parallel?

@NattyNarwhal
Copy link
Member Author

Thanks for fixing DOM!

Run here; MySQL tests skipped because I brought it down to reduce noise and just focus on that failing Pg test.,

@NattyNarwhal
Copy link
Member Author

NattyNarwhal commented Jan 11, 2025

My guess is it picks a different DBA backend that it maybe shouldn't:

ppc system
checking for QDBM support... no
checking for gdbm_open in -lgdbm... yes
checking for GDBM support... yes
checking for NDBM support... no
checking for TCADB support... no
checking for mdb_env_open in -llmdb... yes
checking for LMDB support... yes
checking for Berkeley DB4 support... no
checking for Berkeley DB3 support... no
checking for Berkeley DB2 support... no
checking for DB1 support... no
checking for DBM support... no
checking for CDB support... builtin
checking for INI File support... builtin
checking for FlatFile support... builtin
checking whether to enable DBA interface... yes
amd64 runner
checking for dpopen in -lqdbm... yes
checking for QDBM support... yes
checking for GDBM support... no
checking for NDBM support... no
checking for tcadbopen in -ltokyocabinet... yes
checking for TCADB support... yes
checking for mdb_env_open in -llmdb... yes
checking for LMDB support... yes
checking for Berkeley DB4 support... no
checking for Berkeley DB3 support... no
checking for Berkeley DB2 support... no
checking for DB1 support... no
checking for DBM support... no
checking for CDB support... builtin
checking for INI File support... builtin
checking for FlatFile support... builtin
checking whether to enable DBA interface... yes
checking whether to enable dl-test extension... yes, shared

I had copied the configure invocation from the Alpine section from nightly, which doesn't run pg (on account of not running a pg server, on top of skipSlow).

@NattyNarwhal
Copy link
Member Author

Yeah. it's QDBM (works) vs. GDBM (doesn't).

The DBA test setup has this function to pick the first applicable handler:

function get_any_handler(): string {
    foreach (dba_handlers() as $handler) {
        // Those are weird
        if ($handler !== 'cdb' && $handler !== 'cdb_make' && $handler !== 'inifile') {
            echo 'Using handler: "', $handler, '"', \PHP_EOL;
            return $handler;
        }
    }
    return 'should_not_happen';
}

dba_handlers returns:

array(7) {
  [0]=>
  string(3) "cdb"
  [1]=>
  string(8) "cdb_make"
  [2]=>
  string(7) "inifile"
  [3]=>
  string(8) "flatfile"
  [4]=>
  string(4) "qdbm"
  [5]=>
  string(5) "tcadb"
  [6]=>
  string(4) "lmdb"
}

So it's presumably picking qdbm out this after disqualifying cdb and inifile. No idea why it doesn't pick flatfile even though it's not excluded...

After testing outside of CI with that, it passes.

@NattyNarwhal
Copy link
Member Author

run here with that fixed and also running MySQL tests

This assumes gentoo (which has best ppc64be support of mainstream
distributions).

(Rebased onto the new workflow_call approach)
Not using containers (yet), for now just setting up MariaDB with an
unprivileged user and seeing how that goes. Postgres and Firebird also
seem doable, although SQL Server for ODBC definitely won't be.
PDO_sqlite, the Pg extensions, SNMP, LDAP, Enchant, Tidy, ODBC, and
PDO_DBLIB. Note that DBLIB and ODBC will be hard to test since the tests
assume SQL Server. LDAP and SNMP may require config on the PPC CI
server, but Pg shhould be configured and ready to go.
The checks for things like ext/ldap will assume libraries are in ./lib,
and not lib64 like they are on many multilib systems, Gentoo included.

Gross that we're hardcoding it, but we should also probably convert
ldap to pkg-config sometime. That may require getting rid of Oracle or
Solaris LDAP support though.
Gentoo default USE flags seem to use hunspell for enchant.
GDBM is broken with ext/pgsql/tests/80_bug14383.phpt
@NattyNarwhal
Copy link
Member Author

run with setting log_bin_trust_function_creators in MySQL.

  • ext/mysqli/tests/mysqli_get_client_stats.phpt: Needs ext/mysql: Don't mention errors for optional part of the test #17466.
  • ext/mysqli/tests/bug73462.phpt: Can't seem to reproduce this one out of CI.
  • ext/standard/tests/general_functions/proc_open_multiplex.phpt: Might be spurious, I can't seem to reproduce outside of CI.
  • Some tests are still horribly slow (mysqli, gd, bcmath). This might be an artifact of sanitizers being really nasty for performance. It doesn't help run-tests's parallel runners mode seems to allocate jobs in a way where you have might have several tests and several CPUs, but a few tests in, only one is running tests and the rest are zombies. (CONFLICTS doesn't help, but not that many tests are marked that way.)
  • The runner has a habit of cutting off stdout/err sometimes when steps end. Not too big of a deal, since the most important stuff is up front.

@NattyNarwhal
Copy link
Member Author

  • Some tests are still horribly slow (mysqli, gd, bcmath). This might be an artifact of sanitizers being really nasty for performance. It doesn't help run-tests's parallel runners mode seems to allocate jobs in a way where you have might have several tests and several CPUs, but a few tests in, only one is running tests and the rest are zombies. (CONFLICTS doesn't help, but not that many tests are marked that way.)

Testing this with gd at least (since that's compute heavy); -O0 didn't much make of a difference vs. -O2, but sanitizers on is where the performance impact is. About what me and everyone else presumably expected, but good to confirm. With mysql, most of the time is spent in waiting in poll, so it makes me think MySQL itself is being slow. Not too important to look at now, but something to check later.

@iluuu1994
Copy link
Member

iluuu1994 commented Jan 15, 2025

The build looks indeed pretty slow and I don't think it would be good to add this to the push job in this condition. I assume that because the runner isn't containerized, it also cannot reliably run jobs in parallel. You could try dropping UBSan, which I think causes a bigger slowdown compared to ASan, but also less value. If that doesn't help, it might be better to just add the job to nightly. This should largely be enough, these breaks are rare and I look at nightly regularly. I am somewhat surprised that you saw no significant improvement for -O2, are you sure this flag was actually propagated correctly? Normally this would be set by the build system with --disable-debug --enable-debug-assertions.

@NattyNarwhal
Copy link
Member Author

The build looks indeed pretty slow and I don't think it would be good to add this to the push job in this condition.

Yes, I was thinking nightly is fine for now. I think for now it's probably best to get nightly going (especially if you're looking at it, it's not going to get ignored), and

I assume that because the runner isn't containerized, it also cannot reliably run jobs in parallel.

Indeed, because a lot of tests assume /tmp is unique per run as well as the DB servers. For a push run, the latter could probably be skipped.

You could try dropping UBSan, which I think causes a bigger slowdown compared to ASan, but also less value. If that doesn't help, it might be better to just add the job to nightly. This should largely be enough, these breaks are rare and I look at nightly regularly.

I think it's better to have the sanitizers with their performance impact than not in a nightly build. In a push build, I'd probably pare down the built extensions and maybe the sanitizers.

I am somewhat surprised that you saw no significant improvement for -O2, are you sure this flag was actually propagated correctly? Normally this would be set by the build system with --disable-debug --enable-debug-assertions.

Sorry, should have been more clear: This was comparing -O2 vs. -O0 vs. sanitizers on (0/2 doesn't seem to make a perf difference, I assume sanitizers implicitly kill optimization). The CI build should be doing the latter because --enable-debug IIRC implies -O0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants