Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propolis panic during zone uninstall #827

Open
askfongjojo opened this issue Dec 14, 2024 · 5 comments
Open

Propolis panic during zone uninstall #827

askfongjojo opened this issue Dec 14, 2024 · 5 comments
Milestone

Comments

@askfongjojo
Copy link

askfongjojo commented Dec 14, 2024

I found two sets of propolis core files on a certain sled after running the "parking" script that halted and uninstalled all zones on the sled.

Here are the core files of two specific propolis zones - there are 5 occurrences for each of them:

root@oxz_switch0:~# pilot host exec -c 'ls -lh /pool/*/*/crypt/debug/core\.* || true' 0-31
 7  BRM27230045        ok: 
 8  BRM44220011        ok: 
 9  BRM44220005        ok: -rw-r--r--   1 root     root        309M Dec 14  2024 /pool/ext/f8b11629-ced6-412a-9c3f-d169b99ee996/crypt/debug/core.oxz_propolis-server_0807fc69-c1ad-4768-9cb6-9c54746b36d5.propolis-server.14698.1734168037
-rw-r--r--   1 root     root       69.4M Dec 14  2024 /pool/ext/f8b11629-ced6-412a-9c3f-d169b99ee996/crypt/debug/core.oxz_propolis-server_0807fc69-c1ad-4768-9cb6-9c54746b36d5.propolis-server.18482.1734168039
-rw-r--r--   1 root     root       69.4M Dec 14  2024 /pool/ext/f8b11629-ced6-412a-9c3f-d169b99ee996/crypt/debug/core.oxz_propolis-server_0807fc69-c1ad-4768-9cb6-9c54746b36d5.propolis-server.18516.1734168040
-rw-r--r--   1 root     root       69.4M Dec 14  2024 /pool/ext/f8b11629-ced6-412a-9c3f-d169b99ee996/crypt/debug/core.oxz_propolis-server_0807fc69-c1ad-4768-9cb6-9c54746b36d5.propolis-server.18551.1734168040
-rw-r--r--   1 root     root       69.4M Dec 14  2024 /pool/ext/f8b11629-ced6-412a-9c3f-d169b99ee996/crypt/debug/core.oxz_propolis-server_0807fc69-c1ad-4768-9cb6-9c54746b36d5.propolis-server.18571.1734168041
-rw-r--r--   1 root     root        186M Dec 14  2024 /pool/ext/f8b11629-ced6-412a-9c3f-d169b99ee996/crypt/debug/core.oxz_propolis-server_75e401d6-7502-4231-b30c-b2bfa32a1a6f.propolis-server.18383.1734168037
-rw-r--r--   1 root     root       69.4M Dec 14  2024 /pool/ext/f8b11629-ced6-412a-9c3f-d169b99ee996/crypt/debug/core.oxz_propolis-server_75e401d6-7502-4231-b30c-b2bfa32a1a6f.propolis-server.18453.1734168038
-rw-r--r--   1 root     root       69.4M Dec 14  2024 /pool/ext/f8b11629-ced6-412a-9c3f-d169b99ee996/crypt/debug/core.oxz_propolis-server_75e401d6-7502-4231-b30c-b2bfa32a1a6f.propolis-server.18487.1734168039
-rw-r--r--   1 root     root       69.4M Dec 14  2024 /pool/ext/f8b11629-ced6-412a-9c3f-d169b99ee996/crypt/debug/core.oxz_propolis-server_75e401d6-7502-4231-b30c-b2bfa32a1a6f.propolis-server.18521.1734168040
-rw-r--r--   1 root     root       69.4M Dec 14  2024 /pool/ext/f8b11629-ced6-412a-9c3f-d169b99ee996/crypt/debug/core.oxz_propolis-server_75e401d6-7502-4231-b30c-b2bfa32a1a6f.propolis-server.18554.1734168040
10  BRM42220009        ok: 
...

The stacks appear identical and here is one of them:

BRM44220005 # mdb /pool/ext/f8b11629-ced6-412a-9c3f-d169b99ee996/crypt/debug/core.oxz_propolis-server_75e401d6-7502-4231-b30c-b2bfa32a1a6f.propolis-server.18554.1734168040
Loading modules: [ libumem.so.1 libnvpair.so.1 libc.so.1 ld.so.1 ]
> $C ! demangle
fffff5ffeddffa20 libc.so.1`_lwp_kill+0xa()
fffff5ffeddffa50 libc.so.1`raise+0x22(6)
fffff5ffeddffaa0 libc.so.1`abort+0x58()
fffff5ffeddffab0 0x28c9ab9()
fffff5ffeddffac0 0x28c9aa9()
fffff5ffeddffb20 rust_panic+0xd()
fffff5ffeddffbe0 std::panicking::rust_panic_with_hook::h503ea5292ea6f2f4+0x231()
fffff5ffeddffc20 std::panicking::begin_panic_handler::{{closure}}::h2eb8efd06bcdc46a+0x98()
fffff5ffeddffc30 0x28b0699()
fffff5ffeddffc60 0x28b2d3c()
fffff5ffeddffc90 0x28f7f6f()
fffff5ffeddffd70 0xf881f7()
fffff5ffeddffee0 std::sys::backtrace::__rust_begin_short_backtrace::h68463c8dc06c772b+0xd6()
fffff5ffeddfff60 core::ops::function::FnOnce::call_once{{vtable.shim}}::hc5463a0161650b1d+0xa3()
fffff5ffeddfffb0 std::sys::pal::unix::thread::Thread::new::thread_start::he13a45effb26dfc6+0x2b()
fffff5ffeddfffe0 libc.so.1`_thrp_setup+0x77(fffff5ffeef30240)
fffff5ffeddffff0 libc.so.1`_lwp_start()
@askfongjojo askfongjojo added this to the 12 milestone Dec 14, 2024
@askfongjojo
Copy link
Author

I uploaded an instance of the core files for each propolis zone to /staff/core/propolis-827.

@askfongjojo
Copy link
Author

The two instances have been running for some time and most likely didn't have any new processes in them that triggered new guest OS-related failures:

root@[fd00:1122:3344:105::3]:32221/omicron> select instance.id, name, instance.time_created, vmm.id, vmm.time_created from instance join
vmm on instance.id = vmm.instance_id where vmm.id in ('0807fc69-c1ad-4768-9cb6-9c54746b36d5', '75e401d6-7502-4231-b30c-b2bfa32a1a6f');
                   id                  |   name    |         time_created          |                  id                  |         time_created
---------------------------------------+-----------+-------------------------------+--------------------------------------+--------------------------------
  228b79fd-b4ff-4f50-97a5-286c949e695d | rocky     | 2024-05-24 19:04:52.732903+00 | 75e401d6-7502-4231-b30c-b2bfa32a1a6f | 2024-12-14 00:16:57.880984+00
  4378f1f2-f09a-4d29-b30d-83e9a9037de0 | sbmysql-3 | 2024-11-27 00:35:49.433431+00 | 0807fc69-c1ad-4768-9cb6-9c54746b36d5 | 2024-12-12 00:41:03.934748+00
(2 rows)

Unfortunately, their propolis logs were destroyed as part of the zone uninstall so all we have for debugging are the core files.

@leftwo
Copy link
Contributor

leftwo commented Dec 15, 2024

Unfortunately, their propolis logs were destroyed as part of the zone uninstall so all we have for debugging are the core files.

This is, unfortunate, yes.

I wonder if we could, as part of the parking, force an archive/rotation of the logs? Though, if the panic happens at the time of zone uninstall archiving logs is not going to catch that. Perhaps things were in a non normal state that contributed to the panic and the pre-logs would be a clue. As far as I know, we have not seen core files from running propolis during a rack parking previously.

As a second thing to try, we could spin up instances on a raclkette and then park it and see if we can get a panic, perhaps while tailing the propolis log from the global zone.

Third, it's possible crucible came off the rails if things are not shut down "properly", i.e tasks just dying at unexpected moments and, while not ideal, may not be an actual problem given that we are parking the rack. None the less, it should be understood.

@jclulow
Copy link
Contributor

jclulow commented Dec 15, 2024

Perhaps we ought to add an in-memory ring buffer of recent log records, in such a way that it could be fished out easily from the core file?

@askfongjojo
Copy link
Author

I wonder if we could, as part of the parking, force an archive/rotation of the logs?

Related:
oxidecomputer/omicron#7012
oxidecomputer/omicron#4906
oxidecomputer/omicron#3860

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants