-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
panic: unexpected error during closedir "Bad file descriptor" for unix::fs::Dir #124105
Comments
@ConradIrwin Can you discuss the tracebacks for this, or does your telemetry lack sufficiently sophisticated crash reporting to make that possible? |
I know you meant something slightly aside from the reading of this, but I do expect the standard library to panic on indexing a slice with its length, or greater than. And people write code like that every day. So clearly the criteria has to be something more like
|
Fair: I do expect panics on programmer error or in a situation where it's unsafe to continue. I am relatively (but it's obviously hard to be 100%) sure this is not programmer error, and I'm convinced that it'd be just fine if the program kept running (c.f. how this error is ignored when dropping a file descriptor). The full back-trace is below. All line numbers in Zed code are relative to v0.131.6 and in rust 1.77.1. I'm not sure why the top line number is missed by symbolicate, but the panic message narrows it down to this assertion I believe the reason this panics at on line 4018 of worktree.rs in Zed's code is that
|
Thank you for sharing! I am not necessarily opposed to this change, I just felt it would be useful to know a bit more context. Hmm. I wonder if, in another thread that is not bound by the executor, but does wait until after the worktree code has started up, destroying one of the directories in question would do it? That would make sense, and would explain why we get such a lovely "error" from the OS. That would be a classic case of "force majeure" that we typically do want to ignore, yes. |
This is different because closing regular files can emit IO errors for which there is no good recovery strategy because it is unclear after calling close whether the FD is already closed successfully or not. The recommended strategy for error handling in those cases is to call EBADF is very different. It indicates that something closed a file descriptor it did not own, which is a violation of IO safety. Unless you have some evidence that this is just macos weirdness I would expect this to indicate a very serious bug in userspace which can lead to data corruption and UB due to file descriptor confusion. This error should not be suppressed. Arguably we should also not silently ignore EBADF for OwnedFd and only suppress other errors such as EIO. And maybe we should add a |
Caveat: |
It is, though as with everything Apple, it's a mess. Errors can only occur from the |
Thanks for looking it up. I think that is enough evidence that there's probably some library-UB going on in zed and we shouldn't try to silence it. |
The only place in As for dependencies, it might help to get a complete backtrace to see which |
It's not necessarily readdir that's causing this, it could be the victim of something else going rogue and closing the wrong fd numbers. |
It must be... Though I'm hoping that there is enough locality between the |
I have an idea how to harden std types against these kinds of things, but it would be fairly expensive and imo only make sense for debug_assert builds. Randomization for file descriptor numbers. But I'm not sure if people would be on board with it. |
I think that is a very interesting idea and would support merging it as a debug assertion if it finds a bug in Zed. I am not yet convinced this is a bug in Zed. |
Yeah, without rootcause or even reproducer I'm not claiming certainty. Another possibility that I see is some race in the |
@ConradIrwin I don't think this is a "this just happens sometimes" thing. File descriptors should not randomly vanish from a process. File descriptors are densely allocated resources holding critical data. If they were to get misnumbered under the nose of a process things could go seriously wrong, including data corruption. And if So there could be a serious bug somewhere and that imo warrants further investigation rather than silencing it in the standard library. That it is some benign (although surprising) macos behavior is also possibility, but this should be proven before we silence it. I have opened #124130 to abort on EBADF in OwnedFd. That should provide a signal to tell whether it's specific to ReadDir or file descriptors are getting clobbered in general. But you'll have to build Zed with a custom rustc build for that, or nightly if it gets merged. |
Old MacDonald had a farm... Yes, I'm in favor of us, at least for as long as we can tolerate it, silencing specifically whatever errors we expect to have happen and for us to be able to do nothing about, instead of "any error at all". I will simply trust @the8472 that EBADF is actually a highly specific error, in this context at least, and not merely the surface description of "something bad happened lol". |
Too much trust 😅. It's entirely possible that the usual suspects (FUSE, network filesystems, bugs in filesystem impls) do something crazy. But there's hope, let it at least interact with reality before it gets crushed again. At least libfuse docs say this about the release operation:
|
Error codes really ought to include ENFS and EFUSE.
|
Thanks for all the input here, and particularly the reassurance that "this shouldn't happen". It seems like the next steps are to try and narrow down:
For the in-process case it seems like a "bad file descriptor" means "one that the OS doesn't think you have", so either:
Although most of our code is rust, we do like to the Cocoa runtime, and a few other non-rust libraries, so the search space is rather large. Ideas for narrowing it down would be very appreciated! (In rust, I think we just need to audit all our crates for For the filesystem bug case, I haven't yet tried reproducing this on other filesystems. I know we have users using Samba at least, and Fuse/sshfs seems quite probable too, so that might be a fruitful path. If anyone's interested in this kind of spelunking, I'd love a second brain while working through this: https://calendly.com/conradirwin/pairing, otherwise any interesting hypothesis that we can test for how to get a reproduction would be awesome. If this does turn out to be a bug on the file system side, then I think we do need to not panic in rust. If it turns out to be a bug on the user-space side, then we'll be glad to have tracked it down, and the current behavior seems ok (until we do find a filesystem bug, at which point we'll want to be back to square one). In parallel we might want to explore ideas from #98338 on how we can handle fallible drops for things like this. |
Yes, that's the intent. I'm currently writing a FUSE testcase on linux to see if I can get a EBADF propagated from one of its replies to a |
Thanks! I should be able to do something similar for macFUSE / Samba next week. |
strace excerpt:
Hope has been canceled once again. Test program use std::{fs::File, thread, time::{Duration, UNIX_EPOCH}};
use fuser::{FileAttr, FileType};
struct TestFs {}
const DUMMY_ATTR: FileAttr = FileAttr {
ino: 2,
size: 13,
blocks: 1,
atime: UNIX_EPOCH,
mtime: UNIX_EPOCH,
ctime: UNIX_EPOCH,
crtime: UNIX_EPOCH,
kind: FileType::RegularFile,
perm: 0o644,
nlink: 1,
uid: 501,
gid: 20,
rdev: 0,
flags: 0,
blksize: 512,
};
impl fuser::Filesystem for TestFs {
fn init(&mut self, _req: &fuser::Request<'_>, _config: &mut fuser::KernelConfig) -> Result<(), libc::c_int> {
Ok(())
}
fn lookup(&mut self, _req: &fuser::Request<'_>, parent: u64, name: &std::ffi::OsStr, reply: fuser::ReplyEntry) {
let d = Duration::from_secs(1);
reply.entry(&d, &DUMMY_ATTR, 0);
}
fn open(&mut self, _req: &fuser::Request<'_>, _ino: u64, _flags: i32, reply: fuser::ReplyOpen) {
reply.opened(1, 0);
}
fn flush(&mut self, _req: &fuser::Request<'_>, ino: u64, fh: u64, lock_owner: u64, reply: fuser::ReplyEmpty) {
reply.error(libc::EBADF);
}
}
fn main() {
let fs = TestFs {};
let t = thread::spawn(|| {
//let options = [MountOption::RO, MountOption::AutoUnmount, MountOption::AllowRoot];
fuser::mount2(fs, "fuse-test", &[]).expect("mount failed");
});
thread::sleep(Duration::from_millis(500));
let f = File::open("./fuse-test/foo").unwrap();
drop(f);
t.join().unwrap();
} |
If this replicates on macos and users have some janky fuse drivers that'd be a likely suspect. Though those drivers should have a bug filed against them. |
A quick (though inconclusive) update: Although I'm not certain it doesn't reproduce, I was unable to reproduce with the example program (it fails before it gets to the end), and my testing with Zed itself, I could only reproduce libgit2/libgit2#6796, not this error with Still to try: updating the example program to run to completion, and setting up a Samba server to test that one. |
I just tried compiling Zed on Linux with the rustc_codegen_cranelift backend. It crashes reliably with:
Upon further investigation, I found alacritty/alacritty#7996, which fixes the crash I found here. I think it is very likely that the bug this fixes is responsible for the panic described in this issue too. |
If you can reproduce this reliably, I’d love to debug it together - do you have time to talk the first week of July? If so please reach out to [email protected] In my mind there’s quite a chance this is a bug in zed or its dependencies, but it’s also possible that this is a failure introduced by the filesystem. things I’d like to try:
|
Nvm, I just read the attached issue. I’ll pull that into zed and see if it fixes out problems |
I can confirm that using the current master branch of alacritty-terminal fixes the IO safety crash with rustc_codegen_cranelift. It does still crash, but that is because of some unimplemented things in rustc_codegen_cranelift (rust-lang/rustc_codegen_cranelift#1492), not a bug in Zed or any of it's dependencies. diff --git a/crates/terminal/Cargo.toml b/crates/terminal/Cargo.toml
index ef6260730..3ce869597 100644
--- a/crates/terminal/Cargo.toml
+++ b/crates/terminal/Cargo.toml
@@ -14,7 +14,7 @@ doctest = false
[dependencies]
-alacritty_terminal = "0.23"
+alacritty_terminal = { git = "https://github.com/alacritty/alacritty", version = "0.24.1-dev" }
anyhow.workspace = true
collections.workspace = true
dirs = "4.0.0" alacritty/alacritty#7996 hasn't been published to crates.io yet, so I had to use a git dependency. |
We've been running the fixed alacritty in preview for a week, and on stable for a few days, and have seen 0 instances of this - this either means we've been unusually lucky, or this is now solved. Thank you very much to @the8472 for adding the debug asserts, and to @jakobhellermann for figuring out the problem. |
At Zed we see 2-5 panics per week across our userbase from dropping a
Dir
. It is possible that this issue is macOS specific (as all of our telemetry is from macOS today), but we are unfortunately not able to reproduce this panic internally.The problem for us is that this assertion sometimes fails with
unexpected error during closedir: Os { code: 9, kind: Uncategorized, message: "Bad file descriptor" }
.I expect that the standard library should not panic on situations that do happen in reality. In particular if you are trying to close a bad file descriptor, that seems "fine". For comparison a failure to close a file during the drop of a file descriptor is silently ignored as I would expect.
Not panicking is obviously not ideal; but in the context of a long-running interactive app like a text editor, we want to be able to take the trade off of higher uptime over lower visibility.
For more discussion of possible solutions to fallible drops, #98338 has some good ideas, and I think if any are implemented we should do an equivalent thing for directories too (I particularly like the idea of a manual close that returns a result, and which causes the Drop to become a no-op). In the meantime, I think we should make rust programs crash less by ignoring this failure.
Meta
rustc --version --verbose
:(Though this was happening in earlier versions of rust too)
The text was updated successfully, but these errors were encountered: