Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hung? #161

Open
aaraujo666 opened this issue Sep 22, 2024 · 3 comments
Open

Hung? #161

aaraujo666 opened this issue Sep 22, 2024 · 3 comments

Comments

@aaraujo666
Copy link

Current status (running for the past 24 hrs)

(DRYRUN MODE) Now scanning "/nfs", found 6924758 files.
(DRYRUN MODE) Now have 6924758 files in total.
(DRYRUN MODE) Removed 0 files due to nonunique device and inode.
(DRYRUN MODE) Total size is 75907323718686 bytes or 69 TiB
Removed 292875 files due to unique sizes from list.6631883 files left.
(DRYRUN MODE) Now eliminating candidates based on first bytes:
removed 365136 files from list.6266747 files left.
(DRYRUN MODE) Now eliminating candidates based on last bytes
removed 4229642 files from list.2037105 files left.

Now it's been sitting here for the last 20 hours with this message:

(DRYRUN MODE) Now eliminating candidates based on sha1 checksum:

I get that calculating checksums is a more intensive process, and 2 million files is not exactly a walk in the park, but...

CPU utilization is holding at 5%, memory utilization is not changing anymore (and it did in the previous stages, while processing.

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 2257015 5.0 1.8 2659852 2440212 pts/0 D+ Sep20 90:31 rdfind -n true /nfs

@jmbreuer
Copy link

jmbreuer commented Nov 2, 2024

I'm also experiencing an rdfind instance stalled on, in my case, "eliminating candidates based on first bytes".

I didn't have that much patience, but I do see that it's not using any CPU or I/O.

Assuming it is hung, I set about to abort it - only to find that it doesn't react to any of Strg-Z, kill -STOP, kill -TERM or even kill -KILL...

The filesystem (an external NTFS drive) I'm trying to run rdfind on is still accessible normally, so whatever happened, it's not caused by or related to filesystem or device access in general.

I'll have to reboot now anyway to be able to unmount that external drive, I'll see whether the same thing happens again and hopefully be prepared a little bit better for it (in the sense of trying to attach a debugger / trying to figure out what/where rdfind gets stuck on).

@jmbreuer
Copy link

jmbreuer commented Nov 2, 2024

... it happened again, this time a bit later during "eliminating candidates based on last bytes".

gdb also hangs when trying to attach to the stalled rdfind process. I see no smoking guns (no messages at all correlating with the time when rdfind stalled) in dmesg. (A third-party pstack tool also hangs, with the slightly more valuable diagnostic of "LWP pid cannot be stopped: Operation not permitted".

ps shows that process in state D+, i.e. uninterruptible sleep. I'm running kernel 6.6.52 on x86_64.

@fire-eggs
Copy link

fire-eggs commented Nov 2, 2024

There is a comment in rdutil.h :

  // if there is trouble with too much disk reading, sleeping for nsecsleep
  // nanoseconds can be made between each file.

This is via the -sleep command line option, e.g. -sleep 1ms. Unfortunately, the smallest sleep time is 1ms, the code uses nanoseconds so it'd be nice if the sleep time could be reduced.

I wonder with the behavior described if the sleep time became corrupt and went to infinity.

All the "stages" (first bytes / last bytes / etc) call the same underlying code in Fileinfo.cc, Fileinfo::fillwithbytes, which does:

  std::fstream f1;
  f1.open(m_filename.c_str(), std::ios_base::in);

Unfortunately the code doesn't check the stream status nor clear errors, which might result in a hang (e.g. as I found described here).

Also note the f1.open call does NOT open the file in binary mode!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants