-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: pg 18.6a to be moved to osd.117 is misplaced with -198781.0<0 objects already transferred #38
Comments
I was having the same issue. 259 osds on 15 hosts, running reef (18.2.4). It looks like the problem pg belongs to a 2.4PB 8+2 erasure-coded pool and is active+undersized+degraded+remapped+backfill_toofull. It's number of objects is close but not identical to the negated value in the error message (could have been exactly the same at the time the error was reported). Note that "acting" osds has one missing, while it is present in "up" osds (see below). The cluster is also in the middle of a major rebalancing due to increasing pg_num and has several osds overfull while there are other recently added osds that have plenty of space. Many pgs are backfill_toofull and 1.3% of objects are degraded. The cluster is currently doing 2-3GB/s (1.3k obj/s) recovery/backfill on the hdds with several more weeks to go. The problem appeared immeaditely/shortly after using commands output by placementoptimizer.py; however, the problem pg was not one of the pgs output by it. HOWEVER, since I started writing this report, the problem has gone away! I think we had the problem for a couple of hours. But apparently it is transient. I did a diff of "ceph pg 18.1b4 query" outputs from when I had the problem and after the problem disappeared, and the only changes are small changes in various counts/timestamps abnd last_update/last_complete fields. The "up" and "acting" values have not changed. The output below is from when I was still having the problem. ./placementoptimizer.py balance --osdfrom limiting ceph pg 18.1b4 query |
I am using the script to watch the progress of backfills on a broken cluster. Yet, it shows an exception:
I will send you the dump via email. Yes, I know that one PG is not recoverable without the manual export/import.
The text was updated successfully, but these errors were encountered: