KeyError: 66 (in used += self.cluster.osd_transfer_remainings[osdid]) #35

patrakov · 2024-03-11T18:57:19Z

While trying to rebalance an especially broken cluster, my colleague found this exception:

# ./placementoptimizer.py --osdsize device balance --osdused delta --max-pg-moves 50 --osdfrom fullest
Traceback (most recent call last):
  File "./placementoptimizer.py", line 5475, in <module>
    exit(main())
  File "./placementoptimizer.py", line 5470, in main
    run()
  File "./placementoptimizer.py", line 5434, in <lambda>
    run = lambda: balance(args, state)
  File "./placementoptimizer.py", line 4600, in balance
    need_simulation=True)
  File "./placementoptimizer.py", line 3260, in __init__
    self.init_analyzer.analyze(self)
  File "./placementoptimizer.py", line 4264, in analyze
    self._update_stats()
  File "./placementoptimizer.py", line 4350, in _update_stats
    self.cluster_variance = self.pg_mappings.get_cluster_variance()
  File "./placementoptimizer.py", line 3771, in get_cluster_variance
    for crushclass, usages in self.get_class_osd_usages().items():
  File "./placementoptimizer.py", line 3509, in get_class_osd_usages
    ret[crushclass] = {osdid: self.get_osd_usage(osdid) for osdid in self.osd_candidates_class[crushclass]}
  File "./placementoptimizer.py", line 3509, in <dictcomp>
    ret[crushclass] = {osdid: self.get_osd_usage(osdid) for osdid in self.osd_candidates_class[crushclass]}
  File "./placementoptimizer.py", line 3757, in get_osd_usage
    used = self.get_osd_usage_size(osdid, add_size)
  File "./placementoptimizer.py", line 3714, in get_osd_usage_size
    used += self.cluster.osd_transfer_remainings[osdid]
KeyError: 66

Note that osd.66 is the only OSD which has the hdd_test class:

$ ceph osd tree | grep test
 66  hdd_test     14.55269          osd.66              up   1.00000  1.00000

As we are not permitted to publicly post anything containing UUIDs that can be used to identify the customer's cluster, I am going to submit the debug info via private email.

The text was updated successfully, but these errors were encountered:

patrakov · 2024-03-11T20:06:05Z

I have successfully worked around the crash by adding --only-crushclass hdd

TheJJ · 2024-03-15T19:13:25Z

Thanks for the report and file - i can probably figure this out from the data but you may know directly:
how is the hdd_test class osd selected when the others are from the hdd class? a manual crush root?

patrakov · 2024-03-15T19:28:04Z

No idea. Apparently they just set one OSD to this class and later created a pool that uses it. Today they set more OSDs to this class (see the dump from #36).

TheJJ · 2024-03-16T15:07:30Z

sounds wild :D let's hope they know what they're doing (but then again they seem to have asked you for help :)
looking at the crush rules all the hdd rules pick default~hdd, and hdd_test devices are not part of this. maybe it's possible having upmaps to devices of different classes? these would not be movement candidates but still need to be accounted when moving data away from them. but this seems rather special, phew. I'm gonna think a bit what this means for handling it properly 🤔

patrakov · 2024-03-16T15:16:23Z

I think I should have expressed myself better. At this point, I think that the fact that I managed to notice a state with only one OSD with the "hdd_test" CRUSH device class is purely due to the miscoordination of my work with their work. The end result (two hosts full of hdd_test OSDs, which makes more sense but still triggers the issue unless --only-crushclass hdd is added) is available in the dump that I sent you for issue #36.

patrakov mentioned this issue Mar 15, 2024

ZeroDivisionError: division by zero (osd_objs_acting) #36

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeyError: 66 (in used += self.cluster.osd_transfer_remainings[osdid]) #35

KeyError: 66 (in used += self.cluster.osd_transfer_remainings[osdid]) #35

patrakov commented Mar 11, 2024

patrakov commented Mar 11, 2024

TheJJ commented Mar 15, 2024

patrakov commented Mar 15, 2024

TheJJ commented Mar 16, 2024

patrakov commented Mar 16, 2024 •

edited

Loading

KeyError: 66 (in used += self.cluster.osd_transfer_remainings[osdid]) #35

KeyError: 66 (in used += self.cluster.osd_transfer_remainings[osdid]) #35

Comments

patrakov commented Mar 11, 2024

patrakov commented Mar 11, 2024

TheJJ commented Mar 15, 2024

patrakov commented Mar 15, 2024

TheJJ commented Mar 16, 2024

patrakov commented Mar 16, 2024 • edited Loading

patrakov commented Mar 16, 2024 •

edited

Loading