-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
it's breaking CRUSH rule #41
Comments
I want to add another violation of CRUSH rule in a stretch mode (dual Data Center) setup. Note: stretch mode does not need to be enabled for this issue to occur, just the CRUSH rule should be in use. The result is that some PGs live on 3 OSDs in one data center, and on just one in another. This results in inactive PGs when one data center is offline (hit this issue in production before stretch mode was enabled and min_size=2 was still enforced on the pool). CRUSH rule:
Tested on Ceph 18.2.1 and 18.2.2. Example:
As @dthpulse mentions, AFAIK this upmap should be ignored by Ceph as it violates the CRUSH policy for the pool. I will verify this with Ceph developers. Adding insult to injury (IMHO) this issue will go unnoticed when stretch mode is enabled as currently (as it stands) the min_size gets set to 1 in stretch degraded mode (I made this issue[https://tracker.ceph.com/issues/64842] to fix that). |
oh dear. i'll have to go over the placement constraints once again - could you please send me a state dump of your cluster as well to jj at sft dawt lol? |
I have sent you the state file by mail. Thanks for looking into it. |
Hi
on Ceph Quincy 17.2.7, with EC pool using CRUSH rule:
EC profile:
I originaly have PGs distributed over 2 OSDs per DC, but after running this balancer I found a lot of PGs this distribution is broken. In some DC there are 3 OSDs now and only 1 OSD on other.
Looks to me like it's ignoring custom CRUSH rule for EC pools.
Also strange that
pg-upmap-items
is allowing this. As according docs it shouldn't run if it's breaking CRUSH rule.Let me know if you need more details to debug, but currently I wrote little script to fix this issue on my cluster.
Thank you!
The text was updated successfully, but these errors were encountered: