Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

it's breaking CRUSH rule #41

Open
dthpulse opened this issue Apr 22, 2024 · 3 comments
Open

it's breaking CRUSH rule #41

dthpulse opened this issue Apr 22, 2024 · 3 comments

Comments

@dthpulse
Copy link

Hi

on Ceph Quincy 17.2.7, with EC pool using CRUSH rule:

{
    "rule_id": 10,
    "rule_name": "ec33hdd_rule",
    "type": 3,
    "steps": [
        {
            "op": "set_chooseleaf_tries",
            "num": 5
        },
        {
            "op": "set_choose_tries",
            "num": 100
        },
        {
            "op": "take",
            "item": -2,
            "item_name": "default~hdd"
        },
        {
            "op": "choose_indep",
            "num": 3,
            "type": "datacenter"
        },
        {
            "op": "choose_indep",
            "num": 2,
            "type": "osd"
        },
        {
            "op": "emit"
        }
    ]
}

EC profile:

crush-device-class=hdd
crush-failure-domain=datacenter
crush-root=default
jerasure-per-chunk-alignment=false
k=3
m=3
plugin=jerasure
technique=reed_sol_van
w=8

I originaly have PGs distributed over 2 OSDs per DC, but after running this balancer I found a lot of PGs this distribution is broken. In some DC there are 3 OSDs now and only 1 OSD on other.

Looks to me like it's ignoring custom CRUSH rule for EC pools.

Also strange that pg-upmap-items is allowing this. As according docs it shouldn't run if it's breaking CRUSH rule.

Let me know if you need more details to debug, but currently I wrote little script to fix this issue on my cluster.

Thank you!

@hydro-b
Copy link

hydro-b commented Jun 17, 2024

I want to add another violation of CRUSH rule in a stretch mode (dual Data Center) setup. Note: stretch mode does not need to be enabled for this issue to occur, just the CRUSH rule should be in use. The result is that some PGs live on 3 OSDs in one data center, and on just one in another. This results in inactive PGs when one data center is offline (hit this issue in production before stretch mode was enabled and min_size=2 was still enforced on the pool).

CRUSH rule:

rule stretch_replicated_rule {
	id 3
	type replicated
	step take default
	step choose firstn 0 type datacenter
	step choose firstn 0 type host
	step chooseleaf firstn 2 type osd
	step emit
}
{
        "rule_id": 3,
        "rule_name": "stretch_replicated_rule",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "choose_firstn",
                "num": 0,
                "type": "datacenter"
            },
            {
                "op": "choose_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 2,
                "type": "osd"
            },
            {
                "op": "emit"
            }
        ]
    }

Tested on Ceph 18.2.1 and 18.2.2. Example:

ceph osd tree
ID   CLASS  WEIGHT   TYPE NAME                STATUS  REWEIGHT  PRI-AFF
 -1         0.78394  root default                                      
-10         0.39197      datacenter DC1                             
 -3         0.39197          host host1                           
  0    hdd  0.09798              osd.0            up   1.00000  1.00000
  1    hdd  0.09798              osd.1            up   1.00000  1.00000
  4    ssd  0.09798              osd.4            up   1.00000  1.00000
  5    ssd  0.09798              osd.5            up   1.00000  1.00000
-11         0.39197      datacenter DC2                             
 -5         0.39197          host host2                           
  2    hdd  0.09798              osd.2            up   1.00000  1.00000
  3    hdd  0.09798              osd.3            up   1.00000  1.00000
  6    ssd  0.09798              osd.6            up   1.00000  1.00000
  7    ssd  0.09798              osd.7            up   1.00000  1.00000
./placementoptimizer.py -v balance --max-pg-moves 10
...
output omitted
...
ceph osd pg-upmap-items 3.38 0 2
ceph osd pg-upmap-items 3.38 0 2
ceph pg ls |grep ^3.38
3.38       17         0          0        0  67375104            0           0   9306         0  active+clean    33s   1296'9306  1595:31340  [3,6,5,2]p3  [3,6,5,2]p3  2024-06-16T23:40:43.524716+0200  2024-06-14T15:41:41.842701+0200                    1  periodic scrub scheduled @ 2024-06-18T00:44:32.626640+0200

As @dthpulse mentions, AFAIK this upmap should be ignored by Ceph as it violates the CRUSH policy for the pool. I will verify this with Ceph developers.

Adding insult to injury (IMHO) this issue will go unnoticed when stretch mode is enabled as currently (as it stands) the min_size gets set to 1 in stretch degraded mode (I made this issue[https://tracker.ceph.com/issues/64842] to fix that).

@TheJJ
Copy link
Owner

TheJJ commented Jun 18, 2024

oh dear. i'll have to go over the placement constraints once again - could you please send me a state dump of your cluster as well to jj at sft dawt lol?

@hydro-b
Copy link

hydro-b commented Jun 18, 2024

oh dear. i'll have to go over the placement constraints once again - could you please send me a state dump of your cluster as well to jj at sft dawt lol?

I have sent you the state file by mail. Thanks for looking into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants