it's breaking CRUSH rule #41

dthpulse · 2024-04-22T08:54:17Z

Hi

on Ceph Quincy 17.2.7, with EC pool using CRUSH rule:

{
    "rule_id": 10,
    "rule_name": "ec33hdd_rule",
    "type": 3,
    "steps": [
        {
            "op": "set_chooseleaf_tries",
            "num": 5
        },
        {
            "op": "set_choose_tries",
            "num": 100
        },
        {
            "op": "take",
            "item": -2,
            "item_name": "default~hdd"
        },
        {
            "op": "choose_indep",
            "num": 3,
            "type": "datacenter"
        },
        {
            "op": "choose_indep",
            "num": 2,
            "type": "osd"
        },
        {
            "op": "emit"
        }
    ]
}

EC profile:

crush-device-class=hdd
crush-failure-domain=datacenter
crush-root=default
jerasure-per-chunk-alignment=false
k=3
m=3
plugin=jerasure
technique=reed_sol_van
w=8

I originaly have PGs distributed over 2 OSDs per DC, but after running this balancer I found a lot of PGs this distribution is broken. In some DC there are 3 OSDs now and only 1 OSD on other.

Looks to me like it's ignoring custom CRUSH rule for EC pools.

Also strange that pg-upmap-items is allowing this. As according docs it shouldn't run if it's breaking CRUSH rule.

Let me know if you need more details to debug, but currently I wrote little script to fix this issue on my cluster.

Thank you!

The text was updated successfully, but these errors were encountered:

hydro-b · 2024-06-17T13:05:19Z

I want to add another violation of CRUSH rule in a stretch mode (dual Data Center) setup. Note: stretch mode does not need to be enabled for this issue to occur, just the CRUSH rule should be in use. The result is that some PGs live on 3 OSDs in one data center, and on just one in another. This results in inactive PGs when one data center is offline (hit this issue in production before stretch mode was enabled and min_size=2 was still enforced on the pool).

CRUSH rule:

rule stretch_replicated_rule {
	id 3
	type replicated
	step take default
	step choose firstn 0 type datacenter
	step choose firstn 0 type host
	step chooseleaf firstn 2 type osd
	step emit
}

{
        "rule_id": 3,
        "rule_name": "stretch_replicated_rule",
        "type": 1,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "choose_firstn",
                "num": 0,
                "type": "datacenter"
            },
            {
                "op": "choose_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 2,
                "type": "osd"
            },
            {
                "op": "emit"
            }
        ]
    }

Tested on Ceph 18.2.1 and 18.2.2. Example:

ceph osd tree
ID   CLASS  WEIGHT   TYPE NAME                STATUS  REWEIGHT  PRI-AFF
 -1         0.78394  root default                                      
-10         0.39197      datacenter DC1                             
 -3         0.39197          host host1                           
  0    hdd  0.09798              osd.0            up   1.00000  1.00000
  1    hdd  0.09798              osd.1            up   1.00000  1.00000
  4    ssd  0.09798              osd.4            up   1.00000  1.00000
  5    ssd  0.09798              osd.5            up   1.00000  1.00000
-11         0.39197      datacenter DC2                             
 -5         0.39197          host host2                           
  2    hdd  0.09798              osd.2            up   1.00000  1.00000
  3    hdd  0.09798              osd.3            up   1.00000  1.00000
  6    ssd  0.09798              osd.6            up   1.00000  1.00000
  7    ssd  0.09798              osd.7            up   1.00000  1.00000

./placementoptimizer.py -v balance --max-pg-moves 10
...
output omitted
...
ceph osd pg-upmap-items 3.38 0 2

ceph osd pg-upmap-items 3.38 0 2
ceph pg ls |grep ^3.38
3.38       17         0          0        0  67375104            0           0   9306         0  active+clean    33s   1296'9306  1595:31340  [3,6,5,2]p3  [3,6,5,2]p3  2024-06-16T23:40:43.524716+0200  2024-06-14T15:41:41.842701+0200                    1  periodic scrub scheduled @ 2024-06-18T00:44:32.626640+0200

As @dthpulse mentions, AFAIK this upmap should be ignored by Ceph as it violates the CRUSH policy for the pool. I will verify this with Ceph developers.

Adding insult to injury (IMHO) this issue will go unnoticed when stretch mode is enabled as currently (as it stands) the min_size gets set to 1 in stretch degraded mode (I made this issue[https://tracker.ceph.com/issues/64842] to fix that).

TheJJ · 2024-06-18T08:29:55Z

oh dear. i'll have to go over the placement constraints once again - could you please send me a state dump of your cluster as well to jj at sft dawt lol?

hydro-b · 2024-06-18T10:43:34Z

oh dear. i'll have to go over the placement constraints once again - could you please send me a state dump of your cluster as well to jj at sft dawt lol?

I have sent you the state file by mail. Thanks for looking into it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

it's breaking CRUSH rule #41

it's breaking CRUSH rule #41

dthpulse commented Apr 22, 2024

hydro-b commented Jun 17, 2024

TheJJ commented Jun 18, 2024

hydro-b commented Jun 18, 2024

it's breaking CRUSH rule #41

it's breaking CRUSH rule #41

Comments

dthpulse commented Apr 22, 2024

hydro-b commented Jun 17, 2024

TheJJ commented Jun 18, 2024

hydro-b commented Jun 18, 2024