Changing SMT level based on node label #46

schabrolles · 2020-08-11T08:54:17Z

Hi all,

I just read the following optimization document for CP4D. https://www.ibm.com/support/producthub/icpdata/docs/content/SSQNUZ_current/cpd/install/node-settings.html
There is a session about SMT configuration implementation with openshift.

It deploy a shell script and service on the nodes in order to control the SMT level of each node via Labels.

oc label nodes worker-1 SMT=4

=> This switch the node to SMT4...

I find this pretty handy and think it could be deployed via this ansible playbook during the customization section.

the shell script:

#!/bin/bash
export PATH=/root/.local/bin:/root/bin:/sbin:/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
export KUBECONFIG=/var/lib/kubelet/kubeconfig
COREPS=$(/bin/lscpu | /bin/awk -F: ' $1 ~ /^Core\(s\) per socket$/ {print $2}'|/bin/xargs)
SOCKETS=$(/bin/lscpu | /bin/awk -F: ' $1 ~ /^Socket\(s\)$/ {print $2}'|/bin/xargs)
let TOTALCORES=$COREPS*$SOCKETS
MAXTHREADS=$(/bin/lscpu | /bin/awk -F: ' $1 ~ /^CPU\(s\)$/ {print $2}'|/bin/xargs)
let MAXSMT=$MAXTHREADS/$TOTALCORES
CURRENTSMT=$(/bin/lscpu | /bin/awk -F: ' $1 ~ /^Thread\(s\) per core$/ {print $2}'|/bin/xargs)

while :
do
  ISNODEDEGRADED=$(/bin/oc get node $HOSTNAME -o yaml |/bin/grep machineconfiguration.openshift.io/reason |/bin/grep "unexpected on-disk state validating")
  SMTLABEL=$(/bin/oc get node $HOSTNAME -L SMT --no-headers |/bin/awk '{print $6}')
  if [[ -n $SMTLABEL ]]
    then
      case $SMTLABEL in
        1) TARGETSMT=1
      ;;
        2) TARGETSMT=2
      ;;
        4) TARGETSMT=4
      ;;
        8) TARGETSMT=8
      ;;
        *) TARGETSMT=$CURRENTSMT ; echo "SMT value must be 1, 2, 4, or 8 and smaller than Maximum SMT."
      ;;
      esac
    else
      TARGETSMT=$MAXSMT
  fi

  if [[ -n $ISNODEDEGRADED ]]
    then
      touch /run/machine-config-daemon-force
  fi

  CURRENTSMT=$(/bin/lscpu | /bin/awk -F: ' $1 ~ /^Thread\(s\) per core$/ {print $2}'|/bin/xargs)

  if [[ $CURRENTSMT -ne $TARGETSMT ]]
    then
      INITONTHREAD=0
      INITOFFTHREAD=$TARGETSMT
      if [[ $MAXSMT -ge $TARGETSMT ]]
        then
          while [[ $INITONTHREAD -lt $MAXTHREADS ]]
          do
            ONTHREAD=$INITONTHREAD
            OFFTHREAD=$INITOFFTHREAD

            while [[ $ONTHREAD -lt $OFFTHREAD ]]
            do
              /bin/echo 1 > /sys/devices/system/cpu/cpu$ONTHREAD/online
              let ONTHREAD=$ONTHREAD+1
            done
            let INITONTHREAD=$INITONTHREAD+$MAXSMT
            while [[ $OFFTHREAD -lt $INITONTHREAD ]]
            do
              /bin/echo 0 > /sys/devices/system/cpu/cpu$OFFTHREAD/online
              let OFFTHREAD=$OFFTHREAD+1
            done
            let INITOFFTHREAD=$INITOFFTHREAD+$MAXSMT
          done
        else
          echo "Target SMT must be smaller or equal than Maximum SMT supported"
      fi
  fi
  /bin/sleep 30
done

The text was updated successfully, but these errors were encountered:

Fixes: ocp-power-automation#46 Signed-off-by: Sebastien Chabrolles <[email protected]>

schabrolles · 2020-08-11T14:53:46Z

First draft to discuss about it.

schabrolles · 2020-08-12T06:56:08Z

Another Idea would be to add this functionality into the rsct container (used for DLPAR) as it as the real ppc64_cpu command to deal with it.
This container could be the real power toolbox for all this kind of dynamic functionalities

But for that, we need to add oc command and give rights to get read node labels... I don't know if it is a good idea ...

bpradipt · 2020-08-13T07:02:59Z

@mkumatag do you foresee any issues with using oc inside the worker nodes ?
The approach here is to use oc inside the node to get it's associated SMT label and then use it for SMT setting.

mkumatag · 2020-08-13T08:41:45Z

@mkumatag do you foresee any issues with using oc inside the worker nodes ?
The approach here is to use oc inside the node to get it's associated SMT label and then use it for SMT setting.

I'm just wondering if we can make use of the Machine Config Operator concept of OCP to make changes to the OS.

bpradipt · 2020-08-13T09:31:54Z

@mkumatag eventually this uses MCO itself. However what the machine config uses is the script above for every node where it uses oc to get the label and set the appropriate value

mkumatag · 2020-08-13T10:41:41Z

@mkumatag eventually this uses MCO itself. However what the machine config uses is the script above for every node where it uses oc to get the label and set the appropriate value

In MCO flow, we can directly feed the configs to the machines in the machinepool, and the same information is passed directly to the respective host hence no need to worry about the node list and the labels associated.

Meanwhile, we can try oc way of doing things to read the node labels with having very strict roles and role bindings and service account tied to the pod so that it can talk to required APIs.

I personally prefer MCO way of doing it where we can avoid some of these extra steps required.

schabrolles · 2020-08-13T10:56:33Z

@mkumatag, MCO, this is what is implemented in #47 could you please have a look ?
The disciussion to use a pod was because we already have one for RSCT which has ppc64_cpu installed. It is cleaner compare what is implemented in the script proposed by #47

The best would be to have all this ppc64_utils packages inside coreOS.

Fixes: ocp-power-automation#46 Signed-off-by: Sebastien Chabrolles <[email protected]>

bpradipt · 2020-08-17T17:26:36Z

So today we have two machineconfigpools - master and worker. In order to correctly set SMT we can use a machineconfig and let MCO take care of the node reboot (using rolling update). This is safest and controlled. The drawback is ofcourse we can't have different SMT settings for a group of nodes.

The other approach (described in this issue) of using the script which will keep checking for node label and set the SMT is non-deterministic and can have unintended consequence for the overall cluster stability. I'm also not sure if setting of node labels is a privileged operation or not and if not then this also has the risk of being misused or incorrectly used with unintended consequence. Given these I'm personally not in favour of the approach.

For now, my suggestion will be to go with the approach of machineconfig with the limitation of same SMT setting for all the worker nodes. In parallel we continue to explore better alternatives to handle mix of SMT settings and more flexibility.

@schabrolles @mkumatag

theresax · 2020-09-14T22:02:24Z

@dancasali
for the completeness sake, it make be useful to put what you have proposed the smt_crio_slub.ymal here, and describe how you expect it to be used.

@bpradipt, I saw your recommendation regarding that we should set these values using different ymal files. I agree that we should test out these changes separately, but for this to work for customers may be it is easier for them to be in the same file?

dancasali · 2020-09-15T10:48:16Z

@theresax For testing internally it is easier to have on the same file but even on CP4D Knowledge Center they are shown as different implementation places because crio.conf might differ on futures releases and we might need that to be controlled differently.
I am ok having slub and SMT together because both seem needed always together.
Here attached I have the yaml. We can have MCO to take care of the reboot while still using labels so it would provide flexibility. I don't think SMT4 is needed for all applications. Only the ones that forcefully limit CPU usage by throttling the Thread (like CP4D does).
Step 1 - After the cluster is installed apply labels (eg SMT=4 or SMT=2). If no label applied SMT will be 8 (unchanged) on the node
Step 2 - Apply yaml. It will automatically apply MCO and reboot servers.
@schabrolles I might need your help here to improve with the changes you mentioned (I could not find the changes you mentioned in this thread).
@bpradipt I am not asking to merge this because I am far from being a developer... let me know how can I better help get this out to a broader team
Attaching it in gz and zip for consume on your preferred method
smt_crio_slub.zip
smt_crio_slub.yaml.gz

theresax · 2020-09-15T16:20:55Z

Adding @CharlieW to the ticket as he will also test this out under his OCP cluster

bpradipt · 2020-09-25T18:16:59Z

Thanks @dancasali @CharlieW .. .. this is very helpful. We'll start integrating this in the code..

/cc @yussufsh @Prajyot-Parab @schabrolles

schabrolles added a commit to schabrolles/ocp4-playbooks that referenced this issue Aug 11, 2020

smt_via_label via machine config

6135815

Fixes: ocp-power-automation#46 Signed-off-by: Sebastien Chabrolles <[email protected]>

schabrolles mentioned this issue Aug 11, 2020

smt_via_label via machine config #47

Closed

schabrolles added a commit to schabrolles/ocp4-playbooks that referenced this issue Aug 13, 2020

smt_via_label via machine config

f1a52e2

Fixes: ocp-power-automation#46 Signed-off-by: Sebastien Chabrolles <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changing SMT level based on node label #46

Changing SMT level based on node label #46

schabrolles commented Aug 11, 2020 •

edited

Loading

schabrolles commented Aug 11, 2020

schabrolles commented Aug 12, 2020 •

edited

Loading

bpradipt commented Aug 13, 2020

mkumatag commented Aug 13, 2020

bpradipt commented Aug 13, 2020

mkumatag commented Aug 13, 2020

schabrolles commented Aug 13, 2020 •

edited

Loading

bpradipt commented Aug 17, 2020 •

edited

Loading

theresax commented Sep 14, 2020

dancasali commented Sep 15, 2020

theresax commented Sep 15, 2020

bpradipt commented Sep 25, 2020 •

edited

Loading

Changing SMT level based on node label #46

Changing SMT level based on node label #46

Comments

schabrolles commented Aug 11, 2020 • edited Loading

schabrolles commented Aug 11, 2020

schabrolles commented Aug 12, 2020 • edited Loading

bpradipt commented Aug 13, 2020

mkumatag commented Aug 13, 2020

bpradipt commented Aug 13, 2020

mkumatag commented Aug 13, 2020

schabrolles commented Aug 13, 2020 • edited Loading

bpradipt commented Aug 17, 2020 • edited Loading

theresax commented Sep 14, 2020

dancasali commented Sep 15, 2020

theresax commented Sep 15, 2020

bpradipt commented Sep 25, 2020 • edited Loading

schabrolles commented Aug 11, 2020 •

edited

Loading

schabrolles commented Aug 12, 2020 •

edited

Loading

schabrolles commented Aug 13, 2020 •

edited

Loading

bpradipt commented Aug 17, 2020 •

edited

Loading

bpradipt commented Sep 25, 2020 •

edited

Loading