Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing SMT level based on node label #46

Open
schabrolles opened this issue Aug 11, 2020 · 12 comments
Open

Changing SMT level based on node label #46

schabrolles opened this issue Aug 11, 2020 · 12 comments

Comments

@schabrolles
Copy link
Contributor

schabrolles commented Aug 11, 2020

Hi all,

I just read the following optimization document for CP4D. https://www.ibm.com/support/producthub/icpdata/docs/content/SSQNUZ_current/cpd/install/node-settings.html
There is a session about SMT configuration implementation with openshift.

It deploy a shell script and service on the nodes in order to control the SMT level of each node via Labels.

oc label nodes worker-1 SMT=4

=> This switch the node to SMT4...

I find this pretty handy and think it could be deployed via this ansible playbook during the customization section.

the shell script:

#!/bin/bash
export PATH=/root/.local/bin:/root/bin:/sbin:/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
export KUBECONFIG=/var/lib/kubelet/kubeconfig
COREPS=$(/bin/lscpu | /bin/awk -F: ' $1 ~ /^Core\(s\) per socket$/ {print $2}'|/bin/xargs)
SOCKETS=$(/bin/lscpu | /bin/awk -F: ' $1 ~ /^Socket\(s\)$/ {print $2}'|/bin/xargs)
let TOTALCORES=$COREPS*$SOCKETS
MAXTHREADS=$(/bin/lscpu | /bin/awk -F: ' $1 ~ /^CPU\(s\)$/ {print $2}'|/bin/xargs)
let MAXSMT=$MAXTHREADS/$TOTALCORES
CURRENTSMT=$(/bin/lscpu | /bin/awk -F: ' $1 ~ /^Thread\(s\) per core$/ {print $2}'|/bin/xargs)

while :
do
  ISNODEDEGRADED=$(/bin/oc get node $HOSTNAME -o yaml |/bin/grep machineconfiguration.openshift.io/reason |/bin/grep "unexpected on-disk state validating")
  SMTLABEL=$(/bin/oc get node $HOSTNAME -L SMT --no-headers |/bin/awk '{print $6}')
  if [[ -n $SMTLABEL ]]
    then
      case $SMTLABEL in
        1) TARGETSMT=1
      ;;
        2) TARGETSMT=2
      ;;
        4) TARGETSMT=4
      ;;
        8) TARGETSMT=8
      ;;
        *) TARGETSMT=$CURRENTSMT ; echo "SMT value must be 1, 2, 4, or 8 and smaller than Maximum SMT."
      ;;
      esac
    else
      TARGETSMT=$MAXSMT
  fi

  if [[ -n $ISNODEDEGRADED ]]
    then
      touch /run/machine-config-daemon-force
  fi

  CURRENTSMT=$(/bin/lscpu | /bin/awk -F: ' $1 ~ /^Thread\(s\) per core$/ {print $2}'|/bin/xargs)

  if [[ $CURRENTSMT -ne $TARGETSMT ]]
    then
      INITONTHREAD=0
      INITOFFTHREAD=$TARGETSMT
      if [[ $MAXSMT -ge $TARGETSMT ]]
        then
          while [[ $INITONTHREAD -lt $MAXTHREADS ]]
          do
            ONTHREAD=$INITONTHREAD
            OFFTHREAD=$INITOFFTHREAD

            while [[ $ONTHREAD -lt $OFFTHREAD ]]
            do
              /bin/echo 1 > /sys/devices/system/cpu/cpu$ONTHREAD/online
              let ONTHREAD=$ONTHREAD+1
            done
            let INITONTHREAD=$INITONTHREAD+$MAXSMT
            while [[ $OFFTHREAD -lt $INITONTHREAD ]]
            do
              /bin/echo 0 > /sys/devices/system/cpu/cpu$OFFTHREAD/online
              let OFFTHREAD=$OFFTHREAD+1
            done
            let INITOFFTHREAD=$INITOFFTHREAD+$MAXSMT
          done
        else
          echo "Target SMT must be smaller or equal than Maximum SMT supported"
      fi
  fi
  /bin/sleep 30
done
schabrolles added a commit to schabrolles/ocp4-playbooks that referenced this issue Aug 11, 2020
@schabrolles
Copy link
Contributor Author

First draft to discuss about it.

@schabrolles
Copy link
Contributor Author

schabrolles commented Aug 12, 2020

Another Idea would be to add this functionality into the rsct container (used for DLPAR) as it as the real ppc64_cpu command to deal with it.
This container could be the real power toolbox for all this kind of dynamic functionalities

But for that, we need to add oc command and give rights to get read node labels... I don't know if it is a good idea ...

@bpradipt
Copy link
Contributor

@mkumatag do you foresee any issues with using oc inside the worker nodes ?
The approach here is to use oc inside the node to get it's associated SMT label and then use it for SMT setting.

@mkumatag
Copy link
Member

@mkumatag do you foresee any issues with using oc inside the worker nodes ?
The approach here is to use oc inside the node to get it's associated SMT label and then use it for SMT setting.

I'm just wondering if we can make use of the Machine Config Operator concept of OCP to make changes to the OS.

@bpradipt
Copy link
Contributor

@mkumatag eventually this uses MCO itself. However what the machine config uses is the script above for every node where it uses oc to get the label and set the appropriate value

@mkumatag
Copy link
Member

@mkumatag eventually this uses MCO itself. However what the machine config uses is the script above for every node where it uses oc to get the label and set the appropriate value

In MCO flow, we can directly feed the configs to the machines in the machinepool, and the same information is passed directly to the respective host hence no need to worry about the node list and the labels associated.

Meanwhile, we can try oc way of doing things to read the node labels with having very strict roles and role bindings and service account tied to the pod so that it can talk to required APIs.

I personally prefer MCO way of doing it where we can avoid some of these extra steps required.

@schabrolles
Copy link
Contributor Author

schabrolles commented Aug 13, 2020

@mkumatag, MCO, this is what is implemented in #47 could you please have a look ?
The disciussion to use a pod was because we already have one for RSCT which has ppc64_cpu installed. It is cleaner compare what is implemented in the script proposed by #47

The best would be to have all this ppc64_utils packages inside coreOS.

schabrolles added a commit to schabrolles/ocp4-playbooks that referenced this issue Aug 13, 2020
@bpradipt
Copy link
Contributor

bpradipt commented Aug 17, 2020

So today we have two machineconfigpools - master and worker. In order to correctly set SMT we can use a machineconfig and let MCO take care of the node reboot (using rolling update). This is safest and controlled. The drawback is ofcourse we can't have different SMT settings for a group of nodes.

The other approach (described in this issue) of using the script which will keep checking for node label and set the SMT is non-deterministic and can have unintended consequence for the overall cluster stability. I'm also not sure if setting of node labels is a privileged operation or not and if not then this also has the risk of being misused or incorrectly used with unintended consequence. Given these I'm personally not in favour of the approach.

For now, my suggestion will be to go with the approach of machineconfig with the limitation of same SMT setting for all the worker nodes. In parallel we continue to explore better alternatives to handle mix of SMT settings and more flexibility.

@schabrolles @mkumatag

@theresax
Copy link

@dancasali
for the completeness sake, it make be useful to put what you have proposed the smt_crio_slub.ymal here, and describe how you expect it to be used.

@bpradipt, I saw your recommendation regarding that we should set these values using different ymal files. I agree that we should test out these changes separately, but for this to work for customers may be it is easier for them to be in the same file?

@dancasali
Copy link

@theresax For testing internally it is easier to have on the same file but even on CP4D Knowledge Center they are shown as different implementation places because crio.conf might differ on futures releases and we might need that to be controlled differently.
I am ok having slub and SMT together because both seem needed always together.
Here attached I have the yaml. We can have MCO to take care of the reboot while still using labels so it would provide flexibility. I don't think SMT4 is needed for all applications. Only the ones that forcefully limit CPU usage by throttling the Thread (like CP4D does).
Step 1 - After the cluster is installed apply labels (eg SMT=4 or SMT=2). If no label applied SMT will be 8 (unchanged) on the node
Step 2 - Apply yaml. It will automatically apply MCO and reboot servers.
@schabrolles I might need your help here to improve with the changes you mentioned (I could not find the changes you mentioned in this thread).
@bpradipt I am not asking to merge this because I am far from being a developer... let me know how can I better help get this out to a broader team
Attaching it in gz and zip for consume on your preferred method
smt_crio_slub.zip
smt_crio_slub.yaml.gz

@theresax
Copy link

Adding @CharlieW to the ticket as he will also test this out under his OCP cluster

@bpradipt
Copy link
Contributor

bpradipt commented Sep 25, 2020

Thanks @dancasali @CharlieW .. .. this is very helpful. We'll start integrating this in the code..

/cc @yussufsh @Prajyot-Parab @schabrolles

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants