-
Notifications
You must be signed in to change notification settings - Fork 107
WMAgent deployment
Pre-requisites
- A condor_schedd daemon must be deployed and running in your node.
- It needs to be added to the glideinWMS pool (if not yet).
- Create an environment setup file under /data/admin/wmagent/env.sh (check other agents to see its content). This file needs to be sourced each time you want to operate WMAgent.
- Create a secrets file with services information/url and databases credentials under /data/admin/wmagent/WMAgent.secrets (check other agents to see its content). This file is used during WMAgent deployment in order to override some default configuration.
- NOTE: you need to be very very careful with this file, especially if you are copying it from another agent. Make sure:
- to overwrite the oracle settings or replace them by MYSQL credentials. Otherwise, you may delete production oracle database!!!
- update COUCH_HOST with the proper node IP
- and update the service URLs in case you are using cmsweb-testbed or your own private virtual machine...
- Copy the service certificate files (service{cert,key}.pem from vocms0230) over /data/certs/ directory. Notice their permission must be at least 600.
- Copy the short-term proxy (myproxy.pem from vocms0230) over /data/certs directory.
- Finally, this script will be used for the deployment: https://github.com/dmwm/WMCore/blob/master/deploy/deploy-wmagent.sh
At this point, you should have gone through the pre-requisites, especially the changes required to WMAgent.secrets (if not, go back there!) From lxplus or aiadm, access the node with your own account and then switch to cmst1.
ssh vocmsXXX
sudo -u cmst1 /bin/bash --init-file ~cmst1/.bashrc
Download the deployment script (master branch should work, but if you prefer you could replace master by the wmcore_tag you want):
cd /data/srv
wget -nv https://raw.githubusercontent.com/dmwm/WMCore/master/deploy/deploy-wmagent.sh`
First, read the help/usage of the script by:
sh deploy-wmagent.sh`
There are several things you need to provide in the command line, again, read the script help from the above command. Otherwise, this would be an example of WMAgent deployment:
sh deploy-wmagent.sh -w 2.2.3.1 -t testbed-dev -p "111 222"
The command above would deploy WMAgent version 2.2.3.1, setting the agent team name to testbed-dev, applying patches 111 and 222 from official pull requests from WMCore repo and, finally.
Once you finish the deployment of the agent, it's worth it to check whether the config.py contains the correct configuration (according to arguments from the command line and the secrets file). Run:
source /data/admin/wmagent/env.sh. # or you can use the alias agentenv
less config/wmagentpy3/config.py
IF everything is Ok, you just need to start the components, since the services (couchdb and mysql) are started during the deployment procedure. To start all the components (the agent itself), run:
$manage start-agent
If you made some changes to the code and want to restart the agent (all components), type:
$manage stop-agent
$manage start-agent
If you want restart only specific components, type:
$manage execute-agent wmcoreD --restart --components=DBS3Upload
Ask for a new machine configured by puppet from the VOC. The machine needs to be registered as a proper schedd in the the CERN HTCondor global pool. Then follow the procedure explained above.
ssh vocmsXXX
sudo -u cmst1 /bin/bash --init-file ~cmst1/.bashrc
agentenv
!!! DO NOT START !!! any further actions if the agent is not completely drained.
condor_q
You should see and empty queue:
-- Schedd: vocms0283.cern.ch : <137.138.153.30:4080?... @ 11/15/19 16:21:15
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
Total for all users: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
runningagent
cmst1 1376194 0.0 0.0 112712 948 pts/1 S+ 16:22 0:00 grep -E couch|wmcore|mysql|beam
Check the status of the agent:
$manage status
And in case there is something still running:
$manage stop-agent
$manage stop-services
Unregister the agent from WMStat - Clean the document from the WMStat database:
$manage execute-agent wmagent-unregister-wmstats `hostname -f`
$manage execute-agent clean-oracle
Executing clean-oracle ...
Are you sure you want to wipe out CMS_WMBS_PROD13 oracle database (yes/no): yes
Alright, dropping and purging everything
SQL*Plus: Release 11.2.0.4.0 Production on Fri Nov 15 16:26:10 2019
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options
SQL>
SQL> Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options
Done!
cp -av /data/srv/wmagent/current/config/wmagent/config.py /data/srv/config.py.$(date -I)
rm -fr /data/srv/wmagent/v1.2.4.patch2/
Logout from cmst1 account and reboot
exit
sudo reboot
Once the machine is up again login and run puppet manually. Even though the machines are running puppet on startup sometimes it is needed more than a single run to apply a new change:
[lxplus** ]$ ssh vocms**.cern.ch
sudo -s
sudo /opt/puppetlabs/bin/puppet agent -tv
sudo -u cmst1 /bin/bash --init-file ~cmst1/.bashrc
cd /data/srv
rm -rf deploy*
vi /data/admin/wmagent/WMAgent.secrets
Watch for 'ORACLE_TNS' and 'RUCIO_ACCOUNT'
wget -nv https://raw.githubusercontent.com/dmwm/WMCore/master/deploy/deploy-wmagent.sh
Before executing the command check for the correct versions of:
- agent tag: example "1.2.8" check for the correct tag in the comments here
- team name: "production"
- agent number: "13"
Take those from the previous wmagent config file, and run:
agentTag='1.3.0'
teamName=$(grep -i teamName config.py.$(date -I)|awk '{print $3}') && teamName=${teamName#\'} && teamName=${teamName%\'}
agentNumber=$(grep -i agentNumber config.py.$(date -I)|awk '{print $3}')
sh deploy-wmagent.sh -w $agentTag -t $teamName -n $agentNumber |tee -a /data/srv/deployment.log.$(date -I)
Or in case we need a patched deployment:
agentTag='1.3.0'
teamName=$(grep -i teamName config.py.$(date -I)|awk '{print $3}') && teamName=${teamName#\'} && teamName=${teamName%\'}
agentNumber=$(grep -i agentNumber config.py.$(date -I)|awk '{print $3}')
patchNum="9439"
sh deploy-wmagent.sh -w $agentTag -t $teamName -n $agentNumber -p "$patchNum" |tee -a /data/srv/deployment.log.$(date -I)
Watch out for errors. Need to go through every step in the installation and confirm that it finished with no errors. Especially the parts related to CouchDB
Check that the newly generated config file differs from the previous one only by the agent version (or reflects changes that you made intentionally):
agentenv
diff -u config/wmagent/config.py /data/srv/config.py.$(date -I) |less
Check the status of the agent in its local couchdb by visiting the following (change the machine name):
agentenv
$manage start-agent
rm /data/srv/*$(date -I)
Node | Site | Responsible | Condor pool |
---|---|---|---|
vocms0192 | CERN | DMWM stable | Global |
vocms0193 | CERN | DMWM stable | Global |
vocms0260 | CERN | Todor | ITB |
vocms0261 | CERN | Alan | Global |
vocms0262 | CERN | Alan | ITB |
vocms0263 | CERN | Kenyi | ITB |
vocms0264 | CERN | Todor | ITB |
vocms0265 | CERN | Kenyi | Global |
vocms0267 | CERN | CMS@Home | CMS@Home |
vocms0290 | CERN | Todor | Global |
vocms0291 | CERN | Valentin | Global |
cmsgwms-submit1 | FNAL | DMWM stable | ITB |
cmssrv217 | FNAL | ??? | HEPCloud |
cmssrv620 | FNAL | ??? | HEPCloud |