-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathREADME
66 lines (48 loc) · 2.98 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
JobOverseer
===========
Requires Python 2.6 and Paramiko/SSH library (http://pypi.python.org/pypi/paramiko).
Behold the wonder that is JobOverseer. It is an all powerful software to remotely execute forbidden commands on all-powerful sentient HPC clusters across the vast realm of internets. In non-D&D terms, it's just a script that uses your SSH credentials to login to one or more computers and execute commands. It is primarily geared towards HPC users who use multiple computing resources. To be honest, it's just a wrapper around an SSH library with some bells and whistles!
How it works right now is that you save cluster data periodically, either by script, or manually. And then graph or parse the output on your own time. I attached a script below as an example of what a data saving script might look like for a group of 2 users interested in checking their jobs on numerous clusters.
Features
--------
-Parsing the qstat/showq output and saving it locally in a database
-CPU usage reports from qstat data
Planned Features
----------------
-Actual error handling (clusters go down and script fails)
-Tracking of disk usage for multiple users
-Remote submission of jobs to best available resources in a list of clusters.
-Automatic creation of cluster specific submit scripts
-Logging
-Persistent process mode
Example Script for Saving Cluster Data
--------------------------------------
#Run this script several times per day for a week
#Remember that you need to have authorized ssh keys uploaded
#to every cluster so the script can easily connect and run commands
#Look at average statistics with: python generate_queue_report.py queue_data 7
from clusters import *
UserList = []
UserList.append(User("Bob Loblaw", ["blobl","loblowb","boblowb12"]))
UserList.append(User("Chris Ing", ["coolc","cooling","ingcool"]))
ClusterList = []
ClusterStats = []
ClusterList.append(Cluster("SciNet", 8, "coolc", UserList, "login.scinet.utoronto.ca"))
ClusterList[0].setQueueCommand("showq --xml")
ClusterList.append(Cluster("Colosse", 8, "cooling", UserList, "colosse.clumeq.ca"))
ClusterList[1].setQueueCommand("qstat -xml -g d -u '*'")
ClusterList.append(Cluster("Orca", 24, "coolc", UserList, "orca.sharcnet.ca"))
ClusterList[2].setQueueCommand("/opt/sharcnet/torque/current/bin/qstat -x")
ClusterList.append(Cluster("Nestor", 8, "coolc", UserList, "nestor.westgrid.ca"))
ClusterList[3].setQueueCommand("/opt/bin/qstat -x -l nestor")
ClusterList.append(Cluster("Lattice", 8, "coolc", UserList, "lattice.westgrid.ca"))
ClusterList[4].setQueueCommand("/usr/local/torque/bin/qstat -xt")
ClusterList.append(Cluster("MP2", 24, "ingcool", UserList, "secret-mp2.ccs.usherbrooke.ca", True))
ClusterList[5].setQueueCommand("/opt/torque/bin/qstat -x")
ClusterList.append(Cluster("Guillimin", 12, "coolc", UserList, "guillimin.clumeq.ca"))
ClusterList[6].setQueueCommand("showq --xml")
for cluster in ClusterList:
print "Querying ", cluster.name
cluster.refreshQueueData()
print "Writing ", cluster.name
cluster.writeQueueData()