gilgamesh: help - my jobs aren't running... or are they?
John Kitchin
jkitchin at andrew.cmu.edu
Tue Oct 5 16:07:48 EDT 2010
Some of you may have noticed that your jobs are not running, even though
their are many open cores right now. Don't worry, the queue is working fine.
Here is what is happening.
The queue runs jobs in an order of priority that is determined to maximize
the fair sharing *and* utilization of the cluster. Job priority is roughly
determined by how many jobs you have been running recently, and the queue's
memory of that decays over time. You can see what jobs are queued with
either the qstat or showq command. the showq command is more useful here.
Here is some of the output
158 Active Jobs 333 of 640 Processors Active (52.03%)
19 of 20 Nodes Active (95.00%)
IDLE JOBS----------------------
JOBNAME USERNAME STATE PROC WCLIMIT
QUEUETIME
16521 minyoung Idle 160 7:00:00:00 Sat Oct 2
18:24:44
16601 minyoung Idle 160 7:00:00:00 Sun Oct 3
16:40:29
16623 haibin Idle 16 7:00:00:00 Mon Oct 4
12:08:40
16624 haibin Idle 16 7:00:00:00 Mon Oct 4
12:23:06
16625 haibin Idle 16 7:00:00:00 Mon Oct 4
12:28:43
you can see here that almost half the cluster is free, but there are jobs
not running. The reason 16521 is not running is that the output of qstat -f
16521 shows "Resource_List.nodes = 5:ppn=32". this job is waiting for 5
nodes to be available with 32 cores on each node. you can see only one node
is available right now, and 4 more need to clear out. The queue system is
not scheduling jobs on some nodes so that the jobs on those nodes will
finish so this 160 core job can run.
Our queue system, however, has backfilling capability. So, if you have a job
that would finish in time, it can be scheduled on one of those nodes. You
can use the showbf command to find out how long a job is allowed to run:
15:59:22 1100> showbf
backfill window (user: 'jkitchin' group: 'kitchingroup' partition: ALL) Tue
Oct 5 16:00:13
308 procs available for 00:09:29
284 procs available for 00:17:32
260 procs available for 00:17:33
240 procs available for 00:17:35
222 procs available for 00:21:51
200 procs available for 00:21:53
190 procs available for 00:21:55
158 procs available for 00:21:56
144 procs available for 00:24:20
141 procs available for 00:26:41
136 procs available for 00:26:46
121 procs available for 00:26:48
107 procs available for 3:46:18
98 procs available for 5:30:15
72 procs available for 5:30:33
43 procs available for 5:36:09
13 procs available for 22:37:56
11 procs available for 22:50:26
The output here says you can run a job for up to ~22 hours on up to 11 cores
and it will likely get run immediately.
If you have a job in the queue that might fit that job, you can alter its
options with:
qalter -l walltime=22:50:00,nodes=11 jobid
then, it will probably run quickly. If your job does not fit this way, you
just have to be patient. it will run eventually. You can get an estimate of
when your job might start with "showstart jobid":
16:00:39 1101> showstart 16521
job 16521 requires 160 procs for 7:00:00:00
Earliest start in 00:18:28 on Tue Oct 5 16:22:08
Earliest completion in 7:00:18:28 on Tue Oct 12 16:22:08
Best Partition: DEFAULT
These are just some suggestions for how you can figure out why your job
appears to be sitting in the queue, and how you can increase your
computational throughput.
John
-----------------------------------
John Kitchin
Assistant Professor
Doherty Hall A207F
Department of Chemical Engineering
Carnegie Mellon University
Pittsburgh, PA 15213
412-268-7803
http://kitchingroup.cheme.cmu.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.andrew.cmu.edu/mailman/private/gilgamesh-users/attachments/20101005/76b148b1/attachment.html
More information about the gilgamesh-users
mailing list