gilgamesh: help - my jobs aren't running... or are they?

Tue Oct 5 16:07:48 EDT 2010

Some of you may have noticed that your jobs are not running, even though
their are many open cores right now. Don't worry, the queue is working fine.
Here is what is happening.

The queue runs jobs in an order of priority that is determined to maximize
the fair sharing *and* utilization of the cluster. Job priority is roughly
determined by how many jobs you have been running recently, and the queue's
memory of that decays over time. You can see what jobs are queued with
either the qstat or showq command. the showq command is more useful here.
Here is some of the output

  158 Active Jobs     333 of  640 Processors Active (52.03%)
                        19 of   20 Nodes Active      (95.00%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT
QUEUETIME

16521              minyoung       Idle   160  7:00:00:00  Sat Oct  2
18:24:44
16601              minyoung       Idle   160  7:00:00:00  Sun Oct  3
16:40:29
16623                haibin       Idle    16  7:00:00:00  Mon Oct  4
12:08:40
16624                haibin       Idle    16  7:00:00:00  Mon Oct  4
12:23:06
16625                haibin       Idle    16  7:00:00:00  Mon Oct  4
12:28:43

you can see here that almost half the cluster is free, but there are jobs
not running. The reason 16521 is not running is that the output of qstat -f
16521 shows "Resource_List.nodes = 5:ppn=32". this job is waiting for 5
nodes to be available with 32 cores on each node. you can see only one node
is available right now, and 4 more need to clear out. The queue system is
not scheduling jobs on some nodes so that the jobs on those nodes will
finish so this 160 core job can run.

Our queue system, however, has backfilling capability. So, if you have a job
that would finish in time, it can be scheduled on one of those nodes. You
can use the showbf command to find out how long a job is allowed to run:

15:59:22 1100> showbf
backfill window (user: 'jkitchin' group: 'kitchingroup' partition: ALL) Tue
Oct  5 16:00:13

308 procs available for      00:09:29
284 procs available for      00:17:32
260 procs available for      00:17:33
240 procs available for      00:17:35
222 procs available for      00:21:51
200 procs available for      00:21:53
190 procs available for      00:21:55
158 procs available for      00:21:56
144 procs available for      00:24:20
141 procs available for      00:26:41
136 procs available for      00:26:46
121 procs available for      00:26:48
107 procs available for       3:46:18
 98 procs available for       5:30:15
 72 procs available for       5:30:33
 43 procs available for       5:36:09
 13 procs available for      22:37:56
 11 procs available for      22:50:26

The output here says you can run a job for up to ~22 hours on up to 11 cores
and it will likely get run immediately.

If you have a job in the queue that might fit that job, you can alter its
options with:

qalter -l walltime=22:50:00,nodes=11 jobid

then, it will probably run quickly.  If your job does not fit this way, you
just have to be patient. it will run eventually. You can get an estimate of
when your job might start with "showstart jobid":

16:00:39 1101> showstart 16521
job 16521 requires 160 procs for 7:00:00:00
Earliest start in         00:18:28 on Tue Oct  5 16:22:08
Earliest completion in  7:00:18:28 on Tue Oct 12 16:22:08
Best Partition: DEFAULT

These are just some suggestions for how you can figure out why your job
appears to be sitting in the queue, and how you can increase your
computational throughput.

John

-----------------------------------
John Kitchin
Assistant Professor
Doherty Hall A207F
Department of Chemical Engineering
Carnegie Mellon University
Pittsburgh, PA 15213
412-268-7803
http://kitchingroup.cheme.cmu.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.andrew.cmu.edu/mailman/private/gilgamesh-users/attachments/20101005/76b148b1/attachment.html