gilgamesh: updates on the cluster

John Kitchin jkitchin at andrew.cmu.edu
Thu Sep 9 09:06:38 EDT 2010


Some of you have noticed that there are some intermittent problems
with parallel jobs on gilgamesh regarding "hardware contexts" from the
psm interface that cause jobs to fail or not run. I have modified the
default behavior of the cluster to avoid this problem(hopefully). The
default behavior will not use psm. For most you you this should have
no effect on your jobs other than to make them run without the errors
seen before.

psm: as fast an mpi implementation as you can get on our cluster, BUT
you must configure all the details correctly and it is not flexible
for widely different types of mpi jobs. e.g. it will work if you
always use 1node:32ppn, but is likely to have problems if multiple
users try to use some cores on a node without the proper settings. PSM
latencies are on the order of 2-3 microseconds.

verbs: slightly higher latency times than psm,  Verbs will be 6
microseconds. However, this is in comparison to GigE where the
latencies are on the order of 50-100 microseconds. This will be the
default behavior on the cluster.

If you know what you are doing, and really want the psm interface you
can run your mpi jobs like this:
mpirun -np `cat $PBS_NODEFILE | wc -l` -mca mtl psm ./a.out

You will be notified if your jobs cause problems.

Let me know if you continue having problems with parallel jobs. Thanks,

John

-----------------------------------
John Kitchin
Assistant Professor
Doherty Hall A207F
Department of Chemical Engineering
Carnegie Mellon University
Pittsburgh, PA 15213
412-268-7803
http://kitchingroup.cheme.cmu.edu


More information about the gilgamesh-users mailing list