LARGE single-system Cyrus installs?

Pascal Gienger Pascal.Gienger at uni-konstanz.de
Fri Nov 9 02:24:34 EST 2007


Vincent Fox <vbfox at ucdavis.edu> wrote:

> Our working hypothesis is that CYRUS is what is choking up at a certain
> activity level due to bottlenecks with simultaneous access to some shared
> resource for each instance.

Did you do a

lockstat -Pk sleep 30

(with "-x destructive" when it complains about the system being 
unresponsive)?


We had that result, among others:

Adaptive mutex block: 2339 events in 30.052 seconds (78 events/sec)

Count indv cuml rcnt     nsec Lock                   Caller
-------------------------------------------------------------------------------
  778  79%  79% 0.00 456354473 0xffffffffa4867730     zfs_zget
   61   6%  85% 0.00 466021696 0xffffffffa4867130     zfs_zget
    8   1%  87% 0.00 748812180 0xffffffffa4867780     zfs_zget
   26   1%  88% 0.00 200187703 0xffffffff9cf97598     dmu_object_alloc
    2   1%  89% 0.00 1453472066 0xffffffffa4867de0     zfs_zget
   12   1%  89% 0.00 204437906 0xffffffffa4863ad8     dmu_object_alloc
    4   1%  90% 0.00 575866919 0xffffffffa4867838     zfs_zinactive
    5   1%  90% 0.00 458982547 0xffffffffa48677b8     zfs_zget
    4   1%  91% 0.00 563367350 0xffffffffa4867868     zfs_zinactive
    3   0%  91% 0.00 629688255 0xffffffffa48677b0     zfs_zinactive

Nearly all locks caused by zfs. The Disk SAN system is NOT the bottleneck 
though, having average service times from 5-8 ms, and no wait queue.

456354473 nsecs are 0,456 secs, that is *LONG*.


What's also interestring is tracing open()-calls via dtrace.
Just use this:

#!/usr/sbin/dtrace -s
#pragma D option destructive
#pragma D option quiet

syscall::open:entry
{
        self->ts=timestamp;
        self->filename=arg0;
}

syscall::open:return
/self->ts > 0/
{
        zeit=timestamp - self->ts;
        printf("%10d %s\n",zeit,copyinstr(self->filename));
        @["open duration"] = quantize(zeit);
        self->ts=0;
}

It will show you all files opened and the time needed (in nanosecs) to 
accomplish that. After hitting CTRL-C, it will summarize:

  open duration
           value  ------------- Distribution ------------- count
            1024 |                                         0
            2048 |@                                        80
            4096 |@@@@@@@@@@@@@@@@@@@@@                    1837
            8192 |@@@@@@                                   521
           16384 |@@@@@@@                                  602
           32768 |@@@                                      229
           65536 |@                                        92
          131072 |                                         2
          262144 |                                         0
          524288 |                                         1
         1048576 |                                         1
         2097152 |                                         1
         4194304 |                                         3
         8388608 |                                         12
        16777216 |@                                        51
        33554432 |                                         38
        67108864 |                                         25
       134217728 |                                         9
       268435456 |                                         2
       536870912 |                                         3
      1073741824 |                                         0

You see the arc memory activity from 4-65 mikroseconds and disk activity 
from 8-33ms. And you see some "big hits" from 0,13 - 0,5 secs (!). This is 
far too much and I did not figure out why this is happening. As more users 
are connecting this "really long opens" become more and more.

We have a Postfix spool running on the same machine and we got some relief 
in deactivating its directory hashing scheme. ZFS is very "angry" about 
having a deep directory structure it seems. But still, these "long opens" 
do occur.


Pascal


More information about the Info-cyrus mailing list