LARGE single-system Cyrus installs?
Pascal Gienger
Pascal.Gienger at uni-konstanz.de
Fri Nov 9 02:24:34 EST 2007
Vincent Fox <vbfox at ucdavis.edu> wrote:
> Our working hypothesis is that CYRUS is what is choking up at a certain
> activity level due to bottlenecks with simultaneous access to some shared
> resource for each instance.
Did you do a
lockstat -Pk sleep 30
(with "-x destructive" when it complains about the system being
unresponsive)?
We had that result, among others:
Adaptive mutex block: 2339 events in 30.052 seconds (78 events/sec)
Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
778 79% 79% 0.00 456354473 0xffffffffa4867730 zfs_zget
61 6% 85% 0.00 466021696 0xffffffffa4867130 zfs_zget
8 1% 87% 0.00 748812180 0xffffffffa4867780 zfs_zget
26 1% 88% 0.00 200187703 0xffffffff9cf97598 dmu_object_alloc
2 1% 89% 0.00 1453472066 0xffffffffa4867de0 zfs_zget
12 1% 89% 0.00 204437906 0xffffffffa4863ad8 dmu_object_alloc
4 1% 90% 0.00 575866919 0xffffffffa4867838 zfs_zinactive
5 1% 90% 0.00 458982547 0xffffffffa48677b8 zfs_zget
4 1% 91% 0.00 563367350 0xffffffffa4867868 zfs_zinactive
3 0% 91% 0.00 629688255 0xffffffffa48677b0 zfs_zinactive
Nearly all locks caused by zfs. The Disk SAN system is NOT the bottleneck
though, having average service times from 5-8 ms, and no wait queue.
456354473 nsecs are 0,456 secs, that is *LONG*.
What's also interestring is tracing open()-calls via dtrace.
Just use this:
#!/usr/sbin/dtrace -s
#pragma D option destructive
#pragma D option quiet
syscall::open:entry
{
self->ts=timestamp;
self->filename=arg0;
}
syscall::open:return
/self->ts > 0/
{
zeit=timestamp - self->ts;
printf("%10d %s\n",zeit,copyinstr(self->filename));
@["open duration"] = quantize(zeit);
self->ts=0;
}
It will show you all files opened and the time needed (in nanosecs) to
accomplish that. After hitting CTRL-C, it will summarize:
open duration
value ------------- Distribution ------------- count
1024 | 0
2048 |@ 80
4096 |@@@@@@@@@@@@@@@@@@@@@ 1837
8192 |@@@@@@ 521
16384 |@@@@@@@ 602
32768 |@@@ 229
65536 |@ 92
131072 | 2
262144 | 0
524288 | 1
1048576 | 1
2097152 | 1
4194304 | 3
8388608 | 12
16777216 |@ 51
33554432 | 38
67108864 | 25
134217728 | 9
268435456 | 2
536870912 | 3
1073741824 | 0
You see the arc memory activity from 4-65 mikroseconds and disk activity
from 8-33ms. And you see some "big hits" from 0,13 - 0,5 secs (!). This is
far too much and I did not figure out why this is happening. As more users
are connecting this "really long opens" become more and more.
We have a Postfix spool running on the same machine and we got some relief
in deactivating its directory hashing scheme. ZFS is very "angry" about
having a deep directory structure it seems. But still, these "long opens"
do occur.
Pascal
More information about the Info-cyrus
mailing list