Solaris ZFS & Cyrus : BugID 6535160
Vincent Fox
vbfox at ucdavis.edu
Thu Dec 13 23:25:15 EST 2007
I have a candidate for our running into performance issues with Cyrus
under load. I think it could be in the ZFS filesystem layer. Our man
Omen has come back from DTrace class and started running it, and
while it's early days yet, seems to be some fdsync calls with long times
associated with them 30 seconds and more to complete under duress.
I look at things from an aerospace engineering perspective with "envelopes".
Every filesystem has it's zone where it performs well, then when you get
outside the envelope, everything goes to hell. In UFS for example you
often had to tweak things or switch to directio or even fastfs in some
load patterns. ZFS is still fairly young, so still edges to find.
With ZFS there are a number of hits with fdsync bugids. This connects
back to I believe Pascal Giengers' thread, which I think resulted in him
turning off ZIL to bump up performance. This is speculation at this
point, anyhow here's the BugID information for those what's interested:
===================================================
Bug ID: 6535160
Synopsis: Lock contention on zl_lock from zil_commit
Category: kernel
Subcategory: zfs
State: 10-Fix Delivered
Description:
I've been running the filebench varmail load on an T2000 system, upping
the load a little to account for the larger size of the system, and
encountered some significant lock contention coming from zil_commit().
First, the configuration I'm running on:
32-strand T2000 (Ontario) @ 1GHz
32 GB RAM
4 FC HBAs
12 T3B FC arrays (three per HBA), 4 LUNs presented each (total of 48 LUNs)
The system is running Krister's ARC scalability bits
One pool has been created, using a simple stripe (no mirror, no
RAIDZ/RAIDZ2). Compression is off (default), recordsize is default
(128KB). The pool has one filesystem, as it's not entirely
straightforward to work with multiple filesystems and filebench. I've
tweaked the varmail config with the following settings:
filebench> set $nfiles=100000
filebench> set $nthreads=64
The lock contention picture looks like the following. First, typical
output of lockstat -cwPW:
Adaptive mutex spin: 1156154 events in 30.069 seconds (38449 events/sec)
Count indv cuml rcnt spin Hottest Lock Hottest Caller
-------------------------------------------------------------------------------
1141286 100% 100% 0.00 21 0x60040ce6180 zil_commit+0x40
...
Adaptive mutex block: 2828 events in 30.069 seconds (94 events/sec)
Count indv cuml rcnt nsec Hottest Lock Hottest Caller
-------------------------------------------------------------------------------
2760 99% 99% 0.00 447336 0x60040ce6180 zil_commit+0x40
This means that we're spinning on the lock an average of 1189 times per
CPU per second, and blocking on it an average of 2.9 times per CPU per
second for roughly 447ms each time.
We get to zil_commit() via:
dtrace: description 'fbt::zil_commit:entry ' matched 2 probes
CPU ID FUNCTION:NAME
3 50356 :tick-30s
zfs`zfs_fsync+0xa4
genunix`fop_fsync+0x14
genunix`fdsync+0x3c
unix`syscall_trap+0xac
20244
So at first blush this is all fsync()-related. I'm assuming, of course,
that the varmail profile is accurate for the general industry in this
regard, so we'll march on. Who else is taking that lock?
Adaptive mutex spin: 620972 events in 30.055 seconds (20661 events/sec)
Count indv cuml rcnt spin Lock Caller
-------------------------------------------------------------------------------
574654 93% 93% 0.00 25 0x60040ce6180 zil_commit+0x40
15094 2% 95% 0.00 18 0x60040ce6180 zil_commit_writer+0x200
12918 2% 97% 0.00 15 0x60040ce6180 zil_itx_assign+0x4
9241 1% 99% 0.00 13 0x60040ce6180
zil_lwb_write_start+0x22c
6103 1% 100% 0.00 11 0x60040ce6180 zil_commit+0x14
2015 0% 100% 0.00 11 0x60040ce6180 zil_lwb_write_done+0x24
941 0% 100% 0.00 7 0x60040ce6180 zil_commit_writer+0x32c
4 0% 100% 0.00 19 0x60040ce6180 zil_clean+0x4
2 0% 100% 0.00 7 0x60040ce6180 zil_sync+0x28
-------------------------------------------------------------------------------
So it's still mostly zil_commit() coming from fsync(). Let's peak at
the hold times.
Count indv cuml rcnt nsec Lock Caller
-------------------------------------------------------------------------------
23624 82% 82% 0.00 31656 0x60040ce6180 cv_wait+0x34
1249 4% 86% 0.00 17210 0x60040ce6180 zil_commit_writer+0x1a8
1248 4% 90% 0.00 13491 0x60040ce6180 zil_itx_assign+0x44
760 3% 93% 0.00 10400 0x60040ce6180 zio_done+0x124
760 3% 96% 0.00 10431 0x60040ce6180
zil_lwb_write_start+0x240
640 2% 98% 0.00 398762 0x60040ce6180 zfs_fsync+0xa4
635 2% 100% 0.00 12352 0x60040ce6180 zil_commit_writer+0x290
5 0% 100% 0.00 15032 0x60040ce6180 zil_commit_writer+0x238
2 0% 100% 0.00 18063206 0x60040ce6180 dmu_objset_sync+0x158
2 0% 100% 0.00 13400 0x60040ce6180 dsl_pool_zil_clean+0x28
-------------------------------------------------------------------------------
Nothing outrageous, aside from the dmu_objset_sync() (which is really
zil_sync()) one which is the source of a different CR. Now I'm going to
bet that the cv_wait() call is really zil_commit() in disguise:
-------------------------------------------------------------------------------
Count indv cuml rcnt nsec Lock Caller
37186 79% 79% 0.00 39440 0x60040ce6180 cv_wait+0x34
nsec ------ Time Distribution ------ count Stack
32768 |@@@@@@@ 9062 zil_commit+0x40
65536 |@@@@@@@@@@@@@@@@@@@@@@ 28093 zfs_fsync+0xa4
131072 | 8 fop_fsync+0x14
262144 | 3 fdsync+0x3
524288 | 7 syscall_trap+0xac
1048576 | 10
2097152 | 1
4194304 | 2
And it is. We do have some outliers, but nothing approaching the
dmu_objset_sync() case.
There was also a report in the hold times for zfs_fsync() although my
trolling through the code doesn't see it taking the suspect lock but
rather calling zil_commit() which does. For completeness, the hold
times for the zfs_fsync() case look like this:
Count indv cuml rcnt nsec Lock Caller
1152 2% 90% 0.00 424344 0x60040ce6180 zfs_fsync+0xa4
nsec ------ Time Distribution ------ count Stack
16384 | 15 fop_fsync+0x14
32768 | 12 fdsync+0x3c
65536 |@ 44 syscall_trap+0xac
131072 |@@ 94
262144 |@@@@@ 207
524288 |@@@@@@@@@@ 385
1048576 |@@@@@@@@@@ 386
2097152 | 3
4194304 | 6
The profile (in the form of <probes:function>) looks like this:
7491134:cpu_halt
2896815:mutex_vector_enter
141563:exec_fault
92557:disp_getwork
68408:lzjb_compress
43288:lwpchan_cache_mapping
36614:_resume_from_idle
35462:mutex_enter
26316:fletcher_2_native
24701:lock_set_spl_spin
Which means that when we're doing work (cpu_halt() being, of course,
idle) we're spending most of our time in mutex_vector_enter(). The
first "real" work we get to is lzjb_compress() which I'm going to assume
is from metadata compression. The amount of idle suggests this load
could be increased considerably before we would become CPU saturated. A
brief check at increasing the load shows that in fact the contention
does continue to degrade in proportion and overall throughput begins to
suffer. At nearly-full CPU saturation we now see a profile like this:
8821287:mutex_vector_enter
1454131:cpu_halt
130854:exec_fault
93425:sleepq_insert
88498:lock_set_spl_spin
50079:lzjb_compress
46588:disp_getwork
32659:_resume_from_idle
27340:lwpchan_cache_mapping
26751:fletcher_2_native
Date Modified: 2007-03-16 01:05:19 GMT+00:00
I took the opportunity to hack up the tsd-based suggestion in the
Evaluation and saw some very positive results. On entry into
zfs_fsync() we set a tsd flag which is used in zfs_log_write() to mark
any writes by that thread as synchronous.
With the baseline bits we get:
IO Summary: 679672 ops 5605.4 ops/s, (862/863 r/w) 27.5mb/s,
6466us cpu/
op, 36.7ms latency
and with the change we get:
IO Summary: 885634 ops 7300.9 ops/s, (1123/1124 r/w) 37.5mb/s,
1327us cp
u/op, 11.6ms latency
In particular, the fsyncfile ops see a reduction in latency from
~70ms/op to ~19ms/op. The lock contention picture now looks like this:
Adaptive mutex spin: 917767 events in 30.061 seconds (30531 events/sec)
Count indv cuml rcnt spin Hottest Lock Hottest
Caller
-------------------------------------------------------------------------------
176094 26% 26% 0.00 8 pr_pidlock[8]
pr_p_lock+0xc
76925 16% 42% 0.00 11 0x600442aa4c0 zil_commit+0x40
Which is, obviously, considerably better for the zil_commit() issue.
The pr_p_lock() issue is, unfortunately, endemic to filebench's use of
microstate accounting for each op with high numbers of threads in a
single process but thus unrelated to the issue at hand.
Date Modified: 2007-06-19 17:06:55 GMT+00:00
Work Around:
Suggested Fix:
Evaluation:
Fixed by patch:
Integrated in Build: snv_71
Duplicate of:
Related Change Request(s):
6413510 6579989
Date Modified: 2007-11-29 01:30:17 GMT+00:00
Public Summary:
More information about the Info-cyrus
mailing list