Solaris ZFS & Cyrus : BugID 6535160

Thu Dec 13 23:25:15 EST 2007

I have a candidate for our running into performance issues with Cyrus
under load.  I think it could be in the ZFS filesystem layer.  Our man
Omen has come back from DTrace class and started running it, and
while it's early days yet, seems to be some fdsync calls with long times
associated with them 30 seconds and more to complete under duress.

I look at things from an aerospace engineering perspective with "envelopes".
Every filesystem has it's zone where it performs well, then when you get
outside the envelope, everything goes to hell.  In UFS for example you
often had to tweak things or switch to directio or even fastfs in some
load patterns.  ZFS is still fairly young, so still edges to find.

With ZFS there are a number of hits with fdsync bugids.  This connects
back to I believe Pascal Giengers' thread, which I think resulted in him
turning off ZIL to bump up performance.  This is speculation at this
point, anyhow here's the BugID information for those what's interested:

===================================================
Bug ID: 6535160

Synopsis: Lock contention on zl_lock from zil_commit

Category: kernel

Subcategory: zfs

State: 10-Fix Delivered

Description:

I've been running the filebench varmail load on an T2000 system, upping 
the load a little to account for the larger size of the system, and 
encountered some significant lock contention coming from zil_commit().  
First, the configuration I'm running on:

32-strand T2000 (Ontario) @ 1GHz
32 GB RAM
4 FC HBAs
12 T3B FC arrays (three per HBA), 4 LUNs presented each (total of 48 LUNs)
The system is running Krister's ARC scalability bits

One pool has been created, using a simple stripe (no mirror, no 
RAIDZ/RAIDZ2).  Compression is off (default), recordsize is default 
(128KB).  The pool has one filesystem, as it's not entirely 
straightforward to work with multiple filesystems and filebench.  I've 
tweaked the varmail config with the following settings:

filebench> set $nfiles=100000
filebench> set $nthreads=64

The lock contention picture looks like the following.  First, typical 
output of lockstat -cwPW:

Adaptive mutex spin: 1156154 events in 30.069 seconds (38449 events/sec)

Count indv cuml rcnt     spin Hottest Lock           Hottest Caller
-------------------------------------------------------------------------------
1141286 100% 100% 0.00       21 0x60040ce6180          zil_commit+0x40
...
Adaptive mutex block: 2828 events in 30.069 seconds (94 events/sec)

Count indv cuml rcnt     nsec Hottest Lock           Hottest Caller
-------------------------------------------------------------------------------
 2760  99%  99% 0.00   447336 0x60040ce6180          zil_commit+0x40

This means that we're spinning on the lock an average of 1189 times per 
CPU per second, and blocking on it an average of 2.9 times per CPU per 
second for roughly 447ms each time.

We get to zil_commit() via:
dtrace: description 'fbt::zil_commit:entry ' matched 2 probes
CPU     ID                    FUNCTION:NAME
  3  50356                        :tick-30s

              zfs`zfs_fsync+0xa4
              genunix`fop_fsync+0x14
              genunix`fdsync+0x3c
              unix`syscall_trap+0xac
            20244

So at first blush this is all fsync()-related.  I'm assuming, of course, 
that the varmail profile is accurate for the general industry in this 
regard, so we'll march on.  Who else is taking that lock?
Adaptive mutex spin: 620972 events in 30.055 seconds (20661 events/sec)

Count indv cuml rcnt     spin Lock                   Caller
-------------------------------------------------------------------------------
574654  93%  93% 0.00       25 0x60040ce6180          zil_commit+0x40
15094   2%  95% 0.00       18 0x60040ce6180          zil_commit_writer+0x200
12918   2%  97% 0.00       15 0x60040ce6180          zil_itx_assign+0x4
 9241   1%  99% 0.00       13 0x60040ce6180          
zil_lwb_write_start+0x22c
 6103   1% 100% 0.00       11 0x60040ce6180          zil_commit+0x14
 2015   0% 100% 0.00       11 0x60040ce6180          zil_lwb_write_done+0x24
  941   0% 100% 0.00        7 0x60040ce6180          zil_commit_writer+0x32c
    4   0% 100% 0.00       19 0x60040ce6180          zil_clean+0x4
    2   0% 100% 0.00        7 0x60040ce6180          zil_sync+0x28
-------------------------------------------------------------------------------

So it's still mostly zil_commit() coming from fsync().  Let's peak at 
the hold times.
Count indv cuml rcnt     nsec Lock                   Caller
-------------------------------------------------------------------------------
23624  82%  82% 0.00    31656 0x60040ce6180          cv_wait+0x34
 1249   4%  86% 0.00    17210 0x60040ce6180          zil_commit_writer+0x1a8
 1248   4%  90% 0.00    13491 0x60040ce6180          zil_itx_assign+0x44
  760   3%  93% 0.00    10400 0x60040ce6180          zio_done+0x124
  760   3%  96% 0.00    10431 0x60040ce6180          
zil_lwb_write_start+0x240
  640   2%  98% 0.00   398762 0x60040ce6180          zfs_fsync+0xa4
  635   2% 100% 0.00    12352 0x60040ce6180          zil_commit_writer+0x290
    5   0% 100% 0.00    15032 0x60040ce6180          zil_commit_writer+0x238
    2   0% 100% 0.00 18063206 0x60040ce6180          dmu_objset_sync+0x158
    2   0% 100% 0.00    13400 0x60040ce6180          dsl_pool_zil_clean+0x28
-------------------------------------------------------------------------------

Nothing outrageous, aside from the dmu_objset_sync() (which is really 
zil_sync()) one which is the source of a different CR.  Now I'm going to 
bet that the cv_wait() call is really zil_commit() in disguise:
-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Caller
37186  79%  79% 0.00    39440 0x60040ce6180          cv_wait+0x34

      nsec ------ Time Distribution ------ count     Stack
     32768 |@@@@@@@                        9062      zil_commit+0x40
     65536 |@@@@@@@@@@@@@@@@@@@@@@         28093     zfs_fsync+0xa4
    131072 |                               8         fop_fsync+0x14
    262144 |                               3         fdsync+0x3
    524288 |                               7         syscall_trap+0xac
   1048576 |                               10
   2097152 |                               1
   4194304 |                               2

And it is.  We do have some outliers, but nothing approaching the 
dmu_objset_sync() case.

There was also a report in the hold times for zfs_fsync() although my 
trolling through the code doesn't see it taking the suspect lock but 
rather calling zil_commit() which does.  For completeness, the hold 
times for the zfs_fsync() case look like this:
Count indv cuml rcnt     nsec Lock                   Caller
 1152   2%  90% 0.00   424344 0x60040ce6180          zfs_fsync+0xa4

      nsec ------ Time Distribution ------ count     Stack
     16384 |                               15        fop_fsync+0x14
     32768 |                               12        fdsync+0x3c
     65536 |@                              44        syscall_trap+0xac
    131072 |@@                             94
    262144 |@@@@@                          207
    524288 |@@@@@@@@@@                     385
   1048576 |@@@@@@@@@@                     386
   2097152 |                               3
   4194304 |                               6

The profile (in the form of <probes:function>) looks like this:
7491134:cpu_halt
2896815:mutex_vector_enter
141563:exec_fault
92557:disp_getwork
68408:lzjb_compress
43288:lwpchan_cache_mapping
36614:_resume_from_idle
35462:mutex_enter
26316:fletcher_2_native
24701:lock_set_spl_spin

Which means that when we're doing work (cpu_halt() being, of course, 
idle) we're spending most of our time in mutex_vector_enter().  The 
first "real" work we get to is lzjb_compress() which I'm going to assume 
is from metadata compression.  The amount of idle suggests this load 
could be increased considerably before we would become CPU saturated.  A 
brief check at increasing the load shows that in fact  the contention 
does continue to degrade in proportion and overall throughput begins to 
suffer.  At nearly-full CPU saturation we now see a profile like this:

8821287:mutex_vector_enter
1454131:cpu_halt
130854:exec_fault
93425:sleepq_insert
88498:lock_set_spl_spin
50079:lzjb_compress
46588:disp_getwork
32659:_resume_from_idle
27340:lwpchan_cache_mapping
26751:fletcher_2_native

    Date Modified: 2007-03-16 01:05:19 GMT+00:00

I took the opportunity to hack up the tsd-based suggestion in the 
Evaluation and saw some very positive results.  On entry into 
zfs_fsync() we set a tsd flag which is used in zfs_log_write() to mark 
any writes by that thread as synchronous.

With the baseline bits we get:
IO Summary:      679672 ops 5605.4 ops/s, (862/863 r/w)  27.5mb/s,   
6466us cpu/
op,  36.7ms latency

and with the change we get:
IO Summary:      885634 ops 7300.9 ops/s, (1123/1124 r/w)  37.5mb/s,   
1327us cp
u/op,  11.6ms latency

In particular, the fsyncfile ops see a reduction in latency from 
~70ms/op to ~19ms/op.  The lock contention picture now looks like this:
Adaptive mutex spin: 917767 events in 30.061 seconds (30531 events/sec)

Count indv cuml rcnt     spin Hottest Lock           Hottest 
Caller          
-------------------------------------------------------------------------------
176094  26%  26% 0.00        8 pr_pidlock[8]          
pr_p_lock+0xc          
 76925  16%  42% 0.00       11 0x600442aa4c0          zil_commit+0x40

Which is, obviously, considerably better for the zil_commit() issue.  
The pr_p_lock() issue is, unfortunately, endemic to filebench's use of 
microstate accounting for each op with high numbers of threads in a 
single process but thus unrelated to the issue at hand.

    Date Modified: 2007-06-19 17:06:55 GMT+00:00

Work Around:

Suggested Fix:

Evaluation:

Fixed by patch:

Integrated in Build: snv_71

Duplicate of:

Related Change Request(s):

6413510  6579989

Date Modified: 2007-11-29 01:30:17 GMT+00:00

Public Summary: