LARGE single-system Cyrus installs?
Ian G Batten
ian.batten at uk.fujitsu.com
Wed Nov 21 04:50:25 EST 2007
On 20 Nov 07, at 1756, David Lang wrote:
>
> however a fsync on a journaled filesystem just means the data needs
> to be
> written to the journal, it doesn't mean that the journal needs to
> be flushed to
> disk.
>
> on ext3 if you have data=journaled then your data is in the journal
> as well and
> all that the system needs to do on a fsync is to write things to
> the journal (a
> nice sequential write),
Assuming the journal is on a distinct device and the distinct device
can take the load. It isn't on ZFS, although work is in progress.
One of the many benefits of the sadly underrated Solaris Disksuite
product was the metatrans devices, which at least permitted metadata
updates to go to a distinct device. When the UFS logging code went
into core Solaris (the ON integration) that facility was dropped,
sadly. My Pillar NFS server does data logging to distinct disk
groups, but mostly --- like such boxes tend to do --- relies on 12GB
of RAM and a battery. A sequential write is only of benefit if the
head is in the right place and the platter is at the right rotational
position and the write is well-matched to the transfer rate of the
spindle: if the spindle is doing large sequential writes while also
servicing reads and writes elsewhere, or can't keep up with writing
tracks flat out, the problems increase.
>
> for cyrus you should have the same sort of requirements that you
> would have for
> a database server, including the fact that without a battery-backed
> disk cache
> (or solid state drive) to handle your updates, you end up being
> throttled by
> your disk rotation rate (you can only do a single fsync write per
> rotation, and
> that good only if you don't have to seek), RAID 5/6 arrays are even
> worse, as
> almost all systems will require a read of the entire stripe before
> writing a
> single block (and it's parity block) back out, and since the stripe is
> frequently larger then the OS readahead, the OS throws much of the
> data away
> immediatly.
>
> if we can identify the files that are the bottlenecks it would be very
> interesting to see the result of puttng them on a solid-state drive.
I've split the meta-data out into separate partitions. The meta data
is stored in ZFS filesystems in a pool which is a RAID 0+1 4 disk
group with SAS drives, the message data is coming out of the lowest
QoS on my Pillar. A ten second fsstat on VM operations shows that by
request (this measures filesystem activity, not the implied disk
activity) it's the meta partitions taking the pounding (ten second
sample):
map addmap delmap getpag putpag pagio
0 0 0 45 0 0 /var/imap
11 11 11 17 0 0 /var/imap/meta-partition-1
290 290 290 463 5 0 /var/imap/meta-partition-2
139 139 139 183 3 0 /var/imap/meta-partition-3
66 66 66 106 10 0 /var/imap/meta-partition-7
347 347 342 454 16 0 /var/imap/meta-partition-8
57 57 57 65 5 0 /var/imap/meta-partition-9
4 4 8 4 0 0 /var/imap/partition-1
11 11 22 14 0 0 /var/imap/partition-2
1 1 2 1 0 0 /var/imap/partition-3
6 6 12 49 10 0 /var/imap/partition-7
15 15 28 457 0 0 /var/imap/partition-8
1 1 2 2 0 0 /var/imap/partition-9
Similarly, by non-VM operation:
new name name attr attr lookup rddir read read write write
file remov chng get set ops ops ops bytes ops bytes
0 0 0 2.26K 0 6.15K 0 0 0 45 1.22K /
var/imap
0 0 0 356 0 707 0 0 0 6 3.03K /
var/imap/meta-partition-1
3 0 3 596 0 902 0 6 135K 90 305K /
var/imap/meta-partition-2
0 0 0 621 0 1.08K 0 0 0 3 1.51K /
var/imap/meta-partition-3
3 0 3 1.04K 0 1.70K 0 6 149K 36 650K /
var/imap/meta-partition-7
0 0 0 2.28K 0 4.24K 0 0 0 7 1.87K /
var/imap/meta-partition-8
0 0 0 18 0 32 0 0 0 2 176 /
var/imap/meta-partition-9
2 2 2 22 0 30 0 1 2.37K 2 7.13K /
var/imap/partition-1
3 4 12 84 0 157 0 1 677 3 7.51K /
var/imap/partition-2
1 1 1 1.27K 0 2.16K 0 0 0 1 3.75K /
var/imap/partition-3
2 2 4 35 0 56 0 1 3.97K 36 279K /
var/imap/partition-7
1 2 1 256 0 514 0 0 0 1 3.75K /
var/imap/partition-8
0 0 0 0 0 0 0 0 0 0 0 /
var/imap/partition-9
And looking at the real IO load, ten seconds of zpool (for the meta
data and /var/imap_
capacity operations bandwidth
pool used avail read write read write
------------ ----- ----- ----- ----- ----- -----
pool1 51.6G 26.4G 0 142 54.3K 1001K
mirror 25.8G 13.2G 0 68 38.4K 471K
c0t0d0s4 - - 0 36 44.7K 471K
c0t1d0s4 - - 0 36 0 471K
mirror 25.8G 13.2G 0 73 15.9K 530K
c0t2d0s4 - - 0 40 28.4K 531K
c0t3d0s4 - - 0 39 6.39K 531K
------------ ----- ----- ----- ----- ----- -----
is very different to ten seconds of sar for the NFS:
09:46:34 device %busy avque r+w/s blks/s avwait avserv
[...]
nfs73 1 0.0 3 173 0.0 4.2
nfs86 3 0.1 12 673 0.0 6.5
nfs87 0 0.0 0 0 0.0 0.0
nfs89 0 0.0 0 0 0.0 0.0
nfs96 0 0.0 0 0 0.0 1.8
nfs101 1 0.0 1 25 0.0 8.0
nfs102 0 0.0 0 4 0.0 9.4
The machine has a _lot_ of memory (32GB) so it's likely that all mail
that is delivered and then read within ten minutes never gets read
back from the message store: the NFS load is almost entirely write as
seen from the server.
ian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.andrew.cmu.edu/pipermail/info-cyrus/attachments/20071121/d0e8af96/attachment-0001.html
More information about the Info-cyrus
mailing list