LARGE single-system Cyrus installs?

Tue Nov 20 12:56:37 EST 2007

On Tue, 20 Nov 2007, Ian G Batten wrote:

> On 20 Nov 07, at 1332, Michael R. Gettes wrote:
>
>> I am wondering about the use of fsync() on journal'd file systems
>> as described below.  Shouldn't there be much less use of (or very
>> little use) of fsync() on these types of systems?  Let the journal
>> layer due its job and not force it within cyrus?  This would likely
>> save a lot of system overhead.
>
> fsync() forces the data to be queued to the disk.  A journaling
> filesystem won't usually make any difference, because no one wants to
> keep an intent log of every 1 byte write, or the 100 overwrites of
> the same block.  If you want every write() to go to disk,
> immediately, the filesystem layout doesn't really matter: it's just a
> matter of disk bandwidth.  Journalling filesystems are more usually
> concerned with metadata consistency, so that the filesystem isn't
> actively corrupt if the music stops at the wrong point in a directory
> create or something.

however a fsync on a journaled filesystem just means the data needs to be 
written to the journal, it doesn't mean that the journal needs to be flushed to 
disk.

on ext3 if you have data=journaled then your data is in the journal as well and 
all that the system needs to do on a fsync is to write things to the journal (a 
nice sequential write), and everything is perfectly safe. if you have 
data=ordered (the default for most journaled filesystems) then your data isn't 
safe when the journal is written and two writes must happen on a fsync (one for 
the data, one for the metadata)

for cyrus you should have the same sort of requirements that you would have for 
a database server, including the fact that without a battery-backed disk cache 
(or solid state drive) to handle your updates, you end up being throttled by 
your disk rotation rate (you can only do a single fsync write per rotation, and 
that good only if you don't have to seek), RAID 5/6 arrays are even worse, as 
almost all systems will require a read of the entire stripe before writing a 
single block (and it's parity block) back out, and since the stripe is 
frequently larger then the OS readahead, the OS throws much of the data away 
immediatly.

if we can identify the files that are the bottlenecks it would be very 
interesting to see the result of puttng them on a solid-state drive.

David Lang