Updating /seen from concurrent sessions

Lawrence Greenfield leg+ at andrew.cmu.edu
Thu Nov 14 22:54:52 EST 2002


--On Friday, November 15, 2002 2:40 PM +1100 Andrew McNamara 
<andrewm at object-craft.com.au> wrote:

>> In general none of Cyrus will necessarily work over NFS. If you're only
>> accessing the NFS store from a single client, things have a much better
>> chance of working---
>
> By single client, do you mean a single NFS client hitting the NFS server?
> If so, this is guaranteed in our configuration.

Yes.

[...]
> It's hard to find any hard information amongst the traditional NFS
> hysteria. I suspect Sleepycat's warning is there simply because the
> quality of NFS implementations is often poor, and it involves so many
> other variables they can't control.

A lot of problems also result when people try to run the application on 
more than one computer hitting the same NFS server. But things that drive 
us application writers mad is the idea that rename() can return failure but 
have actually happened; and if you're trying to write a reliable 
application, you don't want to rely on the fact that the chance of this is 
minimized, since you know it's going to happen and you're going to be sorry.

>> skiplist should work over NFS with a single client and map_nommap.
>
> So, do you mean a single process or a single server (potentially with
> multiple processes hitting the file).

I would hope it would work with a single server with multiple processes. 
But I really haven't thought about all the possibilities with NFS. (The 
"return error and succeed" problem is just one that springs to mind, and 
I've never audited the code thinking about that.)

> Indeed, however if you are talking about increasing the frequency of
> writes to the file, and if you retain a few old versions, you will
> almost certainly get away with it (so, worst case on restart, you try
> progressively older files). This wouldn't be an answer for critical
> data, but it may be acceptable for the \Seen state. Shrug.

Great, now I need to do bookkeeping to do this. Plus on most Unix 
filesystems, rename() is a more expensive operation than 1 fsync() and 
probably even 2 fsync()s. And how am I suppose to programmatically 
determine whether or not a given version is valid?

> BTW, Linux up until very recently synced way too much data on an fsync()
> (it behaved more like a sync()). Yet, even after the new improved fsync(),
> it still doesn't guarantee the file won't be lost (since it doesn't
> sync the directory entry for the file, only the file data and metadata,
> whereas the BSDs and Solaris do). This is a massive pain in the arse
> for MTA authors.

Linux ext2 has this metadata problem. ext3 and reiserfs are both suppose to 
force metadata to disk when fsync() is called, similiar to how softupdates 
on BSD, Veritas, or most other modern filesystems. I'm willing to bet that 
I've wasted more time than you have worrying about the semantics of fsync() 
on various Unix filesystems.

> I think my point is that the cost of open() is roughly equivalent to
> the cost of stat() under Solaris - so rather than keep a file open,
> and stat it periodically to see if it's changed under you, you can close
> and reopen the file (resulting in simpler code, but similar performance).

You need to do the stat() regardless if you want the latest data. By 
keeping the file open, you potentially amortize the cost of an open(), 
another fstat (find out the file descriptor of your open'd fd) and an 
mmap(). All of these have various different costs depending on your 
platform and your Unix.

Keeping the file open costs almost nothing (the cost of the disk space when 
and if there is write contention).

[...]
> Actually, it scaled better than initially expected - this map type
> was used specifically for tables that changed very frequently (the
> pop-before-smtp pre-auth mechanism being a case in point). The only
> synchronous operation was the rename(). The lookup read()'s would have
> been pulling the data from the buffer cache, and sequential searches
> beat more complex schemes every time when the dataset is small (less
> than 100kB was the figure we found when comparing to things like libdb).
> The saving in resident set size was critical too - the machine had 4G
> of RAM, and no more could be fitted.

You have one database and weren't fsync()ing the data. Cyrus has thousands 
of active databases and cares about the reliability of the data.

Larry





More information about the Info-cyrus mailing list