Updating /seen from concurrent sessions

Andrew McNamara andrewm at object-craft.com.au
Thu Nov 14 22:40:51 EST 2002


>In general none of Cyrus will necessarily work over NFS. If you're only 
>accessing the NFS store from a single client, things have a much better 
>chance of working---

By single client, do you mean a single NFS client hitting the NFS server?
If so, this is guaranteed in our configuration. 

>but I really don't know what semantics Sun's NFS client and NetApp's NFS
>filer guarantee with regards to mmap() and write(). If it doesn't support
>mmap() showing changes by write() immediately (Cyrus tests for this in the
>configure script but the configure script is probably not doodling on an
>NFS partition) you need to use map_nommap, which is very slow.

Actually, the build directory was NFS mounted, but the server was another
Solaris machine. I just extracted the mmap tests from configure, and
ran them on the test platform, and they passed (for what that's worth).

>Berkeley db makes no guarantees of working over NFS.

It's hard to find any hard information amongst the traditional NFS
hysteria. I suspect Sleepycat's warning is there simply because the
quality of NFS implementations is often poor, and it involves so many
other variables they can't control.

While there are real unsolveable problems with NFS, they tend to only
kick in when there's packet loss or duplicate on the wire, and we've
done everything humanly possible to minimise this in our environment.

>skiplist should work over NFS with a single client and map_nommap.

So, do you mean a single process or a single server (potentially with
multiple processes hitting the file).

>> Another question - it looks to me like I have to recompile to switch
>> database types - is this true? The code looks like it would be flexible
>> enough to allow a run-time config option to chose the method with very
>> little modification?
>
>It probably could be made a run-time option. Since you need to convert all 
>of the different files, making it an easy run-time switch has never been a 
>priority.

It would make life a lot easier in our environment - the build platforms
are slow, and a recompile will take me an afternoon. I have very little
data stored on the test Cyrus platform, and can afford to nuke it and
start again. Having a run-time switch would let me rapidly compare
options.

>> If the imapd already can cope with asynchronous events, I would flush the
>> state after a second or two of inactivity from the client. Failing that,
>> I would probably flush the state before replying to the client (yes,
>> this would hurt performance, although probably not much, particularly
>> if we skip the fsync()).
>
>You can't skip the fsync() because the fsync()s are what guarantees that 
>the files will be in a consistent form if the system crashes. (The fsync()s 
>are needed for ordering guarantees of operation. This is true for Berkeley 
>db, skiplist, flat files, whatever.)

Indeed, however if you are talking about increasing the frequency of
writes to the file, and if you retain a few old versions, you will
almost certainly get away with it (so, worst case on restart, you try
progressively older files). This wouldn't be an answer for critical
data, but it may be acceptable for the \Seen state. Shrug.

BTW, Linux up until very recently synced way too much data on an fsync()
(it behaved more like a sync()). Yet, even after the new improved fsync(),
it still doesn't guarantee the file won't be lost (since it doesn't
sync the directory entry for the file, only the file data and metadata,
whereas the BSDs and Solaris do). This is a massive pain in the arse
for MTA authors.

>> But this just fixes the OE problem - Cyrus would still have a problem
>> (as far as I can see): all the other copies accessing that mailbox
>> will still have their old seen files open (maybe using skiplist fixes
>> this). The flat-file seen implementation needs to check to see if the
>> file has been renamed under it (and do what?).
>
>The flat file database layer (cyrusdb_flat) already knows how to do this at 
>the appropriate time. The caching is being implemented in the seen layer 
>(seen_db.c) not the flat file implementation.

Okay - I'll need to look closer at the code. I'm clearly missing some
detail.

>> To be honest, the flat file seen implementation is way more complicated
>> than I would have thought was worthwhile. My preference would be to
>> not hold the file open, and simply re-write the whole file each time we
>> updated it, renaming the replacement into place (to make the operation
>> atomic - this is also the only synchronous operation). My experience has
>> been that unix is quite happy doing naive things like this while the
>> file remains small (say less than 10k).
>
>Whenever there is a change, the flat file does rewrite the entire file. The 
>database layer holds the file open because the database layer assumes that 
>other operations (reads on other keys, things like that). Updates are very 
>frequent, which is why the skiplist implementation can perform better.

I think my point is that the cost of open() is roughly equivalent to
the cost of stat() under Solaris - so rather than keep a file open,
and stat it periodically to see if it's changed under you, you can close
and reopen the file (resulting in simpler code, but similar performance).

>However, updates can be an order of magnitude more frequent if we're going 
>to write for every flag change. Cyrus is written with the expectation that 
>you will have thousands of simultaneous clients working on tens or hundreds 
>of thousands of mailboxes.

And an excellent design goal that is... 8-)

I'm guessing, but I suspect OE updates the \Seen flag each time it
downloads a message, and presumably this occurs each time a user selects
a message. So you may only see an update every couple of seconds from
each client - obviously that adds up.

BTW, there may be paid consulting oportunities for people with
demonstratable advanced Cyrus hacking skills in this project. If anyone
is interested, let me know.

>> I implemented a Postfix map that works this way - for lookups, it simply
>> does a linear read/search of the file. For update, it writes a new file,
>> and moves it into place. Generally this performed much better than
>> more complex schemes such as the Sleepycat DB's - particularly when you
>> consider memory footprint (this was on a machine with about 100k users,
>> handling 10's of messages per second).
>
>It doesn't scale when there are frequent updates. That's why we have the 
>database abstraction, so we can choose the file format that does the job 
>most effectively. cyrusdb_flat does exactly this, and it works ok when you 
>don't need frequent updates. Seen state has frequent updates.

Actually, it scaled better than initially expected - this map type
was used specifically for tables that changed very frequently (the
pop-before-smtp pre-auth mechanism being a case in point). The only
synchronous operation was the rename(). The lookup read()'s would have
been pulling the data from the buffer cache, and sequential searches
beat more complex schemes every time when the dataset is small (less
than 100kB was the figure we found when comparing to things like libdb).
The saving in resident set size was critical too - the machine had 4G
of RAM, and no more could be fitted.

-- 
Andrew McNamara, Senior Developer, Object Craft
http://www.object-craft.com.au/




More information about the Info-cyrus mailing list