MUPDATE database problems -- help greatly appreciated

Sat Jun 13 16:22:03 EDT 2009

Hello all,

We're in the middle of trying to move from our single server installation 
to a new murder installation on all new hardware.  We're getting into the 
late stages of setup, when we've run into a killer problem with getting the 
old server to sync up with the MUPDATE server so that we can migrate off of 
it.  We're under a deadline to get the expensive new hardware rolled out 
into production, so any help would be enormously appreciated.

The test installation with a test backend of, oh, a couple dozen mailboxes 
worked flawlessly.  Syncing happened just as it was supposed to, and 
everything looked good for production.  The next step was to start the old 
server syncing its database with the MUPDATE server, and that's where we're 
stuck.

The initial sync from the old backend works just fine.  During the second 
sync, however (ctl_mboxlist -m), the backend connects to the MUPDATE 
server, executes a LIST <servername>, and then the server returns somewhere 
between 2500-10,000 lines (of a 830k+ mailboxes database), and freezes.  A 
combination of telemetry logs and truss output shows that the server 
records itself as having sent more data than the client receives, but 
truss'ing the client shows the client expectantly waiting in a read state. 
(The server continues to spin in a fstat/stat/fcntl/fcntl cycle on the 
mailboxes database, which as far as I can tell is normal behavior for the 
skiplist driver, but still looks really weird in a truss.)

Now, here's where it gets even weirder: if I connect using mupdatetest and 
issue the same LIST command and let it run, the command runs to completion 
without error.  However, if I at some point use flow control on my ssh 
session and hit ^S, then a ^Q, the scrolling continues briefly, and then 
the server hangs in a very similar way as above.  To make things even 
odder, when I run a super-aggressive truss on the process (truss -aeflE -v 
all), the error never occurs.  It's as if slowing down the mupdate process 
keeps it out of whatever error state it gets into.

To make matters stranger, when I used the berkeley-hash driver on the 
MUPDATE mboxlist, the MUPDATE server fails to return anything from a LIST 
command, even when its database is full of matching entries.  When 
ctl_mboxlist -m is run, an assert() fails and the process exits without 
performing any work.

Because of all of this, I suspect something going wrong with a buffer 
filling up ungracefully somewhere.  The spot I'm attacking right now is the 
64-bit build -- I'm spending the weekend in the office rebuilding 
everything as 32 bit instead (libraries from the ground up), in case 
there's some problem with a different interpretation of size_t or some such 
thing in the 64-bit world.  I'll share any findings in a few days, but I 
wanted to get this out earlier.

We've eliminated hardware, OS, network, and compiler-specific errors by 
trying uploading the same database from numerous different clients to 
numerous different servers.  (See the combinations tried below).  I'm open 
to any and all suggestions at this point.

Michael Bacon
ITS Messaging
UNC Chapel Hill

Current system information:
Hardware:  Sun T5220s (Sparc CoolThreads architecture) running Solaris 10
Build: 64-bit binaries built using the Sun SPro compiler (to get 
CoolThreads optimizations)
Configuration: tlscache, duplicate, and mboxlist_db all defined to skiplist

Combinations tried: (backend client -> mupdate server)
(all builds currently 64 bit 2.3.13)

Sun 6800+Sol 9+gcc build -> Sun 5220+Sol 10+spro build
Sun 6800+Sol 9+gcc build -> Sun 5120+Sol 10+spro build
Sun 280R+Sol 9+gcc build -> Sun 5220+Sol 10+spro build
Sun 280R+Sol 9+gcc build -> Same machine, separate cyrus install over 
localhost
Sun 5220+Sol 10+spro build -> Sun 5220+Sol 10+spro build
Sun 5220+Sol 10+spro build -> Sun 280R+Sol 9+gcc build
We tried others too, but this covers most of the important combinations, I 
think.