Repeat recovers on databases

Thu Jun 18 17:44:19 EDT 2009

Another one stomped here.  This time, it's a 32/64 bit issue.  myinit in 
cyrusdb_skiplist.c assumes that type_t is 4 bytes long, and writes out that 
many from the current timestamp when creating $confdir/db/skipstamp.  On 
64-bit Solaris, time_t is 8 bytes (it's typedef'ed as a long).  I'm 
forgetting my Who's Who of big and little endian chips, but my guess is 
that on x86 systems, the first four bytes are the ones with the real data 
in them, so there's actually meaningful data that gets written out.  On 
Sparc, though, no such luck.

So, when ctl_cyrusdb decides to recover the database, it writes out four 
bytes of data, all of which happen to be zeroes.  Henceforth, every process 
that looks at the database goes, "oh, look, the database needs recovering!" 
then spends 55 seconds recovering it before it does any meaningful work, 
then proceeds to write out 4 bytes of zeroes into the skipstamp file.  The 
next process comes along, reads the skipstamp file, and goes, "oh, look, 
the database needs recovering!"

The fix for it is below.  I will also open a bugzilla issue for this.

Always remember boys and girls, when you ASS-UM-E the bit size of types, 
you make lots of ASSemblers go "UM...." exponentially.

Michael Bacon
ITS Messaging
UNC Chapel Hill


===================================================================
RCS file: /cvs/src/cyrus/lib/cyrusdb_skiplist.c,v
retrieving revision 1.64
diff -u -r1.64 cyrusdb_skiplist.c

--- cyrusdb_skiplist.c  8 Oct 2008 15:47:08 -0000       1.64
+++ cyrusdb_skiplist.c  18 Jun 2009 21:42:30 -0000
@@ -239,7 +239,7 @@

        if (r != -1) r = ftruncate(fd, 0);
        a = htonl(global_recovery);
-       if (r != -1) r = write(fd, &a, 4);
+       if (r != -1) r = write(fd, &a, sizeof(time_t));
        if (r != -1) r = close(fd);

        if (r == -1) {
@@ -252,7 +252,7 @@

        fd = open(sfile, O_RDONLY, 0644);
        if (fd == -1) r = -1;
-       if (r != -1) r = read(fd, &a, 4);
+       if (r != -1) r = read(fd, &a, sizeof(time_t));
        if (r != -1) r = close(fd);

        if (r == -1) {


--On June 15, 2009 10:07:34 AM -0400 Michael Bacon <baconm at email.unc.edu> 
wrote:

> This appears to be an issue in addition to the freeze-ups we're having.
>
> Given all the dumping and undumping I'm doing in the name of debugging,
> this may not be surprising, but I keep seeing instances where a database
> gets into some state wherein any process that opens it decides to run a
> recover on it before doing anything.  Running a ctl_cyrusdb -r, even with
> all other processes stopped, does not seem to change this behavior.  The
> next time a cyrus process starts up, whether it's an imapd, mupdate, or
> ctl_mboxlist, the process goes and does a recover before doing anything
> else.
>
> Has anyone else seen this?  I've seen it on brand-new, newly "undumped"
> databases in the past week.
>
> Michael Bacon
> ITS Messaging
> UNC Chapel Hill
> ----
> Cyrus Home Page: http://cyrusimap.web.cmu.edu/
> Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki
> List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html