Trouble restoring BDB databases on new OS... confused

Sat Mar 8 11:07:01 EST 2008

SUMMARY: Attempting to upgrade Ubuntu results in 100% CPU tight-loops, not in a system call, maybe somewhere in Berkeley DB.  Blowing away the db dir works, and I don't think there was anything important there, but what happened?

DETAILS:

I've just upgraded from Ubuntu Dapper Drake 32-bit to Ubuntu Gutsy Gibbon 64-bit.  I *think* I was running a source-built Cyrus IMAPD 2.2.13.  I had repointed my /var/lib/imap to /mail/imap, and was using BerkeleyDB 4.3 for anything that still used BDB.  I do know that checkpointing hasn't run for a very, very long time, because when I switched from the packaged version to the source-built one, I forgot to update the path for ctl_cyrusdb in cyrusd.conf.  Oops. 

So, back on Gutsy: I installed Cyrus from the Ubuntu packages; it says it's 2.2.13-11ubuntu1.  I also installed libdb4.3.  Started up cyrus, and "ctl_cyrusdb -r" is looping at 100% CPU, and has to be kill -9.  db_recover, db_verify, etc. all had the same symptoms (but see below).  According to strace, ctl_cyrusdb was not opening *any* of the .db files; it's looking at:

1. itself
2. some libraries
3. imapd.conf
4. /mail/imap/DB_CONFIG (doesn't exist)
5. /var/tmp
6. /mail/imap/db/__db.001

When I look in /mail/imap/db, I see:

  2006-09-25 13:14 log.0000000050.old
  2006-09-26 18:01 skipstamp
--
  2006-09-26 18:01 __db.001 
    through 
  2006-09-26 18:01 __db.005 
--
  2007-06-17 12:16 log.0000000060
    through
  2008-03-05 13:49 log.0000000076

So as I understand it, this isn't "the" database; it's BDB transactions and/or log files for ALL of the various databases:

/mail/imap# file *.db
annotations.db:  Cyrus skiplist DB
deliver.db:      Berkeley DB (Btree, version 8, native byte-order)
mailboxes.db:    Cyrus skiplist DB
tls_sessions.db: Berkeley DB (Btree, version 8, native byte-order)

Not understanding that, I was trying to db4.3_recover the db/ directory itself, and saw the same symptoms: db_recover would open and mmap the __db.001 file, and then completely hang and need a kill -9 to go away.  Same for db_stat, db_verify, db_dump, db_printlog.

(Hey, maybe that's a clue: Is it trying to open a 32-bit shared-memory region on a 64-bit OS, or something like that?)

Figuring that the db/ directory would contain nothing but deliver.db and tls_sessions.db data, and that (from what I read) neither are important state info for a one-user mail system, I just blew away the db/ directory and deliver.db.  (I didn't think to blow away tls_sessions.db but it seems happy enough now.)

So everything *seems* to be working OK now, but I don't quite understand what happened, and what I was supposed to do to "fix" it more properly (other than having run checkpointing in the first place).

If I go back to a freshly-restored backup, understanding what the different DBs are now, I still see weird behavior:  

/tmp/berkeley-recover# ls
annotations.db  db/  deliver.db  log.0000000001  mailboxes.db  tls_sessions.db
/tmp/berkeley-recover# db4.3_recover
[returns instantly]
/tmp/berkeley-recover# db4.3_verify deliver.db
db_verify: Page 239: incorrect next_pgno 244 found in leaf chain (should be 60)
db_verify: Page 60: incorrect prev_pgno 44 found in leaf chain (should be 239)
... [many more linking errors] ...
db_verify: Page 0: page 250 encountered a second time on free list
db_verify: deliver.db: DB_VERIFY_BAD: Database verification failed
/tmp/berkeley-recover# db4.3_recover -c
[returns instantly]
/tmp/berkeley-recover# db4.3_verify -N deliver.db
[same results as before]

Can anyone give me more insights into what I'm seeing, so I know better next time?