Trouble restoring BDB databases on new OS... confused
jlevitt at berklee.net
Sat Mar 8 11:07:01 EST 2008
SUMMARY: Attempting to upgrade Ubuntu results in 100% CPU tight-loops, not in a system call, maybe somewhere in Berkeley DB. Blowing away the db dir works, and I don't think there was anything important there, but what happened?
I've just upgraded from Ubuntu Dapper Drake 32-bit to Ubuntu Gutsy Gibbon 64-bit. I *think* I was running a source-built Cyrus IMAPD 2.2.13. I had repointed my /var/lib/imap to /mail/imap, and was using BerkeleyDB 4.3 for anything that still used BDB. I do know that checkpointing hasn't run for a very, very long time, because when I switched from the packaged version to the source-built one, I forgot to update the path for ctl_cyrusdb in cyrusd.conf. Oops.
So, back on Gutsy: I installed Cyrus from the Ubuntu packages; it says it's 2.2.13-11ubuntu1. I also installed libdb4.3. Started up cyrus, and "ctl_cyrusdb -r" is looping at 100% CPU, and has to be kill -9. db_recover, db_verify, etc. all had the same symptoms (but see below). According to strace, ctl_cyrusdb was not opening *any* of the .db files; it's looking at:
2. some libraries
4. /mail/imap/DB_CONFIG (doesn't exist)
When I look in /mail/imap/db, I see:
2006-09-25 13:14 log.0000000050.old
2006-09-26 18:01 skipstamp
2006-09-26 18:01 __db.001
2006-09-26 18:01 __db.005
2007-06-17 12:16 log.0000000060
2008-03-05 13:49 log.0000000076
So as I understand it, this isn't "the" database; it's BDB transactions and/or log files for ALL of the various databases:
/mail/imap# file *.db
annotations.db: Cyrus skiplist DB
deliver.db: Berkeley DB (Btree, version 8, native byte-order)
mailboxes.db: Cyrus skiplist DB
tls_sessions.db: Berkeley DB (Btree, version 8, native byte-order)
Not understanding that, I was trying to db4.3_recover the db/ directory itself, and saw the same symptoms: db_recover would open and mmap the __db.001 file, and then completely hang and need a kill -9 to go away. Same for db_stat, db_verify, db_dump, db_printlog.
(Hey, maybe that's a clue: Is it trying to open a 32-bit shared-memory region on a 64-bit OS, or something like that?)
Figuring that the db/ directory would contain nothing but deliver.db and tls_sessions.db data, and that (from what I read) neither are important state info for a one-user mail system, I just blew away the db/ directory and deliver.db. (I didn't think to blow away tls_sessions.db but it seems happy enough now.)
So everything *seems* to be working OK now, but I don't quite understand what happened, and what I was supposed to do to "fix" it more properly (other than having run checkpointing in the first place).
If I go back to a freshly-restored backup, understanding what the different DBs are now, I still see weird behavior:
annotations.db db/ deliver.db log.0000000001 mailboxes.db tls_sessions.db
/tmp/berkeley-recover# db4.3_verify deliver.db
db_verify: Page 239: incorrect next_pgno 244 found in leaf chain (should be 60)
db_verify: Page 60: incorrect prev_pgno 44 found in leaf chain (should be 239)
... [many more linking errors] ...
db_verify: Page 0: page 250 encountered a second time on free list
db_verify: deliver.db: DB_VERIFY_BAD: Database verification failed
/tmp/berkeley-recover# db4.3_recover -c
/tmp/berkeley-recover# db4.3_verify -N deliver.db
[same results as before]
Can anyone give me more insights into what I'm seeing, so I know better next time?
More information about the Info-cyrus