Seen databases

Wed Apr 28 22:58:22 EDT 2010

On Thu, Apr 29, 2010 at 08:44:54AM +1000, Rob Mueller wrote:
> Whether to go seen_db or seen_bigdb, that's trickier. seen_db is
> what almost everyone uses now, but seen_bigdb seems almost sane
> since in most cases, the users own seen state will be in the
> cyrus.index.

That's what I figured...

> There's one issue with seen_bigdb though, you really would have to
> use a real DB (eg bdb or skiplist), not the text file db.

Yes, definitely.  We use skiplist for seen_db at the moment anyway.
Also seen_db is what most people use, so it's pretty well tested.

> The other issue I can see, is that seen db is indexed by folder
> unqid. How "unique" are folder id's. They're generated in a pretty
> adhoc fashion, and it's always scared me that it might be too easy
> to generate clashes (when restoring from backups especially), which
> would be especially bad for a seen_bigdb.

It doesn't really matter for a seen_bigdb, because they'll be keyed
by user AND uniqueid - meaning they are no more likely to generate
clashes than they were before under seen_db.

Besides, they only matter within the non-user folders now.

More interesting is the potential for clashes during replication, which
would generate a rename event across users.  That could get super-ugly!

But it's not a high risk - the adhoc uniqueid is a hash of the folder
name concatenated with the uidvalidity, so you'd have to have a hash
collision and creation at the same second.  Restore from backup after
a rename is the disaster case.  The best way to protect against that is
to move the cyrus.header data into a central DB and scan it for matches
before creating an entry.  Either key an "index" db against the uniqueid
directly, or just do a full table scan.  The IMAP "LIST" command already
does a full table scan, so it can't be TOO expensive :)

Bron.