choosing a file system

Thu Jan 8 22:32:04 EST 2009

On Thu, 08 Jan 2009 20:03 -0500, "Dale Ghent" <daleg at elemental.org> wrote:
> On Jan 8, 2009, at 7:46 PM, Bron Gondwana wrote:
> 
> > We run one zfs machine.  I've seen it report issues on a scrub
> > only to not have them on the second scrub.  While it looks shiny
> > and great, it's also relatively new.
>
> Wait, weren't you just crowing about ext4? The filesystem that was  
> marked GA in the linux kernel release that happened just a few weeks  
> ago? You also sound pretty enthusiastic, rather than cautious, when  
> talking about brtfs and tux3.

I was saying I find it interesting.  I wouldn't seriously consider
using it for production mail stores just yet.  But I have been testing
it on my laptop, where I'm running an offlineimap replicated copy of
my mail.  I wouldn't consider btrfs for production yet either, and
tux3 isn't even on the radar.  They're interesting to watch though,
as is ZFS.

I also said (or at least meant) that if you have commercial support,
ext4 is probably going to be the next evolutionary step from ext3.

> ZFS, and anyone who even remotely seriously follows Solaris would know  
> this, has been GA for 3 years now. For someone who doesn't have their  
> nose buried in Solaris much or with any serious attention span, I  
> guess it could still seem new.

Yeah, it's true - but I've heard anecdotes of people losing entire
zpools due to bugs.  Google turns up things like:

http://www.techcrunch.com/2008/01/15/joyent-suffers-major-downtime-due-to-zfs-bug/

which points to this thread:

http://www.opensolaris.org/jive/thread.jspa?threadID=49020&tstart=0

and finally this comment:

http://www.joyeur.com/2008/01/16/strongspace-and-bingodisk-update#c008480

Not something I would want happening to my entire universe, which is
why having ~280 separate filesystems (at the moment) with our email
spread across them means that a rare filesystem bug is only likely to
affect a single store if it bites - and we can restore one store's
worth of users a lot quicker than the whole system.

It's the same reason we prefer Cyrus replication (and put a LOT of work
into making it stable - check this mailing list from a couple of years
ago.  I wrote most of the patches the stabilised replication between
2.3.3 and 2.3.8)

If all your files are on a single filesystem then a rare bug only has
to hit once.  A frequent bug on the other hand, well - you'll know
about them pretty fast... :)  None of the filesystems mentioned have
frequent bugs (except btrfs and probably tux3 - but they ship with
big fat warnings all over)

> As for your x4500, I can't tell if those syslog lines you pasted were  
> from Aug. 2008 or 2007, but certainly since 2007 the marvel SATA  
> driver has seen some huge improvements to work around some pretty  
> nasty bugs in the marvell chipset. If you still have that x4500, and  
> have not applied the current patch for the marvell88sx driver, I  
> highly suggest doing so. Problems with that chip are some of the  
> reasons Sun switched to the LSI 1068E as the controller in the x4540.

I think it was 2007 actually.  We haven't had any trouble with it for
a while, but then it does pretty little.  The big zpool is just used
for backups, which are pretty much one .tar.gz and one .sqlite3 file
per user - and the .sqlite3 file is just indexing the .tar.gz file,
we can rebuild it by reading the tar file if needed.

As a counterpoint to some of the above, we had an issue with Linux
where there was a bug in 64 bit writev handling of mmaped space.  If
you were doing a writev with a mmaped space that crossed a page boundary
and the following page wasn't mapped in, it would inject spurious zero
bytes in the output where the start of the next page belonged.

It took me a few days to prove it was the kernel and create a repeatable
test case, and then backwards and forwards with Linus and a couple of
other developers we fixed it and tested it _that_day_.  I don't know
anyone with even unobtanium level support with a commercial vendor who
has actually had that sort of turnaround.

This caused pretty massive file corruption of especially our skiplist
files, but bits of every other meta file too.   Luckily, as per above,
we had only upgraded one machine.  We generally do that with new kernels
or software versions - upgrade one production machine and watch it for
a bit.  We also test things on testbed machines first, but you always
find something different on production.  The mmap over boundaries case
was pretty rare - only a few per day would actually cause a crash, the
others were silent corruption that wasn't detected at the time.

If something like this hit an only machine, we would have been seriously
screwed.  Since it only hit one machine, we could apply the fix and
re-replicate all the damaged data from the other machine.  No actual
dataloss.

Bron.
-- 
  Bron Gondwana
  brong at fastmail.fm