MMAP performance and using mmap writes

Fri Nov 30 06:42:35 EST 2018

On Fri, Nov 30, 2018, at 22:09, Howard Chu wrote: 
> Bron Gondwana wrote: 
> > Hi All, 
> >  
> > We were debugging the CPU usage in a ctl_conversationsdb rebuild yesterday, and noticed an interesting thing. 70% of the CPU utilisation for this one process 
> > was inside the kernel! Mostly with dirty pages. 
> >  
> > ctl_conversationsdb -R is particularly heavy on the twoskip database - it's rewriting a lot of random keys. This leads to writes all over the place, as it 
> > stitches records into the skiplists. 
> >  
> > Of course the "real answer"[tm] is zeroskip, which doesn't do random writes - but until then, we suspect that the cost is largely due to the face that we use 
> > mmap to read, and fwrite to write! We know that might be less efficient already from Linus' comments about 10 years ago! And I guess here's the proof. 
> >  
> > An option would be to switch to using mmap to write as well. We could easily modify lib/mappedfile to memcpy to do the writes. 
> >  
> > Does anybody see any strong reason not to? 
>  
> I've covered the reasons for/against writing thru mmap in my LMDB design papers. I 
> don't know how relevant all of these are for your use case: 
>  
> 1: writing thru mmap loses any control over write ordering - the OS will page dirty pages out in arbitrary order. 
> If you're using a filesystem that supports ordered writes, it will preserve the ordering of data from write() calls. 

This is not a concern at all - twoskip is deliberately designed such that it does a single write and then flush to "dirty" the file, all changes made while dirty are fully revertable if it crashes, and then it does a fsync (msync now I guess!) before a single write which clears the dirty flag. So long as a single 256 byte write is consistent, it's safe. 

> 2: making the mmap writable opens the possibility of undetectable data structure corruption if any other code 
> is doing stray writes through arbitrary pointers. You need to be very sure your code is bug-free. 

Yes, this is a significant concern. 

> 3: if your DB is larger than RAM, writing thru mmap is slower than using write() syscalls. Whenever you 
> access a page for the first time, the OS will page it in. This is a wasted I/O if all you're doing is 
> overwriting the page with new data. 

I doubt it... especially now we're running on servers with 256Gb of data. These databases are usually under a gigabyte in size. I also don't think we ever overwrite a page without reading from it first - we're usually updating pointers which we've just had to read. 

> 4: you can't use mmap exclusively, if you need to grow the output file. You can only write thru the mapping 
> to pages that already exist. If you need to grow the file, you must preallocate the space, otherwise you 
> get a SEGV when referencing unallocated pages. 

We always know what we're planning to write, so I'm fine with using an ftruncate call on the file descriptor to extend it. 

> And a side note, multiple studies have shown that skiplists are not cache-friendly, and thus have 
> inferior performance to B+tree organizations. A skiplist is a very poor choice for a read/write data structure. 

Yeah, hence zeroskip - it's coming. 

> Obviously I would recommend you use something carefully designed and heavily tested, like LMDB, instead 
> of whatever you're using. 

We tried and had a bad experience last time - it didn't fit in well with the expectations how our code uses database. I'm not super keen to try again right now. I do appreciate your persistence and passion for your project though :) It's good to see this level of engagement. 

> There's one point in favor of writing thru mmap - if you take care of all the other potential gotchas, 
> it will work on every OS that implements mmap. Using mmap for reads, and syscalls for writes, is only 
> valid on OSs with a unified buffer cache. While this isn't a problem on most modern OSs, OpenBSD is a 
> notable example of an OS that lacks this, and so that approach always results in file corruption there. 

Yeah - that's an interesting point to me as well. At the moment we use a wrapper which is called map_stupidshared (don't blame me, was named before my time) which unmaps and remaps every time if the file has been changed. Insanity. It gets tested for during the configure stage. 

We have something even more awful called map_nommmap which just reads the entire file into a buffer every time. As you can imagine, performance is awful - but it does work! 

Bron. 

-- 
 Bron Gondwana, CEO, FastMail Pty Ltd 
 brong at fastmailteam.com 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.andrew.cmu.edu/pipermail/cyrus-devel/attachments/20181130/26442b6e/attachment-0001.html>