MMAP performance and using mmap writes

Fri Nov 30 06:09:23 EST 2018

Bron Gondwana wrote:
> Hi All,
> 
> We were debugging the CPU usage in a ctl_conversationsdb rebuild yesterday, and noticed an interesting thing.  70% of the CPU utilisation for this one process
> was inside the kernel!  Mostly with dirty pages.
> 
> ctl_conversationsdb -R is particularly heavy on the twoskip database - it's rewriting a lot of random keys.  This leads to writes all over the place, as it
> stitches records into the skiplists.
> 
> Of course the "real answer"[tm] is zeroskip, which doesn't do random writes - but until then, we suspect that the cost is largely due to the face that we use
> mmap to read, and fwrite to write!  We know that might be less efficient already from Linus' comments about 10 years ago!  And I guess here's the proof.
> 
> An option would be to switch to using mmap to write as well.  We could easily modify lib/mappedfile to memcpy to do the writes.
> 
> Does anybody see any strong reason not to?

I've covered the reasons for/against writing thru mmap in my LMDB design papers. I
don't know how relevant all of these are for your use case:

1: writing thru mmap loses any control over write ordering - the OS will page dirty pages out in arbitrary order.
If you're using a filesystem that supports ordered writes, it will preserve the ordering of data from write() calls.

2: making the mmap writable opens the possibility of undetectable data structure corruption if any other code
is doing stray writes through arbitrary pointers. You need to be very sure your code is bug-free.

3: if your DB is larger than RAM, writing thru mmap is slower than using write() syscalls. Whenever you
access a page for the first time, the OS will page it in. This is a wasted I/O if all you're doing is
overwriting the page with new data.

4: you can't use mmap exclusively, if you need to grow the output file. You can only write thru the mapping
to pages that already exist. If you need to grow the file, you must preallocate the space, otherwise you
get a SEGV when referencing unallocated pages.

And a side note, multiple studies have shown that skiplists are not cache-friendly, and thus have
inferior performance to B+tree organizations. A skiplist is a very poor choice for a read/write data structure.

Obviously I would recommend you use something carefully designed and heavily tested, like LMDB, instead
of whatever you're using.

There's one point in favor of writing thru mmap - if you take care of all the other potential gotchas,
it will work on every OS that implements mmap. Using mmap for reads, and syscalls for writes, is only
valid on OSs with a unified buffer cache. While this isn't a problem on most modern OSs, OpenBSD is a
notable example of an OS that lacks this, and so that approach always results in file corruption there.

-- 
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/