sync_server "memory leak" with giant new mailbox first sync

Sun Sep 10 15:25:52 EDT 2006

I saw this problem the first time I enabled replication on a machine  
hosting all the IT support staff the University of Michigan.  Plenty  
of large mailboxes there!

My solution (such as it is) was to reduce the wasteful amount of  
space sync_server was allocating per message:

--- cyrus-imapd-2.3.7/imap/sync_support.c	2006-06-14  
14:03:24.000000000 -0400
+++ cyrus-imapd-2.3.7p3/imap/sync_support.c	2006-07-29  
12:34:59.000000000 -0400
@@ -912,9 +912,9 @@
      result = xzmalloc(sizeof(struct sync_message));
      message_uuid_set_null(&result->uuid);

-    result->msg_path = xzmalloc(5 * (MAX_MAILBOX_PATH+1) * sizeof 
(char));
+    result->msg_path = xzmalloc((MAX_MAILBOX_PATH+1) * sizeof(char));
      result->msg_path_end = result->msg_path +
-	5 * (MAX_MAILBOX_PATH+1) * sizeof(char);
+	(MAX_MAILBOX_PATH+1) * sizeof(char);

      snprintf(result->stagename, sizeof(result->stagename), "%lu.",  
l->count);

The times-5 is completely gratuitous.  In fact the pre-allocation of  
any memory for paths is wasteful, but I was not up for reengineering  
the memory scheme in sync_server at the time.  To solve the 1000- 
messages->RESTART transition, I wonder if the client couldn't just  
initiate the transition.  After all, how smart is it to transmit 1000  
messages before deciding that a more efficient approach might be  
needed?  Especially since sync_client has an idea of how many  
messages it's going to attempt to send.

:wes

On 10 Sep 2006, at 11:15, Bron Gondwana wrote:
> When sync_client has a large folder to send (for
> the sake of far too many hours of me trying to
> make this work let's just say it's 180,000
> messages), then it just sends a single
> "UPLOAD [lastuid] [lastappenddate]" followed by
> every single message on after the other.
>
> There's logic on the server end to send a [RESTART]
> back after 1000 new files arrive, but it doesn't
> get to be called until all 180,000 messages have
> arrived... or at least it would be if the sync_server
> process didn't receive a SIGABRT somewhere around
> 102,000 messages in.  I tried all sorts of things
> to find the underlying cause, then finally just
> watched 'top' on the sync_server machine as it ran.
>
> This machine has 8Gb of memory, and was seeing over
> 30% being used by this one sync_server before it
> died!
>
> Well, the attached isn't the most elegant patch in
> the world, and may not be the best way to solve the
> problem, but at least it got that user replicated
> and happy.  The first time we had to deal with it
> was moving the user off a corrupted filesystem that
> I could only mount read-only, and it took about 3
> hours for each run to fail thanks to the insanely
> high IO load on that drive unit, so debugging was
> more of a pain than you'd hope.
>
> I hope something inspired by this can be merged
> upstream to solve the "spam sync_server until it
> falls over" failure mode.