time for cyrus-imap v3.2?

Wed Nov 13 00:12:12 EST 2019

> How do you know when the current write operations are finished?

The pseudocode from my previous email:

__start:

atomic_inc(write_operations_in_execution_counter)

atomic_load(write_barrier)

if (write_barrier == ON) {

  atomic_dec(write_operations_in_execution_counter)
  sleep(100ms)
  goto __start

} else {

  *perform_write_operation_with_its_own_locks()*
  atomic_dec(write_operations_in_execution_counter)

}

Here perform_write_operation_with_its_own_locks() is a function (one
among many) that today performs a write operation (the algorithm implies
that all occurrences of write operations should be wrapped with this
pseudocode).

Once a write function returns (which would mean that this particular
write operation completed), the write_operations_in_execution_counter is
decremented. When all operations complete, the counter would be == 0,
which is detected by the barrier thread and it proceeds with the
sync/fsync to have all files fully written to disk & notifies the
external process waiting to take a snapshot. Then it turns off the
barrier and the daemon continues normal operation.

This is similar to a single global lock as you describe it, and it sort
of behaves like it, but it doesn't need to be acquired as such every
time and it has practically 0 overhead so it could be placed inside
high-frequency loops. It's similar to how modern lock-free concurrency
is implemented. The point it that it's an opportunistic locking which
could be implemented with very low overhead.

And if the code inside the perform_write_operation_with_its_own_locks()
is guaranteed not to block on other
perform_write_operation_with_its_own_locks() functions, then this
implementation would be lock-free for when the barrier is OFF, i.e. no
potential deadlocks. For when it's ON, there also would be no deadlocks,
but the operation as such won't be fully lock-free.

Also, the way I propose to implement the barrier thread, it won't block
the server for more than a configurable amount of seconds no matter what
(with this, the entire implementation would be lock-free (if we consider
the snapshot as part of the process), i.e. it guarantees progress in a
finite amount of time, though some threads could starve):

atomic_store(write_barrier, ON)

__check_write_counter:

atomic_load(write_operations_in_execution_counter)

if (write_operations_in_execution_counter == 0) {

  sync_data_to_disk()
  signal_data_ready()
  *wait_for_lock_release_with_timeout(5s)*
  atomic_store(write_barrier, OFF)

} else {

  sleep(1ms)
  goto __check_write_counter

}

Here the wait_for_lock_release_with_timeout(5s) function will wait for
the release-lock signal for 5 seconds and would turn off the barrier no
matter if the external operation (snapshot-taking backup tool)
completed, so the server would continue its normal operation once the 5s
timeout expires.

So while the barrier thread waits for the release-lock signal, the
backup tool performs a snapshot and then sends the release-lock signal.
The result of the signaling indicates whether the lock was released
before or not. If the backup tool receives the code indicating that the
lock was released before, it would mean that the snapshot that was taken
could be inconsistent.

In this case the backup tool could try to perform the operation again or
proceed in another way (e.g. to notify the admin that the snapshot takes
more than the preconfigured lock-wait time). Again this is the
opportunistic locking, i.e. we try to perform an operation without a
guarantee of success, so we don't need to wait indefinitely, again
providing a lock-free guarantee. If we succeed, then all is good. If
not, we try again or abandon the task with an error.

And all this would be external to cyrus, it would be implemented in the
backup utility.

I guess the best way to start with this is to identify all places where
data write operations occur (I suppose this is where the mail data and
all sorts of databases are written). Once they are identified they could
be tweaked a bit for better concurrency and lockability and then we
could analyze how to wrap them with a global lock/barrier.

Regards,
Anatoli

On 12/11/19 06:20, Bron Gondwana wrote:
> 
> 
> On Tue, Nov 12, 2019, at 14:50, Anatoli wrote:
>> Bron,
>>
>> The proposed algo is a barrier before any single-lock. In itself it's a
>> single lock, but the same code (the pseudocode for the *worker thread*
>> in my previous mail) should be inserted at *every* single-lock/write
>> operation location. If there's no need to pause, the overhead is
>> non-existent. If a pause is requested, all worker threads would pause at
>> the entrance to any single-lock/write code.
>>
>> It would make the entire Cyrus daemon to complete all pending write
>> operations and pause new ones. At this stage, if I understand it
>> correctly, the data on disk would be in a consistent state, ready to
>> take a snapshot or to perform some other operation.
> 
> "complete all pending write operations and pause new ones"
> 
> How do you know when the current write operations are finished?
> 
>> Without that, if we just take a snapshot of the fs, it could happen that
>> a) some files are not written entirely (i.e. caught in the middle of a
>> write operation) or b) the contents of some files are newer than the
>> other, i.e. the logical write operation was not atomic (e.g. mail data
>> is written but indexes are not updated yet or something similar).
>>
>> Maybe I didn't understand you correctly. Do you mean that finishing all
>> writes and pausing new ones is not enough to guarantee an integral state
>> of files on disk? If it's the case, what would have to be done to
>> guarantee it (i.e. to make it like Cyrus was shutdown normally)?
> 
> I mean that to finish all writes and pause new ones, you need to know
> that the writes are finished.  And not just writes, but sets of writes
> that are held under a lock together.  The way I know to do this is a
> single global lock with the following properties:
> 
> 1) every action which locks any file within Cyrus for writing takes a
> SHARED global lock before it takes the write lock on the file.
> 
> 2) the SHARED lock is held for the duration of the writes, and released
> once the writes are finished.
> 
> 3) the "backup utility" takes an EXCLUSIVE lock on the global lock,
> which will only be granted once each write is finished.  It then takes a
> snapshot, and releases the EXCLUSIVE lock.
> 
> This guarantees full consistency.
> 
> The question that always exists for locks is "what granularity" - too
> wide, and you hold the lock for a long time.  Too narrow, and you take
> and release it very frequently, adding overhead.
> 
> My first and most dumbest theory is to go quite wide - add the lock in
> every runloop and command line utility such that it's held for the
> entire running of the loop or the utility!  Mostly these are done within
> a fraction of a second.  The one place that might be interesting is
> FETCH 1:* RFC822.PEEK or similar in imapd, where we already have some
> locking magic that holds a shared namelock on the mailbox to stop
> repacking while it releases the index lock to allow other actions on the
> mailbox in the meanwhile.
> 
> So we could go down a layer and only lock when we lock mailboxes or
> cyrusdbs, and refcount the global lock.  This seems more likely to be a
> good layer, and not too horrible.
> 
> The other thing is that we'll need to assert that the lock isn't being
> held during each daemon's command loop, so that bugs don't leak out to
> deadlock entire servers.
> 
> And I think that's nearly it :)
> 
> Bron.
> 
> --
>   Bron Gondwana, CEO, Fastmail Pty Ltd
>   brong at fastmailteam.com
> 
>