time for cyrus-imap v3.2?

Wed Nov 13 01:06:22 EST 2019

> Given that Cyrus is unix processes, not threads, what mechanism are
> you proposing for the atomic here?

__atomic_load/store and __atomic_fetch_add/sub:
https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html. These
are not related to threads or processes, these are compiler built-ins
that leverage processor-specific instructions like CAS
(compare-and-swap). The compiler knows which instructions to use for
each arch. More info here: https://lwn.net/Articles/509102/. Clang/LLVM
also has similar built-ins: https://libcxx.llvm.org/atomic_design_a.html.

So instead of using some old POSIX synchro primitives, we go one level
lower and implement them ourselves in opportunistic lock-free way with
compiler built-ins. The built-ins are basically inline asm
implementation abstractions available for all archs supported by the
compiler.

> I'd be keen to see something shaped like a pull request against Cyrus
> showing how this would interact with the existing locking
> architecture.

The issue is I don't know the internals of cyrus enough to be able to
identify all the places where write operations occur and I don't have
enough understanding of the entire sync/write logic to be able to
provide a working solution. But once the write operations are
encapsulated in separate functions and there's a list of all of them, I
could implement the efficient global locking and the backup tool that
would leverage it.

Regards,
Anatoli

On 13/11/19 02:25, Bron Gondwana wrote:
> Given that Cyrus is unix processes, not threads, what mechanism are you
> proposing for the atomic here?
> 
> I'd be keen to see something shaped like a pull request against Cyrus
> showing how this would interact with the existing locking architecture.
> 
> Cheers,
> 
> Bron.
> 
> On Wed, Nov 13, 2019, at 16:12, Anatoli via Cyrus-devel wrote:
>> > How do you know when the current write operations are finished?
>>
>> The pseudocode from my previous email:
>>
>> __start:
>>
>> atomic_inc(write_operations_in_execution_counter)
>>
>> atomic_load(write_barrier)
>>
>> if (write_barrier == ON) {
>>
>>   atomic_dec(write_operations_in_execution_counter)
>>   sleep(100ms)
>>   goto __start
>>
>> } else {
>>
>>   *perform_write_operation_with_its_own_locks()*
>>   atomic_dec(write_operations_in_execution_counter)
>>
>> }
>>
>>
>> Here perform_write_operation_with_its_own_locks() is a function (one
>> among many) that today performs a write operation (the algorithm implies
>> that all occurrences of write operations should be wrapped with this
>> pseudocode).
>>
>> Once a write function returns (which would mean that this particular
>> write operation completed), the write_operations_in_execution_counter is
>> decremented. When all operations complete, the counter would be == 0,
>> which is detected by the barrier thread and it proceeds with the
>> sync/fsync to have all files fully written to disk & notifies the
>> external process waiting to take a snapshot. Then it turns off the
>> barrier and the daemon continues normal operation.
>>
>> This is similar to a single global lock as you describe it, and it sort
>> of behaves like it, but it doesn't need to be acquired as such every
>> time and it has practically 0 overhead so it could be placed inside
>> high-frequency loops. It's similar to how modern lock-free concurrency
>> is implemented. The point it that it's an opportunistic locking which
>> could be implemented with very low overhead.
>>
>> And if the code inside the perform_write_operation_with_its_own_locks()
>> is guaranteed not to block on other
>> perform_write_operation_with_its_own_locks() functions, then this
>> implementation would be lock-free for when the barrier is OFF, i.e. no
>> potential deadlocks. For when it's ON, there also would be no deadlocks,
>> but the operation as such won't be fully lock-free.
>>
>> Also, the way I propose to implement the barrier thread, it won't block
>> the server for more than a configurable amount of seconds no matter what
>> (with this, the entire implementation would be lock-free (if we consider
>> the snapshot as part of the process), i.e. it guarantees progress in a
>> finite amount of time, though some threads could starve):
>>
>> atomic_store(write_barrier, ON)
>>
>> __check_write_counter:
>>
>> atomic_load(write_operations_in_execution_counter)
>>
>> if (write_operations_in_execution_counter == 0) {
>>
>>   sync_data_to_disk()
>>   signal_data_ready()
>>   *wait_for_lock_release_with_timeout(5s)*
>>   atomic_store(write_barrier, OFF)
>>
>> } else {
>>
>>   sleep(1ms)
>>   goto __check_write_counter
>>
>> }
>>
>>
>> Here the wait_for_lock_release_with_timeout(5s) function will wait for
>> the release-lock signal for 5 seconds and would turn off the barrier no
>> matter if the external operation (snapshot-taking backup tool)
>> completed, so the server would continue its normal operation once the 5s
>> timeout expires.
>>
>> So while the barrier thread waits for the release-lock signal, the
>> backup tool performs a snapshot and then sends the release-lock signal.
>> The result of the signaling indicates whether the lock was released
>> before or not. If the backup tool receives the code indicating that the
>> lock was released before, it would mean that the snapshot that was taken
>> could be inconsistent.
>>
>> In this case the backup tool could try to perform the operation again or
>> proceed in another way (e.g. to notify the admin that the snapshot takes
>> more than the preconfigured lock-wait time). Again this is the
>> opportunistic locking, i.e. we try to perform an operation without a
>> guarantee of success, so we don't need to wait indefinitely, again
>> providing a lock-free guarantee. If we succeed, then all is good. If
>> not, we try again or abandon the task with an error.
>>
>> And all this would be external to cyrus, it would be implemented in the
>> backup utility.
>>
>> I guess the best way to start with this is to identify all places where
>> data write operations occur (I suppose this is where the mail data and
>> all sorts of databases are written). Once they are identified they could
>> be tweaked a bit for better concurrency and lockability and then we
>> could analyze how to wrap them with a global lock/barrier.
>>
>> Regards,
>> Anatoli
>>
>>
>> On 12/11/19 06:20, Bron Gondwana wrote:
>> > 
>> > 
>> > On Tue, Nov 12, 2019, at 14:50, Anatoli wrote:
>> >> Bron,
>> >>
>> >> The proposed algo is a barrier before any single-lock. In itself it's a
>> >> single lock, but the same code (the pseudocode for the *worker thread*
>> >> in my previous mail) should be inserted at *every* single-lock/write
>> >> operation location. If there's no need to pause, the overhead is
>> >> non-existent. If a pause is requested, all worker threads would
>> pause at
>> >> the entrance to any single-lock/write code.
>> >>
>> >> It would make the entire Cyrus daemon to complete all pending write
>> >> operations and pause new ones. At this stage, if I understand it
>> >> correctly, the data on disk would be in a consistent state, ready to
>> >> take a snapshot or to perform some other operation.
>> > 
>> > "complete all pending write operations and pause new ones"
>> > 
>> > How do you know when the current write operations are finished?
>> > 
>> >> Without that, if we just take a snapshot of the fs, it could happen
>> that
>> >> a) some files are not written entirely (i.e. caught in the middle of a
>> >> write operation) or b) the contents of some files are newer than the
>> >> other, i.e. the logical write operation was not atomic (e.g. mail data
>> >> is written but indexes are not updated yet or something similar).
>> >>
>> >> Maybe I didn't understand you correctly. Do you mean that finishing all
>> >> writes and pausing new ones is not enough to guarantee an integral
>> state
>> >> of files on disk? If it's the case, what would have to be done to
>> >> guarantee it (i.e. to make it like Cyrus was shutdown normally)?
>> > 
>> > I mean that to finish all writes and pause new ones, you need to know
>> > that the writes are finished.  And not just writes, but sets of writes
>> > that are held under a lock together.  The way I know to do this is a
>> > single global lock with the following properties:
>> > 
>> > 1) every action which locks any file within Cyrus for writing takes a
>> > SHARED global lock before it takes the write lock on the file.
>> > 
>> > 2) the SHARED lock is held for the duration of the writes, and released
>> > once the writes are finished.
>> > 
>> > 3) the "backup utility" takes an EXCLUSIVE lock on the global lock,
>> > which will only be granted once each write is finished.  It then takes a
>> > snapshot, and releases the EXCLUSIVE lock.
>> > 
>> > This guarantees full consistency.
>> > 
>> > The question that always exists for locks is "what granularity" - too
>> > wide, and you hold the lock for a long time.  Too narrow, and you take
>> > and release it very frequently, adding overhead.
>> > 
>> > My first and most dumbest theory is to go quite wide - add the lock in
>> > every runloop and command line utility such that it's held for the
>> > entire running of the loop or the utility!  Mostly these are done within
>> > a fraction of a second.  The one place that might be interesting is
>> > FETCH 1:* RFC822.PEEK or similar in imapd, where we already have some
>> > locking magic that holds a shared namelock on the mailbox to stop
>> > repacking while it releases the index lock to allow other actions on the
>> > mailbox in the meanwhile.
>> > 
>> > So we could go down a layer and only lock when we lock mailboxes or
>> > cyrusdbs, and refcount the global lock.  This seems more likely to be a
>> > good layer, and not too horrible.
>> > 
>> > The other thing is that we'll need to assert that the lock isn't being
>> > held during each daemon's command loop, so that bugs don't leak out to
>> > deadlock entire servers.
>> > 
>> > And I think that's nearly it :)
>> > 
>> > Bron.
>> > 
>> > --
>> >   Bron Gondwana, CEO, Fastmail Pty Ltd
>> >   brong at fastmailteam.com <mailto:brong at fastmailteam.com>
>> > 
>> > 
>>
> 
> --
>   Bron Gondwana, CEO, Fastmail Pty Ltd
>   brong at fastmailteam.com <mailto:brong at fastmailteam.com>
> 
>