time for cyrus-imap v3.2?
Bron Gondwana
brong at fastmailteam.com
Wed Nov 13 00:25:38 EST 2019
Given that Cyrus is unix processes, not threads, what mechanism are you proposing for the atomic here?
I'd be keen to see something shaped like a pull request against Cyrus showing how this would interact with the existing locking architecture.
Cheers,
Bron.
On Wed, Nov 13, 2019, at 16:12, Anatoli via Cyrus-devel wrote:
> > How do you know when the current write operations are finished?
>
> The pseudocode from my previous email:
>
> __start:
>
> atomic_inc(write_operations_in_execution_counter)
>
> atomic_load(write_barrier)
>
> if (write_barrier == ON) {
>
> atomic_dec(write_operations_in_execution_counter)
> sleep(100ms)
> goto __start
>
> } else {
>
> *perform_write_operation_with_its_own_locks()*
> atomic_dec(write_operations_in_execution_counter)
>
> }
>
>
> Here perform_write_operation_with_its_own_locks() is a function (one
> among many) that today performs a write operation (the algorithm implies
> that all occurrences of write operations should be wrapped with this
> pseudocode).
>
> Once a write function returns (which would mean that this particular
> write operation completed), the write_operations_in_execution_counter is
> decremented. When all operations complete, the counter would be == 0,
> which is detected by the barrier thread and it proceeds with the
> sync/fsync to have all files fully written to disk & notifies the
> external process waiting to take a snapshot. Then it turns off the
> barrier and the daemon continues normal operation.
>
> This is similar to a single global lock as you describe it, and it sort
> of behaves like it, but it doesn't need to be acquired as such every
> time and it has practically 0 overhead so it could be placed inside
> high-frequency loops. It's similar to how modern lock-free concurrency
> is implemented. The point it that it's an opportunistic locking which
> could be implemented with very low overhead.
>
> And if the code inside the perform_write_operation_with_its_own_locks()
> is guaranteed not to block on other
> perform_write_operation_with_its_own_locks() functions, then this
> implementation would be lock-free for when the barrier is OFF, i.e. no
> potential deadlocks. For when it's ON, there also would be no deadlocks,
> but the operation as such won't be fully lock-free.
>
> Also, the way I propose to implement the barrier thread, it won't block
> the server for more than a configurable amount of seconds no matter what
> (with this, the entire implementation would be lock-free (if we consider
> the snapshot as part of the process), i.e. it guarantees progress in a
> finite amount of time, though some threads could starve):
>
> atomic_store(write_barrier, ON)
>
> __check_write_counter:
>
> atomic_load(write_operations_in_execution_counter)
>
> if (write_operations_in_execution_counter == 0) {
>
> sync_data_to_disk()
> signal_data_ready()
> *wait_for_lock_release_with_timeout(5s)*
> atomic_store(write_barrier, OFF)
>
> } else {
>
> sleep(1ms)
> goto __check_write_counter
>
> }
>
>
> Here the wait_for_lock_release_with_timeout(5s) function will wait for
> the release-lock signal for 5 seconds and would turn off the barrier no
> matter if the external operation (snapshot-taking backup tool)
> completed, so the server would continue its normal operation once the 5s
> timeout expires.
>
> So while the barrier thread waits for the release-lock signal, the
> backup tool performs a snapshot and then sends the release-lock signal.
> The result of the signaling indicates whether the lock was released
> before or not. If the backup tool receives the code indicating that the
> lock was released before, it would mean that the snapshot that was taken
> could be inconsistent.
>
> In this case the backup tool could try to perform the operation again or
> proceed in another way (e.g. to notify the admin that the snapshot takes
> more than the preconfigured lock-wait time). Again this is the
> opportunistic locking, i.e. we try to perform an operation without a
> guarantee of success, so we don't need to wait indefinitely, again
> providing a lock-free guarantee. If we succeed, then all is good. If
> not, we try again or abandon the task with an error.
>
> And all this would be external to cyrus, it would be implemented in the
> backup utility.
>
> I guess the best way to start with this is to identify all places where
> data write operations occur (I suppose this is where the mail data and
> all sorts of databases are written). Once they are identified they could
> be tweaked a bit for better concurrency and lockability and then we
> could analyze how to wrap them with a global lock/barrier.
>
> Regards,
> Anatoli
>
>
> On 12/11/19 06:20, Bron Gondwana wrote:
> >
> >
> > On Tue, Nov 12, 2019, at 14:50, Anatoli wrote:
> >> Bron,
> >>
> >> The proposed algo is a barrier before any single-lock. In itself it's a
> >> single lock, but the same code (the pseudocode for the *worker thread*
> >> in my previous mail) should be inserted at *every* single-lock/write
> >> operation location. If there's no need to pause, the overhead is
> >> non-existent. If a pause is requested, all worker threads would pause at
> >> the entrance to any single-lock/write code.
> >>
> >> It would make the entire Cyrus daemon to complete all pending write
> >> operations and pause new ones. At this stage, if I understand it
> >> correctly, the data on disk would be in a consistent state, ready to
> >> take a snapshot or to perform some other operation.
> >
> > "complete all pending write operations and pause new ones"
> >
> > How do you know when the current write operations are finished?
> >
> >> Without that, if we just take a snapshot of the fs, it could happen that
> >> a) some files are not written entirely (i.e. caught in the middle of a
> >> write operation) or b) the contents of some files are newer than the
> >> other, i.e. the logical write operation was not atomic (e.g. mail data
> >> is written but indexes are not updated yet or something similar).
> >>
> >> Maybe I didn't understand you correctly. Do you mean that finishing all
> >> writes and pausing new ones is not enough to guarantee an integral state
> >> of files on disk? If it's the case, what would have to be done to
> >> guarantee it (i.e. to make it like Cyrus was shutdown normally)?
> >
> > I mean that to finish all writes and pause new ones, you need to know
> > that the writes are finished. And not just writes, but sets of writes
> > that are held under a lock together. The way I know to do this is a
> > single global lock with the following properties:
> >
> > 1) every action which locks any file within Cyrus for writing takes a
> > SHARED global lock before it takes the write lock on the file.
> >
> > 2) the SHARED lock is held for the duration of the writes, and released
> > once the writes are finished.
> >
> > 3) the "backup utility" takes an EXCLUSIVE lock on the global lock,
> > which will only be granted once each write is finished. It then takes a
> > snapshot, and releases the EXCLUSIVE lock.
> >
> > This guarantees full consistency.
> >
> > The question that always exists for locks is "what granularity" - too
> > wide, and you hold the lock for a long time. Too narrow, and you take
> > and release it very frequently, adding overhead.
> >
> > My first and most dumbest theory is to go quite wide - add the lock in
> > every runloop and command line utility such that it's held for the
> > entire running of the loop or the utility! Mostly these are done within
> > a fraction of a second. The one place that might be interesting is
> > FETCH 1:* RFC822.PEEK or similar in imapd, where we already have some
> > locking magic that holds a shared namelock on the mailbox to stop
> > repacking while it releases the index lock to allow other actions on the
> > mailbox in the meanwhile.
> >
> > So we could go down a layer and only lock when we lock mailboxes or
> > cyrusdbs, and refcount the global lock. This seems more likely to be a
> > good layer, and not too horrible.
> >
> > The other thing is that we'll need to assert that the lock isn't being
> > held during each daemon's command loop, so that bugs don't leak out to
> > deadlock entire servers.
> >
> > And I think that's nearly it :)
> >
> > Bron.
> >
> > --
> > Bron Gondwana, CEO, Fastmail Pty Ltd
> > brong at fastmailteam.com
> >
> >
>
--
Bron Gondwana, CEO, Fastmail Pty Ltd
brong at fastmailteam.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.andrew.cmu.edu/pipermail/cyrus-devel/attachments/20191113/af602e59/attachment-0001.html>
More information about the Cyrus-devel
mailing list