<!DOCTYPE html><html><head><title></title><style type="text/css">p.MsoNormal,p.MsoNoSpacing{margin:0}</style></head><body>Given that Cyrus is unix processes, not threads, what mechanism are you proposing for the atomic here?<br><br>I'd be keen to see something shaped like a pull request against Cyrus showing how this would interact with the existing locking architecture.<br><br>Cheers,<br><br>Bron.<br><div style="font-family:Arial;"><br></div><div>On Wed, Nov 13, 2019, at 16:12, Anatoli via Cyrus-devel wrote:<br></div><blockquote type="cite" id="qt"><div style="font-family:Arial;">> How do you know when the current write operations are finished?<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">The pseudocode from my previous email:<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">__start:<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">atomic_inc(write_operations_in_execution_counter)<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">atomic_load(write_barrier)<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">if (write_barrier == ON) {<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">  atomic_dec(write_operations_in_execution_counter)<br></div><div style="font-family:Arial;">  sleep(100ms)<br></div><div style="font-family:Arial;">  goto __start<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">} else {<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">  *perform_write_operation_with_its_own_locks()*<br></div><div style="font-family:Arial;">  atomic_dec(write_operations_in_execution_counter)<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">}<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">Here perform_write_operation_with_its_own_locks() is a function (one<br></div><div style="font-family:Arial;">among many) that today performs a write operation (the algorithm implies<br></div><div style="font-family:Arial;">that all occurrences of write operations should be wrapped with this<br></div><div style="font-family:Arial;">pseudocode).<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">Once a write function returns (which would mean that this particular<br></div><div style="font-family:Arial;">write operation completed), the write_operations_in_execution_counter is<br></div><div style="font-family:Arial;">decremented. When all operations complete, the counter would be == 0,<br></div><div style="font-family:Arial;">which is detected by the barrier thread and it proceeds with the<br></div><div style="font-family:Arial;">sync/fsync to have all files fully written to disk & notifies the<br></div><div style="font-family:Arial;">external process waiting to take a snapshot. Then it turns off the<br></div><div style="font-family:Arial;">barrier and the daemon continues normal operation.<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">This is similar to a single global lock as you describe it, and it sort<br></div><div style="font-family:Arial;">of behaves like it, but it doesn't need to be acquired as such every<br></div><div style="font-family:Arial;">time and it has practically 0 overhead so it could be placed inside<br></div><div style="font-family:Arial;">high-frequency loops. It's similar to how modern lock-free concurrency<br></div><div style="font-family:Arial;">is implemented. The point it that it's an opportunistic locking which<br></div><div style="font-family:Arial;">could be implemented with very low overhead.<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">And if the code inside the perform_write_operation_with_its_own_locks()<br></div><div style="font-family:Arial;">is guaranteed not to block on other<br></div><div style="font-family:Arial;">perform_write_operation_with_its_own_locks() functions, then this<br></div><div style="font-family:Arial;">implementation would be lock-free for when the barrier is OFF, i.e. no<br></div><div style="font-family:Arial;">potential deadlocks. For when it's ON, there also would be no deadlocks,<br></div><div style="font-family:Arial;">but the operation as such won't be fully lock-free.<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">Also, the way I propose to implement the barrier thread, it won't block<br></div><div style="font-family:Arial;">the server for more than a configurable amount of seconds no matter what<br></div><div style="font-family:Arial;">(with this, the entire implementation would be lock-free (if we consider<br></div><div style="font-family:Arial;">the snapshot as part of the process), i.e. it guarantees progress in a<br></div><div style="font-family:Arial;">finite amount of time, though some threads could starve):<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">atomic_store(write_barrier, ON)<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">__check_write_counter:<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">atomic_load(write_operations_in_execution_counter)<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">if (write_operations_in_execution_counter == 0) {<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">  sync_data_to_disk()<br></div><div style="font-family:Arial;">  signal_data_ready()<br></div><div style="font-family:Arial;">  *wait_for_lock_release_with_timeout(5s)*<br></div><div style="font-family:Arial;">  atomic_store(write_barrier, OFF)<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">} else {<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">  sleep(1ms)<br></div><div style="font-family:Arial;">  goto __check_write_counter<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">}<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">Here the wait_for_lock_release_with_timeout(5s) function will wait for<br></div><div style="font-family:Arial;">the release-lock signal for 5 seconds and would turn off the barrier no<br></div><div style="font-family:Arial;">matter if the external operation (snapshot-taking backup tool)<br></div><div style="font-family:Arial;">completed, so the server would continue its normal operation once the 5s<br></div><div style="font-family:Arial;">timeout expires.<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">So while the barrier thread waits for the release-lock signal, the<br></div><div style="font-family:Arial;">backup tool performs a snapshot and then sends the release-lock signal.<br></div><div style="font-family:Arial;">The result of the signaling indicates whether the lock was released<br></div><div style="font-family:Arial;">before or not. If the backup tool receives the code indicating that the<br></div><div style="font-family:Arial;">lock was released before, it would mean that the snapshot that was taken<br></div><div style="font-family:Arial;">could be inconsistent.<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">In this case the backup tool could try to perform the operation again or<br></div><div style="font-family:Arial;">proceed in another way (e.g. to notify the admin that the snapshot takes<br></div><div style="font-family:Arial;">more than the preconfigured lock-wait time). Again this is the<br></div><div style="font-family:Arial;">opportunistic locking, i.e. we try to perform an operation without a<br></div><div style="font-family:Arial;">guarantee of success, so we don't need to wait indefinitely, again<br></div><div style="font-family:Arial;">providing a lock-free guarantee. If we succeed, then all is good. If<br></div><div style="font-family:Arial;">not, we try again or abandon the task with an error.<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">And all this would be external to cyrus, it would be implemented in the<br></div><div style="font-family:Arial;">backup utility.<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">I guess the best way to start with this is to identify all places where<br></div><div style="font-family:Arial;">data write operations occur (I suppose this is where the mail data and<br></div><div style="font-family:Arial;">all sorts of databases are written). Once they are identified they could<br></div><div style="font-family:Arial;">be tweaked a bit for better concurrency and lockability and then we<br></div><div style="font-family:Arial;">could analyze how to wrap them with a global lock/barrier.<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">Regards,<br></div><div style="font-family:Arial;">Anatoli<br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;"><br></div><div style="font-family:Arial;">On 12/11/19 06:20, Bron Gondwana wrote:<br></div><div style="font-family:Arial;">> <br></div><div style="font-family:Arial;">> <br></div><div style="font-family:Arial;">> On Tue, Nov 12, 2019, at 14:50, Anatoli wrote:<br></div><div style="font-family:Arial;">>> Bron,<br></div><div style="font-family:Arial;">>><br></div><div style="font-family:Arial;">>> The proposed algo is a barrier before any single-lock. In itself it's a<br></div><div style="font-family:Arial;">>> single lock, but the same code (the pseudocode for the *worker thread*<br></div><div style="font-family:Arial;">>> in my previous mail) should be inserted at *every* single-lock/write<br></div><div style="font-family:Arial;">>> operation location. If there's no need to pause, the overhead is<br></div><div style="font-family:Arial;">>> non-existent. If a pause is requested, all worker threads would pause at<br></div><div style="font-family:Arial;">>> the entrance to any single-lock/write code.<br></div><div style="font-family:Arial;">>><br></div><div style="font-family:Arial;">>> It would make the entire Cyrus daemon to complete all pending write<br></div><div style="font-family:Arial;">>> operations and pause new ones. At this stage, if I understand it<br></div><div style="font-family:Arial;">>> correctly, the data on disk would be in a consistent state, ready to<br></div><div style="font-family:Arial;">>> take a snapshot or to perform some other operation.<br></div><div style="font-family:Arial;">> <br></div><div style="font-family:Arial;">> "complete all pending write operations and pause new ones"<br></div><div style="font-family:Arial;">> <br></div><div style="font-family:Arial;">> How do you know when the current write operations are finished?<br></div><div style="font-family:Arial;">> <br></div><div style="font-family:Arial;">>> Without that, if we just take a snapshot of the fs, it could happen that<br></div><div style="font-family:Arial;">>> a) some files are not written entirely (i.e. caught in the middle of a<br></div><div style="font-family:Arial;">>> write operation) or b) the contents of some files are newer than the<br></div><div style="font-family:Arial;">>> other, i.e. the logical write operation was not atomic (e.g. mail data<br></div><div style="font-family:Arial;">>> is written but indexes are not updated yet or something similar).<br></div><div style="font-family:Arial;">>><br></div><div style="font-family:Arial;">>> Maybe I didn't understand you correctly. Do you mean that finishing all<br></div><div style="font-family:Arial;">>> writes and pausing new ones is not enough to guarantee an integral state<br></div><div style="font-family:Arial;">>> of files on disk? If it's the case, what would have to be done to<br></div><div style="font-family:Arial;">>> guarantee it (i.e. to make it like Cyrus was shutdown normally)?<br></div><div style="font-family:Arial;">> <br></div><div style="font-family:Arial;">> I mean that to finish all writes and pause new ones, you need to know<br></div><div style="font-family:Arial;">> that the writes are finished.  And not just writes, but sets of writes<br></div><div style="font-family:Arial;">> that are held under a lock together.  The way I know to do this is a<br></div><div style="font-family:Arial;">> single global lock with the following properties:<br></div><div style="font-family:Arial;">> <br></div><div style="font-family:Arial;">> 1) every action which locks any file within Cyrus for writing takes a<br></div><div style="font-family:Arial;">> SHARED global lock before it takes the write lock on the file.<br></div><div style="font-family:Arial;">> <br></div><div style="font-family:Arial;">> 2) the SHARED lock is held for the duration of the writes, and released<br></div><div style="font-family:Arial;">> once the writes are finished.<br></div><div style="font-family:Arial;">> <br></div><div style="font-family:Arial;">> 3) the "backup utility" takes an EXCLUSIVE lock on the global lock,<br></div><div style="font-family:Arial;">> which will only be granted once each write is finished.  It then takes a<br></div><div style="font-family:Arial;">> snapshot, and releases the EXCLUSIVE lock.<br></div><div style="font-family:Arial;">> <br></div><div style="font-family:Arial;">> This guarantees full consistency.<br></div><div style="font-family:Arial;">> <br></div><div style="font-family:Arial;">> The question that always exists for locks is "what granularity" - too<br></div><div style="font-family:Arial;">> wide, and you hold the lock for a long time.  Too narrow, and you take<br></div><div style="font-family:Arial;">> and release it very frequently, adding overhead.<br></div><div style="font-family:Arial;">> <br></div><div style="font-family:Arial;">> My first and most dumbest theory is to go quite wide - add the lock in<br></div><div style="font-family:Arial;">> every runloop and command line utility such that it's held for the<br></div><div style="font-family:Arial;">> entire running of the loop or the utility!  Mostly these are done within<br></div><div style="font-family:Arial;">> a fraction of a second.  The one place that might be interesting is<br></div><div style="font-family:Arial;">> FETCH 1:* RFC822.PEEK or similar in imapd, where we already have some<br></div><div style="font-family:Arial;">> locking magic that holds a shared namelock on the mailbox to stop<br></div><div style="font-family:Arial;">> repacking while it releases the index lock to allow other actions on the<br></div><div style="font-family:Arial;">> mailbox in the meanwhile.<br></div><div style="font-family:Arial;">> <br></div><div style="font-family:Arial;">> So we could go down a layer and only lock when we lock mailboxes or<br></div><div style="font-family:Arial;">> cyrusdbs, and refcount the global lock.  This seems more likely to be a<br></div><div style="font-family:Arial;">> good layer, and not too horrible.<br></div><div style="font-family:Arial;">> <br></div><div style="font-family:Arial;">> The other thing is that we'll need to assert that the lock isn't being<br></div><div style="font-family:Arial;">> held during each daemon's command loop, so that bugs don't leak out to<br></div><div style="font-family:Arial;">> deadlock entire servers.<br></div><div style="font-family:Arial;">> <br></div><div style="font-family:Arial;">> And I think that's nearly it :)<br></div><div style="font-family:Arial;">> <br></div><div style="font-family:Arial;">> Bron.<br></div><div style="font-family:Arial;">> <br></div><div style="font-family:Arial;">> --<br></div><div style="font-family:Arial;">>   Bron Gondwana, CEO, Fastmail Pty Ltd<br></div><div style="font-family:Arial;">>   <a href="mailto:brong@fastmailteam.com">brong@fastmailteam.com</a><br></div><div style="font-family:Arial;">> <br></div><div style="font-family:Arial;">> <br></div><div style="font-family:Arial;"><br></div></blockquote><div style="font-family:Arial;"><br></div><div id="sig56629417"><div class="signature">--<br></div><div class="signature">  Bron Gondwana, CEO, Fastmail Pty Ltd<br></div><div class="signature">  <a href="mailto:brong@fastmailteam.com">brong@fastmailteam.com</a><br></div><div class="signature"><br></div></div><div style="font-family:Arial;"><br></div></body></html>