Backup compaction optimization in a block-level replication environment

Thu Nov 7 21:35:40 EST 2019

On 2019-11-08 09:13, ellie timoney wrote:
> I'm not sure if I'm just not understanding, but if the chunk offsets were to remain the same, then there's no benefit to compaction? A (say) 2gb file full of zeroes between small chunks is still the same 2gb on disk as one that's never been compacted at all!

That's true.  I suppose I'm imagining a threshold, where if the file 
hits, say, 20% wasted space, then I can "defrag" the file and recover 
the lost space, on the understanding that the next sync will have to 
copy the entire file again.

But you mentioned:

> And if you don't use the compaction feature, you might as well skip the backups system entirely, and have your backup server just be a normal replica that doesn't accept client traffic (maybe with a very long cyr_expire -D time?), and then you shut it down on schedule for safe block/file system backups to your offsite location.
... and that seems a more reasonable approach.  I didn't know if copying 
the filesystem of a (paused) Cyrus replica was a supported way of 
backing up, but now I do.  Is there a list of which database and index 
files I need to copy apart from the files inside the partition structure?
> This setting might be helpful:
>
>>            backup_compact_work_threshold: 1
>>                The  number of chunks that must obviously need compaction before the com‐
>>                pact tool will go ahead with the compaction.  If set to  less  than  one,
>>                the value is treated as being one.
> If you set your backup_compact_min/max_sizes to a size that's comfortable/practical for your block backup algorithm, but then set a very lax backup_compact_work_threshold, you might be able to find a sweet spot where you're getting the benefits of compaction eventually, but are not constantly changing every block in the file (until you do).  The default (1) is basically for compaction to occur as soon as there's something to compact out, just because the default had to be something, and without experiential data any other value would just be a hat rabbit.  But this sounds like a case where a big number would play nicer.
>
> I guess I'd try to target a minimum size of 1 disk block per chunk, and a maximum of (fair dice roll) 4 disk blocks? But you'd need some experimentation to figure out ballpark numbers, and won't be able to tune it to exact block sizes, because the configured thresholds are the uncompressed data size, not the compressed chunk size on disk.

Thanks, I saw that setting but didn't really think through how it would 
help me.  I'll experiment with it and report back.

-- 
*Deborah Pickett*
System Administrator
*Polyfoam Australia Pty Ltd*