Another 2.4 upgrade horror story

Tue Sep 25 14:57:49 EDT 2012

On 25.09.2012 15:28, Eric Luyten wrote:
> On Tue, September 25, 2012 2:01 pm, Sebastian Hagedorn wrote:
>> Hi,
>>
>>
>> about three weeks ago we upgraded our Cyrus installation from 2.3.x to 2.4.16.
>> We were aware of the reindexing issue, so we took precautionary
>> measures, but they didn't help a lot. We've got about 7 TB of mail data for
>> almost 200,000 mailboxes. We did the upgrade on a Sunday and had told our
>> users that mail access wouldn't be possible for the whole day. After the
>> actual software upgrade we ran distributed scripts that triggered the index
>> upgrades. We started with the largest mailboxes. The idea was that after those
>> that took the longest had been upgraded, the rest should be OK overnight and
>> early Monday. However, even though our storage infrastructure was kept at 99 %
>> I/O saturation, progress was much slower than anticipated.
>>
>>
>> Ultimately the server was virtually unuseable for the whole Monday and
>> parts of Tuesday. The last mailbox was finally upgraded on Thursday, although
>> on Wednesday most things were already working normally.
>>
>> I realize that some of our problems were caused by infrastructure that's
>> not up to current standards, but nonetheless I would really urge you to never
>> again use an upgrade mechanism like that. Give admins a chance to upgrade
>> indexes in the background and over time.
>
>
> +1
>
>
> Sebastian,
>
>
> Thank you for sharing your experiences.
>
> As a site willing/needing to upgrade from 2.3.16 to 2.4.X this fall, we
> are interested in learning about your storage backend characteristics.
>
> What read/write IOPS rates were you registering before/during/after your
> upgrade process ?
>
> I'd understand your reluctance to share this information in a public forum.
> No offence taken whatsoever !
>
>
> Kind regards,
> Eric Luyten, Computing Centre VUB/ULB,     Eric.Luyten at vub.ac.be

migration process from 2.3 to 2.4 took ~ one year for our installation. 
we converted ~200Tb of users data.
first step we did - spread data on many nodes using cyrus replication.
next we started converting nodes one by one at weekends nights to 
minimize IO load generated by users.
in fact cyrus read all data from disk to generate new indexes, so 
convert is limited by disk IO mainly while CPU is pretty cheap nowadays.
we got around 500Gb in 8 hours rate for forced reindex with 100% disk load.
we started forced reindex with most active users meanwhile allowing 
users to login and trigger reindex of their mailboxes