De-duping attachments

Shuvam Misra shuvam.misra at merceworld.com
Tue Sep 14 23:45:13 EDT 2010


Dear Bron,

> http://www.newegg.com/Product/Product.aspx?Item=N82E16822148413
> 
> 2TB - US $109.

Don't want to nit-pick here, but the effective price we pay is about
ten times this. To set up a mail server with a few TB of disk space,
we usually land up deploying a separate chassis with RAID controllers and
a RAID array, with FC connections from servers, etc, etc.  All this adds
up to about $1,000/TB of usable space if you're using something like the
"low-end" IBM DS3400 box or Dell/EMC equivalent. This is even with
inexpensive 7200RPM SATA-II drives, not 15KRPM SAS drives.

    http://www-07.ibm.com/storage/in/disk/ds3000/ds3400/

And most of our customers actually double this cost because they keep two
physically identical chassis for redundancy. (We recommend this too,
because we can't trust a single RAID 5 array to withstand controller or
PSU failures.) In that case, it's $2000/TB.

And you do reach 5-10 TB of email store quite rapidly --- our company
has many corporate clients (< 500 email users) whose IMAP store has
reached 4TB. No one wants to enforce disk quotas (corporate policy),
and most users don't want to delete emails on their own.

We keep hearing the logic that storage is cheap, and stories of cloud
storage through Amazon, unlimited mailboxes on Gmail, are reinforcing
the belief. But at the ground level in mid-market corporate IT budgets,
storage costs in data centres (as against inside desktops) are still
too high to be trivial, and their prices have only little to do with
the prices of raw SATA-II drives. A fully-loaded DS3400 costs a little
over $12,000 in India, with a full set of 1TB SATA-II drives from IBM,
but even with high cost of IBM drives, the drives themselves contribute
less than 30% of the total cost.

If we really want to put our collective money where our mouth is, and
deliver the storage-is-cheap promise at the ground level, we need to
rearchitect every file server and IMAP server to work in map-reduce mode
and use disks inside desktops. Anyone game for this project? :)

> Now de-duping messages on copy is valuable, not so much because of
> the space it saves, but because of the IO it saves.  Copying the file
> around is expensive.
> 
> De-duping componenets of messages and then reconstructing?  Not so much.
> You'll be causing MORE IO in general looking for the message, finding the
> parts.

I agree. My aim was not to reduce IOPS but to cut disk space usage.

There are two areas where we are seeing a huge increase in "inactive"
disk utilisation for emails. One is for the archive, which is being kept
for security and compliance reasons. Every company we work with wants an
archive with at least a few years' retention. They search the archive
every few weeks to trace "lost" emails, not for compliance reasons but to
find missing information. This means that we can't ask them to move the
data out to removable storage.

The second area is shared mail folders where all communication with each
client/topic/project are stored practically forever.

A 500-user company can easily acquire an email archive of 2-5TB. I don't
care how much the IO load of that archive server increases, but I'd like
to reduce disk space utilisation. If the customer can stick to 2TB of
space requirements, he can use a desktop with two 2TB drives in RAID
1, and get a real cheap archive server. If this figure reaches 3-4TB,
he goes into a separate RAID chassis --- the hardware cost goes up 5-10
times. These are tradeoffs a lot of small to mid-sized companies in my
market fuss about.

And in a more generic context, I am seeing that all kinds of intelligent
de-duping of infrequently-accessed data is going to become the crying
need of every mid-sized and large company.  Data is growing too fast,
and no one wants to impose user discipline or data cleaning. When we
tell the business head "This is crazy!", he turns around and tells the
CTO "But disk space is cheap! Haven't you heard of Google? What are you
cribbing about? You must be doing something really inefficient here,
wasting money!"

thanks and regards,
Shuvam


More information about the Info-cyrus mailing list