[Storage-research-list] Release of a MacOS server and FSLHomes deduplication datasets

Vasily Tarasov vtarasov at us.ibm.com
Wed Jul 9 11:04:28 EDT 2014


Announcement date: 2014 July 8

Dear All,

Stony Brook University, in collaboration with Harvey Mudd College and
EMC, are announcing the release of MacOS and FSLHomes deduplication
datasets.

The MacOS dataset includes over 650 file-system snapshots collected at
the FSL lab’s MacOS production server. The dataset contains daily file
system scans from 2011 to 2014 and covers over 130TiB of data across
over a billion files. The actual anonymized, compressed dataset we are
releasing is 1.1TiB in size.

The FSLHomes dataset includes over 3,600 per-user file-system
snapshots collected at the File-systems and Storage Laboratory (FSL)
home-directory server. The dataset contains daily file system scans of
user home directories from 2011 to 2014 and covers over 530TiB of data
across 1.5 billion files. The actual anonymized, compressed dataset we
are releasing is 1.5TiB in size.

Along with rich metadata information, the snapshots include the hashes
of all chunks in the scanned files. Variable chunking with 2KiB, 4KiB,
8KiB, 16KiB, 32KiB, 64KiB, and 128KiB sizes were used during the
snapshots collection. The anonymized traces can be used in a variety
of studies related to deduplication system performance, efficiency
analysis, and more.

Along with the dataset, we are releasing the fs-hasher software
package that was used to collect the snapshots; it includes the tools
and examples for reading the snapshots.

Detailed information, the snapshots themselves, and accompanying
software can be found at:

http://tracer.filesystems.org/

If you have any questions or comments about this data set, please send them to:

fsltraces at fsl.cs.sunysb.edu



More information about the Storage-research-list mailing list