Seminar: High Performance Large Scale Data Deduplication

Deepavali Bhagwat will give a talk on High Performance Large Scale Data Deduplication

Abstract:There is a growing consensus that data deduplication is a valuable technique for reducing storage space requirements. Data deduplication identifies unique data and eliminates duplicate data. One method used to identify duplicate data in archival systems is chunking. When a file is added to the archive, it is split up into chunks and a signature of each chunk is extracted. An index containing signatures of chunks already existing in the archive is queried to identify what chunks of the file do not need to be stored. As the archive grows the index grows and its peformance begins to suffer. It is not uncommon for every index access to cause a disk access thereby slowing down the deduplication process. The index needs to be partitioned to accommodate the growing archive. To handle high volume of data coming at high speed it is imperative that every index access be fast and a few accesses to the index should suffice. We show a technique to organize the index to accomplish this goal. We present two variants representing a trade off between the speed and quality of deduplication. Our technique is scalable: the ability to find duplicate data does not depend on the number of partitions. It is also fast; our experimental evaluation shows that with 128 partitions it is possible to identify, on average, over 90% of duplicate data with 5 index accesses for a data set of files having 540 chunks.

This is joint work with Kave Eshghi(HP Labs), Darrell Long and Witold Litwin

When:
Wednesday, April 2, 2008 at 12:00 PM

Where:
E2-280

CRSS Contact:
Bhagwat, Deepavali

Streaming video is available for this event.

Last modified 24 May 2019