HANDS: A Heuristically Arranged Non-Backup In-line Deduplication System

Published as Storage Systems Research Center Technical Report UCSC-SSRC-12-03.

Abstract

Deduplication on is rarely used on primary storage because of the disk bottleneck problem, which results from the need to keep an index mapping chunks of data to hash values in memory in order to detect duplicate blocks. This index grows with the number of unique data blocks, creating a scalability problem, and at current prices the cost of additional RAM approaches the cost of the indexed disks. Thus, previously, deduplication ratios had to be over 45% to see any cost benefit. The HANDS technique that we introduce in this paper reduces the amount of in-memory index storage required by up to 99% while still achieving between 30% and 90% of the deduplication of a full memory-resident index, making primary deduplication cost effective in workloads with a low deduplication rate. We achieve this by dynamically prefetching fingerprints from disk into memory cache according to working sets derived from access patterns. We demonstrate the effectiveness of our approach using a simple neighborhood grouping that requires only timestamp and block number, making it suitable for a wide range of storage systems without the need to modify host file systems.

Publication date:
March 2012

Authors:
Avani Wildani
Ethan L. Miller
Ohad Rodeh

Projects:
Archival Storage
Deduplication
Prediction and Grouping

Available media

Full paper text: PDF

Bibtex entry

@techreport{wildani-ssrctr-12-03,
  author       = {Avani Wildani and Ethan L. Miller and Ohad Rodeh},
  title        = {{HANDS}: A Heuristically Arranged Non-Backup In-line Deduplication System},
  institution  = {University of California, Santa Cruz},
  number       = {UCSC-SSRC-12-03},
  month        = mar,
  year         = {2012},
}
Last modified 24 May 2019