Extreme Binning: Scalable, Parallel Deduplication for Chunk-based File Backup
Appeared in Proceedings of the 17th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2009).
Abstract
Data deduplication is an essential and critical component of backup systems. Essential, because it reduces storage space requirements; critical, because the performance of the entire backup operation depends on its throughput. Traditional backup workloads consist of large data streams with high locality and existing deduplication techniques require this locality to provide reasonable throughput. We present Extreme Binning: a scalable deduplication technique for backup requests made up of individual files and with no locality among consecutive files in a given window of time. Due to the lack of locality existing techniques perform poorly. Extreme Binning exploits file similarity instead of locality and makes only one disk access per file to maintain throughput. The backup system scales gracefully with the data; more backup nodes can be added very easily to boost throughput. In such a multi node backup system every file is allocated, using a stateless routing algorithm, to one node only allowing for maximum parallelization. Each backup node is autonomous with no dependency across nodes making data management tasks robust and low overhead.
Publication date:
September 2009
Authors:
Deepavali Bhagwat
Kave Eshghi
Darrell D. E. Long
Mark Lillibridge
Projects:
Deduplication
Available media
Full paper text: PDF
Bibtex entry
@inproceedings{bhagwat-mascots09, author = {Deepavali Bhagwat and Kave Eshghi and Darrell D. E. Long and Mark Lillibridge}, title = {Extreme Binning: Scalable, Parallel Deduplication for Chunk-based File Backup }, booktitle = {Proceedings of the 17th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2009)}, month = sep, year = {2009}, }