THIS IS THE ARCHIVED SSRC SITE
The SSRC was active from 2001–2024.
This archived site is maintained by Ethan L. Miller.
The current CRSS site is at https://www.crss.us/. Please contact the current CRSS Director (Heiner Litz) if you have issues accessing the CRSS site.

Protecting Against Rare Event Failures in Archival Systems

Published as Storage Systems Research Center Technical Report UCSC-SSRC-09-03. Preliminary version of a paper that appeared in MASCOTS 2009.

Abstract

Digital archives are growing rapidly, necessitating stronger reliability measures than RAID to avoid data loss from device failure. Mirroring, a popular solution, is too expensive over time. We present a compromise solution that uses multi-level redundancy coding to reduce the probability of data loss from multiple simultaneous device failures. This approach handles small-scale failures of one or two devices efficiently while still allowing the system to survive rare-event, larger-scale failures of four or more devices. In our approach, each disk is split into a set of fixed size disklets which are used to construct reliability stripes. To protect against rare event failures, reliability stripes are grouped into larger "uber-groups," each of which has a corresponding "uber-parity;'' uber-parity is only used to recover data when disk failures overwhelm the redundancy in a single reliability stripe. Uber-parity can be stored on a variety of devices such as NV-RAM and always-on disks to offset write bottlenecks while still keeping the number of active devices low. Our calculations of failure probabilities found that the addition of uber-groups allowed the system to absorb many more disk failures without data loss. Through discrete event simulation, we found that adding uber-groups only negatively impacts performance when these groups need to be used for a rebuild. Since rebuilds using uber-parity occur very rarely, they minimally impact system performance over time. Finally, we showed that robustness against rare events can be achieved for under 5% of total system cost.

Publication date:
April 2009

Authors:
Avani Wildani
Thomas Schwarz
Ethan L. Miller
Darrell D. E. Long

Projects:
Archival Storage
Reliable Storage

Available media

Full paper text: PDF

Bibtex entry

@techreport{wildani-ssrctr0903,
  author       = {Avani Wildani and Thomas Schwarz and Ethan L. Miller and Darrell D. E. Long},
  title        = {Protecting Against Rare Event Failures in  Archival Systems},
  institution  = {University of California, Santa Cruz},
  number       = {UCSC-SSRC-09-03},
  month        = apr,
  year         = {2009},
}
Last modified 5 Aug 2020