RESAR Storage: a System for Two-Failure Tolerant, Self-Adjusting Million Disk Storage Clusters
Speaker(s) : Pr. Darrell LONG (UCSC)
The demand for large-scale storage is greater than ever. The wide availability of broadband networking has made cloud based storage a vibrant and growing market. Additionally, as we explore exascale high performance computing (HPC) systems with exabytes of data, power considerations become a significant factor. Most existing systems rely on replication to protect user data, maintaining as many as six copies. This high overhead leads to an unnecessary costs in equipment, maintenance and energy. While storage appliances using era- sure coding schemes are available, their long rebuild times and lack of continuity of service during rebuild make them unsuitable as building blocks for large scale storage systems.
We present RESAR (Robust, Efficient, Scalable, Autonomous Reliable) storage, a reliable distributed storage volume provider that scales to millions of drives. We implemented our system and tested it on a large-scale emulation platform called Megatux. Our results show that RESAR is capable of scaling to millions of drives, and it’s rebuild performance benefits from this scale by distributing the recovery across many disks. In our emulations, the work of rebuilding a one terabyte hard drive was distributed across 400 disks and completed in less than four minutes with no interruption of service. With an annual durability of 99.999999% and a storage overhead cost of 20%, RESAR has great promise for both exascale HPC and cloud storage.
Joint work with my Ph.D. student Igancio Coderí, Thomas Kroeger of Sandia National Laboratory and Thomas Schwarz of Universidad Catòlica del Uruguay.
Marc.Shapiro (at) nulllip6.fr