We present RESAR (Robust, Efficient, Scalable, Autonomous Reliable) storage, a reliable distributed storage volume provider that scales to millions of drives. We implemented our system and tested it on a large-scale emulation platform called Megatux. Our results show that RESAR is capable of scaling to millions of drives, and it’s rebuild performance benefits from this scale by distributing the recovery across many disks. In our emulations, the work of rebuilding a one terabyte hard drive was distributed across 400 disks and completed in less than four minutes with no interruption of service. With an annual durability of 99.999999% and a storage overhead cost of 20%, RESAR has great promise for both exascale HPC and cloud storage.
Joint work with my Ph.D. student Igancio Coderí, Thomas Kroeger of Sandia National Laboratory and Thomas Schwarz of Universidad Catòlica del Uruguay.