THIS IS THE ARCHIVED SSRC SITE.
Maintained by Ethan L. Miller.
The current CRSS site is at https://www.crss.us/.

Scalable High-Performance QoS

This project is no longer active. Information is still available below.

Large-scale, high-performance storage systems are gaining momentum in data centers and high performance computing systems. Quality of Service (QoS) will be an essential feature as storage systems scale out in capacity and number of clients to support because QoS can help ensuring high resource utilization and fair resource allocation between competing clients. Few existing QoS solutions are designed to work at the scale that can support millions of concurrent client accesses. This research project aims at designing scalable QoS solutions for these large-scale, high-performance storage systems.

 

The previous focus of this project is Automating Contention Management for High-Performance Storage Systems, see the Ascar project page for more information.

 

Status

The current focuse of this project is on automated performance enhancement through data layout optimization. 

Large distributed storage systems such as high performance computing (HPC) systems used by national or international laboratories require sufficient performance and scale for demanding scientific workloads, and must handle shifting workloads with ease. Ideally, data is placed in locations to optimize performance, but the size and complexity of large storage systems inhibits rapid effective restructuring of data layouts to maintain performance as workloads shift.

To address these issues, we are develloping Geomancy, a tool that models the placement of data within a distributed storage system and reacts to drops in performance. Using a combination of machine learning techniques suitable for temporal modeling, Geomancy determines when and where a bottleneck may happen due to changing workloads and suggests changes in the layout that mitigate or prevent them. Our approach to optimizing throughput offers benefits for storage systems such as avoiding potential bottlenecks and increasing overall I/O throughput from 11% to 30%.

 

Publications

Last modified 19 Oct 2020