FAST '23 - Perseus: A Fail-Slow Detection Framework for Cloud Storage Systems

Published: 02 March 2023
on channel: USENIX
1,366
10

Perseus: A Fail-Slow Detection Framework for Cloud Storage Systems

Ruiming Lu, Shanghai Jiao Tong University; Erci Xu, Alibaba Inc. and Shanghai Jiao Tong University; Yiming Zhang, Xiamen University; Fengyi Zhu, Zhaosheng Zhu, Mengtian Wang, and Zongpeng Zhu, Alibaba Inc.; Guangtao Xue, Shanghai Jiao Tong University; Jiwu Shu, Xiamen University; Minglu Li, Shanghai Jiao Tong University and Zhejiang Normal University; Jiesheng Wu, Alibaba Inc.

Awarded Best Paper!
Deployed-Systems Paper

The newly-emerging ''fail-slow'' failures plague both software and hardware where the victim components are still functioning yet with degraded performance. To address this problem, this paper presents Perseus, a practical fail-slow detection framework for storage devices. Perseus leverages a light regression-based model to fast pinpoint and analyze fail-slow failures at the granularity of drives. Within a 10-month close monitoring on 248K drives, Perseus managed to find 304 fail-slow cases. Isolating them can reduce the (node-level) 99.99th tail latency by 48%. We assemble a large-scale fail-slow dataset (including 41K normal drives and 315 verified fail-slow drives) from our production traces, based on which we provide root cause analysis on fail-slow drives covering a variety of ill-implemented scheduling, hardware defects, and environmental factors. We have released the dataset to the public for fail-slow study.

View the full FAST '23 Technical Sessions at https://www.usenix.org/conference/fast23


Watch video FAST '23 - Perseus: A Fail-Slow Detection Framework for Cloud Storage Systems online without registration, duration hours minute second in high quality. This video was added by user USENIX 02 March 2023, don't forget to share it with your friends and acquaintances, it has been viewed on our site 1,366 once and liked it 10 people.