FAST '23 - Perseus: A Fail-Slow Detection Framework for Cloud Storage Systems

Опубликовано: 02 Март 2023
на канале: USENIX
1,366
10

Perseus: A Fail-Slow Detection Framework for Cloud Storage Systems

Ruiming Lu, Shanghai Jiao Tong University; Erci Xu, Alibaba Inc. and Shanghai Jiao Tong University; Yiming Zhang, Xiamen University; Fengyi Zhu, Zhaosheng Zhu, Mengtian Wang, and Zongpeng Zhu, Alibaba Inc.; Guangtao Xue, Shanghai Jiao Tong University; Jiwu Shu, Xiamen University; Minglu Li, Shanghai Jiao Tong University and Zhejiang Normal University; Jiesheng Wu, Alibaba Inc.

Awarded Best Paper!
Deployed-Systems Paper

The newly-emerging ''fail-slow'' failures plague both software and hardware where the victim components are still functioning yet with degraded performance. To address this problem, this paper presents Perseus, a practical fail-slow detection framework for storage devices. Perseus leverages a light regression-based model to fast pinpoint and analyze fail-slow failures at the granularity of drives. Within a 10-month close monitoring on 248K drives, Perseus managed to find 304 fail-slow cases. Isolating them can reduce the (node-level) 99.99th tail latency by 48%. We assemble a large-scale fail-slow dataset (including 41K normal drives and 315 verified fail-slow drives) from our production traces, based on which we provide root cause analysis on fail-slow drives covering a variety of ill-implemented scheduling, hardware defects, and environmental factors. We have released the dataset to the public for fail-slow study.

View the full FAST '23 Technical Sessions at https://www.usenix.org/conference/fast23


Смотрите видео FAST '23 - Perseus: A Fail-Slow Detection Framework for Cloud Storage Systems онлайн без регистрации, длительностью часов минут секунд в хорошем качестве. Это видео добавил пользователь USENIX 02 Март 2023, не забудьте поделиться им ссылкой с друзьями и знакомыми, на нашем сайте его посмотрели 1,366 раз и оно понравилось 10 людям.