In a war room full of SREs and Incident Commanders you can almost hear the muffled cries: Why is it so slow? Where is the data gone? What happened to our deployments? Is this seriously not encrypted? Well, it’s because … you see … we forgot to set up backups. We forgot to set up CPU hard limits. We didn’t test it before. We should have gone canary.
This talk explores the core lessons from top severity on-call incidents related to the data loss, latency, degradation, under scaling and under configuration. We will start by analysing the root causes, then move on to examine the automation measures we implemented, the alerting tools we had to invent and the capacity issues we had to overcome. By the end of this talk, we will understand from an SRE perspective why continuously improving the system design is the golden key to successful and reliable systems.
Смотрите видео Read your postmortems a quest for SRE continuous improvement - Diana Todea - JOTB24 онлайн без регистрации, длительностью часов минут секунд в хорошем качестве. Это видео добавил пользователь J On The Beach 27 Май 2024, не забудьте поделиться им ссылкой с друзьями и знакомыми, на нашем сайте его посмотрели 78 раз и оно понравилось 2 людям.