How We Drained Every Backbone Router Simultaneously
Francois Richard, Meta
On October 4, 2021, we experienced a severe outage lasting approximately 6 hours.
Our engineering teams learned that configuration changes from commands issued as part of routine infrastructure maintenance on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on how our data centers communicate, bringing our services to a halt.
During this presentation, we aim to describe the chain of events that led us to this situation and how the underlying cause of this outage also impacted internal tools and systems we use in our day-to-day operations. We will also delve into our reflections after the event, how continuous validation of support structures, DR capabilities tooling, and processes have helped us and how we are thinking about the future.
View the full SREcon22 Europe/Middle East/Africa program at https://www.usenix.org/conference/sre...
Смотрите видео SREcon22 Europe/Middle East/Africa - How We Drained Every Backbone Router Simultaneously онлайн без регистрации, длительностью часов минут секунд в хорошем качестве. Это видео добавил пользователь USENIX 15 Декабрь 2022, не забудьте поделиться им ссылкой с друзьями и знакомыми, на нашем сайте его посмотрели 2,416 раз и оно понравилось 57 людям.