SREcon22 Europe/Middle East/Africa - How We Drained Every Backbone Router Simultaneously

Published: 15 December 2022
on channel: USENIX
2,416
57

How We Drained Every Backbone Router Simultaneously

Francois Richard, Meta

On October 4, 2021, we experienced a severe outage lasting approximately 6 hours.

Our engineering teams learned that configuration changes from commands issued as part of routine infrastructure maintenance on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on how our data centers communicate, bringing our services to a halt.

During this presentation, we aim to describe the chain of events that led us to this situation and how the underlying cause of this outage also impacted internal tools and systems we use in our day-to-day operations. We will also delve into our reflections after the event, how continuous validation of support structures, DR capabilities tooling, and processes have helped us and how we are thinking about the future.

View the full SREcon22 Europe/Middle East/Africa program at https://www.usenix.org/conference/sre...


Watch video SREcon22 Europe/Middle East/Africa - How We Drained Every Backbone Router Simultaneously online without registration, duration hours minute second in high quality. This video was added by user USENIX 15 December 2022, don't forget to share it with your friends and acquaintances, it has been viewed on our site 2,416 once and liked it 57 people.