A Facebook network engineer is having a very bad day. Total outages due to BGP SNAFUs are as old as BGP. This change appears to have broken internal routing as well. I am a bit dubious on-site folks would have to deal with such a thing, though. There should be alternative connectivity that doesn’t rely on the network.



Cold-starting a Facebook cluster is somewhat challenging due to Facebook's heavy reliance on caching. When I worked there, we'd configure a "fallback cluster" for cache lookups during cold start. Before hitting the database after a cache miss, everything would first check the fallback cluster and populate the cache from there.

As far as I know, there was never a complete cold start test during the entire 4 years I was there.

Β· Β· Web Β· 1 Β· 13 Β· 17

Facebook outage 

Since Facebook doesn't use TTLs for the most part, you'd think they might be able to pick up where they left off when the network comes back. Even if there is the risk of cache consistency issues, they might well decide to try doing that, then do rolling cache flushes to get rid of any inconsistencies.

But if they have to do a complete cold start, it's going to require allowing traffic to "trickle" back in.

Facebook outage 

The most obvious machinery to use to slow-start is the same machinery they use to run experiments. Roll out an experiment to 99% of users that turns off Facebook entirely, then gradually ramp that number down.

Facebook outage 

Of course, that assumes that the experiments machinery doesn't itself depend on a warm cache. If it does, they might need to start with gating by IP address, letting in one chunk of the Internet at a time.

Facebook almost certainly already had a plan for how to cold start even before this happened. The question is how well that plan survives contact with the enemy.

Sign in to participate in the conversation
R E T R O  S O C I A L

A social network for the 19A0s.