Cloudflare failure, caused by BGP misconfiguration 19 major data centers

Cloudflare failure, caused by BGP misconfiguration 19 major data centers

Cloudflare, a CDN provider, has released a report on the network failure that occurred on the afternoon of June 21, 2022.

The failure became apparent around 06:34 AM UTC, and effects were reported on various services such as Discord, Pixiv, and Notion in Japan, but it seems that recovery was generally seen from 8 am to 9 am UTC Time.

Cloudflare reports that the failure was due to a BGP (Border Gateway Protocol) misconfiguration, which caused failures in the company’s 19 major data centers.

Let me give you an overview from the report.

BGP misconfiguration disconnects 19 data centers

Over the past year and a half, the company has added an additional layer of routing to create a mesh of connections within the company, called “Multi-Colo PoP” (MCP), for 19 high-traffic data centers. We have developed a new and more resilient architecture using.

Using this mesh, it was possible to easily disable or enable parts of the data center’s internal network, which enabled maintenance and troubleshooting.

However, the failure is reported to be due to a mistake in the BGP configuration that configures routing between data centers for this new architecture network. Let’s quote that part from the report.

“While deploying a change to our prefix advertisement policies, a re-ordering of terms caused us to withdraw a critical subset of prefixes.”

“When deploying a public relations policy change for a BGP prefix, a subset of critical prefixes was withdrawn due to a change in the order of terms.”

The removal of this prefix is ​​said to have caused a fatal failure for the new architecture.

However, when this BGP setting was first made at 3:56 am UTC time, it was a deployment to a location on the old architecture, so it wasn’t immediately a failure.

The following is an explanation in chronological order.

  • At 6:27 am, when the settings reached the MCP-enabled location and the settings were reflected to the MCP, the prefix was removed and 19 data centers went offline, causing a failure.
  • At 6:32 am, a failure is detected and an incident is declared inside Cloudflare.
  • At 6:51 am, I tried to change the router settings to confirm the cause.
  • At 6:58 am, the cause was confirmed. Start work to restore the settings.
  • At 6:42 am, the work to restore the settings was completed. One of the reasons why the work took a long time was that the settings restored by one network engineer were not noticed by another engineer, and the work was further restored to return to the failure state.

The incident ended at 9 am.

It is said that this work was done by backup means prepared for failure response because the data center affected by the failure could not be reached by the normal network.

If you are interested , please refer to the original report for details on what kind of misconfiguration caused the failure .

Work on improving automation of testing and rollback

In response to this failure, which was supposed to be resistant to failure, the company will review the specific test and deployment procedures in MCP, review the architecture, improve rollback automation, etc. Concludes the report as follows.

“We are deeply sorry for the disruption to our customers and to all the users who were unable to access Internet properties during the outage. We have already started working on the changes outlined above and will continue our diligence to ensure this cannot happen again.”

“We deeply apologize for the inconvenience caused to our customers and all users who could not access Internet resources during the outage. We have already begun working on the above changes and will continue to work to prevent this from happening again.”