On Sunday, August 30, 2020, it all started with a simple question: “What’s happening?”
Approximately around 10 UTC, the global Internet started experiencing a very specific state of connectivity - inside the network of one of the largest Tier-1 operators in the world, CenturyLink (primary AS3356), something bad was undoubtedly going on.
At 17 UTC or 13 EDT, a message was received, stating that:
“Summary: On August 30, 2020 10:04 GMT, CenturyLink identified an issue to be affecting users across multiple markets. The IP Network Operations Center (NOC) was engaged, and initial research identified that an offending flowspec announcement prevented Border Gateway Protocol (BGP) from establishing across multiple elements throughout the CenturyLink Network. The IP NOC deployed a global configuration change to block the offending flowspec announcement, which allowed BGP to begin to correctly establish. As the change propagated through the network, the IP NOC observed all associated service affecting alarms clearing and services returning to a stable state.”
So, what exactly happened to the CenturyLink - one of the largest Tier-1 ISPs in the world, and, according to Qrator.Radar’s upcoming National Internet Segments Reliability Research, the critical autonomous system in the United States?
Probably to answer that question we should wait for a more detailed RCA of this incident from CenturyLink.
Though we could still post analyze the aftermath from the perspective we’ve got.
“Successful traceroutes through AS3356, AS209 and AS3549 drop precipitously as world's largest telecom network suffers outage this morning” - Oracle InternetIntelligence
At a specific moment, as (allegedly) correctly outlined by the CenturyLink in (allegedly) support ticket around 10 a.m GMT/UTC, CenturyLink’s and related businesses (Level3) autonomous systems started dropping its internal BGP sessions, which led to problems with spreading routing information across their network.
Qrator.Radar at first saw a fluctuation in sessions established with our BGP collectors:
After a moment we already got a more detailed picture of what percentage of regional sessions with our BGP-collectors was dropped:
United States: 49%
United Kingdom: 36%
South Africa: 25%
As you can see, in this case, a one particular network malfunction led to a significant drop in sessions established, which is confirmed by CenturyLink itself.
We do not want to speculate on how exactly an “offending flowspec announcement” affected BGP reflectors - these days flowspec is used for many things, including packet filtering rules and RTBH option, too many things could collide in theory, but only CenturyLink knows what it was.
AS3356’ announced prefixes looked as following during the incident:
As you can see, at 10:11 announcements dropped almost 10x, from 13478 at 9:56 UTC to 1691.
At the same time, in the United States alone, the overall amount of announcements dropped by 30 000:
As you can see, US-based (according to RipeDB) overall announcements dropped by 30k during the CenturyLink incident, while AS3356 itself dropped only about 12k announcements.
A whole number of Internet companies reportedly felt the effect of this incident: Cloudflare, Twitter, Discord and probably more. Despite so many circumstantial evidence, this particular incident was obscure from the outside - there was no typical connection drop tied with BGP, rather than the absence of connectivity at all.
Among many reports, there was one that caught our attention - “Chess Olympiad: India and Russia both get gold after controversial final.” According to the BBC article, both players lost their connection in the final round, at the time of an incident.
Due to its size, the CenturyLink network is nowadays essential to the Internet. As we saw, it is not only the most critical autonomous system in the United States, that is at least partly responsible for the whole US region gaining back their 10 positions in reliability Top-20 rating for the year 2020, which we are presenting soon - in the entire world other networks rely upon CenturyLink to provide them with good connectivity and latency for serving content to users.