Cloudflare have been having a rough couple of weeks! Last week significant chunks of the Internet were effectively knocked offline when large parts of Cloudflare IP space (and other autonomous systems) were routed to a non-transit network belonging to a “specialty metals company” in Pennsylvania. This was ultimately due to a cockup from a small ISP that should have been filtered by Verizon - but wasn’t. You can read more about it here.
Clearly this was no fault of Cloudflare’s, but unfortunately yesterday they also experienced another massive outage due to a bad deployment of software. This affected a huge number of their services for about half an hour. An initial placeholder writeup on this is here.
But this got me thinking - how much of the Internet nowadays relies on individual companies in this way? And what potential problems does this bring?
Old School Internet
Before the advent of cloud-based services, organisations would obviously host their Internet facing services in their own datacentres or colo facilities. Obviously this had its own significant drawbacks in comparison to cloud-based computing, but one thing it did mean was that a reliance on access to these resources was largely down to the ISP they used and their own infrastructure - both of which were relatively distributed throughout the globe (to an extent anyway).
Sure, if an ISP had major issues it could wipe out a lot of sites and we’ve always had an ultimate reliance on a relatively small number of tier 1 ISPs, but on the whole issues were usually relatively confined in terms of global impact.
In a shift to cloud-based services, access to these distributed Internet resources has now become somewhat “centralised”. I use speech-marks because clearly from a physical and geographical perspective these cloud-based services are not remotely “central” - they are completely distributed, but they are ultimately part of the same company. And as we saw from Cloudflare’s issues yesterday, things can go drastically wrong for the entire infrastructure.
Think about some of the main providers that spring to mind when you talk about “the cloud”. AWS, Azure, GCP, Cloudflare, Akamai etc. A huge number of organisations and the applications they provide flow through these in some way, and when they go wrong, they impact a massive number of organisations.
Look at Cloudflare alone - bleeping computer reported that they have more than 16 million websites using their services - this is a significant chunk of the Internet to go down for half an hour (caveat - I don’t know how many of these were affected yesterday)!
Multi-Cloud To The Rescue?
Ok, so surely the answer to these problems is to take a “multi-cloud approach” right? For example, don’t just rely on the likes of AWS - have a backup/DR plan in place that perhaps moves services to Azure or GCP. Maybe. But maybe not…
Clearly this multi-cloud strategy is a good approach and can protect against a number of different failures, but let’s look at this from the perspective of a worldwide outage like we saw yesterday. Hopefully these are rare, but as we’ve seen they can and do happen. What if we saw this with the likes of AWS?
Let’s just say AWS had a significant global outage and everyone was using GCP or Azure as a multi-cloud backup strategy. Would GCP and Azure cope with this sudden increase in demand? I mean clearly, they have significant amounts of spare capacity to be able to provide the services that they do, but is it enough capacity to cope with such a sudden and drastic increase? Maybe. I genuinely don’t know! I’m just thinking out loud here. Whilst multi-cloud approaches may mitigate a number of risks, it perhaps doesn’t help if such a scenario ever did become reality.
“But A Global Outage Will Not Happen”
Well hopefully not, but they clearly can as we have seen. Just look at yesterday. Sometimes “shit happens” no matter how hard you try and prevent or prepare for it. I’ve no doubt that Cloudflare have robust systems in place and do things pretty well, but they’ve still suffered a major failure yesterday. They are usually pretty transparent post-issue as well, so we will likely see a detailed write-up in the coming days. I’d praise them on obviously having a good rollback plan as well - whilst 30 mins is a long time it could have been a lot worse without the right tools in place to detect and react.
Clearly how the cloud infrastructure is divided, how code updates are rolled out, how good security is, how well-trained staff are, how much testing and monitoring is carried out - all these things play a part in how likely these kinds of scenario’s are. But these things are possible no matter how unlikely - whether it is mass hardware defects, bad software deployments, human-error or malicious attacks.
We are in the situation nowadays where all of a sudden, large portions of the Internet are dependent on a handful of individual organisations for their global presence. I’m not saying that this is a bad thing - cloud-computing has bought with its huge benefits that have enabled numerous organisations to exist that may otherwise have not been able to. But this is clearly one of the downsides to the new world we find ourselves in.
I guess time will tell how this all pans out.
2019-07-03 01:00 +0100