Cloudflare outage, broken fiber, IPv6 routing, cap bad week for IoT stability

The IoT is built, entirely as the name suggests, on the foundation that is the internet. Because of this, we spend next to no time wondering how the IoT would function if there was no internet, and yet, for a few hours this week, sysadmins and engineers would have been pondering that particular question, after a trio of errors cropped up in quick succession.

The most pressing was a bad software deployment from Cloudflare, a provider of content delivery networks (CDNs), DDoS mitigation, caching, and DNS services, which led to outages across its network. As a rather large number of major web services rely on Cloudflare, there were numerous outages across major websites and applications.

Essentially, an error in the code for a new software deployment led to Cloudflare’s server resources spiking to 100% CPU utilization, apparently caused by an unseen regular expression that had slipped through the automated testing process. This led to Cloudflare processes failing, which then took down the myriad of web services that rely on them.

But this highlights just how precarious the internet really is. What looks like a single error in a bog-standard software update led to global outages, because one company is relied upon by so many others. This was not even a targeted malicious attack, simply a deployment error that had been run through what Cloudflare thought was a sufficiently thorough testing process.

So if this is not the result of a bad actor, the impact of which can be seen in the Dyn DNS attack that had a much worse affect on global internet access and stability, then it illustrates another problem that is only going to become more apparent in the IoT – automated processes.

If each of the many-billion IoT devices being deployed annually requires only ten minutes of human coding time, then this architecture will far outstrip the supply of coding expertise in the labor market. There are not enough programmers out there to pay attention to each device, and so naturally, batch rollouts and automation have taken hold.

For simpler IoT devices, say a connected light, there might only be a few hundred kilobytes of code to examine, but very quickly, you are an order of magnitude higher when you want to look at the code needed for cameras, hubs, or edge-processors. The most complex connected devices are running gigabytes worth of code, and those simply can’t be put to human review.

As such, an automated filter and checking tool has much better odds at not slipping up in the simpler devices, than in the millions of lines of code that power something like a connected smart grid asset. This then could allow you to allocate humans to reviewing the more complex devices, or at least segments of the code, or perhaps supervising a more complicated testing suite of software tools.

But this process evidently failed at Cloudflare, and any sort of review process is going to fail in the enterprise world, as most IT departments are already under-funded and over-worked. This becomes a question of when you cause a crippling customer outage, rather than if, and a scale that seems to get more likely as the business gets larger.

Calls for a fundamental rethinking of how the IT role is supported have largely fallen on deaf ears, and that apparent corporate reluctance is only going to get worse as the issue of integrating the IT department more closely with the OT (Operational Technology) department becomes more pressing. As more of a business’ systems become connected, there is a bigger driver to tie those systems in with the rest of the IT functions, to take advantage of all those ‘synergies’ that have been promised by the most evangelical consulting firms.

So then, it seems that the enterprise world will be setting itself up for disaster, if it does not start channeling investments into this IT-OT integration, and more importantly, into the tools needed to make up for the lack of specialized workers for the task. Promises of seamless upgrades are going to be used to batter reputations with, after high-profile failures. If Cloudflare can mess this up, many more enterprises are going to push patchy code that is going to cause significant pain.

Cloudflare reckons it fixed the problem within a few hours. There had been issues prior to a status update at 1352 UTC that said Cloudflare was aware of problems, with a fix deployed at 1415, and then the all clear given at 1457. A placeholder report will apparently be replaced by a full post-mortem, which should provide some gory details.

As it stands, Cloudflare says that the “cause of this outage was deployment of a single misconfigured rule within the Cloudflare Web Application Firewall (WAF) during a routine deployment of new Cloudflare WAF Managed rules,” which was intended to improve the blocking of inline JavaScript that is used in attacks.

Cloudflare was running the new rules in a simulated mode, where they weren’t blocking traffic, in order to examine how they fared in terms of false positives. “Unfortunately, one of these rules contained a regular expression that caused CPU to spike to 100% on our machines worldwide. This 100% CPU spike caused the 502 errors that our customers saw. At its worst traffic dropped by 82%.”

But it doesn’t have to be code that brings down a network. Google suffered a nine-hour outage in one of its East Coast regions in the US, after the fiber optic cabling were damaged, bringing down many systems running in GCP us-east1. There’s still no word on how the damage was caused to the “multiple concurrent fiber bundles serving network paths in us-east1”.

And Cloudflare was involved in another outage, prior to its more high-profile failure, where Verizon was causing a packet routing error that took down some major web platforms, after it misconfigured its routing tables and sent traffic that should have been run through major data centers into a small ISP in Pennsylvania. The ISP performed as expected – it was hammered, because the company, DQE Communications, wrongly announced new routes for about 20,000 IP address prefixes as part of its Border Gateway Protocol (BGP), which Verizon then rubber-stamped.

But the final, and perhaps most potentially dangerous, internet routing error that has cropped up in the past week, was announced to the world by Cloudflare’s director of network engineering. One of Bharti Airtel’s data centers had announced IPv6 block 2400::/12, when it should have actually announced 2400::/127. The latter handles 2 IP address, while the former is possibly 83 decillion IP addresses. As The Register points out, this shows both how powerful and expansive IPv6 is, compared to IPv4, but also how little it is currently used, owing to this error being live for a week without anyone spotting it. Imagine that mistake being made in a heavily-populated IPv6 world.