A newly unveiled security flaw in Intel’s CPU designs has sent programmers into a frenzy, as they try to patch the Meltdown kernel memory leaking bug in Linux and Windows environments – which would let user applications discern the supposedly protected virtual kernel memory, which could let attackers read their stored contents, like passwords or encryption keys. The patches to fix the VM-to-VM vulnerabilities could inflict a 30% performance hit on those CPUs, and could permanently damage Intel’s reputation – as it means that one cloud application could read the contents of other instances running on the same CPU.
It’s worth remembering Intel’s ongoing issues with its bug-riddled DOCSIS Puma 6 SoC, with jitter issues meaning its contract at Arris was put on the line last year, where it is deployed on some 135 million devices. This time around, Intel’s cloud business hangs in the balance as the flaw is a major burden for any cloud workload, meaning purchasing more CPUs to make up for the anticipated throughput shortfall. While most Android and other Linux-based systems are ARM-based, some use Intel, which Meltdown is currently exclusive to, as far as we know – meaning the chip firm can expect some disgruntled customers in the video space threatening large scale switch outs.
Since the kernel-leak bug was publicly revealed, now referred to as Meltdown, another bug affecting Intel, AMD, and ARM processors has been announced to the world. This new one is called Spectre, and is present in Intel’s Haswell Xeon family, AMD’s FX, Pro, and Ryzen families, and in a wide range of ARM CPUs, including Qualcomm’s new Snapdragon 845. Spectre allows user applications to pull protected information from other processes on the CPUs, and will also threaten multi-tenant cloud instances – however, it seems much more difficult to exploit, but in turn is much trickier to patch.
The Meltdown flaw appears only fixable at the operating system level, meaning that Intel can’t remotely flash the chips to solve the problem. Details of the exploit are still embargoed, but the fix requires the implementation of Kernel Page Table Isolation (KPTI), which will inflict that expected slowdown. It appears that KPTI is needed to fix the error in the handoff when the CPU switches between its user and kernel modes (the system calls), to completely obfuscate the system’s memory by moving the kernel into a separate memory address space entirely.
The extra processing overhead for this KPTI implementation varies, due to having to jump between those separate address spaces while dumping the caches, but many expect a hit ranging from 5% to 30%, with a worst-case benchmark from GRSecurity finding a 63% hit on an i7-6700 CPU. Newer Intel processors (those with Process-Context Identifiers, or PCID, in the 8000 Series) might have a lessened impact, however.
Intel’s reputation in the cloud computing market could be in tatters. The cloud providers will all be rushing patches through to protect their own customers, who will be running applications and even entire businesses on Intel-powered servers. As it stands, these applications could be attacked by other applications that share those same virtualized cloud instances, and accordingly, the providers are scheduling maintenance to correct this.
But this could mean that the likes of AWS would have to purchase more servers to make up for this sudden shortfall in compute power – and they aren’t going to jump for joy at the idea of buying these new resources from a company that has apparently dropped the ball so impressively. However, Intel’s main rivals in the data center, AMD and ARM, aren’t yet positioned to drive a stake into Intel’s heart today – but had this been revealed a few quarters into the future, this could be an entirely different story.
This comes after Intel said it had become a 50:50 company, where half of its revenue came from its traditional business lines and the other half from its new bets. If this bug ends up denting the core revenues, it will increase the pressure on Intel to ensure that its new enterprises, which include the IoT and automotive, begin churning out cash – as the core CPU lines are being threatened by both an AMD x86 resurgence and the increasing popularity of ARM-based server options.
The KPTI Meltdown fix looks like it will be a burden for all upcoming Intel CPUs, meaning that this is going to be a problem for all designs going forward. The OS developers will get better at working around the KPTI requirement over time, meaning that the 30% hit will ease with time, but Intel’s benchmarks for anything requiring access to virtual memory will need re-testing, and that’s going to severely tarnish its previously gleaming reputation.
Currently, the OS manages a series of page tables (or arrays) that describe the link between the physical RAM and the virtual memory that has been developed to improve robustness – to prevent memory errors crashing the entire system. Until now, it was thought that those tables were secure, but the attack has shown that you can manipulate one cached table to infer the contents of another – by inferring the contents of the Memory Management Unit (MMU). This would allow one application to calculate things like passwords and encryption keys, even it if was running on a separate virtualized instance on that shared CPU – a pretty common configuration on cloud platforms.
Python Sweetness (a blog which was one of the first to cover what came to be known as Meltdown) did some digging and posits that the fix is a response to a new type of memory attack that was unveiled in December, a new variant of the Rowhammer attack that is used to manipulate virtual memory in RAM. Its favorite guess is that we will soon be told of “the mother of all hypervisor privilege escalation bugs, or something similarly systematic as to drive so much urgency.” Python Sweetness signs off by warning; “invest in popcorn, 2018 is going to be fun.”
According to Thomas Lendacky of the Linux OS group at AMD, “AMD processors are not subject to the types of attacks that the KPTI feature protects against. The AMD microarchitecture does not allow memory references, including speculative references, that access higher privileged data when running in a lesser privileged mode when that access would result in a page fault.” Lendacky’s wording strongly implies what the specific Intel problem is.
However, AMD is unlikely to be able to turn this event into a huge market opportunity, as it has been very slow to bring its new Ryzen and Epyc CPUs to market – and won’t be able to match the scale needed to properly seize such a market from its incumbent rival. Similarly, it’s early stages for ARM-based rivals from the likes of Cavium and Qualcomm, but the incident has significantly lowered the barriers that these two rivals faced. If this is a performance flaw that carries forwards for a couple sales cycles, Intel could see its lead in the data center severely eroded, or potentially lost entirely.
The first sign that something was wrong came from speculation about redacted Linux kernel documentation and comments, and the inclusion of AWS and Google staff on email threads. The ‘official’ reason for the patch, which is moving at a much quicker pace than normal Linux kernel development, is to support the fix for the KASLR (Kernel Address Space Layout Randomization) feature that was proposed following research from Graz University that recommended splitting the kernel and user memory spaces in a patch called KAISER – but the clue for the problem seems to be in the name, as the kernel flag is called ‘X86_BUG_CPU_INSECURE.’
As such, there has been rampant internet speculation about the scope of the problem. It appears that Microsoft has been working on a fix for Windows since November, and in the initial Linux fix, the kernel will treat AMD processors in the same manner as Intel ones – meaning they will suffer a performance hit too. For this reason, AMD is recommending not to enable the patch on Linux. Microsoft has yet to outline its fix.