Spectre and Meltdown may force long term changes in CPU design

Intel at least was relieved when Spectre emerged quickly to join Meltdown as a dual security threat because now the problem is shared with AMD and makers of ARM processors. Spectre afflicts just about the whole PC, tablet and smartphone industry as well as the COTS (Common Off the Shelf Hardware) used widely in virtualized infrastructures and cloud services.

With no evidence that any of the vulnerabilities had been exploited, concerns switched more to the performance implications of the patches than the risk of major breaches. But these fears appear to have been overblown after initial stories of increases in CPU time of up to 30%. Amazon Web Services has insisted that impact on virtual machine performance of fixes administered so far has only been slight, while Linux kernel king Linus Torvalds has suggested a 5% slowdown is typical. Microsoft indicates that performance hits in the range of 2% to 14% have been found in its tests with fixes for Spectre tending to be costlier.

But this is perhaps jumping the gun and at this stage it is worth clarifying the important differences between Spectre and Meltdown while highlighting their considerable implications for the future of CPU design.

The countermeasures currently being deployed are short term fixes, especially in the case of Spectre, and the bigger picture is that the foundation of software based security towards which cloud based services, including pay TV, have been migrating for some years, will need to be revised and reinforced. The demarcation and relationship between instruction sets in CPUs and the secure parts of the processor will need to be redefined more rigorously. This situation has arisen because the endless tradeoff between security and performance or processing efficiency has shifted towards the latter, under pressure in particular from the mobile industry where battery life is such a constraint. This has meant that the complex optimizations that have evolved around all the critical components of a system, including processors, compilers, device drivers and operating systems, have inadvertently introduced security risks.

This has come to light with Spectre and Meltdown at a time when there is growing recognition that the balance should shift back towards security as piracy of premium video content for example is an escalating issue, as well as threats right across the emerging Internet of Things ecosystem.

While it is true Meltdown is somewhat easier to fix, both will require long term redesign at the hardware level to prevent possible future exploitation of the vulnerabilities. Meltdown is effectively confined to Intel chips, including most CPUs made since 1995, and is known officially as CVE-2017-5754 or “rogue data cache load”, because it allows a process to read kernel memory. It does this by bypassing protections embodied in modern CPUs to enable multitasking through isolation between different user applications’ memory, as well as preventing these from reading or writing core kernel memory. In outline, this is relatively straightforward, involving three key steps.

Firstly, a memory location is chosen by the attacker as a target. At this stage that location is inaccessible to the attacker, but can be loaded into a register. Then in the second step the attacker issues a transient instruction that accesses a line of cache based on the secret content of the register, even though the contents of that line itself is still unknown.

Then in step 3 the attacker uses the two commands Flush and then Reload, which are available under operating systems such as Windows 10 and Linux, to determine the accessed cache line and hence the secret stored at the chosen memory location. By repeating these steps for different memory locations, the attacker can in time dump the whole kernel memory, including the entire physical memory of the CPU.

This rather glosses over the technical details, which are set out well in a brace of in depth papers covering Meltdown and Spectre published on the website For Meltdown the principle is to overcome the privilege checks that are conducted by the CPU to prevent unauthorized reads by rogue instructions. The flaw lies in the fact that in order to improve efficiency, the CPU initially schedules both the first steps of executing a command and the privilege check simultaneously.

This means that even an unauthorized process is allowed to start fetching data from a protected location while the privilege check is taking place. This does not normally matter because that read data should be deleted before being made available, in the event that the privilege check completes and fails to authorize a process. Meanwhile the CPU cache has been temporarily updated, hence the slightly misleading use of the term “pre-emptive caching” with reference to the Meltdown exploit.

At this stage the CPU cache is still not readable by the unauthorized process, because it is integral to the CPU, but then by instigating a form of side channel attack known as a cache timing attack, the rogue process can determine whether data from a specific address is held within the CPU cache. This attack involves issuing a second instruction to read that data, enabling the rogue process to entice the CPU to use the cache for the purpose if the data is held there, because it is faster. If the data is not in cache then the access will be slower, and the cache timing attack determines which was the case by observing that difference. Even this is not fatal, but Meltdown can then combine this information with other features of the CPU instruction set to gain full access to all mapped memory.

At least Meltdown has a universally agreed fix called Kaiser which should ensure security in the immediate future, which is why it is being implemented by the three main operating system groups, Linux, Windows and iOS, within the coming weeks. Kaiser was designed initially just for Linux to protect against attempts to bypass a feature called kernel address space layout randomization (KASLR). But since it essentially improves isolation between the user’s and kernel’s space it also protects against Meltdown. It prohibits any kernel space being mapped into user space, except for some specific parts such as interrupt handlers required by the Intel x86 architecture, which is still widespread. This prevents Meltdown from leaking any kernel or physical memory data, except for data associated with those few less critical processes.

What is most interesting is that while this is just a temporary fix its mode of operation is likely to be embodied in a permanent redesign of CPU architectures. Meltdown has exposed the current situation where hardware optimizations can endanger secure software implementations. This could also have implications for trusted zones on chips that are seen as the way forward for content security, like the TEE (Trusted Execution Environment).

Meanwhile we also have the thornier question of Spectre, which at one level is a greater issue, because it embraces a whole category of threats. Spectre differs from Meltdown in that all the main chips including those made by AMD or designed by ARM, are potentially affected. It also represents just one example in a whole class of attacks, so that it is already becoming clear that there will be no one patch fixing all of them. It will therefore haunt us for much longer.

Like Meltdown, Spectre in essence exploits the speculative execution performed by modern CPUs but does so in a more generic way that exposes all principle chips and not just Intel’s. All modern processors use speculative execution, combined usually with branch prediction to optimize performance.

The aim is simply to minimize the latency associated with each operation by ensuring that relevant data is available in cache, as near as possible, so that the electrons of the associated signal have least far to travel. So if say the destination of a branch depends on a memory value that is in the process of being read, a CPU will guess that destination and thereby attempt to execute ahead of itself on the basis of probabilities. When the memory value finally arrives, the CPU will discard the value if the wrong guess turned out to have been made, or else continue with the speculative computation. This makes use of resources that would otherwise have been idle, so that providing the speculation is right just some of the time the average performance will still be improved.

But the speculative logic does not adhere to the same rules and can access the victim’s memory and registers, while performing operations with measurable side effects. Spectre attacks involve inducing a victim to speculatively perform operations that would not occur during normal program execution and which leak the victim’s confidential information via a side channel to the adversary.

Spectre then exploits branch prediction, which is one particular case of speculative execution. Unlike Meltdown, which relies on poor memory separation, it does not exploit any specific feature of a given processor’s memory protection system, which is why it is not confined to Intel chips.

The effect is to fool CPUs into making guesses about future instructions that would not otherwise be allowed and through that gain access to privileged information within the kernel address space, or data in other running processes.

So far two different techniques exploiting Spectre known as CVE-2017-5753 or “bounds check bypass,” and CVE-2017-5715 or “branch target injection” have been identified by Google’s Project Zero security team and require different fixes. The bounds bypass check attack requires analysis and recompilation of vulnerable code, while the branch target injection attack can be dealt with via a CPU microcode update, such as Intel’s IBRS microcode, or through a software patch to the operating system kernel.

Such countermeasures are just arriving, and the situation is still confusing. While consumers can do little more than await developments, cloud providers in particular cannot stand idly by because they are going to face calls from their customers when performance sags or service interruptions occur.

As Richard Morrell, CTO and security lead of cyber defense firm Falanx wrote in a technical note to customers, “Amazon, Rackspace, and Verizon along with Microsoft are rebooting swathes of their infrastructure during Friday – Sunday 5th – 8th January. If you are a cloud customer of any provider, please seek clarification from your provider.” He also advised DevOps/Agile leads to consult their vendor to determine if they expect impact at this time. It may be that vendor does not yet know.

But the underlying message is that all these patches are stop-gap measures, especially in the case of Spectre. As the Spectre paper made clear, sound long term solutions will entail fixes to processor designs as well as updates to instruction set architectures. This will be needed to give hardware architects and software developers a common understanding over what information CPU implementations are permitted to expose from computations and what they are not. Many would have assumed that understanding had long been there, especially in era of hardware roots of trust in generic CPUs, but that is the source of the problem.