Main content

That Intel chip flaw means a large – and long – headache for CIOs and IT teams

Martin Banks Profile picture for user mbanks January 7, 2018
The Intel flaw is in the design of the processors it is engineered in, so it can't easily be engineered out. A problem for CIOs that will not go away soon.

By now the news that processor chips from Intel, AMD and ARM carry a well-buried but serious design flaw should be fairly common knowledge.

The industry, both chip makers and system software suppliers are, not surprisingly, hell-bent on trying to plug the hole with patches to operating systems and processor firmware and these are expected to become available over the next few days.

Though it appears the flaw first emerged last June, much of the industry has maintained a strict radio silence about it until the necessary software patches have been issued and, hopefully, installed by the majority of users of all systems running AMD, ARM and, principally, Intel processors. The release of any information could bust the majority of such systems wide open by informing hackers.

The issue here, of course, is what CIOs should do about it, both now and out into the future, and this does require information. In the immediate to short term, there are some obvious steps that need to be taken, but further out there are some wider questions, such as managing the impact on applications performance and hardware plans out into the future.

It is, perhaps, telling that the CERT Software Engineering Institute at Carnegie Mellon University in the USA has recommended a brutal fix for the flaw - throw away current CPUs and buy non-vulnerable ones.

The immediate issue will, of course, be to ensure that all patches and updates are installed as soon as a business gets them – and IT teams should go hunting for every one of them that might be helpful. As these will be fairly significant updates to operating systems – particularly Windows and Linux – it may well require the full nine yards of a dead stop, install, and restart. So the biggest problem may be convincing C-level management that this is essential rather than optional and deferrable.

What are they working round?

In essence, modern processors have the capability to predict what data is going to be needed next by an application and load it into cache memory ready to be called for execution. This is achieved by firing off speculative references based on predictions of what instructions will need to be run next. It is particularly useful in applications such as databases where it is usually easy to predict what instruction is needed next, so the high hit rate speeds up the performance of the application overall. The weak point is that a miss-hit is only `unwound’ and not used after the instruction has been called up. In other words, the security check on the action is post-hoc.

It is therefore possible for this process to also be used against the core on-chip memory space set aside for secure data, such as passwords, and appropriate code can be inserted into normal applications to make this happen. The result is that instructions can be read from this secure cache, and executed before the security process steps in after the event.

The flaw has led to the identification of three vulnerability types, with the first two – a bounds check bypass and branch target injection – being grouped together under the collective name `Spectre’. The third, a rogue data cache load, has been named `Meltdown’. The latter, allows normal programs to be used to read the contents of this private memory. It affects all Intel x86-64 processors produced since 2011, which suggests just about all cloud service providers are affected. It is reported that it does not affect AMD processors, though will affect processors using the up-coming ARM Cortex-A75 core, such as the Qualcomm Snapdragon 845.

Patches are either available or coming shortly for Linux and Windows, and Apple’s MacOS is said to be already patched. It is obviously important that CIOs ensure that all relevant patches are installed as soon as they are available, not least because now this story is out in the public domain, every cybercriminal will be considering an exploit or two before patching is completed. This may be one of those occasions when patching takes precedence over any other work in the IT department, including running normal production workloads.

Spectre allows cybercriminals to use a number of approaches, such as extracting information from other processes on the same system – so an application running on a VM can access the private memory of the host physical server to allow access to other VMs running on the machine. In theory at least, this would allow rogue code on a VM allocated to customer A in a multi-tenanted cloud service to access applications and data used by customer B in a different VM.

This not only affects Intel processors, but may also affect AMD devices, including its new Ryzen family. Researchers into the flaw say it is affected by Spectre, while AMD claims it is `practically immune’.

The performance implications

Working with, or round, the performance implications of the patching process is going to be the big headache for CIOs, and one where they are currently not getting too much help from the major software vendors.

The issue here is that because this is a fundamental design flaw in the processor architecture, it is unlikely that updating the processor chip’s firmware microcode will resolve anything. That means operating systems that have been written to take advantage of the speculative reference capability will need that functionality not just removed, but rewritten to ensure that the private memory is now logically completely separated and cannot be accessed.

This means that the ability to pre-load internal pipelines with instructions will no longer possible, or at least severely reduced, with the result of a negative hit on application performance.

The extent of the hit is the subject of some guesswork, but most estimates put it between 5% and 30% reduction in application speed. Some applications are unlikely to be hit at all, while others could be hit badly. It has to be suspected that any application with repetitive processes, such as databases and analytics could be amongst those hardest hit.

Unfortunately, there is no word yet from the major software vendors about the impact on individual applications.

Microsoft had this to say:

We’re aware of this industry-wide issue and have been working closely with chip manufacturers to develop and test mitigations to protect our customers. We are in the process of deploying mitigations to cloud services and released security updates on January 3 to protect Windows customers against vulnerabilities affecting supported hardware chips from Intel, ARM, and AMD. We have not received any information to indicate that these vulnerabilities had been used to attack our customers.

Later, news came that the company has now released updates for Windows, though this sprung another complication. It is reported that the fix may not be compatible with existing anti-virus software users have installed. Getting it to work will involve changing a registry key, otherwise installing it may cause that old Windows favourite – the blue screen of death.

Meanwhile, Oracle followed a similar line, though its spokesperson was a tad more succinct:

This is not something we're going to be able to comment.

While it is easy to surmise that neither company has yet had a chance to bench test current applications individually on either patched operating system and therefore don’t want to speculate, the situation is of little help to CIOs. And now is a time that CIOs are going to require such help.

This is a case where applying the patches will not make everything right again. The performance hit is going to continue for a while, a point observed by John Abbott, the Research VP for infrastructure at 451 Research.

This is a complex and messy situation with no quick fixes, and the consequences will be with us for several years until new chip designs reach the market. There may be a short term performance hit for some workloads, but the bigger issue will be ensuring that all necessary patches and mitigations are put in place before security holes can be exploited. Unfortunately fixes come from multiple eco-system players, from chip companies to operating system and hypervisor providers, and in some cases code modifications and re-compilation will be required. Cloud providers are already scheduling mass re-boots in order to patch their own hypervisors.

IT teams will need to check with their cloud service providers as to when they will be patching the resource pools, for there will certainly be a performance hit. For some it may even involved a service shut down and restart.

The performance hit could have a direct impact on business operations where latency is an issue, such as in financial services. A 30% hit here – roughly 300 m/secs on top of a 1 sec transaction - could easily mean a transaction lost and a trade not made. CIOs will be the ones to bear the brunt of responsibility for this, and if the performance hit continues for any length of time, the overall impact on business activity could be significant enough for some customers to find it potentially actionable.

Further out, the implications

The implications for Intel certainly don’t look overly encouraging. The CERT analysis that users should throw away their current Intel-based servers shows the precipice the company stands beside. A redesign of the processor is not likely to happen quickly. Intel operates what it calls the Tick-Tock roadmap, a two-year cycle of new micro-architecture one year, and a shrink of the design (reduction in the size of individual components, interconnections and overall die dimensions) the following year.
The next redesign of the microarchitecture, codenamed Ice Lake, is scheduled for 2019 and is likely to be already well on the way to being finalised. So making adjustments to accommodate the removal of the flaw may not be a simple task.

That means Intel is unlikely to be able to effect a change in the flaw for a while yet, making the reduced performance capabilities the `new norm’ at least for a while.

The only alternative will be to invest in new server hardware based on AMD processors. It is said that the company’s Ryzen processor is both cheaper and faster than anything from Intel, so there could be some side benefits to such a move. However, this would involve no small investment, not just in the new hardware but also in the testing to ensure that existing applications already optimised to run on Intel platforms run just as well on AMD. This process is likely to take both time and money, so while it may be worth the investment for specific applications or business processes it is likely to mean users holds for the next planned round of hardware upgrades for the bulk of their infrastructure.

But, even if Intel has Ice Lake ready by that time, it may find that its reputation is sufficiently damaged to push users in the AMD direction.

One tiny speck of (obviously outlandish) hope for Intel may come in the form of its Itanium processor. The 64-bit device is not dead. It is still used by HPE and a new version is expected to appear mid-2018. Its Very Long Word Instruction Set and significant parallelisation means it is radically different from x86-64 devices. But Microsoft at least has an installer available for it for its server applications, and it is immune from Meltdown and Spectre.

With cloud services now well-established – where parallelisation of services could be an advantage – users are likely to have to face up to some major changes anyway, sometime soon. There is……well……you know…… outside chance………… But in practice it seems fair to bet that Intel is in for a pretty hard time over the next couple of years.

My Take

What a mess. On the technology side there is an interesting lesson here – the onion development model will always bite back, given time. This is where vendors have a tech that works, so they upgrade it by overlaying the new onto the existing tech. It works, so they do it again, and again. The basic x86 architecture started with the 8086 launched back in 1979. It is into its 40th year and, looking at its roadmap, now onto its 25th onion layer. That is one complicated processor, so it is hardly much of a surprise that a `bomb’ of some kind was lurking in there somewhere.

The result is as yet unpredictable. No one knows how long cybercriminals have known about it and been exploiting it, and no one knows whether the patches just being released will work or carry some horrors of their own. To be fair, it seems they have had some time to develop and test the patches, so the chances are good.

But while the results will close off a serious security hole the downside of performance reductions could cause real problems for the user community, and problems that may not go away any time soon. And sadly, though none of this is the fault of the CIO community, it may well be them that get to carry the blame as the impact starts to bite.

To misquote Fergus Cashin, an old time theatre critic of the Daily Sketch back in the 1960s: “this story will run and run”.

A grey colored placeholder image