Re: Bricked x86 CPU with software?

From: Tim Mouraveiko
Date: Fri Jan 05 2018 - 13:53:09 EST


> On 2018-01-05 10:21, Tim Mouraveiko wrote:
> >> On Thu 2018-01-04 14:13:56, Tim Mouraveiko wrote:
> >> Actually... I don't think your code works. That's why I'm curious. But
> >> if it works, its rather a big news... and I'm sure Intel and cloud
> >> providers are going to be interested.
> >>
> >
> > I first discovered this issue over a year ago, quite by accident. I changed the code I was
> > working on so as not to kill the CPU (as that is not what I was trying to). We made Intel aware
> > of it. They didn´t care much, one of their personnel suggesting that they already knew about it
> > (whether this is true or not I couldn´t say). It popped up again later, so I had to fix the code
> > again. It could be a buggy implementation of a certain x86 functionality, but I left it at that
> > because I had better things to do with my time.
> >
> > Now this news came up about meltdown and spectre and I was curious if anyone else had
> > experienced a dead CPU by software, too. Meltdown and spectre are undeniably a problem,
> > but the magnitude and practicality of it is questionable.
> >
> > I suspect that what I discovered is either a kill switch, an unintentional flaw that was
> > implemented at the time the original feature was built into x86 functionality and kept
> > propagating through successive generations of processors, or could well be that I have a
> > very destructive and targeted solar flare that is after my CPUs. So, I figured I would put the
> > question out there, to see if anyone else had a similar experience. Putting the solar flare idea
> > aside, I can´t conclusively say whether it is a flaw or a feature. Both options are supported at
> > this time by my observations of the CPU behavior.
> >
>
> If you made Intel aware of the issue a year ago, and they weren't
> interested, then the responsible thing to do is disclose the problem
> publicly. This is a security issue (if trusted code can brick a CPU,
> it's an issue for bare metal hosting providers; if untrusted code can
> brick a CPU, it's a *huge* issue for every cloud provider and many, many
> others who run code in various sandboxes). If the vendor is not
> receptive to coordinated disclosure, the only option is public
> disclosure to at least make people aware of the problem and allow for
> mitigations to be developed, if possible.
>
> Personally, I would be very interested in seeing such code. We've seen
> several ways to brick nonvolatile firmware (writable BIOSes, bad CMOS
> data, etc.), but bricking a CPU is a first. The only way that can happen
> is either blowing a kill fuse, or causing actual hardware damage, since
> CPUs have no nonvolatile memory other than fuses. Either way this would
> be a very interesting result.

We discovered the issue but chose not to distill the code into a standalone CPU-killing app.
Once we realized that the CPU had been killed by the software and that the code caused
other CPUs to behave the same way and once Intel said what they said, I made my pitch to
pursue it, but the decision was made not to. I wasn´t to test the existing code beyond
removing the offending part of it. Granted, I snuck a few tests in while removing it and a few
times, for a few seconds, I held my breath. A few months later I had to fix it again.

Among the considerations was the question of what the possible purpose of designing such
an application would be. Is this a kill switch or unintentional flaw? Particularly in light of Intel´s
position. The consequences of a successful execution on a compatible CPU is a loss of
physical property. Intel must have had good reasons to take the position that they did.

This issue would be a non-starter prior to the Pentium FDIV story. Since then Atmel
popularized storable fuses, and things have gone on from there.

I did consider and investigate the electrical issue as a possible cause. I ruled it out before I
tested other CPUs and different motherboards. Our OS is not a derivative of linux/freebsd,
neither in concept nor design. In relevant parts all mainstream operating systems are the
same design carried over from a long time ago and I dare to say most if not all non-
mainstream copied over the relevant part as well (maybe not exactly), as there was/is no
good reason not to. In our case we did not have certain features in the OS as there was no
good reason to have them, until I needed a way to catch a bug. In the end I did find the bug,
albeit without using the feature.