Hardware Failure Analysis

Back in April 2011 a Microsoft Research team published an interesting paper, Cycles, Cells and Platters: An Empirical Analysis of Hardware Failures on a Million Consumer PCs, which I found through Joel Hruska’s summary at ExtremeTech. The team analyzed crash reports sent from ca. 950,000 machines over an 8-month period in the year 2008.

While any computer whose crash report reaches Microsoft is obviously running Windows, the paper focuses exclusively on hardware failures concerning CPU, DRAM, or disks. The findings should therefore apply to consumer PCs running any operating system. Let’s first look at some results that meet expectations:

  • Initial failures over an 8 month observation period are not ubiquitous but not negligible either, exceeding 0.5% for CPUs, and components that fail once are two orders of magnitude more likely to fail again. This fits the rule of thumb that you best get rid of dodgy hardware as soon as possible.

  • Overclocking makes hardware failure up to 19x more likely. This is hardly surprising. Overclockers push the boundaries to find the highest stable performance, and crashes simply indicate they’ve gone too far.

  • Brand name PCs are more reliable than “white box” PCs, reducing failures by 29% for CPUs and nearly 3x for DRAM. Well, it’s a relief that you do get something for that premium price!

Perhaps more interesting are the report’s unexpected results:

  • Underclocking significantly reduces hardware failures, by 39% to 80%. This really shouldn’t be happening. Manufacturers apparently sell a lot of hardware that doesn’t quite meet advertised specifications, or else is inadequately integrated with the system (e.g. poor ventilation).

  • Laptops are 25% to 60% more reliable than desktops. Portable systems have much tougher operating conditions – smaller cases with greater potential for heat buildup, physical movement and battering while active – but evidently their sturdier design overcompensates for these conditions.

In conclusion, if you want a stable system you’d best get a brand-name laptop… and then underclock it.

2 thoughts on “Hardware Failure Analysis

  1. Richard

    My admitedlly anecdotal experience contradicts the above report; Laptops are much more likely to fail. In fact I would go further, many laptops are designed to fail! Why? Most laptops I come across after a couple years of use have cooling systems clogged up by fluff and dust. If they are used for gaming or media work this usually results in excessive overheating causing soldered joints on the motherboard to fail – graphics chips are notorious for this. Manufacturers are to blame. They design laptops that have to be completely dismantled in order to clean the cooling systems. Yet paradoxically this doesn’t apply to all models, some designs make it easy to clean the cooling system. Why not all models? All manufactures are affected including the big boys HP, Dell, Sony, Toshiba, Acer and Apple. But if you want a concrete example of what I’m talking about just Google; HP dv9000 or dv6000 gpu failure. Or try; Macbook Pro Overheating problems.

    Oh and one last point. Manufacturers happily load up new laptops with what’s known as ‘crapware’. Why don’t they include some useful utilities such as temperature monitoring software to warn users that the laptop is overheating.

    Reply
    1. cnahr

      Richard, I have a few guesses why the laptop issue you describe isn’t reflected in the survey. First, as you say the dust takes years to build up, and many laptops may have been replaced with newer models before that. Second, most laptops probably aren’t used for CPU/GPU intensive work that would trigger heat death; or if they are then they are especially likely to be replaced with faster machines before serious dust buildup. Third, some may feature automatic underclocking to prevent heat death — the survey did show that underclocking reduced hardware failures.

      Reply

Leave a Reply