One of the more interesting features built into Windows Vista is the Reliability monitor that appears as part of performance monitoring. Though you can launch this tool by working your way through this menu sequence: Start, Control Panel, Administrative Tools, Reliability and Performance Monitor, I find it’s a heck of a lot easier to just type “perfmon” in the search window immediately above the Start button.
To those not already in the know, perfmon is short for perfmon.exe, an old familiar to Windows bitheads since the NT era: it’s the built-in Windows performance monitoring tool.
Reliability Monitor appears as an entry in the Monitoring Tools section in the left hand Perfmon menu, as shown in Figure 1. Click on the Reliability Monitor entry, and you’ll see a display like the one shown in Figure 2.
What you see in the basic Reliability Monitor window is a line graph that strings daily “system stability” rating numbers together into a chart of system status, with indicators below individual days to let you know when updates or new software has been installed (appears as a “blue i” information icon), or when trouble has occurred (appears as a white X inside a red circle, the traditional Windows failure alert icon). You also see detail areas below the graph where for any given day you highlight above, you can expand to view details about any of the following reporting categories:
- Software (Un)Installs for <date>: Here, I use 6/22/2008 as my sample date, where you can see numerous software, device driver, and even some Windows update installs in the detail listing depicted in Figure 3. In fact, Gigabyte Easy Tune 6 wouldn’t run on my machine, which is why you also see my reliability score decline from 9.53 the day before to 8.57 that day, because the app crashed when I tried to run it following installation (more on this later).
- Application Failures for <date>: This is where you can see that the primary Easy Tune 6 executable file named GUI.exe stopped working on June 22, 2008. It turns out this software is not compatible with the GA-P35T-DQ6 motherboard on my primary production machine (and again: more on this later).
- Hardware Failures for <date>: Should any hardware component on your system ever quit working in a non-catastrophic way (that is, in a way that lets enough of Windows keep running to detect that a hardware component has failed without causing the system to crash) you’ll get a failure icon in this category of Reliability Monitor. My logs go back to 11/8/2007 and I never had a hardware failure of this type show up in my Reliability Monitor log.
- Windows Failure for <date>: should Windows itself fail in a way that lets it record a crash dump, Reliability Monitor will tell you what kind of error occurred, and share error codes from that dump in the “Failure Detail” field, as shown in Figure 4. According to MS Help and Support (http://support.microsoft.com/kb/820362/) this is a stop error that occurs when a RAID array encounters serious problems. This system uses two mirrored (RAID 1) drives for the system and boot drive, and this problem occurred during the shakedown/burn-in phase as this system was being brought on line.
- Miscellaneous Failure for <date>: whenever some kind of problem or failure occurs that Windows can’t neatly categorize as belonging to one of the aforementioned categories, it shows up here. The Failure Type shown in Figure 5 (disruptive shutdown) is probably the most common type of such failure that occurs; certainly, it’s by far the most common type of failure I’ve experienced myself, and usually results from a hung system when errors force the user to resume processing through a hard reset. This turns out to be real pain on a system like this one with mirrored drives, because the first step in recovery upon reboot is for the system to inspect and re-sync the mirrored drives, a process that takes between one and two hours for the 320 GB Seagate 7200.10 drives I’m using, and which slows processing somewhat while that recovery is underway. In the 8 months this particular system has been up and running, application problems that caused the system to hang have been far and away the biggest cause for this type of failure to appear (looking back at my logs, I can see that issues with Word, a couple of audio programs, IE, and even one case of FreeCell going south, have all played roles in this kind of problem, with IE in first place and Word in second place).
On the plus side, Reliability monitor is a great tool for observing system stability and for acquiring a sense of where system problems or difficulties are coming from. Paying attention to its reporting has definitely helped me improve the stability of my Vista systems, primarily by warning me away from indulging in risky behaviors (installing questionable apps, over-doing my multi-tasking, overloading IE with too many tabs, and so forth). But it irks me no end that I’ve never managed to attain a “pefect 10.0” stability rating on any of my Vista machines, and that goes double when my stability rating suffers from a hit because I install an incompatible or unstable applications–Gigabyte EasyTune 6 has already been proffered as an example–even when I drop back to a Restore Point that precedes the wonky app’s install and restore my system to the state it occupied before the questionable app was introduced into the mix.
Maybe it’s asking too much for Microsoft to take restore points into account when assessing system stability for any given day, but that doesn’t mean it can’t bother me anyway. Even when shooting for perfection, the occasional mis-step will occur, but if a genuine correction can be applied (and in my mind at least, using a Restore Point to return to a state where the problem never happened qualifies as such) why should the failure still count against the stability rating? My best guess is that this occurs because MS decided not to code in the extra logic necessary to go back and rework the logs when Restore Points are invoked. ‘Nuff said!