If your organization has deployed any version of Microsoft System Center Operations Manager (SCOM), you are ready to install the Agentless Exception Monitoring (AEM) feature and collect data about application errors and system crashes. AEM lets you intercept and interact with Windows Error Reporting (WER) data across your enterprise without deploying any new software. WER is built-into Windows Vista, Windows Server 2008, and later Microsoft operating systems, just as the "Dr. Watson" debugger was included with Windows XP and Windows 2003.
What do "hangs and crashes" really mean in business terms? Simple disruptions, such as the time spent waiting for an application to restart after a crash, are a nuisance to a single user. However, when distractions from productivity occur frequently across many clients, this can have a dramatic effect-similar to or greater than that of a server outage. The question is: How can the IT department gain visibility into what applications are crashing the most? Using SCOM's AEM feature, you can have that insight, and you can improve your company's bottom line by fixing the top problems.
From Dr. Watson to WER
About twenty years ago, Microsoft released a simple debugger application for the Windows NT 3.0 beta named Dr. Watson (drwatson.exe), taking its name from fictional medical doctor John Watson, sidekick of Sherlock Holmes. Early Watson versions wrote only crash state information such as stack traces to local files on the hard drive. This information was available to in-house developers or sent to Microsoft as part of a product support troubleshooting investigation. With the release of Office XP in 2001, Microsoft added the Watson debugger, which with user permission, anonymously forwarded crash information to Microsoft itself.
Microsoft since that time has collected, aggregated, and analyzed vast numbers of application error reports to identify problems that users experience with their Windows computers and applications. This information helps prioritize development resources for bug fixes and service packs. In 2003, Microsoft made Corporate Error Reporting (CER) 2.0 available to Software Assurance (SA) customers through the Volume Licensing program. Installing CER creates a shared folder structure to contain error report data on the server hosting the service, and deploys Active Directory Domain Group Policy to configure Watson and WER clients to upload data to the server as well.The core functionality of CER is similar in SCOM AEM, except that support for new client OS's was added. Microsoft renamed Watson functionality in Windows Vista to Windows Error Reporting (WER). Figure A shows the successor to Dr. Watson, the Action Center in Windows 7. Settings in the Maintenance menu of the Action Center control the behavior of WER in Windows 7, Windows Server 2008, and later OS.
The Action Center is where you configure Windows Error Reporting (WER) in later Microsoft OS. (Click images to enlarge.)
Crash and hang monitoring
AEM monitoring for crashes and hangs does not require the presence of a SCOM agent. You will want to initiate the AEM feature of SCOM to detect application hangs and OS crashes on computers with or without SCOM agents. First enable AEM on a selected SCOM management server, and then deploy the AEM GPO with Active Directory. All computers subject to the GPO will start reporting crash and hang information to the AEM server, regardless if they have a SCOM agent installed or not.
The SCOM console shows when Microsoft has a help link available for an issue.The error group view is created by reading the application names and contents of the shared folder structure where error reports are stored. The view includes an entry for each error signature reported. A single application name may appear several times if the same application reports different error signatures. From this view, you can select application errors and customize the error buckets, which are the settings that guide the behavior of the Watson or WER client.
In addition to identifying the top errors in your organization, AEM can optionally collect additional, custom information and reach back to users with a custom solution link. For example, say you used the SCOM console to identify a common error in your organization was due to a missing driver for a memory card reader. Using the SCOM console, you can customize the error bucket with an internal URL link that actually downloads the driver immediately when a user has the same application error. This targeted, self-help technique can be implemented easily.
Reporting on fastest growing application issues
In addition to the five views available in the Agentless Exception Monitoring view folder in the SCOM Monitoring space, there are four client monitoring reports in the Client Monitoring Views Library in the Reporting space of the Operations console. These reports aggregate the information in the monitoring views and perform data analysis.The Top N Error Groups Growth and Resolution report shown in Figure C provides a measure of the percent of error increase or decrease over time. The significance of this report is that it lets you determine which applications are getting worse at a faster rate than other applications, even if the quantities of the most common errors are much higher than emerging errors. This error-trending capability is a great way to detect a problem before it affects a large number of users.
The Top N Error Groups report helps you spot fast trending problems early.
Public access to Microsoft's WER back-end database
If you are a software developer, you can write better software if you know what errors your application is encountering in the field. Microsoft's Dev Center services enable software and hardware vendors to access reports to analyze and respond to problems caused by their applications. You must register with Microsoft to participate in the Dev Center program; there is no charge.
You can use the Dev Center dashboard to view driver-specific, application-specific, or operating system-specific error reports associated with your organization, which are stored in error buckets in the Microsoft WER back-end. Each error report provides details related to that bucket, and you could then request a file of the associated data. For more information about Dev Center, visit http://sysdev.microsoft.com/en-US/Hardware/signup.
John Joyner, MCSE, CMSP, MVP Cloud and Datacenter Management, is senior architect at ClearPointe, a cloud provider of systems management services. He is co-author of the "System Center Operations Manager: Unleashed" book series from Sams Publishing, and is developing cloud-based management solutions based on the Microsoft System Center 2012 suite. John is a retired U.S. Navy Lt. Commander 'Surface Warfare Officer', with the subspeciality 'Computer Scientist, Proven'. His tours of duty included Chief of Network Operations for NATO's southern region and network administrator aboard the aircraft carrier USS CARL VINSON (CVN-70).