When Microsoft added the comprehensive Resource Monitor tool to Windows, it added an outstanding at-a-glance tool that allows administrators to glean deep intelligence regarding the operating condition of mission critical Windows servers. In my four-part series about the Resource Monitor, I will focus on each resource monitoring aspect of the tool: CPU, Memory, Disk, and Network. In this installment, I discuss the various disk-related metrics that you can view with Resource Monitor, explain the graphs you see, and provide some context around each metric.
For the purposes of this article, we’ll use the screenshot in Figure A. This figure shows a Resource Monitor view from a production server running Windows Server 2008 R2 and Exchange Server 2010 with all Exchange roles installed; as such, this server has significant need for storage resources that operate within acceptable boundaries. (Note: Like all of our other servers, this server is running as a virtual machine under VMware vSphere 4.1.)
One look at Resource Monitor in Windows Server 2008 R2 (Click the image to enlarge.)
Let’s start with an overall look at the console. Occupying most of the window is the statistics area, which I’ll be explaining in depth. On the right side of the window are a number of graphs, each depicting a key storage-based performance metric.
In the sections below, I will provide details for each metric. I won’t repeat metrics; if one type of metric appears in multiple areas, I only list it once.
Processes With Disk Activity
This section of the Resource Monitor window shows you a list of all of the running processes that are using disk resources. You are shown the name of the executable and a number of performance statistics.
- Image. Process executable file name. This is the name of the process that is actively using the disk.
- PID. The ID number associated with the process. This is useful if you want to use other utilities to manage processes, or if you want to easily match up processes with Task Manager.
- Read (B/sec). The average number of bytes read per second by the process in the past minute.
- Write (B/sec). The average number of bytes written per second by the process in the past minute.
- Total (B/sec). The average number of bytes accessed per second in the past minute.
The information you’re provided in this section isn’t particularly useful when troubleshooting except to show you which processes are consuming the most disk performance resources. In Figure A, you can see that the process named DPMRA.exe is doing a ton of reads from the disk.
This section of the Resource Monitor window provides you with more useful troubleshooting information. In particular, the response time metric is probably the most useful metric of the bunch, as it’s directly observable without having to really understand the underlying storage configuration.
To the right side of this section label, you’ll see two quick-glance information boxes. The green box shows you the current disk I/O (i.e., the amount of data that is being transferred right now), and the blue box lists the highest amount of active time for the disks in the system.
- File. The name of the file that is being used by the active process. You’ll notice that you’re pointed to the full path so it’s easy to find the file.
- I/O Priority. The priority of I/O transfers.
- Response Time (ms). Disk response time in milliseconds. For this metric, a lower number is definitely better; in general, anything less than 10 ms is considered good performance. If you occasionally go beyond 10 ms, you should be okay, but if the system is consistently waiting more than 20 ms for response from the storage, then you may have a problem that needs attention, and it’s likely that users will notice performance degradation. At 50 ms and greater, the problem is serious. Figure A shows that my Exchange server is seeing 5 and 6 ms response times, so the storage is doing fine as per this metric.
- Logical Disk. The drive letter associated with a disk.
- Physical disk. Which physical disk is being monitored on this line?
- Active Time (%). This shows the percentage of time that the disk is not idle and is actively serving requests. If a disk is constantly running at very, very high levels (say, more than 80%), it may point to a storage-related bottleneck. If your users are seeing performance issues, and you’re seeing 100% active times, you might need faster or more disks.
- Available Space (MB). How much disk space is available on the volume?
- Total Space (MB). What is the total size of the volume?
- Disk Queue Length. Average disk queue length. The queue length metric displays the number of outstanding requests (read and write) at any given time. A high number can indicate that there are not enough disk spindles to service the needs of the application or that the existing storage is too slow to keep up with requests. However, how you define a “high number” in this context requires that you have a deep understanding for how the underlying volume is created on the SAN. Each disk that makes up the underlying volume provides additional resources that go into disk queue length (simplistically, the more disks, the higher the disk queue length can be).
For additional complicating factors, the RAID level and stripe size you choose can affect this value as well. If you’re running on a single disk system and your queue length remains consistently greater than 2, you should add more resources. If it goes beyond 5, you have serious problems that need to be addressed. If you know how many disks make up your underlying volume, multiply the number of disks by 2 to get a very rough, ballpark queue depth maximum value. So, if you’re running on a 10 disk system and the queue depth is 18, you should be fine.
The graphs are very useful tools. The top graph shows you the transfer rates between the storage and the system for the past minute. The green portion is the current overall I/O, while the blue line displays the disk active time for that period. The remaining graphs show you the queue length for each disk in your system.
I have four disks (SAN volumes) in this Exchange server. Because of the way that the underlying SAN volumes are created in my array, I have no queue length-based performance issues at all.
In part two of my series about the Windows Resource Monitor, I’ll focus on CPU performance.