In this article, I'd like to take some time to share with you an incident that took place in my office. While this incident did not become a significant problem, it did point out the need to periodically verify the configuration of our iSCSI-based storage network.
If you've followed my iSCSI ramblings for the past 18 months, you already know that, last March, I implemented an EqualLogic PS200E iSCSI storage array in my data center. Connecting the various servers to the storage array is a pair of HP ProCurve 2848 switches. These switches sport 48 10/100/1000 Ethernet ports and were cabled exactly to EqualLogic's specifications.
When we initially installed the array with our switches, we implemented flow control and jumbo frames wherever possible. While jumbo frames don't provide the actual iSCSI performance boost that flow control does, the use of jumbo frames does ease some burden on each supported server since there are fewer iSCSI packets to package up. Flow control provides the real performance gain with jumbo frames making up the last mile. For the past 18 months, things have hummed along, with only minor problems here and there.
Last week, a combination of firmware revision levels and hard drive types in our SAN led to a situation in which the SAN's performance level dropped to a level that was noticeable across the organization. The help desk went nuts while IT staff began troubleshooting. The problem ended up being a bad hard drive that was not showing as bad due to a bug in the firmware code, but the situation and the resulting call with tech support highlighted the fact that it's important to revisit the storage configuration from time to time.
During the call with tech support, it became apparent that, since our initial implementation, EqualLogic has learned a lot about various network devices and has refined their recommendations. During our chat, EqualLogic recommended that we disable the jumbo frames feature and use only flow control for the particular model of switch that we are using. The reason: The HP 28xx series of switches don't have a whole lot of buffer memory and trying to use both flow control and jumbo frames on these switches can lead to communications problems.
At the time of our installation, EqualLogic also recommended, for full redundancy, two separate switches and provided a cabling diagram for getting the best results in the event of hardware failure somewhere in the chain. EqualLogic also recommended connecting the two switches together with an uplink cable, so communication would keep flowing, regardless of what hardware failed.
In our call this week, EqualLogic amended this recommendation by indicating that we should bond two channels together on the switches so that, in the event of a failure, communication could be maintained at full speed. The PS series of arrays have 3-Gb Ethernet ports on two controllers for a total of six ports. From each controller, two of the three ports are cabled to one switch and the third to the second switch. The second control is cabled the same way, but with two connections to the opposite switch from the first controller. What this means is that there could be 2 Gb worth of traffic trying to get through that uplink between the switches, so it's important to make sure enough bandwidth is there.
Some of you may read this article and wonder why we didn't check these things before. This was truly a case of "set it and forget it." Since the storage network was working fine, and we had a ton of other projects, we were only doing regular firmware updates, but hadn't followed up on possible changes in configuration recommendations. Now that we've updated the configurations, however, we have a situation that is more stable and more resilient in the event of a failure.
The lesson for us: Schedule time to review our storage infrastructure and make sure we're running with the latest recommendations for best performance and stability. Of course, this is applicable to all services!