10 troubleshooting tips for your storage and SAN environment

Remote replication and the associated hardware are powerful tools to ensure continuity of business. But care must be taken in deployment and use.

SAN switches and storage arrays have been around for many, and over time, they have become more fault-tolerant. However, there are still issues that can occur and take some time to resolve, in particular, issues with remote replication and associated. I’ve come up with a few guidelines that may help you to prevent some of these problems, or at least cut down on the time taken to fix them.

#1 Check that your remote replication works.

Remote replication has a few flavours: IBM have Global Mirror, HP uses Business Continuance (or BC), and EMC use Business Continuance Volumes (or BCV). Mostly, these are easy enough to configure and maintain. The main issue is ensuring that disks that need to be replicated are indeed replicated. 

After the initial setup of replication, ensure that there are procedures in place to describe which file systems are to be replicated. The pitfall is when disks are added to a server’s file system on the production site, but not on the remote site. The end result can be potentially disastrous if you find that a file system has grown on the production site, but the remotely replicated file system does not include the extra disks. There is no easy way back from this type of issue, particularly when a restore is required.

#2 Ensure that you have functioning dual paths for remote replication.

Make sure that the dual paths between sites take a separate route to the remote site. Although the instances of a backhoe operator severing data cables have become less frequent due to modern mapping systems, breaks still happen. Most service providers have links between sites going via a different physical route, but you should check.

#3 Ensure the DWDM hardware is supported by your SAN vendor.

Remote replication usually is done via a SAN switch connecting into another switch, which typically uses Dense Wave Division Multiplexing (DWDM) at the physical level to send the data to a remote site. A big question is: are your SAN switches running a supported mode for remote replication? It might seem basic, but even changing a card on the DWDM switch could put you into an unsupported configuration.

Upgrades to DWDM hardware also mean you have to check that the SAN switch is running a supported firmware version. This is usually the case; however, sometimes customers have long lag times between firmware upgrades. Be sure to check and advise the customer if there is going to be an issue. Gently remind them that upgrading firmware to the latest version or just below is in their best interests.

#4 Document the order to bring hosts up.

Storage is very robust? It certainly is, but you still need to know if there are any issues with the order in which hosts are started and applications are started. A document that states this can save you time.

#5 Ensure that you have contact details for your DWDM hardware vendors.

It’s better to not just know where the hardware is located in the data centre. You should also know who to call if you think there is an issue. Sometimes DWDM hardware is looked after by network teams, sometimes it is an outside company. Whoever it is, make sure you know who to call 24/7/365.

#6 Ensure that you know who the service provider for DWDM is.

This may seem common sense, but I have known customers to not know who actually looks after the DWDM service. Sometimes it is the same people who supply the hardware, sometimes it is not. If it isn’t, ensure you know the details of the service provider.

#7 Don't ignore what looks like a spurious alert on SAN switches.

If you notice some odd alert on a SAN switch, check it out, and get to the root cause. It may be nothing. Or it may be a sign that something is amiss. One customer had an alert that appeared after a weekend. Turned out it was a warning about link degradation on the DWDM link. However, it had been ignored for almost two weeks before an investigation into a slowdown of backups turned to checking out the SAN. The link degradation was due to a cable; the cable was replaced, alerts disappeared, and the backup slowdown issue was resolved.

#8 Know the business impact if your remote replication is down.

If the remote links go down completely, what is the impact? For many sites it might mean backups are affected. Other sites may run reporting type databases; this could mean a business hours impact. Whatever the impact, make sure you know what it is. 

#9 Know when a vendor makes a change to a piece of DWDM kit.

Make sure you know exactly what they are doing, and make sure it is done one path at a time. Once one path is done, run through checks to ensure the replication is still working. Check any alert messages on the SAN side. Once you are satisfied there is no impact, then you can allow work on the second path. The bottom line is that you want at least one fully functioning path.

#10 Ensure cables in the data centre are separated by distance for redundancy and label them.

I have seen the case where cables for multipathing of servers have been loomed together. I’ve also seen a contractor removing old fibre cables by cutting the links and pulling the slack out. The inevitable happened one day. He cut through a live cable; several in fact. Luckily, it only affected one path for each of the impacted servers. All the paths successfully failed over, but it did trigger a few software alarms, and a rapid visit from the data centre manager to the area of misdemeanour.

Remote replication (and the hardware that supports it) is a wonderful tool, if built and maintained correctly. These guidelines should help to prevent some issues, and to enable timely resolution of other issues.