SAN switches and storage arrays have been around
for many, and over time, they have become more fault-tolerant. However, there
are still issues that can occur and take some time to resolve, in particular, issues
with remote replication and associated. I’ve come up with a few guidelines that
may help you to prevent some of these problems, or at least cut down on the time
taken to fix them.
#1 Check that your
remote replication works.
Remote replication has a few flavours: IBM have Global
Mirror, HP uses Business Continuance (or BC), and EMC use Business Continuance
Volumes (or BCV). Mostly, these are easy enough to configure and maintain. The
main issue is ensuring that disks that need to be replicated are indeed
After the initial setup of replication, ensure
that there are procedures in place to describe which file systems are to be
replicated. The pitfall is when disks are added to a server’s file system on
the production site, but not on the remote site. The end result can be
potentially disastrous if you find that a file system has grown on the
production site, but the remotely replicated file system does not include the
extra disks. There is no easy way back from this type of issue, particularly
when a restore is required.
#2 Ensure that you have
functioning dual paths for remote replication.
Make sure that the dual paths between sites take a separate route
to the remote site. Although the instances of a backhoe operator severing data
cables have become less frequent due to modern mapping systems, breaks still happen.
Most service providers have links between sites going via a different physical
route, but you should check.
#3 Ensure the DWDM hardware
is supported by your SAN vendor.
Remote replication usually is done via a SAN switch
connecting into another switch, which typically uses Dense Wave Division
Multiplexing (DWDM) at the physical level to send the data to a remote site. A
big question is: are your SAN switches running a supported mode for remote
replication? It might seem basic, but even changing a card on the DWDM switch
could put you into an unsupported configuration.
Upgrades to DWDM hardware also mean you have to check that the SAN
switch is running a supported firmware version. This is usually the case;
however, sometimes customers have long lag times between firmware upgrades. Be
sure to check and advise the customer if there is going to be an issue. Gently
remind them that upgrading firmware to the latest version or just below is in
their best interests.
#4 Document the order to
bring hosts up.
Storage is very robust? It certainly is, but you still need to
know if there are any issues with the order in which hosts are started and
applications are started. A document that states this can save you time.
#5 Ensure that you have
contact details for your DWDM hardware vendors.
It’s better to not just know where the hardware is located in the
data centre. You should also know who to call if you think there is an issue. Sometimes
DWDM hardware is looked after by network teams, sometimes it is an outside
company. Whoever it is, make sure you know who to call 24/7/365.
#6 Ensure that you know
who the service provider for DWDM is.
This may seem common sense, but I have known customers to not know
who actually looks after the DWDM service. Sometimes it is the same people who
supply the hardware, sometimes it is not. If it isn’t, ensure you know the
details of the service provider.
#7 Don’t ignore what looks
like a spurious alert on SAN switches.
If you notice some odd alert on a SAN switch, check it out, and
get to the root cause. It may be nothing. Or it may be a sign that something is
amiss. One customer had an alert that appeared after a weekend. Turned out it
was a warning about link degradation on the DWDM link. However, it had been
ignored for almost two weeks before an investigation into a slowdown of backups
turned to checking out the SAN. The link degradation was due to a cable; the cable
was replaced, alerts disappeared, and the backup slowdown issue was resolved.
#8 Know the business
impact if your remote replication is down.
If the remote links go down completely, what is the impact? For many
sites it might mean backups are affected. Other sites may run reporting type
databases; this could mean a business hours impact. Whatever the impact, make
sure you know what it is.
#9 Know when a vendor makes a
change to a piece of DWDM kit.
Make sure you know exactly what they are doing,
and make sure it is done one path at a time. Once one path is done, run through checks to ensure the
replication is still working. Check any alert messages on the SAN side. Once
you are satisfied there is no impact, then you can allow work on the second
path. The bottom line is that you want at least one fully functioning path.
#10 Ensure cables in the
data centre are separated by distance for redundancy and label them.
I have seen the case where cables for multipathing of servers have
been loomed together. I’ve also seen a contractor removing old fibre cables by
cutting the links and pulling the slack out. The inevitable happened one day.
He cut through a live cable; several in fact. Luckily, it only affected one
path for each of the impacted servers. All the paths successfully failed over,
but it did trigger a few software alarms, and a rapid visit from the data
centre manager to the area of misdemeanour.
Remote replication (and the hardware
that supports it) is a wonderful tool, if built and maintained correctly. These
guidelines should help to prevent some issues, and to enable timely resolution
of other issues.