Mark Pimperton describes how more secure handling of ARP packets by a new router caused a baffling loss of both Internet connections after 15 minutes.
We recently upgraded our router from a Zyxel Zywall 35 to a SonicWALL NSA 240. (We have two Internet connections and our venerable Zyxel was unable to cope with rising demand. Every so often the CPU would hit 100% and then we'd lose connectivity on both connections.)
During configuration, testing and initial deployment of the SonicWALL all seemed well. It was only when we went live that things unraveled. Web browsing was very slow - a real disappointment for Day 1! After a while we figured we had a DNS problem because all our nslookups, pings and tracerts to external sites were failing. We played around with DNS settings on the SonicWALL, but we knew they shouldn't have been relevant because DNS requests from users are handled by our DNS server. (The SonicWALL uses its own DNS settings to resolve names in reports, for example, but ordinary Web browsing requests should be handled by the DNS server.)
The router included bundled subscriptions to SonicWALL security services (e.g. content filtering) but our intention was to operate with all those switched off in the first instance in case of performance problems. I checked and found one of them still switched on in one of the zones. I switched it off and - bingo! Our DNS and browsing all came to life again.
Unfortunately it all broke again a few minutes later.
To make matters worse, I then realised our Exchange server wasn't sending any email out. Opening the Exchange Queue Viewer showed a stack of undelivered messages with - guess what - DNS failures.
I searched discussion forums and took some comfort from apparently not being the only one, but the thread I found didn't offer me a solution. We went back to checking our settings, including NAT Policies. There was one we weren't sure about so we disabled it. Everything started to work again, and our email was flowing once more. 15 minutes later, it all broke again. Eventually we realised that making any setting change on the SonicWALL - enabling or disabling a rule or a policy - would fix it for about 15 minutes.
Curiouser and curiouser, as they say.
I logged a support case with SonicWALL and also posted on the Spiceworks community. Responses from the community led me to think we'd cracked it and that it was caused by packet splitting when spilling over from one WAN to the other. Unfortunately that proved to be a dead end as well.
We could tell it was something to do with having two WAN connections because when we ran on only one (which was our faster one), everything was fine. It was when we reconnected the secondary connection that it would start to fail.
We tried a few other changes - like deleting a route policy that forced all HTTPS traffic to use WAN1, regardless of load balancing settings - to no avail.
Finally SonicWALL support came up with the goods. Their knowlegebase article describes our problem exactly, and it's something our old Zyxel was blissfully unaware of. Evidently our secondary ISP sends ARP (Address Resolution Protocol) requests to check which of our static IP addresses are in use. The SonicWALL detects these requests as coming from an unknown subnet and promptly drops them as this is regarded as a security risk. After a while (about 15 minutes in our case), the ISP's ARP cache no longer has any record of how to reach us so doesn't know where to send packets we should receive. Result: No connectivity for that ISP.
Because of the load balancing between our two connections, whenever the primary connection reached the preset threshold, the SonicWALL would stop using it for new connections and try to use the secondary connection - which was broken. Hence we lost both connections, and it was just like the bad old days with the Zyxel. Only more frequent.
The SonicWALL article describes three steps to diagnosing and fixing this problem:
- Using a hidden option to send "gratuitous ARP requests" from the router to restore connectivity. (This seems to be what we were effectively doing when we made setting changes, though we didn't realise it.)
- Using Packet Capture to see the incoming ARP requests being dropped. Just like the article shows, I could see the relevant IP address and the packets being rejected.
- Adding a static route to tell the SonicWALL that requests from this IP address are acceptable.
Finally we could load-balance, browse and send email without problems. The article does warn that if the ISP ever changes the source IP address for the ARP packets we'll hit the same problem - but this time we'll be prepared and can just change the static route.
This was easy to fix once we found the relevant article but I did begin to wonder if I'd bought a bad router! I'm no expert on networking but I've learnt that ARP requests are important and that normally you'd only see them on your internal LAN. Incoming requests from an unrecognized address will be dropped and if they're from your ISP your connection will break.