Been tasked with a new project: design and setup a fully redundant network for a new data-center. The old data center just doesn’t have enough uptime. It wasn’t designed right, and we suffered a major production outage a couple of months ago due a car crash. A couple cars lost control, and one of them hit the telephone pole that carries electricity to our facility. Bingo-Bongo, power outage for 12 hours. Finally (!) the powers that be agreed to move the servers to a REAL facility (thank you thank you thank you).
The Systems Administrator is working on the fully redundant systems pieces (VMware ESX 4, a SAN, tape backups and autoloader and all that goes with it), so fortunately I don’t have to sweat any of that. However, this new center will be one of the hubs in for our dual-hub, hub-and-spoke network. We’ve got something like 35 remote sites that will need two VPN tunnels to this location, and will be running EIGRP over each tunnel.
This will give the remote sites the redundancy they need should one of our data center routers go down. Of course, we have dual routers and dual switches, and will run HSRP on our routers LAN interface. Additionally, the Data Center will run HSRP on their routers LAN interface, to protect us against a circuit/equipment failure on their side. So far, so good (in theory).
So I set about to create a simulation of this whole setup (since the routers and switches came in early), and test the failure scenarios to make sure the redundancy operated like we expected it to (lab network diagram below). And boy, how it did NOT work!
In the setup (see diagram below), I had one host (a PC) connected to a pair of switches, which in turn are connected to a pair of routers running HSRP on their LAN interface. These routers are connected to the dual handoff from the Data Center switches, which of course are connected to dual Data Center routers. The lead to the internet, and ultimately, our remote router. The remote router has a 2 VPNs, one to each of our pair of routers.
The failure test itself is pretty simple: establish the VPNs, start a continuous ping, and start disconnecting interfaces. Theoretically, we should have a few drops pings with every disconnect (and reconnect), but no more. And at no point should we lose ping altogether.
It all worked fine, till I disconnected the interfaces on the Data Center HSRP interfaces. Then we lost ping altogether, even though the Data Center routers successfully failed over. Took me a while to figure it out, but in the end, it was pretty simple.
Any guesses? I’ll post what I found and the resolution in Part 2. No doubt some of you have found better solutions than me!