Rebuilding the WAN…

Have you ever inherited a mess? Surely it’s happened once or twice. You’re brought in as the Go-To-Guy, and now you’re responsible for maintaing, improving, and more importanly, fixing the network. Of course, you never know what’s needed till you start getting your hands dirty.

In my case, the WAN network needed the most fixing. It was a huge, neglected problem. All the income-generating assets for this company resided at Branch sites. These amounted to Point of Sale units, and units that communicated with data center averaged more money than units that didn’t. Like everywhere else, uptime was a major company priority. When the network was up, everything was fine. But when it went down, it went down hard, and did not come back quickly or easily.

The existing WAN comprised about two hundred remote sites (190+), each with three VPN tunnels; one tunnel to a data center and two tunnels to the main office. They used GRE over IPSEC with EIGRP.  In theory, this was fine, but in practice? Butchery.

There were 5 major problems:

1. Due to GRE over IPSEC, all Branch sites needed a static IP address. Since our sales team loved to forget to tell IT about new sites, they often left us scrambling, especially in places like northern Montana, 20 miles from the nearest small town. Net effect: decreased speed and increased cost to deploy Branch sites.

2. Each VPN tunnel required four router config sections (pre-shared key, crypto map, tunnel interface and access-list). This made the Headend router configs very, very long.  On the office HeadEnd routers, the config was pushing 3500 lines! Net effect: increased chance of manual error for any moves, adds or changes.

3. To keep 190 tunnels straight, each config section originally had identical numbers.  But, the sections numbers didn’t always match, even on the same HeadEnd. Awesome. Net effect: more time needed for troubleshooting.

4. Each VPN tunnel had its own subnet – a necessity. However, just like the config sections, there was little consistency. You’d expect every tunnel from a Branch to share some number, like 192.168.100.X, right? Yeah, not the case. I think they used the Dart Board method to assign subnets. Net effect: even more time needed for troubleshooting.

5. No route distribution lists meant too many routing updates. Branch sites really only needed to reach the office and the data center, nothing more. Instead, 190 routers constantly updated each other, and routing tables were way too big. When a Headend router went down, you could expect 15 minutes before all Branches could reach the data center again. Net effect: decreased fault tolerance, slow network recovery times, and less uptime.

See a pattern here?

Like many technically complex problems, this wasn’t completely understood. It didn’t get a lot of airplay in IT discussions, and had gone largely unaddressed. As a pervasive problem, it couldn’t be fixed quickly or easily (unless reconfiguring every router in the company without incurring any downtime is quick and easy).  Selling a Fix-It plan to the IT director or his VP was daunting in the least. But this finicky, fragile, non fault-tolerant network regularly made us look bad, and made our competitors look that much better. It HAD to be remedied; the sooner, the better.

So after much discussion and lab testing to perfect the Order of Operations needed, here’s the Fix-It solution:

1. All GRE over IPSEC tunnels were replaced with Dynamic Mulitpoint VPNs (DMVPN). DMVPN uses multipoint GRE over IPSEC tunnels, which allowed one Headend tunnel to talk to all Branch routers,  rather than a separate tunnel for each Branch. This shrank Headend router configs by 90%. Also, the DMVPN used Next Hop Resolution Protocol (NHRP), eliminating the need for Branch static IPs. Branch routers could now have DSL modems with dynamic external IP addresses. Net effect: faster troubleshooting, and reduced time and cost to provision Branch sites.

2. Branch VPN tunnel subnets were aligned in a consistent pattern. Now, the last octet of the tunnel IP address was same for every tunnel from a Branch. In an armada of IP addresses sailing around the network, this eased Branch identification during troubleshooting. Net effect:  even faster troubleshooting, less confusion.

3. Route distribution lists were put in place to limit route exchange between the Headends and Branches. Now the Branches only knew about the office and data center –  just three peers, not 190. Net effect: better fault tolerance, and quicker network recovery.

See a pattern?

This took weeks to accomplish, and there were few solid metrics available to compare the previous environment with the new one. However, everyday experience proved the network ran dramatically better. With the Headend configs down from 3500 lines to under 400, we could begin adding other network services, rather than waste time wandering through the config labrynth. Identifying problem sites was quick and easy. And network convergence had shrunk from 10 minutes to less than 2 minutes.

Despite the time it took, it was worth it. The network went from a liability to a strength, and enabled a variety of other services,  from reliable and comprehensive site monitoring (a company first), to advanced SQL reporting for performance metrics and financial data (another company first).

The network enables so many aspects of a company,  with the right perspective, technical expertise and persistence, you can leverage it to serious competitive advantage.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s