Monthly Archives: January 2012

When to reboot a Cisco router?

Yeah, when do you do it? Or more precisely, when will it HELP?

In my early tech days of Windows 95/98 and NT/2000, it seemed like every single time you did something, you had to reboot the stupid computer. Change the IP address? Reboot. Change the DNS? Reboot. Install a program? Reboot. Change the desktop wallpaper? Reboot (just kidding).

In contrast though, in my experience with Cisco, a reboot hardly ever fixed a problem. That’s was mostly due to my own config errors. However, recently, a reboot DID fix a problem, much to my surprise.

I was changing vpn tunnel encryption methods on a core vpn router, moving the Virtual Tunnel Interfaces from using a crypto map to being protected by a vpn profile. After changing about 40 of the tunnels, we started seeing aberrant routing behavior. Some of the remote VPN peers have primary and secondary IP addresses on their internal interfaces, and while we could ping one, we couldn’t ping the other. On one peer, we could ping the primary but not the secondary. One another, we could ping the secondary but not the primary.

And of course, our monitoring alerts are going crazy in the background.

I did some quick troubleshooting: no ACLs affecting traffic on either side; no eigrp distribute lists blocking traffic on either side; checked the routing tables on core and hub – fine; checked the EIGRP topology tables on both – fine; run traceroutes from core and hub – the traffic from the remote can reach the core side of the VPN tunnel, but no farther. So routing looked fine, except for the mystery blocking at the core.

Really, only one thing left to do – reload the core. It was where all the changes were made. A couple nerve-wracking minutes later (will it come back? Do we have a SmartNet?), it comes back, and? Voila. Problem fixed.

Hunh. Never seen a reload fix a problem. But it worked for IPSEC issues.

Cisco ACLs – The Hard Way

Apparently, my Norwegian/German brain only learns things the hard way, so here’s a little tip on writing ACLs.

If you’re specifying ports, the order in which you put the source, destination and ports is important. I was walking my way through some ACLs tonight, using the Cisco Help (/? or just ?) liberally. According to Cisco Help, this cmd was perfectly acceptable:

conf t

ip access-list extended WebAccess

permit tcp 10.0.150.0 0.0.0.255 eq 80 any

Unforutnately, that didn’t let ANYTHING through! Took me one drink to figure out that the any destination comes before the port designation of eq 80. Confusing at first, especially since the little router was perfectly happy with this (and about eight) other statements just like it. So, this was what made all the packet flow the right way:

permit tcp 10.0.150.0 0.0.0.255 any eq 80

Note to self – ACL cmd goes like this:

permit or deny – protocol – source – dest – port designation – options

The Network is Magic

“The Network shall be assumed Guilty, until proven Innocent. Even then, it will still be Guilty.” – C. Gates

Three of the many challenges of IT Network support are: (A) most users don’t know what a network is or does, (B) they can’t figure out where it starts or stops, and  (C) when they say, “there’s a network issue”, the phrase is so broad, it applies to nearly any context or problem.

What they’re really saying is “My Thing is Not Working”.

Even if you prove that the network is blameless, and prove it with system-generated evidence like screenshots, they won’t believe you, since Their Thing is Still Not Working.

Why is this so? Why won’t they believe you? Because largely, they don’t care. Nor should they, unless they’re in a technical position. And there’s nothing wrong with that, though it can lead to chaos when you start troubleshoointg their problem.

It’s like Captain Kirk telling you Warp Dive doesn’t work, and you start troubleshooting.  Engineering says the gravity phase-shifting field-coils are working in tandem harmony with the spatial-antigravity field. Just like they should.  And, there’s nothing wrong with the matter/anti-matter chamber, or the high-velocity induction tubes, and all lights are green on your Engineering dashboard. It’s only after chatting up the Navigation Officer you discover the multi-spectral sensors are misaligned. Now, you know the engines won’t engage until navigation works, but does Kirk care? No! All that matters to him is going Warp speed. It’s the same with your users.

To them, The Network, as a bridge to their content, is completely undistinguishable from The Servers, which house their content. They know what their PC is and what it can and can’t do, because they can see it, touch it. But the other 99% remains an unseen, untouched, impenatrable mystery. It’s magic. And though you may be only a part of IT, when Their Thing Doesn’t Work, you are guilty.

With a smile, I tell my users, “I can build you the bridge you want, but that’s no guarantee you’ll find what you’re looking for.”

They never believe me.

Rebuilding the WAN…

Have you ever inherited a mess? Surely it’s happened once or twice. You’re brought in as the Go-To-Guy, and now you’re responsible for maintaing, improving, and more importanly, fixing the network. Of course, you never know what’s needed till you start getting your hands dirty.

In my case, the WAN network needed the most fixing. It was a huge, neglected problem. All the income-generating assets for this company resided at Branch sites. These amounted to Point of Sale units, and units that communicated with data center averaged more money than units that didn’t. Like everywhere else, uptime was a major company priority. When the network was up, everything was fine. But when it went down, it went down hard, and did not come back quickly or easily.

The existing WAN comprised about two hundred remote sites (190+), each with three VPN tunnels; one tunnel to a data center and two tunnels to the main office. They used GRE over IPSEC with EIGRP.  In theory, this was fine, but in practice? Butchery.

There were 5 major problems:

1. Due to GRE over IPSEC, all Branch sites needed a static IP address. Since our sales team loved to forget to tell IT about new sites, they often left us scrambling, especially in places like northern Montana, 20 miles from the nearest small town. Net effect: decreased speed and increased cost to deploy Branch sites.

2. Each VPN tunnel required four router config sections (pre-shared key, crypto map, tunnel interface and access-list). This made the Headend router configs very, very long.  On the office HeadEnd routers, the config was pushing 3500 lines! Net effect: increased chance of manual error for any moves, adds or changes.

3. To keep 190 tunnels straight, each config section originally had identical numbers.  But, the sections numbers didn’t always match, even on the same HeadEnd. Awesome. Net effect: more time needed for troubleshooting.

4. Each VPN tunnel had its own subnet – a necessity. However, just like the config sections, there was little consistency. You’d expect every tunnel from a Branch to share some number, like 192.168.100.X, right? Yeah, not the case. I think they used the Dart Board method to assign subnets. Net effect: even more time needed for troubleshooting.

5. No route distribution lists meant too many routing updates. Branch sites really only needed to reach the office and the data center, nothing more. Instead, 190 routers constantly updated each other, and routing tables were way too big. When a Headend router went down, you could expect 15 minutes before all Branches could reach the data center again. Net effect: decreased fault tolerance, slow network recovery times, and less uptime.

See a pattern here?

Like many technically complex problems, this wasn’t completely understood. It didn’t get a lot of airplay in IT discussions, and had gone largely unaddressed. As a pervasive problem, it couldn’t be fixed quickly or easily (unless reconfiguring every router in the company without incurring any downtime is quick and easy).  Selling a Fix-It plan to the IT director or his VP was daunting in the least. But this finicky, fragile, non fault-tolerant network regularly made us look bad, and made our competitors look that much better. It HAD to be remedied; the sooner, the better.

So after much discussion and lab testing to perfect the Order of Operations needed, here’s the Fix-It solution:

1. All GRE over IPSEC tunnels were replaced with Dynamic Mulitpoint VPNs (DMVPN). DMVPN uses multipoint GRE over IPSEC tunnels, which allowed one Headend tunnel to talk to all Branch routers,  rather than a separate tunnel for each Branch. This shrank Headend router configs by 90%. Also, the DMVPN used Next Hop Resolution Protocol (NHRP), eliminating the need for Branch static IPs. Branch routers could now have DSL modems with dynamic external IP addresses. Net effect: faster troubleshooting, and reduced time and cost to provision Branch sites.

2. Branch VPN tunnel subnets were aligned in a consistent pattern. Now, the last octet of the tunnel IP address was same for every tunnel from a Branch. In an armada of IP addresses sailing around the network, this eased Branch identification during troubleshooting. Net effect:  even faster troubleshooting, less confusion.

3. Route distribution lists were put in place to limit route exchange between the Headends and Branches. Now the Branches only knew about the office and data center –  just three peers, not 190. Net effect: better fault tolerance, and quicker network recovery.

See a pattern?

This took weeks to accomplish, and there were few solid metrics available to compare the previous environment with the new one. However, everyday experience proved the network ran dramatically better. With the Headend configs down from 3500 lines to under 400, we could begin adding other network services, rather than waste time wandering through the config labrynth. Identifying problem sites was quick and easy. And network convergence had shrunk from 10 minutes to less than 2 minutes.

Despite the time it took, it was worth it. The network went from a liability to a strength, and enabled a variety of other services,  from reliable and comprehensive site monitoring (a company first), to advanced SQL reporting for performance metrics and financial data (another company first).

The network enables so many aspects of a company,  with the right perspective, technical expertise and persistence, you can leverage it to serious competitive advantage.

TACACS+, Part 2: tac_plus install and config

In TACAS+ Part 1, I discussed the reasons to use TACACS+, and why I choose the version written by Marc Huber. Here, I’ll dive into installing tac_plus on Ubuntu 10.04, and configuration of the tac_plus daemon itself.

Despite being a Linux beginner, installing tac_plus on Ubuntu server wasn’t too hard. Below is my checklist.

  1. Install the libssl-dev package
  2. Install CPAN modules:
    1.  Net::SSLeay
    2. IO::Socket::SSL
  3. Install the tac_plus program. I created a working folder of /home/installs for this:
    1. /home/installs# mkdir tac_plus
    2.  /home/installs/# cd tac_plus
    3.  /home/installs/tac_plus# wget http://www.pro-bono-publico.de/projects/src/DEVEL.201111101610.tar.bz2
    4. /home/installs/tac_plus# tarc -xvf DEVEL.201111101610.tar.bz2
    5. /home/installs/tac_plus# cd PROJECTS
    6.  /home/installs/tac_plus/PROJECTS# ./configure
    7.  /home/installs/tac_plus/PROJECTS# make
    8. /home/installs/tac_plus/PROJECTS# make install
  4. Copy the attached tac_plus.conf file to /etc
  5. Verify the file tac_plus has been copied to /etc/init.d
  6. Modify that file so it will run at startup
    1. /etc/init.d# chmod -x tac_plus
  7. Make the tac_plus daemon start at boot
    1. update-rc.d tac_plus defaults
  8. Manually start the daemon:
    1. tac_plus /etc/tac_plus.conf

Once the server began running with the default configutation, it was time to tweak it to make it authenticate against AD. Most importantly, I needed to have tac_plus encrypt its entire communication with AD – nothing in the clear. In nearly all cases, the most powerful IT accounts in the enterprise were getting passed. They must be protected, even internally.

Your environment may be slightly different, depending on DC setup, but the below configuration eventually worked (after much cursing and wiresharking). Especially note the “ldaps://yourdomain.com:636”. The combo of both ldaps:// and :636 is what finally provided the encrypted communication.

id = tac_plus {
debug = MAVIS

accounting log = /var/log/tac_plus/acct.log
authentication log = /var/log/tac_plus/authen.log

mavis module = external {
# Optionally:
script out = {
# Require group membership:
if (undef($TACMEMBER) && $RESULT == ACK) set $RESULT = NAK

# Don’t cache passwords:
if ($RESULT == ACK) set $PASSWORD_ONESHOT = 1
}
setenv LDAP_SERVER_TYPE = “microsoft”
setenv LDAP_HOSTS = “ldaps://yourdomain.com:636”
setenv LDAP_SCOPE = sub
setenv LDAP_BASE = “dc=yourdomain,dc=com”
setenv LDAP_FILTER = “(&(objectclass=user)(sAMAccountName=%s))”
setenv LDAP_USER = “!daps@yourdomain.com”
setenv LDAP_PASSWD = “DeuxManySecrets”
setenv AD_GROUP_PREFIX = Tacacs
setenv REQUIRE_AD_GROUP_PREFIX = 1
exec = /usr/local/lib/mavis/mavis_tacplus_ldap.pl
}

While this gave us tac_plus server-to-AD encryption, we still needed encryption between the router and the tac_plus server. Happily, tac_plus provides this by using a complex key. On the tac_plus server, it looks like this:

host = Router {
address = 192.168.1.1/32
key = >>Complex_Key_Goes_Here<<
}

On the router, the cmd is:

tacacs-server key >>Complex_Key_Goes_Here<<

Testing showed this is not quite encryption, but rather an MD5 hash with additional entropy, but it gets the job done. The remainder of the server config is explained well in the tac_plus docs. These sections were the only tough nut to crack.

In TACACS+ Part 3 of this series, I’ll cover configuration of the routers, using IOS 12.2, 12.4 and 15.0, as well as ASA firewalls on 8.2.