Half a day wasted chasing NAT tranlations

After deciding that I can’t do what I want to do with OpenStack, I decided that I needed to just reallocate address space so I could open up a subnet to allocate to OpenStack. I decided to convert our corporate office to a /16 instead of the /24 I was trying to cram everything in to, so late last night I reconfigured everything.

While changing subnets on several hundred access-list entries, I accidentally changed one netmask that shouldn’t have been. It was the one that kept private network traffic from reaching the NAT tables. I discovered it fairly quickly and put it back, but not before some remote clients sent dhcp requests.

See, I use a single ISC dhcp server (well, actually a pair, but for all intents and purposes…) for all of our 26 locations. The routers that establish the ipsec vpns do dhcp forwarding. So when a machine in 10.2.0.0/24 needs an address, it’s router (10.2.0.254) forwards that request to the dhcp servers (10.0.0.1 - now /16). Those servers reply to 10.2.0.254 which then forwards the response to the mac address of the correct client. It’s all worked beautifully for years.

Last night, however, a few sites had dhcp expirations during the short window in which I had a broken acl. This triggered a response from the dhcp server which was erroniously natted and sent out to the internet where the destination was quickly rejected. That rejection did nothing, of course, to dissuade ios from keeping that translation in the table long after I fixed the acl.

Eventually as dhcp leases expired, machines started dropping off the network at those locations that’d had their router’s addresses natted. All I can see is that at a handfull of locations, some machines aren’t receiving packets (I had erroneously assumed that was all packets). I could see them sending dhcp requests, but when they came up, they couldn’t connect to anything and I, obviously, couldn’t ping them at the address the dhcp server tried to assign.

I tried inferring that maybe it was all dhcp requests and that some just hadn’t had their leases expire yet, so I went to a nearby location to test. Everything there worked perfectly.

I added some logging to the remote router for one of the locations at which I’d had them rebooting whenever I’d try something new and had them reboot again.

It worked.

It couldn’t have worked, I didn’t change anything.

But it did.

I realized that the only way it could have worked is if something had expired or been bumped off a table…

Like the NAT table…

I checked:

show ip nat translations

Sure enough, there were a bunch of translations for port 67.

clear ip translations *

Sure, it’s the sledgehammer approach, but I’m lazy that way sometimes. Okay, and there’s also nothing anyone at the office should be streaming anyway. They’ve got work to be doing.

Suddenly I see DHCPACKs as now the responses are actually making it back to the relays.

It’s been an interesting morning.