2024-11-21

The Network's Down

The ramblings of an aging Networking Mentor… / Estoy enrede en los redes…

Reverse Engineering Pt. 3

Don't mind the redacted photo provided. It's a good read...
Traverse the Snapple

A third installment for the “Reverse Engineering Series” on the blog… If you haven’t read the previous installments (Part 1 or Part 2), feel free to check those out if you’re looking for some fun, a little excitement, a good laugh, or a career change…

Example 4: Can you hear me now? “Ha-Ha, very funny… Wait, are you serious? Is it, underneath the snapple? How did you even find that?” Micro-bursts are a real thing btw…

One of the most unique things about my career, so far anyway, has been the ability to utilize X-Ray vision. Actually that’s a lie, IronMan doesn’t have X-Ray vision, he’s just a fan of “working smarter, not harder”. So how did I actually see “underneath that Snapple” shown above… Let me explain.

The title above was based on an observation that one of my managers made, b/c I had traced out an ongoing issue and found it’s source… The map happened to be on my desk, under a Snapple bottle. There you go, story over… “or is it?”

During my long days as an MSP “un tipo loco ingeniero” (typical engineer) there were hopes, dreams, great ideas, cool switches and routers to configure. But… there was also a STRONG love for Layer 2 (L2). I mean, it was gross. The amount of pruning that I had accomplished as well as what still had to be done was an entire job in itself. Consolidating spanning tree domains, standardization L2 configs, optimizations, etc. So… In the primary DataCenter, the voice team had built out a new platform to serve as a new offering to offer to our clients. “Hosted Voice…” You know, before “Public Cloud” was a thing. Some people did do this before AWS, Azure and Google scooped everyone’s stuff up. Anyway, so I get a ticket escalated to me in the DataCenter that some clients are having issues with completing calls and as I dove into the tickets, I was seeing some super useful info from the service desk stating “the voice quality sucks. Assigned to networking.” As usual, a super useful start.

So let’s put a few other brush strokes of color into making this picture a little more clear. At the “Core” of the switched network, there were 4 Nexus 7ks w/ Dual Supes setup in back to back vPC config. From there we branch out to an array of various switches, layer 1 devices and connectivity. I had installed a few more FEXes for ToR (Top of Rack) connectivity in both some new racks in both rooms, with a good friend of mine aka “Hombré”. If you’re not aware, FEXes don’t learn MAC addresses, they rely on the parent switch for everything. Aka trombone city with VNtags… But that conversation is for another post, ugh DataCenter and Nexus… Anyway…

I log in to the client’s core, check for AutoQos and it dawns on me, I have to double check with the voice team that we are actually matching their DSCP value, b/c like most Engineers, I have a hard time “getting” QoS… So after verifying their value (something in the 40s that I can’t remember offhand) actually fell within the Auto-QoS we had deployed at the site was good, I did some packet captures to verify it was being marked on the way out of the phone ports, and on the way in to the phone port. I called the P2P ISP, pulled some reports in solarwinds and we didn’t see any contention at the times being reported. And it wasn’t happening to everyone (shocker, I know). So after verifying the client side seemed OK, I started tracing from the P2P into our “edge”. Going the physical path from that switch down to the voice server handling the calls / call setup.

So pretty standard start for me; checking the ports for errors, I didn’t see anything on the P2P connection between the DataCenter and the Client site; bandwidth wasn’t spiking, so following that physical path back to the “Switching Core” at Layer 1, was some bundled 10G lines that were seriously over-engineered, so no contention there, and the return path physically the same. I then saw that random packets were dropping their actual QoS markings, which was quite odd. And it was for both traffic sent from the client as well as from the Voice Server. Then I went up a layer from Layer 1 to Layer 2. And “my friend… my friend… I got kicked in the nuts.”

So Layer 2. The topic that keeps bringing stuff to it’s knees, while being so ever-present in many of the environments I get called to work in. Sigh… Anyway. Layer 2, has this funny rule. So if I’m the root bridge for a specific VLAN, no traffic moves (except for locally switched traffic) without my consent. So why does that matter, Mr. Robot? You just told us you were in the “Switching Core”, do you actually know what you’re talking about. Probably not. However… In adding L2 Rs and Ds to my physical diagram, following both VLANs, things got very interesting. The Rs were for “Root” Ports (aka I have to forward everything for this VLAN leaving this switch out this interface), and the Ds were for “Designated” Ports (I’m running spanning-tree).

I followed the Rs for the VLAN that I was chasing, from the Server Side. So I started from the “destination”, from B2, to A2. OK that makes sense because A1 is directly connected to A2, where A lives, so that makes sense. But then on A1, instead of going to Switch A, it said “go to B1”. So I went to B1, and then It sent me to a SAN switch, having NOTHING to do with data networking, it was used “solely for storage” I was told… Then after that switch, it went to our Archaic CAT-OS switch, where someone apparently had created a bunch of VLANs “to test something” and never removed it nor cared it’s impact. Ah yea, and VTP… So, me being me, I continued to follow the path, after “getting routed” on this 6509… Back down, through the SAN Switch down to B1, across the interconnects to A1, and then finally up to switch A. After verifying it two more times, I finally put together a change and resolved the root bridge placement back where it should have been. You know… On the “Switching Core” of the network.

The resolution was two fold. Part one, moving the root bridge to the actual “Switching Core” of the network, to not have unnecessary and unintended layer 2 paths and hops in the L2 Network. And then part 2, applying the QoS on the “Switching Core”, across the required Server uplinks as well as the Edge Switch uplinks to preserve the QoS path within the entire DataCenter Network.

So time and time again, Layer 2 is just… It has it’s place. And I’ll leave it at that. BTW sidebar, this “find” of course sparked my interest in the rest of the layout of the network and launched another project to get everything moved to their proper cores, to assist with convergence as well as convert the Edge Switches up to Rapid-PVST to take advantage of some of the enhancements available to the L2 network.

Example 5: Now THAT my friend… Is one CLEAN NETWORK…

Switching gears here, back to 2011… I worked at a company who specialized in IT support within the “Academia” vertical. We had just gotten an opportunity to start in the University space to assist with “some network stuff”. One noteworthy project that had just been completed by the university, prior to the contract there, was “finishing the upgrade of core routing” from RIPv1 to RIPv2. I wish I made that previous statement up. I’m going to let you just take a minute and let that sink in… Yay support for classless routing? For the non-technical people reading this. Just realize RIP as you probably read it as "Rest In Piece" is exactly how we (As networking people) feel about it as well.

Anyway… This university had quite a large footprint, from what I was used to dealing with, so most of the beginning of my journey there resulted in a lot of white-boarding of: physical path diagrams, “blast radius’s’ ” of VLANs, and their routing in general. Sidebar; this university actually owned a publicly route-able . So everyone's IP address was public. Not really relevant for the post as a whole, but it was a first for me so I thought I'd mention it... One initiative I started there was setting up RADIUS authentication for all of the network devices, as the 200+ boxes were all being manually managed with a single username and password. Feel free to shutter at that though, as years later, I myself still do.

At the time, the university was heavily invested in Alcatel core routing and switching gear. They had just started replacing some edge switching with Cisco, because if you can actually believe it, Cisco’s cost ended up being cheaper. Sidebar, Blue Socket (who I am all but sure has been purchased and rebranded at this point in time) was the Wi-fi of choice...What was the AP Count again? ” (I do miss you guys: Nando and Ivan. LOL.) But anyway, let’s zoom in on a particular 2-3 week period that “reverse engineering” was front and center for us at this place.

We started getting reports from the Dorm building in NYC that half of the (whatever number the floor was) would go dark from a network connectivity perspective, and “the WiFi sucked” during this time. About 40-60 minutes after this, everything went back to normal. So like most things, we started checking into the symptoms in hope of finding the root cause. Of course we had our own reservations. Oh really, the network goes dark? When everyone is playing Gears of War 3, Modern Warfare 3, Halo 3 or Crysis 2 online? That’s odd. Oh, it happens to be during an exam week? You don’t say… Sure… It’s “the network” right?

After a couple of escalations from onsite support we started reviewing route tables, and interface counters. Everything seemed to check out as normal. The device uptimes indicated that the time-frame getting reported was legit but nothing made sense. And me being me, and not trusting many people, started asking things like “this IDF is behind a locked door right?” Or, “can someone please test that Telnet/SSH ACL for me?” “Let’s make sure its actually blocking non IT machines from connecting.” I then rolled out RADIUS AuthC (short for Authentication, sorry. ISE makes me do that shorthand; she's quite an intriguing woman tho... I'll have to start writing about her...) to that stack and hoped that if it was someone logging in, they wouldn’t be able to anymore. And we wanted to be able to get the logs on IAS or NPS or whatever version of RADIUS we were running on the windows servers at the time. The next night, the same thing happened. “OK… That still doesn’t rule out someone with physical access,” I had said to the rest of the team.

This pattern continued for a couple of weeks. At some point we engaged the facilities staff and asked for access to the camera system. We were met with a loud and abrupt laugh and a hard “NO“. So they assigned someone to watch the hallway watching the door to the IDF, because I was fresh out of other options and ideas. The next night, the same thing happened. We got with the Facilities resource who assured us “that door hasn’t opened, so I’m not sure what you’re looking for.” In the mean time, we opened a case with Alcatel and provided some debugs. The weird thing, to me at least, was that we had this same gear almost everywhere, why was this happening only here? The next night, the same thing happened. So mental checklist… Same code rev, same SFPs, same model of hardware, etc…

We checked back in with the Facilities person who was getting agitated at this point after our multiple inquiries. “I don’t know what to tell you, everything looks normal.” So I asked, “So, the door didn’t open?” “No,” was the response. I followed up with “At all? Like it wasn’t opened by anyone? A ghost perhaps?” “Guy, I don’t know what to tell you, the janitor was the only guy who went in.” I’m like “I thought the door didn’t open…” Now we’re getting somewhere. Unfortunately, there were more people than just myself and this super-helpful gentleman from facilities on the line, and there was quite the ruckus stirring about: “Who is this janitor? Where did he come from? Is it even dirty in the IDF, we thought Pete actually showered once in a while. Please put that in his contract renewal…” So we were granted temp access to the security camera and we watched the janitor go into the IDF, mind you this camera is a fixed one, eg it doesn’t move, so “what you see is what you get”, and about 5 minutes later he (Janitor) comes out with the vacuum. And then about 2-3 minutes later he goes back into the IDF and I see it. Right there. Clear as day… The door doesn’t shut b/c he’s plugging in the vaccum in inside the IDF. I kept watching for the next 20-30 mins in realtime playback b/c I was of the opinion that someone was going to go into that closet and reboot the gear and add to my lack of sleep in general. Then all of a sudden, maybe 10 mins later, the camera goes dark. And it’s nothing, for about 40-60 minutes.

I’m talking with my co-workers and all of a sudden it hits me. That camera is PoE. Is the switch stack actually losing power? So as a last ditch effort, we sent a member of the network team (I did not draw the short straw in this case) to the city to literally sit there and hang out. My phone rang at about 1:30am overnight. “Bro, you’re not going to believe this,” was how the call started. “What’s that?” I replied. My co-worker said “Bro, the f*****g Janitor is unplugging the UPS to plug the f***ing vacuum in.

It was at that point, I took a moment. And before acknowledging this what I had just heard consciously, I tilted my head back, toward the sky asked “God, why? Why?”

So, what was happening? Oddly enough… <snark>It wasn’t a network problem.</snark> The UPS was doing it’s job; until it couldn’t. Once the batteries were depleted, “me no workie no-MO…” Once the janitor finished up cleaning, he swapped the plug back, the UPS charged up, and voila, switch stack booted back up; connectivity restored and everyone was happy.

But all in all, we all got a good laugh out of it; you know… Once the yelling, screaming, cursing stopped. Don’t underestimate what we all take for granted. Power, heat, cooling, light, etc. We like to jump to conclusions (myself included here), and sometimes we just need to take a step back… and install outlets in the proper location for maintenance workers…

  • Stark out.