2024-11-21

The Network's Down

The ramblings of an aging Networking Mentor… / Estoy enrede en los redes…

Reverse Engineering Pt. 2

At -7 Fahrenheit in December in Wisconsin, Two Engineers battle more than stripped rack screws, BNC cabling and the elements to complete an onsite network transformation.

Pickup up on the heels of my first Troubleshooting/Reverse Engineering Article, I just told another story to mom and pops about another incident where troubleshooting and Reverse Engineering both came into play in quite a different way. Let’s see what we uncover this time…

Example 3: All packets must wash hands before returning to work…

Ah yes, the days of consulting for a company on a long term staff augmentation contract, to the technical business unit who was responsible for IT throughout the Worldwide Enterprise… Aka my customer, who ended up becoming my professional mentor and close personal friend. In the words of another wise man “There’s a lot to unpack there”. Consulting as a Network Engineer in Corporate America is very fun, but can get a little confusing with who’s paying who, as well as who you are reporting to for a particular half of an hour in your life. But I’ll digress about the politics of consulting for the time being… Although that may make a good article…

Let’s walk down memory lane where Reverse Engineering & Troubleshooting took center stage. I happened to be onsite with a long-standing good friend of mine, and fellow networking junkie, performing a network transformation aka upgrade together. Mind you this was mid December in 2013, in Wisconsin. So it was slightly colder than NY… Just slightly…

Throughout this example we’ll refer to my networking buddy in this article as “The Ghost”; as he, against his better wishes, has been crowned that, from multiple additional sources in my life…

The Ghost and I were sent onsite to perform a network transformation that was scheduled to last about a week, and then staying for first day support after the full cut over, so we’ll round up to 8 days with travel. This was a multi site campus style upgrade including all of the IDFs, the MDF, new fiber runs, new electric runs to all the IDFs and MDF, new WiFi, aka the epitome of a network facelift or a “rip and replace” as he and I refer to it as. So during the week, we worked together ripping out just a few pieces of gear from the existing network and replacing it with some more “manageable” devices. Exhibit A below:

As previously stated… We removed just a few items above…

As this was a multi-day transformation, The Ghost and I went about it in our patented “Bronx Boys” style of upgrading; by racking all the new gear below the existing gear in the main building, “opening” a couple of “locked doors”, then fanning out to the smaller buildings and IDFs and then tying in the new Cores and routing mid-way through the week.

Before we move forward, I have to show everyone what we found buried in more than a few spots of the drop ceiling scattered throughout the outer buildings of the site that we were replacing. There were actually 10Base2 to 10BaseT converters providing some of the existing uplinks back to the pseudo-core switching infrastructure. Clearly this was why there was all new Fiber Infrastructure setup for us to use. In the following picture as well you can see an additional Fiber to Cat5 media converter and waaaay in the back, there is a new WAP (green light) making a cameo, that The Ghost had just installed prior to me snapping this pic.

A legitimate surprise to The Ghost and I, a live and functioning 10Base2 to 10BaseT converter. The kind of stuff you hear legends about…

After overcoming the feeling of nostalgia, The Ghost and I got the cores racked and cabled up the fiber patches. Again, a pretty straight forward approach for “The Bronx Boys”. Exhibit B:

Brand new cores (Switch Stack) and brand new fiber infrastructure. The 2960 below was for additional copper connectivity as we were consolidating the existing infrastructure.

Most of the transformations of the closets went according to plan… But there was one; well there’s always one, that likes to challenge you. So, there was one IDF back in the warehouse where things didn’t exactly pan out as expected. We were replacing the bulk of the WAPs after-hours as our last item for the night and to tie in the new PoE+ switches and leverage the fiber lines. We ran our checks for the night, and everything looked good. Our overnight monitoring team confirmed everything was reporting back well, all of the ARPs and pings came back properly. So the Ghost and I left for the evening.

When I woke up, I had about 30 emails between stating that that one specific IDF in the back warehouse had “gone dark” and “fell off the map” and then suddenly “came back up” several hours later. Some quick thoughts pointed to faulty hardware, maybe a bad IOS load (md5 mismatch issue), memory leaks, some bad SFPs, power, or the universe conspiring against us. When we arrived onsite we went over to the IDF and everything seemed normal. Pretty blinking lights and workers seemed uninterrupted at this point. I was able to console in as well as SSH in from our secured jump-box and the cores. At this point of the morning, our monitoring systems seemed say that the switches were behaving properly. There were syslogs on the cores telling us “interface X and Y as well as PortChannel W” went down and came back up. So I SSH-ed over to the IDF switches checked the show ver | i System|system and it just said “System returned to ROM by power-on”. We grabbed a show tech just for insurance purposes… As that day came to a close, we met with the electrician who tested the power going into the IDF, and then he gave us a thumbs up stating that it was OK. Then the Ghost and I replaced a couple more WAPs, swapped the power cables going to the switch, confirmed a proper boot-up of the stack, and then verified the MD5 of all bin files on flash. We took another show tech after the fresh boot to prep a TAC case, if needed. We left for the night after verifying with the overnight team that everything seemed good again.

The next morning was like groundhog day. At almost the same time the evening before, the switches dropped off the map. When I saw the Ghost in the morning, I have him a thumbs down (making fun of the electrician) and he gave another universal sign using another finger to show his enthusiasm of the same issue repeating itself. We escalated the issue internally with my Client/Project Manager/Boss/Mentor (again, that’s a whole other story), and he called the electrician back who was now “coming by later to meet” us. The Ghost and I went about replacing some more gear around the site and cutting it over. Then the Electrician caught up with us brought me to the breaker box/sub panel and showed me the breaker and tested it. All seemed well. Mind you I’m no Electrical Engineer…

So the Ghost and I decided to stay onsite late to catch the issue in the act. All the people in the warehouse cleared out around 7:30pm so The Ghost and I ordered some grub, started verifying SNMP, config backups, and additional connectivity as well as updating Visios. We had one laptop consoled into the switch stack so we could check the console logs for any creep factors. I went to test some additional APs we had converted around the warehouse and The Ghost ended up going back to the MDF to check for our label maker. About 25 minutes had passed since I had started testing the new WAP placement and all was going great. WCS saw the WAPs, the guest anchor tunnels were working from the onsite WLC, and then all of a sudden it was like I walked into a Faraday Cage. There was ZERO WiFi. Nothing, not one bar. I looked up above me and all of the nice Green and Blue lights were out. The WAPs were all dead. As I made my way back to the actual IDF where the laptop was, I saw the switches booting up. You know, when the SYS lights are blinking rapidly and you’re asking yourself “I did do a write mem right?”… And there was no sign of The Ghost. He emerged moments later, real cloak and dagger like, so I asked “Are you messing with me? I was almost done with the WiFi updates.”
He replied “What are you talking about? I came back like 2 mins ago I finally found the label maker, why did you reboot the switches if you weren’t done w the WiFi yet?”

I’ll save you colorful banter we exchanged between each other at this point, but essentially neither of us had touched the switches, but we both did notice that the laptop we had sitting in the IDF/cabinet did have some battery drain on it. This helped drive home the point that the issue still seemed power related. As he and I were pointing fingers at each other my phone rang and our overnight team started by saying “This is the third night that this has happened but it’s starting to come back online now”. Very encouraging support from our overnight staff at this point…

When the switches came up, they were stating “System returned to ROM by power-on”. We escalated it again to our Customer/Project Manager. We then requested the Electrician stick around until 7:30 or 8pm with us the following evening. We then packed up and left for the night.

The next morning was “more of the same” with a little twist, one of the workers let us know that earlier in the morning ,6:30am or so, there was no WiFi but it seemed to come on around 6:45am or so. We thanked the worker for the info and continued decommissioning old gear, updating our inventories and wondering why this closet was being a jerk. So 6:30pm rolled around and the electrician came to talk to us again and said he would be on the campus looking at some other stuff but to call him around 7:15pm so he could hang out with us in the warehouse. The Ghost and I starting tidying up the cabling a bit in the Warehouse and I looked at my watch 7:20pm and the electrician called me to say he would be there in about 5 mins. Perfect timing. Or so I thought…

I left The Ghost manning the consoled laptop as I told him he “couldn’t be trusted” with his “label-maker tactics” of wandering away the evening prior. So I started continuing my WiFi spot checks after washing my hands since we had just ate dinner. The electrician was talking to The Ghost when I got back around 7:50pm. I had a smirk on because I knew that the IDF would be dead. It happened three times already. When I came within eye shot of the closet, I saw all of the gear up with lights blinking. I’m not gonna lie, I was annoyed. Then I said to the electrician “So what was it?” He explained he hadn’t done anything and The Ghost confirmed that the Electrician hadn’t left his sight. So I pressed him a little more “Are you sure? I’m just happy that this is working”. He assured me that nothing was done and that it probably was hardware related. (Blaming the network? Why, that’s a first.) The three of us walked to the breakers together just to verify voltage again, and when we got back the closet was off. Dead. Nothing. It was then when I decided to embrace the fact and vocalize, “Ok, this whole possessed-network-closet game is played out.” The Electrician went to the breaker box and flipped the breaker off and then on, twice. Nothing. The laptop was now running on battery power, so yet again everything was pointing to power. The electrician tested the leads, but everything was showing ample voltage being supplied.

By this time (around 8:15pm) we had gotten our PM/Boss/Buddy on the phone and were brainstorming. In his infinite wisdom he decided to declare, “Well Mr. Li-Marzi… Maybe it just doesn’t like you”. Of course he was on speaker and The Ghost’s face lit up at my expense, sort of business as usual. LOL. While the PM was talking to the Electrician, The Ghost and I went back to our dinner which was now cold. We finished eating, threw out the garbage and both went to wash up when I heard a voice on my cellphone (still on speaker) say “Mr. LiMarzi… Let’s call your ‘favorite Uncle’ and get Mr. TAC on the line to go over the Inter-VRF routing at the other site while the electrician figures this out”. About 5 seconds later, the rack lit back up. The Electrician goes “Well that’s weird”. We informed the PM that the IDF came back up and I was looking for The Ghost, but the was no where to be found… (I’m starting to see a pattern here). He showed up wiping his hands about 5 seconds later. And something hit me. Other than the thought of leaving IT and getting a less mentally taxing job…

“Ghost” I yelled out… “What [edited] did you just do? ”
He said “Um… I washed my hands… Why?”
I said, “No, before that.” He stared blankly back at me… “How much do you want to bet, someone cut corners on the electric”. His facial expression was full of confusion and at the same time I could see the electrician start to roll his eyes. I then said, “OK, let’s all just hang out here for 20 mins again.” During that 20 mins, the switches came back up, WiFi came back and my PM seemed to be intrigued with my live experiment, that I didn’t exactly state out loud yet. About 20/25 minutes later, the IDF went dark. The laptop chirped that it switched to battery and I put on my patented “mmmyea” grin on my face. I said “OK, Ghost why don’t you go take an Eco-break.” He explained that he didn’t really have the urge, but I insisted. He shook his head and walked toward the bathroom. I said to the Electrician “watch this”. Recalling an episode of one my favorite shows, Burn Notice (Season 3 Episode 11 if you are interested), as he pushed the door of the men’s room open I snapped my fingers and rack’s power came back on. The Electrician looked at me and started asking me how I did that.

Although I do identify as IronMan, I am far from a superhero. And as far as I am aware, I do not possess any “powers”. So, what happened when I snapped my fingers? What the electrician failed to do was actually tie in the new electrical panel directly to the power of the IDF. Or rather, we weren’t “swung over” so to speak… So when he was testing the breaker, it had power… But the power wasn’t going to the IDF. What he DID end up doing… was “tapping” into an existing live wire, which happened to be the lights in the men’s room, to provide us “new power”. The company had a green initiative throughout North America to shutdown the lights of the restrooms when they weren’t needed. So, when you open the door to the men’s room, the motion sensor gets triggered and then the lights would come on automatically; and after a predetermined amount of time of no movement, oh let’s say 20/25 minutes or so… those same lights would shut off, to conserve energy. The only difference in this men’s room was the IDFs power was subject to the same rules. And in turn would come to help author a great troubleshooting article years later.

And in-case you were wondering, yes, the vast majority of workers in this specific warehouse were men. And again probability states that at some point during the day, these men would enter the bathroom, whether to wash their hands or use it. So roughly 20-30 minutes after the last person would leave the warehouse, the IDF would cut off, triggering alerts to our monitoring system and teams… Driving the Dynamic “Bronx Boys” Duo of CCNPs bananas… Below is the IDF that gave us all that trouble. And the wall it is mounted to; well you guessed it… The men’s room…

Wrap up. So, when in doubt… Double check everything. Your switches may be new, your md5s unaltered, and your fiber freshly fused… But if your power source is iffy (at best), or is dependent on someone triggering a motion sensor… You may find yourself wondering if in fact all of your packets did actually wash their hands before returning to work… Until the next one.

  • 51406 out.

1 thought on “Reverse Engineering Pt. 2

Comments are closed.