“Bienvenido a Miami” – WiFi, RADIUS and VRFs
11 min read“What we’re gonna do right here… Is go back…. WAY BACK INTO TIME”
(waits for beat to drop, then for Hip-Hop and R&B heads to start nodding their heads…)
“If you take WiFi [your love] away from me… I’ll go CRAZY!!!” – CCIE 51406, not an official member of Blackstreet.
OK, maybe not THAT far back… Maybe the summer of 2021, let’s say… Oh, I don’t know… Maybe June-ish… Perhaps remembering live music playing, bongo drums perhaps; an outdoor dining area around Brickell Ave/’Not quite South Beach’ hotel? Mmmyea? 😏 Anyway…
The concern raised: Random WiFi inconsistency in the new Miami office.
The Preamble: The office reporting the issues had been designed, literally built out with 95% pre-configuration w Excel templates (obv I couldn’t very well route Florida Public Static IPs outside of the local ISP {deep-cut/pro tip: 'Florida' is the name of a state 🤓}
, smart licensed (smash face on keyboard to continue)
, burned-in for a few days, Multiple Lightweight Access Point Setup, out of band with Cellular modem setup, 802.1x to the access port, VPN Failover tested between Data Centers… Basically, a thoroughly tested office, as far as you could go while living your rental apartment, before shipping it out to the site.
However, prior to shipping the hardware out, a commemorative photo with my (now late) best friend was taken for posterity.
The reported issue(s): Ranging from “it’s dropping” to “it just times out ‘the webpage’ “; and who could forget the most colorfully conveyed and helpful piece of information that anyone in IT can ever hear when asking a serious question while trying to gather information: “it sucks!”
So, me being me, I build out a cloned test environment by pushing the same H-REAP, excuse me “Flex Connect” config, along with the appropriate policies and tags down to my test environment, at home. Almost instantly, I saw wireless authentications and client traffic getting successfully tagged for the Internal, Guest and BYOD networks respectively. I also spied… my old familiar Guest Captive Portal (this was a good thing), as well as the BYOD Enrollment portal and then the corresponding Certificate Provisioning portals from my “on-again off-again girlfriend,” Cisco ISE. Basically, I couldn’t re-replicate any of the issues being reported, but… I continued to test, because I do love me some consistency to help ease/reinforce my Engineering OCD; CDO.
After a couple of days, several diverse endpoints, and google-fooing quite a bit, I still could not reproduce any of the issues or errors being continually reported. The decision was made to physically send me to the office to continue to troubleshoot the issues live and resolve any other items that cropped up. As an interesting aside to this task, I was asked as a prerequisite to “apply” for my own job. So far as, providing a complete write up of my past clients and experiences, my qualifications and accomplishments, as well as an updated resume (Corporate America is so weird)
. But anyway…
After proving that “I was me” (see paragraph above for context), I headed down to the Miami office to identify, investigate and hopefully resolve whatever issues the user(s) seemed to be experiencing at the time. In the Miami office it’s a pretty simple setup. It’s a single switch (super collapsed core), a single point to point line to the DataCenter, a single internet handoff for internet failover, two APs for coverage, a single Out of band device with Cellular backup, and an internet router. You did see the picture of gear I shipped down there, right?
Testing on the first day went off without any issues. Everything seemed to be working fine. Before I left, to prepare for the following day’s tests, I ended up having my brother revoke my BYOD cert from ISE as to test re-enrollment, and also blow away my MAC address for Guest access, which he overly obliged with vast amounts of enthusiasm.
So on day 2, “early-in-the-mornin’ rising to my feet” with a questionable building badge that had been provided to me by arguably my favorite co-worker in NY, I set off to the office to continue troubleshooting.
I started by connecting my iPhone to the BYOD provisioning portal (hosted by ISE as both the provisioning portal of Certificates as well as the BYOD Certificate’s CA. I can get super in the weeds about this design, maybe in another post). So upon successfully connecting and navigating through the Provisioning Portal, I did get a new certificate bundle pushed down to my iPhone and then was successfully connected/authenticated via EAP-TLS after allowing the ISE Cert to be utilizied on the iPhone (thanks networking, you’re SO AWESOME! Seriously, you are and you are really underappreciated).
With that tested, I went onto the next device. I powered up my work laptop, and upon getting to the Windows Logon Screen, it connected “auto-magically” (if you know… you know) to the Internal WiFi, again using EAP-TLS, however this machine cert was cut straight out of Compton, I mean from ADs CA. I did confirm that my laptop received the correct AuthZ and dACL pushed to my Wireless Session, prior to logging in fully (aka nice walled garden for limiting network access in a pre-authenticated environment). So I logged onto my laptop, and confirmed that the “not quite fully, but sort-of because we weren’t doing ‘real’ EAP chaining at the time’s” second EAP-TLS login with my user account’s certificate had subsequently took over (took what Brian?) from the Machine’s previous initial Authentication. Then I hopped on the internet to confirm all was good.
OK, so 2 successful WiFi tests down on day 2, and 1 to go. I took out my personal iPad and attempted to connect to the guest network. I got the AUP/splash page/login page, so I logged in with my guest creds and was then redirected to the Post-Login Banner/acceptance page, as shown below.
OK, looking good so far. So I clicked on the large “Continue” button and nothing happened.
Like nothing-nothing. So, I refreshed the page, and verified the URL was correct (in the following pic I removed the real URL b/c “gente son gente…”) shown below.
So that’s weird… So, the redirect is DNS based right? I can ping and resolve that. I clearly have an IP address, b/c I literally just saw a sign-in webpage as well as a Post-Login webpage. Maybe the iPad is being dumb or something. So I toggle WiFi off, and then back on again. I see it searching for WiFi, then it finds the Guest Network, and b/c I’m just a little bit of a creep 👀, I watch the RADIUS Live Logs in ISE to follow along. I see my iPad grab an IP, and see the AuthZ session being sent in the RADIUS response back to the WLC saying “yep that’s the guest user ‘tonystank‘. Everything seems to be going according to plan. And yet… My iPad’s screen still shows the “loading” circle from the photo above.
So, I take the iPad’s mac-addy out of the guest endpoint identity group, and toggle WiFi off and then on yet again. And so I connect again to the Guest WiFi and get the guest aup/login page presented to me. I refresh my ISE session on my admin machine, and yep, I see the redirect URL in the RADIUS AuthZ being sent. So what the heck? (I may have used another word).
My brother had created me a new colorful username and password to test with this time around, and no, I don’t care to share anything about it, no matter how funny or inappropriate it was. Using the new creds I attempt to login, we see the session change to the “Stage 2” aka Authenticated Guest AuthZ, and apply the “Permit All” set to my session, and I’m dead in the water with that Guest session. Nothing. Well this is frustrating.
So I start checking the data path between the AP to the controller, then from the controller to ISE to ensure connectivity via that Guest network, which in full disclosure was segmented by a firewall and using a NAT to keep people from getting my girlfriend ISE’s number… Her IP address. (more on that setup in another post at some point).
Routing looks good. NATs look good, I see XLATEs and CONNs on the ASA, so the Firewall is good. And just to triple check, past the firewall EIGRP is doing it’s job bidirectionally showing at each hop that both source and dest routes exist. So I hop onto ISE’s CLI aka ADE-OS aka “I’m not fully Cisco’s CLI, but I’ll be damned if I give you any Linux access, or useful visibility without TAC setting up the root patch”. Pings and traceroutes to my gateway on that network being NATed look good.
Well this is now both frustrating AND annoying. So, I call up TAC and start troubleshooting it with them. And we find that the behavior of not loading the full session after a successful login is inconsistently happening. Sometimes it lets me on, sometimes it doesn’t. I tried logging in with a few other devices as well and it was not quite a 50/50 split, but it was clearly not as dramatic as the “70/30 ratio” that I was once told a story about (yikes 😬) , once upon a time in lower East-side Manhattan setting, during an evening out while in the company of potentially intoxicated peers (hey if you know… you know).
Ok, back to the task at hand. TAC and I were seeing packets (via several packet caps and TCP dumps) leave the WLC, reach ISE and then come back properly (and within a few seconds) on the successful attempts. However, on the unsuccessful attempts/page loads that died, we would see traffic leave ISE via TCP dump, and arrive at the controller at first. After some more Radioactive and hours with wireshark with our traces and re-verifying the topology, I got asked why my the TACACS config was configured as VRF-aware in the 9800 config. To which I said, “because we manage the devices via the management VRF at this site.” This was also the case for the RADIUS config as well as the HTTPS config, to keep all of the management traffic of the Wireless Controller out of the data plane. We kept at it for a bit, and then reconvened WEBEX troubleshooting after lunch.
So then, TAC asked me where the wireless management interface was defined, and it was on some inband VLAN, which is where the all the APs CAPWAP tunnels terminated. After dumping the CAPWAPed/authenticated traffic off of the wireless management interface the clients were dropped in a VLAN according to the AuthZ policy pushed back from ISE and then locally trunked off one of the controllers port-channels within the the DataCenter depending on the dot1q tag.
He asked me capture traffic on that inband interface, even-though neither of us expected to see much of anything. And wouldn’t you know it; for some reason the controller was trying to process random reply coming back from the management VRF for RADIUS sessions (aka WLC to ISE and ISE back to WLC) out the inband wireless management interface.
After seeing that a few times, TAC asked me if we could swap the RADIUS config to the wireless inband management interface (and take it out of the management VRF) just to test. So I failed away all of the APs, other than those in the Miami office, to the other DataCenter, and reconfigured the RADIUS servers in the config to not use the management VRF. Then I updated ISE with a new Network Device, to only accept RADIUS, using the inband wireless management interface IP address. Finally, I then removed the RADIUS service from the originally defined Network Device that was utilizing the VRF aware Management IP addresses. I did leave the TACACS service configured under this Network Device, and just unchecked the RADIUS box to disable that service.
I went back to my iPad and started a new session, boom instant guest internet access. I had my brother then rip my session out of ISE, along with my MAC from the guest Endpoint identity group and then de-auth me manually from the WLC. And successful flows of traffic again, again and agaaaaaaain…
So this was weird, all we had been told about these Physical Catalyst 9800 and how they were built off of the Catalyst 9k universal image, so they were “basically in parity with the cat 9k switching platforms” at least within management. Evidently, I found a bug. While TAC took some additional debug files and was going to work with Engineering on that issue, I don’t think it ever got fully resolved.
The interesting takeaway of this entire situation was that both TACACS and HTTP/HTTPS were not effected at all. We never had any issues with remote management logins via SSH using TACACS (the same ISE Servers btw) nor with HTTPS admin sessions for going into the GUI of these controllers. I mean they do both run via TCP, so maybe there was/is an issue with how UDP was being processed within the universal code when being referenced in a VRF? Bottom line, for RADIUS connectivity on any Cisco 9800 Wireless Platform (physical or virtual), use the wireless management interface as the radius source interface and address.
At any rate, I hope this was either insightful or comical to whoever is reading this. If I can save you hours/days of troubleshooting I am here to help. Although, I did get a trip to Miami out of it… So until next time, I will leave you with another shot of Brickell Avenue…
Keep learning, keep labbing and keep having fun out there, kids… – 51406 out.