TO LAN switch or bridge blocks DHCP replies intermittently

Xiche · June 20, 2020, 4:49pm

I have a fairly simple setup where I have the Turris Omnia (@TOS 5.0.1) configured as a “dumb AP” along with another AP to cover a different part of the house. Both, along with my DHCP server, are attached to a managed switch (Cisco SG300). Every time my phone roams/reassociates from the other AP to the TO AP, it will be unable to receive a DHCP lease for a few minutes (stuck in “Obtaining IP address”). I have verified in these scenarios that:

The DHCP server is seeing the requests and responding.
The switch port the TO is attached to is seeing both the DHCP requests and replies (by way of port monitoring/mirroring).
The bridge interface on the TO (br-lan) is seeing only the DHCP requests.

So that means that the DHCP replies are getting lost somewhere between the external switch and the bridge interface on the TO, which pretty much just leaves the TO’s internal switch. My wild guess is that the TO’s internal switch or the associated DSA-backed Linux bridge is doing some MAC learning/filtering and is letting the DHCP request through as it has a broadcast MAC target, but blocking the targeted reply as it thinks that MAC lives elsewhere until its entry times out after a few minutes.

(This is potentially similar to TO in dumb AP mode does not relay DHCP to WLAN clients)

anon82920800 · June 20, 2020, 5:01pm

roughly 5 min? Then you might have run into an issue that others have observed

There appears to be a (communication) gap between the kernel’s bridge FDB and the switch’s MAC address database (Address Translation Unit (ATU))

Xiche · June 20, 2020, 5:15pm

Last attempt was almost exactly 4 minutes, but yeah - about 5 minutes. I could see in bridge fdb show that the MAC address was showing up twice during the broken time:

# bridge fdb | grep 1a:2b:3c:4d:5e:6f
1a:2b:3c:4d:5e:6f dev lan0 vlan 1 self 
1a:2b:3c:4d:5e:6f dev wlan0 master br-lan

When it started working again, the MAC address showed up with only a single entry like so:

1a:2b:3c:4d:5e:6f dev wlan1 master br-lan

(The wlan0 vs wlan1 is probably because it will switch freely between the two.)

anon82920800 · June 20, 2020, 5:32pm

This entry is from the time when the device been connected to the other AP (which presumably routes through the SG300)

And then

when the device roamed to the TO’s AP.

The problem is that the TO’s switch is unaware of upstream interfaces as neither DSA driver or the bridge driver appear to be communicating with

The ATU is only aware of the switch’s downstream ports, thus once a client roams to a port upstream of the TO’s switch the switch does not know what to do with the packets (does not have connection tracking) and drops them and therefore

until the ATU’s ageing period clears the client’s MAC. This user leverages a workaround

Xiche · June 20, 2020, 6:05pm

Thanks a lot for the pointers and explanation. It seems like for my use case, I’d be better off just bridging the WAN port to the WLAN interfaces and connecting that to the external switch to avoid the internal switch altogether.

I’m not going to mark your post as the “Solution” because I still consider this to be a bug that should be fixed by the Turris team if possible. My other AP (a repurposed Greenwave G1100) has no problem correctly performing the role with what appears to be a very similar hardware layout. Best case scenario would be having the software update the switch FDB to reconcile duplication that it notices. Worst case would be a mechanism to disable the MAC learning/FDB in the switch chip.

I’m not sure how popular it would be (probably more so for the Mox), but I would add to my wishlist a “dumb AP” mode offered by Foris in the initial setup.

anon82920800 · June 20, 2020, 6:14pm

Difference could be that Wlan is on the same bus as the switch chip and communicate directly with each other rather (which is not the hardware layout in the TO) than the CPU/kernel as intermediary or some other sort of bridge.

The bug is likely with kernel source development rather than OpenWrt or CZ.NIC and patching DSA or bridge driver might not be that easy. It might also be fixed already in more contemporary kernel versions.

Xiche · June 20, 2020, 6:53pm

Last update for anyone else that runs into this:
Adding the WAN port to the LAN group in Foris and connecting the WAN port to my switch instead of a LAN port worked around the issue and I no longer have any trouble with roaming.

focusaurus · December 23, 2020, 12:06am

Thanks for posting this follow-up! I was having this same issue I believe, so I addressed it following your suggestion. Hopefully it will keep my android phone connecting equally well to either access point.

AreYouLoco · April 9, 2021, 4:07pm