Lan ports aren't accessible for WiFi client for the first 250-330 seconds after connection

I have 2 easy reproducible cases:

  1. Reboot of Raspberry PI 4 with libreelec OS connected via 2.4 or 5GHz. It can’t access my server (connected to Omnia directly via ethernet) for the first ~250s, after that it works fine.
  2. Laptop reconnects from different AP (connected to Omnia via ethernet) to Omnia directly (I test with 5GHz only). The laptop can’t access the server for the first ~250s too, after that it works fine. Tested with Macbook Air and Lenovo with linux.

I also double checked: it’s not only my server can’t be access but anything connected to Omnia via ethernet. All the devices connected via WiFi are accessible.

I’d be happy to help debugging this, but until next week I do tests only with R-PI due to traveling.

I think I do understand what’s going on. The cases I described above are only somewhat related and had different causes:

  1. I had ancient Netgear AP connected via ethernet to Omnia, which was replying something weird to all DHCP requests, which confused Omnia’s switch, until it resets in ~250s. Turning the AP off fixed the issue.

  2. When laptop reconnects from secondary AP (different from the previous one) to Omnia, Omnia’s switch remembers it’s mac and sends all the replies via ethernet to secondary AP and those packets even aren’t visible to Omnia itself as it happens in hardware switch.

Question: are there anyway to reduce that timeout or somehow reset it? Or even make all the packets on switch goes through Omnia’s CPU? (I understand that there will be performance penalty, but it’s different topic). Are there any way to control it at all?

I don’t have 100% proof of that yet, because I can’t sniff packets between the AP and Omnia, but I’ll try do it in the future.

Thanks,

Guys,
I spent some time reading all the info about DSA, and I think that short term solution will be to set different vlans on all the port to push switch move all traffic through CPU. Am I right?
Can I achieve that with following commands:

  # activate VLAN filtering
  ip link set dev br0 type bridge vlan_filtering 1

  # tag traffic on ports
  bridge vlan add dev lan0 vid 10 pvid untagged
  bridge vlan add dev lan1 vid 11 pvid untagged
  bridge vlan add dev lan2 vid 12 pvid untagged
  bridge vlan add dev lan3 vid 13 pvid untagged
  bridge vlan add dev lan4 vid 14 pvid untagged

Also I think that I need to switch from 4.0.5 to HBL/HBD, otherwise vlan filtering won’t work. Right?

P.S. Most of the info I got from DSA and 802.1Q tagging.

Are you suspecting that the switch’s hardware bridge has stale entries in its fdb that clear after ~ 250 seconds?

Tried to find some sort of manual for the 88E6176-A1-TFJ2 switch chip but came up empty, closest I came across is for 88E6060.

Some general stuff

The normal packet flow involves learning how to switch packets only to the correct MACs. The switch learns which port an end station is connected to by remembering each packet’s Source Address along with the port number on which the packet arrived.
When a packet is directed to a new, currently unlearned MAC address, the packet is transmitted out of all of the ports except for the one on which it arrived. Once a MAC address/port number mapping is learned, all future packets directed to that end station’s MAC address (as defined in a frame’s Destination Address field) are directed to the learned port number only. This ensures that the packet is received by the correct end station (if it exists), and when the end station responds, its address is learned by the switch for the next series of packets.

Here is perhaps the interesting bit

The address database is stored in the embedded SRAM and has a default size of 1024 entries with a default aging time of about 300 seconds or 5 minutes. The size of the address database can be modified to a maximum of 256, 512, or 1024 entries. Decreasing the size of the address database increases the number of buffers available for frames. The age time can be modified in 16 second increments from 0 seconds (aging disabled) to 4080 seconds (or 68 minutes). These options are set in the ATU Control register

ATU Age Time. These bits determine the time that each ATU Entry remains
valid in the database, since its last access as a Source Address, before being
purged. The value in this register times 16 is the age time in seconds.
For example:
The default value of 0x13 is 19 decimal.
19 x 16 = 304 seconds or just over 5 minutes.
The minimum age time is 0x1 or 16 seconds.
The maximum age time is 0xFF or 4080 seconds or 68 minutes.
When the AgeTime is set to 0x0 the Aging function is disabled, and all
learned addresses will remain in the database forever.

Whilst the above being pertinent to the 88E6060 chip the default (or programmed) ageing time for the 88E6176 might be ~ 250 seconds instead.

Suppose an issue should be opened at the TOS Gitlab, asking:

  • for a tool enabling the user to change the ATU Age Time register, or
  • to program the switch chip with a shorter ATU Age Time than the ~ 250 seconds.

n8v8r, thanks for the info! Setting age time to 16s would improve, but not solve the issue.
AFAIK something wrong with the switch because it should update fdb when receiving a packet with known mac address from unexpected port, but looks like it doesn’t.

My network:

Omnia <-lan4-> Server
      <-lan1-> Secondary AP < WiFi >
      < WiFi >

WiFi set to the same SSID, so clients can easily roam between Secondary AP and Omnia.

Packet flow, as far as I understand it:

Laptop connected to a secondary AP, sends packets to a server. Omnia’s switch learns that laptop’s mac is on lan1, server mac is on lan4 and deliver the packets without even showing them to Omnia’s CPU. When laptops reconnects to Omnia’s WiFi directly and tries to send a packet to the server (it’s ARP first, as on laptop reconnect ARP table flushes), the packet gets delivered correctly, and I expected at this point that the switch learns that laptop mac is now on eth0 (port 6) instead of lan1, but it doesn’t happen. So server’s ARP reply goes to lan1 and never reaches the laptop.
Omnia doesn’t see that ARP reply, as it never reaches CPU.

And this continues until aging happens.

Changing Age Time from ~250 to 16s would help a bit, but still would cause broken TCP sessions, I think. At this point I’d prefer if all the traffic would go through CPU to have seamless migration between WiFis.

I am wondering whether there is a fdb bug in the switch’s firmware. I just tested with

TO
|-> lan0 -> AP1
|-> lan1 -> AP2

and had an Android client first connected to AP1 and then went to AP2 and ran a network analyser app on the client but could not reproduce the issue, had immediate ARP responses from all clients connected on AP1 and was able to reach the web interface of AP1 as well without any such 250 s delay.

Maybe this issue is particular for a certain batch of hardware only. It would require access to the switch’s fdb to debug but I would not know how to gain such access.


I am not sure that is feasible with DSA since packets designated to remain within the boundary of the switch’s hardware bridge should not escape to the CPU and then being rerouted to the switch (basically defying the purpose of the switch) which might be more clear from [3]

DSA currently supports 5 different tagging protocols, and a tag-less mode as well. The different protocols are implemented in:

net/dsa/tag_trailer.c: Marvell’s 4 trailer tag mode (legacy)
net/dsa/tag_dsa.c: Marvell’s original DSA tag
net/dsa/tag_edsa.c: Marvell’s enhanced DSA tag
net/dsa/tag_brcm.c: Broadcom’s 4 bytes tag
net/dsa/tag_qca.c: Qualcomm’s 2 bytes tag

The exact format of the tag protocol is vendor specific, but in general, they all contain something which:

identifies which port the Ethernet frame came from/should be sent to
provides a reason why this frame was forwarded to the management interface

[3] https://www.kernel.org/doc/html/latest/networking/dsa/dsa.html#switch-tagging-protocols

What is not so clear to me is whether this happens with moving a client between TO’s Lan ports (direct wire or wired external AP/extender should not matter) or whether this happens when moving a client between TO’s own Wlan and Lan port(s)?

If the latter it should be kept in mind that the TO’s own Wlan connects via PCIe (bridge chip) to the CPU and is then potentially slaved into a software bridge with a DSA port. In that case the switch’s fdb would have no information about the client connecting to the TO via its Wlan unless the same client previously been connected to a Lan port and thus the switch’s fdb still holds the client’s MAC source/destination for those 250 - 300 seconds.
In this light it would be an expected (logical) outcome and not a bug.

This happens only when a client moving from Lan port to Wlan port. But I still think switch should learn that the client has moved to CPU port when it receives the first packet from it. The client is sending traffic to the server, which is connected to the switch and switch should learn that the client got moved from port 1 to port 5 or 6 (one of the CPU ports).

If it’s the server trying to reach the client after the move, then it’s understandable that it can’t find it until new location will be learnt by the switch.

Am my understanding correct?

Does the switch treat all the ports equally?

Then the max. 250 - 300 s lag can be expected, as explained previously.


The fdb for any bridge, hardware or software, by design is only fed (learning) from the downstream ports but not the upstream ports.

The switch’s upstream CPU facing ports (5 & 6 in this case) are:

  • not DSA ports (check with ethtool -i eth0 | ethtool -i eth1)
  • not part of the hardware bridge
  • not learning ports that feed into the switch’s fdb

The software bridge that eventually enslaves the WLan port and the DSA port has it is own fdb which the switch knows nothing about and equally the software bridge is unaware of the switch’s fdb.

Not sure whether creating a multicast group (bridge mdb) could be a potential workaround.


Afaik that is the case. Some switches provide QoS but that is then an upper protocol layer.

I tried to understand how to setup that, but settings vlans was easier. Also I afraid that having multicast for all the packets would make too much unneeded load.

Looks like I manage to fix the issue by pushing all the traffic through CPU with following commands:

bridge vlan add dev lan0 vid 10 pvid untagged
bridge vlan add dev lan1 vid 11 pvid untagged
bridge vlan add dev lan2 vid 12 pvid untagged
bridge vlan add dev lan3 vid 13 pvid untagged
bridge vlan add dev lan4 vid 14 pvid untagged
bridge vlan del dev lan0 vid 1               
bridge vlan del dev lan1 vid 1               
bridge vlan del dev lan2 vid 1               
bridge vlan del dev lan3 vid 1               
bridge vlan del dev lan4 vid 1  

So bridge vlan looks like:

root@ap:~# bridge v                          
port    vlan ids                             
lan0     10 PVID Egress Untagged             
lan1     11 PVID Egress Untagged             
lan2     12 PVID Egress Untagged             
lan3     13 PVID Egress Untagged                  
lan4     14 PVID Egress Untagged             
br-guest_turris  1 PVID Egress Untagged      
br-lan   1 PVID Egress Untagged              
wlan1    1 PVID Egress Untagged              
wlan0    1 PVID Egress Untagged         

BTW, I also find out that Asus RT-AC68U with Asus original firmware has similar bug. I hope I’ll be able to fix it in the similar way after flashing it with OpenWRT.

Thank you for all the advises!

1 Like

That bit is new to me and surprising (least because I do not use the node’s Wlan). But what is interesting it should render the creation of a software bridge that enslaves WLan and Lan ports obsolete, e.g. say one want lan0 and wlan0 to be member of the same VLAN it should presumably work with:

bridge v d dev lan0 vid 1
bridge v d dev wlan0 vid 1
bridge v a dev lan0 vid 10 pvid untagged
bridge v a dev wlan0 vid 10 pvid untagged

I am not sure that this way the traffic designated to remain within the boundaries of the switch’s hardware bridge (lan 0 - 4) reaches the CPU and gets redirected to the switch.

It works, all the traffic which was forwarded by the switch directly now got forwarded via CPU. Latency almost doubled, but I don’t have 5 minutes delays any more.

Pings between 2 nodes connected via ethernet directly to Omnia; before:
rtt min/avg/max/mdev = 0.221/0.351/3.798/0.087 ms
After:
rtt min/avg/max/mdev = 0.397/0.579/10.699/0.083 ms

And now I can tcpdump all that traffic on Omnia.

attempt at a potential explanation
  • two address databases are at work and one perhaps superseding the other
    • switch address database
      • ageing of 300 seconds
      • learning on its downstream ports
      • no learning on the WLan ports (by design)
    • bridge fdb
      • learning on switch downstream ports
      • learning on WLan ports

If the switch address database is superseding the bridge fdb it would explain why clients that move from a Lan port to a WLan port in less than 300 seconds are exhibiting the connectivity issue.

Hi. This happens on my Turris MOX as well. At first the workaround seem to be working, but testing again it doesn’t. I am not sure why it did seem to work before.

Even when WDS AP enabled on all of my APs (as to my surprise I have read that it may be necessary to get the bridging even working at all)

I have three APs:

[Turris MOX]
|  L ethernet -- [ Openwrt on TPLink WR1043ND]
L_   ethernet -- [ Openwrt on D-Link DIR 825 ]  -- wireless -- [raspberry-pi]

Each device has a linux bridge over eth and wlan devices With my mobile phone I am roaming between them. When I roam from the D-Link to MOX, it seem to not work until some timeout . When roaming from TP-Link to D-Link, it seems OK just handover to new AP and pings to RPi are going no problem.

This is on my MOX at the moment

lan1	 11 Egress Untagged

lan2	 12 Egress Untagged

lan3	 13 Egress Untagged

lan4	 14 Egress Untagged

lan5	 15 Egress Untagged

lan6	 16 Egress Untagged

lan7	 17 Egress Untagged

lan8	 18 Egress Untagged

br-guest_turris	 1 PVID Egress Untagged

br-lan	 1 PVID Egress Untagged

wlan0	 20 Egress Untagged

wlan1	 21 Egress Untagged

There is someone else aparently having the same issue:

1 Like

Indeed I’m encountering the same issue.

Could it be related to ARP caching somehow? I only seem to get it when roaming from my other APs to the Turris, and not when I turn on my device in the Turris range.

I’m sure it’s not ARP caching on the main router, because everything works fine switching between other APs.

I just took my laptop and phone in Turris range, the laptop had no issues and the phone was blocked for a while.

The phone was visible in the wireless clients list with its IP address, but could not be pinged. From the phone, none of the other wireless clients could be reached. After a while the phone disconnected from wifi and tried again but didn’t get a DHCP response. Shortly after that it did get a DHCP response and everything worked.

I think the above behavior matches the phone MAC address being cached and any packets for it going to the LAN interface (that it was on before) and not to the WiFi interface.

This sounds very similar to this issue: OMNIA: Vlan on DSA port breaks arp responses (TOS 4.0.5). There is a script that someone posted which monitors the wifi interface and removes stale entries from the lan bridge when a client roams to wifi. That completely solved the issue for me.

Thanks! I installed it (from https://github.com/stanojr/tempfix_fdb_dsa) and let’s hope it works.

I hope Turris will fix this properly at some point, is there any way to track that?

You can follow the issue on their gitlab issue tracker: https://gitlab.nic.cz/turris/turris-build/-/issues/165

1 Like