Lan ports aren't accessible for WiFi client for the first 250-330 seconds after connection

yorik · January 19, 2020, 8:46pm

Guys,

could you please help fixing one issue: when a client connects to Omnia’s WiFi it can’t reach anything connected to Omnia’s lan ports, even ARP requests aren’t forwarded to them, for first 250-330 seconds. After that all soddenly works. I wasn’t able to find anything in dmesg or iptables. The only suspect is the bridge. I tried to switch on/off stp and change ageing from 300s to 10s, both didn’t help.

All the WiFi connections are available from the first seconds and Omnia itself too.
Both IPv4 and IPv6 are affected.
Turris 4.0.3; I think, I’ve seen similar behavior with 3.13.

Thanks!

anon50890781 · January 20, 2020, 12:27pm

Thought to point out ^[1] since it seems to exhibit a similar issue for another TO user however on the upstream OS, perhaps for pooling efforts to get it sorted.

^[1] https://forum.openwrt.org/t/access-point-configuration-problem/52938

yorik · January 21, 2020, 11:47am

Thanks for the link!
Just to confirm: it wasn’t me, but the problem looks exactly the same. Also I checked brctl showmacs and I can see that mac of new device attached to correct interface.

anon50890781 · January 21, 2020, 12:17pm

Time frame matching as observed in both instance is certainly curios.

Is STP (Spanning Tree) enabled on the bridge?

There seems to be a bug ^[2] related to mcast_last_member_interval which may or may not could compound the issue.

^[2] https://forum.openwrt.org/t/bridge-stp-potential-syntax-error-last-member-interval/51885/5

yorik · January 21, 2020, 1:51pm

The issue happens with stp both on and off, but by default it’s off.

anon50890781 · January 21, 2020, 1:52pm

Apparently some sort of timing issue with the AP, either:

in the bridge device, or
in the communication between bridge device and WLan device, or
in the WLan device

Since the issue does not exhibit on the Lan port the bridge device could likely be ruled out.

The time frame seems pretty distinctive considering that it matches the case reported on the other user forum. But neither seems to mention the frequency band that happens with, e.g. 2.4GHz or 5 GHz or both?

If it is only on the 5 GHz band then it might be caused by Dynamic Frequency Selection (DFS) scan. Not sure if client association is permitted during the scan but it certainly blocks traffic during such scan.

TnT · January 21, 2020, 1:57pm

I’m the user from the openwrt topic. We can continue the discussion in one forum if that’s easier. At this point I don’t know if this is something specific to the Turris Omnia or a more general openwrt issue.

anon50890781 · January 21, 2020, 2:01pm

Then suppose we stay here, considering it happens on the same hardware platform and exhibits on OpenWrt as well as TOS.

Just patching in @TnT’s last from the other forum

I have both 2.4 and 5 GHz networks with the same SSID. I didn’t pay attention to which one I was connecting. I just tried with 5 GHz disabled, and it shows the same problem. So it’s probably not the DFS scan. Note that during the time where pinging the access point fails, I can access the internet just fine (and also ping the main router). So it’s not like I have no connectivity at all.

In case it is hardware (Wlan cards) or driver specific perhaps worth to compare whether those are the same on both nodes?

yorik · January 21, 2020, 5:31pm

I have 2 easy reproducible cases:

Reboot of Raspberry PI 4 with libreelec OS connected via 2.4 or 5GHz. It can’t access my server (connected to Omnia directly via ethernet) for the first ~250s, after that it works fine.
Laptop reconnects from different AP (connected to Omnia via ethernet) to Omnia directly (I test with 5GHz only). The laptop can’t access the server for the first ~250s too, after that it works fine. Tested with Macbook Air and Lenovo with linux.

I also double checked: it’s not only my server can’t be access but anything connected to Omnia via ethernet. All the devices connected via WiFi are accessible.

I’d be happy to help debugging this, but until next week I do tests only with R-PI due to traveling.

yorik · January 28, 2020, 8:25pm

I think I do understand what’s going on. The cases I described above are only somewhat related and had different causes:

I had ancient Netgear AP connected via ethernet to Omnia, which was replying something weird to all DHCP requests, which confused Omnia’s switch, until it resets in ~250s. Turning the AP off fixed the issue.
When laptop reconnects from secondary AP (different from the previous one) to Omnia, Omnia’s switch remembers it’s mac and sends all the replies via ethernet to secondary AP and those packets even aren’t visible to Omnia itself as it happens in hardware switch.

Question: are there anyway to reduce that timeout or somehow reset it? Or even make all the packets on switch goes through Omnia’s CPU? (I understand that there will be performance penalty, but it’s different topic). Are there any way to control it at all?

I don’t have 100% proof of that yet, because I can’t sniff packets between the AP and Omnia, but I’ll try do it in the future.

Thanks,

yorik · January 30, 2020, 12:00pm

Guys,
I spent some time reading all the info about DSA, and I think that short term solution will be to set different vlans on all the port to push switch move all traffic through CPU. Am I right?
Can I achieve that with following commands:

  # activate VLAN filtering
  ip link set dev br0 type bridge vlan_filtering 1

  # tag traffic on ports
  bridge vlan add dev lan0 vid 10 pvid untagged
  bridge vlan add dev lan1 vid 11 pvid untagged
  bridge vlan add dev lan2 vid 12 pvid untagged
  bridge vlan add dev lan3 vid 13 pvid untagged
  bridge vlan add dev lan4 vid 14 pvid untagged

Also I think that I need to switch from 4.0.5 to HBL/HBD, otherwise vlan filtering won’t work. Right?

P.S. Most of the info I got from DSA and 802.1Q tagging.

anon50890781 · January 30, 2020, 12:51pm

Are you suspecting that the switch’s hardware bridge has stale entries in its fdb that clear after ~ 250 seconds?

Tried to find some sort of manual for the 88E6176-A1-TFJ2 switch chip but came up empty, closest I came across is for 88E6060.

Some general stuff

The normal packet flow involves learning how to switch packets only to the correct MACs. The switch learns which port an end station is connected to by remembering each packet’s Source Address along with the port number on which the packet arrived.
When a packet is directed to a new, currently unlearned MAC address, the packet is transmitted out of all of the ports except for the one on which it arrived. Once a MAC address/port number mapping is learned, all future packets directed to that end station’s MAC address (as defined in a frame’s Destination Address field) are directed to the learned port number only. This ensures that the packet is received by the correct end station (if it exists), and when the end station responds, its address is learned by the switch for the next series of packets.

Here is perhaps the interesting bit

The address database is stored in the embedded SRAM and has a default size of 1024 entries with a default aging time of about 300 seconds or 5 minutes. The size of the address database can be modified to a maximum of 256, 512, or 1024 entries. Decreasing the size of the address database increases the number of buffers available for frames. The age time can be modified in 16 second increments from 0 seconds (aging disabled) to 4080 seconds (or 68 minutes). These options are set in the ATU Control register

ATU Age Time. These bits determine the time that each ATU Entry remains
valid in the database, since its last access as a Source Address, before being
purged. The value in this register times 16 is the age time in seconds.
For example:
The default value of 0x13 is 19 decimal.
19 x 16 = 304 seconds or just over 5 minutes.
The minimum age time is 0x1 or 16 seconds.
The maximum age time is 0xFF or 4080 seconds or 68 minutes.
When the AgeTime is set to 0x0 the Aging function is disabled, and all
learned addresses will remain in the database forever.

Whilst the above being pertinent to the 88E6060 chip the default (or programmed) ageing time for the 88E6176 might be ~ 250 seconds instead.

Suppose an issue should be opened at the TOS Gitlab, asking:

for a tool enabling the user to change the ATU Age Time register, or
to program the switch chip with a shorter ATU Age Time than the ~ 250 seconds.

yorik · January 30, 2020, 1:26pm

n8v8r, thanks for the info! Setting age time to 16s would improve, but not solve the issue.
AFAIK something wrong with the switch because it should update fdb when receiving a packet with known mac address from unexpected port, but looks like it doesn’t.

My network:

Omnia <-lan4-> Server
      <-lan1-> Secondary AP < WiFi >
      < WiFi >

WiFi set to the same SSID, so clients can easily roam between Secondary AP and Omnia.

Packet flow, as far as I understand it:

Laptop connected to a secondary AP, sends packets to a server. Omnia’s switch learns that laptop’s mac is on lan1, server mac is on lan4 and deliver the packets without even showing them to Omnia’s CPU. When laptops reconnects to Omnia’s WiFi directly and tries to send a packet to the server (it’s ARP first, as on laptop reconnect ARP table flushes), the packet gets delivered correctly, and I expected at this point that the switch learns that laptop mac is now on eth0 (port 6) instead of lan1, but it doesn’t happen. So server’s ARP reply goes to lan1 and never reaches the laptop.
Omnia doesn’t see that ARP reply, as it never reaches CPU.

And this continues until aging happens.

Changing Age Time from ~250 to 16s would help a bit, but still would cause broken TCP sessions, I think. At this point I’d prefer if all the traffic would go through CPU to have seamless migration between WiFis.

anon50890781 · January 30, 2020, 2:14pm

I am wondering whether there is a fdb bug in the switch’s firmware. I just tested with

TO
|-> lan0 → AP1
|-> lan1 → AP2

and had an Android client first connected to AP1 and then went to AP2 and ran a network analyser app on the client but could not reproduce the issue, had immediate ARP responses from all clients connected on AP1 and was able to reach the web interface of AP1 as well without any such 250 s delay.

Maybe this issue is particular for a certain batch of hardware only. It would require access to the switch’s fdb to debug but I would not know how to gain such access.

I am not sure that is feasible with DSA since packets designated to remain within the boundary of the switch’s hardware bridge should not escape to the CPU and then being rerouted to the switch (basically defying the purpose of the switch) which might be more clear from ^[3]

DSA currently supports 5 different tagging protocols, and a tag-less mode as well. The different protocols are implemented in:

net/dsa/tag_trailer.c: Marvell’s 4 trailer tag mode (legacy)
net/dsa/tag_dsa.c: Marvell’s original DSA tag
net/dsa/tag_edsa.c: Marvell’s enhanced DSA tag
net/dsa/tag_brcm.c: Broadcom’s 4 bytes tag
net/dsa/tag_qca.c: Qualcomm’s 2 bytes tag

The exact format of the tag protocol is vendor specific, but in general, they all contain something which:

identifies which port the Ethernet frame came from/should be sent to
provides a reason why this frame was forwarded to the management interface

^[3] Architecture — The Linux Kernel documentation

anon50890781 · January 30, 2020, 2:53pm

What is not so clear to me is whether this happens with moving a client between TO’s Lan ports (direct wire or wired external AP/extender should not matter) or whether this happens when moving a client between TO’s own Wlan and Lan port(s)?

If the latter it should be kept in mind that the TO’s own Wlan connects via PCIe (bridge chip) to the CPU and is then potentially slaved into a software bridge with a DSA port. In that case the switch’s fdb would have no information about the client connecting to the TO via its Wlan unless the same client previously been connected to a Lan port and thus the switch’s fdb still holds the client’s MAC source/destination for those 250 - 300 seconds.
In this light it would be an expected (logical) outcome and not a bug.

yorik · January 30, 2020, 3:55pm

This happens only when a client moving from Lan port to Wlan port. But I still think switch should learn that the client has moved to CPU port when it receives the first packet from it. The client is sending traffic to the server, which is connected to the switch and switch should learn that the client got moved from port 1 to port 5 or 6 (one of the CPU ports).

If it’s the server trying to reach the client after the move, then it’s understandable that it can’t find it until new location will be learnt by the switch.

Am my understanding correct?

Does the switch treat all the ports equally?

anon50890781 · January 30, 2020, 4:26pm

Then the max. 250 - 300 s lag can be expected, as explained previously.

The fdb for any bridge, hardware or software, by design is only fed (learning) from the downstream ports but not the upstream ports.

The switch’s upstream CPU facing ports (5 & 6 in this case) are:

not DSA ports (check with ethtool -i eth0 | ethtool -i eth1)
not part of the hardware bridge
not learning ports that feed into the switch’s fdb

The software bridge that eventually enslaves the WLan port and the DSA port has it is own fdb which the switch knows nothing about and equally the software bridge is unaware of the switch’s fdb.

Not sure whether creating a multicast group (bridge mdb) could be a potential workaround.

Afaik that is the case. Some switches provide QoS but that is then an upper protocol layer.

yorik · January 30, 2020, 11:24pm

I tried to understand how to setup that, but settings vlans was easier. Also I afraid that having multicast for all the packets would make too much unneeded load.

Looks like I manage to fix the issue by pushing all the traffic through CPU with following commands:

bridge vlan add dev lan0 vid 10 pvid untagged
bridge vlan add dev lan1 vid 11 pvid untagged
bridge vlan add dev lan2 vid 12 pvid untagged
bridge vlan add dev lan3 vid 13 pvid untagged
bridge vlan add dev lan4 vid 14 pvid untagged
bridge vlan del dev lan0 vid 1               
bridge vlan del dev lan1 vid 1               
bridge vlan del dev lan2 vid 1               
bridge vlan del dev lan3 vid 1               
bridge vlan del dev lan4 vid 1

So bridge vlan looks like:

root@ap:~# bridge v                          
port    vlan ids                             
lan0     10 PVID Egress Untagged             
lan1     11 PVID Egress Untagged             
lan2     12 PVID Egress Untagged             
lan3     13 PVID Egress Untagged                  
lan4     14 PVID Egress Untagged             
br-guest_turris  1 PVID Egress Untagged      
br-lan   1 PVID Egress Untagged              
wlan1    1 PVID Egress Untagged              
wlan0    1 PVID Egress Untagged

BTW, I also find out that Asus RT-AC68U with Asus original firmware has similar bug. I hope I’ll be able to fix it in the similar way after flashing it with OpenWRT.

Thank you for all the advises!

anon50890781 · January 31, 2020, 2:42am

That bit is new to me and surprising (least because I do not use the node’s Wlan). But what is interesting it should render the creation of a software bridge that enslaves WLan and Lan ports obsolete, e.g. say one want lan0 and wlan0 to be member of the same VLAN it should presumably work with:

bridge v d dev lan0 vid 1
bridge v d dev wlan0 vid 1
bridge v a dev lan0 vid 10 pvid untagged
bridge v a dev wlan0 vid 10 pvid untagged

I am not sure that this way the traffic designated to remain within the boundaries of the switch’s hardware bridge (lan 0 - 4) reaches the CPU and gets redirected to the switch.

yorik · January 31, 2020, 2:13pm

It works, all the traffic which was forwarded by the switch directly now got forwarded via CPU. Latency almost doubled, but I don’t have 5 minutes delays any more.

Pings between 2 nodes connected via ethernet directly to Omnia; before:
rtt min/avg/max/mdev = 0.221/0.351/3.798/0.087 ms
After:
rtt min/avg/max/mdev = 0.397/0.579/10.699/0.083 ms

And now I can tcpdump all that traffic on Omnia.