Solution found for WiFi clients disconnects

As many of you face problems with clients “random” disconnects from wifi I decided to share my solution for this problem.

In my case some android devices were disconnecting few minutes after the display was off.
Probably you tried the disassoc_low_ack option already as recommended on some forums but for me it was no sense becasue my log did not show anything like “disconnected due to excessive missing ACKs”.

In my case it was “deauthenticated due to local deauth request”.
To investigate your reason for disconnects just keep running:
tail -f -n +0 /tmp/log/messages | grep 'hostapd\|ath:\|ath10\|wlan'
so you can immediately see all events related to wifi (in case of ath9 or other udate the command accordingly).
What I could see is that every 10 minutes (600 seconds) is done WPA Group Cipher rekeying (info hostapd[]: wlan0: STA xx:xx:xx:xx:xx:xx WPA: pairwise key handshake completed (RSN)).
And just after this android device disconnected and in log appeared hostapd message like deauthenticated due to local deauth request.
I think this is related to some device power save / standby mode as they simply did not receive/process this rekeying.

Because my setting is WPA2 PSK/CCMP then I set interval for rekeying from 600s (10 minutes) to 86400s (24 hrs) which is mentioned as default for CCMP in hostapd documentation (http://w1.fi/cgit/hostap/plain/hostapd/hostapd.conf)

Time interval for rekeying GTK (broadcast/multicast encryption keys) in seconds (dot11RSNAConfigGroupRekeyTime)
This defaults to 86400 seconds (once per day) when using CCMP/GCMP as the group cipher and 600 seconds (once per 10 minutes) when using TKIP as the group cipher.
wpa_group_rekey=86400

So in /etc/config/wireless update all required interfaces:

config wifi-iface
   ...
   option wpa_group_rekey '86400'
   ...

and after wifi reload check /tmp/run/hostapd-phy0.conf (or phy1) if interfaces are updated correctly with wpa_group_rekey=86400

I don’t know how much I hit directly the problem or how much this is just workaround or temp fix but it works great.

Just to mention I was playing with more options like skip_inactivity_poll and max_inactivity etc. which seems to be amazing for specific environment tuning needs.

Leave comments if this is working for you or if you found any other fix for disconnect problem…

9 Likes

It is not problem for me now as I exchanged USB dongles with PCI card, but before on USB dongles WIFI it was REALLY ANNOYING issue. I would update my wireless config accordingly but if it will be proven to correct this unwated behaviour, maybe it should be retrofitted into turris default configuration.

I can confirm. I modified the same values months in the past - it was a great stability improve (in my config I do the rekeying at 2400) for some clients (Cyanogen Dashboards).

1 Like

There is no such default configuration. If I understand change correctly then change has to be done somewhere inside Luci. We don’t maintain Luci. I would suggest you to contact upstream with this (also before that check if they haven’t fixed it already. We have old version of luci because of dependencies and old openwrt tree.).

I think many of Omnia BFUs is claiming this issue to be Omnia problem. They don’t care about upstream approach they just want plug-and-forget functionality.
I think Omnia team should take the responsibility to fix this issue by own workaround or contact upstream because as I said customers are expecting 100% functionality out of the box and those “upstream” problems waste the name of Omnia as product (don’t buy Omnia because this and this is not working). At least such elementary functions like wifi on default radios should take more care.

For some environments the change can be a security issue. My personal view is, that for most environment you could slightly modify the value.

Quote from “Real 802.11 Security: Wi-Fi Protected Access and 802.11i:”

"If a mobile device chooses to leave the Wi-Fi LAN, it should notify the access point by sending an IEEE 802.11 disassociate message. When it does this, the access point erases the copy of the pairwise keys for the departing mobile device and stops sending it messages. If the device wants to rejoin later, it must go through the whole key establishment phase from scratch. But what about the group key? Even though the device has left the network, it can still receive and decrypt the multicasts that are sent because it still has a valid group key available. This is not acceptable from a security standpoint; if a device leaves the network, it should no longer be allowed any access at all.

The solution to this problem is to change the group key when a device leaves the network. This is a bit like changing the locks on your house after a long-term guest leaves; you don’t want anyone to have a door key who is not living in your house. So group keys have an added complication: the need to rekey."

So changing the rekey time opens this (possible) gap. Please consider.

2 Likes

You’re right. Anyway for home environment it’s OK I think. I will try to find out the minimum rekeying period for stand by devices.
On the other hand - I have no problem if device which was once authorizated will listen to multicast.
In real world I see no risks and reason to do so - get connected and then disconnect and not connect again but listen to multicast only.
If there would be real security risk like this then there are many others finally not let me use wifi at all.

BTW this option is related I think:

Rekey GTK when any STA that possesses the current GTK is leaving the BSS.
(dot11RSNAConfigGroupRekeyStrict)
wpa_strict_rekey=1

So the rekeying period doesn’t matter so much because once device disconnects then remaining devices should be rekeyed by default.

Well I understand this point. But now I suspect that I experienced such disconnect with macbook running for couple days and in sleep mode as well as android phone even on PCIE WIFI card. So going to update my wireless network configuration accordingly but even it is not manageable in LuCI not everyone put such effort and knowledge into investigation of this issue like @blbeczech82 hence other blame guess what ? whole device itself. There should be some troubleshooting page in official documentation where this will be mentioned.

I wasn’t saying anything against that. I would even suggest creating it in user section of documentation because that way anyone can update it. But I don’t think that it’s ready to be added to troubleshoot page as I think before that more users have to test it.

In ideal world we would. But in short we really have more pressing issues on our hands. You can create issue for us on our gitlab (gitlab.labs.nic.cz) but I can’t promise that someone will take care of it immediately (note also that from foris we allow only wpa2). So yes you can let it on us and we will contact upstream and maybe in few months we would communicate it out with upstream and upstream would fix it (and we would pull in patch). But what I am telling you is that it would be low priority for us at the moment as we have bigger fishes to fry. So if you really want to fix it then contact upstream directly (you can still create issue for it so we know that this problem exists), not only that you will have bigger chance to have it resolved just in few weeks (or even days) but also you will help upstream not only us. I am not saying “don’t bother us with it” I am just saying that you should contact upstream with this as this would fall pretty deep in our priority list.

1 Like

Done some checking myself and from what I can conclude it’s not a problem with luci - it’s in hostapd. It’s true that wpa_group_rekey is set to 86400 by default in hostapd, but only in more recent versions, the version of hostapd used in Turris is too old and doesn’t contain this fix.
Specifically, it’s this commit:
https://w1.fi/cgit/hostap/commit/src/ap/ap_config.c?id=90f837b0bfb26f9c26111fef39199190b9f820f2

May be a pain to get the most recent version of hostapd to work on the turris, but it should be fairly easy to cherry-pick just that commit and backport it. It does depend on a new struct member wpa_group_rekey_set which causes it to be binary-incompatible, so that will need to be worked around somehow.

I have created a patch that should work, but I can’t be 100% certain:
https://drive.google.com/open?id=1iD6fVua-z7XloDrFL5e_4vvENx9ftmzB

5 Likes

Thank you for great work with tracking the cause. We will try to apply that patch and we will see.

2 Likes

I switched to RC branch already and made option wpa_group_rekey ‘86400’ part of my wireless config file already as I experienced those issues too.

There is default configuration created by Foris. For some reason it enables TKIP on WPA2 what I think doesn’t make much sense and can be actually reason for this - TKIP by default uses 10 minutes rekeying while CCMP 24 hours.

Anyway I’ve just created pull request to Foris to adjust these settings - https://github.com/CZ-NIC/foris/pull/16

2 Likes

Can You please add one more thing on GitHub. I don 't have my login on github at hand. If the team decided to leave TKIP for some reason. MacOS also disconnects when someone tries to hack into your network. AFAIK This behavior is not the same in Windows. So despite that I manually change to CCMP. So, when I want to then set up a new password in Forris, I have to go back to LuCI and change from TKIP + CCMP on CCMP.

Here you can see the MacOS dialog that appears: Qualcomm Atheros QCA9880 802.11bgnac (radio0) not stable

My pull request does change the Foris to use CCMP only, so that should fix your case as well.

I don’t think there is any good reasoning for the original TKIP choice - it has been there since initial commit with “TODO: find in docs” comment which was later removed without any explanation in unrelated commit.

2 Likes

IIRC I put the TODO there to check if it’s really the correct option - the goal was to support both ciphers for best device compatibility. Yet it was more than four years ago and dropping TKIP today is more than reasonable if we consider the increased implicit security.

3 Likes

Yeah, I see.
I’m sorry my ear is in a bad mood since New Year’s Eve, so do I.
Thank you anyway.

Hmmm…I tried this but it doesn’t seem to be working. I still get random disconnects every so often. And the most frustrating part is that the logs don’t offer any meaningful explanation of what’s happening.

These are the only pertinent entries I can find in the system log (the entries before that are entirely unrelated to WLAN or DHCP etc):

2018-01-10 17:31:06 info hostapd[]: wlan1: STA xx:xx:xx:xx:xx:xx IEEE 802.11: authenticated
2018-01-10 17:31:06 info hostapd[]: wlan1: STA xx:xx:xx:xx:xx:xx IEEE 802.11: associated (aid 9)
2018-01-10 17:31:06 info hostapd[]: wlan1: STA xx:xx:xx:xx:xx:xx RADIUS: starting accounting session 41CC6D591F90C4F2
2018-01-10 17:31:06 info hostapd[]: wlan1: STA xx:xx:xx:xx:xx:xx WPA: pairwise key handshake completed (RSN)
2018-01-10 17:31:06 info dnsmasq-dhcp[14004]: DHCPREQUEST(br-lan) 192.168.1.10x xx:xx:xx:xx:xx:xx 
2018-01-10 17:31:06 info dnsmasq-dhcp[14004]: DHCPACK(br-lan) 192.168.1.10x xx:xx:xx:xx:xx:xx MyDevice

Why is my device re-authenticating itself again mid way through usage?

What is your disconnect period?
What device?
Is there any link to device usage?
Is it one specific client or more?
What is DHCP lease time?
Do you have static IP?
Is there any link between disconnect and router operation?
Is it 5G, 2G or both?
…?

As in how long the disconnect lasted for? Probably 60-90 seconds at most I think?

Surface Pro 3. The disconnect happened mid-way through usage (i.e. it wasn’t in hibernation or anything like that).

Just this particular device on this occasion.

The default 12 hours.

Yes.

Not that I’m aware of. The router was (and is) running as usual. No link that I can see.

The SP3 is connected via 2.4Ghz