Turris OS 3.10: DHCPv6 loses address after 1 hour

BearPerson · August 25, 2018, 12:22am

I fear that it’s not that simple. noserverunicast=1 is correct, but there also seem to be periods where the dhcpv6 server is legitimately down/unreachable/overloaded for a few hours, causing leases to tick down. At least, my tcpdumps are seeing renews and sometimes eventually rebinds echo into the void for a while every so often. Might be more instances of the T1=0 bug pounding the servers into mush. I am seeing odd bursts of replies come back when it recovers, which would lean towards an overloaded server of some kind.
However, on my end the last extended such outage was wednesday, things have been pretty solid since.

In theory, short issues shouldn’t cause issues though - the assignments are coming back with preferred time 1h, valid time 24h, so the network should keep functioning for a day even after losing all contact with the dhcpv6 server.
Which isn’t what we observed, though, which is strange. We’ve seen issues an hour after the last proper reply.
After a >1h outage, my logs say

2018-08-22T09:19:11+02:00 warning odhcpd[2110]: A default route is present but there is no public prefix on br-lan thus we don't announce a default route!

(beware: Timestamps in this log seem a tad schizophrenic, this is likely actually 11:19+0200)
around the same time, the DHCPv6 script hook I set up reports that dhcpv6 is faithfully reporting a prefix, with preferred time expired but valid time still going.

========== DHCPv6 eth1 ra-updated ==========
Wed Aug 22 11:19:10 CEST 2018
RA_ROUTES=::/0,fe80::523d:e5ff:fe16:55ff,1799,512 2a02:168:2000:xxxx::/64,,2591999,256
PREFIXES=2a02:168:xxxx::/48,0,82046

Very strange. Unfortunately, this is probably going to send us into the bowels of netifd and odhcpd.
Either netifd is prematurely deassigning the address, or odhcpd’s detection of whether there is a valid prefix is broken. Ultimately, there’s a bug where something is effectively enforcing preferred lifetimes as if they were validity lifetimes, which is just all sorts of wrong.

BearPerson · August 27, 2018, 1:05am

Okay, the issue of premature expiration is in odhcpd. The version shipped in turris is OLD:

root@turris:~# opkg info odhcpd
Package: odhcpd
Version: 2015-05-21-2ebf6c8216287983779c8ec6597d30893b914a7c

The code at https://github.com/openwrt/odhcpd/blob/2ebf6c8216287983779c8ec6597d30893b914a7c/src/router.c#L294
specifically (and incorrectly) ignores non-preferred addresses when determining if we seem to have a valid prefix.
The first version that correctly falls back to non-preferred addresses when no preferred ones are available is likely https://github.com/openwrt/odhcpd/commit/01bfec4c333d906ca4d2230c804dfe361779f42f#diff-117f402862ed1a3c5e8cbc4d78e409bb from 2015-07-14, but we probably just want a newer version - there’s been a lot of bugfixes since.

The package of https://github.com/CZ-NIC/turris-os/blob/stable/package/network/services/odhcpd/Makefile
has been pinned at that version for a long time (the lede-merge branch has a 2017 version, but I don’t see any of the other branches using that).
Meanwhile https://github.com/openwrt/openwrt/blob/master/package/network/services/odhcpd/Makefile has been tracking actual odhcpd head for a while.

I might do some fiddling later to see if I can dig up a newer build somewhere from openwrt and see how much blows up when I yank it into my system…

BearPerson · August 27, 2018, 2:23am

@Init7 - as a workaround for the occasional dhcpv6 server downtime, can you guys configure your dhcpv6 server to provide a longer pltime when handing out prefixes? So long as we make sure pltime is longer than the longest dhcp server downtime, this should fix the temporary route drops we currently see on turris omnia during server outages. I’d probably start with 43200 (12h) as a reasonable setting so long as we don’t expect the prefixes to change frequently/soon, but you may have reason to set it otherwise. The current setting of 3600 is a bit short given the odhcpd bug here.

Of course, if rather than changing the config you’d rather finish reworking the setup so dhcpv6 doesn’t go down to begin with, that’d also work 8)

@others: On the client side, we can work around this via “option ra_default ‘1’” inside the config dhcp ‘lan’ section of /etc/config/network - this will force odhcpd to keep sending default-router announcements even if it doesn’t believe there is a public prefix, so long as there’s a default route on the wan interface.
BEWARE: There’s a significant risk with that setting of stranding devices on your lan so they think they have IPv6 connectivity even when they do not. In particular, there’s a scenario where if your router restarts while there’s a dhcpv6 server outage, it’ll still pick up a default route on wan from stateless router advertisements (which seem to work quite reliably), but it won’t have the prefix info on lan anymore, and likely fail to forward traffic correctly.
So the tradeoff here is that ra_default=1 will attempt IPv6 in more corner cases but may cause actual breakage, whereas the default behavior (ra_default=0) actively brings down IPv6 when things seem weird, to ensure clients don’t even try using it.

bdeblier · August 27, 2018, 7:16am

My 2 cents worth of observation: I no longer lose the IPv6 wan address after 12 hours, but the default route disappears.

bdeblier · August 27, 2018, 7:26am

Correction: IPv6 address has disappeared again, though it now took longer than 12 hours. Can anybody tell me how to enable logging for odhcpd?

BearPerson · August 27, 2018, 1:29pm

You can add logging for odhcp6c with -v on the command line, which causes it to log some information to syslog, mostly whenever it begins a new command transaction (every 20min+ usually). Unfortunately there’s currently no option to do so, you need to edit /lib/netifd/proto/dhcpv6.sh directly and edit the commandline, which will mean changes get clobbered by turris updates. (would probably be nice to create a uci variable for this)

The odhcpd version we are running predates support for configurable log verbosity, unfortunately - you get what you get. Later versions would allow you to configure log verbosity either on the commandline or via dhcp.odhcpd.loglevel in uci.

dns2utf8 · June 25, 2019, 3:10pm

This problem is very old now but still persist for me with the latest firmware from yesterday.
Are there any news?

yorik · June 25, 2019, 3:46pm

I also have problem with that right now.

maetthu · June 25, 2019, 4:01pm

Same here. It was stable for quite some time - until now. It’s been almost a year since the last update on init7’s status page - apparently there’s still no fixed DHCPv6 service yet.

bdeblier · June 26, 2019, 12:33pm

Has anybody tested if this problem is fixed in Turris OS 4.0 beta?

ivanek · June 26, 2019, 1:42pm

For me IPv6 works (I am connected via Init7, Switzerland). I have the option

option noserverunicast '1'

in the section wan6 in the /etc/config/network. I am using Turris 4.0 (beta 4).

yorik · June 26, 2019, 1:55pm

I have the same setting but yesterday I lost IPv6 connection for half a day. Now it works again.

bdeblier · June 26, 2019, 1:58pm

Cool - I’ll attempt an upgrade then at some point in the near future

kollerq · June 28, 2019, 1:59pm

I also haven’t had an IPv6 connection with Init7 (Switzerland) for some time now. Unfortunately I don’t know exactly since when. Yesterday it still worked. Now I will just wait and see. (Turris Omnia 3.11.5)

simon · December 14, 2019, 2:17pm

Sorry for bringing up this old post. What is the current status on this issue?
Can anyone confirm that static ipv6 prefix delegation is now working reliably with init7 and turris?

maetthu · December 15, 2019, 3:23pm

It has been on and off for a very long time for me, although my last issues getting a working prefix were about 2 months ago and it has been stable since. But the init7 support also told me back then that until they replace their DHCP servers, there’s isn’t really a workaround if there are issues (apart from noserverunicast, which didn’t help in my case). Apparently, rollout of new DHCP infrastructure has begun this month, so there’s finally hope for a more permanent solution: https://as13030.net/status.php?ticket=14283

simon · December 16, 2019, 3:01pm

Thanks. I’ve found the mail from init7, they indeed updated the dhcpv6 infrastructure very recently. It’s running fine since 48hrs. With pfsense I’ve never had issues in the past 1.5yrs, unfortunately it didn’t reach the full bandwidth.