Turris OS 3.10: DHCPv6 loses address after 1 hour

bdeblier · August 17, 2018, 7:06am

No change in 3.10.4. Wan IPv6 address has gone AWOL again.

Pepe · August 17, 2018, 7:19am

Will check that and will get back to you.

yorik · August 17, 2018, 8:41am

After yesterday’s update my router start sending dhcp6 renew in a loop, so init7 had to turn ipv6 for me off, to save their DHCP server. Are there any way to fix it without compiling and installing custom packages?

# tcpdump -n -i eth1 -vv '(udp port 546 or 547) or icmp6'
10:21:22.126283 IP6 (flowlabel 0x9ed25, hlim 64, next-header UDP (17) payload length: 184) 2a02:168:2000:9:da58:d7ff:fe00:50a4.546 > 2001:1620:2777:19:1::9.547: [udp sum ok] dhcp6 renew (xid=71aa0d (elapsed-time 100) (option-request SIP-servers-domain SIP-servers-address DNS-server DNS-search-list server-unicast SNTP-servers NTP-server AFTR-Name opt_67 opt_82 opt_83 opt_94 opt_95 opt_96) (client-ID hwaddr type 1 d858d70050a4) (server-ID hwaddr/time type 1 time 574856739 525400faac14) (Client-FQDN) (IA_NA IAID:1 T1:0 T2:0 (IA_ADDR 2a02:168:2000:9:a2d3:42ec:f1e:49f5 pltime:0 vltime:0)) (IA_PD IAID:1 T1:0 T2:0 (IA_PD-prefix 2a02:168:1234::/48 pltime:0 vltime:0)))
10:21:22.127273 IP6 (flowlabel 0x6ef01, hlim 1, next-header UDP (17) payload length: 166) fe80::da58:d7ff:fe00:50a4.546 > ff02::1:2.547: [udp sum ok] dhcp6 rebind (xid=d95e41 (elapsed-time 0) (option-request SIP-servers-domain SIP-servers-address DNS-server DNS-search-list server-unicast SNTP-servers NTP-server AFTR-Name opt_67 opt_82 opt_83 opt_94 opt_95 opt_96) (client-ID hwaddr type 1 d858d70050a4) (Client-FQDN) (IA_NA IAID:1 T1:0 T2:0 (IA_ADDR 2a02:168:2000:9:a2d3:42ec:f1e:49f5 pltime:0 vltime:0)) (IA_PD IAID:1 T1:0 T2:0 (IA_PD-prefix 2a02:168:1234::/48 pltime:0 vltime:0)))
10:21:22.327224 IP6 (class 0xe0, hlim 255, next-header UDP (17) payload length: 198) fe80::ca9c:1dff:fe93:343f.547 > fe80::da58:d7ff:fe00:50a4.546: [udp sum ok] dhcp6 reply (xid=d95e41 (IA_NA IAID:1 T1:1200 T2:1800 (IA_ADDR 2a02:168:2000:9:a2d3:42ec:f1e:49f5 pltime:3600 vltime:86400) (status-code Success)) (IA_PD IAID:1 T1:0 T2:0) (server-ID hwaddr/time type 1 time 574856739 525400faac14) (client-ID hwaddr type 1 d858d70050a4) (preference 0) (server-unicast) (DNS-server 2001:1620:2777:1::10 2001:1620:2777:2::20))
10:21:22.328710 IP6 (flowlabel 0x9ed25, hlim 64, next-header UDP (17) payload length: 184) 2a02:168:2000:9:da58:d7ff:fe00:50a4.546 > 2001:1620:2777:19:1::9.547: [udp sum ok] dhcp6 renew (xid=8d7ca2 (elapsed-time 0) (option-request SIP-servers-domain SIP-servers-address DNS-server DNS-search-list server-unicast SNTP-servers NTP-server AFTR-Name opt_67 opt_82 opt_83 opt_94 opt_95 opt_96) (client-ID hwaddr type 1 d858d70050a4) (server-ID hwaddr/time type 1 time 574856739 525400faac14) (Client-FQDN) (IA_NA IAID:1 T1:0 T2:0 (IA_ADDR 2a02:168:2000:9:a2d3:42ec:f1e:49f5 pltime:0 vltime:0)) (IA_PD IAID:1 T1:0 T2:0 (IA_PD-prefix 2a02:168:1234::/48 pltime:0 vltime:0)))
10:21:22.812365 IP6 (flowlabel 0x29b9e, hlim 64, next-header ICMPv6 (58) payload length: 113) 2a02:168:2000:9:da58:d7ff:fe00:50a4 > 2001:1620:2777:1::10: [icmp6 sum ok] ICMP6, destination unreachable, unreachable port, 2a02:168:2000:9:da58:d7ff:fe00:50a4 udp port 46499
10:21:23.336383 IP6 (flowlabel 0x9ed25, hlim 64, next-header UDP (17) payload length: 184) 2a02:168:2000:9:da58:d7ff:fe00:50a4.546 > 2001:1620:2777:19:1::9.547: [udp sum ok] dhcp6 renew (xid=8d7ca2 (elapsed-time 100) (option-request SIP-servers-domain SIP-servers-address DNS-server DNS-search-list server-unicast SNTP-servers NTP-server AFTR-Name opt_67 opt_82 opt_83 opt_94 opt_95 opt_96) (client-ID hwaddr type 1 d858d70050a4) (server-ID hwaddr/time type 1 time 574856739 525400faac14) (Client-FQDN) (IA_NA IAID:1 T1:0 T2:0 (IA_ADDR 2a02:168:2000:9:a2d3:42ec:f1e:49f5 pltime:0 vltime:0)) (IA_PD IAID:1 T1:0 T2:0 (IA_PD-prefix 2a02:168:1234::/48 pltime:0 vltime:0)))

Init7 · August 17, 2018, 9:38am

Hello everyone

We are Init7 and have been observing the topic for some time thanks to Michael. @sECuRE

As the project to harmonize our DHCP infrastructure will take some time, we are interested in a temporary solution for our customers. Since yesterday’s update, the situation has worsened dramatically and we have seen numerous Turris Omnia routers flooding our servers with hundreds of thousands of requests. For this reason, we have already had to deactivate IPv6 for a large number of customers.

Basically we are aware of the problem, but unfortunately we cannot provide a short-term solution on our DHCP servers. Recently, attention was drawn to the following workaround, but do not know if it solves the situation completely:

https://blog.printk.io/2018/08/ipv6-renew-issue-with-fiber7-and-openwrt/

We are open to discussion on this subject.

Kind regards,
Init7 NOC
^dw

yorik · August 17, 2018, 11:41am

Unfortunately that solution doesn’t work because package odhcp6c (2018-06-20-12) has no support of -U parameter:

odhcp6c -U -s /lib/netifd/dhcpv6.script -Ntry -P0 -t120 eth1
odhcp6c: unrecognized option: U
Usage: odhcp6c [options] <interface>

Can it be related to very old ticket https://gitlab.labs.nic.cz/turris/openwrt/issues/182 ?

Pepe · August 17, 2018, 11:43am

We’re working on it.

yorik · August 17, 2018, 11:48am

Pepe,
could you say how to revert odhcp6c to previous version? I’m afraid that I’ll have no connectivity for the weekend.

Pepe · August 17, 2018, 12:01pm

Sent PM.

       // 20 characters

Pepe · August 17, 2018, 8:05pm

Hello guys,

I am sorry it has taken me so long to respond to your query. We’ve been discussing how we can help you in your situation and we decided to release Turris OS 3.10.5 to RC at the beginning of the next week.

In the upcoming release, there has been updated odhcp6c to the latest version including the option to ignore Server Unicast option. We’d like to thank you to our user @koalatux, who bring the option to odhcp6c! We’re really glad that we have such an amazing community, which can debug it and send a pull request to get it fixed in upstream.

Recently, we updated odhcp6c, which added multicast option. That’s why we decided to disable unicast support by default as it is causing problems in some networks. The unicast support can be enabled in the configuration file /etc/config/network. It’ll be included in release notes and the support team (including me) is ready to help our users, who would like to enable it.

I have very good news for experienced users with CLI and SSH.
If somebody would like to try RC before we’ll release it, you can do it with the following command:

switch-branch --force stable

Earlier I was in touch with @yorik once you switch to the stable branch, you’ll need to ask Init7 to re-enable IPv6.

We’d really appreciate the feedback if that works better for you.

Greetings from Prague,
Pepe

koalatux · August 18, 2018, 9:08am

Hi all

Something seems to be broke since the Turris update on Wednesday.

Yesterday evening I ran tcpdump again and I saw a flood of DHCPv6 messages of types renew and reply. I’ve seen about one renew per second (but sometimes with pauses of tenths of minutes). This kind of matches what @Init7 described earlier in this thread.

I still have my patched version of odhcp6c (with the -U parameter) running so the renew messages were sent to the multicast address. Because of that I assume the -U / noserverunicast won’t fix the problem of the flooding, this must be something unrelated.

I just wanted to share this. I don’t have time for debugging this weekend and currently I also don’t get replies to DHCPv6 Solicit messages from Init7’s DHCPv6-Server anymore.

Cheers, Adi

EDIT: I forgot to mention, the DHCPv6 reply messages I captured with tcpdump looked correct. This made me conclude there must be an error at Turris.

BearPerson · August 21, 2018, 10:47pm

Hrm. I think I understand the DoSing failure mode, at least somewhat.

TL;DR: This might happen after an assigned IA_PD-prefix expired its T1 and T2, until it actually expires its validity,
if the server stops returning that prefix (but doesn’t actively revoke it), while still providing IA_NA addresses.

Note the “elapsed-time 100” on the renew retransmission (aka 1s, the option is in centiseconds). odhcp6c seems to have an odd habit of sending one final retransmission AT the final operation timeout and then immediately giving up. The initial timeout for the first renew transmission is normally 10s ±rand(0.1s), so this is being cut short by the overall timeout.
Sifting code, this means T2 = T1 + 1. In this case, I think T2=1, T1=0. We’d get there from dhcpv6.c:1177 - that branch is taken if we just parsed a REBIND response, T1 and T2 were both 0, and sets T2=1.
By my reading, dhcpv6_calc_refresh_timers() will quite happily pull T1=T2=0 out of a stored address, and odhcp6c_expire_list() will just tick those down to 0 over time if nothing refreshes them.

This puts us into the observed failure state of REBIND/REPLY/RENEW/RENEW/REBIND.
Doesn’t seem particularly intended, overall.

I’m not familiar with the precise standard here, but it seems odd that odhcp6c just hangs on to all configured addresses it was ever told about, even after repeated responses from the server no longer mention them, though maybe that’s intended - when we last got it the address was valid, it hasn’t expired, and nobody told us otherwise.
The behavior of expiring T1/T2 to 0 doesn’t seem to be intended, though.
I think dhcpv6_calc_refresh_timers() should probably ignore values of 0 in its min(), only yielding 0 if ALL addresses are now 0, but I’m not confident in that - we might need to get dedeckeh to weigh in.

I don’t think I fully understand that updated_IAs interaction, either - I get that we yield failure if no IAs changed and we’re stuck at T1=T2=0, but some IAs changing doesn’t mean all of them did. Perhaps there needs to be code here where we re-up T1/T2 on entries to some reasonable value if they weren’t mentioned in the response, perhaps max(1200, 0.5*preferred) or so - what should we do when preferred expires but valid hasn’t yet? The standard seems thin on this, but I may have missed something.
The best I got is that the standard wants clients to maintain T1/T2 per IA, as in, per set of addresses returned aka per “session”, not per individual address. Thus, after the server has responded with a new T1/T2 on some addresses, the client should use those T1/T2, even if it also has other addresses it received in the past that weren’t mentioned in the response.
So instead of invoking dhcpv6_calc_refresh_timers(), we might want dhcpv6_handle_reply() to track the smallest T1/T2 effective on any IA_NA/IA_PD address it received in this reply, and use those values.

@Init7 - I’m not sure what to offer as workaround here.
If you can, the simplest would be ensuring that prefixes stay bound for 24h, so that you can always reply to REBIND by refreshing the requested prefixes and they don’t run out. May mean that servers who can’t reach the assigned-PD database can’t send responses.
If you have control over server code, implementing the REBIND bit of https://tools.ietf.org/html/rfc3633#section-12.2 would help: If the client mentioned a prefix in its IA_PD option on renew, but the server doesn’t want it to use it anymore, reply with an IA_PD-prefix option mentioning that prefix with vltime=0. This would correctly cause odhcp6c to immediately discard that prefix, stopping it from depressing T1/T2.
If you can’t touch your dhcp server, the best I can come up with is a packet-inspecting filter that discards REBIND requests with an elapsed-time option < 6000, ensuring that clients don’t get a REBIND response within the first minute. They should keep retransmitting, eventually sending a request marked old enough to make it through, and timeouts should be high enough that 60s isn’t a huge deal. A bit of a dirty move, but it might slow down the flood enough that your servers survive, as the speed of the loop depends on how fast clients get a response to REBIND.

Edit: After writing this, I realized a very similar failure mode exists where T1 hits 0 but T2 hasn’t yet.
In that case, the client will send a RENEW with normal timeout (T2-T1 == T2-1), upon receiving a response set T1=1 and restart the stateful loop: Wait 1s (T1) for RECONFIGURE, then immediately send RENEW and normal retransmissions thereof, looping on response - again, a request spam of cycle time 1s + renew_rtt.
So unfortunately, the workaround of filtering low-age REBIND isn’t sufficient, you’d also need to filter low-age RENEW. Still, as the message is retransmitted and not that time-critical, might be survivable for a while.

BearPerson · August 21, 2018, 11:15pm

Worth mentioning that the breaking client behavior (synthesizing T2=1) was introduced in https://git.openwrt.org/?p=project/odhcp6c.git;a=commitdiff;h=473f248e2db6c6c39e7aecf78f888e44f36ff5c4 in early april, older builds of odhcp6c would not behave this way.
If I read this right, those older builds would instead retransmit REBIND 20 times, with normal exponential backoff, refreshing any returned addresses but considering the response subtly invalid, eventually restarting SOLICIT, which forcibly discards any PDs still held, until it gets through REQUEST/REPLY.

BearPerson · August 22, 2018, 1:27pm

@koalatux (or anyone else) when you have time, could you check out https://github.com/AlsoBearPerson/odhcp6c/commit/250f56a73e7a1b5e1e90f53982e8915065947450 please?
This should fix the flood we’ve been seeing, but I don’t have a suitable development environment nearby, so I can’t even syntax check that right now. Probably has a few loose bits to shake out, so it seems rude to pull request in its current shape.
I would like to tweak things a tad more - changing the static local T1/T2/T3 variables from relative times to absolute timestamps would remove the need for much of the ticking-down shenanigans, though I’m not sure if I should keep piling that into this change…

bdeblier · August 23, 2018, 7:07am

Just rebooted with 3.10.5. Will check back in 12-24 hours to see if my problem persists.

Init7 · August 23, 2018, 7:56am

Many thanks to everyone who has worked so intensively on the topic. We really appreciate it. The first tests with some customers look very good. ^dw

maetthu · August 23, 2018, 9:48am

I enabled unicast in /etc/config/network (set “option noserverunicast” to “0”) and rebooted with 3.10.5, looks good so far indeed.

yorik · August 23, 2018, 11:23am

I have unicast disabled and it works for 48+ hours. Thank you for fixing this!

BearPerson · August 23, 2018, 12:12pm

Note that if you’re on init7 (or another setup with busted unicast) you need noserverunicast at 1 (yay for double negatives).

As for the cause, I kind of suspect relay agent shenanigans - judging by addresses involved, my upstream router is also acting as dhcpv6 relay agent on multicast queries, which might allow it to e.g. add an interface-id option indicating which port the request came from, when relaying to the actual server. As unicast will likely be routed as normal packet traffic instead, it won’t be modified, and the server might miss information about which customer this is, unless it remembers such metadata by client_id…
The RFC says “Therefore, a server should only send a Unicast option to a client when Relay Agents are not sending Relay Agent options.” - I’m not seeing any relay agent options on the client side, but that doesn’t mean there aren’t any server-side. If my theory is correct, then either the server should not generate a unicast option to begin with, or the relay agent should cut it out of the response when relaying.
My uplink looks like a point-to-point link (not seeing any neighbor discovery traffic from my actual neighbors) so it’s not like multicast vs. unicast is going to make a huge difference in traffic fanout here.

maetthu · August 23, 2018, 1:27pm

Thx. I suspected as much, since about ~2h after the reboot I lost ipv6 connectivity again, set noserverunicast back to 1 and it’s stable again since about 2h (although it was working for that long with the last reboot as well, so we’ll see…)

bdeblier · August 24, 2018, 6:46am

Lost IPv6 address again after 12 hours. Now retrying with option noserverunicast set to 0.