DHCP/Network up seems to be flaky

Softcoder · November 21, 2016, 5:23pm

I have received my Turris Omnia here in Canada and for the most part its working quite well.

One strange problem i noticed over the last few days is that if i need to reboot the router or disable/re-enabled my NIC in ubuntu often times I do not get an assigned IP address and the nic keeps trying to get an assignment from the router.

The only way to solve this seems to be rebooting the Turris Omnia again and then it works (assigns connected client its static IP as configured in the static DHCP table).

Is this a known issue?

TheChaZ · November 21, 2016, 9:23pm

Hi - are you using Ethernet or WIFI?

Softcoder · November 21, 2016, 9:59pm

Wired Ethernet. I’ve seen this now multiple times when reconfigure my WAN LTE router which is in front of the turris (providing WAN Cell connection).

At one point I had problems even accessing the WAN LTE router login page (when it had the same subnet) so I switched it to another subnet and it worked fine afterwards, but always seemed to require reboots of the turris.

So to re-iterate the key issue i am looking to resolve here is that the turris seems to have trouble dynamically handling wired Ethernet port changes.

Mateusz_Czeladka · November 21, 2016, 10:47pm

I had the same issue I believe with DHCP (ethernet)

Softcoder · November 22, 2016, 3:51pm

FYI, rebooting the turris does NOT fix this. The existing plugged in wired ethernet clients (windows and linux) do not get reconnected to the network. If i manually unplug their wire and plug them into a different empty port then it works.

Definitely something wrong here.

fik · November 22, 2016, 4:08pm

Actually I have a similar problem: a wired Linux computer is getting a reserved DHCP address from turris, but when the computer runs and turris is restarted, there is no network on the computer. I need to reconnect the cable or do /etc/init.d/net.eth0 restart. This was not happening with previous routers.

GRMoreton · November 28, 2016, 5:29am

I don’t want to start another thread on the Woes of the DHCP on the Turris Omnia so will add to this one. I have been investigating why I have to revert back to as previous configuration Backup each morning having had a fully working Omnia the night before. Without any obvious clues on the Logs, despite placing the Omnia in DEBUG mode, I have now concluded that something in the night is causing me to lose the DHCP on anything connected to VLAN1 / Eth0. when this happens, despite performing reboots etc, the interface on my Plugged in Ethernet CaT6 devices, are all reporting “NO DHCP” , however anything connected via the Wifi does get DHCP with an allocated IP Address.

The strange thing is that the Wifi is also Bridged via Eth0, so can only assume that it is the Ports on the physical Router that are somehow to blame for losing DHCP?

Any suggestions on how you get DHCP enabled on the Eth0 Ports without reverting back to a previous backup?

GRMoreton · November 28, 2016, 6:01am

Scratch that about DHCP not being Served. Just realized it says that as I have static addresses set. I will try disabling DHCP on VLAN0 / Eth0 and see what happens tomorrow morning. I noticed that even-though I have Static IP’s set to a number of Devices, wifi connectivity included, the DHCP Leases are still counting down and constantly seem to be reset throughout the day, almost as if something is conflicting with these devices being set as Statics. everything else stays the same, in that each night everything is running fine, the next morning the only devices that can connect are those connected via Wifi, including those ordinarily connected via Ethernet Cat6??

nexusnet · January 6, 2017, 5:15pm

This sounds like the same problem I’m having. I created a topic earlier before seeing this one. When I reboot my Omnia, many servers don’t reconnect until I reboot each server. These are wired, using DHCP with ip address reservation. Last night (once again) I had made a variety of updates to my network and had it working exactly as I wanted. This morning, my iMac (from which I admin the net) had a self-assigned IP and would not refresh its lease. When i switched the iMac to fixed manual IP, I could sere a few clients on the network. Most were not there. Many of my servers that have reserved ip addresses have been given different ip addressees; and I can’t get them to correct w/o rebooting each. It looks like some are not getting their reserved IP even after a reboot.

white · January 6, 2017, 5:27pm

Have you checked if the link is up for those interfaces?

nexusnet · January 6, 2017, 6:57pm

Link is up @ 1Gb on each of the 4 units that I’m working on. Most of my network has been down since this morning. DHCP does not appear to be serving. I disabled the DHCP server on Omnia, and the laptop I’m using to test (link also up) keeps getting the same wrong IP address (x.x.x.4). I’ve rebooted the Omnia and the core switch multiple times. No change. Based on my reading online, the dhcp leases should be stored in /tmp/dhcp.leases. That file is 0 bytes. I’ve verified that the core switch is not serving dhcp. Baffled.

nexusnet · January 11, 2017, 7:14pm

This is befuddling. With dhcp running on the Omnia, I see about 10-11 devices on my network - including Omnia, core managed switch (ubiquiti es48), my iMac with fixed ip, and a few others. If I start the dhcp server on the switch, I see 19-21 devices on the network. Static assignments thru dhcp don’t seem to work in either scenario. And even with both dhcp servers running, there are 7 servers that do not show up. All servers picked up addresses from dhcp on my prior router (asus ac1900) without fail - and without having to enable a dhcp server on the core switch.

Added: I’ve tried rebooting the router with the switch disconnected (cable removed); rebooted the Ubiquiti switch; neither corrected the problem. I rebooted one of the still-missing servers. Others came online at about the same time that I did the reboot - likely coincidence - may be that the dhcp server is taking a long time to recognize servers. IMO the dhcp server on the switch should not be necessary. In prior tests, however, if I removed it (used only Omnia), I’d be back to the same 10-11 devices. It looks as though the Omnia dhcp server is not recognizing the presence of devices on or requests from the core switch.

white · January 11, 2017, 8:18pm

Run tcpdump on Omnia to find out if the DHCP requests come to Omnia and if Omnia respons.

nexusnet · January 12, 2017, 12:27am

Thanks @white, wish I could explain why the network is finally working, but it is. The up-boards required hard reboots. Other devices have simply started to get ip assignments. Static dhcp assignments are working - reboot server, takes correct ip. I’ve shut off the second dhcp server on the Ubiquiti core switch. Lots of config saves along the way. I’ve also turned off auto-reboots to have more control over timing.

fuller · January 12, 2017, 9:41am

that sounds like a wise course of action

nexusnet · January 12, 2017, 2:16pm

Internal network including dhcp working fine this morning. WAN wasn’t operational. Firewall machine on the wan pipe had good connectivity. Since I access the firewall box via ssh over the Omnia, that indicated to me that routing thru the wan was functioning, even though other traffic from the network (.i.e email, Omnia trying to send statistics, web) was failing. I restarted resolver, odhcpd and firewall processes on the Omnia - wan is working once again for all traffic.

Some info from last night’s log that I’d appreciate folks’ assessments of:

Resolver is being restarted frequently - this is from the last 1000 lines of messages log:
2017-01-12T11:45:07-05:00 warning watchdog[]: Restarted resolver
2017-01-12T11:55:07-05:00 warning watchdog[]: Restarted resolver
2017-01-12T12:05:07-05:00 warning watchdog[]: Restarted resolver
2017-01-12T12:15:07-05:00 warning watchdog[]: Restarted resolver
2017-01-12T12:25:07-05:00 warning watchdog[]: Restarted resolver
2017-01-12T12:35:07-05:00 warning watchdog[]: Restarted resolver
2017-01-12T12:45:08-05:00 warning watchdog[]: Restarted resolver
2017-01-12T12:55:07-05:00 warning watchdog[]: Restarted resolver
2017-01-12T13:05:07-05:00 warning watchdog[]: Restarted resolver
2017-01-12T13:15:08-05:00 warning watchdog[]: Restarted resolver
2017-01-12T13:25:08-05:00 warning watchdog[]: Restarted resolver
2017-01-12T13:35:07-05:00 warning watchdog[]: Restarted resolver
2017-01-12T13:45:07-05:00 warning watchdog[]: Restarted resolver
2017-01-12T13:55:08-05:00 warning watchdog[]: Restarted resolver
No messages in last 1000 log lines about odhcpd

30 firewall messages in last 1000 log lines seem operational, not indicative of errors:
2017-01-12T12:00:01-05:00 info /usr/sbin/cron[20875]: (root) CMD (/usr/share/firewall/turris-download)
2017-01-12T12:00:11-05:00 err turris-firewall-rules[]: (v62) Failed to download https://api.turris.cz/firewall/turris-ipsets.gz.sign
2017-01-12T12:05:01-05:00 info /usr/sbin/cron[21313]: (root) CMD (/usr/share/firewall/turris)
2017-01-12T12:05:01-05:00 info turris-firewall-rules[]: (v62) IPv4 WAN interface used - 'eth1’
2017-01-12T12:05:01-05:00 info turris-firewall-rules[]: (v62) IPv6 WAN interface used - 'lo’
2017-01-12T12:05:02-05:00 info turris-firewall-rules[]: (v62) 3403 ipv4 address(es) and 0 ipv6 address(es) were loaded (bc4b7d351917fe6864073e9991d55b9d), 0 rule(s) overriden, 0 rule(s) skipped
2017-01-12T13:00:01-05:00 info /usr/sbin/cron[26560]: (root) CMD (/usr/share/firewall/turris-download)
2017-01-12T13:00:12-05:00 err turris-firewall-rules[]: (v62) Failed to download https://api.turris.cz/firewall/turris-ipsets.gz.sign
2017-01-12T13:05:01-05:00 info /usr/sbin/cron[26999]: (root) CMD (/usr/share/firewall/turris)
2017-01-12T13:05:01-05:00 info turris-firewall-rules[]: (v62) IPv4 WAN interface used - 'eth1’
2017-01-12T13:05:01-05:00 info turris-firewall-rules[]: (v62) IPv6 WAN interface used - 'lo’
2017-01-12T13:05:02-05:00 info turris-firewall-rules[]: (v62) 3403 ipv4 address(es) and 0 ipv6 address(es) were loaded (bc4b7d351917fe6864073e9991d55b9d), 0 rule(s) overriden, 0 rule(s) skipped
2017-01-12T14:00:01-05:00 info /usr/sbin/cron[2390]: (root) CMD (/usr/share/firewall/turris-download)
2017-01-12T14:00:11-05:00 err turris-firewall-rules[]: (v62) Failed to download https://api.turris.cz/firewall/turris-ipsets.gz.sign
2017-01-12T14:01:19-05:00 info turris-firewall-rules[]: (v62) IPv4 WAN interface used - 'eth1’
2017-01-12T14:01:19-05:00 info turris-firewall-rules[]: (v62) IPv6 WAN interface used - 'lo’
2017-01-12T14:01:20-05:00 info turris-firewall-rules[]: (v62) 3403 ipv4 address(es) and 0 ipv6 address(es) were loaded (bc4b7d351917fe6864073e9991d55b9d), 0 rule(s) overriden, 0 rule(s) skipped
2017-01-12T14:02:00-05:00 notice firewall[]: Reloading firewall due to ifup of wan (eth1)
2017-01-12T14:02:00-05:00 info turris-firewall-rules[]: (v62) IPv4 WAN interface used - 'eth1’
2017-01-12T14:02:00-05:00 info turris-firewall-rules[]: (v62) IPv6 WAN interface used - 'lo’
2017-01-12T14:02:02-05:00 info turris-firewall-rules[]: (v62) 3403 ipv4 address(es) and 0 ipv6 address(es) were loaded (bc4b7d351917fe6864073e9991d55b9d), 0 rule(s) overriden, 0 rule(s) skipped
2017-01-12T14:05:01-05:00 info /usr/sbin/cron[13134]: (root) CMD (/usr/share/firewall/turris)
2017-01-12T14:05:01-05:00 info turris-firewall-rules[]: (v62) IPv4 WAN interface used - 'eth1’
2017-01-12T14:05:01-05:00 info turris-firewall-rules[]: (v62) IPv6 WAN interface used - 'lo’
2017-01-12T14:05:02-05:00 info turris-firewall-rules[]: (v62) 3403 ipv4 address(es) and 0 ipv6 address(es) were loaded (bc4b7d351917fe6864073e9991d55b9d), 0 rule(s) overriden, 0 rule(s) skipped

The LuCI status overview section for mwan indicated that the primary (currently only - configuring second wan is a task for today or tomorrow) wan was offline. After restarting the 3 processes above, that page once again showed primary wan as enabled rather than offline.
Routing functions were working (i.e. I could access the firewall machine on wan line even though external dns resolution was failing). DNS resolution from the firewall machine itself was tested and working fine.

I am leaning toward the resolver messages indicating a problem. Next time (if this happens again) I will reload only one process at a time - starting with resolver - to isolate cause. Would internal traffic from x.x.1.0 lan to firewall on x.x.10.0 route even though dns resolution was failing? I think so - checking opinions on that. Resolver problems would explain the dns resolution failures. Any thoughts on the log info or general scenario?