Omnia: LAN interfaces all went down. No idea why, no idea how to fix it but want to learn

(Not sure, software or hardware)

Seriously weird happening last night:

  1. I have a Turris Omnia - been in service for years, best thing ever.
  2. I plug a Netgear WAP into my LAN (actually two switch hops away from the Omnia) and go to my desktop to see if I can bring up its web interface, but my desktop has no network (it is plugged directly into the Omnia).
  3. I spend ages trying to work out what’s up with my desktop’s network, find nix, so reboot it, but still no good.
  4. I check the Omnia, it has two LEDs flashing orange in rapid cycle, the Internet one (with the little globe) and PCI3 (no idea what that one is and can’t find out either - labelled unassigned in the manual). Could be the USB HDD I have plugged in?
  5. I connect to the Omnia WLAN with my phone and it’s up and I have internet. I can get Foris and Luci up fine.
  6. I open laptop and connect it with CAt6 to the back of the Omnia. No net. Cable Disconnected symbol in Systray of Linux Mint.
  7. I connect the laptop to the Omnia WLAN and I’m on-line and can ssh to the Omnia and access Fores, reFrodis and Luci fine and surf the web.
  8. I open a second laptop and connect it with CAT6 to the Omnia but there is no net. I keep the two open so I can experiment though the WLAN connected laptop and ping the Omnia from the CAT6 connected on.
  9. I reboot the Omnia. No change
  10. I use the reset button to rollback one snapshot on Schnapps (hold down wait for two LEDs to light and release). Reboot. No change.
  11. Repeat 10. No Change.
  12. Repeat 10. No Change.
  13. Repeat 10. No change. I am now on a snapshot that’s prior to the last update and a month (of reliable running) old so it’s not a system configuration issue.
  14. It just so happens, that I have a spare Omnia ;-). Alas it’s a cold spare and not configured. I spent ages configuring that and it got too late (am almost done, but I need to copy over DHCP, DDNS, and Lighttpd configs and test).
  15. Because it was now after midnight and I had to crash, on the WLAN connected laptop, on Luci, under network interfaces, I restarted the LAN interface, just because … and voila the CAT6 connected laptop now pings the Omnia. Quick check and everything is back up. I retire.
  16. In the morning it’s all down again! That is the LAN ports, the WLAN and WAN are up! And the Omnia is flashing those two LEDs again (i.e. none of the LAN port ones)
  17. I use the still WLAN connected laptop still open at Luci, to restart the LAN again. All comes good. Has been good for some hours now.

Would this merit a giant WTF?

What is going on here?

I will continue configuring my spare so it’s a warm swap in spare, but for now am on the original Omnia and all is good. It seems.

What possible causes are there for this. The TLDR:

  1. Reliable Omnia been running for years, the out of the blue for no apparent reason…
  2. WAN is up and works well
  3. WLAN is up and works well
  4. All 5 LAN ports are down and nothing can connect.
  5. Reboot does not fix it.
  6. Rollbacks don’t fix it.
  7. Luci LAN interface restart does fix it (for a while then it happens again)

As I’m paranoid and need this line to be up, after restarting the interface worked again I turned off the switch the new WAP is on (and a TV and an Xbox are on). I have of course to see if I get a day’s clean running and will then turn that on again and see if it correlates. My bet is no. But I wonder if it’s possible that flooding can bring the LAN interface down on the Omnia - that is, if for any reason this newly connected WAP was flooding the line - in the manner of a classic DoS attack - if that can bring Omnia’s LAN interface down?

Here is Uptimerobot’s view:

The first green band was last night after I restarted the LAN interface. It lasted 2.5 hours (30 min resolution as it pings every 30 mins) and it’s been up now for almost 2 hours this morning (is crunch time coming?).

Which version of Turris OS? How does it work when you remove the switches and wap? I mean pure omnia.

Foris reports:

Device Turris Omnia
Serial number 47244708159
Turris OS version 5.4.4
Turris OS branch hbs
Kernel version 4.14.294

It’s run reliably since yesterday, so I can’t test isolation. My plan is to explore the marginally small likelihood that the correlated connection of a Netgear WAP was causative by connecting that again soon after a clear run of reliable connection. If I get sudden correlation again, it begins to look causative, plus I’d have a reproducible scenario to answer the “in isolation” question.

I remain bamboozled and as to what can cause these symptoms, and maintain flooding of the LAN interface as a (poorly informed) mild possibility. I will learn more when I connect the Netgear WAP once more, likely this evening (after two clear days of connectivity)

I would be suspicious of this WAP. :face_with_monocle:

What configuration did you do on the WAP? You didn’t say what model it is, but check that it is truly in AP mode and not trying to serve DHCP, et cetera.

It’s a Netgear EX6200.

In Extender Mode, and was operating in Extender mode for weeks, connected to this self same Omnia’s WLAN. I installed a new switch recently in its vicinity. Because I had a single CAT6 line running through walls from years back, that went to an Xbox. I recently put a smart TV above the Xbox which also has an RJ45, and connects on-line, so I put a spare Netgear GS108Ev3 that to split that line to the Xbox and TV and that ran fine for a few weeks.

I had this Wi-Fi range extender situated in the same area for some while as an experiment in reach and coverage. It worked fine, connected the Omnia’s WLAN and extended its range significantly (outside to reach IoT devices outside the house).

The same device can also work as a wired in WAP and so I one day (after all the over has been working fine for weeks) thought I’d plug it in, so I ran a patch cable between the extender (WAP to be) and the switch. Then went to my PC to bring up its LAN interface. But my PC was offline and the OP commences.

I am cautious here, as this is a correlation and by no means a clear or proven cause. I had not used the PC or any wired device for some hours (I started this little thing after the whole family dinner, kids to bed routine etc). There is no evidence that the issue arose when I plugged this Extender in.

That said, I will try repeating just that tonight to see if repeats, and thus comes closer to a causal relationship. It is conceivable perhaps that in extender mode, plugged into the LAN that there’s some intriguing feedback loop over the WLAN and LAN connections (a bit like remoting in to a PC with VNC or RDP or whatever and in that session remoting back to the source computer, that can be fun ;-)). I say “conceivable” but I have no reason to believe it and there are counter indicators (the WLAN remained up only the LAN was down on the Omnia - affording me the interesting diagnostic evening with on laptop WLAN connected, and one LAN connected pinging the Omnia constantly while I played around with it on the WLAN connected laptop to see if and when the LAN responded).

Will certainly be able to report more later when I repeat the exercise, as a learning thing.

Do you have another switch to try?
I would do something like this, replace one device at a time, first replace wap, see if problem persists, then replace switch etc.
So then we can find if it is switch(most likely) or the wap.
It could be hardware issue from switch for example some bad voltage hitting lan ports etc.

Of course here I am assuming that you configured switch and wap correctly.

I am wondering if the EX6200 knows you are trying to use the ethernet as an input, as opposed to trying to feed the router its own network which is potentially introducing a conflict.

It looks like the assumed behavior would be to take the network over 802.11 and feed it out over the ethernet ports.

Alas, failed to find time to try this again last night. But yes, something like that may be happening, but I’m a little confused by the notion of input and output ports Ethernet is not a directional protocol in any way shape or form. Some services like DHCP are of course, and have a client/sever relationship.

What may play a role, and this interests me to learn over time, is another observation, and one reason I want to move to the WAP mode. This device as an extender practices the loathsome habit of MAC rewriting, So a device connecting to any other WAP on my networked has a given MAC but when connected to this one in extender mode it has another MAC. Which plays mild havoc with security based on MAC identification (iPhones do the same now, so they have to turn that off on my network, as I’d sorry, but security trumps anonymity when you’re on my watch/network ;-).

Of course internet “clients” sort of exist as well in the cited examples, TVs, game consoles and Blu-ray players, in that they really only issue a DHCP request, get an IP assigned and then surf the net through the NAT (which the Omnia provides). As an extender it should simply have one WLAN link to the Omnia and offer a connection to WiFi device, acting as a bridge between the two.

I look forward to trying this again and learning if the WAP is to blame. We shall see. A weekend is coming and may afford more time. The time issue is that if it brings my LAN ports down, I have an effective network outage I need to attend to ASAP and may resolve quickly (turn it off again and restart the LAN interface on the Omnia) and may not (in short any experiment entails a small risk of blowing out in effort).

Well today I made time to experiment. And yes we can conclude it is the Netgear EX6200 that brings down the Omnia’s LAN interface, I’ll stop just short of blaming it, and ask the obvious question whether the Omnia’s LAN interface should be so fickle as to permit such a thing, let alone without any clear indicators. First the evidence:

  1. Turned off the Netgear EX6200
  2. Switched on the switch it was plugged into (which has a TV and Xbox on it too)
  3. Everything ran fine for hours, and hours and hours
  4. Turn on the Netgear EX6200
  5. After a short moment, the LAN interface goes dead on the Omnia (WAN and WLAN remain active and useable)
  6. Turn off the Netgear EX6200
  7. Restart the LAN interface on the Luci Interfaces page of the Omnia
  8. After a while (it takes a minute or so) the LAN interface comes up again.
  9. Remove LAN cable from the Netgear EX6200
  10. Turn on the Netgear EX6200
  11. LAN stays up, all is good. Netgear EX6200 works as a WAP in extender mode.

Then the steps to reproduce:

  1. Have a Netgear EX6200 configured as a Wi-Fi Extender
  2. Plug it into a LAN for which the Omnia acts a gateway to the WAN (DHCP server, NAT server etc).
    • It doesn’t have to be connected to the Omnia, just on the same LAN - I have two intervening switches
  3. The Omnia’s LAN interface goes dead (non-responsive)
  4. It still serves WLAN and WAN and hence Foris, reForis and Luci can be used via a Wi-Fi link, but none of these lend any clue that the LAN interface is down, it looks up, and reports traffic, modest traffic, nothing wild. There is simply no clue that there is an issue.

Then the expectation:

  1. I expect the Omnia’s LAN interface not to go down
  2. I expect that if it does, and I can connect with WLAN that there are clues, ways of:
    a. seeing that the interface is down
    b. diagnosing the reason

Wow, this sounds so much like a network loop of some sort causing a broadcast storm.

I was also thinking that you have a loop in your network. Maybe try enabling STP on your br-lan So it will only disable one port instead of all not responding.

I agree, but:

  1. For there to be a loop, it includes the WLAN and LAN interfaces. Yet the LAN on the Omnia goes down, and the WLAN stays responsive.
  2. Because of 1 I can see network traffic on Luci’s Interfaces pages. Not only can I see it, I can see it’s active, there is traffic on the LAN interface, the numbers move, and I can see it is not at all voluminous.

So, storm is not the word I’d use, but something is up, indeed.

Moreover, I connected to the EX6200 over its WLAN and reconfigure it from extender mode into WAP mode, and it now runs fine as a WAP.

The issue arises (Omnia’s LAN interface dies) when the EX6200 is connected to bot the Omnia’s LAN and WLAN (is in extender mode), but does not “appear” to relate to a traffic flood or storm, but something more subtle.

Still, this is interesting reading:

But does not impact any of the observations, that if a broadcast storm were to blame I’d want to see that on the monitored traffic count on Luci, and also I’d expect it to flood all the links in the loop (so in this case the Omnia’s LAN and WLAN).

Interesting, had to read about that, and found it on Luci. It remains an exercise in learning, alas as in AP mode my issues are resolved, and I have no issue to resolve, only more to learn about why this issue existed at all (i.e. growing personally). That has some value, but in a busy life as limits too. I’d have to make time to experiment with STP risking network downtime in the process (not a crisis I deliver free services to community groups and beggars accept that 100% uptime is not promised, cannot be, but I manage OK (98.693% uptime in the past month, only 94.646% in the past week because of this issue).

STP is interesting and was a nice read. This is perhaps one of the more interesting reads:

Because it highlights that multiple paths are an exploited feature, not a problem.

I’m not sure how a loop differs from multiple paths. Again, I am in fact a rusty network engineer (studied networks intensively in the 1990s) and my recollection of the Ethernet is as a broadcast protocol. A given NIC simply shouts out packets with appropriate addressing, and only listeners identifying under that address act on it. Of course, in an IP network the remaining Ethernet layer operates primarily between two endpoints only (the evolution toward switched networks enabling maximum bandwidth on the Ethernet). Still, the NICs are all operating Ethernet at the conceptual OSI data link layer (with MAC as the addressor), with IP routing on top of that (with an IP address as the addressor). Local switches though are switched Ethernet I presume, so route (via ARP) using MACs. And that gets doubly interesting as the LAN and WLAN on the EX6200 have distinct MACs.

Seems STP is low risk mind you and given the LAN has a one route topology (is a tree really with its root at the WAN link on the Omnia and 5 initial branches, growing with each of the subsequent switches), STP might be a good way of ensuring if a wired loop were accidentally created, a stable network is maintained.

Anyhow, nice tip, nice reading and I learned something (unlike old dogs, old network engineers do learn new tricks!).