Diagnosing bizarre lockups on the Omnia

Now and again my Omnia router locks up. Albeit in the most bizarre way. I’m curious if, after a reboot there’s a reliable place to look to try and diagnose what was up?

Here are the symptoms I consider bizarre:

  1. WAN link not being served (local machines can’t connect to WAN)
  2. Omni web interface not being served (no response)
  3. ssh not working, can’t ssh to the router.
  4. Nameserver not working (not getting LAN names resolved)
  5. LAN switching IS working … that is two LAN machines connected via LAN ports on the Omnia can talk to one another

My conclusion, which I’d like to check with skilled pros here if possible, is that the LAN switching operates at a firmware level beneath openWRT layer, so that when openWRT is locked up (for whatever reason) it still works. I don’t like that conclusion one jot as I can’t see any reason why WAN switching would not be included in that, though maybe it is, and I haven’t checked closely enough - essentially if the nameserving is down web browsers would not work, but I might if I knew and remembered the IP address of a public server, be able to ping it, even load a web page from it, using IP address perhaps, this I haven’t tested. To be honest because when it’s down my top priority has been getting it up again as I have complaints form the family.

Now how I know it’s serving the LAN is roundabout. I have a HTPC and NAS connected to the Omnia, and I have the HTPC playing music off the NAS, and that works the whole time. In fact if it’s not playing, I start it playing specifically because the router is under our house (where it’s cool and dark, as is the NAS and various other devices), I have these under-house devices on RF switched power points … specifically so I can do a hard resent from the house not have to descend to the basement to do it (because like it or not it happens too often that one device or another locks up I guess). Now it means if the WAN is down and I can’t access the Omnia in any way, what I can do is restart it remotely, but because I can’t see when it’s up again easily and I’m impatient I start music playing on my HTPC reading off the NAS connected through the Omnia (that it works is what blows my mind) then I turn the Omnia off and the music stops, I turn it on again and when the music starts I know it’s up, and I test the WAN and all is good again.

What puzzles me then is, what to do after to diagnose why the Omnia was locked up. I would like to think we can endure such lockups with the consolation of in time nailing what causes them and seeking a resolution. But the problem I see is that both dmesg and /var/log/messages are not persistent across reboots and start from system start every time.

Is there a persistent log of a common way of making either or both of these persistent by configuring them to live on a mounted SSD say?

You might consider attaching a serial cable to get to the console. That should allow you to see if there are any errors when you can’t get to the web interface or ssh.

seems to have a decent discussion on this.

This is quite typical, as in many routers the LAN ports are located on an dedicated switch chip, that can be configured from the router’s SoC, but once setup correctly will just continue to work as long as it is powered up.

Thanks. That’s handy to know.

  1. Diagnosing with a serial line seems like a lot of effort alas to go to for a pricey top of the line router. Grrrr. Like a throw back to the 60s with a console …

  2. I guess LAN switching can work at the chip level where WAN switching needs the kernel in order to apply the various WAN related rules (firewall, NAT etc.)

I really wish some Turris folk would step up and suggest a way to great persistent logs. I’m tempted to mount an SSD and then see if I can get log files written to that. Could mount it as /tmp but might impact performance and am eager to hear from Turris regarding sensibility.

Even the persistence of an SSD won’t help if /var/log/messages is overwritten at startup though unless that can be configured to append instead. And where dmesg stores it’s messages is a mystery to me at present. Let alone if there are any other logs that can be enabled or I should know about in trying to diagnose such issues.

There’s a chance they relate to /tmp filling up. I just noticed that happening. So I turned some rather verbose lightpd logging off that I had running (and will have to look at getting it to log to an SSD and/or check if it has a sensibly rolling log system, but I am keen on logs).

Not enough space in tmpfs might be cause of router not responding on lan as DHCP and DNS servers store temporally files in tmpfs. It might also bring down the lighttpd, but I am not sure about ssh server. I don’t see how that would be affected by this (possibly killed because of not enough ram?). But anyway first thing I would do is to resolve issue you know about and than yes look to logging. You can configure syslog-ng to store you the copy of /var/log/messages to some other location. Look to /etc/syslog-ng.conf and copy block destination with different name and different path. But although console is so 60s it’s the best tool devised since than. Logging doesn’t work during early and late boot/reboot phase. If it seems like all services running in router died it is probably because they did and in such case even logging died and so there will be little to no information in log. Looking to console that is hardware driven is the best way.

DNS: knot-resolver will probably be killed by filling /tmp, as that’s where its cache is mmap-ed (LMDB) on Omnia. IIRC it will lead to SIGBUS. Details upstream (partially in CZ unfortunately).

This is great. it might well be running out space in /tmp that is causing all this. Thanks deeply for the config tips, I’ll look into all those.

I’m curious what I’d plug into the serial port as a console. When you do that is there an inexpensive device you’d recommend. I mean back in the 80s I’d have taken any randomly available VT100 terminal ;-). Today, is a different story and as I said, it’s so dated I’m not familiar with what I’d use.

We have the whole one line in documentation about it. And that is enough: https://www.turris.cz/doc/cs/troubleshooting/serial_link
See: http://www.ftdichip.com/Products/Cables/USBTTLSerial.htm
And inexpensive they are. You can get one for about 40Kč from ebay.