Reboot loop when starting

lbschenkel · February 20, 2017, 8:39am

On this weekend I rebooted my router and during start-up it would reboot itself while it was starting. The LAN/wireless are brought up and I can ssh, however around 5-10 seconds after that the router reboots. On next boot the same would happen. This is the exact problem that has been reported here:

By the way, the router is fully up-to-date: TurrisOS 3.5.3 with Kernel 4.4.39-80079e1c1e5f9ca7ad734044462a761a-4.

I connected via serial to inspect the boot process. While inspecting the output I see nothing unusual until the point in which the router reboots without warning. No message is ever logged. I had to do a full factory reset to be able to recover.

Once I reconfigure the router I reboot and it starts happening again. I had to do a new factory reset. Due to the symptoms and the reboot without warning, I naturally suspect of the watchdog. From what I can see everything is fine:

orion_wdt module is enabled (via lsmod) and I can see orion_wdt: Initial timeout 171 sec in dmesg
/dev/watchdog exists and is open by procd (checked via lsof)
/etc/init.d/watchdog_adjust is started at boot time (when running it again I can see the same settings were already applied, 2 second frequency and 12 seconds timeout)

Once more, I configure the router and the reboot loop comes back. At this point I start trying to do some experimentation and I found out that if I boot the router with the cable disconnected from the WAN port, then the router starts up successfully and does not reboot! If I connect it later then everything works fine, no spontaneous reboots.

Now with this new knowledge I try out something else: I set network.wan.auto='0' in order to avoid the WAN being brought up at boot, leave the cable connected to the WAN port and try out rebooting. The router starts fine, no spontaneous reboots. Rebooted many times to confirm and it’s 100% reproducible — always successful, no reboots.

Now I try out bringing up the WAN at the end of the start-up: I put ifup wan in /etc/rc.local. The router still starts up successfully. No reboots. Again, repeated this many times and again it’s 100% reproducible.

Just as a sanity check, once more I configure the WAN interface to start at boot. The reboot loop comes back — a few seconds after networking is brought up, it reboots. Now that I know how to stop it, I disconnect the cable from the WAN port and the reboot loop stops.

Still, the only explanation that I can think of is the hardware watchdog and somehow procd is not pinging it early enough at startup, and the router reboots. Moving the WAN initialization to the end of the startup process somehow fixes/works around it.

So, as an experiment I try out editing /etc/init.d/watchdog_adjust and bumping the timeout to 120 seconds. Unfortunately that did solve the problem: when the WAN is enabled at boot and the cable is connected, the reboot loop happens again.

I’m running out of ideas here. I still believe it must be some timing problem and some weird interaction between procd and the watchdog. As I understand, U-Boot enables the watchdog with a 120 second timeout before booting Linux. Is it possible to disable the watchdog on the U-Boot command line before booting? I would like to try that next.

Is there somebody at Omnia (hopefully an engineer) that could help me troubleshooting this? I can survive for now with the hack of starting the WAN in /etc/rc.local, but I would really like to get to the bottom of this. It should not be happening. I wonder if I’m the only one experiencing this.

Pepe · February 20, 2017, 9:28am

Hello
I’m afraid in this case you will need to send them email to address tech.support(at)turris(dot)cz.

It will be faster and they will reply to you asap.
Right know they dont really support forum.
Sometimes they reply here , but they’re doing it in their free time, which I really appreciate,
but I tried convince Vaclav (Turris Community Manager ), but w/o success right know. Maybe in future they will have also official support on this forum, too.

brill · February 20, 2017, 12:56pm

Hello! Thanks for the analysis. The problem with U-Boot watchdog enable is that it is hard-coded to the startup and can not be turned off without recompiling U-Boot (I know, I know… Shame on me I didn’t put this option to the production U-Boot.)

I might try to add an option “disablewatchdog” it to the U-Boot, re-compile & test it and send the image to you. You should be able to update the U-Boot with mtdwrite from TurrisOS. (I’ll have to check&try it because I normally write new U-Boot images to SPI flash from U-Boot, but it is less convenient for one-off test.)

In the meantime I think if might be interesting to consult @miska who might know more about procd and the system init.

Btw. there are two watchdogs, as you might know from U-Boot startup. The first one is the MCU watchdog running in the STM32F0 chip that manages power regulators etc. It has 120 second timeout and we use it in the first stage of initialization to protect us from U-Boot hangs. Then U-Boot starts CPU watchdog and stops MCU watchdog after relocation. I would say that the problem lies in pinging the CPU watchdog from procd and it hopefully does not involve the MCU watchdog…

miska · February 20, 2017, 2:40pm

Hmmm, procd should handle watchdog by itself and watchdog_adjust is set to fire up before network is being setup, so everything should be fine by that time already. When calling ubus to set up timeout, does it report watchdog running? How is your WAN configured? Simple dhcp? dhcpv6?

lbschenkel · February 20, 2017, 4:15pm

OK, I did some more troubleshooting and we can discard the watchdog explanation. The reason for that is that I just noticed the following in the serial output when the router reboots:

2017-02-20T15:48:05+01:00 info procd[]: - shutdown -
[   42.671714] reboot: Restarting system

Which is an orderly shutdown and cannot happen if the watchdog reboots the machine. I even tested myself by stopping the procd watchdog via ubus call system watchdog '{ "stop": true }' and I could see the unclean reboot.

I also realized that once the router is in the reboot loop it actually succeeds once in a while (around 10/20 reboots). So, I did the following: changed syslog-ng.conf to write to /dev/ttyS0 and changed the priority to 01 so it starts very early in the boot process. Then I started capturing the output via minicom and let it reboot in a loop until it eventually started up successfuly. I kept the output of the two last boots, the one before last (which failed) and the last one, which succeeded. I did not touch the router during the process besides turning it on and waiting until I got a successful startup.

The capture has quite some output, and I’m not sure if I should post it here. Let me know if I should send via e-mail to any of you guys or post to the forum.

Now I’m assuming that some script or daemon is failing and triggering a reboot. I examined the serial capture more than once and I cannot find anything out of the ordinary. I grepped most of the filesystem trying to figure out if there was any script trying to run reboot, shutdown or halt on failure but I couldn’t find any.

The scenario is still the same, though: when trying to bring up the WAN during boot, the router triggers a reboot (except for the eventual case when it manages to succeed). If I leave the cable unplugged or disable the autostart for the WAN and start it manually in /etc/rc.local, it always works — I have never experienced a reboot in these conditions.

Please let me know what the next step is.

lbschenkel · February 20, 2017, 4:25pm

Regarding your questions about my WAN, here is the config:

config interface 'wan'
        option ifname 'eth1'
        option proto 'dhcp'
        option auto '1'

config interface 'wan6'
        option ifname '@wan'
        option proto 'dhcpv6'
        option reqaddress 'try'
        option reqprefix 'auto'

My ISP does not provide IPv6. I do use an IPv6 tunnel but every other WAN interface is disabled for now so I can narrow down what the issue is.

lbschenkel · February 20, 2017, 6:57pm

You won’t believe this. I noticed the following lines in the serial output:

debug procd[]: Triggering reboot
debug procd[]: Shutting down system with event 1234567
info procd[]: - shutdown -
debug procd[]: running /etc/rc.d/K* shutdown
procd[]: Triggering reboot

By inspecting procd source code and considering the order of the messages above, I managed to take an educated guess that procd is rebooting because it’s getting a SIGINT or SIGTERM. To get more insight about what was going on, I renamed /sbin/procd to /sbin/procd.real and created a script in its place:

#!/bin/sh
export PROCD_DEBUG=9
exec /sbin/procd.real -d9 $@

To be honest that did not help a lot, except that I could confirm that nothing was sending a shutdown command to procd via ubus. But then I realized that most of the reboots happened soon after ddns scripts that were failing (because name resolution was not working yet at that point). I decided to check the implementation of ddns-scripts, specifically /usr/lib/ddns/dynamic_dns_functions.sh and saw that there’s a lot of logic involving killing ‘dead’ scripts. I reasoned that it is not completely impossible that there’s a race condition and the service and/or the script is failing and becomes orphan/zombie and the signal ends up being sent to procd instead (maybe that’s impossible, but it is the reasoning that led me through this path) and I decided to check it out, edited the file and prepended logger to all kill commands and rebooted…

To my surprise, the boot loop didn’t happen! I saw many kills in the log but the boot succeeded. I rebooted many many times more and I hadn’t experienced the reboot loop again. Then I restored the original script and the reboot loop came back. If I disable the service, no more reboot loops.

I would never ever have suspected of ddns-scripts.

Maybe I’m wrong about the root cause, maybe it’s really something else and I just disturbed the timing that triggered it. Anyhow, I can survive for now without the ddns service so I’ll run it for some time to check if the set-up is stable now. I’ll let you know if there are any new developments.

Thanks for the members of the Omnia team for chiming in and let me know if you still want any data that I have captured.