Router reboots caused by memory fragmentation


#1

I had a problems with spontaneous router reboots every 1-3 days since we changed an ISP. Almost every time before reboot the internet connection dropped. Once with some luck, I was able to read the router logs shortly before the reboot happened. I found this very interesting entry:

2018-03-04T02:55:04+01:00 warning kernel[]: [86429.609956] netifd: page allocation failure: order:5, mode:0x24000c0

This means that netifd process requested memory allocation (if I am not wrong order 5 means 128kB) but this request failed. As the system had lots of free memory:

Normal free:17796kB min:3504kB low:4380kB high:5256kB active_anon:26476kB inactive_anon:28948kB active_file:271888kB inactive_file:258968kB .…

the reason is NOT memory exhaustion. The only candidate here can be memory fragmentation. Flowing log entries proves this hypothesis:

2018-03-04T02:55:04+01:00 warning kernel[]: [86429.610316] Normal: 1176*4kB (UME) 576*8kB (UME) 412*16kB (UME) 50*32kB (UM) 6*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 17888kB
2018-03-04T02:55:04+01:00 warning kernel[]: [86429.610337] HighMem: 98*4kB (UM) 43*8kB (UM) 8*16kB (UM) 6*32kB (UM) 3*64kB (M) 5*128kB (M) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1888kB

I find out that there was similar issue on arch linux for ARM https://github.com/archlinuxarm/PKGBUILDs/pull/630 and the solution is to enable flowing kernel compile options:

CONFIG_COMPACTION=y
CONFIG_SLUB=y

I have recompiled and installed kernel with this options and my router is running stable for months!

Initially I have informed support about the issue (ticket #001486), and later as I found the solution I have requested kernel change with this options. But unfortunately I did not received any reply from Turris support to my request.

To everybody with self-rebooting Turris, this small patch can hopefully help you. Send an email to support with link to this post and request to this small kernel change.


#2

Good job! But one question remains. You said the change-point was ISP change. I don’t have any idea how new ISP can trigger problems you mentioned.
What is connection type / config of ISPs before and after?


#3

Good catch! No idea why OpenWRT has those disabled by default, enabled them now in nightly and unless it breaks something else, it will be part of the next release.


#4

My previous ISP was using cable connection. The cable modem simply exposed external IP on ethernet interface (very useful) so Turris was used as a simple router without need to dial anyting. Then we was moving and I did not used Turris for 2-3 months. New provider is using VDSL so Turris is configured to dial PPPoE over VDSL modem now.

I don’t think the ISP change triggered this problem, rather some software update during the offline time. But I am not sure and cannot prove anything.


#5

Thank you for accepting the proposed options. I am happy to hear that it will be part of standard kernel (if everything goes right).


#6

I still experience this problem in 3.11.2, but only if Turris Omnia performs PPPoE dial on WAN. Routers with 3.11.2 where the cable modem handles the dial in run stable and are not affected.

Are the proposed options part of the standard kernel in 3.11.2?