Hi I am experiencing random OOM killings crashes. I use HBL branch on TOS 7.0 now. I tried to look on the internet for general info on how to debug OOM crash and I have found only workarounds by clearing cache and stuff not relevant. How may I dig deeper to the real cause?
By now I don’t use many applications that are memory hungry. In normal operating state I have
root@router:~# free
total used free shared buff/cache availab
Mem: 2063488 354976 641040 83732 1067472 14991
Swap: 12582908 0 12582908
I will post a log from dmesg when it happens again. But I couldnt find a pattern. Sometimes its every two days sometimes two weeks uptime and it simply works. I also posted in: Unprovoked OOM mass killing because it was the only thing that was related to the problem in the meantime but MGLRU should be disabled 2 months ago and I still experience OOM’s?
I know HBL is not supported by Turris Support so I ask Community.
I’m going a bit off topic, but available is the interesting metric and the graph above isn’t very useful IMO. For example, /tmp(fs) memory will be counted into cached, but it’s not available for use (though if you have swap, it can get swapped like normal used memory).
Its easier for me to restore uplink just by rebooting. But indeed there was some messages in dmesg. Next it will blow up I will save dmesg and post it here.
[65562.563741] mvneta f1034000.ethernet eth2: Link is Down
[65562.574825] sfp sfp: module transmit fault indicated
[65563.901033] sfp sfp: module transmit fault recovered
[65563.906071] mvneta f1034000.ethernet eth2: Link is Up - 1Gbps/Full - flow control rx/tx
[65565.826141] mvneta f1034000.ethernet eth2: Link is Down
[65565.844573] mvneta f1034000.ethernet eth2: configuring for inband/1000base-x link mode
[65565.888075] mvneta f1034000.ethernet eth2: Link is Up - 1Gbps/Full - flow control rx/tx
[65565.898356] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[65566.975551] mvneta f1034000.ethernet eth2: Link is Down
[65567.006071] mvneta f1034000.ethernet eth2: configuring for inband/1000base-x link mode
[65567.050087] mvneta f1034000.ethernet eth2: Link is Up - 1Gbps/Full - flow control rx/tx
[65567.059863] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[65568.175397] mvneta f1034000.ethernet eth2: Link is Down
[65568.203978] mvneta f1034000.ethernet eth2: configuring for inband/1000base-x link mode
[65568.249943] mvneta f1034000.ethernet eth2: Link is Up - 1Gbps/Full - flow control rx/tx
[65568.271468] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[65569.316967] mvneta f1034000.ethernet eth2: Link is Down
[65569.344401] mvneta f1034000.ethernet eth2: configuring for inband/1000base-x link mode
[65569.391793] mvneta f1034000.ethernet eth2: Link is Up - 1Gbps/Full - flow control rx/tx
[65569.402413] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[65570.611083] mvneta f1034000.ethernet eth2: Link is Down
[65570.664375] mvneta f1034000.ethernet eth2: configuring for inband/1000base-x link mode
[65570.708118] mvneta f1034000.ethernet eth2: Link is Up - 1Gbps/Full - flow control rx/tx
[65570.718825] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[65571.765190] mvneta f1034000.ethernet eth2: Link is Down
[65571.798280] mvneta f1034000.ethernet eth2: configuring for inband/1000base-x link mode
[65571.844701] mvneta f1034000.ethernet eth2: Link is Up - 1Gbps/Full - flow control rx/tx
[65571.853631] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[65572.885231] mvneta f1034000.ethernet eth2: Link is Down
[65572.916059] mvneta f1034000.ethernet eth2: configuring for inband/1000base-x link mode
[65572.959104] mvneta f1034000.ethernet eth2: Link is Up - 1Gbps/Full - flow control rx/tx
[65572.972471] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[65574.085054] mvneta f1034000.ethernet eth2: Link is Down
[65574.113590] mvneta f1034000.ethernet eth2: configuring for inband/1000base-x link mode
[65574.157126] mvneta f1034000.ethernet eth2: Link is Up - 1Gbps/Full - flow control rx/tx
[65574.180280] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[65575.244995] mvneta f1034000.ethernet eth2: Link is Down
[65575.262749] mvneta f1034000.ethernet eth2: configuring for inband/1000base-x link mode
[65575.306123] mvneta f1034000.ethernet eth2: Link is Up - 1Gbps/Full - flow control rx/tx
[65575.318127] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[65576.386129] mvneta f1034000.ethernet eth2: Link is Down
[65576.404710] mvneta f1034000.ethernet eth2: configuring for inband/1000base-x link mode
[65576.448596] mvneta f1034000.ethernet eth2: Link is Up - 1Gbps/Full - flow control rx/tx
[65576.466374] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[65577.504983] mvneta f1034000.ethernet eth2: Link is Down
[65577.518225] mvneta f1034000.ethernet eth2: configuring for inband/1000base-x link mode
[65577.562131] mvneta f1034000.ethernet eth2: Link is Up - 1Gbps/Full - flow control rx/tx
[65577.580751] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[65578.673795] mvneta f1034000.ethernet eth2: Link is Down
[65578.698130] mvneta f1034000.ethernet eth2: configuring for inband/1000base-x link mode
[65578.742133] mvneta f1034000.ethernet eth2: Link is Up - 1Gbps/Full - flow control rx/tx
[65578.760587] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[65579.814734] mvneta f1034000.ethernet eth2: Link is Down
[65579.833663] mvneta f1034000.ethernet eth2: configuring for inband/1000base-x link mode
[65579.878862] mvneta f1034000.ethernet eth2: Link is Up - 1Gbps/Full - flow control rx/tx
[65579.907934] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[65581.034394] mvneta f1034000.ethernet eth2: Link is Down
[65581.057063] mvneta f1034000.ethernet eth2: configuring for inband/1000base-x link mode
[65581.100221] mvneta f1034000.ethernet eth2: Link is Up - 1Gbps/Full - flow control rx/tx
[65581.108311] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[65582.194777] mvneta f1034000.ethernet eth2: Link is Down
[65582.218014] mvneta f1034000.ethernet eth2: configuring for inband/1000base-x link mode
[65582.261147] mvneta f1034000.ethernet eth2: Link is Up - 1Gbps/Full - flow control rx/tx
[65582.272565] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[65583.334444] mvneta f1034000.ethernet eth2: Link is Down
[65583.351512] mvneta f1034000.ethernet eth2: configuring for inband/1000base-x link mode
[65583.395196] mvneta f1034000.ethernet eth2: Link is Up - 1Gbps/Full - flow control rx/tx
[65583.409879] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[65584.504344] mvneta f1034000.ethernet eth2: Link is Down
[65584.522677] mvneta f1034000.ethernet eth2: configuring for inband/1000base-x link mode
[65584.566158] mvneta f1034000.ethernet eth2: Link is Up - 1Gbps/Full - flow control rx/tx
[65584.599572] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[65585.654408] mvneta f1034000.ethernet eth2: Link is Down
[65585.669509] mvneta f1034000.ethernet eth2: configuring for inband/1000base-x link mode
[65585.713163] mvneta f1034000.ethernet eth2: Link is Up - 1Gbps/Full - flow control rx/tx
[65585.733836] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[65586.774176] mvneta f1034000.ethernet eth2: Link is Down
[65586.820191] mvneta f1034000.ethernet eth2: configuring for inband/1000base-x link mode
[65586.864170] mvneta f1034000.ethernet eth2: Link is Up - 1Gbps/Full - flow control rx/tx
[65586.888694] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[65587.024335] mvneta f1034000.ethernet eth2: Link is Down
[65587.046489] mvneta f1034000.ethernet eth2: configuring for inband/1000base-x link mode
[65587.090929] mvneta f1034000.ethernet eth2: Link is Up - 1Gbps/Full - flow control rx/tx
[65587.099060] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[65588.193956] mvneta f1034000.ethernet eth2: Link is Down
[65588.219131] mvneta f1034000.ethernet eth2: configuring for inband/1000base-x link mode
[65588.263179] mvneta f1034000.ethernet eth2: Link is Up - 1Gbps/Full - flow control rx/tx
[65588.271729] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[65589.345547] mvneta f1034000.ethernet eth2: Link is Down
[65589.379953] mvneta f1034000.ethernet eth2: configuring for inband/1000base-x link mode
[65589.423182] mvneta f1034000.ethernet eth2: Link is Up - 1Gbps/Full - flow control rx/tx
[65589.433099] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[65590.503766] mvneta f1034000.ethernet eth2: Link is Down
[65590.568019] mvneta f1034000.ethernet eth2: configuring for inband/1000base-x link mode
[65590.611180] mvneta f1034000.ethernet eth2: Link is Up - 1Gbps/Full - flow control rx/tx
[65590.629667] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[65591.663691] mvneta f1034000.ethernet eth2: Link is Down
[65591.682233] mvneta f1034000.ethernet eth2: configuring for inband/1000base-x link mode
[65591.726190] mvneta f1034000.ethernet eth2: Link is Up - 1Gbps/Full - flow control rx/tx
[65591.739004] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[65592.824067] mvneta f1034000.ethernet eth2: Link is Down
[65592.846717] mvneta f1034000.ethernet eth2: configuring for inband/1000base-x link mode
[65592.890197] mvneta f1034000.ethernet eth2: Link is Up - 1Gbps/Full - flow control rx/tx
[65592.903017] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[65593.963523] mvneta f1034000.ethernet eth2: Link is Down
[65593.987744] mvneta f1034000.ethernet eth2: configuring for inband/1000base-x link mode
[65594.031203] mvneta f1034000.ethernet eth2: Link is Up - 1Gbps/Full - flow control rx/tx
[65594.039548] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[65595.174035] mvneta f1034000.ethernet eth2: Link is Down
[65595.203246] netifd: page allocation failure: order:6, mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0
[65595.213964] CPU: 1 PID: 3080 Comm: netifd Not tainted 5.15.96 #0
After reboot everything is back operational and the crash happens every 2 days or maximally every 2 weeks.
Now when I think of it. Comparing to my second Omnia running as dumb AP (same branch HBL) this doesnt happen. And the only thing that is different we might say is SFP port in use. I really suspect SFP.
I don’t think this would be created by the addition of the support for the HALNy SFP module, because nothing was changed in relation to memory allocation - it’s all already-allocated data structures or static (const) data.
What it looks like to me is a memory allocation problem “netifd: page allocation failure” kind of gives it away - and it’s an order 6 allocation, meaning its a request for 256KiB of contiguous memory. Probably due to memory fragmentation.
I think a question to ask is why the network interface is bouncing up and down - and looking at the backtraces, it looks like something in userspace (netifd?) is responsible for bringing the interface up and down. Certainly, this memory allocation failure backtrace points to netifd requesting that the network interface is brought up.
It does look like mvneta_cleanup_txqs() is failing after mvneta_setup_txqs() fails, which won’t be helping - indeed, looking at the code in mainline, we will end up freeing memory that was previously freed. Here’s a patch for that:
At least that should reduce the number of kernel warnings, possibly down to just the memory allocation failure. If that is down to just memory fragmentation, then I think the only solution is to check that one has CMA enabled and has enough CMA to satisfy mvneta in the presence of fragmentation.
Thanky you rmk for the insight. What I noticed is when the crash happens the LED of WAN interface is not blinking. Need to wait it happens again to verify if the interface is flapping between up/down.
Edit:
AFAICT the crash was not there when I switched to HBL so it must be a regression in recent kernels. Could someone from Turris Team apply the patch in HBL? I will test it for some time and report back.
@mvasilek / @miska Could you apply the patch? Crashes happen every two days and it’s anoyying and posibble fix is ready. I would suggest to first apply the patch. I will report if mvneta is missing CMA and if so then lets enable it in kernel.
But this could affect MOX and Turris1.x as well. Thats why there is HBL to test stuff. Thanks
Edit: I reported issue on gitlab. Maybe like that devs will see it:
I’ve sent a RFC patch series to netdev that removes the order-6 allocation entirely - it’s not a simple fix so its going to be targetted for net-next. However, getting some testing from someone with a reliable reproducer would be great.