Random kernel crashes on mvneta [HBS/HBL]

rmk · April 11, 2023, 3:45pm

Any news from the Turris side on my proposed patch set?

rmk · April 11, 2023, 3:53pm

I will paste here what I just stated on netdev concerning my proposed patch set. This is aimed more at those who provide the kernels. Note that the merge window likely opens this weekend, which means if the patch set isn’t in net-next by then, it won’t hit mainline for another three months - and it won’t get backported to stable kernels either.

Hi,

I think the Turris folk are waiting for me to get this into the kernel
and backported to stable before they merge it into their tree and we
therefore end up with it being tested.

We are now at -rc7, and this series is in danger of missing the
upcoming merge window.

So, I think it’s time that I posted a wake-up call here to say that no,
that’s not going to happen until such time that we know whether these
patches solve the problem that they identified. I’m not bunging patches
into the kernel for problems people have without those people testing
the proposed changes.

I think if the Turris folk want to engage with mainline for assistance
in resolving issues, they need to do their part and find a way to
provide kernels to test out proposed fixes for their problems.

AreYouLoco · April 11, 2023, 4:24pm

Well I moved back to stable branch. But I could easily revert to snapshot that had this crash and test. But I wont compile kernel on my own just to test that. The question is if Turris team could reproduce that problem. Or was it some misconfiguration on my side.

Edit: In other words I am willing to test further.

miska · April 12, 2023, 7:49am

Hi,

thank you very much for debugging this issue. Just to post some update, we are not waiting, I’m working on backporting it to a testing branch. I started with your original patch, but then scraped it and I am now working on patchset you sent to net-next. It took some time as fixing 6.3.1 was more important and then was Easter holiday. But I hope to have something to test ready by the end of the week.

rmk · April 12, 2023, 8:11am

Sorry, but that’s too late to get it into the net-next tree for the opening of the merge window.

Also, there are two problems here:

The kernel warnings. This is what my “original” patch addresses.
The memory allocation failure. This is what the later patch series addresses.

Dropping one or the other is going to leave one of these problems unsolved. You need both of them - and in fact the series I posted depends on the original patch already being in place.

To be clear: all the patches I’ve mentioned are all required.

mbehun · April 12, 2023, 11:49am

@rmk I’ve backported 7 patches: the RFC series (5 patches), the fix potential double-frees patch and another patch that is a dependency for the series.

miska · April 12, 2023, 1:19pm

It is now available in crashlab branch. You can test it using switch-branch --force crashlab.

rmk · April 12, 2023, 3:15pm

Thanks @miska for getting that sorted so quickly!

miska · April 14, 2023, 7:41am

All credits goes to @mbehun who finished the rebase.

AreYouLoco · April 17, 2023, 12:32pm

I switched back to HBS (reflash) and todays morning I was greet with the same behaviour No link on WAN interface. Didn’t look at the log cuz I needed connectivity so I just rebooted. But I suspect the same issue on HBS. Will have to confirm that when it happens again. I am just suprised I am the only one that reported this. @miska Were you able to reproduce the issue in your lab?

peci1 · April 17, 2023, 6:42pm

Something similar happened to me years ago. But not in the last year or two. My suspicion at that time was that it was caused by link flapping at the provider’s side. But I can’t help with it now…

And just to add - Marek Behun contacted me last week asking for help with verifying the crashlab branch patch. So CZ.NIC is actually trying hard to resolve it.

AreYouLoco · April 23, 2023, 9:11am

So tonight crash happened also on HBS branch. I have saved dmesg will post it here when I clean it a bit from personal info and if it contains more information.

Edit: So there is nothing new. Interface is flapping between up/down and then at the end I got this:

[361856.476910] mvneta f1034000.ethernet eth2: mvneta_setup_txq
s: can't create txq=7
[361857.576233] mvneta f1034000.ethernet eth2: mvneta_setup_txq
s: can't create txq=7

lucenera · April 23, 2023, 8:34pm

Have you tried switching to crashlab? It seems to me to give no problems with high uptime on all branches.

mbehun · April 25, 2023, 8:55am

Can you please try the crashlab branch?

lucenera · April 25, 2023, 9:41am

I am using it and I don’t have kernel crashes (actually I don’t have them on any branch) and I have an Omnia 2020. The only thing on TurrisOS 7.x is that Sentinel does not collect data, but you probably already knew that.

mbehun · April 25, 2023, 9:59am

And did you have kernel crashes before?

We need someone who encountered this bug to let us know if this solution fixes the bug.

AreYouLoco · April 25, 2023, 2:14pm

Well I could try crashlab yes. But just to test that out and if it works then lets move it forward to the stable branches. I am concern about security on test branches. But yeah.

I will test that out when possible and let know if the patches are working.

mbehun · April 25, 2023, 3:38pm

@AreYouLoco if possible, please try to reproduce the issue on crashlab to see whether the patches actually help.

AreYouLoco · April 25, 2023, 4:26pm

I switched to crashlab I will let it run for a week or two and see if I can reproduce the issue. The update is installing. I hope I will at least have network connectivity

HBL was the furthest I went into the future untill now

Edit: uptime 5 days on crashlab… so far so good

AreYouLoco · May 6, 2023, 9:42am

I updated the PR with the comment:

github.com/openwrt/openwrt

WIP: kernel: Backport mvneta crash fix to 5.15

openwrt:master ← elkablo:mvneta-random-crash-fix

opened 11:42AM - 12 Apr 23 UTC

elkablo

+567 -12

Backport Russell King's series [1] net: mvneta: reduce size of TSO header all…ocation to pending-5.15 to fix random crashes on Turris Omnia. This also backports two patches that are dependencies to this series: net: mvneta: Delete unused variable net: mvneta: fix potential double-frees in mvneta_txq_sw_deinit() [1] https://lore.kernel.org/netdev/ZCsbJ4nG+So%2Fn9qY@shell.armlinux.org.uk/ Thanks for your contribution to OpenWrt! To help keep the codebase consistent and readable, and to help people review your contribution, we ask you to follow the rules you find in the wiki at this link https://openwrt.org/submitting-patches Please remove this message before posting the pull request.

@rmk Do you mind to take a look? I have uptime now of 10 days and it didn’t crash.