Random kernel crashes on mvneta [HBS/HBL]

Any news from the Turris side on my proposed patch set?

I will paste here what I just stated on netdev concerning my proposed patch set. This is aimed more at those who provide the kernels. Note that the merge window likely opens this weekend, which means if the patch set isnā€™t in net-next by then, it wonā€™t hit mainline for another three months - and it wonā€™t get backported to stable kernels either.

Hi,

I think the Turris folk are waiting for me to get this into the kernel
and backported to stable before they merge it into their tree and we
therefore end up with it being tested.

We are now at -rc7, and this series is in danger of missing the
upcoming merge window.

So, I think itā€™s time that I posted a wake-up call here to say that no,
thatā€™s not going to happen until such time that we know whether these
patches solve the problem that they identified. Iā€™m not bunging patches
into the kernel for problems people have without those people testing
the proposed changes.

I think if the Turris folk want to engage with mainline for assistance
in resolving issues, they need to do their part and find a way to
provide kernels to test out proposed fixes for their problems.

Well I moved back to stable branch. But I could easily revert to snapshot that had this crash and test. But I wont compile kernel on my own just to test that. The question is if Turris team could reproduce that problem. Or was it some misconfiguration on my side.

Edit: In other words I am willing to test further.

Hi,

thank you very much for debugging this issue. Just to post some update, we are not waiting, Iā€™m working on backporting it to a testing branch. I started with your original patch, but then scraped it and I am now working on patchset you sent to net-next. It took some time as fixing 6.3.1 was more important and then was Easter holiday. But I hope to have something to test ready by the end of the week.

Sorry, but thatā€™s too late to get it into the net-next tree for the opening of the merge window.

Also, there are two problems here:

  1. The kernel warnings. This is what my ā€œoriginalā€ patch addresses.
  2. The memory allocation failure. This is what the later patch series addresses.

Dropping one or the other is going to leave one of these problems unsolved. You need both of them - and in fact the series I posted depends on the original patch already being in place.

To be clear: all the patches Iā€™ve mentioned are all required.

1 Like

@rmk Iā€™ve backported 7 patches: the RFC series (5 patches), the fix potential double-frees patch and another patch that is a dependency for the series.

1 Like

It is now available in crashlab branch. You can test it using switch-branch --force crashlab.

1 Like

Thanks @miska for getting that sorted so quickly!

1 Like

All credits goes to @mbehun who finished the rebase.

I switched back to HBS (reflash) and todays morning I was greet with the same behaviour No link on WAN interface. Didnā€™t look at the log cuz I needed connectivity so I just rebooted. But I suspect the same issue on HBS. Will have to confirm that when it happens again. I am just suprised I am the only one that reported this. @miska Were you able to reproduce the issue in your lab?

Something similar happened to me years ago. But not in the last year or two. My suspicion at that time was that it was caused by link flapping at the providerā€™s side. But I canā€™t help with it nowā€¦

And just to add - Marek Behun contacted me last week asking for help with verifying the crashlab branch patch. So CZ.NIC is actually trying hard to resolve it.

1 Like

So tonight crash happened also on HBS branch. I have saved dmesg will post it here when I clean it a bit from personal info and if it contains more information.

Edit: So there is nothing new. Interface is flapping between up/down and then at the end I got this:

[361856.476910] mvneta f1034000.ethernet eth2: mvneta_setup_txq
s: can't create txq=7
[361857.576233] mvneta f1034000.ethernet eth2: mvneta_setup_txq
s: can't create txq=7

Have you tried switching to crashlab? It seems to me to give no problems with high uptime on all branches.

Can you please try the crashlab branch?

I am using it and I donā€™t have kernel crashes (actually I donā€™t have them on any branch) and I have an Omnia 2020. The only thing on TurrisOS 7.x is that Sentinel does not collect data, but you probably already knew that.

And did you have kernel crashes before?

We need someone who encountered this bug to let us know if this solution fixes the bug.

Well I could try crashlab yes. But just to test that out and if it works then lets move it forward to the stable branches. I am concern about security on test branches. But yeah.

I will test that out when possible and let know if the patches are working.

@AreYouLoco if possible, please try to reproduce the issue on crashlab to see whether the patches actually help.

I switched to crashlab I will let it run for a week or two and see if I can reproduce the issue. The update is installing. I hope I will at least have network connectivity

HBL was the furthest I went into the future untill now

Edit: uptime 5 days on crashlabā€¦ so far so good

1 Like

I updated the PR with the comment:

@rmk Do you mind to take a look? I have uptime now of 10 days and it didnā€™t crash.