Router is inaccessible: Memory Leak in kresd?

Hi. I am using Turris Omnia since the beginning of this year and am generally very satisfied with it.

Recently (since the recent or first November patch), however the Router regularly stops accepting connections on ssh and throws a 500 on all frontends. Access to the logs is impossible in this case and only a reboot works.

During manual monitoring I have seen an increase in the memory usage of kresd. Overnight its memory consumption has increased from ~5% to ~53%. This fits with the observation, that the router needs to be restarted regularly.

A colleague of mine also uses this router and does not have this issue. We both have installed
knot-libs - 2.9.6-1
knot-libzscanner - 2.9.6-1
knot-resolver - 5.1.3-3

Could this be a memory leak in kresd? How can I diagnose it further?

//edit: device information
Device: Turris Omnia
Turris OS version: 5.1.4
Turris OS branch: HBS
Kernel version: 4.14.206

53% is certainly some bug in knot-resolver. (unless you sustain a huge DNS throughput; it might also be a problem in some library) I can’t remember anyone experiencing such leaks, and I expect it will be very difficult to track down unless we can reproduce them locally.

During November the version of knot-resolver was unchanged (in HBS). It’s possible that something in your traffic patterns or “DNS content” has changed to trigger the problem. The only change I see as possibly relevant is reduction of the size limit when TCP is used instead of UDP. That’s configurable, so you could try tweaking config back to net.bufsize(4096), though I don’t expect that will significantly help to discover the exact cause anyway.

In any case, I’ll try to reproduce it. In the meantime you’ll probably want to deploy some kind of mitigations… say, if you’re an advanced user, run /etc/init.d/resolver restart occasionally by cron (cache is kept by default, so it should be very cheap).

What we actually managed to reproduce are leaks when running DoH server by knot-resolver 5.2.x (on x86), but so far we’re not even sure if these two are related (and DoH is not served by default on Turris).

EDIT: main sources of the DoH leaks should be fixed in https://gitlab.nic.cz/knot/knot-resolver/-/merge_requests/1117