Kresd eats file handles for breakfast, lunch and dinner

Hi!

I had to restart kresd the third time this week, and this time I was able to get some clue what going on.

the behavior of the TO is the following

  • The TO is stopping to answer DNS Requests, returning SERVFAIL on a test with
  • The messages file show resolver error
  • logging in per ssh and taking a look with lsof on kresd shows that kresd uses all 1024 file handles
  • Restarting resolves the situation for now, but it is only a matter of days to break again…

I have added debug output to the Topic below.

I am not sure if this is a related issue to DNS broken after factory reset and update to 3.11.1 or Repeated problems with kresd , feel free to merge if appropriate

cheerio
Steve

(edit: Typo)

Debug output

Testing with Dig

steve@steve-ThinkPad-T560:~$ dig heise.de @192.168.1.1

; <<>> DiG 9.11.3-1ubuntu1.3-Ubuntu <<>> heise.de @192.168.1.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 55512
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;heise.de.			IN	A

;; Query time: 0 msec
;; SERVER: 192.168.1.1#53(192.168.1.1)
;; WHEN: Sun Jan 06 19:36:53 CET 2019
;; MSG SIZE  rcvd: 37

Errors in messages:

2019-01-06 19:43:52 err ucollect[5035]: Failed to resolve uplink api.turris.cz:5679: Try again
2019-01-06 19:43:57 err ucollect[5035]: Failed to resolve uplink api.turris.cz:5679: Try again

lsof output:

 3574 root      1108 S    grep kresd
27293 root      116m S    /usr/bin/kresd -c /tmp/kresd.config -f 1 /tmp/kresd -a 0.0.0.0 53 -a :: 53 -k /etc/root.keys
root@firefly:~# lsof -p 27293
COMMAND   PID USER   FD      TYPE     DEVICE SIZE/OFF     NODE NAME
kresd   27293 root  cwd       DIR       0,20      120     2831 /tmp/kresd
kresd   27293 root  rtd       DIR       0,16      126      256 /
kresd   27293 root  txt       REG       0,16   157204     9070 /usr/bin/kresd
kresd   27293 root  mem       REG       0,13              9070 /usr/bin/kresd (path dev=0,16)
kresd   27293 root  mem       REG       0,20 20971520  2204735 /tmp/kresd/data.mdb
kresd   27293 root  mem       REG       0,13              8763 /usr/lib/kdns_modules/stats.so (path dev=0,16)
kresd   27293 root  mem       REG       0,13              8782 /usr/lib/kdns_modules/hints.so (path dev=0,16)
kresd   27293 root  mem       REG       0,13              8733 /usr/lib/libstdc++.so.6.0.19 (path dev=0,16)
kresd   27293 root  mem       REG       0,13              8740 /usr/lib/kdns_modules/ahocorasick.so (path dev=0,16)
kresd   27293 root  mem       REG       0,13              2814 /usr/lib/libgmp.so.10.3.2 (path dev=0,16)
kresd   27293 root  mem       REG       0,13              2884 /usr/lib/libhogweed.so.4.3 (path dev=0,16)
kresd   27293 root  mem       REG       0,13              4668 /usr/lib/libnettle.so.6.3 (path dev=0,16)
kresd   27293 root  mem       REG       0,13              5133 /usr/lib/libunistring.so.2.0.0 (path dev=0,16)
kresd   27293 root  mem       REG       0,13              1436 /lib/libgcc_s.so.1 (path dev=0,16)
kresd   27293 root  mem       REG       0,13              4593 /usr/lib/libgnutls.so.30.14.6 (path dev=0,16)
kresd   27293 root  mem       REG       0,13              5071 /usr/lib/libluajit-5.1.so.2.0.5 (path dev=0,16)
kresd   27293 root  mem       REG       0,13              2829 /usr/lib/libuv.so.1.0.0 (path dev=0,16)
kresd   27293 root  mem       REG       0,13              8735 /usr/lib/libdnssec.so.6.0.0 (path dev=0,16)
kresd   27293 root  mem       REG       0,13              8822 /usr/lib/libzscanner.so.2.0.0 (path dev=0,16)
kresd   27293 root  mem       REG       0,13              2858 /usr/lib/libknot.so.8.0.0 (path dev=0,16)
kresd   27293 root  mem       REG       0,13              2826 /usr/lib/libkres.so.8 (path dev=0,16)
kresd   27293 root  mem       REG       0,13              1858 /lib/libsetlbf.so (path dev=0,16)
kresd   27293 root  mem       REG       0,13              1789 /lib/libc.so (path dev=0,16)
kresd   27293 root  mem-r     REG       0,20     8192     2886 /tmp/kresd/lock.mdb
kresd   27293 root  mem       REG       0,20       27      658 /tmp/TZ
kresd   27293 root    0r      CHR        1,3      0t0      415 /dev/null
kresd   27293 root    1w     FIFO        0,8      0t0  3811121 pipe
kresd   27293 root    2w     FIFO        0,8      0t0  3811122 pipe
kresd   27293 root    3r     FIFO        0,8      0t0  3811124 pipe
kresd   27293 root    4w     FIFO        0,8      0t0  3811124 pipe
kresd   27293 root    5u  a_inode        0,9        0     1034 [eventpoll]
kresd   27293 root    6r     FIFO        0,8      0t0  3811125 pipe
kresd   27293 root    7w     FIFO        0,8      0t0  3811125 pipe
kresd   27293 root    8r     FIFO        0,8      0t0     2679 pipe
kresd   27293 root    9r     FIFO        0,8      0t0     2548 pipe
kresd   27293 root   10r     FIFO        0,8      0t0   393367 pipe
kresd   27293 root   11r     FIFO        0,8      0t0     2549 pipe
kresd   27293 root   12u  a_inode        0,9        0     1034 [eventfd]
kresd   27293 root   13r     FIFO        0,8      0t0     2680 pipe
kresd   27293 root   14u     IPv4    3811126      0t0      UDP *:domain 
kresd   27293 root   15r     FIFO        0,8      0t0   393368 pipe
kresd   27293 root   16r      CHR        1,3      0t0      415 /dev/null
kresd   27293 root   17u     IPv4    3811127      0t0      TCP *:domain (LISTEN)
kresd   27293 root   18u     IPv6    3811128      0t0      UDP *:domain 
kresd   27293 root   19u     IPv6    3811129      0t0      TCP *:domain (LISTEN)
kresd   27293 root   20ur     REG       0,20     8192     2886 /tmp/kresd/lock.mdb
kresd   27293 root   21u      REG       0,20 20971520  2204735 /tmp/kresd/data.mdb
kresd   27293 root   22u     unix 0xdeff2d00      0t0  3811133 tty/27293 type=STREAM
kresd   27293 root   23u     sock        0,7      0t0  6560176 protocol: TCPv6
kresd   27293 root   24u     sock        0,7      0t0  4604254 protocol: TCP
[..]
kresd   27293 root  992u     sock        0,7      0t0 11688336 protocol: TCPv6
kresd   27293 root  993u     sock        0,7      0t0 11690277 protocol: TCPv6
kresd   27293 root  994u     sock        0,7      0t0 11693066 protocol: TCPv6
kresd   27293 root  995u     sock        0,7      0t0 11692336 protocol: TCPv6
kresd   27293 root  996u     sock        0,7      0t0 11688436 protocol: TCPv6
kresd   27293 root  997u     sock        0,7      0t0 11688464 protocol: TCPv6
kresd   27293 root  998u     sock        0,7      0t0 11688438 protocol: TCPv6
kresd   27293 root  999u     sock        0,7      0t0 12735122 protocol: TCPv6
kresd   27293 root 1007u     sock        0,7      0t0 12734969 protocol: TCP
kresd   27293 root 1008u     sock        0,7      0t0 12735145 protocol: TCPv6
kresd   27293 root 1009u     sock        0,7      0t0 12734971 protocol: TCP
kresd   27293 root 1013u     sock        0,7      0t0 12735119 protocol: TCPv6
kresd   27293 root 1015u     sock        0,7      0t0 12735121 protocol: TCP
kresd   27293 root 1016u     sock        0,7      0t0 12735135 protocol: TCPv6
kresd   27293 root 1018u     sock        0,7      0t0 12735137 protocol: TCPv6
kresd   27293 root 1019u     sock        0,7      0t0 12735138 protocol: TCPv6
root@firefly:~# 
1 Like

Very interesting, thanks. Your configuration is TLS forwarding, right?

Hi vcunat

I have tried different settings, currently kresd runs as a standalone resolver without forwarding.

/tmp/kresd.config
root@firefly:~# cat /tmp/kresd.config 
--Automatically generated file; DO NOT EDIT
modules = {
    'hints > iterate'
  , 'policy'
  , 'stats'
  , predict = {
        window = 30 -- 30 minutes sampling window
      , period = 24*(60/30) -- track last 24 hours
  }
}
hints.use_nodata(true)
cache.ns_tout(5000)
hints.config('/tmp/kresd/hints.tmp')
net.bufsize(4096)
net.ipv4=true
net.ipv6=true
cache.open(20*MB)

I have not captured the netstat output, but as far as I remember it does not show any active UDP or TCP sessions for kresd

cheerio
Steve

To be clear, this situation happened within a few days without forwarding, right? If so, that’s rather bad.

I have problems with DNS since about start of last december. before christmas I restarted the resolver twice, at that time my resolver conf was also as standalone resolver.

At 31.12. I replaced the Turris Omina with a spare one, which re-flashed from a flash drive and configured everything manually using the Fortis UI. At that time the Turris ran with forwarding to Cloudflare using DNS over https.

Last week I ran into some odd DNS problem where most of the resolving were OK but some ( notably www.postbank.de) did not return any results. Tests with dig at same time to 8.8.8.8 or 1.1.1.1 did return a working result. Following that I reconfigured the router again as standalone resolver.

This evening I will turn on the debug log on the omnia and setup some DNS probing in my icinga monitoring.

cheerio
Steve

Thanks, that confirms our suspicion on knot-resolver version introducing this.

postbank.de is in a group of domains that returns “incorrect” DNSSEC proofs, due to using a broken DNS software.

“Incorrect” = their denial of AAAA also includes records claiming there’s no A on the same name, even though they do serve A on that name. Given that the problem is known for many months and sites, it doesn’t appear there’s enough motivation for them to fix it… and affected big DNS resolvers just add exceptions explicitly for such affected DNS subtrees. We consider adding a similar list to Omnia default, too, but it’s just :face_vomiting: You can configure these yourself.

Nitpick: you surely meant over TLS and not https, as we don’t support the other option in knot-resolver :slight_smile: It’s a bit confusing that these two competing standards were introduced and soon after each other…

With the version in stable Omnia, the TLS forwarding is still a little flaky, I’m afraid. Some improvements are on their way already.

I have used the fortis UI option for this, so you are right.

I wonder, is there something unusual or extreme about your network connection or the DNS queries you send? So far we’ve been unable to cause any FD leak at all.

EDIT: to be clear, this is about finding the cause in kresd, not at all about implying it’s your fault somehow.

I am not aware of any unusual pattern in my network. The router stats at http://project.turris.cz are not detailed enouth to see any pattern.

The Network connection is a 100/10 MBit FTTH connection from Tel.Quick (wilhelm.tel is the technical service provider and the actual LIR) with IPv4 and IPv6, there are about 30 Network devices ranging from PC (Windows/Linux), Android Phones and Tablets, Amazon FireTVs and Samsung TV.

One other problem I struggled with is the PPPoE/DHCPv6 setup. While I get a global reachable IPv4 Address, the DHCPv6 Response from also contains a AFTR gateway address and the TO does use DSLite for IPv4 instead of the global reachable IPv4.

As workaround I call the following script every minute from cron:

 #!/bin/sh

if ip route show | grep -q "default dev ds-wan6_4" ; then
  ip route del default dev ds-wan6_4
  ip route add default dev pppoe-wan
fi

but this is a problem for another day…

1 Like

Hello there,
I have the same problem.I have already participated this thread: Poslední aktualizace 3.11.1 - Turris Omnia - problémy

I can confirm this problem under very similar conditions. My resolver is set NOT to forward. It started during December. When I read this thread earlier, I checked how many sockets are opened by kresd and it was something around 50. However after kresd stopped working, there was 1048 sockets open by kresd. Fortunately this time I have a log from kresd. Should I upload it somewhere? It has 1.7MB and I really don’t want mess this thread with such long log.

Viktor Cerny

1 Like

It will be easiest to send the log to vladimir.cunat@nic.cz – thanks!

9 posts were split to a new topic: ISP issue with CDN Mozilla

Hi @vcunat

I had the issue with the nonresponding kresd again and have the debug log from the kresd.

Shortly before the kresd stops responding I see a bunch of outgoing queries where I see no incoming queries for.

Thre log is large, if helpful I can send it to you by mail

1 Like

Yes, please :heart: vladimir.cunat@nic.cz

I might have the same problem, DNS requests start failing after a couple of days. Restarting the resolver fixes it. Unfortunately I’m clueless compared to the other contributors in this topic, but I can execute instructions if it helps the analysis.

I assume you know how to ssh to the router. You can confirm it’s this problem by waiting for the problem to occur and verifying that kresd has lots of file-descriptors by executing lsof -p $(pgrep kresd) and seeing there are about a thousand lines instead of the usual roughly fifty.

[to everyone] This problem really proves difficult to track down for us. On the RC branch there’s updated knot-resolver with some related fixes; there’s a (slight) hope that also this issue is better in there, so you might want to try RC before it gets into “stable Omnia”.

Yes, I checked this before I restarted the service and it was close to 50 lines. I wasn’t sure if I queried it correctly, but then apparently I did. Different issue then? Anything I should do next time?

Most likely it’s something else, so let’s keep the rest of discussion to a separate thread. The best step will be to obtain verbose logs from around a failure: https://doc.turris.cz/doc/en/howto/dnsdebug (note that before installing the package you may need to click the update button on that page) You can e.g. send the logs directly to vladimir.cunat@nic.cz

1 Like

I have updated to RC yesterday, we will see if the issue arises again.

cheerio Steve

1 Like