DNS resolver fails at random time

For the second time I observed that the DNS resolver (kresd?) does stop working at some point, days or weeks after the boot.

For example:

$ dig example.com @192.168.1.1

; <<>> DiG 9.18.16 <<>> example.com @192.168.1.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 40262
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;example.com.			IN	A

;; Query time: 908 msec
;; SERVER: 192.168.1.1#53(192.168.1.1) (UDP)
;; WHEN: Wed Jul 05 15:59:53 CEST 2023
;; MSG SIZE  rcvd: 40

returned a SERVFAIL. The DNS of the modem works:

$ dig example.com @10.0.0.138

; <<>> DiG 9.18.16 <<>> example.com @10.0.0.138
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 14296
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;example.com.			IN	A

;; ANSWER SECTION:
example.com.		66070	IN	A	93.184.216.34

;; Query time: 52 msec
;; SERVER: 10.0.0.138#53(10.0.0.138) (UDP)
;; WHEN: Wed Jul 05 15:59:57 CEST 2023
;; MSG SIZE  rcvd: 56

On the Omnia I see only:

# grep -i kresd /var/log/messages
Jul  4 21:50:16 turris1 dhcp_host_domain_ng.py: Refresh kresd leases
Jul  5 05:50:07 turris1 dhcp_host_domain_ng.py: Refresh kresd leases
Jul  5 06:58:05 turris1 dhcp_host_domain_ng.py: Refresh kresd leases
Jul  5 09:00:39 turris1 dhcp_host_domain_ng.py: Refresh kresd leases
Jul  5 09:31:04 turris1 dhcp_host_domain_ng.py: Refresh kresd leases
Jul  5 09:50:16 turris1 dhcp_host_domain_ng.py: Refresh kresd leases
# cat /var/log/resolver
Jun 26 09:26:02 turris1 kresd[5666]: [system] warning: hard limit for number of file-descriptors is only 4096 but recommended value is 524288
Jul  4 21:26:08 turris1 kresd[5666]: [taupd ] active refresh failed for . with rcode: 2
Jul  4 23:50:10 turris1 kresd[5666]: [taupd ] active refresh failed for . with rcode: 2
Jul  5 02:14:12 turris1 kresd[5666]: [taupd ] active refresh failed for . with rcode: 2
Jul  5 04:38:14 turris1 kresd[5666]: [taupd ] active refresh failed for . with rcode: 2
Jul  5 07:02:16 turris1 kresd[5666]: [taupd ] active refresh failed for . with rcode: 2
Jul  5 09:26:18 turris1 kresd[5666]: [taupd ] active refresh failed for . with rcode: 2
Jul  5 11:50:20 turris1 kresd[5666]: [taupd ] active refresh failed for . with rcode: 2

no errors at the point when I failed (just now)

TurrisOS 6.3.3, Turris Omnia

And your settings (in reForis DNS tab) is forwarding to ISP’s resolvers? Based on what you wrote above, I’d first try avoiding that combination.

(It’s fairly easy to try, contrary to really debugging this – and after lots of work most likely finding that “we” consider their DNS servers broken in some way… and then what’s next anyway?)

So, I’d personally mainly recommend either of these two choices:

  1. uncheck forwarding. That way your router’s resolver will ask directly the respective authoritative servers, exactly as the DNS protocol was really designed. At least I’d hope that your ISP wouldn’t intercept DNS queries not aiming at them.
  2. keep forwarding to some known-reliable public resolvers. Turris provides a menu of some good ones in encrypted configuration, as encryption is usually the main reason why you might prefer over the option (1). (though CZ.NIC option will probably get slow if you’re very far from .cz) Generally you’d choose one which you trust with privacy, as they will see your DNS queries instead of your ISP.

The ISP’s resolvers work fine (see the dig-output above), kresd returns SERVFAIL.

Forwarding is and was disabled:
image

1 Like

I saw the dig output, but that doesn’t really prove anything, because forwarding with DNSSEC validation has to do way more complex things than this trivial query.

I now switched to forwarding to CZ.nic’s DNS servers. We’ll see if that bug reappears, worst case in some months.
Kresd progressively failing at some point - #5 by iron-maiden describes the same bug, maybe the culprit is a too low default file descriptor limit?

The descriptor limit is certainly not causing such issues.