Knot resolver not resolving

This is driving me crazy and only started recently. But the Knot resolver gets hung up and stops responding out of the blue. No idea why. And no idea how to diagnose. All I do know is that this fixes the problem:

# /etc/init.d/resolver restart
Called /etc/init.d/kresd stop
Called /etc/init.d/kresd start

What I want to know desperately is how to diagnose. I see no clues in /var/log/messages and I’d love to be a blt ask the resolver “what giuves?”, or “why the heck ain’t you responding to requests?” given it’s clearly running:

# ps | grep kresd
14156 root     30724 S    /usr/bin/kresd -c /tmp/kresd.config -f 1 /tmp/kresd -a 0.0.0.0 53 -a :: 53 -k /etc/root.keys

Can I configure to start it in a debugging mode? Can I just lose it and move back to dnsmasq as a resolver?

Not impressed is all I can say, for a pretty slick router out of the box. But that’s life, what I need to know is how to diagnose, and how to work around.

Oh, and:

# cat /tmp/kresd.config
--Automatically generated file; DO NOT EDIT
modules = {
    'hints > iterate'
  , 'policy'
  , 'stats'
  , predict = {
        window = 30 -- 30 minutes sampling window
      , period = 24*(60/30) -- track last 24 hours
  }
}
hints.config('/tmp/kresd/hints.tmp')
net.bufsize(4096)
net.ipv4=true
net.ipv6=true
cache.open(20*MB)
cache.clear()
policy.add(policy.all(policy.FORWARD({
	'203.12.160.35',
	'203.12.160.36',
})))
# ll /tmp/kresd
-rw-r-----    1 root     root      20971520 Jun 10 20:57 data.mdb
-rw-r--r--    1 root     root           812 Jun 10 20:55 hints.tmp
-rw-r-----    1 root     root          8192 Jun 10 20:55 lock.mdb
drwxr-x---    2 root     root            80 Jun 10 20:55 tty/

Relevant tot he command line I see in ps.

And, further till as exemplified by the present moment and yesterday alas, restarting it fixes the problem, for a few minutes … then it’s back and I can repeat the cycle … and a reboot gets it working in a more lasting manner (well for about a day since it last happened).

  1. disable startup of everything dns-related (resolver, kres, unbound, …) but
  2. enable startup of dnsmasq and make sure it listens on 0.0:53 (remove any port stancas in /etc/config/dhcp [dnsmasq]-section )
  3. reboot

cheers

Thank you kindly, I’m happy to try (I just had to reboot again because Knot was not resolving - making fun of its own name I guess). In a nutshell though you’re saying disable Knot and use dnsmasq as a DNS. If so, is it not best to stick to using /etc/init.d/resolver as the starter and fix this:

prefered_resolver() {
	local pref_resolver
	config_load resolver
	config_get pref_resolver common prefered_resolver
	echo -ne "$pref_resolver"
}

PREF_RESOLVER=$(prefered_resolver)

So that dnsmasq is the preferred resolver? Just curious as it seems designed to make it easy to define a preference, though I’ll have to drill down into config_load and config_get to find where. Alas am typing one handed now unable to do that right now, but later today.

It would be awesome to hear from Turris folk re the Knot resolver though and diagnosis, debugging and stability issues?

OK a little further reading. It seems this is complicated marginally by loads of interest and some history on this forum that suggest that Knot was developed by Turris specifically because dnsmasq did not support DNSSEC among other concerns with existing DNS proxy services dnsmaq offers I guess (though I am still mildly perplexed at the propensity of the OSS world to split, fork, start anew etc rather than say hopping in and helping fix dnsmasq - aside from the fact that humans are conflict ego driven beasts ;-).

Here is a telling post for example:

And given my hassles with Knot I tend to concur. Though part of me bought a Turris router, is kind of happy to stay standardish Turris with Turris support and I’d rather see Knot if it’s installed, a) reliable, b) diagnosable when not! The experience I report here - i.e. the knot resolver not responding, every device on my LAN not having a DNS, even though the Omnia is up, kresd is running and the problem goes away when I restart kresd, and goes away longer if restart the router to wit - there must be some way to diagnose and provide diagnostics to Knot devs and see Knot fixed or the reasons understood and worked around.

1 Like

It does seem that kresd has a --verbose and one is left wondering how to ask (via /etc/config/resolver) that it start in verbose mode and log the output to /var/log/resolver for example.

Alas, I did find this:

http://knot-resolver.readthedocs.io/en/latest/daemon.html#cli-interface

Which is neat! And bothered as I am by such inordinate klunkiness I wrote a quick helper:

https://github.com/bernd-wechner/OpenWRT-Tools/blob/master/knot

Alas I tried it and it seems that the installed kresd is compiled with -DNOVERBOSELOG! Aaargh. I turn it on anyhow and see nothing in return. So it ain’t helping much. Dear me.

Perhaps I have to build it myself (somehow for the ARM architecture, oh the joys) with that in, or maybe some Turris devs could pipe up and let me know what’s going on here and how to do some diagnostics?

Well, that’s certainly not exactly like that. It is correct that knot-resolver and turris projects are developed by the same organization. Still, there are other DNS resolvers that support DNSSEC, most notably unbound is a very good one (and Turris 1.x used that by default).

No, it certainly isn’t compiled with that deprecated option, in any packaging we know of, specifically to avoid users having to recompile. Logging can be turned on by verbose(true) command (which echoes the new state for confirmation). However, the logs themselves are not printed to the $tty but to standard and error outputs.

I’ll check status of some Omnia-specific stuff and get back to you. I’m really interested in finding and solving the cause.

OK, so the GUI helper isn’t deployed ATM. Still, after you feed kresd with verbose(true) command, the log lines should simply appear within /var/log/messages. Can you send me (at least) the relevant bits of the log? (Privately if you prefer, e.g. to vladimir.cunat@nic.cz)

Hey, thanks I just tired verbose(true) in the CLI and which echoed:

[string "return table_print(verbose(true))"]:1: attempt to call global 'verbose' (a boolean value)

and this appeared in /var/log/messages

2017-06-12T08:36:54+10:00 info kresd[27250]: > verbose(true)
2017-06-12T08:36:54+10:00 err kresd[27250]: [string "return table_print(verbose(true))"]:1: attempt to call global 'verbose' (a boolean value)

but then I did nslookup google.com from a LAN server that uses the Omnia as gateway and DNS and nothing new in /var/log/messages so I went to a browser and surfed to google.com and nothing new in /var/log/messages either (well, nothing from the Knot resolver).

I’m not sure where I got verbose=true from because the doc seems clear at:

http://knot-resolver.readthedocs.io/en/stable/daemon.html

my bad there!

Resolver hasn’t hung in past day mind you. Which is great, albeit not useful for checking logs. But verbose is set now, albeit with a log message labelled err which is not very informative but suggests maybe something went wrong?

Well, if you ran command verbose = true, you overrode that identifier and lost access to the verbose function, so attempting that verbose(true) call won’t have any effect anymore.

Ah, thanks for that. I shall restart kresd, and try again! I hope its memory of the override does not span sessions!

The lua state isn’t persistent. Restart will clear that up.

Awesome! I restarted and set verbose(true) this morning and it took, and is spewing forth info int /var/log/messages.

Now I’ll just leave this running and see if the issue recurs (where kresd stops resolving as describe in the OP). Since the weekend it has been behaving (which is a two edged sword … 1) Phew, I hope it just keeps working! and 2) alas, doesn’t help diagnose what was happening and leaves an air of uncertainty around reliability).

Oh, for now a sample of the output is:

2017-06-14T00:17:49+10:00 info kresd[12439]: [39448][iter]   '41.96.139.121.in-addr.arpa.' type 'PTR' id was assigned, parent id 0
2017-06-14T00:17:49+10:00 info kresd[12439]: [39448][resl]   => querying: '203.12.160.35' score: 390 zone cut: '.' m12n: '41.96.139.121.IN-ADdR.aRpa.' type: 'PTR' proto: 'udp'
2017-06-14T00:17:49+10:00 info kresd[12439]: [39448][ pc ]   => answer cached for TTL=900
2017-06-14T00:17:49+10:00 info kresd[12439]: [39448][resl]   <= server: '203.12.160.35' rtt: 42 ms
2017-06-14T00:17:49+10:00 info kresd[12439]: [    0][resl] finished: 4, queries: 1, mempool: 16392 B
2017-06-14T00:17:49+10:00 info kresd[12439]: [    0][plan] plan '13.65.241.188.in-addr.arpa.' type 'PTR'
2017-06-14T00:17:49+10:00 info kresd[12439]: [25337][iter]   '13.65.241.188.in-addr.arpa.' type 'PTR' id was assigned, parent id 0
2017-06-14T00:17:49+10:00 info kresd[12439]: [25337][resl]   => querying: '203.12.160.35' score: 216 zone cut: '.' m12n: '13.65.241.188.IN-aDdR.ArPa.' type: 'PTR' proto: 'udp'
2017-06-14T00:17:49+10:00 info kresd[12439]: [25337][resl]   => querying: '203.12.160.36' score: 216 zone cut: '.' m12n: '13.65.241.188.IN-aDdR.ArPa.' type: 'PTR' proto: 'udp'
2017-06-14T00:17:49+10:00 info kresd[12439]: [25337][resl]   => querying: '203.12.160.35' score: 216 zone cut: '.' m12n: '13.65.241.188.IN-aDdR.ArPa.' type: 'PTR' proto: 'udp'
2017-06-14T00:17:50+10:00 info kresd[12439]: [    0][plan] plan 'forum.turris.cz.' type 'A'
2017-06-14T00:17:50+10:00 info kresd[12439]: [ 2351][iter]   'forum.turris.cz.' type 'A' id was assigned, parent id 0
2017-06-14T00:17:50+10:00 info kresd[12439]: [ 2351][resl]   => querying: '203.12.160.35' score: 216 zone cut: '.' m12n: 'ForUm.TUrris.CZ.' type: 'A' proto: 'udp'
2017-06-14T00:17:50+10:00 info kresd[12439]: [    0][plan] plan 'forum.turris.cz.' type 'AAAA'
2017-06-14T00:17:50+10:00 info kresd[12439]: [65486][iter]   'forum.turris.cz.' type 'AAAA' id was assigned, parent id 0
2017-06-14T00:17:50+10:00 info kresd[12439]: [65486][resl]   => querying: '203.12.160.35' score: 216 zone cut: '.' m12n: 'FOrum.TUrrIS.cz.' type: 'AAAA' proto: 'udp'
2017-06-14T00:17:50+10:00 info kresd[12439]: [ 2351][resl]   <= server: '203.12.160.35' rtt: 22 ms
2017-06-14T00:17:50+10:00 info kresd[12439]: [    0][resl] finished: 4, queries: 1, mempool: 16392 B
2017-06-14T00:17:50+10:00 info kresd[12439]: [65486][resl]   <= server: '203.12.160.35' rtt: 24 ms
2017-06-14T00:17:50+10:00 info kresd[12439]: [    0][resl] finished: 4, queries: 1, mempool: 16392 B
2017-06-14T00:17:50+10:00 info kresd[12439]: [25337][resl]   => querying: '203.12.160.36' score: 216 zone cut: '.' m12n: '13.65.241.188.IN-aDdR.ArPa.' type: 'PTR' proto: 'udp'
2017-06-14T00:17:50+10:00 info kresd[12439]: [59297][iter]   '182.167.173.78.in-addr.arpa.' type 'PTR' id was assigned, parent id 0
2017-06-14T00:17:50+10:00 info kresd[12439]: [59297][resl]   => querying: '203.12.160.35' score: 30 zone cut: '.' m12n: '182.167.173.78.iN-aDdR.aRPa.' type: 'PTR' proto: 'udp'
2017-06-14T00:17:50+10:00 info kresd[12439]: [28124][iter]   '182.167.173.78.in-addr.arpa.' type 'PTR' id was assigned, parent id 0
2017-06-14T00:17:50+10:00 info kresd[12439]: [59297][resl]   <= server: '203.12.160.35' rtt: 13 ms
2017-06-14T00:17:50+10:00 info kresd[12439]: [    0][resl] finished: 4, queries: 1, mempool: 16392 B
2017-06-14T00:17:50+10:00 info kresd[12439]: [59297][resl]   <= server: '203.12.160.35' rtt: 14 ms
2017-06-14T00:17:50+10:00 info kresd[12439]: [    0][resl] finished: 4, queries: 1, mempool: 16392 B

The interesting thing being I tried this:

# ping nessie
PING nessie (192.168.0.13): 56 data bytes
64 bytes from 192.168.0.13: seq=0 ttl=64 time=0.286 ms
64 bytes from 192.168.0.13: seq=1 ttl=64 time=0.240 ms
^C
--- nessie ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.240/0.263/0.286 ms
# ping nessie.lan
PING nessie.lan (192.168.0.13): 56 data bytes
64 bytes from 192.168.0.13: seq=0 ttl=64 time=0.279 ms
64 bytes from 192.168.0.13: seq=1 ttl=64 time=0.245 ms
64 bytes from 192.168.0.13: seq=2 ttl=64 time=0.228 ms
^C
--- nessie.lan ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.228/0.250/0.279 ms
# ping google.com
PING google.com (172.217.25.46): 56 data bytes
64 bytes from 172.217.25.46: seq=0 ttl=57 time=26.009 ms
64 bytes from 172.217.25.46: seq=1 ttl=57 time=25.921 ms
64 bytes from 172.217.25.46: seq=2 ttl=57 time=25.898 ms
^C
--- google.com ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 25.898/25.942/26.009 ms

from another LAN server but see nothing in /var/log/messages to suggest I did that. As in it’s resolving, which is great but it seems the verbose output doesn’t include details of specific name queries being performed.