Intermittent dns errors with knot resolver without forwarding

elodg · May 27, 2020, 7:23pm

Hi,
Ever since I disabled forwarding and been relying on recursive resolution I have been getting intermittent DNS errors in the browser which disappears on the second try. I am suspecting a timeout, but don’t know how to solve it. I managed to connect to kresd and enable verbose. The log is from a single attempt at visiting bbc.com: https://pastebin.com/YqiGXvYD
Any help on what to look for is appreciated.

vcunat · May 27, 2020, 7:40pm

In the log lots of “delays” are caused by sending over IPv6 but not getting any reply (probably because your IPv6 does not work at all). This should normally be detected around one minute after resolver restart/reconfiguration with getting

net.ipv6 = false

into the log. (I’m not certain if it’s shown in the log even in non-verbose mode.)

On a cold cache the whole query is quite complicated to solve, so the extra delays on many of the steps lead to your DNS client losing patience when waiting for the answer.

elodg · May 28, 2020, 1:35pm

Thanks, I disabled ipv6 resolving in my configuration. Queries are faster to resolve which makes me think ipv6 unavailability wasn’t automatically detected before. I will keep a close eye on dns failures.

vcunat · May 28, 2020, 3:12pm

For completeness, which Turris OS version have you been using?

elodg · May 28, 2020, 6:15pm

TurrisOS 4.0.5.

This line has been added to meet the 20 chars minimum limit.

elodg · July 4, 2020, 12:45pm

Revisiting the issue, I am on TurrisOS 5.0.2 now. I am still getting intermittent timeouts for queries. I ran namebench with Firefox as the data source, limited to 250 queries. 2 out of 250 timed out at 3500ms. Both when queried from the console afterwards returned a valid cached response. The names are: wiki.bash-hackers.org and www.compexshop.com.
How can I debug this further?

root@turris:~# cat /tmp/kresd.config
–Automatically generated file; DO NOT EDIT
modules = {
‘hints > iterate’
, ‘policy’
, ‘stats’
, predict = {
window = 30 – 30 minutes sampling window
, period = 24*(60/30) – track last 24 hours
}
}
hints.use_nodata(true)
hints.config(‘/tmp/kresd/hints.tmp’)
trust_anchors.remove(‘.’)
trust_anchors.add_file(‘/etc/root.keys’, true)
net.bufsize(4096)
net.ipv4=true
net.ipv6=true
cache.open(20*MB)

— Included custom configuration file from: —
— /etc/kresd/custom.conf
–log DNSSEC failures
modules.load(‘bogus_log’)
–our wan does not have ipv6
net.ipv6=false
–recursive resolver might need a bigger cache
cache.size=100*MB

vcunat · July 5, 2020, 9:42am

Yes, these names can sometimes take longer time to resolve – e.g. around four seconds for me on completely empty cache (a bit unrealistic as common TLDs typically are in cache) – and the bench client probably gives up the first attempt before then. (In that case kresd continues resolving, so client retries should get it faster, often immediately from cache.)

For the first domain, half of authoritative server IPv4s don’t react at all, and it’s two thirds (!) for the second one… so they don’t make this easy at all. Also the nameservers aren’t glued, so just finding their IPs takes additional round-trips. Still, we do want kresd to perform better in these half-broken cases even without retransmitting too aggressively, and there is work underway.