DNS queries randomly fail

Dreki · March 4, 2020, 1:43pm

I am running Turris OS 4.0.5

I can set DNS servers using any of the TLS options in foris and the DNS will work but forwarding is slow (>150ms with some requests >400ms) and some requests randomly fail. out of 250 requests about 35-50 fail or timeout.

What could be wrong? I’m not sure how to troubleshoot this.

I also noticed in each of the files in “/etc/resolver/dns_servers” only one server is listed. Is it possible to set a secondary server?
For example in /etc/resolver/dns_servers/99_cloudflare.conf where
ipv4=“1.1.1.1”
could you set
ipv4=“1.1.1.1, 1.0.0.1”
or something similar to have a secondary DNS server?

/etc/config/resolver

config resolver ‘common’
list interface ‘0.0.0.0’
list interface ‘::0’
option port ‘53’
option keyfile ‘/etc/root.keys’
option verbose ‘0’
option msg_buffer_size ‘4096’
option msg_cache_size ‘20M’
option net_ipv6 ‘1’
option net_ipv4 ‘1’
option forward_upstream ‘1’
option prefered_resolver ‘kresd’
option ignore_root_key ‘0’
option prefetch ‘yes’
option static_domains ‘1’
option dynamic_domains ‘0’
option forward_custom ‘99_quad9’

config resolver ‘kresd’
option rundir ‘/tmp/kresd’
option log_stderr ‘1’
option log_stdout ‘1’
option forks ‘1’
option keep_cache ‘1’

config resolver ‘unbound’
option outgoing_range ‘60’
option outgoing_num_tcp ‘1’
option incoming_num_tcp ‘1’
option msg_cache_slabs ‘1’
option num_queries_per_thread ‘30’
option rrset_cache_size ‘100K’
option rrset_cache_slabs ‘1’
option infra_cache_slabs ‘1’
option infra_cache_numhosts ‘200’
list access_control ‘0.0.0.0/0 allow’
list access_control ‘::0/0 allow’
option pidfile ‘/var/run/unbound.pid’
option root_hints ‘/etc/unbound/named.cache’
option target_fetch_policy ‘2 1 0 0 0’
option harden_short_bufsize ‘yes’
option harden_large_queries ‘yes’
option qname_minimisation ‘yes’
option harden_below_nxdomain ‘yes’
option key_cache_size ‘100k’
option key_cache_slabs ‘1’
option neg_cache_size ‘10k’
option prefetch_key ‘yes’

config resolver ‘unbound_remote_control’
option control_enable ‘yes’
option control_use_cert ‘no’
list control_interface ‘127.0.0.1’

vcunat · March 4, 2020, 2:36pm

3.11.x has this already (by default), and 5.x as well if I look correctly.

vcunat · March 4, 2020, 4:07pm

Can you add rough typical round-trip time towards the servers? (e.g. ping 1.1.1.1) Note that TLS_FORWARDING may need multiple round-trips to validate a single query, in particular our implementation currently isn’t optimal in this respect (it favours reliability over speed for forwarding), and the target servers might not have everything ready in cache either.

I can’t really think of any better approach than catch the failure while collecting verbose logs (large amounts) and try look into them around the point of the failure.

eckso · March 4, 2020, 5:59pm

fwiw DNS over TLS (Cloudflare) has been sporadically failing hard for me over the last 2-3 days (e.g. google.com and many others not resolving on PCs, laptops, phones in my flat). Omnia, TurrisOS 4.0.5. Thanks @vcunat for the debug docs.

Dreki · March 4, 2020, 7:02pm

Pings to both cloudflare servers.
root@turris:~# ping 1.1.1.1
PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.
64 bytes from 1.1.1.1: icmp_req=1 ttl=58 time=24.9 ms
64 bytes from 1.1.1.1: icmp_req=2 ttl=58 time=23.9 ms
64 bytes from 1.1.1.1: icmp_req=3 ttl=58 time=22.9 ms
64 bytes from 1.1.1.1: icmp_req=4 ttl=58 time=21.7 ms
64 bytes from 1.1.1.1: icmp_req=5 ttl=58 time=29.9 ms
^C
— 1.1.1.1 ping statistics —
5 packets transmitted, 5 received, 0% packet loss, time 4005ms
rtt min/avg/max/mdev = 21.790/24.718/29.939/2.821 ms
root@turris:~# ping 1.0.0.1
PING 1.0.0.1 (1.0.0.1) 56(84) bytes of data.
64 bytes from 1.0.0.1: icmp_req=1 ttl=58 time=29.1 ms
64 bytes from 1.0.0.1: icmp_req=2 ttl=58 time=28.4 ms
64 bytes from 1.0.0.1: icmp_req=3 ttl=58 time=36.9 ms
64 bytes from 1.0.0.1: icmp_req=4 ttl=58 time=36.0 ms
64 bytes from 1.0.0.1: icmp_req=5 ttl=58 time=24.9 ms
^C
— 1.0.0.1 ping statistics —
5 packets transmitted, 5 received, 0% packet loss, time 4004ms
rtt min/avg/max/mdev = 24.904/31.113/36.984/4.661 ms

DNS TLS Query from a device on the network directly to the DNS server.

dreki@1:~$ kdig -d @1.1.1.1 +tls-ca +tls-host=cloudflare-dns.com reddit.com
;; DEBUG: Querying for owner(reddit.com.), class(1), type(1), server(1.1.1.1), port(853), protocol(TCP)
;; DEBUG: TLS, imported 133 system certificates
;; DEBUG: TLS, received certificate hierarchy:
;; DEBUG: #1, C=US,ST=California,L=San Francisco,O=Cloudflare, Inc.,CN=cloudflare-dns.com
;; DEBUG: SHA-256 PIN: V6zes8hHBVwUECsHf7uV5xGM7dj3uMXIS9//7qC8+jU=
;; DEBUG: #2, C=US,O=DigiCert Inc,CN=DigiCert ECC Secure Server CA
;; DEBUG: SHA-256 PIN: PZXN3lRAy+8tBKk2Ox6F7jIlnzr2Yzmwqc3JnyfXoCw=
;; DEBUG: TLS, skipping certificate PIN check
;; DEBUG: TLS, The certificate is trusted.
;; TLS session (TLS1.2)-(ECDHE-ECDSA-SECP256R1)-(CHACHA20-POLY1305)
;; ->>HEADER<<- opcode: QUERY; status: NOERROR; id: 36177
;; Flags: qr rd ra; QUERY: 1; ANSWER: 4; AUTHORITY: 0; ADDITIONAL: 1

;; EDNS PSEUDOSECTION:
;; Version: 0; flags: ; UDP size: 1452 B; ext-rcode: NOERROR

;; QUESTION SECTION:
;; reddit.com. IN A

;; ANSWER SECTION:
reddit.com. 186 IN A 151.101.1.140
reddit.com. 186 IN A 151.101.65.140
reddit.com. 186 IN A 151.101.129.140
reddit.com. 186 IN A 151.101.193.140

;; Received 103 B
;; Time 2020-03-04 13:46:38 EST
;; From 1.1.1.1@853(TCP) in 31.3 ms

the same query again from the same device but using the normal configured DNS settings
dreki@1:~$ dig reddit.com

; <<>> DiG 9.11.3-1ubuntu1.11-Ubuntu <<>> reddit.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 7080
;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;reddit.com. IN A

;; ANSWER SECTION:
reddit.com. 300 IN A 151.101.1.140
reddit.com. 300 IN A 151.101.65.140
reddit.com. 300 IN A 151.101.129.140
reddit.com. 300 IN A 151.101.193.140

;; Query time: 173 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Wed Mar 04 13:57:59 EST 2020
;; MSG SIZE rcvd: 103

I could live with 173msec but I can’t live with all the dropped DNS queries. It is interfering with my email server. not to mention annoying users.

Dreki · March 4, 2020, 7:19pm

Is there a way to set a secondary server in my version(4.0.5)?

vcunat · March 4, 2020, 7:42pm

No, current 4.x does not have this, only one of each IPv4 and IPv6.

czlada · March 4, 2020, 7:53pm

DoT does not work in TOS 4.x …
T1.x

vcunat · March 4, 2020, 7:57pm

Turris 1.x (still) does not use Knot Resolver but Unbound (by default), so that should be put into a separate discussion thread.

dinsdale · March 4, 2020, 8:18pm

I have had the same issue with cloudflare DNS over HTTPS. I run cloudflared on a Raspberry Pi, not on the Turris so the problem seems to be with cloudflare.

Dreki · March 4, 2020, 8:33pm

I’m not trying to use any special resolver on my Turris. I only used kdig on another machine to conduct a query as a control example. Sorry for the confusion.

vcunat · March 4, 2020, 8:40pm

I don’t think this relates to you. Turris 1.x is earlier HW, different than Omnias and MOX.

Dreki · March 4, 2020, 8:45pm

Oh i see that now. My mistake.

czlada · March 5, 2020, 3:28am

Where do you read that thread is about Knot?

vcunat · March 5, 2020, 8:06am

That was a highly probable conclusion on my side, especially after the last posts like

and ruling out Turris 1.x (which seemed unlikely anyway, given the poster’s non-Czech name).

eckso · March 5, 2020, 3:56pm

No longer seeing DNS failures after switching DoT from Cloudflare to Quad9.

Dreki · March 6, 2020, 8:11am

I am still getting DNS failures using any of the options. I have managed to generate the DNS debug and dnslog from the directions outlines here: https://doc.turris.cz/doc/en/howto/dnsdebug

I’m afraid I don’t know enough about DNS and DNS over TLS to know exactly what I’m looking for. Is it relatively safe post those here on the forum so that someone could help me understand what is wrong?

I’m still getting resolve failures constantly.

vcunat · March 6, 2020, 8:43am

They contain quite lots of information; some of that may be considered private (just the set of resolved names can tell quite a lot). The officially recommended way is to send them to tech.support@turris.cz – and if you remember e.g. the failing names (or timestamps), that might be useful in case the failure doesn’t stand out in the log.