Dig shows SERVFAIL but googleapps is successful

peterb · January 6, 2021, 7:51pm

While trying to debug why a certain web page is not showing as expected I saw that dig on my router and dig via googleapps show different output

My router with kresd and TurrisOS 5.1.4

``root@turris:/etc/config# dig media.os.fressnapf.com
; <<>> DiG 9.16.8 <<>> media.os.fressnapf.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 50292
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

Via googleapps:

id 60228 
opcode QUERY 
rcode NOERROR 
flags QR RD RA 
;QUESTION media.os.fressnapf.com. IN A
;ANSWER media.os.fressnapf.com. 3599 IN CNAME fn-aurora-prod.azurefd.net.
fn-aurora-prod.azurefd.net. 29 IN CNAME star-azurefd-prod.trafficmanager.net.
star-azurefd-prod.trafficmanager.net. 29 IN CNAME t-0003.t-msedge.net.
t-0003.t-msedge.net. 59 IN CNAME Edge-Prod-STOr3.ctrl.t-0003.t-msedge.net.
Edge-Prod-STOr3.ctrl.t-0003.t-msedge.net. 239 IN CNAME standard.t-0003.t-msedge.net.
standard.t-0003.t-msedge.net. 239 IN A 13.107.246.13** ;AUTHORITY ;ADDITIONAL

Any idea where the difference come from?
Any hint how I can debug this problem?

Peter

X-dark · January 7, 2021, 9:48am

I did a quick test and it is resolving fine on my side, with DNSSEC enabled.
How is kresd resolving? Is it forwarding or resolving locally?

peterb · January 7, 2021, 6:26pm

Does this help?

root@turris:~# cat /etc/config/resolver

config resolver 'common'
	list interface '0.0.0.0'
	list interface '::0'
	option port '53'
	option keyfile '/etc/root.keys'
	option verbose '0'
	option msg_buffer_size '65552'
	option msg_cache_size '20M'
	option net_ipv6 '1'
	option net_ipv4 '1'
	option prefered_resolver 'kresd'
	option prefetch 'yes'
	option static_domains '1'
	option dynamic_domains '1'
	option forward_upstream '0'
	option ignore_root_key '0'
	option edns_buffer_size '1232'

config resolver 'kresd'
	option rundir '/tmp/kresd'
	option log_stderr '1'
	option log_stdout '1'
	option forks '1'
	option keep_cache '1'
	option include_config '/etc/kresd/custom.conf'
	list rpz_file '/etc/kresd/adb_list.overall'

config resolver 'unbound'
	option outgoing_range '60'
	option outgoing_num_tcp '1'
	option incoming_num_tcp '1'
	option msg_cache_slabs '1'
	option num_queries_per_thread '30'
	option rrset_cache_size '100K'
	option rrset_cache_slabs '1'
	option infra_cache_slabs '1'
	option infra_cache_numhosts '200'
	list access_control '0.0.0.0/0 allow'
	list access_control '::0/0 allow'
	option pidfile '/var/run/unbound.pid'
	option root_hints '/etc/unbound/named.cache'
	option target_fetch_policy '2 1 0 0 0'
	option harden_short_bufsize 'yes'
	option harden_large_queries 'yes'
	option qname_minimisation 'yes'
	option harden_below_nxdomain 'yes'
	option key_cache_size '100k'
	option key_cache_slabs '1'
	option neg_cache_size '10k'
	option prefetch_key 'yes'

config resolver 'unbound_remote_control'
	option control_enable 'yes'
	option control_use_cert 'no'
	list control_interface '127.0.0.1'

X-dark · January 8, 2021, 8:39am

Yes this show that you are doing local resolution (which is good because it avoids relying on an external dns resolver). I don’t explain what is going on though.

vcunat · January 8, 2021, 8:39am

I can reproduce it with plain Knot Resolver (without forwarding). TL;DR: I see their nameserver replies incorrectly, and perhaps some other resolvers manage to recover from that. Detailed analysis below, for reference.

The first weird reply I see is:

[20505.06][resl]   => id: '47107' querying: '8.20.243.107#00053' score: 10 zone cut: 'fressnapf.com.' qname: 'os.FrESsnApf.coM.' qtype: 'NS' proto: 'udp'
[20505.06][iter]   <= answer received: 
;; ->>HEADER<<- opcode: QUERY; status: NOERROR; id: 47107
;; Flags: qr aa cd  QUERY: 1; ANSWER: 4; AUTHORITY: 0; ADDITIONAL: 1

;; EDNS PSEUDOSECTION:
;; Version: 0; flags: do; UDP size: 4096 B; ext-rcode: Unused

;; QUESTION SECTION
os.fressnapf.com.               NS

;; ANSWER SECTION
os.fressnapf.com.       3600    NS      ns1-04.azure-dns.com.
os.fressnapf.com.       3600    NS      ns2-04.azure-dns.net.
os.fressnapf.com.       3600    NS      ns3-04.azure-dns.org.
os.fressnapf.com.       3600    NS      ns4-04.azure-dns.info.

;; ADDITIONAL SECTION

[20505.06][iter]   <= rcode: NOERROR
[20505.06][iter]   <= continuing with qname minimization

here the ns?.eurodns.com. server replies authoritatively, although I expect that it really wanted to reply with a referral (without AA flag; EDIT: and with NS records in AUTHORITY section instead of ANSWER). Consequently, kresd also sends the deeper media.os.fressnapf.com queries to the same set of servers

[20505.07][resl]   => id: '13137' querying: '8.20.243.108#00053' score: 10 zone cut: 'os.fressnapf.com.' qname: 'mEDIA.Os.FRESSnAPf.coM.' qtype: 'A' proto: 'udp'
[20505.07][iter]   <= answer received: 
;; ->>HEADER<<- opcode: QUERY; status: NOERROR; id: 13137
;; Flags: qr cd  QUERY: 1; ANSWER: 0; AUTHORITY: 4; ADDITIONAL: 1

;; EDNS PSEUDOSECTION:
;; Version: 0; flags: do; UDP size: 4096 B; ext-rcode: Unused

;; QUESTION SECTION
media.os.fressnapf.com.         A

;; AUTHORITY SECTION
os.fressnapf.com.       3600    NS      ns1-04.azure-dns.com.
os.fressnapf.com.       3600    NS      ns2-04.azure-dns.net.
os.fressnapf.com.       3600    NS      ns3-04.azure-dns.org.
os.fressnapf.com.       3600    NS      ns4-04.azure-dns.info.

;; ADDITIONAL SECTION

[20505.07][iter]   <= rcode: NOERROR
[20505.07][iter]   <= lame response: non-auth sent negative response

at which point it apparently attempts referral, but the server has already answered authoritatively on os.fressnapf.com and now we’re deeper, so kresd gets completely confused (wrt. what the other server apparently meant).

vcunat · January 8, 2021, 8:42am

So… behavior in this case could be improved even on Knot Resolver side, but as (1) this only happens when the other side is incorrect and (2) it seems to be relatively rare, I currently don’t see it as a priority.

vcunat · January 8, 2021, 9:04am

I notified domain-service@fressnapf.com and itservices@eurodns.com about their incorrect answers (taken from whois).

peterb · January 8, 2021, 6:18pm

Thanks a lot, @vcunat, for looking into that.

peterb · January 9, 2021, 9:25am

It looks as if they already fixed it.

root@turris:~# dig media.os.fressnapf.com

; <<>> DiG 9.16.8 <<>> media.os.fressnapf.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 9674
;; flags: qr rd ra; QUERY: 1, ANSWER: 6, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;media.os.fressnapf.com.		IN	A

;; ANSWER SECTION:
media.os.fressnapf.com.	3600	IN	CNAME	fn-aurora-prod.azurefd.net.
fn-aurora-prod.azurefd.net. 30	IN	CNAME	star-azurefd-prod.trafficmanager.net.
star-azurefd-prod.trafficmanager.net. 30 IN CNAME t-0003.t-msedge.net.
t-0003.t-msedge.net.	60	IN	CNAME	edge-prod-dus30r3a.ctrl.t-0003.t-msedge.net.
edge-prod-dus30r3a.ctrl.t-0003.t-msedge.net. 240 IN CNAME standard.t-0003.t-msedge.net.
standard.t-0003.t-msedge.net. 240 IN	A	13.107.246.13

;; Query time: 2270 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Sat Jan 09 10:21:50 CET 2021
;; MSG SIZE  rcvd: 245

Again, many thanks, @vcunat

Peter

vcunat · January 9, 2021, 1:10pm

No, it seems the same. The interaction does not always lead to failure. Maybe repeated queries help, too (I don’t remember).

vcunat · January 19, 2021, 4:23pm

I got no reaction from either e-mail so far. By the way, it’s possible to avoid the problem by add this config line:

policy.add(policy.suffix(policy.FLAGS('NO_MINIMIZE'), policy.todnames({'os.fressnapf.com.'})))

peterb · January 19, 2021, 5:52pm

Does this line has any drawbacks?
I mean there is probably a reason why this flag is not always set, right?

Peter

vcunat · January 19, 2021, 6:11pm

Well, traditionally (for decades) basically all servers always sent full query to every layer in the DNS (including the root servers). In Knot Resolver we’re cutting it down by default, and lately that’s becoming a more popular choice.

Main advantage of minimizing is better privacy (only TLD names are sent towards root, for example). Disadvantages are typically bugs like this, as some servers just didn’t expect/test such queries and they get them wrong. And in some (deeper) cases the minimized approach needs more round-trips.

EDIT: that’s also why above I suggest to override the setting only for the problematic subtrees.