DNS-over-TCP: Just a single transaction?

Foris → DNS → Use forwarding: Use provider’s DNS resolver

Currently, I am a debugging an issue of mine. To ease debugging, I disabled DNSSEC. For testing, I use SSH and there pkgupdate. In that case, DNS resolution does not work (still investigating; know several workarounds = happy). In Wireshark, I see, after two (?) attempts, that my Turris MOX falls back from DNS-over-UDP to DNS-over-TCP. However, instead of using one TCP connection for each DNS transaction, my Turris MOX re-uses one TCP connection. In some cases, it even issues several DNS transactions side-by-side. I call that ‘bulk behavior’.

The problem: My DNS server allows only one transaction per TCP connection. Only the very first DNS query is answered. With a new TCP connection, again, only the first DNS query is answered. I have not looked up the RFCs whom to blame. My question: Does anyone know a configuration flag or setting which I can tweak, so my Turris MOX opens a new TCP connection for each DNS transaction?

You can’t tweak that. Multiple messages in a single TCP connection is over 30 years old.

If the ISP’s resolver is of this quality, I don’t think you’d want to forward to them. My first (personal) choice would be to go without forwarding, but now there are various public providers, too (some as one-click in Foris).

There seems to be particular RFC pertinent to reuse in DoT, least by judging from this bug still (shockingly) being unresolved for unbound

[1] https://www.nlnetlabs.nl/bugs-script/show_bug.cgi?id=4089

Thanks, I know the alternatives and workarounds. However, that was my initial choice when I clicked through the configuration wizard. Perhaps other users have the same scenario. Therefore, I look into that issue.

Well, this DNS resolver is the most used one in Germany, at least I guess so. It is the DNS server in the AVM FRITZ!OS. My Turris MOX has severe issues with FRITZ!OS as upstream router. One isolated issue is this single transaction thing. I still have to figure out why I face all my issues (some services in Turris OS ignore that setting and still go for full-resolver mode, some services use some different DNS server, why does Turris MOX not like those initial answers and falls back to TCP at all, …).

By the way, because near to every Internet provider gives away a FRITZ!Box, used ones are terrible cheap in eBay. For example, the FRITZ!Box 7362 SL or FRITZ!Box 7520 still get the latest AVM FRITZ!OS. Although they come with a DSL modem, those can be re-configured via the Web wizard to re-use your existing broadband connection. It might be a good testing device in-front of Turris OS.

Since the FOS on the FB is proprietary closed source code it is unknown what their dnsd implementation really entails.

I’m sad this will affect so many people by default. I’m generally not fond of doing workarounds because of others breaking long-established protocol standards; we actually tend to push the other way now (example).

I don’t know what the workaround would even look like. Doing one query per connection by default is certainly out of the question; it’s just expensive and harming “good implementations” in order to help “bad” ones.

I’ve personally never been liked the default to forward to ISP resolvers, exactly because their quality is relatively commonly a problem, and doing “advanced” stuff like DNSSEC validation (or this topic) tend to expose hidden issues. I heard some ISP offer custom services that rely on using their DNS, so one way probably can’t please (almost) everyone.

I was not about changing the default for everyone. Especially for DNS-over-TLS, doing several transactions is more than obvious. I was just about tweaking this to debug further. Currently, my Turris MOX behaves totally crazy and I do not know why. I identified that issue and hoped, it helped me to understand to other issues. Is there any trick to debug the Knot Resolver in detail?

Depends. There’s verbose logging in kresd; probably difficult to understand to those who don’t know how it works. Then there’s the classic tcpdump (wireshark).

Mhm. You are right, I do not understand what that debug package does. Although I added DANE (and DNSSEC via Unbound) to a bigger open-source project recently. My Turris MOX is attached to a switch with Port Mirroring. With that, I see the DNS traffic in Wireshark. However, I do not see anything with that package. I have to look deeper in that shell script now.

Nevertheless, if you have a debug guide somewhere floating around, that would be much appreciated. Especially because my Turris MOX goes into full-resolver mode after each start, and only then honors what I set in Foris (go for my ISP).

Today, I found a way to replicate the issue via a normal command line, for example Ubuntu 19.10:
dig @fritz.box +short +tcp +keepopen example.com A example.net A

This creates two DNS transactions in one TCP connection. And yes, the current FRITZ!OS 07.12 fails to handle this. That confirms my theory and allows me to report it. However, while creating the issue report with AVM, I am not sure this is a bug but a feature: RFC 7766 from the year 2016 states that connection reuse via TCP is just a SHOULD. Consequently, this would be not a bug but a feature.

Actually kresd is the one who fails because it does not re-send the second query after the TCP connection got closed. I am going to file a feature request with AVM. However, where do I report that fail-over issue with kresd?

Finally in my setup, I am still not sure why kresd moves over to TCP at all. I am still not sure how to debug that. Am I able to compile kresd on a desktop computer and debug it via gdb for example? In other words, when I go for the KNOT resolver compile steps for Ubuntu, for example, do I get a very similar behavior as kresd on my Turris MOX? Or would anyone recommend another approach? I have to figure out why kresd is not doing UDP. And in the next step, I have to figure out who calls kresd as I see a full resolver usage at the very start of my Turris MOX although I disabled this via Foris.

Well, it’s also a SHOULD in 7766 :slight_smile:

DNS clients SHOULD retry unanswered queries if the connection closes before receiving all outstanding responses. No specific retry algorithm is specified in this document.

If you already have packet captures from this showing the amounts of UDP and TCP retries, I think that would be helpful. I suspect the verbose logs won’t add anything on top of that, but why not save them in case you already have them. You can post here or directly to the upstream issue I now opened.

Yes, that seems a separate issue, probably around configuration on Turris side. I haven’t heard about anything like that yet. My own insight into these parts is much worse than into kresd itself, especially around MOX early startup. @paja: any immediate idea?

For this step the best way seems the verbose log and Wireshark. But what you wrote in the original post is intentional – after a few UDP attempts without reaction, TCP is tried. And once there’s a TCP already, I think all queries towards the same target go over it.

You may note the disadvantage of UDP: kresd has no way of knowing if the query/answer got lost (unlikely towards local Fritz) or if the target is working hard on getting the answer (e.g. waiting on other servers). TCP is more reliable in some senses, which is why it serves as fallback.

We provide binary packages for several distributions, including ones with debugging symbols and there are compilation instructions as well. But I don’t think it’s as easy to do, as you can’t debug a protocol timeout issue by manually stepping in a debugger :wink: On x86 Linux with recent CPUs, one can record an execution and replay it in debugger later (arbitrarily stepping back and forth).

Yes, here in this thread, let us concentrate on UDP for a moment.

Nothing in the logs or I look at the wrong logs. The problem with Wireshark, I see only symptoms. I do not see the cause or who is causing the issue. This makes debugging something like this so difficult. I know that pkgupdate fails. I know it fails because of DNS. Something does not like the UDP answers. However, I do not know who is between the DNS query (which I see in Wireshark) and the DNS command (created by the script pkgupdate ). Therefore, I would like to dig into the states via a debugger.

On my Ubuntu 19.10, I installed the .deb and then went for
sudo apt install knot-dnsutils
kdig @ +short +bufsize=4096 repo.turris.cz A
which gives me very much the same DNS query as with my Turris OS 4.0.5. That worked, no TCP failover. The next step is to add the next layer in-between. I guess, I do not know the architecture of Turris OS, that is kresd.
sudo apt install knot-resolver
sudo service kresd@1 start
kdig @localhost +short -dnssec repo.turris.cz A
That worked, no TCP failover, either.

Although I can trigger the re-tries in Knot Resolver (30 msec for first re-try, 200 ms for second re-try), I do not see that as issue in my Wireshark trace with Turris OS. There, the UDP answer is faster than 30 msec but I see a re-try after 680 msec. Who starts this (and why)? And literally at the same time (or after the first UDP response!?), Knot Resolver changes to TCP, too.

I would probably look at the verbose log from the occasion, it should explain the reasons better. For example:

# start verbose logging
echo 'verbose(true)' | nc -U /run/knot-resolver/control/1
# now somehow reproduce what you want to study
# inspect logs
journalctl -u kresd@1.service
# perhaps stop verbose logging
echo 'verbose(false)' | nc -U /run/knot-resolver/control/1

Step by step, we come closer to the causing issue. However, verbose logging on my Turris MOX was a bit more complicated, because bash in Turris OS does not have a parameter for UNIX domain sockets for nc. Actually, I had to do most of the stuff from that script (within the the package resolver-debug) myself.

The issue can be reproduced by two simple dig repo.turris.cz on the SSH command line interface of my Turris MOX already. If the second command is executed rapidly, within 15 seconds, dig returns an empty answer.

In knot-resolver, the check about is_paired_to_query fails, at least tail -f /var/log/messages prints <= ignoring mismatching response. Therefore, my Turris MOX changes from UDP to TCP. And this happens for DNS responses which contain a CNAME + A(AAA). It does not happen for answers which contain just A or AAAA, for example when I go for ‘very’ same dig proxy.turris.cz.

Now, as next step, I have to replicate the same on my Ubuntu computer. Then, I compile the knot-resolver myself and step into it with a debugger. I really wonder what fails within is_paired_to_query. By the way, on the webpage the latest version is 5.0.1. In Turris OS, the version is 3.2.1. The very same version is used in Ubuntu 19.10. However, where exactly do I see the source code which Turris OS uses – perhaps they use 3.2.1 patched somehow.

This should work on Turris out of the box instead of nc:

socat - /tmp/kresd/tty/*

Knot Resolver version 3.2.1 was used in Turris OS 4.x release. In Turris OS 3.x and Turris OS 5.x, there you can find version 4.3.0, but we know there’s a new version and there were some incompatible changes, which I need to handle as it is on my to-do list.

About the source code, it is a little bit problematic as we have merged Turris OS 5.0.0 to the branch, where was Turris OS 4.x released, but we have tags. You can take a look here:

If you would like, you can try to use Turris OS 5.0.0. Also, there’s a git-hash file, which can be found in our repository and in that file, you can see from which feeds it was built.

To be clear, nothing reported by this user is “fixed” in any Knot Resolver release so far.

Went for rescue mode 6, kept everything at default, switched the channel, and tried 5.0.0: Same issue. That rules out merge request 839.

However, I am getting closer to the issue because I am able to reproduce it on my Ubuntu computer now. When I query repo.turris.cz via my Turris MOX (the very first works as expected via UDP), and then immediately query repo.turris.cz via my Ubuntu computer, the computer changes to TCP because of a ‘mismatch’. I do not have figured out the timing yet, it looks like related to the time-to-live of the A record, anyway this should enable me to debug this much easier.

If I do not find the time to do that or fail otherwise: Could the Knot Resolver Team at more verbose statements into is_paired_to_query, for example for each condition checked? Could the Turris Team backport that change to their version of the Knot Resolver? The current ‘mismatch’ message is not sufficient, at least I do not get it.

By the way, the posted Wireshark log contains those ‘mismatching’ records. At the start, the first query is resolved via UDP as expected. However, the Knot Resolver does not cache that, why ever, starts a query for the subsequent query, declares that as mismatching (and then goes for TCP).

I again inspected the pcap sent to the GitLab issue, and some of the UDP answers indeed have mismatching QNAME case randomization. The function checks for some properties linked between the request and its answer, which is a long-lasting mitigation to some kinds of attacks (from time before DNSSEC was deployable, as that provides way stronger security).

As a generic workaround for the mismatch, TCP is used, but that’s also bad luck in this case.