Haas_proxy >90% cpu usage + kresd errors

Skywalker-11 · February 5, 2018, 12:06am

I noticed that the haas proxy uses >90% of one of the cpu cores:
/usr/bin/python2.7 -m haas_proxy -n --pidfile=/tmp/haas.pid --syslog haas_proxy --device-token xxxxxxxxxxx --port 2525
The xxxxxxxxxxx matches the token on https://haas.nic.cz/device
Also syslog-ng -F uses 40-60% of the second cpu core

I have this errors all over the system log:

2018-02-05T00:00:01+01:00 info /usr/sbin/cron[27381]: (root) CMD (/usr/share/server-uplink/registration_code.sh)
2018-02-05T00:00:01+01:00 info /usr/sbin/cron[27382]: (root) CMD (/etc/init.d/haas-proxy enabled && /etc/init.d/haas-proxy check_fw)
2018-02-05T00:00:01+01:00 info /usr/sbin/cron[27385]: (root) CMD (/usr/share/nikola/bin/nikola.sh)
2018-02-05T00:00:01+01:00 info twisted[]: [haas_proxy.proxy.ProxySSHFactory] Could not accept new connection (EMFILE)
2018-02-05T01:00:01+01:00 info twisted[25670]: Last message '[haas_proxy.proxy.Pr' repeated 215 times, suppressed by syslog-ng on turris
2018-02-05T00:00:01+01:00 info /usr/sbin/cron[27363]: (root) MAIL (mailed 40 bytes of output but got status 0x0001
)
2018-02-05T00:00:01+01:00 info twisted[]: [haas_proxy.proxy.ProxySSHFactory] Could not accept new connection (EMFILE)
2018-02-05T01:00:02+01:00 info twisted[25670]: Last message '[haas_proxy.proxy.Pr' repeated 2387 times, suppressed by syslog-ng on turris
2018-02-05T01:00:02+01:00 info nikola[]: (v42) recognized WAN interfaces: eth1, lo
2018-02-05T00:00:02+01:00 info twisted[]: [haas_proxy.proxy.ProxySSHFactory] Could not accept new connection (EMFILE)
2018-02-05T01:00:04+01:00 info twisted[25670]: Last message '[haas_proxy.proxy.Pr' repeated 5232 times, suppressed by syslog-ng on turris
2018-02-05T00:00:04+01:00 err kresd[5774]: [priming] cannot resolve '.' NS, next priming query in 10 seconds

EDIT: Problem maybe related to the fact that the dns server wasn’t reachable or it happens after some time because it is working again with little to no cpu usage. I will monitor it further to find it out

MiKe · February 5, 2018, 5:32am

Hi, are you on the latest turris version (3.9.4)? I’ve noticed similar behavior but on oldest version - 3.9.1-2?
There was still port forwarding rule from old honeypot in firewall rules and after manual removal and restart CPU usage went back to normal but I’m not sure if it was related.

Pepe · February 5, 2018, 2:08pm

Hello,
somebody can be looking for a solution, so I will write it also here.
We’ll disable HaaS logging by default in 3.9.5 (currently it is in RC).
So in future, this shouldn’t happen at all.

Skywalker-11 · February 5, 2018, 10:56pm

Yes I’m running now 3.9.4 but not sure which version ran when the problem happened. Could have been 3.9.1-2. It looks fine for now.

Skywalker-11 · February 13, 2018, 9:42pm

I’m running 3.9.5 now but the problem with high cpu usage still exists except that syslog-ng doesn’t require so much cpu anymore. Syslog shows errors:

2018-02-13T21:21:06+01:00 err kresd[30868]: [priming] cannot resolve '.' NS, next priming query in 10 seconds
2018-02-13T21:21:16+01:00 err kresd[30868]: [priming] cannot resolve '.' NS, next priming query in 10 seconds
2018-02-13T21:21:26+01:00 err kresd[30868]: [priming] cannot resolve '.' NS, next priming query in 10 seconds
2018-02-13T21:21:36+01:00 err kresd[30868]: [priming] cannot resolve '.' NS, next priming query in 10 seconds
2018-02-13T21:21:46+01:00 err kresd[30868]: [priming] cannot resolve '.' NS, next priming query in 10 seconds
2018-02-13T21:21:56+01:00 err kresd[30868]: [priming] cannot resolve '.' NS, next priming query in 10 seconds
2018-02-13T21:22:01+01:00 info /usr/sbin/cron[6031]: (root) CMD (/usr/bin/rainbow_button_sync.sh)
2018-02-13T21:22:01+01:00 info /usr/sbin/cron[6030]: (root) CMD (nethist_stats.lua)
2018-02-13T21:22:06+01:00 err kresd[30868]: [priming] cannot resolve '.' NS, next priming query in 10 seconds
2018-02-13T21:22:13+01:00 warning []: Unable to use //tmp//yin2yang.xsl (No such file or directory).
2018-02-13T21:22:13+01:00 warning []: YANG format data models will not be available via get-schema.
2018-02-13T21:22:13+01:00 warning []: Unable to get tty (Not a tty) to get the client's hostname (session 6133).
2018-02-13T21:22:16+01:00 err kresd[30868]: [priming] cannot resolve '.' NS, next priming query in 10 seconds
2018-02-13T21:22:26+01:00 err kresd[30868]: [priming] cannot resolve '.' NS, next priming query in 10 seconds

The errors from kresd are shown right after a reboot of the router. But the haas_proxy is still fine then (0% CPU). The CPU usage seem to rise some time after reboot
I also noticed that foris shows that firewall logs can not be send (“Unknown status”)

vcunat · February 14, 2018, 9:52am

And does DNS even start to work after some time? The lines from kresd shows that it’s unable to contact upstream servers for more than a minute – that seems very unusual to me.

Skywalker-11 · February 14, 2018, 10:07pm

I am not using the internal kresd dns server but an external bind9 server. It is working fine as I can do nslookups from the TO and the foris dns test shows that everything is fine.
But I noticed that nslookup on the turris seems to be broken as it ignores the dns server if it is specified and instead uses the default dns server(in my case the bind9 server).
Dig is working as expected and uses the specified server but shows a servfail for the TO kresd for every query.

Skywalker-11 · February 15, 2018, 7:22pm

Is there a way to debug why kresd always gets a servfail?

vcunat · February 15, 2018, 9:35pm

How did you set it up in Foris / DNS? In particular, if you wanted forwarding, do you forward to some reasonable servers? (Some that don’t block DNSSEC records.)

EDIT: BTW, I understood you don’t want to use it anyway, so why bother?

Skywalker-11 · February 15, 2018, 10:42pm

I looked into the foris settings again and entered my dns server for the wan connection which resulted in multiple queries incoming to the bind9 servers. This also seems to have fixed the errors in the logs and kresd address resolving.
What seems to be weird is that after reverting that change by removing the server from the wan settings and restarting the kresd service /etc/init.d/kresd it was still working but after a reboot of the TO the problem was back again. Adding the server back to the wan config makes it working again.

Just for reference the foris interface says for the dns servers in the wan settings: DNS server address is not required as the built-in DNS resolver is capable of working without it.

I defined my own server with DNS_SERVERS="192.168.230.5" in /etc/init.d/dnsmasq so that the server is added to the resolv.conf and so is available to the TO as dns server.

vcunat · February 15, 2018, 10:46pm

It is capable to resolve DNS by itself, but that’s what happens iff you disable forwarding in Foris / DNS.

Skywalker-11 · February 17, 2018, 11:35am

Thanks for the clarification @vcunat . Maybe the info text should be updated to say that.

Now only the problem with nslookup not using the defined server is left. Eg. nslookup turris.cz 8.8.8.8 shows the following output but isn’t querying the google dns but the default local dns server.

nslookup turris.cz 8.8.8.8
Server:    8.8.8.8
Address 1: 8.8.8.8 google-public-dns-a.google.com

Name:      turris.cz
Address 1: 2001:1488:ac15:ff80::69
Address 2: 217.31.192.69

vcunat · February 19, 2018, 10:03am

How did you determine it’s querying the local server? The output shows it’s querying Google (the first two lines of output).

Skywalker-11 · February 19, 2018, 7:00pm

I have a public domain example.com with public dns entry A of x.x.x.x and also have the same domain on my local bind9 which resolves to a different local ip y.y.y.y
If I do nslookup example.com 8.8.8.8 this resolves to address y.y.y.y on the TO as if I would do nslookup example.com local.bind9.dns.server. It should resolve to x.x.x.x because the google dns only knows that public dns entry. Also I see the queries in the dns log of the bind9

Doing the same queries from windows or linux machines work as expected:
querying the 8.8.8.8 server resolves to x.x.x.x
querying the local.bind9.dns.server resolves to y.y.y.y

Skywalker-11 · February 19, 2018, 7:16pm

I also noticed that not providing a dns server to nslookup results in

# nslookup turris.cz
nslookup: can't resolve '(null)': Name does not resolve

Name:      turris.cz
Address 1: 2001:1488:ac15:ff80::69
Address 2: 217.31.192.69

nslookup is a symlink to busybox from 15th January 2018

# ls -l /usr/bin/nslookup
lrwxrwxrwx    1 root     root            17 Jan 15 12:10 /usr/bin/nslookup -> ../../bin/busybox
# ls -le /bin/busybox
-rwxr-xr-x    1 root     root        416452 Mon Jan 15 12:10:44 2018 /bin/busybox

Skywalker-11 · February 22, 2018, 6:57pm

@vcunat can you reproduce the problem or is it only a local problem on my TO?

vcunat · February 23, 2018, 7:21am

I can reproduce both. The '(null)' part is a known bug (on this forum), and probably a harmless one.

The other thing is very strange – clearly even if one specifies a server, nslookup does ~~(also?)~~ query Omnia’s DNS for the same name. I tried also with newer BusyBox v1.28.0 from a(nother) Linux distribution and it acts the same; on the other hand, full nslookup from BIND doesn’t suffer from this. EDIT: tried also with valid IP that does not provide DNS at all, and nslookup doesn’t notice any problem.

vcunat · February 23, 2018, 7:37am

Ah, these people have dug deeper: https://dev.openwrt.org/ticket/18333 and https://bugs.busybox.net/show_bug.cgi?id=675

Skywalker-11 · February 26, 2018, 11:51pm

Just confirmed. Problem is actually in source of busybox. But thanks for the advice

Skywalker-11 · March 13, 2018, 1:11am

The initial problem with high cpu usage of haas_proxy is back

EDIT: After one day the cpu usage was still near 100%. Stopping the service resulted in following output:

~# /etc/init.d/haas-proxy stop
iptables: No chain/target/match by that name.

After /etc/init.d/haas-proxy start the cpu usage was OK again <2%