Majordomo takes up all space in /tmp and crashes DNS

anarcat · March 25, 2017, 1:57pm

Hi,

Summary: majordomo filled /tmp and broke DNS resolution. Removing ucollect_majordomo and restarting dnsmasq fixed the problem for me, and I uninstalled majordomo. Out of disk conditions in /tmp should not disable DNS and majordomo should not fill the disk.

Yesterday evening, I noticed the DNS resolution was broken on the local network. I didn’t investigate further, thinking it was a transient upstream issue - but it was still a problem this morning. I also noticed my IRC connexions were still up, which meant that the network (the TCP/IP layer) itself was operational. So I turned to the DNS server for more information. Let’s see:

root@octavia:~# cat /etc/resolv.conf 
search anarc.at
nameserver 127.0.0.1

Okay, so the nameserver is local, is it dnsmasq? Yes it is:

root@octavia:~# ps | grep dnsm
 3349 nobody     900 S    /usr/sbin/dnsmasq -C /var/etc/dnsmasq.conf -k -x /var/run/dnsmasq/dnsmasq.pid

Here’s its configuration:

root@octavia:/tmp# cat /var/etc/dnsmasq.conf
# auto-generated config file from /etc/config/dhcp
conf-file=/etc/dnsmasq.conf
dhcp-authoritative
domain-needed
localise-queries
read-ethers
expand-hosts
local-service
port=0
domain=anarc.at
server=/anarc.at/
server=/anarc.at/192.168.0.3
server=/anarcat.ath.cx/192.168.0.3
server=/orangeseeds.org/192.168.0.3
server=/168.192.in-addr.arpa/192.168.0.3
server=/16.172.in-addr.arpa/192.168.0.3
server=/9.0.0.0.1.0.0.0.8.2.9.1.1.0.0.2.ip6.arpa/192.168.0.3
server=/a.0.d.4.b.e.1.9.0.c.2.f.7.0.6.2.ip6.arpa/192.168.0.3
server=/0.0.f.8.f.0.0.f.0.c.2.f.7.0.6.2.ip6.arpa/192.168.0.3
dhcp-leasefile=/tmp/dhcp.leases
resolv-file=/tmp/resolv.conf.auto
addn-hosts=/tmp/hosts
conf-dir=/tmp/dnsmasq.d
stop-dns-rebind
dhcp-broadcast=tag:needs-broadcast

dhcp-host=[...censored...]



dhcp-range=lan,192.168.0.100,192.168.0.249,255.255.255.0,12h
dhcp-option=lan,6,192.168.0.1
no-dhcp-interface=pppoe-wan

So far, so good: nothing looks wrong there. Yet clients can’t resolve anything. Hmm… So how do I diagnose dnsmasq? Normally, on OpenWRT I would look at the logread command, but that is broken on the Omnia:

root@octavia:/tmp# logread 
Failed to find log object: Not found
Failed to find log object: Not found

Instead, I must look into /var/log/messages, and there I found this gem:

2017-03-25T00:10:12-04:00 notice syslog-ng[1966]: Suspending write operation because of an I/O error; fd='10', time_reopen='60'
2017-03-25T00:11:12-04:00 notice syslog-ng[1966]: Error suspend timeout has elapsed, attempting to write again; fd='10'
2017-03-25T00:11:12-04:00 err syslog-ng[1966]: I/O error occurred while writing; fd='10', error='No space left on device (28)'
2017-03-25T00:11:12-04:00 notice syslog-ng[1966]: Suspending write operation because of an I/O error; fd='10', time_reopen='60'
2017-03-25T00:12:12-04:00 notice syslog-ng[1966]: Error suspend timeout has elapsed, attempting to write again; fd='10'
2017-03-25T00:12:12-04:00 err syslog-ng[1966]: I/O error occurred while writing; fd='10', error='No space left on device (28)'
2017-03-25T00:12:12-04:00 notice syslog-ng[1966]: Suspending write operation because of an I/O error; fd='10', time_reopen='60'
2017-03-25T00:13:12-04:00 notice syslog-ng[1966]: Error suspend timeout has elapsed, attempting to write again; fd='10'
2017-03-25T00:13:12-04:00 warning syslog-ng[1966]: syslog-ng internal() messages are looping back, preventing loop by suppressing further messages; recurse_count='2'
2017-03-25T00:14:12-04:00 notice syslog-ng[1966]: Error suspend timeout has elapsed, attempting to write again; fd='10'
2017-03-25T00:14:12-04:00 warning syslog-ng[1966]: syslog-ng internal() messages are looping back, preventing loop by suppressing further messages; recurse_count='2'
2017-03-25T00:15:12-04:00 notice syslog-ng[1966]: Error suspend timeout has elapsed, attempting to write again; fd='10'
2017-03-25T00:15:12-04:00 err syslog-ng[1966]: I/O error occurred while writing; fd='10', error='No space left on device (28)'

tons of messages like this. And, lo and behold, /tmp is full!

root@octavia:/var/log# df -h 
Filesystem                Size      Used Available Use% Mounted on
/dev/mmcblk0p1            7.3G    374.6M      6.9G   5% /
tmpfs                   503.6M    503.6M         0 100% /tmp
tmpfs                   512.0K      4.0K    508.0K   1% /dev

What’s going on in /tmp? Well, majordomo is what’s going on:

root@octavia:/tmp# du -schL *
4.0K    TZ
0       beaker
4.0K    dhcp.leases
0       dnsmasq.d
0       empty
8.0K    etc
0       extroot
0       fastcgi.python.socket-0
8.0K    hosts
18.0M   kresd
4.0K    kresd.config
4.0K    lcollect
4.0K    lcollect.md5
0       lib
4.0K    libatsha204.lock
0       lock
2.0M    log
4.0K    lvm
0       mounts
0       nethist
4.0K    resolv.conf
0       resolv.conf.auto
4.0K    resolv.conf.ppp
80.0K   run
0       screens
du: spool/cron/crontabs: No such file or directory
0       spool
4.0K    state
8.0K    sysinfo
0       syslog-ng.ctl
4.0K    syslog-ng.pid
483.5M  ucollect_majordomo
8.0K    update-state
4.0K    updater_crash.log
12.0K   user_notify
503.6M  total

Bam. 480MB of “something” by majordomo. Remove the file, restart dnsmasq, and the problem is solved:

root@octavia:/tmp# rm ucollect_majordomo 
root@octavia:/tmp# df -h 
Filesystem                Size      Used Available Use% Mounted on
/dev/mmcblk0p1            7.3G    374.6M      6.9G   5% /
tmpfs                   503.6M     20.1M    483.5M   4% /tmp
tmpfs                   512.0K      4.0K    508.0K   1% /dev
root@octavia:/tmp# /etc/init.d/dnsmasq restart
udhcpc: started, v1.25.1
udhcpc: sending discover
udhcpc: no lease, failing
root@octavia:/tmp# ping -c 1 example.com
PING example.com (93.184.216.34): 56 data bytes
64 bytes from 93.184.216.34: seq=0 ttl=59 time=36.141 ms

--- example.com ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 36.141/36.141/36.141 ms

Now, yay, my problem is solved but this will happen sometime again soon, probably.

So, workaround: uninstall majordomo.

There are multiple solutions here:

dnsmasq should continue operating even without any disk space left - from my experience with other nameservers (e.g. bind9), there are ways to operate even if the disk is full
failing that, dnsmasq should use another directory than /tmp to store its state, for example somewhere under /var
failing that, majordomo should be the one putting its files somewhere else
ultimately, majordomo should rotate its storage and should not fill the disk

Any feedback welcome!

Pepe · March 25, 2017, 4:27pm

Workaround is go to LuCI - Stats - Majordomo - Settings
There is location, which means:
Path to majordomo DB. Set it to permanent storage like USB disk to preserve data between reboots.

It depends on your usage. My Majordomo since 1.3 has 27 MB

Maxmilian_Picmaus · March 26, 2017, 4:56pm

https://www.turris.cz/doc/en/howto/majordomo

There is recommendation to use some SD card/Flash drive or SSD/HDD to store the data from Majordomo. To prevent such out-of-space situation. (so option 3 is my winner

ad_1: i think any nix machine is having trouble serving services when /tmp or /proc is full
ad_2: /var is linked to /tmp …so that is also out of list (unless you change it to some other location …but i am not sure what else might get wrong later on if you do that)
ad_3: that is only way …and it applies also for minidlna, transmission, collectd(rrdtool) stuff … by default those are using /var or /tmp so you better change it to some usb-flash/ssd/hdd drive.
ad_4: that is not automatic ? (i am having like 4 month data taking about 65,5MB) …

anarcat · March 27, 2017, 12:55pm

As I said in the original post, UNIX machines are generally fine with running out of disk space, depending on the service. Some that need to write (like MySQL, for example) freak out, crash and burn, but others that don’t (like Bind) continue running fine.

The nameserver on Turris should not crash when it runs out of disk space, period. It’s mostly stateless. The fact that it also provides DHCP services is, to me, weird, but shouldn’t keep the DNS part from running…

I prefer avoiding extra moving parts in that machine.

Well, I have no idea, but there was obviously something wrong here if I had a 500MB file.

I have been running the Omnia for about 4 months as well.

Maxmilian_Picmaus · March 27, 2017, 2:29pm

I am pretty much aware about such behave on any other unix, but this is more-like LEDE so there are some limitations built-in, … always.

Plug-in 32Gb flash drive and redirecting such logs/stats file to it is quite easy with no harm. Solving lot of possible troubles. And really recommended in every official guide/howto for lot of services.
I have 64GB thumb drive dedicated for logs/configs/stats/certificates/lxc area/user homes … . Trying to keep all my stuff outside the original fs structure.

Hm interesting file-size,

Indeed. Still, I will say, never run out of space in /tmp.
The name resolving on Turris is special category having it’s own bunch of issues here.

anarcat · March 27, 2017, 2:49pm

For me that means just uninstalling majordomo, which is unfortunate. I figured there would be a better way than telling users “just add more disk space” because that’s just working around the issue. What happens when that file reaches 32GB? I need to buy a new drive to fix DNS again? (Arguably, at the current rate it would take 24 years, but still…)

vcunat · March 27, 2017, 5:17pm

For reference, knot-resolver persists its cache via an mmap-ped file (LMDB), so continuously sucking all free space from the same drive might cripple DNS. (It’s the default DNS provider in Omnia.)