Catastrophic disk failure - discovered after an update

bremby · May 18, 2018, 11:21am

Hi,

This is a long post; I’m looking for ideas, brainstorming, discussion, experience. Admins: I may have selected bad subforum, so feel free to move it.

I have Turris Omnia with attached USB3.0 disk (new WD MyBook, 4TB), formatted to ext4, which is being shared through Samba, Syncthing and SSHFS. Two days after an update we noticed the disk is not accessible and SSHing to the router revealed the disk was completely empty, having lost around 200GB of data. Not only the data disappeared, but also the lost+found directory, which is automatically created on filesystem format, wasn’t there. The disk appeared as if you ran rm -rf on it; except that the journal was also empty.

Omnia is being updated regularly and automatically, with its factory settings. I only receive e-mails about updates and planned maintenance reboots. Last update occurred on the 10th of May and the device rebooted at 3:30 on the 13th of May, which was this Sunday. On Tuesday we discovered the failure.

Since last evening I’m running ext4magic to recover the lost data. It appears the data is still there, so I might be able to recover it without major loss. The ext4magic tool revealed the journal to be empty. I haven’t run smartctl yet, because data recovery has priority. dmesg did not reveal any errors related to the disk, but I admit that wasn’t the first thing I ran. As I didn’t know what was happening, the first thing I ran was fsck, which may have “repaired” the filesystem, clearing the journal or restoring any filesystem control structures on the disk.

Now I’m looking for explanations to what might have happened. Of course, if the disk failed, I want to have it replaced as it’s about 4 months old. If it’s Omnia’s fault, then this should be explored and fixed. However, I can’t currently wrap my head around this. It doesn’t make any sense to me. If the disk failed, I’d expect some bad sectors, error messages in dmesg, or partial data inaccessibility. I don’t expect the described behaviour to be caused by HW failure. Similarly, I don’t see a way how an update would cause this, or how any unintended SW action would result in this. If something, maybe a bug in Syncthing, deleted the data in a normal manner, the journal would describe that, wouldn’t it? Even then, my daemons run under unprivileged users, not root. The data belonged to multiple users, including root (as in the case of lost+found folder). Does anybody have any idea? Anything else I should look into?

Fortunately, the disk is mostly a backup, not really storage, so no real damage except for the stress and time loss.

jklaas · May 18, 2018, 1:25pm

This is pure speculation, but did this perhaps get caught in the new Storage option?

“Once you choose a drive, it will be formatted to Btrfs filesystem and on next reboot your /srv (directory where all IO intesive applications should reisde) will get moved to this new drive.”

I personally think this is a fine option for a brand new install, but there really should be big red warning signs and klaxons for existing systems instead of boring plain unremarkable text.

Fenevadkan · May 18, 2018, 1:31pm

I also suggest you the program called “testdisk”, that pulled me out of trouble in past a couple of times, when partion informations have got corrupted. Unfortunaltey it seems to be not available on openwrt. But your issue also seems to be partition issue.
My first thought was also the same as of jklaas, that the new Storage option may have deleted everything. And I also agree that it is pretty dangerous and should be handled with much more caution!

anon50890781 · May 18, 2018, 2:24pm

My understanding is that the storage activation would require actual interaction through Foris. Reading the original post it sounds like it happend after an automated upgrade/reboot without user interaction though.

bremby · May 18, 2018, 2:43pm

Storage was also on my mind, but I haven’t looked at it while the disk was connected. And I also think it requires some user interaction, and that certainly wasn’t the case.

Furthermore, if it was formatted to btrfs, then it would get mounted as such, isn’t that so? Because it still gets mounted as ext4. What’s also strange is that while re-mounting and debugging it for the first time on Tuesday, sometimes I got a clean mount as ext4, and sometimes I got an error about being unable to mount a NTFS partition for some reason (requiring a filesystem check first, if I recall correctly).

anon50890781 · May 18, 2018, 2:50pm

That sounds a bit like an issue with partition headers not being read correctly. Which should not have been messed with by the upgrade.

Maybe there is some issue with the drive that became apparent only after the reboot (unmount/mount)

bremby · May 18, 2018, 2:55pm

That sounds a bit like an issue with partition headers not being read correctly. Which should not have been messed with by the upgrade.

Maybe there is some issue with the drive that became apparent only after the reboot (unmount/mount)

Yeah, but how can I test that? Is there a tool that analyzes the state of those headers, something like file? I can try to mount it as btrfs and ntfs to see if there is a clash between filesystems (if their structures overlap only partially perhaps).

anon50890781 · May 18, 2018, 3:10pm

As a basic I was to say sfdisk but seems that has some issues on the TO. The router might not be the best bet to troubleshoot a potential drive issue. Perhaps better to pulg drive into another machine with suitable tools. Probably WD has some diagnostic tools for checking the hardware.

bremby · May 18, 2018, 3:14pm

As a basic I was to say sfdisk but seems that has some issues on the TO. The router might not be the best bet to troubleshoot a potential drive issue. Perhaps better to pulg drive into another machine with suitable tools. Probably WD has some diagnostic tools for checking the hardware.

That’s how it already is: it’s plugged to an Arch Linux laptop and data are being recovered to another external disk. With Arch Linux I have a pretty broad range of tools, so feel free to suggest anything.

anon50890781 · May 18, 2018, 3:23pm

I am not familiar what arch L has on offer. Maybe gdisk if you got a gpt partition table.

jklaas · May 19, 2018, 8:00pm

I’ve think this is be an artefact of the mount command. If mount can’t automatically figure out what kind of filesystem it is, it complains about being unable to automatically detect NTFS filesystems or some such message.

bremby · May 30, 2018, 6:41pm

After May 10, when my Omnia was updated last, the router deletes data on the external drive on reboot. This post is a followup after previous post: Catastrophic disk failure.

I do have some custom software installed on the router, however certainly not running as root, which would be required to remove everything on the disk. I have ruled out a hardware failure for following reasons:

Data seem to be deleted instead of corrupted. Last time the ext4 journal confirmed changes on the fs sometime around restart of the device (I am guessing it is more likely to happen during boot-up instead of shutdown).
I have run several S.M.A.R.T. tests, all finished successfully.
The disk is only a few months old, used without serious load.
Any other operation with the disk seems fine; the files disappear only when the router is rebooted.

Obviously I am suspicious of the storage module, or whatever it is, that has been added to Foris in that last update; it is not enabled in Foris though (at least I didn’t do anything). I am also suspicious of the mountd daemon, which keeps mounting it to its own folder in /tmp/run/mountd/sda1. In the Mount Points screen in LuCI I have configured it to be mounted elsewhere, which works, so it is mounted on both locations. However, I don’t know why that’s happening as I’ve unchecked the “Anonymous Mount” and “Automount Filesystem” options.

Please, how can I debug this? How can I discover what’s deleting my files? How do I disable it?

Jirka · May 31, 2018, 4:44am

These duplicate mounts starts happen to me also after one of previous updates. For me, solution was disable mountd daemon (I have already configured mount point manually).

Or you can try this: