Uncorrectable errors on freshly setup btrfs RAID 1

I have Turris NAS bundle with two WD Red Plus 4TB drives that were originally in mdadm raid with ext4 and I want to move them to btrfs raid. When I have recreated the RAID with btrfs copied some data and then ran btrfs scrub some uncorrectable errors appeared. I have since then recreated the btrfs setup few times and when I have fresh setup and copy data on it, then immediately I get uncorrectable errors.

I ran extended smartctl tests on the disks but it did not discover anything wrong.

Would you have some suggestions on what could be wrong? Please?

Try replacing the SATA cables and post some logs. Hard to tell without any logs.

Also try re-plugging all connectors, including re-seating of the sata card.

After that, I would start monitoring syslog (/var/log/messages) for ata errors and btrfs messages and also btrfs device stats /<mount_point_of_the_btrfs_volume> for increasing error counters, both especially during file transfers. Depending on the kind of errors you get (if any), you can start searching for possible solutions or things to try.

I faced similar issue with my setup (2x WD60EFRX hdd), although for me the errors were correctable with scrubbing (but I never tried it before power-cycling entire router).
In my case, during file transfers, there were ata errors on one of the sata ports and report from btrfs device stats showed increasing errors on one of the drives. smartctl showed no problems, temperatures seemed also fine. After mixing up the drives, cables and mpci ports, I was able to isolate the issue to my ASM1061 based sata card, which seems to be notorious for having problems with handling both drives at full speed simultaneously and would drop one of the drives during such load, which would then fail to re-initialise properly. Solution to this problem was setting the jumper on the drives so that they were limited to sata 3Gb/s speeds.

2 Likes

Are these SMR drives? I don’t know if BTRFS is similar to ZFS where there will be timeouts during some operations due to the real speed of the SMR rewriting. Not even the enormous (256MB) cache can keep up.

Nope, those are CMR drives.