SATA HDD issues

stefanpr · October 29, 2016, 8:27am

Hello,

I’m trying to setup a RAID 1 array using MDADM. I have connected to WD RED 3TiB drives to the SATA controller (supplied as part of the NAS perk) but running mdadm --create always fails after a little while: the array is created successfully but during synching SDB always gets marked as failed.

I have tried different drives and different SATA cables but that does not help, so I’m thinking it’s either a dodgy SATA controller or there’s another hardware / driver issue.

Any suggestions?? Thank you very much in advance!!

Here’s some DMESG output:

[49355.725523] ata2: hard resetting link
[49365.730597] ata2: softreset failed (1st FIS failed)
[49365.735494] ata2: hard resetting link
[49375.740606] ata2: softreset failed (1st FIS failed)
[49375.745494] ata2: hard resetting link
[49394.960619] ata1.00: exception Emask 0x0 SAct 0xc0 SErr 0x0 action 0x6 frozen
[49394.967786] ata1.00: cmd 60/08:30:08:08:00/00:00:00:00:00/40 tag 6 ncq 4096 in
[49394.967786] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[49394.982450] ata1.00: cmd 60/20:38:00:08:00/00:00:00:00:00/40 tag 7 ncq 16384 in
[49394.982450] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[49394.997201] ata1: hard resetting link
[49405.000600] ata1: softreset failed (1st FIS failed)
[49405.005498] ata1: hard resetting link
[49410.750598] ata2: softreset failed (1st FIS failed)
[49410.755495] ata2: limiting SATA link speed to 3.0 Gbps
[49410.755500] ata2: hard resetting link
[49415.010597] ata1: softreset failed (1st FIS failed)
[49415.015492] ata1: hard resetting link
[49415.760603] ata2: softreset failed (1st FIS failed)
[49415.765494] ata2: reset failed, giving up
[49415.769512] ata2.00: disabled
[49415.769550] ata2: EH complete
[49415.769601] sd 1:0:0:0: [sdb] tag#28 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
[49415.769610] sd 1:0:0:0: [sdb] tag#28 CDB: opcode=0x8a 8a 00 00 00 00 00 00 d0 4d 80 00 00 05 00 00 00
[49415.769615] blk_update_request: I/O error, dev sdb, sector 13651328
[49415.775945] sd 1:0:0:0: [sdb] tag#29 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
[49415.775952] sd 1:0:0:0: [sdb] tag#29 CDB: opcode=0x8a 8a 00 00 00 00 00 00 d0 4a 80 00 00 03 00 00 00
[49415.775956] blk_update_request: I/O error, dev sdb, sector 13650560
[49415.776505] md/raid1:md0: Disk failure on sdb1, disabling device.
[49415.776505] md/raid1:md0: Operation continuing on 1 devices.
[49415.776573] md: md0: recovery interrupted.

white · October 29, 2016, 9:18am

If smart tools are installed you could try to check if SMART data from the disks or SMART self-tests tell anything. You could also check the temperature is not rising too high for the disks if you don’t have a fan for the NAS box.

stefanpr · October 29, 2016, 9:24am

Thanks for the suggestions - I do have a fan installed, so everything stays nice and cool. I also ran a long smart test on sdb and that completed without errors. I am pretty sure it’s not the actual drive as I swapped them around and it’s always the one connected to channel 2 (sdb) that fails. I have confirmed this using the actual drive serial numbers. I have also tried different SATA cables but that does not solve the problem either…

Stiglar · October 30, 2016, 11:09am

i do same right now, and i was supprice by time what it need for sync.

root@turris:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath]
md0 : active raid1 sdb1[1] sda1[0]
      2930134272 blocks super 1.2 [2/2] [UU]
      [>....................]  resync =  3.8% (113765312/2930134272) finish=323.7min speed=145003K/sec

unused devices: <none>

Stiglar · October 30, 2016, 11:51am

hi again so now i have f also

root@turris:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath]
md0 : active raid1 sdb1[1](F) sda1[0]
      2930134272 blocks super 1.2 [2/1] [U_]

unused devices: <none>

root@turris:~# mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Sun Oct 30 13:02:41 2016
     Raid Level : raid1
     Array Size : 2930134272 (2794.39 GiB 3000.46 GB)
  Used Dev Size : 2930134272 (2794.39 GiB 3000.46 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Sun Oct 30 13:11:34 2016
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       0        0        1      removed

       1       8       17        -      faulty spare   /dev/sdb1

I also found this lines in system log:

2016-10-30T14:20:22+01:00 err kernel[]: [ 473.810716] ata1: softreset failed (1st FIS failed)
2016-10-30T14:20:22+01:00 info kernel[]: [ 473.815617] ata1: hard resetting link
2016-10-30T14:20:27+01:00 info kernel[]: [ 479.020728] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
2016-10-30T14:20:27+01:00 info kernel[]: [ 479.020760] ata1: EH complete
2016-10-30T14:20:27+01:00 info kernel[]: [ 479.020805] sd 0:0:0:0: [sda] tag#11 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
2016-10-30T14:20:27+01:00 info kernel[]: [ 479.020814] sd 0:0:0:0: [sda] tag#11 CDB: opcode=0x88 88 00 00 00 00 00 01 00 d2 00 00 00 02 00 00 00
2016-10-30T14:20:27+01:00 err kernel[]: [ 479.020819] blk_update_request: I/O error, dev sda, sector 16830976
2016-10-30T14:20:27+01:00 info kernel[]: [ 479.027129] sd 0:0:0:0: [sda] tag#12 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
2016-10-30T14:20:27+01:00 info kernel[]: [ 479.027136] sd 0:0:0:0: [sda] tag#12 CDB: opcode=0x88 88 00 00 00 00 00 01 00 ce 00 00 00 04 00 00 00
2016-10-30T14:20:27+01:00 err kernel[]: [ 479.027140] blk_update_request: I/O error, dev sda, sector 16829952
2016-10-30T14:20:27+01:00 err kernel[]: [ 479.033448] blk_update_request: I/O error, dev sda, sector 2056
2016-10-30T14:20:27+01:00 warning kernel[]: [ 479.039381] md: super_written gets error=-5
2016-10-30T14:20:27+01:00 info kernel[]: [ 479.039679] sd 0:0:0:0: [sda] tag#14 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
2016-10-30T14:20:27+01:00 info kernel[]: [ 479.039689] sd 0:0:0:0: [sda] tag#14 CDB: opcode=0x88 88 00 00 00 00 00 01 00 d2 00 00 00 00 08 00 00
2016-10-30T14:20:27+01:00 err kernel[]: [ 479.039695] blk_update_request: I/O error, dev sda, sector 16830976
2016-10-30T14:20:27+01:00 info kernel[]: [ 479.046045] sd 1:0:0:0: [sdb] tag#2 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
2016-10-30T14:20:27+01:00 info kernel[]: [ 479.046054] sd 1:0:0:0: [sdb] tag#2 CDB: opcode=0x88 88 00 00 00 00 00 01 00 d2 00 00 00 00 08 00 00
2016-10-30T14:20:27+01:00 err kernel[]: [ 479.046059] blk_update_request: I/O error, dev sdb, sector 16830976
2016-10-30T14:20:27+01:00 alert kernel[]: [ 479.052371] md/raid1:md0: sda: unrecoverable I/O read error for block 16566784
2016-10-30T14:20:27+01:00 info kernel[]: [ 479.059678] sd 0:0:0:0: [sda] tag#15 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
2016-10-30T14:20:27+01:00 info kernel[]: [ 479.059686] sd 0:0:0:0: [sda] tag#15 CDB: opcode=0x88 88 00 00 00 00 00 01 00 d2 80 00 00 00 08 00 00
2016-10-30T14:20:27+01:00 err kernel[]: [ 479.059690] blk_update_request: I/O error, dev sda, sector 16831104
2016-10-30T14:20:27+01:00 info kernel[]: [ 479.066022] sd 1:0:0:0: [sdb] tag#3 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
2016-10-30T14:20:27+01:00 info kernel[]: [ 479.066030] sd 1:0:0:0: [sdb] tag#3 CDB: opcode=0x88 88 00 00 00 00 00 01 00 d2 80 00 00 00 08 00 00
2016-10-30T14:20:27+01:00 err kernel[]: [ 479.066034] blk_update_request: I/O error, dev sdb, sector 16831104
2016-10-30T14:20:27+01:00 alert kernel[]: [ 479.072335] md/raid1:md0: sda: unrecoverable I/O read error for block 16566912
2016-10-30T14:20:27+01:00 info kernel[]: [ 479.079623] sd 0:0:0:0: [sda] tag#16 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
2016-10-30T14:20:27+01:00 info kernel[]: [ 479.079630] sd 0:0:0:0: [sda] tag#16 CDB: opcode=0x88 88 00 00 00 00 00 01 00 d3 00 00 00 00 08 00 00
2016-10-30T14:20:27+01:00 err kernel[]: [ 479.079634] blk_update_request: I/O error, dev sda, sector 16831232
2016-10-30T14:20:27+01:00 info kernel[]: [ 479.085961] sd 1:0:0:0: [sdb] tag#4 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
2016-10-30T14:20:27+01:00 info kernel[]: [ 479.085969] sd 1:0:0:0: [sdb] tag#4 CDB: opcode=0x88 88 00 00 00 00 00 01 00 d3 00 00 00 00 08 00 00
2016-10-30T14:20:27+01:00 err kernel[]: [ 479.085973] blk_update_request: I/O error, dev sdb, sector 16831232
2016-10-30T14:20:27+01:00 alert kernel[]: [ 479.092330] md/raid1:md0: sda: unrecoverable I/O read error for block 16567040
2016-10-30T14:20:27+01:00 info kernel[]: [ 479.099632] sd 0:0:0:0: [sda] tag#17 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
2016-10-30T14:20:27+01:00 info kernel[]: [ 479.099640] sd 0:0:0:0: [sda] tag#17 CDB: opcode=0x88 88 00 00 00 00 00 01 00 d3 80 00 00 00 08 00 00
2016-10-30T14:20:27+01:00 err kernel[]: [ 479.099645] blk_update_request: I/O error, dev sda, sector 16831360
2016-10-30T14:20:27+01:00 info kernel[]: [ 479.106136] sd 1:0:0:0: [sdb] tag#5 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
2016-10-30T14:20:27+01:00 info kernel[]: [ 479.106144] sd 1:0:0:0: [sdb] tag#5 CDB: opcode=0x88 88 00 00 00 00 00 01 00 d3 80 00 00 00 08 00 00
2016-10-30T14:20:27+01:00 alert kernel[]: [ 479.106157] md/raid1:md0: sda: unrecoverable I/O read error for block 16567168
2016-10-30T14:20:27+01:00 alert kernel[]: [ 479.113529] md/raid1:md0: sda: unrecoverable I/O read error for block 16565760
2016-10-30T14:20:27+01:00 alert kernel[]: [ 479.120847] md/raid1:md0: sda: unrecoverable I/O read error for block 16565888
2016-10-30T14:20:27+01:00 alert kernel[]: [ 479.128181] md/raid1:md0: sda: unrecoverable I/O read error for block 16566016
2016-10-30T14:20:27+01:00 alert kernel[]: [ 479.135499] md/raid1:md0: sda: unrecoverable I/O read error for block 16566144
2016-10-30T14:20:27+01:00 alert kernel[]: [ 479.142826] md/raid1:md0: sda: unrecoverable I/O read error for block 16566272
2016-10-30T14:20:27+01:00 alert kernel[]: [ 479.150128] md/raid1:md0: sda: unrecoverable I/O read error for block 16566400
2016-10-30T14:20:27+01:00 alert kernel[]: [ 479.157442] md/raid1:md0: sda: unrecoverable I/O read error for block 16566528
2016-10-30T14:20:27+01:00 alert kernel[]: [ 479.164760] md/raid1:md0: sda: unrecoverable I/O read error for block 16566656
2016-10-30T14:20:27+01:00 info kernel[]: [ 479.172301] md: checkpointing resync of md0.
2016-10-30T14:20:27+01:00 warning kernel[]: [ 479.172338] md: super_written gets error=-5
2016-10-30T14:20:27+01:00 warning kernel[]: [ 479.172432] md: super_written gets error=-5
2016-10-30T14:20:27+01:00 debug kernel[]: [ 479.172459] RAID1 conf printout:
2016-10-30T14:20:27+01:00 debug kernel[]: [ 479.172463] — wd:1 rd:2
2016-10-30T14:20:27+01:00 debug kernel[]: [ 479.172467] disk 0, wo:0, o:1, dev:sda1
2016-10-30T14:20:27+01:00 debug kernel[]: [ 479.172470] disk 1, wo:1, o:0, dev:sdb1
2016-10-30T14:20:27+01:00 debug kernel[]: [ 479.210726] RAID1 conf printout:
2016-10-30T14:20:27+01:00 debug kernel[]: [ 479.210731] — wd:1 rd:2
2016-10-30T14:20:27+01:00 debug kernel[]: [ 479.210736] disk 0, wo:0, o:1, dev:sda1
2016-10-30T14:20:27+01:00 warning kernel[]: [ 479.210784] md: super_written gets error=-5
2016-10-30T14:20:27+01:00 warning kernel[]: [ 479.210866] md: super_written gets error=-5
2016-10-30T14:20:27+01:00 warning kernel[]: [ 479.210912] md: super_written gets error=-5
2016-10-30T14:20:27+01:00 info kernel[]: [ 479.210967] md: resync of RAID array md0
2016-10-30T14:20:27+01:00 info kernel[]: [ 479.210971] md: minimum guaranteed speed: 1000 KB/sec/disk.
2016-10-30T14:20:27+01:00 info kernel[]: [ 479.210974] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for resync.
2016-10-30T14:20:27+01:00 info kernel[]: [ 479.210979] md: using 128k window, over a total of 2930134272k.
2016-10-30T14:20:27+01:00 info kernel[]: [ 479.211180] md: md0: resync done.
2016-10-30T14:20:27+01:00 warning kernel[]: [ 479.211419] md: super_written gets error=-5
2016-10-30T14:20:27+01:00 warning kernel[]: [ 479.211473] md: super_written gets error=-5

Big_boss · October 31, 2016, 12:23am

I have not really any additions to this, only a tip.

Guys, keep in mind the vibrations of the hard disks. Try to have something soft or rubber under the NAS-case. I am not sure if this problem also occurs using 2 harddisks, but when using multiple HD’s you get the problem of vibration, which can damage the HD’s.

Stiglar · October 31, 2016, 1:38am

Fixing in the nasbox is realy good.
That not will be problem.

I found by google that similiar problems occurs to guys on PCs who had emulated IDE interface.
Or mb we should use mb diferent partition table (i was trying Linux Raid) mb we should do diferent.

Btw disks works when i create two folders (one for each drive) and i started filling both of them at same time by data. There was no error and i put there almost 300GB and it was ok.

stefanpr · October 31, 2016, 8:28am

I created a btrfs raid-1 filesystem and copied some data to and from the device, and that seems to be working fine. I’m going to to see what happens if I try to use dmcrypt…

Paul_Totterman · October 31, 2016, 11:21am

I’ve been wondering how to layer dmcrypt with raid. Should it be:

HDDs
dmcrypt
mdraid
xfs

or

HDDs
mdraid
dmcrypt
xfs

or

HDDs
dmcrypt
btrfs

or what?

Why?

Stiglar · October 31, 2016, 10:32pm

i have a llitle problem with that look like command mkfs.btrfs dont want to do the job for me…

any help here ?

root@turris:~# mkfs.btrfs -m raid1 -d raid1 -f /dev/sda1 /dev/sdb1
btrfs-progs v4.5.1
See http://btrfs.wiki.kernel.org for more information.

Warning, could not drop caches
Warning, could not drop caches
Label:              (null)
UUID:               c676dab7-744c-4901-94a5-a45c4b9b7738
Node size:          16384
Sector size:        4096
Filesystem size:    5.46TiB
Block group profiles:
  Data:             RAID1             1.01GiB
  Metadata:         RAID1             1.01GiB
  System:           RAID1            12.00MiB
SSD detected:       no
Incompat features:  extref, skinny-metadata
Number of devices:  2
Devices:
   ID        SIZE  PATH
    1     2.73TiB  /dev/sda1
    2     2.73TiB  /dev/sdb1

Warning, could not drop caches
Warning, could not drop caches

Paul_Totterman · November 1, 2016, 7:30am

SATA or USB? Can you disable write cache if you can’t drop it?

Maxmilian_Picmaus · November 1, 2016, 5:23pm

My small issues with IronWolf Segate 3TB twins …

I faced similar issue with I/O Error on one drive. I bought two identical drives, connect as used to. First ‘partitioning’ passed okey. Creating filesystems failed on sdb (during ioctl ending operation). So i check fdisk/cfdisk/sfdisk/partx to see what’s wrong. Nothing suspicious. I connect drives to Windows and check the disk. First correctly GPT with my wanted Ext3. Second one shows as MBR disk with preformated ntfs.
Reformating both on windows and plug to Turris. First drive okey, second drive got I/O error.
Clearing all again and doing one disk by one on first channel. No luck. Second drive always has issues, no matter which tool i used. Several “dd” (clearing 0, clearing 0+1, clearing(urandom, 1:1 sda2sdb cloning. Hm, i even reasemble the router (re-plug all stuff, including pci sata card). Later on that disk start complaining about sector 0, sector 2048, 4096 … something with offset or/and zero found. So what the hell, i run the hdparm tool based on info i have from first drive. That went really okey, for like 12hours waiting. After that disk was fine and i can make it running with 3x1T Ext(3) partitions. While not using it, it disconnect for several times. So i have to reconnect to windows box for check. Seatools/DiskWizard/Gparted and so on. In the end, disk refuse any GPT , so only MBR was possible. So i ende in situation 2T partition on 3T disk. Put it back, run cfdisk (change all to GPT + 2x 1,5T NTFS).
Finally working set. Ready to mount and use.
Two days under ‘load/testing’. SDB2 partition got sickness. Again , new check and voila, ‘bad-disk’ or/and ‘read-only’.
Found out that wonderful --yes-i-know-what-i-am-doing flag for hdparm and make security enhanced erase/factory reset. Over night and few hours disk looks fine. Just to not repeat all again, quickly 3x 1T NTFS . For like 10h it was okey, but then disconnect and since then i can’t do any ‘write’ operation after 4096 sector or at the end of the disk (anything between was ok).
Bricked drive, doh …meh …mffffp…
(i can erase 0-4095 sectors, i can ‘repair partition’ (but i can’t write that repair back), LLF tool is failing as well. ‘dd’/‘shred’/‘scrum’ got always i/o error …). also i can play within DCO area a bit (read, clear security/set security … but issuing again security-erase/security-enhanced-erase command fails on IO error …

After few days playing, drive was sent back for replacement. Obviously defected drive.
That learn me a leson, that force me to read bunch of man pages and from those ‘disk’ tools.
Be hacker sfdisk is really useful.

Btw: on the other hand, windows internal ‘diskpart.exe’ under admin in cmd.exe is also very helpful tool (why the hell i was using some 3rd party tools before … PartitionMagic and such …)

Cheers mates.
-vh-

Stiglar · November 1, 2016, 8:21pm

SATA drives.
i finaly figure out that its working even with this errors.
So now i have Raid 1 running by using mkfs.btrfs -m raid1 -d raid1 -f /dev/sda1 /dev/sdb1

mina86 · November 2, 2016, 5:01pm

If you put encryption between hard drives and mdraid you’ll end up encrypting the same data twice (and unless the same key is used in both cases, this may have security implications). To save CPU time, put encryption between RAID configuration and file system.

Whether to use btrfs’ RAID support is another matter. I will go with ext4 on top of mdraid.

Stiglar · November 2, 2016, 5:14pm

let us know if will succefully make raid field…

kolaCZek · November 3, 2016, 4:58am

I have same issue with raid1 (mdadm) and two WD Red 3TB hdds. :-/

white · November 3, 2016, 4:45pm

Does it help if you give “libata.force=1.5” or “libata.force=noncq” as a kernel parameter at the boot time?

schovi · November 4, 2016, 4:44pm

Hi,

Both Omnia and NAS have finally arrived. I had exactly same idea of buying two WD Red 3TB, but I never have to configure HW on unix. Do you have any good manual how to setup it all and ideally solve this I/O error?

Thank you!

Maxmilian_Picmaus · November 4, 2016, 6:09pm

From my experiences with my 2x3T drives i recommend to use GParted aside or any similar livedistro (with parted/gparted tool) to prepare the partitions in advance.

Doing that on shell is not big deal, just few basic commands. On OpenWRT wiki is very nice guide. https://wiki.openwrt.org/doc/howto/storage or https://www.turris.cz/doc/cs/howto/nas

When i was trying to find out what went wrong i was looking for ‘parted’ tool (and during looking why it is not installed by default) i found some article (really i can’t find it again) where was stated that GPT is supported by TurrisOS (like to use it on drives), but due some limitations of correspond fork of OpenWRT it is not recommended to create 2T+ partition(and format them) directly in ssh.

Big_boss · November 5, 2016, 8:20am

I would advice you to NOT buy the WD Red 3 TB (WD Red WD30EFRX, 3TB), i have read many reviews that those are unreliable.