SATA HDD issues

j0n4s82 · March 23, 2017, 7:20am

Nice find. I hope this further proves to work for others. I am planning to go for a raid setup in the near future, only have a single drive at the moment.

Maxmilian_Picmaus · March 23, 2017, 7:25am

Hopefully I will test it ( libata force=1.5,noncq and/or echo 1 > /sys/block/sda/device/queue_depth ) next week. I’ll keep you posted.

mamash · March 23, 2017, 8:47pm

Encouraged by the latest foudings, I got a pair of 4TB Reds, installed them in the NAS perk box, disabled NCQ, set up a native btrfs RAID1 array, and just finished backing up two laptops over Time Machine. No issues so far!

j0n4s82 · March 24, 2017, 5:14am

Happy news. What’s turris staff have to say about this? I wonder what their findings have been so far. @brill

davey · March 24, 2017, 5:47am

Googling NCQ and openwrt yield some interesting returns.

maurer · March 24, 2017, 6:52am

interesting returns…like what?

gstrauss · March 24, 2017, 6:58am

@technik007cz another data point, but without disabling NCQ:
While I had not dared run this earlier, after upgrading to Turris Omnia 3.6.0, I was able to successfully run a ‘check’ sync_action on an ext4 volume on mdraid RAID1 over (2) 5TB drives.

[380161.047639] md: data-check of RAID array md0
[380161.047647] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[380161.047651] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
[380161.047656] md: using 128k window, over a total of 4883638272k.
[410153.391120] md: md0: data-check done.

At the beginning, cat /proc/mdstat was reporting over 200000KB/sec, but towards the end of the sync, only 150000KB/sec, and the average using the above report from dmesg is 159MB/sec. During the ‘check’, one of the CPUs was pegged and thermometer reported sustained temperatures of 108-109C. Annecdotally, it looked like throughput went down as temperature went up, so if there was any throttling due to the temperature, that might be what avoided exceeding other limits. (just guessing)

My previous tests writing to disk over the network maxed out the CPU running samba, and throughput was up to 111MB/sec, close to saturating the 1Gbps link, but did not crash the array (after the array was created on the drives on an external machine). Since I don’t want to rebuild the array and transfer all the data again, maybe I’ll instead try some tests using dd if=/dev/zero of=/path/to/file/on/disk to see if I can reproduce the issue with Turris Omnia (now) 3.6.1 release, or if 3.6.0 kernel update fixed things for me.

technik007cz · March 24, 2017, 10:22am

Do not share useless questions and information. You decided go to toilet in the middle of you comment? And then you pressed reply button when you returned back? This is not facebook mate.

Maxmilian_Picmaus · March 24, 2017, 2:15pm

Check my notes from my experiences with twin drives (no raid at all) SATA HDD issues … and as for “dd” , before you run it, backup/dump first,last and super, blocks, in case of i/o error it might be very handy…

technik007cz · March 24, 2017, 9:27pm

Uptime: 2 days
Copying between internal sata drives ongoing and still stable.

quick · March 25, 2017, 12:59pm

Some other findings:

New kernel version changes nothing. If you try to sync sw raid in vanilla turris os today, you get same errors as months ago.

Correcting (this) post: You can create file in /etc/modules.d to pass parameters to libata module, but it must start with “40-aaa”. It seems like “41-aaa” (as in OP) is too late, module is loaded already and parameters can not be passed. So create file in /etc/modules.d named 40-aaa-libata. Then you must create symlink to /etc/modules-boot.d to work. That file can contain:

libata force=noncq
or
libata force=1.5
or
libata force=1.5,noncq

SW raid syncing works in ALL cases.

And some speed results:

libata force=1.5 scenario:

Checking setup:
root@turris:~# smartctl -a /dev/sda | grep ATA
ATA Version is: ACS-2 (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 1.5 Gb/s)

Starting sync speed:
root@turris:~# cat /proc/mdstat Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] md0 : active raid1 sdb[1] sda[0] 3906887360 blocks super 1.2 [2/2] [UU] [>....................] resync = 0.4% (15761472/3906887360) finish=490.5min speed=132188K/sec

ending sync speed:
root@turris:~# cat /proc/mdstat Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] md0 : active raid1 sdb[1] sda[0] 3906887360 blocks super 1.2 [2/2] [UU] [===================>.] resync = 99.9% (3904464000/3906887360) finish=0.5min speed=71850K/sec

libata force=noncq scenario:

starting sync speed:
root@turris:~# cat /proc/mdstat Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] md0 : active raid1 sdb[1] sda[0] 3906887360 blocks super 1.2 [2/2] [UU] [>....................] resync = 0.4% (15743488/3906887360) finish=414.4min speed=156477K/sec

ending sync speed I did not catch.

libata force=1.5,noncq scenario:

root@turris:~# cat /proc/mdstat Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] md0 : active raid1 sdb[1] sda[0] 3906887360 blocks super 1.2 [2/2] [UU] [>....................] resync = 0.4% (15754240/3906887360) finish=493.9min speed=131305K/sec

ending sync speed I did not catch either, but it looks like it is the same case like libata force=1.5

IMHO, noncq parameter (at least with 1.5Gbs) has no significant effect on sync speed.

Regards
Q.

technik007cz · March 26, 2017, 8:23am

NCQ technology has been designed to speed up random reads/writes so in theory it should not affect sync of raid pool when there is nothing reading/writing to the raid pool. If we want to test how much performance is affected it needs more sophisticated benchmark. But I remember benchmarks from beginning of ncq technology when it came to the market and it was not big deal.
It probably makes difference for ssd drives but who want sacrifice stability and get better performance for couple of minutes and then crash, lost data, time, etc… ?

technik007cz · March 27, 2017, 9:05pm

@Maxmilian_Picmaus @Pepe @CIJOML @Kes @quick

I am running rsync tests between internal drives with checksum control on ( with “-c” parameter ) for 5 days.
If you have no experiences with this rsync feature imagine you have two folder with 95% same data. Normally rsync quickly reads structures on both folders and transfer from source to target what is missing .
But with checksum on it creates checksums for content and start transfers from source to target based on checksum differences. It is more time consuming but it will not ignore files with same attributes and file size ( like in first scenario ) when content is different ( like files corrupted during transfer ).
Router is still stable and I have not experienced any errors causing sata link reset like in past or rsync has not reported something wrong yet.
So thank you guys.

Below is magic which helped me. I put it in /etc/rc.local.
echo 1 > /sys/block/sda/device/queue_depth
echo 1 > /sys/block/sdb/device/queue_depth

Pepe · March 27, 2017, 9:07pm

I think we should ping @brill that we find workaround.

brill · March 27, 2017, 9:45pm

Thanks @Pepe, this sounds really interesting. It seems to me, that it broadens the spectrum of possible causes in fact, but maybe I’ll be able to find something in SATA driver od libata. I’ll try…

Guus_Houtzager · April 1, 2017, 4:14pm

I used the libata force=1.5 scenario because I didn’t want to turn off NCQ and it synced my drives like a charm (mdadm raid1, 6 TB WD Red drives). Speed was pretty good, about 130 MB/s.

technik007cz · April 7, 2017, 7:21pm

@Guus_Houtzager
I received 4 port miniPCIe card about one week ago and tested it. It is card with SAN multiport connector. This card had nearly same issue and it is reset of sata links during startup when system tried to mount/connect to drive. But difference was system did not freeze and I could use drives.
I tried the card with ncq turned off but it did not make any difference. Than I tried solution see bellow and it is working for 4 days without problem. ( It successfully passed parallel scrub test on four drives couple of minutes ago. )

If somebody is interested I bought it on aliexpress, model number: LRST8615-4IT
This card do not fit perfectly because router’s coin battery is in way but instead desoldering the battery and solder it on wires I decided to mount the card on higher hexagon plastic spacers and secured it with plastic screws.

maurer · April 13, 2017, 11:58am

does a cable like that works with the multiport card?

technik007cz · April 14, 2017, 8:51pm

@maurer, it will work with the multiport card I bought but I am not sure if it is for sata drives.
I bought and I am using cable link below:

https://www.aliexpress.com/item/50cm-internal-Mini-SAS-4i-36Pin-SFF-8087-Host-to-4-SATA-7pin-hard-disk-target/1613006680.html?spm=2114.13010608.0.0.v7kojs

Maxmilian_Picmaus · April 16, 2017, 8:54am

So i’ve finally get time to test it again.
2x3TB , sdc=1T+1T+800G partitions, sdb=2,8G , all ext3 no raid, both GPT

I used both fixes mentioned here. …

echo 1 > /sys/block/sdX/device/queue_type
both drives have jumpers to force speed 1,5G
/etc/modules.d/40-aaa-libata created, forcing 1,5G speed and no-ncq

I mount the /dev/sdb , list the data, check the status via smartctl …and tried to copy one movie, drive went to ‘I/O Failure’ during that operation. And again when connect same drive via usb port no issues. It is always second channel, the first one is just fine.

I will re-check, reboot it just to be double sure.

2017-04-16T10:30:48+02:00 err kernel[]: [   41.930649] ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x400000 action 0x6 frozen
2017-04-16T10:30:48+02:00 err kernel[]: [   41.938241] ata2.00: irq_stat 0x08000000, interface fatal error
2017-04-16T10:30:48+02:00 err kernel[]: [   41.944194] ata2.00: cmd ca/00:a8:b0:08:00/00:00:00:00:00/e0 tag 2 dma 86016 out
2017-04-16T10:30:48+02:00 err kernel[]: [   41.944194]          res 50/00:00:af:08:00/00:00:00:00:00/e0 Emask 0x10 (ATA bus error)
2017-04-16T10:30:48+02:00 info kernel[]: [   41.959628] ata2: hard resetting link
... and later
2017-04-16T10:30:48+02:00 err kernel[]: [   51.950634] ata2: softreset failed (1st FIS failed)
2017-04-16T10:30:48+02:00 info kernel[]: [   51.955527] ata2: hard resetting link
2017-04-16T10:30:48+02:00 err kernel[]: [   61.950635] ata2: softreset failed (1st FIS failed)
2017-04-16T10:30:48+02:00 info kernel[]: [   61.955530] ata2: hard resetting link
2017-04-16T10:30:48+02:00 err kernel[]: [   96.950635] ata2: softreset failed (1st FIS failed)
2017-04-16T10:30:48+02:00 info kernel[]: [   96.955565] ata2: hard resetting link
2017-04-16T10:30:48+02:00 err kernel[]: [  101.960629] ata2: softreset failed (1st FIS failed)
2017-04-16T10:30:48+02:00 err kernel[]: [  101.965566] ata2: reset failed, giving up
2017-04-16T10:30:48+02:00 warning kernel[]: [  101.969585] ata2.00: disabled
2017-04-16T10:30:48+02:00 info kernel[]: [  101.969605] ata2: EH complete
2017-04-16T10:30:48+02:00 info kernel[]: [  101.969683] sd 1:0:0:0: [sdb] tag#3 UNKNOWN(0x2003)Result: hostbyte=0x04 driverbyte=0x00
2017-04-16T10:30:48+02:00 info kernel[]: [  101.969691] sd 1:0:0:0: [sdb] tag#3 CDB: opcode=0x8a 8a 00 00 00 00 00 00 00 08 b0 00 00 00 a8 00 00
2017-04-16T10:30:48+02:00 err kernel[]: [  101.969696] blk_update_request: I/O error, dev sdb, sector 2224
2017-04-16T10:30:48+02:00 err kernel[]: [  101.975649] Buffer I/O error on dev sdb1, logical block 176, lost async page write
2017-04-16T10:30:48+02:00 err kernel[]: [  101.983243] Buffer I/O error on dev sdb1, logical block 177, lost async page write
2017-04-16T10:30:48+02:00 err kernel[]: [  101.990843] Buffer I/O error on dev sdb1, logical block 178, lost async page write
2017-04-16T10:30:48+02:00 err kernel[]: [  101.998437] Buffer I/O error on dev sdb1, logical block 179, lost async page write