SATA HDD issues

I’ve used WD Green, Hitachi and Seagate all with problems. Also I’ve used dmcrypt, so maybe the combined load of CPU (encryption) + 2xHDD is enough to push the limits.

Power is not problem.
I left 2 HDDs in desktop case, powered with desktop power supply. SATA cables connected to TO. RAID 1 sync failed very fast - at 0.3%.

2x ST3000VN007 Seagate IronWolf no matter what combination/configuration …
/dev/sdb gives me i/o error during or shortly after mount. when connected as /dev/sda all normal …
no raid configuration, tested many options including another disk with same firmware/revision.
Identical twins, stand-alone and in log i just seen something like “/dev/sdb … to big for me!” “trying[16]” later on i/o error is shown …

i will check it once more , after today system update …(i notice some kmod updates …) so maybe i will be lucky this time.

/KYP

Could you pm me or post here your log containing that please? And if you have possibility, could you connect serial cable and record what’s happening when you reproduce it? It doesn’t seems like thing we reproduced. From my own tests it seems like if drive stops communicating, in log we see something like this:

[16814.454535] ata1: softreset failed (1st FIS failed)
[16814.459443] ata1: limiting SATA link speed to 3.0 Gbps
[16819.464240] ata1: softreset failed (1st FIS failed)
[16819.469136] ata1: reset failed, giving up
[16819.473161] ata1.00: disabled
[16819.476144] ata1.00: device reported invalid CHS sector 0
[16819.481555] ata1.00: device reported invalid CHS sector 0

Nothing like to big for me!. Or did I misunderstood you?

mine shorter output from syslog (long-one is in your /pm)

2016-12-17T16:15:19+01:00 err kernel[]: [ 4599.859189] ata2: softreset failed (1st FIS failed)
2016-12-17T16:15:19+01:00 warning kernel[]: [ 4599.864113] ata2: limiting SATA link speed to 1.5 Gbps
2016-12-17T16:15:19+01:00 info kernel[]: [ 4599.864119] ata2: hard resetting link
2016-12-17T16:15:24+01:00 err kernel[]: [ 4604.868963] ata2: softreset failed (1st FIS failed)
2016-12-17T16:15:24+01:00 err kernel[]: [ 4604.873859] ata2: reset failed, giving up
2016-12-17T16:15:24+01:00 warning kernel[]: [ 4604.877877] ata2.00: disabled
2016-12-17T16:15:24+01:00 info kernel[]: [ 4604.877897] ata2: EH complete

drives are having jumpers to limit speed to 1.5G

Any progress in ironing out the SATA issues?

I am also curious about the status of this issue. Yesterday my BTRFS array crashed during decompressing of 3GB package. (compressed package were located on array and it was decompressed to the same array)
I tried today to run btrfs scrub (it reads data on both disks and compare checksums with metadata) to see if the data are consistent, but after reading of 450GB it failed again.

Here is a kernel log:
turris-sata.csv (29.9 KB)

Latest news I know about is that it is caused by some errors on PCI. They are probably caused by some bug in kernel or (more probably) cpu it self. It is tackled by @brill, so he might give you more current update.

4 Likes

Hi there,
my omnia wih NAS perk had issue when I started btrfs defrag with compression on first hdd and btrfs filesystem defrag (without compression) on second hdd. On the background two lxc containers were running so cpu was under heavy load. Swap partition has been added on first hdd (this is explaining last error). There is output of dmesg:

[681060.012730] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[681065.012723] ata2.00: qc timeout (cmd 0xec)
[681065.012738] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[681065.012743] ata2.00: revalidation failed (errno=-5)
[681065.017730] ata2: hard resetting link
[681075.022738] ata2: softreset failed (1st FIS failed)
[681075.027731] ata2: hard resetting link
[681085.032716] ata2: softreset failed (1st FIS failed)
[681085.037711] ata2: hard resetting link
[681120.032716] ata2: softreset failed (1st FIS failed)
[681120.037701] ata2: hard resetting link
[681125.252729] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[681135.252723] ata2.00: qc timeout (cmd 0xec)
[681135.252738] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[681135.252743] ata2.00: revalidation failed (errno=-5)
[681135.257725] ata2: hard resetting link
[681145.262732] ata2: softreset failed (1st FIS failed)
[681145.267717] ata2: hard resetting link
[681148.764158] scsi_io_completion: 2438 callbacks suppressed
[681148.764174] sd 0:0:0:0: [sda] tag#26 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
[681148.764182] sd 0:0:0:0: [sda] tag#26 CDB: opcode=0x88 88 00 00 00 00 00 04 02 13 08 00 00 00 08 00 00
[681148.764186] blk_update_request: 2438 callbacks suppressed
[681148.764190] blk_update_request: I/O error, dev sda, sector 67244808
[681148.770562] Read-error on swap-device (8:0:67244816)

I found way how to reproduce SATA errors described in this thread. I have different drives in nonraid setup.
Both of them have btrfs filesystem with couple of gigs of data (in my case it was exactly 29.4G on /dev/sda2 and 4.2T on /dev/sdb1). It just needed to start scrub on both drives and wait for crash.
What I tested is error free?

  1. btrfs scrub on “sda” - just only on one drive while second one is not used
  2. btrfs defrag with compression running together on “sda” (tested with zlib compression)
  3. I rsynced most of data stored on “sdb” from my laptop to turris
  4. I can copy data to samba shared folder created on “sdb” drive. Is has been happening everyday for about 1 month and it is amount of couple of gigs daily.

What difference I see between error and error free cases? It is visible when I run “iostat -m” and htop on konsole. Srub on two drives pushing load of pci line and cpu to maximum limit compare to error free cases.

I have the same issue with two disks Seagate Archive 8TB (ST8000AS0002).

Very good to hear that you’ve managed to reproduce the issue. Please keep us posted.

Well not much to say… We are going to investigate possible underlying problem with PCI Express analyzer. Hopefully we will be able to find what is going on there.

Tomas

3 Likes

I received new kernel update today. Even there was no information saying; SATA errors repaired I decided to test it again.
I run btrfs scrub on both drives sda (4TB) and sdb (8TB). When scrub finished testing about 500GB of data on sda drive I started btrfs filesystem defrag on sda drive while scrub was still running on drive sdb. When about 1.25TB of data had been scrubbed/checked on sdb and defrag was still running on sda here is sata link reset and processes runing on drives stopped. So unluckily nothing changed in last update with this serious error.

Here is a part of dmesg:

dmesg

[ 8427.828378] ata2: softreset failed (1st FIS failed)
[ 8427.833274] ata2: hard resetting link
[ 8428.308306] ata1: softreset failed (1st FIS failed)
[ 8428.313200] ata1: hard resetting link
[ 8437.837694] ata2: softreset failed (1st FIS failed)
[ 8437.842590] ata2: hard resetting link
[ 8438.307660] ata1: softreset failed (1st FIS failed)
[ 8438.312643] ata1: hard resetting link
[ 8472.845396] ata2: softreset failed (1st FIS failed)
[ 8472.850291] ata2: limiting SATA link speed to 1.5 Gbps
[ 8472.850296] ata2: hard resetting link
[ 8473.305362] ata1: softreset failed (1st FIS failed)
[ 8473.310258] ata1: limiting SATA link speed to 3.0 Gbps
[ 8473.310263] ata1: hard resetting link
[ 8478.055073] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[ 8478.315042] ata1: softreset failed (1st FIS failed)
[ 8478.319938] ata1: reset failed, giving up
[ 8478.323955] ata1.00: disabled
[ 8478.323962] ata1.00: device reported invalid CHS sector 0
[ 8478.323967] ata1.00: device reported invalid CHS sector 0
[ 8478.323971] ata1.00: device reported invalid CHS sector 0
[ 8478.323974] ata1.00: device reported invalid CHS sector 0
[ 8478.323978] ata1.00: device reported invalid CHS sector 0
[ 8478.323981] ata1.00: device reported invalid CHS sector 0
[ 8478.323985] ata1.00: device reported invalid CHS sector 0
[ 8478.323988] ata1.00: device reported invalid CHS sector 0
[ 8478.323994] ata1.00: device reported invalid CHS sector 0
[ 8478.323997] ata1.00: device reported invalid CHS sector 0
[ 8478.324001] ata1.00: device reported invalid CHS sector 0
[ 8478.324004] ata1.00: device reported invalid CHS sector 0
[ 8478.324008] ata1.00: device reported invalid CHS sector 0
[ 8478.324011] ata1.00: device reported invalid CHS sector 0
[ 8478.324015] ata1.00: device reported invalid CHS sector 0
[ 8478.324019] ata1.00: device reported invalid CHS sector 0
[ 8478.324023] ata1.00: device reported invalid CHS sector 0
[ 8478.324026] ata1.00: device reported invalid CHS sector 0
[ 8478.324030] ata1.00: device reported invalid CHS sector 0
[ 8478.324034] ata1.00: device reported invalid CHS sector 0
[ 8478.324037] ata1.00: device reported invalid CHS sector 0
[ 8478.324041] ata1.00: device reported invalid CHS sector 0
[ 8478.324044] ata1.00: device reported invalid CHS sector 0
[ 8478.324048] ata1.00: device reported invalid CHS sector 0
[ 8478.324051] ata1.00: device reported invalid CHS sector 0
[ 8478.324054] ata1.00: device reported invalid CHS sector 0
[ 8478.324058] ata1.00: device reported invalid CHS sector 0
[ 8478.324062] ata1.00: device reported invalid CHS sector 0
[ 8478.324094] ata1: EH complete
[ 8478.324145] sd 0:0:0:0: [sda] tag#20 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
[ 8478.324155] sd 0:0:0:0: [sda] tag#20 CDB: opcode=0x8a 8a 00 00 00 00 00 43 5d 62 38 00 00 04 00 00 00
[ 8478.324161] blk_update_request: I/O error, dev sda, sector 1130193464
[ 8478.324166] sd 0:0:0:0: [sda] tag#26 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
[ 8478.324171] sd 0:0:0:0: [sda] tag#26 CDB: opcode=0x8a 8a 00 00 00 00 00 43 5d 4a 38 00 00 04 00 00 00
[ 8478.324174] blk_update_request: I/O error, dev sda, sector 1130187320
[ 8478.324182] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
[ 8478.324347] sd 0:0:0:0: [sda] tag#29 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
[ 8478.324352] sd 0:0:0:0: [sda] tag#29 CDB: opcode=0x8a 8a 00 00 00 00 00 43 5d 36 38 00 00 04 00 00 00
[ 8478.324354] blk_update_request: I/O error, dev sda, sector 1130182200
[ 8478.324360] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
[ 8478.324504] sd 0:0:0:0: [sda] tag#30 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
[ 8478.324508] sd 0:0:0:0: [sda] tag#30 CDB: opcode=0x8a 8a 00 00 00 00 00 43 5d 2e 38 00 00 08 00 00 00
[ 8478.324510] blk_update_request: I/O error, dev sda, sector 1130180152
[ 8478.324515] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
[ 8478.324640] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 4, rd 0, flush 0, corrupt 0, gen 0
[ 8478.324770] sd 0:0:0:0: [sda] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
[ 8478.324775] sd 0:0:0:0: [sda] tag#0 CDB: opcode=0x8a 8a 00 00 00 00 00 43 5d 7a 38 00 00 08 00 00 00
[ 8478.324777] blk_update_request: I/O error, dev sda, sector 1130199608
[ 8478.324782] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 5, rd 0, flush 0, corrupt 0, gen 0
[ 8478.324903] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 6, rd 0, flush 0, corrupt 0, gen 0
[ 8478.325044] sd 0:0:0:0: [sda] tag#14 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
[ 8478.325048] sd 0:0:0:0: [sda] tag#14 CDB: opcode=0x8a 8a 00 00 00 00 00 43 5c de 38 00 00 04 00 00 00
[ 8478.325050] blk_update_request: I/O error, dev sda, sector 1130159672
[ 8478.325056] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 7, rd 0, flush 0, corrupt 0, gen 0
[ 8478.325175] sd 0:0:0:0: [sda] tag#15 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
[ 8478.325179] sd 0:0:0:0: [sda] tag#15 CDB: opcode=0x8a 8a 00 00 00 00 00 22 91 42 f8 00 00 01 20 00 00
[ 8478.325181] blk_update_request: I/O error, dev sda, sector 579945208
[ 8478.325185] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 8, rd 0, flush 0, corrupt 0, gen 0
[ 8478.325228] sd 0:0:0:0: [sda] tag#26 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
[ 8478.325232] sd 0:0:0:0: [sda] tag#26 CDB: opcode=0x88 88 00 00 00 00 00 1f dd 36 50 00 00 00 88 00 00
[ 8478.325234] blk_update_request: I/O error, dev sda, sector 534591056
[ 8478.325238] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 8, rd 1, flush 0, corrupt 0, gen 0
[ 8478.325248] sd 0:0:0:0: [sda] tag#29 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
[ 8478.325251] sd 0:0:0:0: [sda] tag#29 CDB: opcode=0x88 88 00 00 00 00 00 2e 6b 8e c0 00 00 00 20 00 00
[ 8478.325253] blk_update_request: I/O error, dev sda, sector 778800832
[ 8478.325256] BTRFS error (device sda2): bdev /dev/sda2 errs: wr 8, rd 2, flush 0, corrupt 0, gen 0
[ 8478.328632] sd 0:0:0:0: [sda] tag#30 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
[ 8478.328637] sd 0:0:0:0: [sda] tag#30 CDB: opcode=0x88 88 00 00 00 00 00 1f dd 36 50 00 00 00 08 00 00
[ 8478.328640] blk_update_request: I/O error, dev sda, sector 534591056
[ 8478.329702] ------------[ cut here ]------------
[ 8478.329715] WARNING: CPU: 1 PID: 1296 at fs/btrfs/extent-tree.c:2927 btrfs_run_delayed_refs+0x29c/0x2d4()
[ 8478.329716] BTRFS: Transaction aborted (error -5)
[ 8478.329776] Modules linked in: qcserial option iptable_nat ip6table_nat ath9k uvcvideo usb_wwan snd_usb_audio rndis_host qmi_wwan pppoe nf_nat_pptp nf_nat_ipv6 nf_nat_ipv4 nf_nat_amanda nf_conntrack_pptp nf_conntrack_netlink nf_conntrack_ipv6 nf_conntrack_ipv4 nf_conntrack_amanda ipt_REJECT ipt_MASQUERADE ebtable_nat ebtable_filter ebtable_broute cdc_ether ath9k_common armada_thermal xt_time xt_tcpudp xt_tcpmss xt_statistic xt_state xt_socket xt_recent xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_id xt_hl xt_helper xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit xt_connbytes xt_comment xt_TPROXY xt_TCPMSS xt_REDIRECT xt_LOG xt_HL xt_DSCP xt_CT xt_CLASSIFY videobuf2_v4l2 usbserial usbnet usblp ums_usbat ums_sddr55 ums_sddr09 ums_karma ums_jumpshot ums_isd200 ums_freecom ums_datafab ums_cypress ums_alauda ts_kmp ts_fsm ts_bm thermal_sys snd_usbmidi_lib pppox ppp_mppe ppp_async nfnetlink nf_reject_ipv4 nf_nat_tftp nf_nat_snmp_basic nf_nat_sip nf_nat_redirect nf_nat_proto_gre nf_nat_masquerade_ipv4 nf_nat_irc nf_nat_h323 nf_nat_ftp nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_tftp nf_conntrack_snmp nf_conntrack_sip nf_conntrack_rtcache nf_conntrack_proto_gre nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp nf_conntrack_broadcast mvsdio iptable_raw iptable_mangle iptable_filter ipt_ECN ip_tables hwmon ebtables ebt_vlan ebt_stp ebt_redirect ebt_pkttype ebt_mark_m ebt_mark ebt_limit ebt_among ebt_802_3 crc_ccitt cdc_wdm ath9k_hw fuse sch_teql sch_tbf sch_sfq sch_red sch_prio sch_pie sch_netem sch_htb sch_gred sch_fq sch_dsmark sch_codel em_text em_nbyte em_meta em_cmp cls_basic act_police act_ipt act_skbedit act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_fw sch_hfsc sch_ingress videobuf2_vmalloc videobuf2_memops videobuf2_core v4l2_common videodev ath10k_pci ath10k_core ath mac80211 cfg80211 compat ledtrig_usbdev ledtrig_oneshot xt_LED ledtrig_morse ledtrig_heartbeat ledtrig_gpio cryptodev ip6t_NPT ip6t_MASQUERADE nf_nat_masquerade_ipv6 nf_nat nf_conntrack ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_raw ip6table_mangle ip6table_filter ip6_tables x_tables pppoatm ppp_generic slhc nfsd nfsv3 nfs msdos ip_gre gre ifb sit ip6_tunnel tunnel6 tunnel4 ip_tunnel veth tun snd_compress snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_rawmidi snd_seq_device snd_hwdep snd input_core soundcore rxkad vfat fat udf crc_itu_t ntfs lockd sunrpc grace minix isofs hfsplus hfs cramfs configfs cifs autofs4 kafs af_rxrpc dns_resolver dm_crypt dm_mirror dm_region_hash dm_log dm_mod br2684 atm multipath fscache raid456 async_raid6_recov async_pq async_xor async_memcpy async_tx raid10 raid1 raid0 linear md_mod nls_utf8 nls_koi8_r nls_cp1255 nls_iso8859_6 nls_iso8859_2 nls_iso8859_15 nls_iso8859_13 nls_iso8859_1 nls_cp932 nls_cp866 nls_cp864 nls_cp862 nls_cp852 nls_cp850 nls_cp775 nls_cp437 nls_cp1251 nls_cp1250 dma_shared_buffer xts algif_skcipher algif_hash af_alg sha512_generic sha256_generic sha1_generic seqiv jitterentropy_rng drbg pcbc md5 md4 marvell_cesa hmac gf128mul fcrypt ecb des_generic ctr cmac ccm cbc authenc usb_storage xhci_plat_hcd xhci_pci xhci_hcd orion_wdt uhci_hcd ledtrig_transient ahci_mvebu ahci ehci_orion ehci_platform ehci_hcd sd_mod ahci_platform libahci_platform libahci libata scsi_mod xfs libcrc32c jfs f2fs exfat usbcore nls_base usb_common mii aead crypto_null
[ 8478.329983] CPU: 1 PID: 1296 Comm: btrfs-transacti Not tainted 4.4.39-80079e1c1e5f9ca7ad734044462a761a-3 #1
[ 8478.329985] Hardware name: Marvell Armada 380/385 (Device Tree)
[ 8478.329987] Backtrace:
[ 8478.329993] [] (dump_backtrace) from [] (show_stack+0x18/0x1c)
[ 8478.329998] r6:00000000 r5:60000013 r4:c06ae224 r3:00000000
[ 8478.330003] [] (show_stack) from [] (dump_stack+0x98/0xac)
[ 8478.330007] [] (dump_stack) from [] (warn_slowpath_common+0x8c/0xbc)
[ 8478.330012] r6:00000b6f r5:c0213718 r4:ecf1de48 r3:ecf1c000
[ 8478.330016] [] (warn_slowpath_common) from [] (warn_slowpath_fmt+0x38/0x40)
[ 8478.330020] r8:ed9eb400 r7:ea9d6340 r6:ee4bb848 r5:000005c0 r4:fffffffb
[ 8478.330026] [] (warn_slowpath_fmt) from [] (btrfs_run_delayed_refs+0x29c/0x2d4)
[ 8478.330028] r3:fffffffb r2:c05ced24
[ 8478.330034] [] (btrfs_run_delayed_refs) from [] (btrfs_commit_transaction+0x34/0xc38)
[ 8478.330040] r10:00000000 r9:00000000 r8:00010da0 r7:ed9eb400 r6:ee4bb848 r5:000005c0
[ 8478.330041] r4:ee5ad678
[ 8478.330046] [] (btrfs_commit_transaction) from [] (transaction_kthread+0x1a8/0x218)
[ 8478.330051] r10:00000000 r9:00000000 r8:00010da0 r7:ecf1c000 r6:00000bb8 r5:000005c0
[ 8478.330053] r4:ed9eb400
[ 8478.330058] [] (transaction_kthread) from [] (kthread+0xf0/0x104)
[ 8478.330063] r10:00000000 r9:00000000 r8:00000000 r7:c02242e4 r6:ed9eb400 r5:00000000
[ 8478.330064] r4:ed3e8900
[ 8478.330069] [] (kthread) from [] (ret_from_fork+0x14/0x3c)
[ 8478.330072] r7:00000000 r6:00000000 r5:c00438e8 r4:ed3e8900
[ 8478.330074] —[ end trace 4dbae35872572c17 ]—
[ 8478.330078] BTRFS: error (device sda2) in btrfs_run_delayed_refs:2927: errno=-5 IO failure
[ 8478.330080] BTRFS info (device sda2): forced readonly
[ 8478.337669] pending csums is 28614656
[ 8478.478079] BTRFS error (device sda2): error loading props for ino 82125 (root 258): -5
[ 8483.054735] ata2.00: qc timeout (cmd 0xec)
[ 8483.054750] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[ 8483.054755] ata2.00: revalidation failed (errno=-5)
[ 8483.059648] ata2: hard resetting link
[ 8493.064090] ata2: softreset failed (1st FIS failed)
[ 8493.068986] ata2: hard resetting link
[ 8503.073429] ata2: softreset failed (1st FIS failed)
[ 8503.078329] ata2: hard resetting link
[ 8538.071169] ata2: softreset failed (1st FIS failed)
[ 8538.076071] ata2: hard resetting link
[ 8543.280847] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[ 8553.280203] ata2.00: qc timeout (cmd 0xec)
[ 8553.280218] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[ 8553.280224] ata2.00: revalidation failed (errno=-5)
[ 8553.285118] ata2: hard resetting link
[ 8563.279554] ata2: softreset failed (1st FIS failed)
[ 8563.284450] ata2: hard resetting link
[ 8573.288911] ata2: softreset failed (1st FIS failed)
[ 8573.293811] ata2: hard resetting link
[ 8608.296675] ata2: softreset failed (1st FIS failed)
[ 8608.301573] ata2: hard resetting link
[ 8613.506350] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[ 8643.504444] ata2.00: qc timeout (cmd 0xec)
[ 8643.504459] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[ 8643.504465] ata2.00: revalidation failed (errno=-5)
[ 8643.509354] ata2.00: disabled
[ 8643.509369] ata2.00: device reported invalid CHS sector 0
[ 8643.509373] ata2.00: device reported invalid CHS sector 0
[ 8643.509377] ata2.00: device reported invalid CHS sector 0
[ 8643.509380] ata2.00: device reported invalid CHS sector 0
[ 8643.509384] ata2.00: device reported invalid CHS sector 0
[ 8643.509387] ata2.00: device reported invalid CHS sector 0
[ 8643.509390] ata2.00: device reported invalid CHS sector 0
[ 8643.509394] ata2.00: device reported invalid CHS sector 0
[ 8643.509398] ata2.00: device reported invalid CHS sector 0
[ 8643.509401] ata2.00: device reported invalid CHS sector 0
[ 8643.509404] ata2.00: device reported invalid CHS sector 0
[ 8643.509408] ata2.00: device reported invalid CHS sector 0
[ 8643.509411] ata2.00: device reported invalid CHS sector 0
[ 8643.509414] ata2.00: device reported invalid CHS sector 0
[ 8643.509418] ata2.00: device reported invalid CHS sector 0
[ 8643.509421] ata2.00: device reported invalid CHS sector 0
[ 8643.509425] ata2.00: device reported invalid CHS sector 0
[ 8643.509429] ata2.00: device reported invalid CHS sector 0
[ 8643.509432] ata2.00: device reported invalid CHS sector 0
[ 8643.509436] ata2.00: device reported invalid CHS sector 0
[ 8643.509439] ata2.00: device reported invalid CHS sector 0
[ 8643.509442] ata2.00: device reported invalid CHS sector 0
[ 8643.509445] ata2.00: device reported invalid CHS sector 0
[ 8643.509449] ata2.00: device reported invalid CHS sector 0
[ 8643.509452] ata2.00: device reported invalid CHS sector 0
[ 8643.509456] ata2.00: device reported invalid CHS sector 0
[ 8643.509459] ata2.00: device reported invalid CHS sector 0
[ 8643.509462] ata2.00: device reported invalid CHS sector 0
[ 8643.509474] ata2: hard resetting link

Late to the party, but “me, too”. I encountered the same issue with (2) Toshiba X300

Since I did not want to take the drives out of the Turris Omnia NAS, I only took the cover off, left the drives powered by the Turris Omnia, and connected internal SATA cables from a desktop machine. From the desktop machine, I was able to zero the drives (in parallel). Once that was done, I disconnected the drives from the desktop machine and reconnected everything in the Turris Omnia. Then, I was able to mdadm --create --assume-clean --level=raid1 … , and mkfs.ext4. After that, I have not had issues accessing the drive over samba from a single client, but I am hesitant to try and put excessive load on the box, especially since CPU temperature was sustained at119°C while the single client was copying a few hundred GB over samba (as noted in Operating temperature)

Another “me too” here. Using 2 brand new WD Red 6 TB drives…

Thank you all to your responses. It gives us idea where problem lies. Currently we know that it is problem with PCIe, as @brill wrote. And he is looking into analyzing it using specialized hardware. So no more “me too” will help currently. When there is some progress we will inform you.

4 Likes

I had the ‘softreset failed’ issue even with just one hard drive connected even if I didn’t send any data to or from it[1]. I’d boot up with a disk connected, it would be detected fine and then after few minutes I’d get ‘hard resetting link’ messages eventually followed by ‘softreset failed’. Strangely, even power-cycling wouldn’t always fix the issue. It looked like as soon as the disk got spun down, it would no longer respond.

So what I’m saying is: I don’t think this is bandwidth related. At least not in my case.

I have tried different disks (HDD and SSD) as well as external PSU to no avail. I’m currently waiting for a new controller in hope that things will work with it.

[1] Mind you, I didn’t get a NAS perk but I bought the same controller based on ASM1062 controller.

[ 4438.793490] ata1.00: exception Emask 0x0 SAct 0xe0000 SErr 0x0 action 0x6 frozen [ 4438.800916] ata1.00: cmd 60/00:88:80:fe:4b/05:00:01:00:00/40 tag 17 ncq 655360 in [ 4438.815832] ata1.00: cmd 60/80:90:80:03:4c/00:00:01:00:00/40 tag 18 ncq 65536 in [ 4438.830657] ata1.00: cmd 60/01:98:00:00:04/00:00:00:00:00/40 tag 19 ncq 512 in [ 4438.845602] ata1: hard resetting link [ 4448.843078] ata1: softreset failed (1st FIS failed) [ 4448.847973] ata1: hard resetting link [ 4458.852299] ata1: softreset failed (1st FIS failed) [ 4458.857197] ata1: hard resetting link [ 4493.849536] ata1: softreset failed (1st FIS failed) [ 4493.854431] ata1: limiting SATA link speed to 3.0 Gbps [ 4493.854436] ata1: hard resetting link [ 4499.059218] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 320) [ 4504.058896] ata1.00: qc timeout (cmd 0xec) [ 4504.058910] ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4) [ 4504.058915] ata1.00: revalidation failed (errno=-5) [ 4504.063807] ata1: hard resetting link [ 4514.068236] ata1: softreset failed (1st FIS failed) [ 4514.073142] ata1: hard resetting link [ 4524.077488] ata1: softreset failed (1st FIS failed) [ 4524.082387] ata1: hard resetting link [ 4559.074889] ata1: softreset failed (1st FIS failed) [ 4559.079789] ata1: limiting SATA link speed to 1.5 Gbps [ 4559.079794] ata1: hard resetting link [ 4564.284493] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310) [ 4574.283766] ata1.00: qc timeout (cmd 0xec) [ 4574.283780] ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4) [ 4574.283785] ata1.00: revalidation failed (errno=-5) [ 4574.288679] ata1: hard resetting link [ 4584.293065] ata1: softreset failed (1st FIS failed) [ 4584.297965] ata1: hard resetting link [ 4594.302341] ata1: softreset failed (1st FIS failed) [ 4594.307279] ata1: hard resetting link [ 4629.309893] ata1: softreset failed (1st FIS failed) [ 4629.314789] ata1: hard resetting link [ 4634.519536] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310) [ 4664.517486] ata1.00: qc timeout (cmd 0xec) [ 4664.517499] ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4) [ 4664.517504] ata1.00: revalidation failed (errno=-5) [ 4664.522446] ata1.00: disabled [ 4664.522460] ata1.00: device reported invalid CHS sector 0 [ 4664.522465] ata1.00: device reported invalid CHS sector 0 [ 4664.522468] ata1.00: device reported invalid CHS sector 0 [ 4664.522482] ata1: hard resetting link [ 4674.526810] ata1: softreset failed (1st FIS failed) [ 4674.531716] ata1: hard resetting link [ 4684.536146] ata1: softreset failed (1st FIS failed) [ 4684.541053] ata1: hard resetting link [ 4719.533613] ata1: softreset failed (1st FIS failed) [ 4719.538510] ata1: hard resetting link [ 4724.743226] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310) [ 4724.743253] ata1: EH complete [52721.241081] ata1: exception Emask 0x10 SAct 0x0 SErr 0x10002 action 0xe frozen [52721.248354] ata1: irq_stat 0x00400000, PHY RDY changed [52721.253512] ata1: hard resetting link [52721.994130] ata1: SATA link down (SStatus 0 SControl 300) [52721.994150] ata1: EH complete [52721.994165] ata1.00: detaching (SCSI 0:0:0:0)

Problem is with PCI Express bus which is not reliable and disconnects your drives during initial raid sync or later during high load like btrfs scrubbing or when something in lxc container wants eat maximum hdd bandwidth.
I think there is something wrong in hardware design because I have not found any info how they want resolve this problem even they confirmed they know about it.

Check post from @cynerd in this thread. As far as i remember, they know about it, know how to simulate few test scenarios, but not found generic fix yet. No date/time specified. We just have to wait.