SATA HDD issues

So still any solution , or step-by-step guide to disable NCQ ? , i am right now on single drive … but this is not safe.

opkg install asm1062-fix

The fix doesn’t help. I still have issues. Maybe the problem is that I’m using SSDs instead of noisy HDDs. But sdb is very unrealiable. This is such a shame. The NAS box upgrade is almost useless. Did anyone have a good experience with some other controller in the NAS box? Does team turris have any response?

1 Like

Can you please check if you really have NCQ disabled? I have no issues since asm1062-fix but of course I have hdds not ssds.

What does unreliable mean? Do you still have timeouts? Is only sdb problematic or both drives goes down simultaneously? Because that is how it behaved. I never had problem just with sdb.

Is this checking for it?

# cat /sys/block/sda/device/queue_depth 
1
# cat /sys/block/sdb/device/queue_depth 
1

I have no idea why it seems to only affect sdb. Maybe because I’m writing to it? Trying to grow btrfs from sda to mirror to sdb. It cancelled with an I/O error.

I got very similar errors to others in dmesg about softresetting and hard resets.

In my case 2nd channel/2nd drive was an issue (no matter what).

Now (after the fix and latest OS updates) it seems to be fine (at least i can mount the drive/partitions, without instant i/o error and i can operate with files/folders). Hope it will stay this way :slight_smile:

It worked for like 120 minutes. After copying some .iso files it just gave me back same old “I/O” error.
Some output from logs (With “cz” comments) https://pastebin.com/kqA2g8BY

Hi,
can you check your HDD health or can you post output from smartctl -a /dev/sdb ?

Here it is …

[details=smartctl_output]=== START OF INFORMATION SECTION ===
Device Model: ST3000VN007-2E4166
Serial Number: W6A24HNA
LU WWN Device Id: 5 000c50 09d3adcd4
Firmware Version: SC60
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5900 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 1.5 Gb/s)
Local Time is: Sat Dec 30 01:00:23 2017 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 107) seconds.
Offline data collection
capabilities: (0x73) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 390) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x10bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 108 100 006 Pre-fail Always - 19725048
3 Spin_Up_Time 0x0003 095 094 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 67
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 100 253 030 Pre-fail Always - 314846
9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 4905
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 15
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 074 074 000 Old_age Always - 26
190 Airflow_Temperature_Cel 0x0022 065 059 045 Old_age Always - 35 (Min/Max 29/39)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 11
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 67
194 Temperature_Celsius 0x0022 035 041 000 Old_age Always - 35 (0 17 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 199 000 Old_age Always - 18

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.[/details]

I am having the same problems, but apparently much less frequent with two different USB 3 HDDs (no raid!, one powerd, the other unpowered).
At some point in time, sometimes several days, “Buffer I/O error on device dm-0” happens and th device dissapears.

LXC-Containers are running from an encrypted partition on the device mounted as /srv, another partition serves media.

SMART is ok, I let a cronjob touch a file every hour on the unencrypted partition to be sure it’s not some powersaving. hdidle is not active.

I had got problems with USB 3.0 self powered drives and my laptop in past. One thing I found wrong is cable. Connect your drive with USB 2.0 cable and test it again.
In theory speed drop should be noticeable however in routers world not like you expected.

1 Like

Will try USB 2.0, thanks for the idea! For my applications, USB 2.0 speed should be sufficient.

After some time, after some OS updates i tried mount my second drive again…

Still, i have issues with my two 3TB drives (my story is somewhere up in this thread)

  1. Each drive works perfectly fine when running as single, no matter what channel is used.
  2. Running together ; drive on channel1 is fine ; drive on channel2 is having I/O errors.
  3. each drive connected via USB is working perfectly for days

Usual scenario. Disk is visible and after fsck all seems to be good. All tools (blkid, block info, fdisk, cfdisk) reports no issues. Mounting results in I/O error. Force scan of sata links. Drives is back online. (and circle is complete).

Any hints, tricks are welcome (Do not hesitate to write me direct message (czech or english).)

/dev/sdb1: LABEL="Osiris" UUID="xxxx" SEC_TYPE="ext2" TYPE="ext3" PARTUUID="yyyyyy"

mount -L Osiris /mnt/Osiris

		2018-05-10 18:10:22 err kernel[]: [10447.673820] blk_update_request: I/O error, dev sdb, sector 2048
		2018-05-10 18:10:22 err kernel[]: [10447.679759] Buffer I/O error on dev sdb1, logical block 0, lost sync page write
		2018-05-10 18:10:22 info kernel[]: [10447.687145] sd 1:0:0:0: [sdb] tag#5 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
		2018-05-10 18:10:22 info kernel[]: [10447.687155] sd 1:0:0:0: [sdb] tag#5 CDB: opcode=0x88 88 00 00 00 00 01 5d 50 a3 00 00 00 00 08 00 00
		2018-05-10 18:10:22 err kernel[]: [10447.687164] blk_update_request: I/O error, dev sdb, sector 5860532992


smartctl -a /dev/sdb
=== START OF INFORMATION SECTION ===
Vendor:               /1:0:0:0
Product:
Compliance:           SPC-5
User Capacity:        600,332,565,813,390,450 bytes [600 PB]
Logical block size:   774843950 bytes
scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46
scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.


echo "- - -" > /sys/class/scsi_host/host1/scan
		2018-05-10 18:12:16 info kernel[]: [10561.407201] ata2: hard resetting link


smartctl -a /dev/sdb

Device Model:     ST3000VN007-2E4166
Serial Number:    W6A24HNA
LU WWN Device Id: 5 000c50 09d3adcd4
Firmware Version: SC60
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5900 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu May 10 18:13:57 2018 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled


SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  107) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 390) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x10bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   111   100   006    Pre-fail  Always       -       30886608
  3 Spin_Up_Time            0x0003   096   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       119
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   100   253   030    Pre-fail  Always       -       538728
  9 Power_On_Hours          0x0032   091   091   000    Old_age   Always       -       7945
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       22
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   074   074   000    Old_age   Always       -       26
190 Airflow_Temperature_Cel 0x0022   058   058   045    Old_age   Always       -       42 (Min/Max 31/42)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       13
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       119
194 Temperature_Celsius     0x0022   042   042   000    Old_age   Always       -       42 (0 17 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   199   000    Old_age   Always       -       21

SMART Error Log Version: 1

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   111   100   006    Pre-fail  Always       -       30886608
  3 Spin_Up_Time            0x0003   096   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       119
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   100   253   030    Pre-fail  Always       -       538728
  9 Power_On_Hours          0x0032   091   091   000    Old_age   Always       -       7945
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       22
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   074   074   000    Old_age   Always       -       26
190 Airflow_Temperature_Cel 0x0022   058   058   045    Old_age   Always       -       42 (Min/Max 31/42)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       13
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       119
194 Temperature_Celsius     0x0022   042   042   000    Old_age   Always       -       42 (0 17 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   199   000    Old_age   Always       -       21

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


fsck.ext3 /dev/sdb1

e2fsck 1.43.5 (04-Aug-2017)
Osiris: clean, 12/183148544 files, 11546855/732566385 blocks
		2018-05-10 18:18:31 err kernel[]: [10936.764472] Buffer I/O error on dev sdb1, logical block 2929987602, lost async page write
		2018-05-10 18:18:31 err kernel[]: [10936.772681] Buffer I/O error on dev sdb1, logical block 2929987603, lost async page write
		2018-05-10 18:18:31 err kernel[]: [10936.797293] Buffer I/O error on dev sdb1, logical block 2929987606, lost async page write

fdisk -l /dev/sdb1
fdisk: cannot open /dev/sdb1: I/O error		
		2018-05-10 18:18:53 info kernel[]: [10958.416459] sd 1:0:0:0: [sdb] tag#23 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
		2018-05-10 18:18:53 info kernel[]: [10958.416468] sd 1:0:0:0: [sdb] tag#23 CDB: opcode=0x88 88 00 00 00 00 00 00 00 08 00 00 00 00 01 00 00
		2018-05-10 18:18:53 err kernel[]: [10958.416472] blk_update_request: I/O error, dev sdb, sector 2048
		2018-05-10 18:18:53 err kernel[]: [10958.422423] Buffer I/O error on dev sdb1, logical block 0, async page read
		2018-05-10 18:18:53 err kernel[]: [10958.473870] Buffer I/O error on dev sdb1, logical block 4, async page read

		partx /dev/sdb1
partx: /dev/sdb: failed to read partition table
partx: write failed: I/O error
echo "- - -" > host1/scan
partx /dev/sdb1
NR START        END    SECTORS SIZE NAME UUID
 1  2048 5860533134 5860531087 2.7T      0b246952-9325-4bf9-8b82-afcb1af48d1c
		2018-05-10 18:22:24 info kernel[]: [11169.550891] sd 1:0:0:0: [sdb] tag#6 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
		2018-05-10 18:22:24 info kernel[]: [11169.550904] sd 1:0:0:0: [sdb] tag#6 CDB: opcode=0x88 88 00 00 00 00 00 00 00 00 00 00 00 00 08 00 00
		2018-05-10 18:22:24 err kernel[]: [11169.550909] blk_update_request: I/O error, dev sdb, sector 0
		2018-05-10 18:22:24 err kernel[]: [11169.556606] Buffer I/O error on dev sdb, logical block 0, async page read
		2018-05-10 18:22:24 info kernel[]: [11169.564627] sd 1:0:0:0: [sdb] tag#7 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
		2018-05-10 18:22:24 info kernel[]: [11169.564639] sd 1:0:0:0: [sdb] tag#7 CDB: opcode=0x35 35 00 00 00 00 00 00 00 00 00
		2018-05-10 18:22:24 err kernel[]: [11169.564649] blk_update_request: I/O error, dev sdb, sector 0

EDIT: Catched in kernel log

[17896.493526] EXT4-fs (sdb1): mounting ext3 file system using the ext4 subsystem
[17896.555672] ata2.00: exception Emask 0x10 SAct 0x4 SErr 0x400000 action 0x6 frozen
[17896.563278] ata2.00: irq_stat 0x08000000, interface fatal error
[17896.569263] ata2.00: cmd 61/08:10:00:08:00/00:00:00:00:00/40 tag 2 ncq 4096 out
[17896.569263]          res 40/00:10:00:08:00/00:00:00:00:00/40 Emask 0x10 (ATA bus error)
[17896.584666] ata2: hard resetting link
[17906.574968] ata2: softreset failed (1st FIS failed)
[17906.579889] ata2: hard resetting link
[17916.574351] ata2: softreset failed (1st FIS failed)
[17916.579283] ata2: hard resetting link
[17951.582104] ata2: softreset failed (1st FIS failed)
[17951.587022] ata2: limiting SATA link speed to 1.5 Gbps
[17951.587029] ata2: hard resetting link

For me, the issue appears to be fixed for USB 2.0 with the new UAS driver introduced with TurrisOS 3.10.

Sorry for, refreshing this topic …
while checking something else, this pops-up in google Libata error messages - ata Wiki
… so below message with bunch of numbers might get some text-like explanation… hurray …!

not so fast with that happiness … even after decoding , not much helpful, there are not any detailed list, just explanation for each “position” of number, not the value itself…
But at least it confirms that issue is somewhere between the chip and ata-bus , not on the disk.

NCQ fix is just workaround (normally you should be able benefit from ncq enabled). Aside some users noted that for some drivers (over 2,7TB) there might be an issue with power(at some situation, that mini board is not able to feed both at same time when there is some peak (boot-up/sleep-on-off…??)) . As i used several cable setups, i did not tried to feed the drives from external source. I will try it.

let’s use force, Luke!

1 Like

Any update on this topic? I was happy user of two WD60EFRX in RAID mirroring. Suddenly, after 5.0.2 or 5.0.3 upgrade same experiences as here. One HDD works still fine whereas the other one fails… Disk checked in Windows connected over USB working flawlesly. SMART does not point to anything unusual.

I am experiencing same. After upgrade to 5.0.x one of my disks in RAID is again lost.
asm1062-fix is present and I can see NCQ is disabled.

root@…# cat /sys/block/sda/device/queue_depth
1
root@…# cat /sys/block/sdb/device/queue_depth
1

No update from my side, I am using just one drive. My last post still stands.
If you have related kernel message, we can decode it … https://ata.wiki.kernel.org/index.php/Libata_error_messages