What's wrong with the messages like "UnrecovData 10B8B BadCRC" and "failed command: READ FPDMA QUEUED" on Linux?

I keep seeing messages in dmesg as follows with "exception Emask 0x10" -> "SError: { UnrecovData 10B8B BadCRC }" -> "failed command: READ FPDMA QUEUED" -> "hard resetting link":

[ 7395.936692] ata4.00: exception Emask 0x10 SAct 0xe000 SErr 0x280100 action 0x6 frozen
[ 7395.936701] ata4.00: irq_stat 0x08000000, interface fatal error
[ 7395.936703] ata4: SError: { UnrecovData 10B8B BadCRC }
[ 7395.936705] ata4.00: failed command: READ FPDMA QUEUED
[ 7395.936708] ata4.00: cmd 60/00:68:20:7c:b0/04:00:7c:00:00/40 tag 13 ncq 524288 in
         res 40/00:78:20:84:b0/00:00:7c:00:00/40 Emask 0x10 (ATA bus error)
[ 7395.936709] ata4.00: status: { DRDY }
[ 7395.936710] ata4.00: failed command: READ FPDMA QUEUED
[ 7395.936712] ata4.00: cmd 60/00:70:20:80:b0/04:00:7c:00:00/40 tag 14 ncq 524288 in
         res 40/00:78:20:84:b0/00:00:7c:00:00/40 Emask 0x10 (ATA bus error)
[ 7395.936713] ata4.00: status: { DRDY }
[ 7395.936714] ata4.00: failed command: READ FPDMA QUEUED
[ 7395.936715] ata4.00: cmd 60/e0:78:20:84:b0/03:00:7c:00:00/40 tag 15 ncq 507904 in
         res 40/00:78:20:84:b0/00:00:7c:00:00/40 Emask 0x10 (ATA bus error)
[ 7395.936716] ata4.00: status: { DRDY }
[ 7395.936719] ata4: hard resetting link
[ 7396.241404] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 7396.243180] ata4.00: configured for UDMA/133
[ 7396.243195] ata4: EH complete

It seems are disks related warnings/errors.

ata4 seems the disk sdd:

# ls -l /sys/class/block/ | grep ata4
lrwxrwxrwx 1 root root 0 May  5 15:04 sdd -> ../../devices/pci0000:00/0000:00:1f.2/ata4/host3/target3:0:0/3:0:0:0/block/sdd
lrwxrwxrwx 1 root root 0 May  5 15:04 sdd1 -> ../../devices/pci0000:00/0000:00:1f.2/ata4/host3/target3:0:0/3:0:0:0/block/sdd/sdd1

The smartctl checking on sdd shows:

# smartctl -a /dev/sdd
smartctl 6.2 2014-07-16 r3952 [x86_64-linux-3.17.8-200.fc20.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.14 (AF)
Device Model:     ST2000DM001-1CH164
Serial Number:    .........
LU WWN Device Id: 5 000c50 072c9ba25
Firmware Version: CC29
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Tue May  5 15:08:13 2015 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(  600) seconds.
Offline data collection
capabilities: 			 (0x73) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 225) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x3085)	SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   119   099   006    Pre-fail  Always       -       229791872
  3 Spin_Up_Time            0x0003   096   096   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       7
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   073   060   030    Pre-fail  Always       -       21876240
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always       -       3330
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       6
183 Runtime_Bad_Block       0x0032   099   099   000    Old_age   Always       -       1
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       3 3 3
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   070   064   045    Old_age   Always       -       30 (Min/Max 29/31)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       6
193 Load_Cycle_Count        0x0032   060   060   000    Old_age   Always       -       81321
194 Temperature_Celsius     0x0022   030   040   000    Old_age   Always       -       30 (0 22 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       5
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       2784h+08m+57.117s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       8817986301
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       18410704178

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The results seem fine. What's wrong on this server as indicated by the messages from dmesg?

asked May 5, 2015 by Eric Z Ma (44,280 points)

1 Answer

 
Best answer

Found some good sources of information related to ata error messages and some useful tips.

Understanding these messages

DRDY: Device ready. Normally 1, when all is OK.
UnrecovData: Data integrity error occurred, interface did not recover
BadCRC: Link layer CRC error occurred

Reference: Libata error messages

Possible reasons

An example:

ata3.00: exception Emask 0x50 SAct 0x1 SErr 0x280900 action 0x6 frozen
ata3.00: irq_stat 0x08000000, interface fatal error
ata3: SError: { 10B8B BadCRC } often may also include DisPar, UnrecovData, and/or HostInt

From an expert:

"Your machine seems to be suffering genuine link layer problem. In most cases, this indicates hardware problem and in my experience, common causes are (in the order of ballpark frequency)...

# inadequate power supply
# device and controller don't like each other on 3Gbps
# cable too long or flaky connector (especially with eSATA cables or genders or backplanes)
# faulty controller or drive"

-- tejun ( http://lkml.org/lkml/2008/12/2/426 ) (written by one of the foremost experts)

The presence of BadCRC is a pretty good indicator of a poor quality SATA cable. However, if a better cable does not solve the issue, then it is probably a power problem (loose power cable or backplane connection, poor connectors, poor power splitter, overloaded power supply, too many drives on power rail, bad power supply, etc).

Reference: The Analysis of Drive Issues

answered May 5, 2015 by Eric Z Ma (44,280 points)

Please log in or register to answer this question.

Copyright © SysTutorials. User contributions licensed under cc-wiki with attribution required.
Hosted on Dreamhost

...