Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1001873.1
Update Date:2011-04-04
Keywords:

Solution Type  Technical Instruction Sure

Solution  1001873.1 :   Sun StorEdge[TM] T3 Array: Understanding Sense Key Errors (Disk Failures); Clearing Disk Errors Using Vol Verify; Performing a Single-Drive Read Operation  


Related Items
  • Sun Storage T3 Array
  •  
  • Sun Storage T3+ Array
  •  
Related Categories
  • GCS>Sun Microsystems>Storage - Disk>Modular Disk - Other
  •  

PreviouslyPublishedAs
202564


Applies to:

Sun Storage T3 Array
All Platforms

Goal

This document provides the following information about the Sun StorEdge[TM] array:
  • Understanding sense key errors (disk failures)

  • Clearing disk errors using Vol Verity

  • Performing a single-drive read operation

Solution

Section I:  Understanding Sense Key Errors (Disk Failures)

The Sun StorEdge[TM] T3 firmware logs several types of disk errors into the syslog file, based on the sense data it receives from the drive.  Not all disk errors require immediate action.  This document describes sense keys that are common and the required action that needs to be taken when such error messages are observed in the Sun StorEdge T3 syslog file.

A. Sense Key = 0x1

The Sense Key = 0x1 error message indicates that the command completed successfully with some recovery action performed by the disk.  When multiple recovered errors occur, the last error that occurred is reported by the additional sense bytes.  Normally, all 0x1 sense keys are soft errors.  Proactive replacement is not required.  The following is a list of common errors observed in T3 syslogs.  Note:  For some Mode settings the last command may have terminated before completing.  

Sense Key = 0x1, Asc = 0x3, Ascq = 0x0
Sense Data Description = Peripheral Device Write Fault

The 0x1/0x3 error is a soft recoverable error that typically occurs in transient environmental conditions, such as vibration or shock.  When the

drive is exposed to such a condition, the data is written slightly off the center of the track.  The drive is able to recover the data by adjusting the head in the positive or negative position from the center of the track (but within the guard band).  This condition is fixed when another write to the same sector is completed (again, assuming no sustained vibration or shock effect).

Sense Key = 0x1, Asc = 0x9, Ascq = 0x0
Sense Data Description = Track Following Error

The data tracks on the disk media have elliptical shapes, a result of the imperfect circular rotation of the media.  The drive compensates for this condition by implementing Repeatable Run Out (or RRO) logic.  RRO is part of the drive's servo code that allows the read/write head to follow each track's elliptical shape.  It is possible (because of mechanical/vibration effect) that the head may not be properly centered.  The drive detects such a condition by moving the head's position (+ or - from the center) to read the data.  When the servo logic makes such an adjustment for several sectors on a track, it flags this condition as a Non-Repeatable Run Out (NRRO) and returns the 0x1/0x9 sense key.  This condition is normally recovered during the next write operation on the track.

Sense Key = 0x1, Asc = 0x9, Ascq = 0x1
Sense Data Description = Write Fault - Write Fault Status During Read

This error message indicates that the disk servo failed to allocate the disk head to the correct data track before the write operation.  The disk servo corrects this condition by reallocating the disk head to the correct data track.

Sense Key = 0x1, Asc = 0xc, Ascq = 0x1
Sense Data Description = Write Error - Recovered With Auto Reallocation

This error message is good news.  The message reports that the drive had a write error but it was recovered and data was reallocated.  Normally, when the feedback signal from the write head is not strong, the drive reallocates the data to a spare sector.

Sense Key = 0x1, Asc = 0x15, Ascq = 0x1
Sense Data Description = Mechanical Positioning Error

This error message reports a transient condition.  The drive realizes that the head is not where it is supposed to be, makes the proper adjustments, and reads or writes the data during the next rotation.  The drive has the logic to continuously resynchronize itself.

Sense Key = 0x1, Asc = 0x17, Ascq = 0x1
Sense Data Description = Recovered Data With Retries

This error results from a read exception condition.  The error message reports that the drive was able to recover the data using the normal number of retries.

Sense Key = 0x1, Asc = 0x17, Ascq = 0x2
Sense Data Description = Recovered Data With Positive Head Offset

This error results from a read exception condition.  The error message indicates that the drive was able to recover the data using the normal number of retries.

Sense Key = 0x1, Asc = 0x17, Ascq = 0x3
Sense Data Description = Recovered Data With Negative Head Offset

All 0x1/0x17 sense keys are a result of read exception conditions.  When the disk receives too many of these errors within a short time, it reaches its failure prediction threshold and returns a 0x1/0x5d sense key.

Sense Key = 0x1, Asc = 0x17, Ascq = 0x8
Sense Data Description = Recovered Data w/o ECC-Recommended Rewrite

This error results from a read exception condition.  The error message indicates that the data was recovered but a rewrite is recommended.

Sense Key = 0x1, Asc = 0x18, Ascq = 0x1
Sense Data Description = Recovered Data Using ECC After Normal Retries

This error results from a read condition.  The error message indicates that the drive was able to recover the data using the normal number of retries.

Sense Key = 0x1, Asc = 0x18, Ascq = 0x2
Sense Data Description = Recovered Data - Data Auto-Reallocated

This error results from a read condition.  The error message indicates that the drive was able to recover the data by reallocating the sector.

All 0x1/0x18 are read conditions.  For Ascq of 0x1 , the drive was able to recover the data using the normal number of retries.  Note that the drive can detect and correct up to 156 ECC bits.  For Ascq of 0x2, the drive worked hard to get the data, and decided to reallocate the sector.

Sense Key = 0x1, Asc = 0x5d, Ascq = 0x0
Sense Data Description = Failure Prediction Threshold Exceeded

In this condition, the drive reached its failure prediction threshold, so it is necessary to replace the disk.  The drive is in good condition; it is in the process of spinning up, but it is not ready yet.

All the above messages, except Sense Key = 0x1, Asc = 0x5d, do not require the replacement of the disk if their frequency of occurrence is low.  In this document, the definition of "low" is once a week.  However, if these kinds of messages appear more frequently, especially to the same Logical Block Address (LBA), then replacement is recommended.  A Sense Key = 0x1, Asc = 0x5d requires the replacement of the disk.  The Sun StorEdge T3 controller never fails a drive with this kind of error condition.

B. Sense Key = 0x2

Sense Key = 0x2, Asc = 0x4, Ascq = 0x1
Sense Data Description = Logical Unit Not Ready, Becoming Ready

This error message indicates that the drive is in good condition; it is in the process of spinning up, but it is not yet ready.

Sense Key = 0x2, Asc = 0x4, Ascq = 0x2
Sense Data Description = Logical Unit Not Ready, Init. Cmd Required

This error message indicates that the drive is ready, but it is not spinning.  Generally, the drive needs to be spun up.

Sense Key = 0x2, Asc = 0x4, Ascq = 0x3
Sense Data Description = Logical Unit Not Ready, Manual Intervention is Required.

All 0x2 sense keys indicate that the drive is not in a ready state and cannot be accessed.  Note that Asc = 0x4, Ascq = 0x3 requires manual intervention.  This sense key is a SCSI specification definition and applies more to just a bunch of disks (JBOD).  In a redundant array of independent disks (RAID) controller, such as the Sun StorEdge T3 controller, manual intervention is handled by disabling and reconstructing the failed disk after two failed retries.  There is one exception to the way the T3 firmware handles 0x2 sense keys.  For 0x2/0x4/0x1, the Sun StorEdge T3 firmware does not disable the drive.  When this happens, the recommended action is to run "fru stat" twice.  If the disk is still not ready, then force a disable and reconstruct the drive by using the "vol disable to stand_by."

C. Sense Key = 0x3

Sense Key = 0x3, Asc = 0x11, Ascq = 0x0
Sense Data Description = Unrecovered Read Error
Sense Key = 0x3, Asc = 0x11, Ascq = 0x1
Sense Data Description = Read Retries Exhausted
Sense Key = 0x3, Asc = 0x11, Ascq = 0x2
Sense Data Description = Unrecovered error was detected during Data Read (BCRC error detected by SCSI)
Sense Key = 0x3, Asc = 0x11, Ascq = 0x4
Sense Data Description = Unrecovered Read Error, Auto Reallocation Failed
Sense Key = 0x3, Asc = 0x16, Ascq = 0x0
Sense Data Description = Data Synchronization Mark Missing Or Incorrect

The 0x3 sense key indicates that the command terminated with a non-recovered error condition.  A 0x3 sense key that occurs once or twice a month should not present a concern.  The Sun StorEdge T3 firmware, during a normal read or vol verify fix operation (with 2.01.03 or 1.18.2 or higher firmware revisions), corrects the bad sector on the drive by reconstructing the data (assuming a RAID 1+0 or RAID 5 configuration) and writes the data back to the drive.  The drive then writes the data to a spare sector.  When the occurrence of the 0x3 sense key is more frequent, it is highly recommended that the drive be replaced.  Also, the list value should be used as a determining factor to replace the disk.  The only exception that requires disk replacement, regardless of the RAID level is 0x3/0x16/0x0. This is a fatal condition:  The track is lost and the Sun StorEdge T3 firmware disables the disk.

D. Sense key = 0x4

Sense Key = 0x4, Asc = 0x15, Ascq = 0x1
Sense Data Description = Mechanical Positioning Error
Sense Key = 0x4, Asc = 0x15, Ascq = 0x2
Sense Data Description = Positioning Error Detected by Read of Media
Sense Key = 0x4, Asc = 0x19, Ascq = 0x0
Sense Data Description = Defect List Error
Sense Key = 0x4, Asc = 0x32, Ascq = 0x0
Sense Data Description = Not Defect Spare Sectors Available
Sense Key = 0x4, Asc = 0x32, Ascq = 0x1
Sense Data Description = Defect List (G) Update Failure
Sense Key = 0x4, Asc = 0x3E, Ascq = 0x3
Sense Data Description = Logic Unit Failed Self- Test
Sense Key = 0x4, Asc = 0x3E, Ascq = 0x4
Sense Data Description = Logic Unit Unable to update Self- Test Result Log

The 0x4 sense key indicates that the disk drive detected a non-recoverable hardware failure while performing the command or during a self-test.  Please replace the drive when such messages appear in the syslog file.  It is recommended that a drive with this kind of error condition be replaced even though the Sun StorEdge T3 firmware does not normally fail the drive.

Sometimes, an ASC code of 0x80 with sense key 0x4 appears. The 0x80 ASC code is vendor-specific, which in the following case refers to the

Seagate drive.  The following provides a description of some of the common 0x80 ASC/ASCQ codes.

ASC ASCQ Description
--- ---- -----------
0x80 0x00 General firmware error qualifier
0x80 0x80 FC FIFO error during read transfer
0x80 0x81 FC FIFO error during write transfer
0x80 0x82 Disk FIFO error during read transfer
0x80 0x83 Disk FIFO error during write transfer
0x80 0x84 LBA seeded CRC error on read
0x80 0x85 LBA seeded CRC error on write
0x80 0x86 IOEDC error on read
0x80 0x87 IOEDC error on write
0x80 0x3d Hardware Error/0x80/0x3d is "Cache Mirror Failed".

E. Sense Key = 0x5 to 0x9

NOTE: Sense Keys 0x5 to 0x9 are all informational.

Sense Key = 0x5, Asc = 0x25, Ascq = 0x0
Sense Data Description = Logical Unit Number Not supported
Sense Key = 0x5, Asc = 0x26, Ascq = 0x97
Sense Data Description = Invalid Field Parameter - TMS Firmware Tag
Sense Key = 0x5, Asc = 0x26, Ascq = 0x99
Sense Data Description = Invalid Field Parameter - Firmware Tag

The 0x5 sense key is a host/T3 interaction sense key, which indicates that the Sun StorEdge T3 controller received an illegal request from the host.  Sometimes, the host is sending out a command that asks an individual drive to perform a read or write. However, the problem is that no host can talk directly to the individual drives in a hardware RAID environment.  The host can only talk to the Sun StorEdge T3 controller.  Therefore, this error message indicates that the Sun StorEdge T3 controller is returning an "illegal request" statement back to the host, which indicates that the controller doesn't recognize the host's request.  Note: The Sun StorEdge T3 controller never fails a drive with this kind of error condition.

F. Sense Key = 0xb

Sense Key = 0xb, Asc = 0x45, Ascq = 0x0
Sense Data Description = Select/Reselect failure
Sense Key = 0xb, Asc = 0x47, Ascq = 0x0
Sense Data Description = SCSI Parity Error

The 0xb sense key indicates the disk drive aborted the command and the initiator might be able to recover by retrying the command again.  The Asc/Ascq values of 0x47 and 0x0 imply that a parity error condition occurred on the disk.

Possible Causes of parity errors:

  • Bad hardware components on the disk

  • Data corrupted somewhere on the fibre channel loop

T3 Disk Replacement Guidelines

Sun StorEdge T3 firmware revisions 1.18.01 and 2.01.03, or higher ,incorporate enhanced vol verify fix functionality to facilitate better error handling.

These versions provide the capability of disabling disks drives for certain error conditions that indicate a disk is failing or is about to fail.

The action taken by these firmware revisions is to disable and recon failed disk drives for each of the sense keys described in Section II.  This action is already embedded in these firmware revisions, and there is no need to proactively replace disks drives.  It is strongly recommended that all Sun StorEdge T3A and T3B arrays be upgraded to the firmware revisions 1.18.01 and 2.01.03, or higher.  This document describes the recommended action for different Sense Key error conditions.

Section II:  Clearing Disk Errors Using Vol Verify

Typically, a drive has latent disk errors that can only be detected when the affected disk sector is accessed.  Normally, this condition results from head particles that come into contact between the read/write head and the media.  If too many of these latent errors are on a drive that is a member of a RAID set, the likelihood increases of a double-disk failure during a reconstruction operation.  To reduce the likelihood of double disk failures, it is recommended that the vol verify be used on a regular basis during off-peak hours that are convenient to the customer.

The Sun StorEdge T3A firmware release 1.18.01 and T3B firmware release 2.01.03 (or later) fixes the problem with the Unrecovered Read Error during the vol verify fix operation.  The firmware, when encountering such error, reconstructs the data from the other drives and writes the data back to the drive exhibiting the error, thus forcing Auto-Reallocation to a spare sector.  Both failure conditions (the Unrecovered Read Error and the Auto-Reallocation) are logged in to the syslog file.  The vol verify operation performs the verify function on all of the LBAs in a volume/lun and does not stop when an error is encountered, which happened with previous Sun StorEdge T3 firmware releases.  The following guidelines can be used to determine if the vol verify needs to be run:

  • Monitor the syslog file on regular basis and look for SCSI I/O errors.  If a high frequency of errors occurs, latent disk errors might be the cause.

  • The vol verify command has two options. One option is the verify with no fix option and the other is verify with the fix option. The vol verify without the fix option generates the XOR checksum of a raid stripe and compares the checksum with the data stored on the parity drive.  If the data mis-compares or there is a SCSI Unrecovered Read error, then the Sun StoreEdge T3B firmware will log such a condition into the syslog file.  It is the responsibility of the user to take the next action.

However, when an XOR checksum mis-compare occurs, the vol verify with the fix option fixes the data on the parity drive (the assumption is the data on the data drives is correct).  Also, this option corrects Unrecovered Read errors by reconstructing the data and writing it back to force Auto-Reallocation.

  • If you want to recover or move the data from bad blocks, on an array with firmware before the mentioned versions, you can use the following procedure:

1. Run vol verify [rate <1-8>] and note the LBA of the drive that has the Unrecovered Read error.

2. Convert the LBA number (which is displayed in hex), shown to the right of the Valid Information field, to a decimal number.  This number is equal to the DISK_LBA variable in the OFFSET equation.

3. On the host system, execute the following command:

dd if=/dev/rdsk/ of=/dev/null iseek=OFFSET bs=BLK_SIZE count=N


Where:

cxtxdxs2 = The device name for the T3 LUN that owns the drive with the Unrecovered Read error.


For a RAID 5:

(512) ( (DISK_LBA - 411009) (N - 1) )
OFFSET = ------------------------------------- - 1
BLK_SIZE


For a RAID 1+0:



(512) (DISK_LBA - 411009) (N)
OFFSET = -----------------------------
(2) (BLK_SIZE)

DISK_LBA = The decimal number obtained in step 2, perviously.

BLK_SIZE = The T3 blocksize, which is set in the sys blocksize command. This value can only be set to 16K, 32K, or 64K. For 64K, use 65536 instead of 64000.

N = The number on drives in the RAID-5 set.  For example, N = 8 for a RAID-5 set with eight data drives and one standby drive.

Notes:

  • If (DISK_LBAA - 411009) is less than or equal to N + 1; then use the following command:

dd if=/dev/rdsk/ of=/dev/null bs=BLK_SIZE count=N
  • It is possible to get an OFFSET value with a decimal number for both RAID-5 and RAID-1+0 calculations:  You need to round off the OFFSET value to the next whole number.

  • To ensure that a backend disk operation is taking place, you need to have cache, rd_ahead, and the mirror sys setting set to off.  Don't forget to set these parameters back to their original values after completing this exercise.

III.  Performing a Single-Drive Read Operation

Sometimes, it is required to do a single-drive read operation on a RAID-5 T3 volume to monitor drive failures.  The following scripts allow the user to issue dd read operations to the full drive capacity or to a selected number of LBAs.

#!/usr/bin/ksh
#
# Description: This script performs a dd option on single T3 drive in a RAID-5
# configuration.
#
# Syntax: ./single_dd     
#
# First parameter is the drive position in a raid set <0-n>
#
# Second parameter is maximum number of block size, which is obtained from the
# format disk->partition->print.
#
# Third parameter is the block size that's defined in the T3.
# No need to add the "k" next to the blocksize.
#
# The fourth parameter is the raw device name. example /dev/rdsk/c2t1d0s2
#
# The fifth parameter is the number of drives in a read set.
#
# Note: Disk access will not take place 100% of the time if the "cache" and
# "rd_ahead" sys parameters are set. This is OK since the data must have been
# read from cache which mean that the data was recently read from the disk and
# follows the LRU (Least Recently Used) algorithm
if [ $# -ne 5 ]; then
echo "USAGE: $0    " >/dev/tty 2>&1
exit 1
fi
if [ ! -c $4 ]; then
echo "$4: character device not found" > /dev/tty 2>&1
exit 2
fi
if [ $1 -lt 0 -o $1 -gt 8 ]; then
echo "$1: invalid drive position" > /dev/tty 2>&1
exit 3
fi
let n_drives=$5 # number of disks in set
let bs=$3*1024 # block size (kbytes)
let iseek=$1 # starting block on drive
let loop=$2*512 # bytes / raid set
let loop=$loop/$bs # blocks / raid set
let loop=$loop/$n_drives # blocks / drive
while [ $loop -gt 0 ]; do
dd if=$4 of=/dev/null iseek=$iseek bs=$bs count=1
if [ $  -ne 0 ]; then
echo "$4: i/o error"
exit 3
fi
let iseek=$iseek+$n_drives
let loop=$loop-1
done
exit 0




Refer to Quick Troubleshooting Guide Doc

(QTSG): T3 Sense Key Errors and Replacement Guidelines at:
http://tsc-storage.us.oracle.com/products/T3/documentation.html



Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback