Document Audience:INTERNAL
Document ID:I0966-1
Title:Best practices guidelines are available for StorEdge T3/T3+ arrays which encounter "disk error 03" messages.
Copyright Notice:Copyright © 2005 Sun Microsystems, Inc. All Rights Reserved
Update Date:2003-10-06

---------------------------------------------------------
            - Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------  
                        FIELD INFORMATION NOTICE
               (For Authorized Distribution by SunService)
FIN #: I0966-1
Synopsis: Best practices guidelines are available for StorEdge T3/T3+ arrays which encounter "disk error 03" messages.
Create Date: May/07/03
SunAlert: No
Top FIN/FCO Report: No
Products Reference: StorEdge T3/T3+ Array
Product Category: Storage / Service
Product Affected: 
Systems Affected:
-----------------  
Mkt_ID    Platform    Model    Description                 Serial Number
------    --------    -----    -----------                 -------------
  -        Anysys      ALL     System Platform Independent       -


X-Options Affected:
-------------------
Mkt_ID     Platform   Model   Description            Serial Number
------     --------   -----   -----------            -------------
  -          T3        ALL    T3 StorEdge Array            -
  -          T3+       ALL    T3+ StorEdge Array           -


Part Number     Description   	 Model
-----------     -----------   	 -----
     -  	     -	           -
References: 
URL: http://webhome.sfbay/spiroux/fin_releases/T3Error_021803.html
Issue Description: 
Sun StorEdge T3/T3+ "disk error 03" messages are an indication that an
operation within the array did not complete successfully.  Failure to
properly investigate the cause of these errors could result in
incorrect resolution for T3/T3+ array issues, leading to unnecessary
system downtime.

This issue affects any Sun StorEdge T3 (T3A) or T3+ (T3B) array which
has experienced an operational error, indicated by "disk error 03"
displayed in the syslog file.

A "disk error 3" is an informational notice.  This indicates that
additional investigation is required, and further action may be
necessary.  That action could include monitoring the health of the
array, or replacing parts.  These errors will be logged to the array
syslog file and have the following format:
   
    
Implementation: 
---
        |   |   MANDATORY (Fully Proactive)
         ---    
         
  
         ---
        | X |   CONTROLLED PROACTIVE (per Sun Geo Plan) 
         --- 
         
                                
         ---
        |   |   REACTIVE (As Required)
         ---
Corrective Action: 
The following recommendation is provided as a guideline for authorized
Sun Services Field Representatives who may encounter the above
mentioned problem.

It is important that StorEdge T3/T3+ disk errors are monitored on a 
regular basis and that proper action be taken based on the guidelines 
provided below:

1. Monitor disk errors in syslog.

   Periodically run "vol verify" to minimize the possibility of
   encountering media-related double faults during a data reconstruction.
   The recommended period is once a month.  T3/T3+ firmware versions
   1.18.1 and 2.1.3 incorporate enhanced "vol verify" functionality to
   facilitate this.  1.18.1 and 2.1.3 versions of the firmware also
   include more intelligence in disk error handling.  These versions
   disable disks for certain error conditions that indicate that the disk
   is failing or is about to fail.

2. Sense error codes and the corresponding recommendations:
   
   Note that "REPLACE" in the following table means:

   For T3 firmware version 1.18.1 and above, and for T3+ firmware version
   2.1.3 and above, the system will automatically disable the drive and
   the drive will be ready for removal.

For firmware versions earlier than the above, manual checking of error
codes in the syslog file and manual disabling of the drive is required
before it can be replaced.

 =====================================================================
|                  |                                                  |
|Sense Error Codes |  Action Required                                 |      
|(with exceptions) |                                                  |
|==================+==================================================|
| 01/5d/xx         |  REPLACE.                                        |
|                  |  The failure of this drive is imminent. It is    |
|                  |  recommendedto backup the data from this drive,  |
|                  |  if possible,and replace the drive as soon as    |
|                  |  possible.                                       |
|                  |                                                  |
|------------------+--------------------------------------------------|
| 02/04/01         |  Validate (Is it in process of becoming ready?)  |
|                  |  Run "fru stat" twice with a one minute interval.|
|                  |  If the drive is still in the "NOT READY" state, |
|                  |  REPLACE the drive.                              |
| For any other    |                                                  |
| 02/xx/xx         |  REPLACE                                         |
|------------------+--------------------------------------------------|
| 03/11/xx         | If in a Raid-1, or -5 configuration, the RAID    |
|                  | controller (or manager) will reconstruct the     |
|                  | data from the remaining disk in the volume and   |
|                  | write it back to the failed LBA.  This will cause|  
|                  | the drive to automaticallyreplace the failed LBA.|   
|                  | If RAID-0, or not in a RAID config, replace the  | 
|                  | drive.                                           |
| For any other    |                                                  |
| 03/xx/xx         | REPLACE                                          |
|------------------+--------------------------------------------------|
|                  |                                                  |
| 04/xx/xx         | REPLACE. Hardware failure.                       |
 =====================================================================


OPERATOR GUIDELINES:
====================

In general, a "disk error 0x3" displayed in the T3 syslog file is
handled and recovered by the T3 firmware itself.  However, multiple
"disk error 0x3" messages might indicate that one of the FRUs is not
working properly or is defective. This decision cannot be taken without
checking the health of the whole system.

"disk error 3" messages in the syslog are generally preceded by the
sense error codes, which specifies the reason of failure of a
particular disk.

  Example 1:

     Jul 12 23:52:11 ISR1[1]: W: u1d6 SCSI Disk Error Occurred 
                              (path = 0x0)
     Jul 12 23:52:11 ISR1[1]: W: Sense Key = 0x3, Asc = 0x11, 
                              Ascq = 0x0
     Jul 12 23:52:11 ISR1[1]: W: Sense Data Description = Unrecovered 
                              Read Error
     Jul 12 23:52:11 ISR1[1]: W: Valid Information = 0x68fb4
     Jul 12 23:52:11 ISR1[1]: N: u1d6 SVD_DONE: Command Error = 0x3
     Jul 12 23:52:11 ISR1[1]: N: u1d6 sid 148 stype 1001 disk error 3

This error, Sense Key = 0x3, Asc = 0x11, is a recoverable error.  No
action is recommended from the operator.

  Example 2:

     Jul 24 07:50:20 ISR1[1]: N: u2d8 SCSI Disk Error Occurred 
                              (path = 0x1)
     Jul 24 07:50:20 ISR1[1]: N: Sense Key = 0x1, Asc = 0x5d, 
                              Ascq = 0x0
     Jul 24 07:50:20 ISR1[1]: N: Sense Data Description = Failure 
                              Prediction Threshold Exceeded
     Jul 24 07:50:20 ISR1[1]: N: u1d6 SVD_DONE: Command Error = 0x3
     Jul 24 07:50:20 ISR1[1]: N: u1d6 sid 148 stype 1001 disk error 3

In this case, many "disk error 3" may be expected.

This error, Sense Key = 0x1, Asc = 0x5d, is an unrecoverable error.
Replace the disk.

  Example 3:

    While IO is going on, "Disk error 0x3" messages could be displayed
    repeatedly when a disk becomes bad and is disabled by the T3 firmware.
    Look at state 4D below.  In this case, replace the bad/disabled disk.

      hws26-118:/etc:<113>vol stat
 
         v1            u1d1   u1d2   u1d3   u1d9
         mounted        4D     0      0      0
         v2            u2d1   u2d2   u2d3
         mounted        0      0      0

    OR

      hws26-118:/etc:<117>fru stat

         CTLR    STATUS   STATE       ROLE        PARTNER    TEMP
         ------  -------  ----------  ----------  -------    ----
         u1ctr   ready    enabled     master      u2ctr      30.5
         u2ctr   ready    enabled     alt master  u1ctr      30.5


    ----------------------------------------------------------------------
   | DISK | STATUS | STATE    | ROLE      | PORT1 | PORT2 | TEMP | VOLUME |
   |      |        |          |           |       |       |      |        |
   |======+========+==========+===========+=======+=======+======+========|
   | u1d1 | fault  | disabled | data disk | bypass| bypass| -    | v1     |
   | u1d2 | ready  | enabled  | data disk | ready | ready | 34   | v1     |
   | u1d3 | ready  | enabled  | data disk | ready | ready | 38   | v1     |
   | u1d4 | ready  | enabled  | unassigned| ready | ready | 30   | -      |
   | u1d5 | ready  | enabled  | unassigned| ready | ready | 34   | -      |
   | u1d6 | ready  | enabled  | unassigned| ready | ready | 38   | -      |
   | u1d7 | ready  | enabled  | unassigned| ready | ready | 37   | -      |
   | u1d8 | ready  | enabled  | unassigned| ready | ready | 36   | -      |
   | u1d9 | ready  | enabled  | standby   | ready | ready | 30   | v1     |
    ----------------------------------------------------------------------
Comments: 
None

============================================================================
Implementation Footnote: 
i)   In case of MANDATORY FINs, Sun Services will attempt to contact   
     all affected customers to recommend implementation of the FIN. 
   
ii)  For CONTROLLED PROACTIVE FINs, Sun Services mission critical    
     support teams will recommend implementation of the FIN  (to their  
     respective accounts), at the convenience of the customer. 

iii) For REACTIVE FINs, Sun Services will implement the FIN as the   
     need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network 
browser as follows:
 
SunWeb Access:
-------------- 
* Access the top level URL of http://sdpsweb.central/FIN_FCO/

* From there, select the appropriate link to query or browse the FIN and
  FCO Homepage collections.
 
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.central/

* From there, select the appropriate link to browse the FIN or FCO index.

Internet Access:
----------------
* Access the top level URL of https://spe.sun.com
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to [email protected]
--------------------------------------------------------------------------
Statusactive