Document Audience:INTERNAL
Document ID:I0670-1
Title:D1000 ESM board replacement procedure on A3x00/A3500FC
Copyright Notice:Copyright © 2005 Sun Microsystems, Inc. All Rights Reserved
Update Date:2001-06-15

---------------------------------------------------------------------
- Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------  
                            FIELD INFORMATION NOTICE
                  (For Authorized Distribution by SunService)
FIN #: I0670-1
Synopsis: D1000 ESM board replacement procedure on A3x00/A3500FC
Create Date: Jun/15/01
Keywords: 

D1000 ESM board replacement procedure on A3x00/A3500FC

SunAlert: No
Top FIN/FCO Report: No
Products Reference: D1000 ESM Boards
Product Category: Storage / Documentation
Product Affected: 
Mkt_ID   Platform   Model   Description                    Serial Number
------   --------   -----   -----------                    -------------
Systems Affected
----------------
  -      ANYSYS       -     System Platform Independent           -

X-Options Affected
------------------
  -                A3500 -   A3500 Storage Array                  -
SG-XARY351A-180G     -   -   A3500 1 CONT MOD/5 TRAYS/18GB        -
SG-XARY353A-1008G    -   -   A3500 2 CONT/7 TRAYS/18GB            -
SG-XARY353A-360G     -   -   A3500 2 CONT/7 TRAYS/18GB            -
SG-XARY355A-2160G    -   -   A3500 3 CONT/15 TRAYS/18GB           -
SG-XARY360A-545G     -   -   545-GB A3500 (1X5X9-GB)              -
SG-XARY360A-90G      -   -   A3500 1 CONT/5 TRAYS/9GB(10K)        -
SG-XARY362A-180G     -   -   A3500 2 CONT/7 TRAYS/9GB(10K)        -
SG-XARY362A-763G     -   -   A3500 2 CONT/7 TRAYS/9GB(10K)        -
SG-XARY364A-1635G    -   -   A3500 3 CONT/15 TRAYS/9GB(10K)       -
SG-XARY366A-72G      -   -   A3500 1 CONT/2 TRAYS/9GB(10K)        -
SG-XARY380A-1092G    -   -   1092-GB A3500 (1x5x18-GB)            -
SG-XARY360B-90G      -   -   ASSY TOP OPT 1X5X9 MIN 9GB 10K       -
SG-XARY360B-545G     -   -   ASSY TOP OPT 1X5X9 MAX 9GB 10K       -
SG-XARY362B-180G     -   -   X-OPT 2X7X9 MIN FCAL 9GB 10K         -
SG-XARY374B-273G     -   -   ASSY TOP OPT 3X15X9 MIN 9GB 10K      -
SG-XARY380B-182G     -   -   X-OPT FC-SN 1X5X18MIN 18GB 10K       -
SG-XARY380B-1092G    -   -   ASSY FC-SNL 1X5X18MAX 18GB 10K       -
SG-XARY382B-364G     -   -   ASSY FC-SN 2X7X18 MIN 18GB 10K       -
SG-XARY384B-546G     -   -   ASSY FC 3X15X18 MIN 18GB             -
SG-XARY381B-364G     -   -   ASSY FC-SN 1X5X36 MIN 36GB 10K       -
SG-XARY381B-1456G    -   -   ASSY FC-SN 1X5X36 MAX 36GB 10K       -
SG-XARY383B-728G     -   -   ASSY FC-SN 2X7X36 IN 36GB 10K        -
SG-XARY385B-1092G    -   -   ASSY FC-SN 3X15X36 MIN 36GB 10K      -
UG-A3500-FC-545G     -   -   ASSY TOP OPT 1X5X9 MAX 9GB 10K       -
CU-A3500-FC-545G     -   -   ASSY TOP OPT 1X5X9 MAX 9GB 10K       -
UG-A3500FC-182-10K   -   -   FCTY A3500FC/SCSI 1X5X18 MIN 18/10K  -
CU-A3500FC-182-10K   -   -   FCTY A3500FC/SCSI 1X5X18 MIN 18/10K  -
UG-A3500FC-364-10K   -   -   FCTY A3500FC/SCSI 2X7X18 MIN 18/10K  -
CU-A3500FC-364-10K   -   -   FCTY A3500FC/SCSI 2X7X18 MIN 18/10K  -
UG-A3500FC-546-10K   -   -   FCTY A3500FC/SCSI 3X15X18 MIN 18G10K -
CU-A3500FC-546-10K   -   -   FCTY A3500FC/SCSI 3X15X18 MIN 18G10K -
UG-A3500-A3500FC     -   -   ASSY UPGRADEA3500FC/DILBERT          -
X6538A               -   -   X-OPT A3500FC CONTROLLER             -
6538A                -   -   FCTY CONTROLLER A3500FC              -
Parts Affected: 
Part Number   Description                             Model
-----------   -----------                             -----
798-0522-02   RAID Manager6.1.1 Update 1                -
798-0522-03   RAID Manager6.1.1 Update 2                -
704-6708-10   CD SUN STOREDGE RAID Manager6.22         -
References: 
BugId:  4369971 - Power cycle of A3500 controller module is required after 
                  D1000 tray repair. 
        4382087 - Incorrect RM6 recovery guru procedure for ESM card 
                  replacement.
        4295842 - All drives fail in a d1000 tray of A3500.
        4306903 - A3x00 turns on amber LED of drives in the same drive grp 
                  in D1000s (fail drv).
        4309564 - hotspare kicked in for a dead lun.
        4309556 - Cannot revive luns/fail drives after power up of D1000 
                  tray in A3500.
        4307364 - module profile & serial port information conflict.
 
ESC:    524511

Manual: 805-2624-10: A1000/D1000 Installation, Operations, and 
                     Service Manual - Page 26.
        806-6419-11: Sun StorEdge A3x00/A3500FC Best Practices Guide.
Issue Description: 
The replacement procedure for the SCSI controller board (ESM card) in a
D1000 disk array is not well-documented.  If the proper procedure is
not followed, complications can arise which might lead to data loss.
This FIN describes correct ESM replacement procedures and gives
examples of additional problems which might occur as a result of an 
ESM card failure and its subsequent replacement. 

Replacement of the D1000 ESM card may result in drives being "failed"
in one or more drive trays of an A3x00/A3500FC array.  A failure of an
ESM card will cause the entire D1000 tray to go offline.  As a result,
all disks in the D1000 tray will go offline until the ESM card is
replaced.

The ESM card replacement procedure requires a power-cycle of the
affected D1000 tray.  It is possible that marginal disk drives within
the power-cycled D1000 tray could fail at this point.  If the entire
array is power-cycled, the risk of losing marginal disk drives within
other D1000 trays is increased.  This could potentially result in a
double disk failure of a RAID 5 LUN which, by definition, causes all
data in that LUN to be destroyed.

Below are some of the possible problems that could arise as a result 
of an ESM card failure:
 
   . All drives fail in a D1000 tray of an A3x00/A3500FC array.  If the 
     failure is reported as a non-media component failure specified by 
     a 3FC7 event then it is most likely a termination failure.  
     By default, the drives in this tray are "failed" for this error.

   . Multiple disks in different D1000 trays fail in the A3x00/A3500FC
     array.  Upon a power cycle/reset to the A3x00/A3500FC controller  
     module while reconstruction is happening, disks may be off-lined  
     on a lun by lun basis.

   . One cannot revive luns or fail drives after power up of the D1000
     tray.  This is because only one reconstruction could be happening
     per active controller.

   . Sometimes dead luns are seen as either degraded or in reconstruction
     mode.  The file system could be corrupted or a degraded lun with no
     hotspare is indicated as reconstructing.

Various types of behavior are seen when a power cycle of the
A3x00/A3500FC controller module is done after power cycling a D1000
tray.  Two examples of these behaviors involve either a random drive
failing or drives failing on a lun by lun basis.

The root cause of random drive failures has been identified as
A3x00/A3500FC controller firmware not handling a "no sense" status from
the disks.  The reason why disks gives no sense data is not known.
Current A3x00/A3500FC controller firmware will retry on this
condition. 

An indication of the lun by lun drive failure problem is an "amber"
light on all the drives in the same drive group.  One drive after
another turns "amber" as the A3x00/A3500FC controller scans the drive
after the A3x00/A3500FC controller module was power-cycled.  The net
effect is the A3x00/A3500FC controller off-lines the lun.  Multiple
disks failing causes a lun to die leading to data loss.  

It is likely that one or more LUNs could be in the "reconstructing"
state after the ESM card replacement.  Certain limitations should be
kept in mind during the reconstruction period.

The A3x00/A3500FC controllers only support 1 reconstruction per
controller, i.e.  2 reconstruction's per controller pair (a module).
Reconstruction does indeed occur after powering off a D1000 tray
provided I/O is sent to the affected LUN(s).  As long as the
A3x00/A3500FC controllers are in dual-active state, two LUNs will
reconstruct concurrently and the rest will be "waiting to
reconstruct".  However, there have been cases where the LUNs will still
be marked as "Degraded" until after the first LUN is finished
reconstructing.  Reconstruction will also occur after powering the
D1000 tray back on (again, provided I/O is sent to the affected LUNs)
unless reconstruction is still in progress from powering the tray off,
in which case the drives won't return to their drive groups until
that is complete.

When starting LUN reconstruction on some of the drives, the LUN
reconstruction may or may not be seen at all.  If there is LUN
reconstruction going on, wait for the reconstruction to finish on these
drives before power-cycling the controller module.  If the drives
remain in a dead or unresponsive mode, LUN reconstruction will not
start on these drives and the LEDs on these drives will remain
"green".  Because a power-cycle of the A3x00/A3500FC controller module
while LUNs are reconstructing could result in data loss, a complete
backup of the data stored on the array should be performed.  After the
A3x00/A3500FC controller module is power-cycled, the data should be
restored if necessary.

When one runs the Recovery Guru, as part of the ESM card replacement
procedure, it will tell whether the drives need to be reconstructed or
revived.  In order to reconstruct the drive, click on the "Reconstruct"
button.  A pop-up window message will indicate when an error occurs
while attempting to reconstruct the drive.  In order to revive the
drive, click on the "revive" button and a pop-up window message will
indicate when an error occurs while attempting to revive the drive.
Pulling out the drives and waiting for 30 seconds and then attempting
to revive the drives may display the same error information.  Even
after power cycling the D1000 tray, it may not be possible to revive
the drives.  In this case, the Ax00/A3500FC controller must be power
cycled.

When following the procedures under "Failed Environmental Card" in the
RM6 Recovery Guru, it states that if the firmware level is 2.05.02 or
later, you can skip the power cycle of the controller module.  However,
it has been found that a power cycle of the A3x00/A3500FC controller 
module is needed regardless of the firmware level, as long as the array
has returned to an optimal state.

The A3x00/A3500FC controller will find the status of the D1000 tray upon
polling or will recognize it when the tray comes up.  However, if at
any given time the A3x00/A3500FC controller was power cycled without
the tray being active, the A3x00/A3500FC array will not see the new
D1000 tray as it does not exist in its configuration.  In this
situation, the A3x00/A3500FC controller module needs to be power cycled
for the controller to reconfigure the existence of the tray.  This
condition has been seen at a major customer site.
Implementation: 
---
        |   |   MANDATORY (Fully Pro-Active)
         ---    
         
  
         ---
        |   |   CONTROLLED PRO-ACTIVE (per Sun Geo Plan) 
         --- 
         
                                
         ---
        | X |   REACTIVE (As Required)
         ---
Corrective Action: 
An Authorized Enterprise Field Service Representative may avoid the
above mentioned problems by following the recommendations as shown
below.

When replacement of an ESM card in a D1000 tray is necessary, please
follow these guidelines.

1) The D1000 tray must be powered down to replace the ESM card.  It is
   essential that *ONLY* the D1000 which suffered the ESM card failure
   be power-cycled.  Do *NOT* power-cycle the other D1000 trays and do
   *NOT* power-cycle the entire rack as a convenient method to power-
   cycle the failed D1000 tray.

2) Replace the ESM Card and then power up D1000 tray.  See page 26 of
   the A1000/D1000 Installation, Operations, and Service Manual for
   details.

3) After the ESM card is replaced, the A3x00/A3500FC controller will
   recognize the ESM card.  When the tray comes up with the new ESM
   card, the drives generally come up with the LED "green" and
   participate in LUN reconstruction if they are in a degraded mode.
   If the LED on the drives is "amber", then the drives need to be
   unfailed using the CLI utility drivutil(1M).

4) Wait until all LUNs are in an optimal state.  If degraded LUNs do 
   not start reconstructing in a timely fashion *after* all other 
   reconstruction has completed, a full backup of the data is required, 
   followed by a power cycle of the A3x00/A3500FC controller module.

5) Power-cycle the A3500/A3500FC controller. It is important that
   the entire array be in an optimal state before performing this step.
Comments: 
------------------------------------------------------------------------------
Implementation Footnote: 
i)   In case of MANDATORY FINs, Enterprise Services will attempt to    
     contact all affected customers to recommend implementation of 
     the FIN. 
   
ii)  For CONTROLLED PROACTIVE FINs, Enterprise Services mission critical    
     support teams will recommend implementation of the FIN  (to their  
     respective accounts), at the convenience of the customer. 

iii) For REACTIVE FINs, Enterprise Services will implement the FIN as the   
     need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network 
browser as follows:
 
SunWeb Access:
-------------- 
* Access the top level URL of http://sdpsweb.ebay/FIN_FCO/

* From there, select the appropriate link to query or browse the FIN and
  FCO Homepage collections.
 
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.Corp/

* From there, select the appropriate link to browse the FIN or FCO index.

Supporting Documents:
---------------------
* Supporting documents for FIN/FCOs can be found on Edist.  Edist can be 
  accessed internally at the following URL: http://edist.corp/.
  
* From there, follow the hyperlink path of "Enterprise Services Documenta- 
  tion" and click on "FIN & FCO attachments", then choose the appropriate   
  folder, FIN or FCO.  This will display supporting directories/files for 
  FINs or FCOs.
   
Internet Access:
----------------
* Access the top level URL of https://infoserver.Sun.COM
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to [email protected]
---------------------------------------------------------------------------
Statusactive