Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-75-1004271.1
Update Date:2011-03-24
Keywords:

Solution Type  Troubleshooting Sure

Solution  1004271.1 :   Troubleshooting errors on I/O devices (Disk drives, DVD, Tapes) in a Sun Fire [TM] Serengeti or LightWeight8 systems  


Related Items
  • Sun Fire E6900 Server
  •  
  • Sun Fire 3800 Server
  •  
  • Sun Fire 6800 Server
  •  
  • Sun Netra 1280 Server
  •  
  • Sun Fire E4900 Server
  •  
  • Sun Fire 4800 Server
  •  
  • Sun Fire V1280 Server
  •  
  • Sun Fire E2900 Server
  •  
  • Sun Netra 1290 Server
  •  
  • Sun Fire 4810 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>Midrange V and Netra Servers
  •  
  • GCS>Sun Microsystems>Servers>Midrange Servers
  •  

PreviouslyPublishedAs
205899


Applies to:

Sun Netra 1280 Server
Sun Netra 1290 Server
Sun Fire V1280 Server
Sun Fire 3800 Server
Sun Fire 4800 Server
All Platforms

Purpose

Description

This document covers situations where certain I/O devices might be suspected to be defective.

Specifically, this document addresses how to troubleshoot device errors affecting Hard Disk Drives (HDDs), DVDs, or Tape Drives on Sun Fire [TM] 3800, 4800, 4810, E4900, 6800, E6900 and Sun Fire [TM] v1280, E2900, and Netra [TM] 1280, 1290 systems.  This document does not address a situation where a device is considered to be "missing" or has "disappeared".

  • To troubleshoot a "missing" device, see <Document:1005522.1> Troubleshooting a "missing" Hard Disk Drive (HDD) on Sun Fire [TM] Serengeti or LightWeight8 systems

Symptoms:

  • One might describe the situation by saying "I have a bad I/O device or devices" or "I'm getting I/O device errors" or "I'm getting I/O errors".
  • Iostat may report excessive hard or soft errors on disks, dvds, or tape drives.
  • There may be numerous messages in /var/adm/messages in the domain reflecting read or write errors, scsi transport errors, or similar.
  • In some cases, the errors or problems could prevent a domain from booting.
  • It's possible the problems could affect the whole controller or device path or multiple controllers or device paths.

Last Review Date

March 24, 2011

Instructions for the Reader

A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible, diagnostic tools are included in the document to assist in troubleshooting.

Troubleshooting Details

Steps to Follow

Please validate that each troubleshooting step below is true for your environment.  The steps will provide instructions or a link to a document, for validating the step and taking corrective action  as necessary. The steps are ordered in the most appropriate sequence to isolate the issue and identify the proper resolution. Please do not skip a step.

1.  Verify the error(s) affect an I/O device or devices (Hard Disk Drive, DVD ROM, Tape Drive).

  • Use iostat -En output to identify the device(s) in error or error messages logged to /var/adm/messages on the domain.
  • Example iostat - En data is available below shows a DVD-ROM and a Hard Disk Drive (HDD):
$ iostat -En
c0t0d0 Soft Errors: 2 Hard Errors: 0 Transport Errors: 0
Vendor: TOSHIBA Product: DVD-ROM SD-C2612 Revision: 1011 Serial No:
Size: 0.00GB <0 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 2 Predictive Failure Analysis: 0
c1t0d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: FUJITSU Product: MAP3735N SUN72G Revision: 0401 Serial No: 0435Q0E3UJ
Size: 73.40GB <73400057856 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0

2.  Verify that the devices in error are not newly installed or replaced units.

  • If they are newly installed or repaired, make sure they have current firmware and that the units have been re-seated. 
    • For most I/O devices like a Hard Disk drive, DVD-ROM, etc the firmware revision is displayed in iostat -En output (see example in Step 1 ) as the field "Revision:".
  • After completing these tasks, verify that the errors persist.

3.  Verify the type of error and error count in iostat - En output to indicate whether device replacement might be necessary.  

  • <Document:1007250.1> iostat -E: Explanation of error counters offers advice.

4.  Verify that any device with high Hard Error count in iostat -En may actually NOT be healthy.

  • <Document:1017741.1> Solaris[TM] Operating System: High Hard Error value in iostat -E output provides an explanation of situations which may cause hard errors on sane hardware.

5.  Confirm that the iostat -En output is reporting errors that correlate to the present period of time.

  • In other words, make sure that the iostat data is not "stale" (old errors that originated sometime in the past). 
    • If there are errors logged in /var/adm/messages and in iostat from the present period of time, the data from iostat is trusted and the device would be implicated depending on the type and count of errors.
  • If the errors in iostat are "stale" and do not correlate to present errors in /var/adm/messages, do not replace the implicated device, but instead monitor it for any repeat errors.
    • The error counters will be reset following the next system reboot.

6.  Verify the physical location of the devices in error.

  • Decode the device paths in error using <Document:1005907.1> Solaris[TM] Operating System: Matrix of Recognized Device Paths.

7.  Verify that no system configuration changes (software changes) recently affected the implicated devices.

  • Examples of such changes would be PCI driver patches recently installed (or reboots to allow the changes to take affect), recent /etc/system file changes, disk firmware upgrades, etc.
    • This confirmation is most important when the devices in error occupy different device paths, but the actual device type is similar.  For example, if there are three disks in error and all are the same model, the likely cause might be a disk firmware issue.  But, if there are two different device types in error (like a DVD ROM and disk drive) attached to the same type of HBA in error, the likely cause might be the driver that operates that particular HBA.
  • So, essentially, confirm whether there is a commonality between the devices in error and rule out recent changes to those similarities before proceeding to the next step.

8.  Investigate the primary hardware cause of the errors depending on which devices are in error using the table below.

The advice in this table is intended to be used as a guide to troubleshooting this issue and not necessarily the exact resolution to every multiple device error situation.  When investigating the cause of the errors you might choose to replace or relocate the suspect components to another location in the configuration to confirm the cause or determine the resolution to the errors.

Situation

Primary Causes

Secondary Causes

Less likely Causes

Single device on a device path.

Defective device (Disk, DVD, Tape, etc)

Defective cable/termination, or HBA

Array Controller (if device is in an array) or Media Tray (if internal to E2900/1280/1290) or disk array backplane.

Collaborate with the next level of support if investigating this suspected cause (See Step 10).

All devices on same device path (HBA) in error.

Defective HBA or cable/termination.

Array Controller (if device is in an array) or Media Tray (if internal to E2900/1280/1290)

I/O Board or Array Backplane.

Collaborate with the next level of support if investigating this suspected cause (See Step 10).

Some devices on same path (HBA) in error.

Defective device (Disk, DVD, Tape) or HBA.

Cable/termination or Array Controller (if device is in an array) or Media Tray (if internal to E2900/1280/1290)

Array Backplane.

Collaborate with the next level of support if investigating this suspected cause (See Step 10).

Multiple devices on multiple paths (HBAs), same I/O Board in error.

I/O Board

Collaborate with the next level of support (See Step 10).

Collaborate with the next level of support (See Step 10).

  • If replacing a suspect device, use the Sun System Handbook to identify the part number of the device you need to replace. 
  • For Hard Disk Drive (HDD) replacements, see the reference <Document:1004390.1> Hard Disk Drive (HDD) Part Number Identification.

9.  If errors persist, investigate the secondary hardware cause of the errors depending on which devices are in error using the information and table from Step 8 above.

10.  Collect the following data and collaborate with the next level of support.

  • It is preferred that Explorer with the appropriate scextended or 1280extended option as detailed in: <Document:1018748.1> How to Run Sun[TM] Explorer and Forward the Data to a Sun Engineer.
  • If Explorer data can not be collected for whatever reason see <Document:1003529.1> Procedure to manually collect Sun Fire[TM] Midrange System Controller level failure data.


Internal Only Information

Stale iostat Data
If the errors in iostat are "stale" and do not correlate to present errors in /var/adm/messages,
reset the error counts to prevent further confusion of the device's health by following
<Document:1012731.1> If you want to reset the iostat -E hard/soft/tran errors counters without rebooting.

@ At this point, if the customer has validated that each troubleshooting step above is true for their
environment, and the issue still exists, collaborate to the next level of technical expertise.

Previously Published As 91433

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback