Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1005520.1
Update Date:2011-02-28
Keywords:

Solution Type  Technical Instruction Sure

Solution  1005520.1 :   How to verify I/O errors for failing device on Sun SPARC Systems [Video]  


Related Items
  • Sun Fire V240 Server
  •  
  • Sun Fire V245 Server
  •  
  • Sun Fire V440 Server
  •  
  • Sun Fire V480 Server
  •  
  • Sun Fire T2000 Server
  •  
  • Sun Fire V215 Server
  •  
  • Sun SPARC Enterprise T5220 Server
  •  
  • Sun SPARC Enterprise T5240 Server
  •  
  • Sun Fire V445 Server
  •  
  • Sun Fire V890 Server
  •  
  • Sun Fire V210 Server
  •  
  • Sun Fire T1000 Server
  •  
  • Sun Fire V880 Server
  •  
  • Sun SPARC Enterprise T5140 Server
  •  
  • Sun Fire V490 Server
  •  
  • Sun SPARC Enterprise T5120 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>CMT Servers
  •  
  • GCS>Sun Microsystems>Servers>Entry-Level Servers
  •  
  • GCS>Support>KM>Content>Video
  •  

PreviouslyPublishedAs
207650


Applies to:

Sun Fire V490 Server
Sun Fire T1000 Server
Sun Fire V880 Server
Sun Fire V890 Server
Sun Fire T2000 Server
All Platforms

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community, Oracle Entrylevel Servers.

Goal

Description
This document will help the user to identify a failing disk device based on errors reported in the 'format' output, 'iostat' and /var/adm/messages.

This information doesn't apply to systems, in which the disks are configured in a hardware raid volume (as 'format' will not show disks that are part of a raid volume)


Available for this topic, a Video Tutorial; Brief how-to video tutorials that provide step-by-step instructions answering Sun's most frequently asked questions.  View the video answer and/or follow the detailed instructions below.


Video - Analysing Disk Errors (5:00)

 

Sunsolve users must download the attachment to view the video.


Solution

Steps to Follow
Confirming Disk failure for failing drives

Most of the I/O errors for failing drives on the Sun Fire[TM] servers are related to a disk problem and not to disk backplane or cables. To confirm a disk failure from I/O errors, there are several things that can be checked.

First you may need verify that 'format' is not seeing a device problem. A typical example here is when format shows 'drive type unknown' for a specific drive. Server platforms, such as 280R, V480/V490, and V880/V890 are using FC-AL disk drives. Note that the FC-AL disks have a World Wide Number (WWN) attached to each disk, which affects how devices appear in Solaris[TM] (in the format output):

AVAILABLE DISK SELECTIONS:

0. c1t0d0  /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w21000011c6371e4d,0
1. c1t1d0  /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w21000011c6372ccc,0
2. c1t2d0  /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w21000011c6371bc0,0

After analyzing the format output, in this case it is strongly recommended to also examine /var/adm/messages for matching disk drive errors:

Dec 22 12:34:39 wspaba01 scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w21000011c6371bc0,0 (ssd0):
Dec 22 12:34:39 wspaba01 scsi: [ID 107833 kern.notice]  Error for Command: read(10) 
Dec 22 12:34:39 wspaba01 scsi: [ID 107833 kern.notice]  Error Level: Retryable
Dec 22 12:34:39 wspaba01 scsi: [ID 107833 kern.notice] Requested Block: 404016 Error Block: 404016
Dec 22 12:34:39 wspaba01 scsi: [ID 107833 kern.notice] Vendor: SEAGATE Serial Number: 0446B9xxxx
Dec 22 12:34:39 wspaba01 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention
Dec 22 12:34:39 wspaba01 scsi: [ID 107833 kern.notice] ASC: 0x29 (), ASCQ: 0x3, FRU: 0x4

Errors like these generally indicate that the drive listed needs to be replaced. To confirm the failing drive, the WWN of  w21000011c6371bc0,0 in the above messages should be mapped to 'c1t2d0' drive shown in the output of the format command (in this case they match).

Here is another example of format errors for server platforms using SCSI drives (servers such as V215/V245, V440/445, T1000/T2000):

AVAILABLE DISK SELECTIONS:

0. c0t0d0 /pci@1f,700000/scsi@2/sd@0,0
1. c0t1d0 /pci@1f,700000/scsi@2/sd@1,0

The following errors are in the /var/adm/messages:

Nov 20 12:28:51 sg5000-maildb-0 scsi: WARNING: /pci@1f,700000/scsi@2/sd@1,0 (sd2):
Nov 20 12:28:51 sg5000-maildb-0 scsi: Error for Command: persistent reservation in Error Level: Informational
Nov 20 12:28:51 sg5000-maildb-0 scsi: Requested Block: 0 Error Block: 0
Nov 20 12:28:51 sg5000-maildb-0 scsi: Vendor: SEAGATE Serial Number: 0449B9xxxx
Nov 20 12:28:51 sg5000-maildb-0 scsi: Sense Key: Soft Error
Nov 20 12:28:51 sg5000-maildb-0 scsi: ASC: 0x5d (drive operation marginal, service immediately (failure prediction threshold exceeded)), ASCQ: 0x0, FRU: 0x5

In the above example the device path from messages matches the disk c1t1d0 reported with in the format output, so the disk needs to be replaced.

When troubleshooting I/O errors for failing devices you'll also need to carefully examine the output of the 'iostat -E' (iostat -En) command, for any error events that affect the disk drives. Look for non-zero counts (usually in the 1st, 4th, and 5th lines):

# iostat -En

c1t1d0 Soft Errors: 1 Hard Errors: 0 Transport Errors: 0
Vendor: SEAGATE Product: ST373307LSUN72G Revision: 0507 Serial No: 3HZ7CC470000xxxx
Size: 73.40GB <73400057856 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 1
Illegal Request: 0 Predictive Failure Analysis: 0
c1t2d0 Soft Errors: 0 Hard Errors: 394 Transport Errors: 0
Vendor: SEAGATE Product: ST373307LSUN72G Revision: 0507 Serial No: 3HZ7CLM60000xxxx
Size: 0.00GB <0 bytes>
Media Error: 394 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0

If more that one disk has a non-zero counts (as in the above example), this could be a problem on one disk and a side-effect of that problem on the other. In this case the error counts on the failing drive c1t2d0 are significantly higher compared to the other disk c1t1d0.

A disk problem reported in the 'format' output (or messages) typically translates to a high error count in iostat, for example:

2. c1t2d0 /pci@1f,700000/scsi@2/sd@2,0
# iostat -E
..........
c1t2d0 Soft Errors: 0 Hard Errors: 932 Transport Errors: 0
Vendor: FUJITSU Product: MAP3735N SUN72G Revision: 0401 Serial No: 0415Q0XXXX
Size: 0.00GB <0 bytes>
Media Error: 0 Device Not Ready: 335 No Device: 190 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0

However, a non-zero count in 'iostat -E' output does not always mean an error event on a device. Some specific conditions of the target device, can cause non-zero values in the 'iostat' output. Following, is an example of such a condition where the device is working normally:

# iostat -E

ssd10 Soft Errors: 0 Hard Errors: 10 Transport Errors: 0
Vendor: SEAGATE Product: ST336605FSUN36G Revision: 0638 Serial No: 0201P1xxxx
Size: 36.42GB <36418595328 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 10 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0

In this case, both "Hard Errors" and "No Device" are the same. This implies that the device has gone through resets or power on. The device does not need an immediate replacement. It is recommended to monitor the value over a period of time, and if there are other related errors, this has to be investigated.

Refer to  Document: 1017741.1   Solaris Operating System: High Hard Error value in iostat -E output  for more details.

NOTE: There is a helpful utility "diskinfo.sparc", which is part of the Sun explorer. It always gives updated disk model and serial number information even after a disk hot swap. For example:
# /opt/SUNWexplo/bin/diskinfo.sparc

AVAILABLE SCSI DEVICES:
   Location     Vendor          Product         Rev  Serial #
    c1t0d0      FUJITSU    MAP3147F SUN146G     1601 0515R0304B
    c1t1d0      SEAGATE    ST373307FSUN72G      0307 0426B7MQQ5

Internal Comments
This document contains normalized content and is managed by the the Domain Lead(s) of the
respective domains. To notify content owners of a knowledge gap contained in this document,
and/or prior to updating this document, please contact the domain engineers that are managing this @ document via the "Document Feedback" alias(es) listed below:

[email protected]

Note:
Some of the error
examples in document list  Vendor ,   Sense Key ,  ASC and ASCQ information.
These values will vary with the type of drive error and are explained further in   
Doc ID 1005787.1 Kernel tips: understanding SCSI and its errors.

normalized, I/O errors, failed drive, format, iostat, Problem Solved = Disk Error Verification
Previously Published As
91406

Change History
Date: 2009-11-18
User name: Dencho Kojucharov
Action: Updated
Comments: Currency check, audited by Dencho Kojucharov, Entry-Level SPARC Content Lead




Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback