Sun[TM] X64/X86 Guide to System Troubleshooting

Asset ID:	1-71-1008335.1
Update Date:	2011-05-27
Keywords:

Solution Type Technical Instruction Sure

Solution 1008335.1 : Sun[TM] X64/X86 Guide to System Troubleshooting

Applies to:

Sun Fire X4240 Server
Sun Fire X4250 Server
Sun Ultra 40 M2 Workstation
Sun Ultra 27 Workstation
Sun Netra X4250 Server
All Platforms

Goal

Description

This document provides a high-level guide to troubleshooting documents for Oracle's Sun x64/x86 product line.

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - Sun x86 Systems

Sun System Handbook	Docs	Downloads	Service Processor	Oracle Page
Workstations:
Sun Ultra 20	Docs	Download	None
Sun Ultra 20 M2	Docs	Download	None
Sun Ultra 24	Docs	Download	None
Sun Ultra 27	Docs	Downloads	None
Sun Ultra 40	Docs	Download	None
Sun Ultra 40 M2	Docs	Download	None
Sun Java W1100z	Docs	Download	None
Sun Java W2100z	Docs	Download	None
Servers:
Sun Fire X2100	Docs	Download	SMDC (option)	Sun x86 Systems
Sun Fire X2100 M2	Docs	Download	ELOM
Sun Fire X2200 M2	Docs	Download	ELOM
Sun Fire X2250	Docs	Download	ILOM
Sun Fire X4100	Docs	Download	ILOM
Sun Fire X4100 M2	Docs	Download	ILOM
Sun Fire X4140	Docs	Download	ILOM
Sun Fire X4150	Docs	Download	ILOM
Sun Fire X4200	Docs	Download	ILOM
Sun Fire X4200 M2	Docs	Download	ILOM
Sun Fire X4240	Docs	Download	ILOM
Sun Fire X4250	Docs	Download	ILOM
Sun Fire X4440	Docs	Download	ILOM
Sun Fire X4450	Docs	Download	ILOM
Sun Fire X4500	Docs	Download	ILOM
Sun Fire X4540	Docs	Download	ILOM
Sun Fire X4600	Docs	Download	ILOM
Sun Fire X4600 M2	Docs	Download	ILOM
Sun Fire V20z	Docs	Download	SP
Sun Fire V40z	Docs	Download	SP
Blade Servers:
Sun Blade 1600	Docs	Patches	Switch SC (SSC)	Sun Blade Servers
Sun Blade 6000	Docs	Download	ILOM
Sun Blade 8000	Docs	Download	ILOM
Netra Blades And Servers:
Netra X4200 M2	Docs	Download	ILOM	Sun Netra Carrier-Grade Servers
Netra X4250	Docs	Download	ILOM
Netra X4450	Docs	Download	ILOM
Netra CT900	Docs	Download	ShMM

The product links above contain general information about the specific product. The Sun system handbook links from above contain system specifications, parts lists, documentation, and the list of minimum supported operating systems. System firmware, drivers, and BIOS can be downloaded via the Download link.

Solution

Steps to Follow

Kernel Analysis

A system becomes unresponsive for one of three reasons:

Fatal reset (hardware detected)
Operating system panic (software detected)
Operating system/application hang (not detected)

The following document provides some information about the necessary data to gather:
DocID: 1010911.1 What to send to Sun[TM] after a system panic and/or unexpected reboot

Fatal Reset
Fatal Resets are hardware detected problems and are caused when the central processing unit (CPU) performs a trap which immediately drops to the BIOS.
One reason for this is due to a watchdog reset which is caused when the operating system fails to access the watchdog circuitry within its time out period.
This is really due to an operating system hang detected by the watchdog timer, so see the hang section below for techniques to diagnose.

Other reasons for fatal resets are due to hardware failure like loss of input voltage, or other major hardware related issues. No core file is saved and the messages file shows normal operation followed by an abrupt system restart (no shutdown messages).

The most important diagnosis information to retrieve is the following which his mostly gained through the service processor (SP):

Console output. This typically contains a reason for the reset for example "sync flood" (or nothing for total power loss).
SP events. This could contain sensor related events like under voltage conditions on one rail or OEM specific events like 0x12's.
SP sensor data. This contains information if a sensor has a consistent problem like a voltage regulator or fan failure.
SP field replaceable unit (FRU) data. This describes the hardware inventory configuration to assist with hardware replacement. Collect this to determine if the system has the proper configuration (eg. partially installed memory bank). A good item to check is the system board page in the Sun System Handbook.
Explorer or other operating system data collector that contains the messages files and other data.

Explorer or other operating system data collector that contains the messages files and other data. This typically contains panic messages and a stack trace related to the panic.
SP events. This could contain sensor related events like under voltage conditions on one rail or OEM specific events like 0x12's.
SP sensor data. This contains information if a sensor has a consistent problem like a voltage regulator or fan failure.
SP FRU data. This describes the hardware inventory configuration to assist with hardware replacement. Collect this to determine if the system has the proper configuration (eg. partially installed memory bank). A good item to check is the system board page in the Sun System Handbook.

If the cause of the reset cannot be quickly determined, its important to perform hardware diagnostics such as a full power on self test (POST), the bundled PCcheck or SunVTS to determine if the hardware is stable. PCcheck and other diagnostic tools can typically be downloaded via the Download link above or already be available as part of the BIOS boot menu.

Hangs
A Hang is when some applications may operate properly, and others appear dead, but the hardware and operating system do not detect a problem. Hangs are caused by resource deadlocks due to operating system race conditions or resource deprivation due to one or more applications that are too needy. Sometimes console messages may indicate the source of the hang, but typically a core should be forced so that Sun's kernel group can analyze the data. There is a small possibility that hangs can be caused by hardware, but please contact the kernel group first for isolation.

DocID: 1012991.1 How to check if your x64 platform "system hang" actually is a system hang.

This document can be referenced to assist with possible hang situations.

The following operating system diagnostic section should be read to determine how to configure and force core dumps, but forcing a core dump from a hung system is not always possible.

OS Troubleshooting
Sun x86/x64 systems typically support Solaris[TM], Red Hat Enterprise Linux, SuSE Enterprise Linux and the Windows operating system.
Please check the Sun Systems Handbook to ensure that the operating system in question is supported on that platform.
A good overall operating system document to review is:

DocID: 1019144.1 Data Requirements reference: What data is needed in order to troubleshoot my software or hardware problem?

Solaris:
Six important Solaris documents that discuss procedures and configuration for Solaris panics and hangs are as follows:

DocID: 1012913.1 Troubleshooting Panics, dumps, hangs or crashes in the Solaris[TM] Operating System
DocID: 1001950.1 Troubleshooting Suspected Solaris Operating System Hangs
DocID: 1004506.1 How to force a crash when my machine is hung
DocID: 1001950.1 When to Force a Solaris[TM] System Core File
DocID: 1004530.1 KERNEL: How to enable deadman kernel code
DocID: 1003085.1 Solaris[TM] Operating System: Forcing a kernel core dump on an x86 or x64 system

Red Hat Linux:
Three important Red Hat documents that discuss procedures and configuration for Red Hat panics & hangs are as follows:

DocID: 1005528.1 How to configure Kdump on Red Hat Enterprise Linux 5 systems
DocID: 1006577.1 Red Hat Linux: Diskdump Pre-requisites, install and settings
DocID: 1007699.1 Crash Dump capturing for Red Hat Linux

SuSE Linux:
Two important SuSE documents that discuss procedures and configuration for SuSE panics & hangs are as follows:
DocID: 1108937.1 How to configure Kdump on SuSE Linux Enterprise System 10
DocID: 1010059.1 How to configure LKCD on SuSE Linux Enterprise Systems 8 and 9

Windows:
An important Windows document that discusses procedures and configuration for panics is:
DocID: 1007054.1 How to handle Microsoft Windows panics on x64 platforms

Additional documents that assist in Windows troubleshooting are:
DocID: 1011590.1 How to check for Windows platform disk errors and online/offline status
DocID: 1010936.1 Microsoft Windows and Linux operating systems: How to obtain troubleshooting information

Disk and Redundant Array of Independent/Inexpensive Disks (RAID) Troubleshooting
Disk and RAID problems are sometimes related to the disk/RAID controller firmware and boot configuration.

A good overall document to determine the firmware revision from systems with a supported operating system and how to search for known issues is:
DocID: 1008396.1 How to Identify Optical and Hard Disk Firmware Revisions for Checking of Known Issues

A good document on boot related issues is:
DocID: 1005506.1 How to verify your boot media exists and is bootable on a Sun Fire[TM] X4100/X4200/X4600 and M2 models Server

Once the version is known, the following document can be used to provide information of how to list, create, or delete RAID volumes:
DocID: 1005358.1 Hardware RAID usage on X64 based systems with the LSI SAS1064

The LSI RAID controller firmware requires 64MB unpartitioned disk space at the end of the disk for volume management. Thus, data backup prior any RAID creation should be performed.
LSI related RAID status can be obtained via the BIOS as shown in the following:

DocID: 1013107.1 How to Identify BIOS and Solaris[TM] Hardware RAID Status

Disks placed into a RAID volume should be of identical size to avoid problems.

RAID levels are:
RAID-0: Stripe of 2 or more disks to form a virtual larger disk. No redundancy so data lost on failure, but higher performance due to access to multiple disks for a file.
RAID-1: Mirrors of 2 or more disks to provide redundant data copies to prevent data loss on disk failure. Write performance decreases due to 2 or more writes per single file update but read performance increases due to access to file access from multiple disks.
RAID-01: Mirror of striped disks, but disk failure will offline its associated stripe.
RAID-10: Stripes of mirrored disks which can tolerate loss of two disks depending on configuration.
RAID-5: Stripes 3 or more disks with distributed parity so data loss is prevented if a disk fails. Medium performance is sustained since two writes are performed for each file update, but access is striped across multiple disks.

The Solaris raidctl command provides RAID status and provides RAID creation & deletion information as described in the following:

DocID: 1013107.1 How to Identify BIOS and Solaris[TM] Hardware RAID Status

Solaris commands that are helpful in disk troubleshooting, are as follows:

# /usr/sbin/mount | grep "/ on"

/ on /dev/dsk/c1t0d0s0 read/write/setuid/devices/logging/xattr/onerror=panic/dev=f40040 on Thu Dec 6 11:49:54 2007

# iostat -E

sd0 Soft Errors: 1 Hard Errors: 2 Transport Errors: 0
Vendor: AMI Product: Virtual CDROM Revision: 1.00 Serial No: Size: 0.00GB <0 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 2 Recoverable: 0
Illegal Request: 1 Predictive Failure Analysis: 0
sd1 Soft Errors: 2 Hard Errors: 0 Transport Errors: 0
Vendor: AMI Product: Virtual Floppy Revision: 1.00 Serial No: Size: 0.00GB <0 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 2 Predictive Failure Analysis: 0

# iostat -xe

extended device statistics ---- errors ---
device r/s w/s kr/s kw/s wait actv svc_t %w %b s/w h/w trn tot
sd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 1 2 0 3
sd1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 2 0 0 2

LINUX disk issues can be isolated using the following :
DocID: 1013003.1 How to Identify if a Linux Operating Environment is Installed on a Hardware RAID Controller

The following document indicates how to determine if a LINUX disk is under RAID control.
Software RAID is configured using mdadm as discussed in:
DocID: 1011427.1 How to setup software RAID in Linux

LINUX commands that are helpful in disk troubleshooting, are as follows:

# /bin/mount | grep "on / " (Display root mount point) /dev/sda2 on / type ext3 (rw)

Windows disk status can be checked using information from the following:

DocID: 1011590.1 How to check for Windows platform disk errors and online/offline status

An example of a Windows RAID installation is obtained from:
DocID: 1009559.1 Installing Windows 2003 Server with RAID enabled on Sun Fire[TM] x2100

General Troubleshooting
For problems not covered by the prior two sections, collect the following information:

Obtain SP related data in all cases. This can be done via ipmitool (see below), or via the SP's GUI or command line interfaces (if functionality exists; see SP link above).
Ensure that the installed operating system is supported per the Sun System Handbook link above.
When possible, obtain operating system data collectors such as explorer or other output that records the state of the operating system and file system (including messages files).
PCcheck & other diagnostic tools can typically be downloaded via the Download link above or already be available as part of the BIOS boot menu.

IPMItool
IPMItool is a very useful tool that can gather information from the ILOM and other Service Processors (SP's).

Example commands to collect are as follows replacing the "ipaddress" with the address of the service processor, not the main platform:

ipmitool -H "ipaddress" -U root fru
ipmitool -H "ipaddress" -U root sel elist
ipmitool -H "ipaddress" -U root -v sdr
ipmitool -H "ipaddress" -U root sdr elist
ipmitool -H "ipaddress" -U root sdr list
ipmitool -H "ipaddress" -U root chassis status
ipmitool -H "ipaddress" -U root sunoem led get
ipmitool -H "ipaddress" -U root sensor

X64, troubleshooting, x86
Previously Published As
88276

Change History
Date: 2009-12-01
User Name: Tony McNamara
Action: Currency check
Comment: Updated to add new products, re-defined descriptions of errors relating to x64 platforms, added new hang section, updated links and added new RAID section
Date: 2010-06-02
User: [email protected]
Action: Implemented comment
Comment: Updated link to How to configure Kdump on SuSE Linux Enterprise System 10 (Doc ID 1108937.1)

Attachments

This solution has no attachment