Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1319343.1
Update Date:2011-05-26
Keywords:

Solution Type  Technical Instruction Sure

Solution  1319343.1 :   Sun Enterprise [TM] 10000: POST and Hardware Dump Frequently Asked Questions  


Related Items
  • Sun Enterprise 10000 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>High-End Servers
  •  




In this Document
  Goal
  Solution


Applies to:

Sun Enterprise 10000 Server - Version: Not Applicable to Not Applicable - Release: N/A to N/A
Information in this document applies to any platform.

Goal

This document contains answers to POST and hardware dump frequently asked questions.

Solution

What are the meanings of the various board and hardware component states found at the end of a POST run or log file?
Each component and the system board overall are associated with one of nine states. The Gen column is the 'general health' of the system board. Other columns report the state of individual components. From left to right: processors, memory banks, I/O controllers and slots, ASICs.
Board Descriptor Array:
Proc M/Grp IOC/Slot CIC PC XDB LDPTH
Brd Gen 3210 3210 1/3210 0/3210 3210 210 3210 10
0: x rrrr r/rrrr r/rrrr r/rrrr rrrr rrr rrrr rr | G=Good
1: x xxxx x/xxxx x/xxxx x/xxxx xxxx xxx xxxx xx | f=Failed
2: x rrrr r/rrrr r/rrrr r/rrrr rrrr rrr rrrr rr | m=Missing
3: x rrrr r/rrrr r/rrrr r/rrrr rrrr rrr rrrr rr | b=Blacklisted
4: x rrrr r/rrrr r/rrrr r/rrrr rrrr rrr rrrr rr | r=Redlisted
5: x rrrr r/rrrr r/rrrr r/rrrr rrrr rrr rrrr rr | c=Crunched
6: x rrrr r/rrrr r/rrrr r/rrrr rrrr rrr rrrr rr | _=Undefined
7: x rrrr r/rrrr r/rrrr r/rrrr rrrr rrr rrrr rr | x=NotInDomain
8: G GGGG G/mmGG G/__mG G/__mG GGGG GGG GGGG GG | u=G,unconfig
9: G GGGb G/mmGG G/__mG G/__Gm GGGG GGG GGGG GG
A: G GGGG G/mmGG G/__GG G/__GG GGGG GGG GGGG GG
B: G GGGG G/GGGG G/__Gm G/__GG GGGG GGG GGGG GG
C: c mmmm m/cccc c/__mm c/__mm cccc ccc cccc cc
D: G GGGG G/mmGG G/__GG G/__mG GGGG GGG GGGG GG
E: G GGGG G/GGGG c/__mm c/__mm GGGG GGG GGGG GG
F: G GGGG G/bmGG G/__GG G/__mG GGGG GGG GGGG GG
StateDescription
Good (G)
Component passed hpost tests. The hpost level indicated on line 2 of the output.
Failed (f)
Component failed hpost tests. See the post log in /var/opt/SUNWssp/adm/<platform>/<domain>/post.
Missing (m)
The component is not physically present in the system. In the above output, an example is memory banks 2 and 3 on board 8. Also, I/O slots 8.0.1 and 8.1.1 are empty.
Blacklisted (b)
The component has been blacklisted in either a platform or domain-specific blacklist file. Proc 9.0 above is an example.
Redlisted (r)The component is considered 'untouchable' by hpost. Typcially, this means the component is part of another domain, as hpost may not change the state of resources in other domains. It is possible that a component has been added to the redlist file, although it is not recommended to touch this file. See the redlist man page for more.
Crunched (c)A component is crunched when it serves no useful purpose, and hpost therefore does not configure that component into the system. It's essentially a cause and effect - if component A relies on/serves component B, and component B is not good, don't bother configuring component A. In the above output, system board C is crunched, for it has no memory, no processors, and no I/O cards. Another example is the I/O on system board E. No I/O cards are present, so the I/O controllers have been crunched. Crunching of a component can also result when ASICs are blacklisted.

 

Undefined (_)An unimplemented location. For each SYSIO, up to 4 sbus cards is supported, but there is only physical space for 2.
NotInDomain (x)If a board's Gen column is x, but all other components are r, that board is part of another domain. System board 0 is an example. A board reporting all components as x means that the board is not configured in any domain. System board 2 in the exammple above is such a board.
G,unconfig (u)The component is good, but not configured for some reason. This is a holdover from the CS6400 and is not used in Starfire.

How do I analyze a WatchDog-Redmode-Dump file?

The WatchDog-Redmode-Dump file is only useful for reviewing the configuration of a domain. It will not provide any information on the failure because a watchdog or redmode is a cpu-based failure, and not an interconnect-based failure.

With a watchdog or redmode failure, look for a hostresetdump file, which will contain (among other things) the processor states


Hpost reports Component ID discrepancy

A component ID discrepancy means that hpost has detected a piece of hardware in a domain that is unknown to the scan database on the SSP. The most common occurrence is with new processor modules. Messages can be either WARNINGs or FAILs. For example:

(output omitted)
phase jtag_integ: JTAG probe and integrity test...
WARNING: b/r/c = sysboard12/proc0/spitfire:
Component ID is up-version: Actual A003602F
Expected 9003602F
FAIL b/r/c = sysboard12/proc0/udb0: Component ID discrepancy.
FAIL Actual 00000000; Expected one of:
FAIL 4F643989 or
FAIL 3F643989 or
FAIL 2F643989 or
FAIL 1F643989 or
FAIL 0F643989 or
FAIL 5002602F or
FAIL 1002602F
(output omitted)
If the messages reported are FAILs:

  • Install any/all patches on the SSP that update the scan database with new hardware information.
  • Reboot the SSP. This makes the scan database changes take effect.
  • Run autoconfig on the system board(s) containing the new hardware. Do not run autoconfig on any system board running OS or OBP as it will crash that domain. See the autoconfig man page for details.

If the messages reported are WARNINGs, provided the board passes POST the domain should operate without problems. See here for more information.

Hpost reports bogus Mixed Ecache error

Symptom:

The post log reports something like:

(output omitted)
phase proc1: Initial processor module tests...
FAIL proc 9.3: Mixed Ecache sizes on board.
phase pc/cic_reg: PC and CIC register tests...
(output omitted)
However, it's a known fact that the board in question contains four identical processors and other post failures have not failed with this error.

This problem occurs when:

  • A power cycle of the system boards/platform is done
  • AND The "failing" processors are 400MHz
  • AND The SSP is at 3.1.1 or 3.2

The failure occurs only on the first hpost (bringup) immediately following a power cycle of the system boards/platform. The failure does not always occur. Failures have not been observed or reported on SSP 3.0 or 3.1.

Workaround:

Another hpost (bringup) run does not fail. All subsequent hpost runs are also error free, until the next power cycle.

Resolution:

Fixed in SSP 3.4 and later

How do I determine the memory configuration of a system board?

Dimm information is available only in an interactive redx session from the SSP. The generic command is shdimm . A repeat command can be used to make life simpler. This command outputs all 4 banks on system board 0:

WARNING: shdimm can crash a running domain! 

redx> repeat 4 { shdimm 0 $loopcnt }
DIMMs 0.0[7:0] = 6F 6F 6F 6F 6F 6F 6F 6F
Type 6F:
0F Size/Org[4:0] Type[4:0] 128 MB dimm / 1 GB Bank
3 Speed[1:0] Type[6:5] 60 ns
0 Reserved Type[7]
DIMMs 0.1[7:0] = 6F 6F 6F 6F 6F 6F 6F 6F
Type 6F:
0F Size/Org[4:0] Type[4:0] 128 MB dimm / 1 GB Bank
3 Speed[1:0] Type[6:5] 60 ns
0 Reserved Type[7]
DIMMs 0.2[7:0] = FF FF FF FF FF FF FF FF
Type FF: Empty Socket
DIMMs 0.3[7:0] = FF FF FF FF FF FF FF FF
Type FF: Empty Socket

redx> repeat 4 { shdimm d $loopcnt }
     DIMMs D.0[7:0] = 6B 6B 6B 6B 6B 6B 6B 6B Type 6B: 0B Size/Org[4:0] Type[4:0] 32 MB dimm / 256 MB Bank 3 Speed[1:0] Type[6:5] 60 ns 0 Reserved Type[7]
     DIMMs D.1[7:0] = 6B 6B 6B 6B 6B 6B 6B 6B Type 6B: 0B Size/Org[4:0] Type[4:0] 32 MB dimm / 256 MB Bank 3 Speed[1:0] Type[6:5] 60 ns 0 Reserved Type[7]
     DIMMs D.2[7:0] = FF FF FF FF FF FF FF FF Type FF: Empty Socket
     DIMMs D.3[7:0] = FF FF FF FF FF FF FF FF Type FF: Empty Socket
To rifle through the entire platform, use:

repeat 16 { shdimm $loopcnt 0
shdimm $loopcnt 1
shdimm $loopcnt 2
shdimm $loopcnt 3 }

How do I determine what DTAGs are on a system board?

DTAG information can be read from a Recordstop Dump, Arbstop Dump, or a live platform. Dump files only contain the DTAG information for those system boards in the domain that produces the dump file. redx> repeat 4 { shdtag 0 $loopcnt };

DTAG 0.0 Component IDs[2:0] = 100000E3 100000E3 100000E3
DTAG 0.1 Component IDs[2:0] = 100000E3 100000E3 100000E3
DTAG 0.2 Component IDs[2:0] = 100000E3 100000E3 100000E3
DTAG 0.3 Component IDs[2:0] = 100000E3 100000E3 100000E3


Component ID
Sram Vendor
00000000system board not present in dumpfile/platform
100000E3
100050E3
Sony
01910149
11910149
IBM

To rifle through the entire platform, use:

repeat 16 { shdtag $loopcnt 0
shdtag $loopcnt 1
shdtag $loopcnt 2
shdtag $loopcnt 3 }

How do I read part/serial numbers

Serial number data can be read for the centerplane (cp), system boards (sys), control boards (ctlbd), centerplane support boards (csb), memory mezzanines (mem), and I/O mezzanines (io).


redx> eepr cp 0
Serial number eeprom of centerplane 0:
Assembly Part Number 501-6509-04 Rev 01 Serial Number 28R301696
Programmed on Thu Jan 16 12:11:11 1997

redx> eepr sys 0
Serial number eeprom of system board 0:
Assembly Part Number 501-4347-10 Rev 50 Serial Number 28Q736115
Programmed on Mon Nov 17 14:28:20 1997

redx> eepr ctlbd 0
Serial number eeprom of control board 0:
Assembly Part Number 501-4345-05 Rev 50 Serial Number 28R301232
Programmed on Mon Apr 7 10:01:15 1997

redx> eepr csb 0
Serial number eeprom of cplane sup board 0:
Assembly Part Number 501-4346-04 Rev 50 Serial Number 28R301054
Programmed on Mon Apr 14 08:36:13 1997

redx> eepr mem 0
Serial number eeprom of memory module 0:
Assembly Part Number 501-4351-04 Rev 50 Serial Number 28R303807
Programmed on Tue Mar 25 10:57:04 1997


redx> eepr io 0
I/O module type on board 0: code = 01: 2 * (SYSIO w/ 2 SBus slots)
Serial number eeprom of I/O module 0:
Assembly Part Number 501-4349-50 Rev 52 Serial Number 28B008654
Programmed on Wed Dec 3 15:05:00 1997

Part/serial number data can be obtained from Arb/Record Stop files for all components except control boards. Control board data requires an interactive redx session from the SSP.

How do I read thermcal data?

Part/serial number data is only appropriate for system boards and the centerplane. This command is primarily to validate that thermcal data is written to a component.

redx> eepr -T sys 7
Serial number eeprom of system board 7:
Assembly Part Number 501-4347-09 Rev 50 Serial Number 28Q736308
Programmed on Mon Mar 24 08:33:37 1997
Asic thermistors calibrated at 26.953 degrees-C. 5 thermistors:
9B3 89F 949 8FB 86C

redx> eepr -T cp 0
Serial number eeprom of centerplane 0:
Assembly Part Number 501-6509-04 Rev 01 Serial Number 28R301696
Programmed on Thu Jan 16 12:11:11 1997 Asic thermistors calibrated at 25.228 degrees-C. 10 thermistors: 994 9FC A1D 9FC 9A4 9E4 99F A3C 874 884

The component will report an error if it has not been thermcal'ed.  Part/serial number data for system boards can be obtained from Arb/Record Stop files if data for those boards is included in the dump. Centerplane information must be collected in an interactive redx session from the SSP. 

How do I see what hpost passed to OBP?

After hpost finishes, it builds a 'post2obp' structure and stores it in the BBSRAM of the bootproc of a given domain. It's also known as the board descriptor array. The structure can be viewed using an interactive redx session.

  1. Obtain the boot processor of the domain in question.
            # cat /var/opt/SUNWssp/etc/southpark/kenny/bootproc
32
     2. Using redx, set the current processor to the booproc and dump the 'post2obp' structure.
redx> proc 32
Current proc set to 8.0 = 32
redx> p2o
p2o_magic = XFPOST_2OBP p2o_struct_version = 010F0000
Created by pid = 864 running at level 17 on Mon Nov 22 15:30:36 1999
Bus configuration = 3F ShuffleMode = 0 Flags = 00000000
Interconnect Freq = 99902435 Hz Processor Ext Freq = 199804870 Hz.
Processor Internal to Interconnect frequency ratio = 4.

Board Descriptor Array:
Proc M/Grp IOC/Slot CIC PC XDB LDPTH
Brd Gen 3210 3210 1/3210 0/3210 3210 210 3210 10
0: x rrrr r/rrrr r/rrrr r/rrrr rrrr rrr rrrr rr | G=Good
1: x xxxx x/xxxx x/xxxx x/xxxx xxxx xxx xxxx xx | f=Failed
2: x rrrr r/rrrr r/rrrr r/rrrr rrrr rrr rrrr rr | m=Missing
3: x rrrr r/rrrr r/rrrr r/rrrr rrrr rrr rrrr rr | b=Blacklisted
4: x rrrr r/rrrr r/rrrr r/rrrr rrrr rrr rrrr rr | r=Redlisted
5: x rrrr r/rrrr r/rrrr r/rrrr rrrr rrr rrrr rr | c=Crunched
6: x rrrr r/rrrr r/rrrr r/rrrr rrrr rrr rrrr rr | _=Undefined
7: x rrrr r/rrrr r/rrrr r/rrrr rrrr rrr rrrr rr | x=NotInDomain
8: G GGGG G/mmGG G/__mG G/__mG GGGG GGG GGGG GG | u=G,unconfig
9: G GGGb G/mmGG G/__mG G/__Gm GGGG GGG GGGG GG
A: G GGGG G/mmGG G/__GG G/__GG GGGG GGG GGGG GG
B: G GGGG G/GGGG G/__Gm G/__GG GGGG GGG GGGG GG
C: c mmmm m/cccc c/__mm c/__mm cccc ccc cccc cc
D: G GGGG G/mmGG G/__GG G/__mG GGGG GGG GGGG GG
E: G GGGG G/GGGG c/__mm c/__mm GGGG GGG GGGG GG
F: G GGGG G/bmGG G/__GG G/__mG GGGG GGG GGGG GG

Memory total: 7 chunks, 2162688 8KB pages (16896 MBytes):
PA = 010.00000000 262144 Pages (2048 MBytes)
PA = 012.00000000 262144 Pages (2048 MBytes)
PA = 014.00000000 262144 Pages (2048 MBytes)
PA = 016.00000000 524288 Pages (4096 MBytes)
PA = 01A.00000000 65536 Pages (512 MBytes)
PA = 01C.00000000 524288 Pages (4096 MBytes)
PA = 01E.00000000 262144 Pages (2048 MBytes)







Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback