Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition | |||
|
|
Solution Type Technical Instruction Sure Solution 1012598.1 : Understanding and Decoding Machine Check Errors on Opteron systems running Solaris[TM] Operating System for x86 Platforms
PreviouslyPublishedAs 217340
Applies to:Sun Ultra 40 M2 Workstation - Version: Not Applicable to Not Applicable - Release: N/A to N/ASun Ultra 40 M2 Workstation - Version: Not Applicable to Not Applicable [Release: N/A to N/A] Sun Blade X6220 Server Module - Version: Not Applicable to Not Applicable [Release: N/A to N/A] Sun Blade X6240 Server Module - Version: Not Applicable to Not Applicable [Release: N/A to N/A] Sun Fire X4100 Server - Version: Not Applicable to Not Applicable [Release: N/A to N/A] All Platforms GoalThe machine check mechanism on the AMD64 processor allows it to detect and report hardware errors.When an unrecoverable Machine Check Errors is detected, a Machine Check Exception is generated. When such condition occurs, Solaris[TM] Operating System for x86 Platforms can panic with a Trap of type 0x12 (Machine check exception). When the Solaris{TM} Operating System (OS) receives a Machine check exception, it shows MCE warning messages just before the panic. Examples of the warning: WARNING: MCE: Bank 0: error code 15:addr = cf53c000, model errcode = 0 The meaning of MCE messages depends on processors. This document explains how to understand MCE messages on AMD Opteron based systems. SolutionUnderstanding and Decoding Machine Check Errors on Opteron systems running Solaris[TM] Operating System for x86 Platforms. An MCE message has 3 or 4 values; bank, error code, address (not always exist) and model error code. All values are displayed as hexadecimal numbers regardless of existence of the prefix "0x". WARNING: MCE: Bank 0: error code 15:addr = cf53c000, model errcode = 0 - Bank (THIS HAS NOTHING TO DO WITH MEMORY BANK) Opteron processors have five error reporting banks associated with specific hardware blocks.
These banks correspond to Bank 0-4 in order. - Error code Error code field shows 16bit MCA (Machine Check Architecture) Error Code contained in Machine Check Status Registers. The MCA Error Code has the following format. Error Value Error Type Description ------------------- -------------- ---------------------------------- 0000 0000 0001 TTLL TLB errors Errors in the Gart TLB cache 0000 0001 RRRR TTLL Memory errors Errors in the cache hierarchy 0000 1PPT RRRR IILL Bus Errors General bus errors including errors in the HyperTransport link or DRAM Transaction Type Bits (TT) 00 Instruction 01 Data 10 Generic 11 reserved Cache Level Bits (LL) 00 Level 0 01 Level 1 10 Level 2 11 Generig Participation Processor Bits (PP) 00 Local node originated the request 01 Local node responded to the request 10 Local node observed error as 3rd party 11 Generic Time-out Bit (T) 0 Request did not time out 1 Request timed out Memory Transaction Type Bits (RRRR) 0000 Generic error 0001 Generic read 0010 Generic write 0011 Data read 0100 Data write 0101 Instruction fetch 0110 Prefetch 0111 Evict 1000 Snoop Memory or I/O Bits (II) 00 Memory access 01 reserved 10 I/O access 11 Generic - Address Address field shows the address where the Machine check exception occurs. - Model error code Model error code field shows 4bit Extended Error Code contained in Machine Check Status Registers. The meaning is: DC/IC 0000 TLB parity error in physical array 0001 TLB parity error in virtual array (multi-match error) BU 0000 Bus or cache data array error 0010 Cache tag array error LS Reserved NB 0000 ECC error 0001 CRC error 0010 Sync error 0011 Master Abort 0100 Target Abort 0101 GART error 0110 RMW error 0111 Watchdog error 1000 ChipKill ECC error Detailed information to decode the message can be obtained from AMD's document "BIOS and Kernel Developer's Guide for AMD Athlon 64 and AMD Opteron Processors" Decode examples: WARNING: MCE: Bank 0: error code 15:addr = cf53c000, model errcode = 0 TLB error at L1 cache detected by Data Cache Unit. WARNING: MCE: Bank 2: error code 152:addr = e748, model errcode = 2 Tag parity error during an instruction fetch at L2 cache detected by Bus Unit. WARNING: MCE: Bank 4: error code 0xf0f, mserrcode = 0x7 Watchdog error detected by Northbridge. References BIOS and Kernel Developer's Guide for AMD Athlon 64 and AMD Opteron Processors AMD64 Architecture Programmer's Manual Volume 2: System Programming Other MCE Related InfoDocs Technical Instruction Product Solaris 9 Operating System for x86 Platforms Solaris 10 Operating System for x86 Platforms Sun Fire X4200 Server Sun Fire X4100 Server Sun Fire X2100 Server Sun Fire V40z Server Sun Fire V20z Server Sun Java Workstation W2100z Sun Ultra 20 Workstation MCE, x86, opteron, x64, amd64 Previously Published As 82833 Change History Date: 2005-12-14 User Name: 97961 Action: Approved Comment: Publishing. No further edits required. Version: 7 Date: 2005-12-14 User Name: 97961 Action: Accept Comment: Version: 0 Date: 2005-12-14 User Name: 105028 Action: Approved Comment: Checked all links, they all work now. Version: 0 Attachments This solution has no attachment |
||||||||||||
|