Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1012598.1
Update Date:2011-06-03
Keywords:

Solution Type  Technical Instruction Sure

Solution  1012598.1 :   Understanding and Decoding Machine Check Errors on Opteron systems running Solaris[TM] Operating System for x86 Platforms  


Related Items
  • Sun Fire X4600 M2 Server
  •  
  • Sun Fire X2100 Server
  •  
  • Sun Fire X4100 M2 Server
  •  
  • Sun Blade X8400 Server Module
  •  
  • Sun Blade X6220 Server Module
  •  
  • Sun Ultra 20 M2 Workstation
  •  
  • Sun Fire X4640 Server
  •  
  • Sun Blade X6420 Server Module
  •  
  • Sun Fire X4100 Server
  •  
  • Sun Fire X2100 M2 Server
  •  
  • Sun Java Workstation W2100z
  •  
  • Sun Blade X6440 Server Module
  •  
  • Sun Blade X8420 Server Module
  •  
  • Sun Fire X4240 Server
  •  
  • Sun Fire X4600 Server
  •  
  • Sun Fire V20z Server
  •  
  • Sun Fire X4200 Server
  •  
  • Sun Blade 6000 System
  •  
  • Sun Fire V40z Server
  •  
  • Sun Fire X2200 M2 Server
  •  
  • Sun Netra X4200 M2 Server
  •  
  • Sun Fire X4500 Server
  •  
  • Sun Blade 8000 System
  •  
  • Sun Ultra 20 Workstation
  •  
  • Sun Blade X8440 Server Module
  •  
  • Sun Fire X4540 Server
  •  
  • Sun Ultra 40 M2 Workstation
  •  
  • Solaris x64/x86 Operating System
  •  
  • Sun Java Workstation W1100z
  •  
  • Sun Fire X4440 Server
  •  
  • Sun Blade X6240 Server Module
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>x64 Servers
  •  

PreviouslyPublishedAs
217340


Applies to:

Sun Ultra 40 M2 Workstation - Version: Not Applicable to Not Applicable - Release: N/A to N/A
Sun Ultra 40 M2 Workstation - Version: Not Applicable to Not Applicable   [Release: N/A to N/A]
Sun Blade X6220 Server Module - Version: Not Applicable to Not Applicable   [Release: N/A to N/A]
Sun Blade X6240 Server Module - Version: Not Applicable to Not Applicable   [Release: N/A to N/A]
Sun Fire X4100 Server - Version: Not Applicable to Not Applicable   [Release: N/A to N/A]
All Platforms

Goal

The machine check mechanism on the AMD64 processor allows it to detect and report hardware errors.

When an unrecoverable Machine Check Errors is detected, a Machine Check Exception is generated.

When such condition occurs, Solaris[TM] Operating System for x86 Platforms can panic with a Trap of type 0x12 (Machine check exception). When the Solaris{TM} Operating System (OS) receives a Machine check exception, it shows MCE warning messages just before the panic.

Examples of the warning:

WARNING: MCE: Bank 0: error code 15:addr = cf53c000, model errcode = 0
WARNING: MCE: Bank 2: error code 152:addr = e748, model errcode = 2
WARNING: MCE: Bank 4: error code 0xf0f, mserrcode = 0x7

The meaning of MCE messages depends on processors. This document explains how to understand MCE messages on AMD Opteron based systems.

Solution


Understanding and Decoding Machine Check Errors on Opteron systems running Solaris[TM] Operating System for x86 Platforms.

An MCE message has 3 or 4 values; bank, error code, address (not always exist) and model error code. All values are displayed as hexadecimal numbers regardless of existence of the prefix "0x".

WARNING: MCE: Bank 0: error code 15:addr = cf53c000, model errcode = 0

- Bank (THIS HAS NOTHING TO DO WITH MEMORY BANK)

Opteron processors have five error reporting banks associated with specific hardware blocks.

  • Data Cache unit(DC) - Includes the cache structures that hold data and tags, the data TLBs, and the data cache probing logic.
  • Instruction cache unit(IC) - Includes the instruction cache structures that hold instructions and tags, the instruction TLBs, and the instruction cache probing logic.
  • Bus unit(BU) - Includes the system bus interface to the Northbridge and the level 2 cache.
  • Load/store unit(LS) - Includes logic used to manage loads and stores.
  • Northbridge unit(NB) - Includes the Northbridge and DRAM controller.

These banks correspond to Bank 0-4 in order.

- Error code

Error code field shows 16bit MCA (Machine Check Architecture) Error Code contained in Machine Check Status Registers. The MCA Error Code has the following format.

      Error Value     Error Type        Description
-------------------  -------------- ----------------------------------
0000 0000 0001 TTLL  TLB errors     Errors in the Gart TLB cache
0000 0001 RRRR TTLL  Memory errors  Errors in the cache hierarchy
0000 1PPT RRRR IILL  Bus Errors     General bus errors including errors
in the HyperTransport link or DRAM
 Transaction Type Bits (TT)
00 Instruction
01 Data
10 Generic
11 reserved
 Cache Level Bits (LL)
00 Level 0
01 Level 1
10 Level 2
11 Generig
 Participation Processor Bits (PP)
00 Local node originated the request
01 Local node responded to the request
10 Local node observed error as 3rd party
11 Generic
 Time-out Bit (T)
0 Request did not time out
1 Request timed out
 Memory Transaction Type Bits (RRRR)
0000 Generic error
0001 Generic read
0010 Generic write
0011 Data read
0100 Data write
0101 Instruction fetch
0110 Prefetch
0111 Evict
1000 Snoop
 Memory or I/O Bits (II)
00 Memory access
01 reserved
10 I/O access
11 Generic

- Address

Address field shows the address where the Machine check exception occurs.

- Model error code

Model error code field shows 4bit Extended Error Code contained in Machine Check Status Registers.

The meaning is:

 DC/IC
0000 TLB parity error in physical array
0001 TLB parity error in virtual array (multi-match error)
 BU
0000 Bus or cache data array error
0010 Cache tag array error
 LS Reserved
 NB
0000 ECC error
0001 CRC error
0010 Sync error
0011 Master Abort
0100 Target Abort
0101 GART error
0110 RMW error
0111 Watchdog error
1000 ChipKill ECC error

Detailed information to decode the message can be obtained from AMD's document

"BIOS and Kernel Developer's Guide for AMD Athlon 64 and AMD Opteron Processors"

Decode examples:

  WARNING: MCE: Bank 0: error code 15:addr = cf53c000, model errcode = 0
  TLB error at L1 cache detected by Data Cache Unit.
  WARNING: MCE: Bank 2: error code 152:addr = e748, model errcode = 2
  Tag parity error during an instruction fetch at L2 cache detected by Bus Unit.
  WARNING: MCE: Bank 4: error code 0xf0f, mserrcode = 0x7
  Watchdog error detected by Northbridge.

References

BIOS and Kernel Developer's Guide for AMD Athlon 64 and AMD Opteron Processors

AMD64 Architecture Programmer's Manual Volume 2: System Programming

Other MCE Related InfoDocs

Technical Instruction - KERNEL: Understanding Trap Type 18 (0x12) Panics on x86 platform
Technical Instruction - Understanding and Decoding Machine Check Errors on Sun Fire[TM] V20z/Sun Fire[TM] V40z running Red Hat OS



Product
Solaris 9 Operating System for x86 Platforms
Solaris 10 Operating System for x86 Platforms
Sun Fire X4200 Server
Sun Fire X4100 Server
Sun Fire X2100 Server
Sun Fire V40z Server
Sun Fire V20z Server
Sun Java Workstation W2100z
Sun Ultra 20 Workstation

MCE, x86, opteron, x64, amd64
Previously Published As
82833

Change History
Date: 2005-12-14
User Name: 97961
Action: Approved
Comment: Publishing. No further edits required.
Version: 7

Date: 2005-12-14
User Name: 97961
Action: Accept
Comment:
Version: 0
Date: 2005-12-14
User Name: 105028
Action: Approved
Comment: Checked all links, they all work now.
Version: 0

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback