Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1019683.1
Update Date:2011-05-27
Keywords:

Solution Type  Technical Instruction Sure

Solution  1019683.1 :   How to analyze Memory Errors on x64 Servers running Linux using HERD  


Related Items
  • Sun Fire X4600 M2 Server
  •  
  • Sun Fire X4200 M2 Server
  •  
  • Sun Blade X6220 Server Module
  •  
  • Sun Fire X2100 Server
  •  
  • Sun Fire X4100 M2 Server
  •  
  • Sun Fire X4640 Server
  •  
  • Sun Netra X4200 Server
  •  
  • Sun Fire X4140 Server
  •  
  • Sun Blade X6420 Server Module
  •  
  • Sun Fire X4100 Server
  •  
  • Sun Fire X2100 M2 Server
  •  
  • Sun Fire X4600 Server
  •  
  • Sun Fire X4200 Server
  •  
  • Sun Fire X4240 Server
  •  
  • Sun Fire X4500 Server
  •  
  • Sun Fire X2200 M2 Server
  •  
  • Sun Netra X4200 M2 Server
  •  
  • Sun Fire X4500 Server
  •  
  • Sun Blade X6420 Server Module
  •  
  • Sun Fire X4540 Server
  •  
  • Sun Fire X4440 Server
  •  
  • Sun Blade X6240 Server Module
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>x64 Servers
  •  
  • GCS>Sun Microsystems>Servers>NEBS-Certified Servers
  •  
  • GCS>Sun Microsystems>Servers>Blade Servers
  •  

PreviouslyPublishedAs
243706


Applies to:

Sun Fire X4600 M2 Server - Version: Not Applicable to Not Applicable - Release: N/A to N/A
Sun Fire X4600 Server - Version: Not Applicable to Not Applicable   [Release: N/A to N/A]
Sun Fire X4640 Server - Version: Not Applicable to Not Applicable   [Release: N/A to N/A]
Sun Netra X4200 Server - Version: Not Applicable to Not Applicable   [Release: N/A to N/A]
Sun Fire X4440 Server - Version: Not Applicable to Not Applicable   [Release: N/A to N/A]
Microsoft Windows (32-bit)
Red Hat Enterprise Linux Advanced Server x86-64 (AMD Opteron Architecture)
Microsoft Windows x64 (64-bit)
All Platforms

Goal

Memory errors on x64 systems running Linux will give rise to a machine check exception (MCE). This document will tell you how to analyze these MCEs using HERD.

To discuss this information further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Community - Sun x86 Systems

Symptoms:
  • Silent Northbridge MCE
  • panic
  • performance reduction
  • memory capacity reduction
  • reset
Purpose/Scope:

This document describes which troubleshooting steps to take when trying to resolve memory errors on x64 systems running the Linux operating system. A memory error will give rise to a machine check exception (MCE).

In the case of correctable memory errors, the MCE will be reported on the console and in the system log files /var/log/messages.

In the case of an uncorrectable memory error, the platform will reset, and a MCE entry may appear in the system log file /var/log/messages after a reboot occurs.

The presence of this log entry is not always guaranteed as the severity of the platform crash dictates the platforms ability to report errors.

Solution

Steps to Follow

What is HERD ?

Hardware Error Report and Decode (HERD) is a tool for monitoring, decoding, and reporting of hardware errors. These errors are also known as Machine Check Exceptions (MCE). During error decoding, HERD attempts to provide as much information as possible from the data supplied by the CPU. In particular, physical addresses obtained from correctable ECC memory errors are matched
to the corresponding CPU slot and DIMM number.

How to get hold of HERD

Sun product patches, updates and firmware are now available on My Oracle Support from the Patches and Updates tab.

Information on accessing and using My Oracle Support can be found at the My Oracle Support Welcome Center for Oracle Sun Customers and Partners.

To find your download on My Oracle Support:

1. Sign in to My Oracle Support.
2. Click the "Patches & Updates" tab.
3. In the "Patch Search" box on the right side, select "Product or Family (Advanced Search)".
4. Enter a product name "Product is" "Hardware Error Report and Decode Tool" or partial name. - a list of matches will be displayed. Select the product of interest.
5. Select one or more "releases" in the "Release is" drop down and close the pop-up window.



6. Click Search - a list of product downloads (listed as patches) will be displayed. Select the download of interest. This will take you to the Download Information Page.

Patch selection example

If, on the Download Information Page, you get the message "You do not have permissions to download this Patch...", see How Patches and Updates Entitlement Works to help you determine the reason.

 HERD versions

HERD is currently available in 4 distributions. 1.x Linux releases , 2.x Linux releases , 3.x Linux releases and 3.x Windows releases.  In general, 1.x releases were targeted to support specific hardware whilst 2.x upward is the common release for all AMD hardware.

How to install Windows HERD

  1. Download WinHERD_v3.x.x.zip.
  2. Inside this uncompressed zip file you will find two executable files, run the correct one for you chosen OS. For windows 2003 run setup.exe. For windows 2008 run herdsetup.msi.
  3. Then follow Instructions provided in via the installation routine.
  4. If DotNET is not install it will prompt you to download it from Microsoft. If you are not connected to the Internet you can install it by selecting it from the "Add or Remove Programs" section in the "Control" interface.
  5. After installation the service will start automatically (and on each reboot).
  6. Go to the Windows "System Event " (HERD Log) and there should be a log entry This entry will show the service has started.

WinHERD is written using the Microsoft DotNET framework and as such DotNET needs to be installed for WinHERD to correctly draw the visual routines to screen. 

HERD windows 3-x version packages

Release: Package Designation:

Windows 2003 requires dotnetfx.exe (Dot Net 2.0)
WinHERD_v3.x.x.zip (setup.exe)
Windows 2008  WinHERD_v3.x.x.zip (herdsetup.msi)
Windows 2008 already includes the Microsoft Dot Net framework

HERD decoding Windows

In Windows, Machine Check Events are reported to the Windows "Event Log" and the user is prompted with a dialog box to confirm the event has been seen/reported. HERD adds a HERD Log to the Windows event reporting instrumentation to catalogue and report seen MCE events.

1. Click"Start" then "Administrative Tools" then "Event Viewer".
2. In the left menu of Event Viewer window, click on "HERD Log" The window shows an "Information" event.
3. Double-click "Information" to view this Event Properties This shows HERD startup status.

Example of HERD location in windows event logs

Above shows details of were to locate any HERD events if required.

Example of HERD in windows HERD event list.

Above is the "Information" event log entry which shows the startup status of WinHERD. A report of the platform type, the number of processor sockets (nodes) and the number of physical processor cores per socket (logical processors) are shown Finally, HERD reports as started and running.

Example of DIMM event in windows

The HERD Log within the Windows Event Viewer is the default location in which MCE errors will now be reported When an MCE is detected and violates the default Sun DIMM replacement policy, an event will be showing in the HERD Log with details of the failure condition and recommendations. Above is an example of a such a DIMM event.

 

HERD linux3-x version packages

Release: Package Designation:

RedHat RHEL4 herd-3.x-x.rh4.x86_64.rpm
RedHat RHEL5 herd-3.x-x.rh5.x86_64.rpm
Novell SLES9 herd-3.x-x.sl9.x86_64.rpm
Novell SLES10 herd-3.x-x.sl10.x86_64.rpm

How to install HERD Linux

  • To install the RPM under Linux, run the following command:

rpm -Uhv herd-2x-x.xxx.x86_64.rpm

Each RPM has a set of run-time dependencies that are enforced by RPM. These dependencies include the openssl libraries or the OpenIPMI scripts. If a dependency is missing, RPM reports an error and you will have to install them manually.

  • To install with SuSE Enterprise Linux, use the yast utility. For example:

yast2 -i OpenIPMI

  • To install with Red Hat Enterprise Linux, use up2date or system-config-packages. For example:

up2date -i openssl

Start the HERD daemon

  • For SuSE Enterprise Linux 10 and Red Hat Enterprise Linux, type:

service herd start

  • For SuSE Enterprise Linux 9, type:


/etc/init.d/herd start

When the following message appears in the system log, then HERD is running successfully:

/var/log/messages: herd: IPMI connection fully operational


HERD decoding Linux

Once the HERD daemon is running, any correctable MCEs that occur on the system are reported both on the system log (/var/log/messages) and onto the service processor System Event Log (SEL).

In the case of correctable ECC memory errors, both reports should correctly identify the CPU slot and DIMM number on which the memory error occurred.


In the case of uncorrectable ECC memory errors, the system would have reset before writing an error to console but a MCE error may be reported in /var/log/messages.
With uncorrectable memory errors, collect the address bit and use the manual herd decoding method below:

Note: The Linux kernel only harvests MCE errors every 5 minutes, so a delay might occur between an MCE occurrence and its report to the system log and SEL.

  • Example output of a herd decode:


Jan 14 18:57:32 host herd: HARDWARE ERROR. This is *NOT* a software problem!
Jan 14 18:57:32 host herd: Please contact your hardware vendor
Jan 14 18:57:32 host herd: CPU 0 4 northbridge
Jan 14 18:57:32 host herd: Northbridge Watchdog error
Jan 14 18:57:32 host herd: bit57 = processor context corrupt
Jan 14 18:57:32 host herd: bit61 = error uncorrected
Jan 14 18:57:32 host herd: bus error 'generic participation, request timed out generic error mem transaction generic access, level generic'
Jan 14 18:57:32 host herd: STATUS b200000000070f0f MCGSTATUS 0
Jan 14 18:57:32 host herd: Physical address maps to: Cpu Node 0, DIMM 0


Manual HERD decode for Linux only.

Perform a manual HERD decode if:

An MCE event occurred before HERD was installed
An uncorrectable memory error occurs and an MCE event is reoprted on reboot.

The HERD tool will identify the CPU slot and DIMM number from the physical address reported by the MCE event.

Install HERD as outlined above, and run the following HERD commands against the address reoprted in the MCE (in /var/log/messages).

  • Sample MCE correctable error (before HERD is installed)

Mar 03 03:33:33 testsystem kernel: CPU 0: Silent Northbridge MCE
Mar 03 03:33:33 testsystem kernel: Northbridge status 946ac002:00000813
Mar 03 03:33:33 testsystem kernel: Error ecc error
Mar 03 03:33:33 testsystem kernel: bus error local node origin, request didn't time out
Mar 03 03:33:33 testsystem kernel: generic read
Mar 03 03:33:33 testsystem kernel: memory access, level generic
Mar 03 03:33:33 testsystem kernel: link number 0
Mar 03 03:33:33 testsystem kernel: err cpu0
Mar 03 03:33:33 testsystem kernel: corrected ecc error
Mar 03 03:33:33 testsystem kernel: previous error lost
Mar 03 03:33:33 testsystem kernel: error address 000000004ed0c630

  • Sample MCE uncorrectable error (reported on reboot)


Apr 04 04:44:44 testsystem kernel: CPU 0: Silent Northbridge MCE
Apr 04 04:44:44 testsystem kernel: Northbridge status a60000010005001b
Apr 04 04:44:44 testsystem kernel: GART TLB error generic level generic
Apr 04 04:44:44 testsystem kernel: extended error gart error
Apr 04 04:44:44 testsystem kernel: link number 0
Apr 04 04:44:44 testsystem kernel: err cpu0
Apr 04 04:44:44 testsystem kernel: processor context corrupt
Apr 04 04:44:44 testsystem kernel: error address valid
Apr 04 04:44:44 testsystem kernel: error uncorrected
Apr 04 04:44:44 testsystem kernel: previous error lost
Apr 04 04:44:44 testsystem kernel: error address 00000000e7f6a05

Use HERD with the -e option to decode a physical address:

# herd -e e7f6a050
0000e7f6a050: Cpu Node 0, DIMM 0

HERD supports a debug option (-d) that gives more system information:

# herd -d -e e7f6a050
2 cores found, family 15, model 33, stepping 2 (revision E)
CPU description: Dual-Core AMD Opteron(tm), rev JH-E6 (940)
DRAM interface: 128-bit
Chip Kill Error Checking and Correction enabled
herd: dimm translation against system address 0000e7f6a050
Node 0: DRAM base 00000000, DRAM limit 00ffffff, HoleEn 0
Chip 0: CSBase 00000000. CSMask 0fffbfff
0000e7f6a050: Cpu Node 0, DIMM 0

Note, HERD must be run on the system on which the MCE actually occurred.

Failure to run HERD on the platform that the MCE occurred will result in the failing memory DIMM to be misreported.

Internal Comments
This document contains normalized content and is managed by the the Domain Lead
(s) of the respective domains. To notify content owners of a knowledge gap
contained in this document, and/or prior to updating this document, please
contact the domain engineers that are managing this document via the Document
Feedback alias(es) listed below:

Domain Lead: [email protected]
Feedback Alias: [email protected]

normalized, memory, x64, manual, decode, MCE, machine, check, exception, Linux, HERD


Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback