How To manage "Unable to send ECC event message to System Controller" messages

Asset ID:	1-71-1020467.1
Update Date:	2011-02-28
Keywords:

Solution Type Technical Instruction Sure

Solution 1020467.1 : How To manage "Unable to send ECC event message to System Controller" messages

Applies to:

Sun Fire V1280 Server
Sun Fire 3800 Server
Sun Fire 4800 Server
Sun Fire 4810 Server
Sun Fire 6800 Server
All Platforms

Goal

Description

How to deal with "Unable to send ECC event message to System Controller" messages

This document discusses how to proceed in case you are continually getting messages in the /var/adm/messages file such as:

May 10 03:10:28 system sgsbbc: [ID 538587 kern.notice] NOTICE: Timed out sending message to SC
May 10 03:11:28 system last message repeated 2 times
May 10 03:11:53 system sgsbbc: [ID 428960 kern.notice] NOTICE: Unable to send ECC event message to System Controller
May 10 03:11:58 system sgsbbc: [ID 538587 kern.notice] NOTICE: Timed out sending message to SC
May 10 03:13:58 system last message repeated 5 times
May 10 03:14:12 system sgsbbc: [ID 428960 kern.notice] NOTICE: Unable to send ECC event message to System Controller
May 10 03:14:28 system sgsbbc: [ID 538587 kern.notice] NOTICE: Timed out sending message to SC
May 10 03:17:28 system last message repeated 6 times
May 10 03:17:29 system sgsbbc: [ID 428960 kern.notice] NOTICE: Unable to send ECC event message to System Controller
May 10 03:17:58 system sgsbbc: [ID 538587 kern.notice] NOTICE: Timed out sending message to SC
May 10 03:18:28 system last message repeated 1 time
May 10 03:18:58 system sgsbbc: [ID 428960 kern.notice] NOTICE: Unable to send ECC event message to System Controller
May 10 03:18:58 system sgsbbc: [ID 538587 kern.notice] NOTICE: Timed out sending message to SC
May 10 03:24:58 system last message repeated 13 times
May 10 03:25:25 system sgsbbc: [ID 428960 kern.notice] NOTICE: Unable to send ECC event message to System Controller
May 10 03:25:28 system sgsbbc: [ID 538587 kern.notice] NOTICE: Timed out sending message to SC
May 10 03:28:58 system last message repeated 8 times
May 10 03:29:27 system sgsbbc: [ID 428960 kern.notice] NOTICE: Unable to send ECC event message to System Controller
May 10 03:29:28 system sgsbbc: [ID 538587 kern.notice] NOTICE: Timed out sending message to SC
May 10 03:31:58 system last message repeated 5 times

Solution

Cause

These messages are caused by a flood of errors, for example a dimm causing many hundreds or thousands of CE's (Correctable Errors). The flood of errors on the domain is more then the domain to SC data path can handle.

Note: The CE flood or storm is sometimes caused by FMA not retiring pages correctly. It is important to install the latest FMA patches, for example:
Patch 139572-02 SunOS[TM] 5.10: fmd patch (or later) fixes Sun CR 6714311 Updated P2 fma/mem fmstat seems to hang after/during CE storm

This bug causes page retirement to malfunction.

Also: Patch 120011-14 SunOS 5.10: kernel patch (or later) and Patch 125369-12 SunOS 5.10: Fault Manager patch (or later) are quit important to have installed in order to avoid known issues that can lead to this condition.

Background

Data Transactions go into the error buffer, and the error buffer on the SC is getting full. By design, it only holds about 100 messages. Because Solaris can no longer write to the error buffer, we get the notices in /var/adm/messages which indicate "Unable to send ECC event message to System Controller".

This issue is sometimes difficult to troubleshoot because the original error messages have to be examined to determine what event started the error storm. You should resolve the original error event (replace the dimm), but only after making sure to update the patches to assure that page retirement is functioning properly (see the NOTE on patches above). It is not advisable to replace hardware (memory DIMM) if the patches above are not installed. The patches should have prevented an storm in the first place by disabling faulty pages instead of allowing them to noisily fill up the error buffer with ECC errors.

If the patches ARE installed, search for the dimm in error by examining the showerrorbuffer output - the dimm implicated by the "incoming" error is the root cause suspect (see Document 1002710.1 for details on this diagnosis):

  Date: Thu May 07 15:50:45 EDT 2009
  Device: /partition0/domain0/SB2/dx2
  ErrorID: 0x32091ff0
  Port: 0
  Syndrome: 0x2f(CE bit 10)
  Direction: outgoing read
  First error: true
  TargetAid: 0x8
  Transid: 0x2
  .
  .
  Date: Thu May 07 15:50:45 EDT 2009
  Device: /partition0/domain0/SB2/dx3
  ErrorID: 0x33091ff0
  Port: 0
  Syndrome: 0x1c(CE bit 11)
  Direction: outgoing read
  First error: true
  TargetAid: 0x8
  Transid: 0x2
  .
  .
  Date: Thu May 07 15:50:46 EDT 2009
  Device: /partition0/domain0/SB0/dx2
  ErrorID: 0x32091ff0
  Port: 0
  Syndrome: 0x2f(CE bit 10)
  Direction: incoming read
  First error: true
  TargetAid: 0x4
  Transid: 0x1
  .
  .

NOTE: The service mode command clearerrorbuffer can be used to clear the error buffer and prevent the "Unable to send" event messages from showing up again in /var/adm/messages (unless the error storm persists).

However, service mode requires that you contact Oracle Support Services to obtain a password and this special mode is only to be executed by Oracle badged employees. This is one reasons that using clearerrorbuffer is not really a viable solution to this problem. The main reason this isn't a viable solution is that this method to "resolve" the issue will wipe clean all the errors in the error buffer and could prevent you from being able to ID the dimm responsible for the noise in the first place.

It is best to install the correct patches and/or replace the dimm in the first place.

Internal Comments

If it is needed to clear the error buffer, the following is performed:
1. Get into service mode
     Document 1010655.1 provides insight in working in service mode.
     Note: requires Oracle badge or use of Shared Shell - Customers should not do this themselves.

2. use the clearerrorbuffer command (example below):
ssc0:SC[service]> clearerrorbuffer -h
    clearerrorbuffer -- clear the contents of the error buffer
    Usage: clearerrorbuffer
    clearerrorbuffer -h

NOTE: If using the clearerrobuffer command, just know that it will empty the showerrorbuffer
command output. You will not be able to use that data to ID a faulty dimm or source of errors.

- The ECC storm discussed in this doc can lead to the behaviour described in Sun Alert 1019109.1
Systems With UltraSPARC IV+ Processors Running Solaris 9 or 10 May Experience &qot;send
mondo timeout" Panic
This behaviour has been experienced on system's using both ScApp 5.19.x & 5.20.x.

Memory, CE, storm, ECC, showerrorbuffer, clearerrorbuffer, flood, page retirement, buffer

Attachments

This solution has no attachment