Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1019337.1
Update Date:2010-07-06
Keywords:

Solution Type  Technical Instruction Sure

Solution  1019337.1 :   Introduction to cache-line retirement feature for Ultrasparc-IV+ (USIV+) processors  


Related Items
  • Sun Fire E6900 Server
  •  
  • Sun Fire E25K Server
  •  
  • Sun Fire E20K Server
  •  
  • Sun Fire 6800 Server
  •  
  • Sun Fire V890 Server
  •  
  • Sun Fire E4900 Server
  •  
  • Sun Fire 12K Server
  •  
  • Sun Fire 4800 Server
  •  
  • Sun Fire V1280 Server
  •  
  • Sun Fire E2900 Server
  •  
  • Sun Fire 15K Server
  •  
  • Sun Fire V490 Server
  •  
  • Sun Netra 1290 Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>Midrange V and Netra Servers
  •  
  • GCS>Sun Microsystems>Servers>Entry-Level Servers
  •  
  • GCS>Sun Microsystems>Servers>High-End Servers
  •  
  • GCS>Sun Microsystems>Servers>Midrange Servers
  •  

PreviouslyPublishedAs
238627


Description
This document is intended to explain the Level 2/Level 3 (L2/L3) cache-line retirement feature available for US-IV+ CPU's.
What is cache-line retirement?

Cache-line retirement is an enhancement to Solaris that allows for disabling of a CPU's cache-line.  In a similar technique to MPR (Memory Page Retirement), cache-line retirement allows for retirement of a small piece of L2/L3 cache, without interruption to the operating system. 

For Solaris[TM] 10 Operating System, this technology utilizes the new mem_cache driver to allow disabling of the cache index and cache way.  For Solaris[TM] 9 Operating System, we use the Soft Error Rate Discriminator (SERD) engine to determine when to retire the cache-line.

The cache index/way is similar to pages of main memory.  This process uses the Diagnosis Engine or FMA to make a cache index/way “unavailable”, or in more familiar terms – retirement of that cache-line.



Steps to Follow
Cache-line retirement details.
What are the benefits of cache-line retirement?

Currently, when FMA detects a correctable error within L2/L3 cache, it applies them to the SERD, and off-lines the processor.  The cache-line retirement feature will result in the cache index and way being disabled rather then the entire processor.   By only disabling cache index and way, the CPU will continue to function normally with no down time or performance impact.  In the rare case the amount of retired cache-lines have exceeded the cpu's set threshold, the cpu will then be off-lined.  Overall, cache-line retirement provides significant improvements in RAS features for USIV+ CPU's and it's associated cache.

Cache-line retirement provides the final resolution for Sun Alert <Document: 1000495.1> .

Availability

Solaris 9 Kernel patch 122300-28
Solaris 10 Kernel patch 137111-02

Examples

Example of when a cache-line is retired in Solaris 9 (from /var/adm/messages).  This example was taken from a Sunfire V490

May  8 03:11:45 testmachine SUNW,UltraSPARC-IV+: [ID 711633 kern.notice] NOTICE: L2_CACHE_DATA: cpu 6: Retired cache index 4199 way 1 due to event at bit 30
May  8 03:11:45 testmachine  No action required.


Example of when a cpu is off-lined due to too many cache-line retirements in Solaris 9 (from /var/adm/messages).  This example was taken from a Sunfire V490

May 11 14:40:18 testmachine SUNW,UltraSPARC-IV+: [ID 503843 kern.notice] NOTICE: L2_CACHE_TAG: cpu 0: Retiring CPU since we have already retired 3 ways at cache index 0x3e8
May 11 14:40:18 testmachine  Recommended-Action: Service action required
May 11 14:40:18 testmachine SUNW,UltraSPARC-IV+: [ID 123177 kern.notice] NOTICE: [AFT1] CPU0 offlined
May 11 14:40:18 testmachine SUNW,UltraSPARC-IV+: [ID 307609 kern.notice] NOTICE: [AFT1] CPU16 offlined due to events detected by another CPU on the same chip


Solaris 10 examples.  The following is the SUNW-MSG-ID of cache-line faults, taken from fmdump output:

http://sun.com/msg/SUN4U-8007-FQ
http://sun.com/msg/SUN4U-8007-GC
http://sun.com/msg/SUN4U-8007-HH
http://sun.com/msg/SUN4U-8007-JD


fault.cpu.ultraSPARC-IVplus.l2cachedata-line
fault.cpu.ultraSPARC-IVplus.l3cachedata-line
fault.cpu.ultraSPARC-IVplus.l2cachetag-line
fault.cpu.ultraSPARC-IVplus.l3cachetag-line


Product
Sun Fire 6800 Server
Sun Fire 4800 Server
Sun Fire V1280 Server
Sun Fire E2900 Server
Sun Fire E4900 Server
Sun Fire E6900 Server
Sun Fire 15K Server
Sun Fire 12K Server
Sun Fire E20K Server
Sun Fire E25K Server
Sun Fire V490 Server
Sun Fire V890 Server
Sun Netra 1290 Server

Internal Comments
Internal section only

Facts about cache-line retirement
  • A cache-line can be retired because of an issue with the data area (L2_CACHE_DATA or L3_CACHE_DATA) or with the tag area (L2_CACHE_TAG or L3_CACHE_TAG).
  • The data and tag area for L2cache and the tag area for the L3cache reside on the CPU chip itself.
  • The data area for the L3cache reside outside the CPU chip.
  • The kernel keeps track of cache-lines retired due to errors within cache that reside on the cpu chip, and a separate counter for cache-lines retired that reside off the cpu chip.
  • Solaris retires cache-lines until threshold is met (64 cache-lines).  At this point the processor is offlined and the board should be replaced.
  • On panther boards, L2cache is 2 MB in size, and L3cache is 32MB in size.  Each cache-line is 64 bytes (1 index & 1 way).
  • For L2 there are 32,768 cache lines, so only less than 2 one thousandths of the cache can be retired before the proc is failed.
  • The L3 Cache has 524,288 cache lines and still the proc is offlined after only 64 cache lines are retired. Thus the percentage of the cache that can be retired before the proc is failed is significantly smaller than even 2 one thousandths for the L2 Cache.
  • For Solaris 9, to see how many cache-lines have been retired, you may use the following command:
          kstat -n pn_cacheline_retire
  • Current cacheline retirement thresholds for Solaris 9 are:
    • retire cache line if more then 4 cache_tag events in 1 hour
    • retire cache line if more then 12 cache_data events in 1 hour
  • Current cacheline retirement thresholds for Solaris 10 are:
    • specified in /usr/platform/sun4u/lib/fm/fmd/plugins/cpumem-diagnosis.conf
    • retire cache line if more then 4 cache_tag events in 1 hour
    • retire cache line if more then 12 cache_data events in 1 hour
  • For Solaris 10, to see how many cache-lines have been retired, you may use the following command:
          fmdump -av
  • Cache-line retirement is persistent across reboots for Solaris 10, and not for Solaris 9.
Related Bugs CR 6589208 -  Ultrasparc IV+: support L2/L3 Cache Line Retirement

CPU, error, Level 2, l2, level 3, l3, cache, correctable event, disabled, offline, enhancement, usiv+

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback