Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1012349.1
Update Date:2010-01-06
Keywords:

Solution Type  Technical Instruction Sure

Solution  1012349.1 :   Kernel Cage Splitting Overview  


Related Items
  • Sun Fire E25K Server
  •  
  • Sun Fire E20K Server
  •  
  • Sun Fire 12K Server
  •  
  • Sun Fire 15K Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>High-End Servers
  •  

PreviouslyPublishedAs
217037


Description
This document describes kernel cage, performance issues associated with current cage implementation and the recently introduced changes to kernel cage implementation for Solaris[TM] 9 Operating System.




Steps to Follow
1.0 What is the kernel cage?

The kernel cage is a mechanism created to help dynamic reconfiguration (DR) whereby memory allocated for the kernel is attempted to be "caged" to a single system board. This is done because removing boards containing kernel memory (also known as permanent memory or kernel permanent memory) by DR is more difficult and requires quiescing the system while the memory is moved. This copy-rename process on large systems can take from 10-40 minutes which, while being shorter than a reboot, may still be unacceptable to the customer. Other potential impacts during this process:

- application time-outs, since nothing will be running except copy-rename
- panics because of incompatible 3rd party drivers, certain applications or missing patches
- problems with real-time threads not being able to run as well
- problems with cluster software (Sun Cluster SC3.x and Veritas VCS)

By caging the kernel to a single board, the other boards can be removed withDR much more easily since no kernel memory needs to be relocated. (Note that currently the cage can extend beyond one board if the available memory on that board is exhausted.) For more detail see other resources:

- Technical Instruction <  Document: 1010363.1 > . Sun Fire[TM] 12K/15K/E20K/E25K Servers: Dynamic Reconfiguration Considerations

- Sun Fire DR page
http://www.sun.com/servers/highend/dr_sunfire/

- Blueprints
Dynamic Reconfiguration for High-End Servers: Part 1--Planning Phase
http://www.sun.com/blueprints/0304/817-5949.pdf

- Dynamic Reconfiguration for High-End Servers: Part 2--Implementation Phase
http://www.sun.com/blueprints/0304/817-5951.pdf

2.0 When is the kernel cage a problem?

What can happen is that on larger machines with many cpus, especially Sun Fire 12K/15K/E20K/E25K servers, the kernel structures can get very busy and the memory controller on the particular board that has the cage can become saturated because of the number of memory references to the cage. Then any memory references to that board will queue up and slow down, decreasing overall system performance.

Among the workloads that can experience this are any that have a heavy networking component, generating perhaps more than 50K packets/second combined input and output. Others that generate a similar disk I/O rate or lots of buffered I/O can also be affected, although getting to that level of I/O is more rare for commercial workloads.

3.0 How can I tell if my system is experiencing this?

There are many ways of monitoring whether a board is getting busy but currently the best way is using busstat. Interpreting the raw data from busstat can be difficult so involving Sun performance experts is the best way to know for sure that this is a problem.

Using "netstat -i" can be one way of looking at the network load to see if the level is approaching the rough guidelines above. Similarly "iostat" can be used to monitor the overall disk I/O rate.

If the kernel cage is suspected in performance problems, the best current method is to escalate to Sun support to further investigate the issue.

4.0 If the cage board is saturated, what can I do?

There are two main ways of dealing with a kernel cage that has saturated a memory board controller. The first (and probably most successful) method is to reduce the pressure on the kernel structures. This is one of the main effects from KU 117171-17 for Solaris[TM] 9 and KU 117350-18 for Solaris[TM] 8, which fixed some mutex handling and also the dispatcher and in so doing reduced pressure on those memory structures.

For other sources of kernel activity, other tuning may be useful. For example, reducing system calls by tuning network parameters or closely examining filesystem choices can remove contention due to segmap issues, single-writer locks or other inefficiencies that can exacerbate any kernel cage problems.

The second way to reduce the pressure is to spread the cage over more boards. Prior to this cage splitting fix, the only way to do that was to just turn the cage off. If you add the line "set kernel_cage_enable = 0" to /etc/system and reboot, the cage is disabled and kernel memory structures will be allocated across all the boards.

This does mean, however, that DR operations to remove a board will always fail and so should be done only if there are no other ways to meet one's performance needs. With the cage disabled, you may add boards, but not delete them. Only make these changes after very carefully considering alternatives and tradeoffs.

Even without disabling the cage, if the kernel memory grows beyond the capacity of one board, the cage will be split onto more boards, but this is not very controllable and may not distribute the kernel structures well, meaning one board's memory controller may still be saturated.

5.0 New Cage Splitting Patch.

Solaris[TM] 9 KU 118558-05 together with the platmod patch 117124-07, will include a partial fix for kernel cage issues. This will alter the current default behavior by splitting the cage across more than one board, depending on the size of the domain. This is only available for Sun Fire 12K/15K/E20K/E25K server systems.

Splitting the cage by default is a change in the default behavior and so one needs to know about it before adding the patches.

The default behavior with this patch is to add one board to the cage for every six in the domain and split the cage across that number of boards. So a domain with 1 to 6 boards will still start with a single cage board, a domain with 7 to 12 boards will have at least two cage boards and a domain with 13 to 18 boards will have at least three cage boards. This ratio (1 cage board per 6 boards) is controllable by new /etc/system parameters:

   kcage_split_enable = 0 or 1
default is enabled (1) if cage is enabled
kcage_split_ratio = 1 to 18
default is 6 (i.e. 1 per 6 boards)

The cage-splitting fix will ONLY be available in Solaris[TM] 9. Solaris[TM] 8 is not planned to ever have this kind of functionality and so turning off the cage or tuning to reduce the cage pressure will be the only choices if the memory load approaches saturation. Solaris[TM] 10 will have a different set of fixes that reduce the types of kernel memory structures that are in the kernel cage instead of splitting it, meaning those structures can be allocated on any board, not just ones marked for the cage.

One more change with the kernel splitting patch is that if a board is added to a domain by DR and the board count crosses the split threshold, the new board will be marked to contain future cage growth (when the system needs more kernel memory) unless the cage has already been split. What this means is that if you have left everything as default in a 6-board domain, adding in a seventh board may cause the cage to be split onto the new board. If you wish to remove that board later with DR, the system may have to be quiesced while the kernel is moved from that board. If these DR operations are primarily to achieve load-balancing, you have to make sure to remove a non-cage board if possible.

You can identify the cage boards by running the cfgadm command:

 cfgadm -av | grep memory | grep perm
SB7::memory   connected    configured   ok   base address 0x12000000000,
4194304 KBytes total, 1415560 KBytes permanent

This indicates the cage is only on SB7, and so other boards could be safely removed.

6.0 When should you apply this patch?

First of all, the benefit of this patch for a system cannot be accurately predicted without testing it.

If your system is currently operating with the cage enabled (the default behavior), this change should only affect performance in a positive way. The cage will be split in large domains which should help performance. One needs to be aware that more boards may contain the kernel cage structures and so will cause the system to quiesce when doing DR removal because kernel structures will have to be relocated.

If you have disabled the cage because you could not get adequate performance and would like to use DR, you might want to test whether this fix is sufficient for you. As mentioned above, you should have applied all of the appropriate patches that reduce pressure on the cage first. Then enabling the cage and cage splitting may meet your performance requirements, but it will need to be tested. (Again note that disabling the cage also disables DR.) If you explicitly disable the cage in /etc/system, this change will not have any effect, since cage splitting is only enabled if the cage is enabled.

If you use DR for reconfiguring hardware, e.g.

- moving boards between domains for load-balancing reasons
- replacing failed hardware components

You will need to know how many boards are in the domain relative to the split ratio to know whether the board being added will be marked to contain the cage or not. Since the cage could previously grow as the kernel grew, the recommended procedure always includes making sure a board does not contain permanent memory before removing it, so this should not be a big change; the patch just added another way to create a cage board. (See the cfgadm note above to identify boards which have permanent memory.)

If the board being added is special in some way and you want to be sure to DR out that board, it may be possible to adjust the split ratio or perhaps start the domain with the full complement of boards meaning the split happens at boot time to avoid having the special board marked to contain the cage.



Product
Sun Fire E25K Server
Sun Fire E20K Server
Sun Fire 15K Server
Sun Fire 12K Server

Internal Comments
THIS SECTION IS STRICTLY INTERNAL ONLY:

In Q4FY05, the EIS Standard will publish a DR checklist containing the restrictions for every DR operation, in particular for copy-rename operations.


Further DR details can be seen at the onestop "Sun Fire DR" page :

http://onestop/sunfiredr/


Only the dynamic cage is split; the static cage is still allocated on one board. (See the cage documents above for more definition of dynamic and static cage.) This means that the load is not necessarily spread evenly across the cage boards, but the goal is to reduce the memory load below saturation.

There are a lot more details about recent kernel cage changes at :

http://heseweb.east/~huah/cage_split.html


More scalability details are at :

http://onestop/starcatperf/S9_perf_improvements_roadmap.shtml


On using bustat to find out cage saturation problem:

The main limit seems to be CDC reads, which is limited to about 30 million per second per board. An example busstat command would be:


 busstat -w axq,pic0=cdc_hits,pic1=total_cdc_read 1 1

If any of the AXQ's listed have total_cdc_read counts near 30 million, that board is probably approaching saturation. Interpreting the raw data from busstat can be difficult so if there is any doubt, involve performance experts to help make sure that this is a problem.


Performance problems internally can be submitted to the Sun Systems Performance Roundtable (SPRT) to further investigate the issue. See:

http://onestop/perf-roundtable/


One of the bugs fixed by the cage splitting fix is:
CR 4860202: kernel cage splitting can eliminate single-board cage bottleneck


kernel cage, DR, cfgadm, performance, memory saturation, split
Previously Published As
80991

Change History
Date: 2009-12-01
User Name: Volkmar Grote 117021
Action: Reviewed for Content Team
Comments: I checked the structure, links and layout, however I think someone from the Kernel team should check if the contents is still valid like this
Date: 2005-04-19
User Name: 111868
Action: Approved

Comment: re-publishing , this article needs to stay, I had removed it cause it was dupliacted by othe article with more hits, but this one is reffred in patch report, so it needs to stay


Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback