Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-71-1006949.1
Update Date:2011-05-04
Keywords:

Solution Type  Technical Instruction Sure

Solution  1006949.1 :   Busy interrupts, problems and solutions.  


Related Items
  • Sun Fire E25K Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>High-End Servers
  •  

PreviouslyPublishedAs
209619


Applies to:

Sun Fire E25K Server
All Platforms

Goal

This paper tries to describe what problems busy device interrupts can cause on a host running Solaris[TM] Operating System (OS). It will also try and describe what can be done to mitigate the effects on the systems response.

Solution

Background

Hardware devices generate interrupts, depending on the device type it may generate several different interrupts. As the Solaris OS kernel discovers the device and installs its device driver it will assign a CPU to handle each interrupt. The CPUs are chosen in a "round robin" fashion and as there are often more interrupt sources than CPUs in a system a CPU may have to deal with interrupts from more than one source. A CPU services interrupts according to the level of the interrupt.

Hardware devices generate interrupts at different rates depending on their workload and design. A Gigabit ethernet NIC will generate tens of thousands of interrupts per second whilst a serial port might only generate a few hundred.

Related to the number of interrupts is the amount of work the interrupt handler has to do before it can exit. For a serial port driver it may just have to fiddle with some hardware registers and poke the next byte into the hardware, that might take just a microsecond. The ethernet NIC will have to transfer the data into streams buffers and then run that buffer up though the IP module, then the TCP module, then wake up a waiting thread at the stream head, this might take tens of microseconds. A disk storage HBA interrupt driver may have to do even more work depending on storage stack layered above the HBA, for example the interrupt from a fibre channel QLC card will have to deal with hardware registers then transfer the data up though the FCP/FP fibre channel drivers to the target driver, from there it might have to progress the buffer up through some form of volume management and up into a filesystem where there will be some interaction with the VM system, this whole process may take hundreds of microseconds.

During this time the hardware could be receiving some new data and be signaling its need to be serviced again. Modern hardware uses interrupt blanking to reduce the number of interrupts, but the hardware still has to be emptied, so the interrupt handler for these devices will check to see if there is any new pending work before it exits, if there is then it will stay at the raised interrupt level and service the new work.

Solaris OS uses 15 levels of interrupt priority, the lower 10 run in separate interrupt threads bound to each CPU, the higher ones just run in the context of the currently running thread as they are high priority and perform only short lived functions like MMU demaps. The level 10 and lower interrupts will stop or "pin" the scheduled thread whilst they run on the CPU. That pinned thread could be executing userland code or lower priority kernel code, it may hold mutexes or userland locks. Should one of these level 10 or lower interrupt threads block on a synchronization object like a kernel mutex then the pinned thread will be allowed to execute again but the CPU will stay at the raised interrupt level until that blocked interrupt thread resumes and then exits.

Typically SCSI disk interrupts run at interrupt level 4 (some HBAs use 6 and 4), network NIC's run their interrupt service routines at interrupt level 6. The Solaris OS 100 Hz clock thread runs at interrupt level 10 (referred to as "lock level") on one specific CPU (the "clock" CPU) and this interrupt is produced by the hi-resolution cyclic subsystem.

The Solaris OS allows you to offline CPUs, in that case the interrupts assigned to that CPU will be randomly assigned to the other CPUs remaining online and accepting interrupts. The administrator can also use processor sets to contain or exclude processes/LWPs from sets of CPUs which can also have interrupts removed. This allows "interrupt fencing" to form barriers between busy interrupt CPUs and user/system threads. The most common use of interrupt fencing is to keep userland LWPs away from CPUs taking lots of interrupts to avoid latency bubbles as the userland threads get pinned by the busy interrupts.


Possible problems

1) A particular CPU is the target for interrupts from a busy network NIC at interrupt level 6 and a SCSI disk drive HBA at interrupt level 4. In this situation the disk driver interrupt handler will not be able to run whilst the network NIC interrupt handler is running. If the network NIC is fed enough incoming packets to keep its interrupt handler busy occupying the CPU solidly then the disk HBA interrupt handler will see long latency. This may cause the application using disks behind that disk HBA to complain about slow disk response, of course the application may well be the operating system itself and users will report slow behavior as applications on those disks take a long time to page in or to access swap devices.

The slow disk response time will be seen when running iostat with a short refresh interval in the asvc_t column and it will correlate with mpstat/trapstat/intrstat showing a high rate of interrupts on one CPU. The later the version of Solaris OS, the greater the information available to see what device is using the CPU resource.

One solution is to steer the interrupts to separate CPUs so they do not clash.
Until intrd appears in a Solaris OS version later than Solaris[TM] 10 Operating System, there is no supported way to do this other than moving hardware around, changing bus probe orders or offlining and onlining CPUs to rearrange the interrupts.

A simpler solution is to reconfigure the hardware to reduce the amount of interrupt processing a device needs - either by software changes/tuning or by architecture changes, e.g. using network trunking to split network load across two NICs.


2) A CPU is taking interrupts from a very busy device and is spending considerable periods at a raised interrupt level. Depending on what is running on that CPU at the time when the first interrupt comes in for a burst, several symptoms may be observed:

a) A userland thread not holding any locks, this thread may stop for the duration of the interrupt burst, users may report latency bubbles or just slow application.

b) The userland thread is part of multi-process/multi-threaded application and is holding a contented for userland mutex at the start of a interrupt burst. Other threads on other CPUs wanting that mutex will spin as they will think that the owner thread is still executing on CPU, the users will complain of slow application, batch jobs may be slow and mpstat will show high user execution percentages. The application/plockstat may also report longer than expected mutex holds and waits.

c) A kernel thread is pinned by the interrupt burst, if it holds no synchronisation objects then its work is delayed. But if it holds a contended for mutex then those contenting threads on other CPUs will spin burning system time and incrementing the smtx column in mpstat.

A solution is to put a processor set on top of the busy CPU. This interrupt "fence" is not to keep the interrupts in but to keep the other threads out so they can't be scheduled onto the CPU busy with interrupts.


3) userland timer functions like poll()/nanosleep() require the system to schedule a softint at interrupt level 1 to run on a CPU. Usually the CPU chosen is the "clock" CPU as it generates the majority of the softints but there are other reasons that softints can be generated on a different CPU and once they are pending for a specific CPU then all other softints are queued to the same CPU for delivery. If the softint is queued to a CPU that starts a interrupt burst at a interrupt level greater than 1 then the softints will all queue up until that CPU lowers its interrupt level. This will cause these timing function to take longer than expected (note: the man pages do say the timeout value is a minimum but the delay can be unexpectedly large).

A solution to this is to reduce either the rate of interrupts on that CPU or the time taken to service each interrupt to the point where the CPU drops the interrupt level at regular intervals. For example a complex storage stack may take the disk HBA interrupt thread a whole millisecond to process each interrupt, so if you exceed 1000 IOPs/sec through that HBA then that cpu is 100% busy and latency bubbles can occur.

It is possible to multipath the disk workload across more controllers to reduce IOPs/sec per controller or work with the storage stack vendor to make the software processing for each interrupt more efficient, say by passing work from the interrupt thread to a kernel taskq thread to process.

Also, increasing the size of the IOPs so that for a given transfer rate the number of interrupts is reduced, ethernet NIC jumbo frames are a good example of this.


Observing

Most of these utilities are documented elsewhere so this list is for reference.

  1. mpstat
  2. lockstat/plockstat
  3. dtrace
  4. prex/tnf
  5. trapstat
  6. netstat
  7. iostat
  8. kstat
  9. vmstat

Product
Sun Fire E12K/E15K/E20K/E25K Server


Internal Section

Tools you can use in Nevada or OpenSolaris[TM]
1) intrd
2) intrstat

Unsupported tools
1) intradm

Changes and Bugs:
  • With the changes in bugid 5017095 pinned threads can be made runnable again if the interrupt handler returns but their is another interrupt pending.
  • Some drivers like CE network card might not return for seconds.
  • 6292092 describes timer related softints being stopped by higher priority interrupts.

Previously Published As 87312  

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback