Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1006249.1
Update Date:2011-02-17
Keywords:

Solution Type  Problem Resolution Sure

Solution  1006249.1 :   Sun Fire[TM] 12K/15K/E20K/E25K: Solaris[TM] 8 domain hangs on resuming JNI I/O device driver  


Related Items
  • Sun Fire E25K Server
  •  
  • Sun Fire E20K Server
  •  
  • Sun Fire 12K Server
  •  
  • Sun Fire 15K Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>High-End Servers
  •  

PreviouslyPublishedAs
208764


Applies to:

Sun Fire 12K Server
Sun Fire 15K Server
Sun Fire E20K Server
Sun Fire E25K Server
All Platforms
***Checked for relevance on 17-Feb-2011***

Symptoms

On a Sun Fire[TM] 12K/15K/E20K/E25K domain, an attempt to DR out a System Board which contains permanent memory results in the domain becoming hung. The last entry in the domain's /var/adm/messages file or console log indicates that the domain is attempting to resume an I/O device driver. Even after over an hour, the domain is still unresponsive and remains at this device resumption stage.
NOTE: When DR'ing a System Board that contains permanent memory out of a domain, the Soalris[TM] OS is suspended temporarily to allow for the reallocation of the permanent memory to other System Board resources which will remain in the domain. This suspension is not the "hang" described in this document.

The domain's console log shows the following (SB7 is the board which contains permanent memory):

Aug  5 01:05:46 2004 root@domainA # cfgadm -v -c disconnect SB7
Aug  5 01:05:49 2004 System may be temporarily suspended, proceed (yes/no)? yes
Aug  5 01:05:49 2004 request suspend SUNW_OS
Aug  5 01:05:51 2004 request suspend SUNW_OS done
Aug  5 01:05:51 2004 request delete capacity (4 cpus)
Aug  5 01:05:51 2004 request delete capacity (1048576 pages)
Aug  5 01:05:51 2004 request delete capacity SB7 done
.
.
.
Aug  5 01:09:11 2004    resuming pci108e,8001@3d,600000 (aka pcisch)
Aug  5 01:09:11 2004    resuming JNI,FCR@1,1 (aka jnic146x)
Aug  5 01:09:12 2004    resuming JNI,FCR@1 (aka jnic146x)

In the example above, the domain was forced to the OBP via the reset command.

Cause

DR'ing out a System Board which does not contain permanent memory works with no problem whatsoever. Because this operation requires no Solaris[TM] OS suspension, the I/O device driver does not need to be resumed, and therefore the domain does not hang.
This is important information, because this tends to prove that DR is not in fact the root cause of the problem.  DR in general works fine.  In this event, DR of permanent memory (Solaris[TM] OS suspension) and how that interacts with the resumption of the device driver in question is really the problem.

Solution

The solution to this specific case is to make sure kernel patches and the st driver patch are at certain revisions or higher to take advantage of the cfgadm fixes contained in the kernel patches and the specific st driver fixes contained in the st driver patch:
  • 108529-29 - KJP Solaris[TM] 8
  • 117000-05 - KJP Solaris[TM] 8
  • 108725-16 - Solaris[TM] 8 st driver patch

See below for a brief description of why the st driver patch is part of the fix for this particular case.

Prior to DR'ing out a System Board which contains permanent memory, one must modunload the st driver (assuming a tape device is attached to the domain). The procedure to do this follows:

# modinfo |grep tape
144 10301cab  19c8c  33   1  st (SCSI tape Driver 1.218)
# modunload -i 144
# cfgadm -c unconfigure SBXX

Assuming the st driver and Solaris Kernel Jumbo Patches (KJP) are up to date, this DR should work just fine. But, if you are downrev on the st driver patch or KJP, then the DR might hang as shown in the "Symptoms" section of this article.

After applying the st driver fixes and the KJP, and confirming that the st driver is now unloaded, the DR unconfigure of the System Board containing permanent memory should work with little delay.


This problem may occur on drivers other than the described JNI driver above, and the fix in this case may be slightly different in your situation.

In this specific case, the domain was configured with the following hardware/software:

  • Solaris 8, KJP 108528-24
  • st driver, 108725-14
  • JNI HBA model jnic146x with firmware rev 3.9
  • StoregTek tape SILO attached to the JNI card.

As you can see, the st driver is included in the fix because the card is attached to a tape device. If a site had trouble resuming a device driver attached to a disk device, the sd driver would be suspect.


Internal Comments
Related Documents
<Document: 1010363.1> "Sun Fire[TM] 12K/15K/E20K/E25K Servers: Dynamic Reconfiguration Considerations"
<Document 1001683.1> "Sun Fire[TM] 12K/15K/E20K/E25K: Location and Relocation of Kernel for DR Operations"
Article written as a result of Radiance case 64199977, Escalation ID 1-2917411
DR, dynamic reconfiguration, cfgadm, rcfgadm, JNI, HBA, I/O, resume, permanent memory, 12k, 15k, 20k, 25k
Previously Published As 77660

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback