Document Audience:INTERNAL
Document ID:I0997-1
Title:MPI jobs with a large number of processes (>1000) running over the Sun Fire Link interconnect can hang due to shared memory resource limitations.
Copyright Notice:Copyright © 2005 Sun Microsystems, Inc. All Rights Reserved
Update Date:2003-07-28

---------------------------------------------------------
            - Sun Proprietary/Confidential: Internal Use Only -
---------------------------------------------------------------------  
                        FIELD INFORMATION NOTICE
               (For Authorized Distribution by SunService)
FIN #: I0997-1
Synopsis: MPI jobs with a large number of processes (>1000) running over the Sun Fire Link interconnect can hang due to shared memory resource limitations.
Create Date: Jul/25/03
SunAlert: No
Top FIN/FCO Report: No
Products Reference: Sun Fire F12K/F15K/6800 Servers
Product Category: Server / Service
Product Affected: 
Systems Affected:
-----------------  
Mkt_ID   Platform   Model    Description            Serial Number
------   --------   -----    -----------            -------------
  -        F12K      ALL     Sun Fire 12000               -
  -        F15K      ALL     Sun Fire 15000               -
  -        S24       ALL     Sun Fire 6800                -  


X-Options Affected:
-------------------
Mkt_ID        Platform   Model    Description              Serial Number
------        --------   -----    -----------              -------------
X4121A           -         -      Sun Fire Link Assembly         -
X4141A           -         -      Sun Fire Link Assembly         -
HPCIS-500-9999   -         -      HPC ClusterTools 5  Media      -
Parts Affected: 
----------------------
Part Number      Description   	    Model
-----------      -----------   	    -----
     -                -               -
References: 
BugId:   4828184 - hpc_rsmd will hang when trying to create/export 
                   rsm segment totaling >4GB.
         4816959 - RSM: allow RSM_RESOURCE_SLEEP/DONTWAIT flags to 
                   rsm_memseg_export_create(3RSM).

PatchId: 112863-01: HPC 5.0: MPI CRE RSM fixes.
Issue Description: 
Sun HPC ClusterTools jobs consisting of 1000 or more Message Passing
Interface (MPI) processes running over a Sun Fire Link interconnect may
hang.

This issue can occur with Sun Fire 6800 and Sun Fire 12K/15K servers
using a Sun Fire Link interface with Sun HPC ClusterTools 5 Software
without patch 112863-02 or later.

The following command can be used to determine if the patch 112863-02
(or later) has been installed:

    # pkgparam SUNWmpi PATCHLIST

The issue can appear when running multiple MPI jobs utilizing the Sun
Fire Link as the primary interconnect required for job completion.  The
MPI jobs will hang or not complete past normal completion time.

Each node in a cluster is limited to exporting no more than 4 GB of
memory over the Fire Link interconnect.  Processes within an MPI job
communicate via exported memory.  When the number of processes within a
job becomes extremely large, this limitation can be reached and the MPI
job will hang.

The hang may occur at job startup time or well into a job run if
internode connections are established late in the run.  One way to
check the number of bytes exported on a node is to execute the
following command on the node:

    % kstat wrsm:0:rsmpi_stat

If the reported value is close to 4 GB, then it is likely that MPI job
will hang.

It is important that Patch 112863-01 (or later) for the Sun HPC
ClusterTools 5 software be installed.  This patch makes the 4 GB export
limit problem much more rare and mitigates its consequences.

With this patch, MPI jobs using default values for the Sun MPI
environment variables will encounter the 4 GB export limit only when
the number of remote processes reaches roughly 1000.  Thus, it is
anticipated that most users will never see this problem.

An MPI job may encounter this limit sooner if other MPI jobs on the
cluster are concurrently vying for the same Fire Link resources.

An MPI job may also encounter the export limit with fewer processes if
Sun MPI environment variables are set to use more memory than the
default.  Specifically , increasing MPI_RSM_SBPOOLSIZE significantly or
setting MPI_RSM_STRONGPARTITION to 1 will increase the amount of
exported memory needed by a job and may cause the 4 GB limit to be
exceeded.

The final solution for this issue will cause a job to exit with an
appropriate error message when the 4 GB export limit is exceeded.  This
solution requires two patches: Patch 112863-02 (which fixes bug
4828184) and the patch for RFE 4816959 (to be announced when
available).  In the meantime, the probability of this issue occurring
can be greatly reduced by installing patch 112863-01 which is now
available from SunSolve.
Implementation: 
---
        |   |   MANDATORY (Fully Proactive)
         ---    
         
  
         ---
        |   |   CONTROLLED PROACTIVE (per Sun Geo Plan) 
         --- 
         
                                
         ---
        | X |   REACTIVE (As Required)
         ---
Corrective Action: 
The following recommendation is provided as a guideline for authorized
Sun Services Field Representatives who may encounter the above
mentioned issue. 

   1. Install ClusterTools 5 patch 112863-01 (for immediate mitigation).
   
   2. Install ClusterTools 5 patch 112863-02 when it becomes available.
    
   3. Install the fix for bug ID 4816959 when it becomes available.
   
NOTE: This FIN will be updated when the last solution is released.
Comments: 
None

============================================================================
Implementation Footnote: 
i)   In case of MANDATORY FINs, Sun Services will attempt to contact   
     all affected customers to recommend implementation of the FIN. 
   
ii)  For CONTROLLED PROACTIVE FINs, Sun Services mission critical    
     support teams will recommend implementation of the FIN  (to their  
     respective accounts), at the convenience of the customer. 

iii) For REACTIVE FINs, Sun Services will implement the FIN as the   
     need arises.
----------------------------------------------------------------------------
All released FINs and FCOs can be accessed using your favorite network 
browser as follows:
 
SunWeb Access:
-------------- 
* Access the top level URL of http://sdpsweb.central/FIN_FCO/

* From there, select the appropriate link to query or browse the FIN and
  FCO Homepage collections.
 
SunSolve Online Access:
-----------------------
* Access the SunSolve Online URL at http://sunsolve.central/

* From there, select the appropriate link to browse the FIN or FCO index.

Internet Access:
----------------
* Access the top level URL of https://spe.sun.com
--------------------------------------------------------------------------
General:
--------
* Send questions or comments to [email protected]
--------------------------------------------------------------------------
Statusactive