Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-73-1022287.1
Update Date:2011-03-04
Keywords:

Solution Type  FAB (standard) Sure

Solution  1022287.1 :   Update to Service Processor firmware to resolve hangs and related symptoms.  


Related Items
  • Sun Storage 7410 Unified Storage System
  •  
  • Sun Storage 7310 Unified Storage System
  •  
  • Sun Storage 7110 Unified Storage System
  •  
  • Sun Storage 7210 Unified Storage System
  •  
Related Categories
  • GCS>Sun Microsystems>Sun FAB>Standard>Reactive
  •  

PreviouslyPublishedAs
279330


Bug Id
<SUNBUG: 6921482>

Product
Sun Storage 7110 Unified Storage System
Sun Storage 7210 Unified Storage System
Sun Storage 7310 Unified Storage System
Sun Storage 7410 Unified Storage System

Date of Workaround Release
15-Apr-2010

Older versions of SP f/w can leak memory (see details below).

Impact

Older versions of the Service Processor firmware can leak memory, eventually resulting in a variety of issues as listed in the symptoms section.

Contributing Factors

Above listed platforms with Service Processor firmware not up to the levels described in this document are impacted by this issue.

When present, the issues surface somewhere between 30 and 60 days of uptime.  There is some  variation in the time between failures, their severity, and even whether or not they occur  at a particular site.  The reasons for these variations are not known at this time.

Symptoms

Cannot connect to Service Processor via serial or network.

Service Processor absent from hardware details page in BUI Alert.

Service Processor has stopped responding to requests.

Directories, such as /SYS, missing from SP interface.

Fans in server node running continuously at full speed.

Slow throughput to system disks (due to fan vibration).

Time out during software upgrade (due to system disks/fan vibration).

Root Cause

Root Cause is attributed to a number of CRs for memory leaks on the Service Processor.  Defects (memory leaks) in the Service Processor firmware lead to an out of memory condition, and an inability to respond to requests.  The condition deteriorates until the Service Processor is reset.

Corrective Action

Workaround:
 
The appliance software, as of version 2009.Q3, has a mechanism to reset the Service Processor every 60 days, or sooner if it becomes unresponsive.  This is sufficient to prevent the issues on the majority of systems.

For systems that experience the problems described above, use the following procedure:

First, ensure the Service Processor is responding.  This is best done by resetting the Service Processor.  Use one of the following two methods:

Enter "maintenance hardware select chassis-000 select sp reset" at the appliance kit shell.

Download http://tsc-storage.us/products/AmberRoad/download/spreset.akwf, then install and run it from the Maintenance/Workflows screen in the BUI.  Consult the appliance help under Maintenance/Workflows for assistance with this process.  After executing the workflow, and ensuring that it ran successfully, delete it from the customer system.

This process takes some time - on the order of five minutes.  The main external indication that the reset has completed is that the fans spin down to a normal speed.  You can also monitor progress for any of these operations via a serial connection to the SP.

Next, verify that the Service Processor has been reset, via the Alert Log.  You should see that the service processor either stopped, then resumed responding to requests, or simply resumed, in the case of a Service Processor that was previously unresponsive.

Download the correct BIOS and Service Processor firmware for the system being serviced, as follows:

For Sun Storage 7110, 7310, 7410:

   http://tsc-storage.us/products/AmberRoad/download/0ABMN064-r45008.pkg

For Sun Storage 7210:

   http://tsc-storage.us/products/AmberRoad/download/0ABNF032-r45117.pkg

Connect to the Service Processor via ssh using root credentials. Use this interface to shut down the head you are working on with "stop /SYS".

Connect to the Service Processor IP address via browser and provide the root login credentials.

Follow these steps to upgrade the Service Processor and BIOS:

  1. Click on Maintenance tab

  2. Firmware Upgrade will be the default and correct subtab

  3. Click on "Enter Upgrade Mode"

  4. Confirm this action with the pop up

  5. Click on "Browse" and select the appropriate image from your local filesystem

  6. Click on "Upload"

  7. Wait for upload to complete and the verification to succeed

  8. You will now see a Summary Table of the SP firmware and BIOS versions
     (Existing vs New).  Confirm that "Preserve existing configuration" is
     checked for the SP Firmware

  9. Click on "Start Upgrade"

 10. Confirm this action with the pop up

 11. Now wait for the upgrade to proceed. If the head was up at this point, it will
     be cleanly shutdown.

     Warning! Do not interrupt the update. Leave the browser undisturbed until the
     update is complete.

 12. When finished, you will see "Upgrade Complete" and the SP will reboot.

The SP firmware and BIOS will now have been updated to the correct 7000 version.  Now you must configure some specific BIOS settings.  Boot the head and enter setup with:

    -> start /SYS
    Are you sure you want to start /SYS (y/n)? y
    Starting /SYS

    -> start /SP/console
    Are you sure you want to start /SP/console (y/n)? y

    Serial console started.  To stop, type ESC (

Once you see the initial BIOS banner, hit CONTROL-E a few times; this will trigger the BIOS Setup menu after the initialization.

You can drop back to the SP with ESC-(
NOTE: Escape, followed by shift 9 - at least open parenthesis is usually on shift 9.

NOTE: If the initialisation hangs on a 7310/7410, and it is part of a cluster with
      the other head up and in service, disconnect the SAS cables to the J4400 JBODs,
      drop back to the SP and reset with:

    Serial console stopped.

    -> reset /SYS
    Are you sure you want to reset /SYS (y/n)? y
    Performing hard reset on /SYS

    -> start /SP/console
    Are you sure you want to start /SP/console (y/n)? y

    Serial console started.  To stop, type ESC (

If you use this workaround, be very certain to reconnect the SAS cables immediately after correcting the BIOS settings.

Once into the BIOS Setup screen, start by loading factory defaults. To do this, use the right arrow key to move over to the "Exit" menu.  Down arrow to "Load Optimal Defaults" and, then again to confirm the popup asking "Load Opitmal Defaults".

Now follow the specific instructions for the appropriate appliance:

For Sun Storage 7110:

Disable PCIPnP Option-ROM scanning for slots 1-5
Disable I/O allocation

Use the right arrow key to page over to "PCIPnP" menu. Use the down arrow to highlight:

  Scanning OPROM on PCI-E Slot1 Enabled

Press return and select "Disabled". This will now appear as:

  Scanning OPROM on PCI-E Slot1 Disabled

Repeat this for slots 2-5 (the last slot is off the bottom of the screen).

You should now have:

    Scanning OPROM on PCI-E Slot0 Enabled
    Scanning OPROM on PCI-E Slot1 Disabled
    Scanning OPROM on PCI-E Slot2 Disabled
    Scanning OPROM on PCI-E Slot3 Disabled
    Scanning OPROM on PCI-E Slot4 Disabled
    Scanning OPROM on PCI-E Slot5 Disabled


Just below these OPROM settings are a group of settings which allow IO allocation to be disabled per-slot.  Disable PCI-E slots 1-4. Only slots 0 and 5 should be enabled.  It should look like:

  IO Allocation on PCI-E Slot0 Enabled
  IO Allocation on PCI-E Slot1 Disabled
  IO Allocation on PCI-E Slot2 Disabled
  IO Allocation on PCI-E Slot3 Disabled
  IO Allocation on PCI-E Slot4 Disabled
  IO Allocation on PCI-E Slot5 Enabled

 
On boot, you will see the following warning message from the BIOS:

  Warning: IO resource not allocated

This is an expected message and does not indicate a failure.

Exiting BIOS Setup

  Use right arrow to page over to "Exit". Press for the default "Save Changes
  and Exit", and again to confirm the action with the pop up.

For Sun Storage 7210:

Disable PCIPnP Option-ROM scanning for all slots
Disable I/O allocation

Use the right arrow key to page over to "PCIPnP" menu. Use the down arrow to highlight:

    Scanning OPROM on PCI-E Slot0 Enabled

Press return and select "Disabled". This will now appear as:

    Scanning OPROM on PCI-E Slot0 Disabled

Repeat this for slot 1 and 2. You should now have:

    Scanning OPROM on PCI-E Slot0 Disabled
    Scanning OPROM on PCI-E Slot1 Disabled
    Scanning OPROM on PCI-E Slot2 Disabled


Just below these OPROM settings are a group of settings which allow IO allocation to be disabled per-slot.  Disable PCI-E slots 0 and 2. Only slot 1 should be enabled.  It should look like:

    IO Allocation on PCI-E Slot0 Disabled
    IO Allocation on PCI-E Slot1 Enabled
    IO Allocation on PCI-E Slot2 Disabled


On boot, you will see the following warning message from the BIOS:

  Warning: IO resource not allocated

This is an expected message and does not indicate a failure.

Exiting BIOS Setup

  Use right arrow to page over to "Exit". Press for the default "Save Changes 
  and Exit", and again to confirm the action with the pop up.

For Sun Storage 7310:

Disable PCIPnP Option-ROM scanning for all slots
Disable I/O allocation
Configure boot drives

Use the right arrow key to page over to "PCIPnP" menu. Use the down arrow to highlight:

    Scanning OPROM on PCI-E Slot0 Enabled

Press return and select "Disabled", followed by return. This will now appear as:

    Scanning OPROM on PCI-E Slot0 Disabled

Repeat this for slots 1-2.  You should now have:

    Scanning OPROM on PCI-E Slot0 Disabled
    Scanning OPROM on PCI-E Slot1 Disabled
    Scanning OPROM on PCI-E Slot2 Disabled


Just below these OPROM settings are a group of settings which allow IO allocation to be disabled per-slot.  Disable PCI-E slots 1 and 2. Only slot 0 should be enabled.  It should look like:

    IO Allocation on PCI-E Slot0 Enabled
    IO Allocation on PCI-E Slot1 Disabled
    IO Allocation on PCI-E Slot2 Disabled


Next, arrow over to the Boot menu. Select the last item: "Hard Disk Drives" and press return.  The list should include only 2 drives (the 2 internal SATA drives) with labels like:
 
    SATA:11M-<drive model>
    SATA:12M-<drive model>

If this list includes anything else (such as readzilla cache devices with a 'STEC MACH8' string, or JBOD attached drives) you'll need to remove them from the list by selecting the boot position and setting it to 'Disabled' for each of non-boot drives.

If the list is full (with 16 drives) you will not be able to edit the list.  However, the change to the OPROM settings above will cause the JBOD drives to disappear from the list on the next boot.  You will need to exit and save changes and immediately re-enter the BIOS menu on the next boot (CTRL-E).

Exiting BIOS Setup

  Once you've removed any readzilla cache or JBOD drive entries from the "Hard Disk
  Drives" list, perform the following;

   . Press ESC to exit the "Hard Disk Drives" menu, then arrow right to the "Exit" menu.
   . Press for the default "Save Changes and Exit", and return again to confirm the
     action with the pop up.

On boot, you will see the following warning message from the BIOS:

  Warning: IO resource not allocated

This is an expected message and does not indicate a failure.

For Sun Storage 7410:

Disable PCIPnP Option-ROM scanning for all slots
Disable I/O allocation
Configure boot drives

Use the right arrow key to page over to "PCIPnP" menu. Use the down arrow to highlight:

    Scanning OPROM on PCI-E Slot0 Enabled

Press return and select "Disabled", followed by return. This will now appear as:

    Scanning OPROM on PCI-E Slot0 Disabled

Repeat this for slots 1-5 (the last slot is off the bottom of the screen).  You should now have:

    Scanning OPROM on PCI-E Slot0 Disabled
    Scanning OPROM on PCI-E Slot1 Disabled
    Scanning OPROM on PCI-E Slot2 Disabled
    Scanning OPROM on PCI-E Slot3 Disabled
    Scanning OPROM on PCI-E Slot4 Disabled
    Scanning OPROM on PCI-E Slot5 Disabled


Just below these OPROM settings (they are actually off the bottom of the screen and you will need to scroll down) are a group of settings which allow IO allocation to be disabled per slot.  Disable PCI-E slots 0-3, checking that slots 4 and 5 are Enabled.  It should look like:

  IO Allocation on PCI-E Slot0 Disabled
  IO Allocation on PCI-E Slot1 Disabled
  IO Allocation on PCI-E Slot2 Disabled
  IO Allocation on PCI-E Slot3 Disabled
  IO Allocation on PCI-E Slot4 Enabled
  IO Allocation on PCI-E Slot5 Enabled


Next, arrow over to the Boot menu. Select the last item: "Hard Disk Drives" and press return. The list should include only 2 drives (the 2 internal SATA drives) with labels like:

  SATA:11M-<drive model>
  SATA:12M-<drive model>


If this list includes anything else (such as readzilla cache devices with a 'STEC MACH8' string, or JBOD attached drives) you will need to remove them from the list by selecting the boot position and setting it to 'Disabled' for each of non-boot drives.

If the list is full (with 16 drives) you will not be able to edit the list.  However, the change to the OPROM settings above will cause the JBOD drives to disappear from the list on the next boot.  You will need to exit and save changes and immediately re-enter the BIOS menu on the next boot (CTRL-E).

Exiting BIOS Setup

  Once you've removed any readzilla cache or JBOD drive entries from the "Hard Disk
  Drives" list, perform the following;

  . Press ESC to exit the "Hard Disk Drives" menu, then arrow right to the "Exit" menu.
  . Press for the default "Save Changes and Exit", andreturnagain to confirm the action
    with the pop up.

On boot, you will see the following warning message from the BIOS:

  Warning: IO resource not allocated

This is an expected message and does not indicate a failure.         
Resync SP Password.

Finally, resync the SP password to match the root password of the NAS head.  Have the customer complete this final step.  Remember, you exit back to the SP using ESC-(

    -> cd /SP/users/root
    /SP/users/root

    -> set password
    Enter new password: *********
    Enter new password again: *********

NOTE: The SP has a minimum password length of 8 which is not enforced by the appliance
      system software. ie, if the customer has the hopelessly simple password "abc", then
      this will be rejected by the SP. To resolve this, the customer will need set a new
      password from the appliance, which in turn will update the SP password directly.

Resolution:

In a future release, in-band Service Processor updates will be supported.  At that point in time, the reset procedure will be removed from the appliance software, and avoiding these issues will be as simple as keeping the system software up to date.

Identification of Affected Parts (how to):

Connect via ssh to the Service Processor and supply root credentials.  The SP version will be displayed as part of the logon banner.  The current version for the 7110, 7310 and 7410 is 2.0.2.16. Version 2.0.2.15 is current for the 7210.  Any prior version is susceptible to these issues.

Note that checking the SP version via other means, such as the administrative BUI can be unreliable.  Due to a bug in some releases, version 2.0.2.16 may also be displayed as 2.0.2.22.

Comments

Version 2.0.2.16 is the latest supported version of the Service Processor firmware for the 7110, 7310 and 7410.  Version 2.0.2.15 is the latest supported version for the 7210.  Newer versions should not be used unless specifically tested and released for the appliance.  If a newer version is found, with the exception noted above in the "Identification of Affected Parts" section, you should escalate the case to TSC Backline, and additionally report the version found and serial number of the system to those listed as Contributor and Responsible Manager in the "Contacts" section below of this FAB.

This procedure assumes that the Service Processor has been configured with an IP address. If this has not been done, refer to the appliance documentation under "Installation".

There is no minimum system software requirement to run this procedure, however the customer should always follow the standard guideline of running no more than one major version behind the current release.

This can be done in a "rolling" fashion on cluster systems, simply perform the procedure on one node at a time, the clustering software will move resources to  the partner node.





For information about FAB documents, its release processes, implementation strategies and billing information, go to the following URL:

For Sun Authorized Service Providers go to:

In addition to the above you may email:


Internal Contributor/submitter
[email protected]

Internal Eng Responsible Engineer
[email protected] Responsible Manager: [email protected]

Internal Services Knowledge Engineer
[email protected]

Internal Eng Business Unit Group
NWS (Network Storage)

Internal Sun Alert & FAB Admin Info
06-Apr-2010: Completed draft and sent to Extended Review.
08-Apr-2010: On-hold awtg submitter corrections per feedback from Ext Rvw.
15-Apr-2010: Submitter provided corrections - sending to Publish.


Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback