Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1009154.1
Update Date:2011-03-21
Keywords:

Solution Type  Problem Resolution Sure

Solution  1009154.1 :   IBM LTO GEN-1 - Tape Drives Produce Intermittent Write and Position Errors  


Related Items
  • Sun StorageTek L700 Tape Library
  •  
Related Categories
  • GCS>Sun Microsystems>Storage - Tape>Libraries - L-Series
  •  

PreviouslyPublishedAs
212678


Oracle Confidential (PARTNER). Do not distribute to customers
Reason: Confidential for Partners and Oracle Support personnel

Applies to:

Sun StorageTek L700 Tape Library
All Platforms
Checked for relevance on 21-Mar-2011.
Outdated information on EOL product.
Set for Archive.

Symptoms

Symptoms
THIS IS INTERNAL ONLY , PLEASE DO NOT DISTRIBUTE ANY PART OF THIS TO THE CUSTOMER.

IBM LTO GEN1 drives in Sun StorEdge[TM] L700 library under heavy constant usage have been seen to occasionally fail with a scsi write error while under Netbackup use :
Sep 14 01:40:08 WARNING: /sbus@2,0/QLGC,qla@1,30400/st@4,0 (st47):
Sep 14 01:40:08 apollo Error for Command: write
Sep 14 01:40:08 apollo Requested Block: 196 Error Block: 196
Sep 14 01:40:08 apollo Vendor: IBM
Sep 14 01:40:08 apollo Sense Key: Media Error
Sep 14 01:40:08 apollo ASC: 0xc (write error), ASCQ: 0x0, FRU: 0x6
This write error also appears to corrupt the index contained on the tape so that subsequent use of the tape in other drives fail with:
Sep 14 01:57:32 WARNING: /sbus@2,0/QLGC,qla@1,30400/st@4,0 (st47):
Sep 14 01:57:32 apollo Error for Command: rezero/rewind
Sep 14 01:57:32 apollo Requested Block: 0 Error Block: 0
Sep 14 01:57:32 apollo Vendor: IBM Serial Number:
Sep 14 01:57:32 apollo Sense Key: Media Error
Sep 14 01:57:32 apollo ASC: 0x14 (recorded entity not found),
ASCQ: 0x0, FRU: 0x7
All this make it appear that all drives are failing randomly, but the failures can be tracked back to the first drive that produces the write error and corrupts the tape .
The media log report shows that the tape first incurs the write error ( MTWEOF ) , then the tape cartridge fails in whatever drive it is used with a positioning error ( MTFSF ).
Date/Time Media Server Client Job ID Severity Description Policy Schedule Status Process
-----------------------------------------------------------------------------------------------------------------------------------------------
10/04/2004 20:05:09 apollo yankee 141419 Error ioctl (MTWEOF) failed on media id D10875, drive index 4 (silo drive 2), I/O error (bptm.c.17770)
10/04/2004 21:18:32 apollo redsox 141503 Error ioctl (MTFSF) failed on media id D10875, drive index 9 (silo drive 13), I/O error (bptm.c.6522) bptm
10/04/2004 21:49:18 apollo yankee 141639 Error ioctl (MTFSF) failed on media id D10875, drive index 4 (silo drive 2), I/O error (bptm.c.6522) bptm
10/04/2004 22:19:40 apollo Beantown 141692 Error ioctl (MTFSF) failed on media id D10875, drive index 9 (silo drive 13), I/O error (bptm.c.6522) bptm
The drive Identity as seen by sgscan :
/dev/sg/c0t2l0: Tape (/dev/rmt/0): "IBM ULTRIUM-TD1 3CKE" : NOT-IN-ST-CO
NFIG-FILE


Changes

{CHANGE}

Cause

{CAUSE}

Solution

Resolution
This problem seems to be inherent with the IBM LTO GEN1 drive, however STK has not officially acknowledged this . In a couple cases the resolution to this problem has been to proactively test the drives and replace non-optimal drives before they fail.

1. Upgrade drives to the latest f/w , currently at 3CKE but soon to be 4561 , enable the cleaning bit on the tape drive.
2. Clean all the drives once a week , if they need it or not .
3. STK go onsite once a week and run the capacity test,if the drives fail , they clean it and run the test again.

If it still fails they replace the tape drive .
PTS also recommended that each time a write error occurs the customer should notify Sun and isolate the tape .
Then Sun should send a task to STK to get the tape and pull the logs off the tape , This should supply the serial number of the previous drives that the tape has been used .
The LTO drives do not contain a log as DLT drives do , but instead they have counters that can be dumped . They are of limited value.
Use the NBU command below :
# /usr/openv/volmgr/bin/scsi_command -d /dev/rmt/0 -log_dump
Notes : The big fix for f/w 4561 is for tape cleaning .Since
the tapealert does not work , this new f/w cleans the tape
after every 40 or so mounts .
Note: The Capacity test performs a write to the drive . Since the
LTO corrects errors on the fly , if there are many errors the
write will take a longer time for the write of a full tape to
complete . If the drive exceeds a specific time it is a possible
problem .



@ Thanks to John Howard (PTS engineer Emiritus) for all the help in resolving this customers' long on-going tape errors.

Below is an actual summary of work performed by STK relating to this problem :

Here is a synopsis of what occurred during our maintenance window today 09/20/04 from 12:00 to 17:00.

1. Updated firmware on drives 1,2,3,4,5,11, and 13 to level 4561.

2. Enabled cleaning bit on all devices via the oemtapetest utility ver 1.44f.

3. Cleaned all drives twice before running capacity test.

4. Ran capacity test on all drives. Drives 2,3,7,8 and 13 failed the initial test.

5 Cleaned failed drives two more times.

6. Reran capacity test on failed drives. Only drive 13 passed this time.

7. Replaced drive 7 with spare drive which we brought onsite, and tested new drive with capacity test ok.

8. Analyzed tape W11546. CM registered 03. Customer needs to freeze tape until it goes to scratch status and then reinitialize it.

9. Will order three replacement drives based on when the next maintenance window is to occur.
IBM LTO GEN1, L700, Sense Key: Media Error, ASC: 0xc (write error), ASCQ: 0x0, FRU: 0x6, Intermittent Write errors, Position errors
Previously Published As 79037



Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback