HP LTO-2 - Drive May Hang With VERITAS NetBackup's bptm While SCSI Locate Command

Asset ID:	1-72-1012653.1
Update Date:	2011-05-09
Keywords:

Solution Type Problem Resolution Sure

Solution 1012653.1 : HP LTO-2 - Drive May Hang With VERITAS NetBackup's bptm While SCSI Locate Command

Applies to:

Sun Storage L100 Tape Library
Sun Storage L25 Tape Library
HP LTO2 Tape Drive - Version: Not Applicable and later [Release: N/A and later]
All Platforms
Checked for relevance on 9-May-2011.

Symptoms

HP's Ultrium 2 drive in Sun StorEdge L25/L100 may hang with VERITAS NetBackup's bptm command while SCSI locate command. Once HP Ultrium 2 drive hangs, SCSI timeout will occur three hours later. It usually occurs, bptm issues SCSI locate

command against sg tape device after bptm's ioctl is invoked for MTREW command.

For example, messages below at Dec 2 18:02:51 SCSI timeout error occurred.
NetBackup's "logs/bptm/log.120205" showed ioctl's MTREW was issued at 15:01:40 on the same day. "scsi command failed, may be timeout, scsi_pkt.us_reason = 6" was logged by sg driver since it was detected SCSI timeout for SCSI locate command.

messages:

Dec  2 15:01:00 v4u-v240b tldcd[7746]: [ID 359804 daemon.notice] TLD(0) opening robotic path /dev/sg/c3t0l0
Dec  2 15:01:00 v4u-v240b tldcd[7746]: [ID 237046 daemon.notice] inquiry() function processing library ATL      1500             12.012.0
:
Dec  2 15:01:00 v4u-v240b tldcd[7746]: [ID 625537 daemon.notice] TLD(0) initiating MOVE_MEDIUM from addr 201 to addr 129
Dec  2 15:01:30 v4u-v240b tldcd[7746]: [ID 559680 daemon.notice] TLD(0) closing/unlocking robotic path
Dec  2 15:01:30 v4u-v240b tldcd[501]: [ID 237046 daemon.notice] inquiry() function processing library ATL      1500             12.012.0
:
Dec  2 15:01:34 v4u-v240b tldd[7745]: [ID 817166 daemon.notice] Calling the drive_mount_notify script, DriveName=HPUltrium2-SCSI0
Dec  2 15:01:34 v4u-v240b tldd[7745]: [ID 428369 daemon.notice] Mount callout returned normal status, DriveName = HPUltrium2-SCSI0, Status = 0
Dec  2 15:01:40 v4u-v240b tldd[490]: [ID 179157 daemon.notice] DecodeMount(): TLD(0) drive 2, Actual status: STATUS_SUCCESS
Dec  2 15:01:40 v4u-v240b ltid[423]: [ID 412853 daemon.notice] LTID - received ROBOT MESSAGE, Type=54, LongParam=0, Param1=0, Param2=0
Dec  2 18:02:51 v4u-v240b bptm[7722]: [ID 832037 daemon.error] scsi command failed, may be timeout, scsi_pkt.us_reason = 6
Dec  2 18:02:52 v4u-v240b scsi: [ID 107833 kern.warning] WARNING: /pci@1d,700000/pci@1/scsi@5/st@2,0 (st23):
Dec  2 18:02:52 v4u-v240b 	Error for Command: rezero/rewind           Error Level: Fatal
Dec  2 18:02:52 v4u-v240b scsi: [ID 107833 kern.notice] 	Requested Block: 0                         Error Block: 0
Dec  2 18:02:52 v4u-v240b scsi: [ID 107833 kern.notice] 	Vendor: HP                                 Serial Number:
Dec  2 18:02:52 v4u-v240b scsi: [ID 107833 kern.notice] 	Sense Key: Unit Attention
Dec  2 18:02:52 v4u-v240b scsi: [ID 107833 kern.notice] 	ASC: 0x29 (), ASCQ: 0x3, FRU: 0x0
Dec  2 18:12:55 v4u-v240b scsi: [ID 107833 kern.warning] WARNING: /pci@1d,700000/pci@1/scsi@5/st@2,0 (st23):
Dec  2 18:12:55 v4u-v240b 	SCSI transport failed: reason 'timeout': retrying command
Dec  2 18:12:56 v4u-v240b scsi: [ID 107833 kern.warning] WARNING: /pci@1d,700000/pci@1/scsi@5/st@2,0 (st23):
Dec  2 18:12:56 v4u-v240b 	Error for Command: rezero/rewind           Error Level: Fatal
Dec  2 18:12:56 v4u-v240b scsi: [ID 107833 kern.notice] 	Requested Block: 0                         Error Block: 0
Dec  2 18:12:56 v4u-v240b scsi: [ID 107833 kern.notice] 	Vendor: HP                                 Serial Number:
Dec  2 18:12:56 v4u-v240b scsi: [ID 107833 kern.notice] 	Sense Key: Unit Attention
Dec  2 18:12:56 v4u-v240b scsi: [ID 107833 kern.notice] 	ASC: 0x29 (), ASCQ: 0x3, FRU: 0x0

bptm's debug log(a.k.a. activity log) showed as follows.
("7722" is the process ID of bptm when the problem occurred.)

% grep ' .7722' logs/bptm/log.120205
15:00:28.859 [7722]<2> bptm: INITIATING (VERBOSE = 1): -w -c v4u-v240b -den 14 -rt 8 -rn 0 -stunit v4u-v240b-hcart2-robot-tld-0 -cl testclass
-bt 1133503225 -b v4u-v240b_1133503225 -st 0 -cj 1 -p NetBackup -hostname v4u-v240b -ru root -rclnt v4u-v240b -rclnthostname v4u-v240b -rl 1 -rp
1209600 -sl Full -ct 0 -maxfrag 1048576 -v -mediasvr v4u-v240b -jobid 3013 -jobgrpid 3013 -masterversion 500000 -shm
15:00:28.861 [7722]<2> nb_getsockconnected: host=v4u-v240b service=bpjobd address=10.14.1.23 protocol=tcp non-reserved port=13723
15:00:28.862 [7722]<2> logconnections: BPJOBD CONNECT FROM 10.14.1.23.37103 TO 10.14.1.23.13723
15:00:28.868 [7722]<2> job_authenticate_connection: ignoring VxSS authentication check for now...
15:00:28.869 [7722]<2> job_connect: Connected to the host v4u-v240b contype 10 jobid <3013> socket <7>
15:00:28.869 [7722]<2> job_connect: Connected on port 37103
15:00:28.869 [7722]<2> set_job_details: Done
15:00:28.869 [7722]<2> get_stunit: STUNIT v4u-v240b-hcart2-robot-tld-0 2 v4u-v240b 8 0 14 1 0 *NULL* 0 1 1048576 *NULL*
15:00:28.869 [7722]<2> get_stunit: Buffer1 "STUNIT v4u-v240b-hcart2-robot-tld-0 2 v4u-v240b 8 0 14 1 0 *NULL*", Buffer2 "*NULL*", Buffer3 "0 1
1048576 *NULL*"
15:00:28.872 [7722]<2> LOCAL CLASS_ATT_DEFS: Product ID = 6
15:00:28.875 [7722]<2> get_stunit_rec_list: db maximum fragment size is 1048576, using maximum fragment size 1048576
15:00:28.875 [7722]<2> get_stunit: MD5 digest line
15:00:28.875 [7722]<2> get_stunit: myfgetline() NULL
15:00:28.876 [7722]<2> io_init: using 64512 data buffer size
15:00:28.876 [7722]<2> io_init: CINDEX 0, sched Kbytes for monitoring = 60000
15:00:28.876 [7722]<2> io_set_recvbuf: setting receive network buffer to 32032 bytes
15:00:28.876 [7722]<2> io_init: using 8 data buffers
15:00:28.876 [7722]<2> io_init: child delay = 20, parent delay = 30 (milliseconds)
15:00:28.876 [7722]<2> io_init: shm_size = 516292, buffer address = 0xfe880000, buf control = 0xfe8fe000, ready ptr = 0xfe8fe0c0
15:00:28.881 [7722]<2> setup_bpbkar_info: /usr/openv/netbackup/db/config/shm/v4u-v240b_1133503225 file successfully created
15:00:28.882 [7722]<2> get_behavior_rec: invalid entry: POSTPROCESS_IMAGES
15:00:28.882 [7722]<2> get_behavior_rec: MD5 digest line
15:00:28.882 [7722]<2> nb_getsockconnected: host=v4u-v240b service=bpdbm address=10.14.1.23 protocol=tcp non-reserved port=13721
15:00:28.883 [7722]<2> logconnections: BPDBM CONNECT FROM 10.14.1.23.37106 TO 10.14.1.23.13721
15:00:28.948 [7722]<2> add_to_vmhost_list: added v4u-v240b to vmhost list
15:00:28.949 [7722]<2> nb_getsockconnected: host=v4u-v240b service=vmd address=10.14.1.23 protocol=tcp non-reserved port=13701
15:00:28.997 [7722]<2> select_media: added 2 media id's to list that matched robot number/type and media type
15:00:29.011 [7722]<2> select_media: selected media id U597L2 for backup[0], v4u-v240b(rl = 1) <----------
15:00:29.011 [7722]<2> mount_open_media: Waiting for mount of media id U597L2 (copy 1) on server v4u-v240b.
15:01:40.864 [7722]<2> io_open: SCSI RESERVE
15:01:40.894 [7722]<2> io_open: file /usr/openv/netbackup/db/media/tpreq/U597L2 successfully opened
15:01:40.894 [7722]<2> write_backup: media id U597L2 mounted on drive index 0, drivepath /dev/rmt/0cbn, drivename HPUltrium2-SCSI0, copy 1
15:01:40.895 [7722]<2> io_read_media_header: drive index 0, reading media header, buflen = 65536, buff = 0x164bf0, copy 1
15:01:40.895 [7722]<2> io_ioctl: command (5)MTREW 1 from (bptm.c.7269) on drive index 0
15:01:40.961 [7722]<2> io_ioctl: command (1)MTFSF 1 from (bptm.c.7537) on drive index 0
15:01:41.043 [7722]<2> io_position_for_write: position media id U597L2, copy 1, current number images = 6
15:01:41.043 [7722]<2> io_position_for_write: locating to absolute block number 20, copy 1
18:02:51.990 [7722]<2> set_job_details: Done
18:02:52.008 [7722]<8> io_position_for_write: cannot locate on drive index 0, locate scsi command failed (possibly a command timeout), errno =
9, Bad file number
18:02:52.008 [7722]<2> io_ioctl: command (5)MTREW 1 from (bptm.c.6516) on drive index 0
18:02:52.012 [7722]<2> set_job_details: Done
18:02:52.038 [7722]<16> io_ioctl: ioctl (MTREW) failed on media id U597L2, drive index 0, I/O error (bptm.c.6516)
18:02:52.057 [7722]<2> log_media_error: successfully wrote to error file - 12/02/05 18:02:52 U597L2 0 POSITION_ERROR
18:02:52.070 [7722]<2> set_job_details: Done
18:02:52.098 [7722]<8> io_read_block: read error on media id U597L2, drive index 0 reading header block, I/O error
18:02:52.098 [7722]<2> io_position_for_write: error, rewind and retry
18:02:52.098 [7722]<2> io_ioctl: command (5)MTREW 1 from (bptm.c.6784) on drive index 0
18:12:56.963 [7722]<2> set_job_details: Done
18:12:56.996 [7722]<16> io_ioctl: ioctl (MTREW) failed on media id U597L2, drive index 0, I/O error (bptm.c.6784)
18:12:57.021 [7722]<2> log_media_error: successfully wrote to error file - 12/02/05 18:12:56 U597L2 0 POSITION_ERROR
18:12:57.021 [7722]<2> check_error_history: called from bptm line 19757, EXIT_Status = 86
18:12:57.022 [7722]<2> check_error_history: drive index = 0, media id = U597L2, time = 12/02/05 18:12:57, both_match = 1, media_match = 0,
drive_match = 0
18:12:57.022 [7722]<2> io_close: closing /usr/openv/netbackup/db/media/tpreq/U597L2, from bptm.c.15481
18:12:57.022 [7722]<2> tpunmount: tpunmount'ing /usr/openv/netbackup/db/media/tpreq/U597L2
18:12:57.024 [7722]<2> TpUnmountWrapper: SCSI RELEASE
19:14:56.835 [7722]<2> TpUnmountWrapper: retrying open, errno = I/O error
19:18:54.069 [7722]<2> bptm: EXITING with status 86 <----------

 At 15:01:41.043, io_position_for_write was called, and it hung for three hours.

Changes

{CHANGE}

Cause

{CAUSE}

Solution

Resolution

The problem has been identified in HP Ultrium 2's drive firmware for Sun StorEdge L25/L100
and fixed firmware (version F6AZ) has been released as patch 116170-03. To avoid this
issue, HP Ultrium 2's firmware needs to be updated to version F6AZ of patch 116170-03.

Once this issue occurs, the following operations need to be done.

1. Stop NetBackup daemons.

/usr/openv/netbackup/bin/goodies/netbackup stop

/usr/openv/netbackup/bin/goodies/bp.kill_all

2. Power off and wait for two or three minutes then power on the tape library since we cannot
unload the tape from the tape drive once it hangs.

3. Once the tape library becomes ready state, then unload the tape in the tape drive,
then put back to the original storage bin slot of the tape library
using the front panel operations of the tape library.

4. Restart NetBackup daemons.

/usr/openv/netbackup/bin/goodies/netbackup start

End of this procedure.

Note: This issue does not depend on particular version of NetBackup since it is
HP Ultrium 2 drive firmware's issue.

Once this issue occurs, sg driver takes three hours to detect SCSI command timeout.
This timeout value is the one which is originally set by bptm command.
The timeout value is hard-coded on bptm as current NetBackup's spec. Before issuing SCSI locate command via ioctl, bptm sets SCSI timeout value for the sg driver instance of the tape drive.

Note that this issue is not related to st driver at all since bptm opens sg tape device to issue SCSI locate command. That's why the following error messages were logged by sg driver when timeout was detected by sg driver.

Dec 2 18:02:51 v4u-v240b bptm[7722]: [ID 832037 daemon.error] scsi command failed, may be timeout, scsi_pkt.us_reason = 6

See also:
CR 6359993
HP Ultrium2, LTO2, LTO-2, locate, SCSI, NBU, NetBackup, bptm, hang, scsi_pkt.us_reason, 6359993
Previously Published As 86426

Comment: Here is the answer for Nakamura -san's query.

> - How to identify PID of bptm to look into it Some customer does not take activity log of bptm.

Necessarily we don't have to find the affected process ID of bptm first.
In this case, we see the following bptm's message on /var/adm/messages.

Dec 2 18:02:51 v4u-v240b bptm[7722]: [ID 832037 daemon.error] scsi command
failed, may be timeout, scsi_pkt.us_reason = 6

Then we can check bptm's debug log with the same or nearest timestamp,
then we see can find the following messages on it.

18:02:52.008 [7722]<8> io_position_for_write: cannot locate on drive index 0,
locate scsi command failed (possibly a command timeout), errno =
9, Bad file number

So we don't have to know a special method to find the PID of bptm in this case.
I think that it is the basic method to check a log file of nearest timestamp.

> - The author did not mention about "io_position_for_read".
> The issue could be detected during io_position_for_read because the operation uses
> SCSI sequential command: locate(op: 2Bh)

I don't think so. Just in case, I will send an email to explain
its stack trace. As you see bptm's debug log, we don't see io_position_for_read.
So I don't think we have to mention about io_position_for_read.

Attachments

This solution has no attachment