Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1011106.1
Update Date:2010-01-28
Keywords:

Solution Type  Problem Resolution Sure

Solution  1011106.1 :   Fabric devices and QuickLoop devices exported to Solaris [TM] via the same Fiber Channel connection.  


Related Items
  • Sun Fire E25K Server
  •  
  • Sun Fire E20K Server
  •  
  • Sun Fire 12K Server
  •  
  • Sun Fire 15K Server
  •  
Related Categories
  • GCS>Sun Microsystems>Servers>High-End Servers
  •  

PreviouslyPublishedAs
215279


Symptoms
When Fabric devices and QuickLoop devices are exported to Solaris via the same Fiber channel connection, it reported offline/online under heavy load. And also, it resulted in poor IO performance.
Example:
A highend server at the customer site logged the following errors:
svr03 qlc: [ID 686697 kern.info] NOTICE: Qlogic qlc(1): Loop OFFLINE
svr03 qlc: [ID 686697 kern.info] NOTICE: Qlogic qlc(1): Link ONLINE
svr03 fctl: [ID 517869 kern.warning] WARNING: 2589=>fp(1)::fp_unsol_cb() bailing out LOGO for D_ID b19c7
svr03 fctl: [ID 517869 kern.warning] WARNING: 2591=>fp(1)::fp_unsol_cb() bailing out LOGO for D_ID b18c6
svr03 fctl: [ID 517869 kern.warning] WARNING: 2593=>fp(1)::fp_unsol_cb() bailing out LOGO for D_ID fffc0f
svr03 fctl: [ID 517869 kern.warning] WARNING: 2595=>fp(1)::fp_unsol_cb() bailing out LOGO for D_ID d0000
svr03 fctl: [ID 517869 kern.warning] WARNING: 2597=>fp(1)::fp_unsol_cb() bailing out LOGO for D_ID b0000
svr03 qlc: [ID 787125 kern.warning] WARNING: qlc(1) no lid for adisc b19c7
svr03 fp: [ID 517869 kern.info] NOTICE: fp(1): ADISC to b19c7 failed, cmd_flags=1 state=Packet Transport error, reason=No Connection
svr03 qlc: [ID 787125 kern.warning] WARNING: qlc(1) no lid for adisc b18c6
svr03 fctl: [ID 517869 kern.warning] WARNING: 2609=>fp(1)::fp_adisc_intr: Dev change notification to ULP port=300204db000, pd=300f2b5b998, map_flags=0 map_state=1
svr03 fp: [ID 517869 kern.info] NOTICE: fp(1): ADISC to b18c6 failed, cmd_flags=1 state=Packet Transport error, reason=No Connection
svr03 fctl: [ID 517869 kern.warning] WARNING: 2612=>fp(1)::fp_adisc_intr:
Dev change notification to ULP port=300204db000, pd=300bd6c6140, map_flags=0 map_state=1
svr03 fcip: [ID 356328 kern.warning] WARNING: fc_ulp_login failed for d_id: 0xb19c7, rval: 0x41
svr03 fcip: [ID 356328 kern.warning] WARNING: fc_ulp_login failed for d_id: 0xb18c6, rval: 0x41
svr03 scsi: [ID 107833 kern.warning] WARNING: /pci@fd,600000/SUNW,qlc@1,1/fp@0,0/ssd@w5006016830601681,49 (ssd60):
svr03 	Error for Command: read(10)     Error Level: Retryable
svr03 scsi: [ID 107833 kern.notice] 	Requested Block: 95798528                  Error Block: 95798528
svr03 scsi: [ID 107833 kern.notice] 	Vendor: DGC Serial Number: 4900004D24CL
svr03 scsi: [ID 107833 kern.notice] 	Sense Key: Unit Attention
svr03 scsi: [ID 107833 kern.notice] 	ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
...(repeated for several DGC LUNs. Error Level is Retryable, Omitted here ! )
svr03 scsi: [ID 107833 kern.warning] WARNING: /pci@fd,600000/SUNW,qlc@1,1/fp@0,0/st@w50050763004a3e05,0 (st31):
svr03 	Error for Command: write        Error Level: Fatal
svr03 scsi: [ID 107833 kern.notice] 	Requested Block: 2303                      Error Block: 2303
svr03 scsi: [ID 107833 kern.notice] 	Vendor: IBM   Serial Number:
svr03 scsi: [ID 107833 kern.notice] 	Sense Key: Unit Attention
svr03 scsi: [ID 107833 kern.notice] 	ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
svr03 scsi: [ID 107833 kern.warning] WARNING: /pci@fd,600000/SUNW,qlc@1,1/fp@0,0/st@w50050763004a3e06,0 (st30):
svr03 	Error for Command: load/start/stop         Error Level: Fatal
svr03 scsi: [ID 107833 kern.notice] 	Requested Block: 0                         Error Block: 0
svr03 scsi: [ID 107833 kern.notice] Vendor: IBM Serial Number:
svr03 scsi: [ID 107833 kern.notice] 	Sense Key: Unit Attention
svr03 scsi: [ID 107833 kern.notice] 	ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
svr03 tldd[19472]: [ID 861947 daemon.error] TLD(0) unload failed in io_open, I/O error[5]
svr03 tldd[10635]: [ID 821050 daemon.error] TLD(0) drive 6 (device 0) is being DOWNED, status: Unable to SCSI unload drive
svr03 tldd[10635]: [ID 229259 daemon.error] Check integrity of the drive, drive path, and media
svr03 scsi: [ID 107833 kern.warning] WARNING: /pci@fd,600000/SUNW,qlc@1,1/fp@0,0/ssd@w5006016830601681,1f (ssd73):
svr03 	Error for Command: write(10)               Error Level: Retryable
svr03 scsi: [ID 107833 kern.notice] 	Requested Block: 20434                     Error Block: 20434
svr03 scsi: [ID 107833 kern.notice] 	Vendor: DGC  Serial Number: 1F000042A0CL
svr03 scsi: [ID 107833 kern.notice] 	Sense Key: Unit Attention
svr03 scsi: [ID 107833 kern.notice] 	ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0xca


Resolution
Similar issues happened several times on this same server(svr03). Customer DBA complained that backup UFS file system which based on EMC[TM] CLARiiON Cx700 LUNs showed very poor IO performance (less than 50KB/s for read OR write). Scheduled backup jobs failed.

After a careful review of the current IO sub-system configuration, it was found that the affected EMC[TM] CLARiiON Cx700 LUNs(OS marked it as "Vendor:DGC") and SAN attached Tape drives (Os marked it as "Vendor:IBM) are all presented to Solaris via the same fiber channel - /pci@fd,600000/SUNW,qlc@1,1/fp@0,0. This is the lower Fiber channel port of server's 1st HBA(Part Number is X6768 OR 375-3108, it is 2GB dual port HBA). The following diagram shows the original system Backup SAN architecture:
+--------+  +--------+  +--------+  +--------+
| 1st HBA|  | 2nd HBA|  | 3rd HBA|  | 4th HBA|
|        |  |        |  |        |  |        |
|  FC(U) |  |  FC(U) |  |  FC(U) |  |  FC(U) |
|        |  |        |  |        |  |        |
|        |  |        |  |        |  |        |
|  FC(L) |  |  FC(L) |  |  FC(L) |  |  FC(L) |
|  |     |  |        |  |        |  |        |
+--|-----+  +--------+  +--------+  +--------+
|
|
+--> To SAN switch ports for both Cx700 array & tape drives.
(The port was configured both in zone svr3-bk-za & svr3-bk-zb,
overlapped)
Remark: Above Four HBA's Upper FC ports [FC(U)] are used for another  high-end storage  connection.
Zone configuration (svr3-bk-za and svr3-bk-zb):
Zone Defines	Port				Port Type
--------------------------------------------------------------------
Zone svr03-bk-za	1st HBA lower port		F-Port
			Cx700 Controller SPA		F-Port
			Cx700 Controller SPB		F-Port
Zone svr03-bk-zb	1st HBA lower port(overlap) 	F-Port
			Tape Driver(st30)		L-Port, 1 Public
			Tape Driver(st31)		L-Port, 1 Public
---------------------------------------------------------------------
We also noticed that above failure used to happen only under heavy IO loads.
Light IO workload worked fine.
Though fabric device and QuickLoop device can work together, it was never   recommended by any Storage or Switch Vendors. Because a chunk of data needs   to be read from this Fiber channel and then write to the tape drives via the   same Fiber Channel. This could trigger poor IO performance, resulting application failure.
When the ports for the tape drives in zone "svr03-bk-zb" were made to fail, two tape drives st30 & st31 both became offline. Following meesages
were logged.
svr03 fctl: [ID 517869 kern.warning] WARNING: 2793=>fp(1)::GPN_ID for D_ID=b18c6 failed
svr03 fctl: [ID 517869 kern.warning] WARNING: 2794=>fp(1)::N_x Port with D_ID=b18c6,
PWWN=50050763004a3e06 disappeared from fabric
svr03 fctl: [ID 517869 kern.warning] WARNING: 2804=>fp(1)::GPN_ID for D_ID=b19c7 failed
svr03 fctl: [ID 517869 kern.warning] WARNING: 2805=>fp(1)::N_x Port with D_ID=b19c7,
PWWN=50050763004a3e05 disappeared from fabric
svr03 scsi: [ID 243001 kern.info] /pci@fd,600000/SUNW,qlc@1,1/fp@0,0 (fcp1):
svr03 	offlining lun=0 (trace=0), target=b18c6 (trace=2800004)
svr03 scsi: [ID 107833 kern.warning] WARNING: /pci@fd,600000/SUNW,qlc@1,1/fp@0,0/st@w50050763004a3e06,0 (st30):
svr03 	transport rejected
svr03 genunix: [ID 408114 kern.info] /pci@fd,600000/SUNW,qlc@1,1/fp@0,0/st@w50050763004a3e06,0 (st30) offline
svr03 scsi: [ID 243001 kern.info] /pci@fd,600000/SUNW,qlc@1,1/fp@0,0 (fcp1):
svr03 	offlining lun=0 (trace=0), target=b19c7 (trace=2800004)
svr03 scsi: [ID 107833 kern.warning] WARNING: /pci@fd,600000/SUNW,qlc@1,1/fp@0,0/st@w50050763004a3e05,0 (st31):
svr03 	transport rejected
svr03 genunix: [ID 408114 kern.info] /pci@fd,600000/SUNW,qlc@1,1/fp@0,0/st@w50050763004a3e05,0 (st31) offline
This resulted satisfying IO performance, single Read or write thread can
generate IO throughput up to 40-50 MB/s.  So, disabling those tape drives
can be used as a temporary workaround in a similar configuration.
This proves that the bottleneck was in the configuration.
Rebuilding the current backup SAN architecture, that is to organize Fibric
devices and QuickLoop devices in two separate zones (also using different HBAs),
was the proposed solution. Following are the two new planned Zone defines:
Zone Defines	Port				Port Type
--------------------------------------------------------------------
Zone svr03-bk-za	1st HBA lower port		F-Port
			2nd HBA lower port(for DMP)	F-Port
			Cx700 Controller SPA		F-Port
			Cx700 Controller SPB		F-Port	
Zone svr03-bk-zb	3rd HBA lower port	 	F-Port
			Tape Driver(st30)		L-Port, 1 Public
			Tape Driver(st31)		L-Port, 1 Public
---------------------------------------------------------------------
Following is the diagram of the final system Backup SAN architecture:
+--------+  +--------+  +--------+  +--------+
| 1st HBA|  | 2nd HBA|  | 3rd HBA|  | 4th HBA|
|        |  |        |  |        |  |        |
|  FC(U) |  |  FC(U) |  |  FC(U) |  |  FC(U) |
|        |  |        |  |        |  |        |
|        |  |        |  |        |  |        |
|  FC(L) |  |  FC(L) |  |  FC(L) |  |  FC(L) |
|  |     |  |   |    |  |    |   |  |        |
+--|-----+  +---|----+  +----|---+  +--------+
|            |            |
|            |            |
|            |            +--> To SAN Switch for Tape driver connection
|            |            (This port was configured in zone svr03-bk-zb)
|            |
|            +---> To SAN Switch for Cx700 Array connections(DMP path A)
|            (This port was configured in zone svr3-bk-za)
|
+--> To SAN Switch Port for Cx700 Array connections(DMP path B)
(This port was configured in zone svr3-bk-za)
So, as a best practice for device connection via SAN Switch, try to avoid
configuring Fabric and QuickLoop devices into the same fiber channel connection
especially when they are both used for the same application.


Relief/Workaround
These two types of devices need to be in different zones. So, disabling one of
these devices temporarily would avoid the poor performance issue.

Product
Sun Fire E25K Server
Sun Fire E20K Server
Sun Fire 15K Server
Sun Fire 12K Server

SAN switch L-Port, F-Port, X6768, 375-3108, Qlogic qlc, Loop OFFLINE, Link ONLINE, tape, QuickLoop
Previously Published As
83332

Change History
Date: 2005-12-06
User Name: 31620
Action: Approved
Comment: Verified Metadata - ok
Verified Keywords - ok
Verified still correct for audience - was free, has to be contract as per
FvF http://kmo.central/howto/FvF.html
Checked review date - currently set to 2006-12-05
Checked for TM - added on efor Solaris
Publishing under the current publication rules of 18 Apr 2005:
Version: 4
Product_uuid
d842dd03-059b-11d8-84cb-080020a9ed93|Sun Fire E25K Server
1404a2d3-059a-11d8-84cb-080020a9ed93|Sun Fire E20K Server
29e4659c-0a18-11d6-9fa1-e67bbc033df8|Sun Fire 15K Server
077fd4c5-df8f-4320-ad69-7d01603a674d|Sun Fire 12K Server

Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback