Sun Microsystems, Inc.  Sun System Handbook - ISO 3.4 June 2011 Internal/Partner Edition
   Home | Current Systems | Former STK Products | EOL Systems | Components | General Info | Search | Feedback

Asset ID: 1-72-1003864.1
Update Date:2008-06-04
Keywords:

Solution Type  Problem Resolution Sure

Solution  1003864.1 :   Invalid ID 125 on Sun StorEdge[TM] 3510 causing multiple drive/loop problems.  


Related Items
  • Sun Storage 3510 FC Array
  •  
Related Categories
  • GCS>Sun Microsystems>Storage - Disk>Modular Disk - 3xxx Arrays
  •  

PreviouslyPublishedAs
205426


Symptoms
The symptoms for this problem were as follows:-


The symptoms for this problem were as follows:-

1. Multiple disk drives failed (LEDs turned amber)
2. As a result of (1), Logical drives failed.
3. Host lost access to the Sun StorEdge[TM] 3510 LUNs.

Upon trying to troubleshoot, it was found that an invalid ID 125 was to be seen on the loop maps from both channels 2 and channels 3 which was causing the above problem to be seen.

This document will help in troubleshooting this problem and getting rid of the invalid id and hence resolving the same.



Resolution
Invalid ID 125 on Sun StorEdge[TM] 3510 causing multiple drive/loop problems.

This problem happened on the following configuration but it can happen on any SE3510 with or without a JBOD connected and the resolution would be the same.

Configuration:-
-------------

RAID dual controller with 2 JBODs running 3.27R. The RAID head with chassis ID 0 and the JBODs with chassis IDs 1 and 2.

The IDs on the SE3510 are set based on the chassis ID and following are the valid ID range for any SE3510 configurations:-

Chassis ID Switch Setting ID Range
0 0-15
1 16-31
2 32-47
3 48-63
4 64-79
5 80-95
6 96-111
7 112-125

Please check <Document: 1007692.1> for more details.

 <Document: 1007692.1> Sun StorEdge[TM] 3510 FC Array switch settings and disk IDs.

Considering the chassis id = 7, we have from the above...

Target Disk drive(s) Loop IDs are 112-123
112 115 118 121
113 116 119 122
114 117 120 123 (SES Loop ID is 124)

When the problem occurs, we see the following when we run show loop-map command.

sccli> show loop-map channel 2

40 devices found in loop map

Channel Loop Map retrieved from CH 2 ID 12

AL_PA SEL_ID SEL_ID TYPE ENCL_ID SLOT
(hex) (hex) (dec)
----- ----- ----- ---- ------ ----
D1 0E 14 RAID N/A N/A
B1 21 33 DISK 2 1
9F 2C 44 SES 2 N/A
AE 22 34 DISK 2 2
AD 23 35 DISK 2 3
AB 25 37 DISK 2 5
B2 20 32 DISK 2 0
A7 28 40 DISK 2 8
A6 29 41 DISK 2 9
A3 2B 43 DISK 2 11
AA 26 38 DISK 2 6
A5 2A 42 DISK 2 10
A9 27 39 DISK 2 7
01 7D 125 RAID N/A N/A <<----PROBLEM
E8 01 1 DISK 0 1
E1 04 4 DISK 0 4
E4 02 2 DISK 0 2
E2 03 3 DISK 0 3
E0 05 5 DISK 0 5
EF 00 0 DISK 0 0
D9 08 8 DISK 0 8
D6 09 9 DISK 0 9
B9 1B 27 DISK 1 11
C6 16 22 DISK 1 6
BA 1A 26 DISK 1 10
C5 17 23 DISK 1 7
B6 1C 28 SES 1 N/A
CC 11 17 DISK 1 1
C9 14 20 DISK 1 4
CB 12 18 DISK 1 2
CA 13 19 DISK 1 3
C7 15 21 DISK 1 5
CD 10 16 DISK 1 0
C3 18 24 DISK 1 8
BC 19 25 DISK 1 9
D4 0B 11 DISK 0 11
DC 06 6 DISK 0 6
D5 0A 10 DISK 0 10
DA 07 7 DISK 0 7
D3 0C 12 SES 0 N/A

When the above problem happened, we could see following messages logged in the event log of the SE3510.

snippet from the show events....

Mon Oct 18 15:33:39 2004
[113f] #5639: StorEdge Array SN#8022864 CH2 ID10: ALERT: redundant path failure detected (CH3 ID10)

Mon Oct 18 15:33:40 2004
[113f] #5640: StorEdge Array SN#8022864 CH3 ID10: ALERT: redundant path failure detected (CH2 ID10)

Mon Oct 18 15:33:41 2004
[113f] #5641: StorEdge Array SN#8022864 CH3 ID10: NOTICE: redundant path restored (CH2 ID10)

Mon Oct 18 15:33:41 2004
[113f] #5642: StorEdge Array SN#8022864 CH2 ID10: NOTICE: redundant path restored (CH3 ID10)

end snippet.

The above messages were seen for many drives and the paths to the drives were constantly getting failed/unfailed due to this problem.

Resolution :-
----------

Unfortunately, there is no one step process if we have more than one chassis and the problem isolation process would be to disconnect each chassis and re-run the loop map command and check if the invalid id is still seen in the loop map. On the above mentioned problem, it was resolved once the JBOD with chassis ID 2 was disconnected. Once we isolate the chassis, follow the following procedure to isolate the bad IOM from that chassis.

1. Reconnect the suspect in the loop and replace the top IOM and check with loop
map. If problem is resolved, then stop. If problem persists, then continue.
2. Re-connect the original top IOM and replace the bottom IOM and should resolve
the problem.

Notes:

1. In the above example, the problem happened to be on the second JBOD but it
may be possible that the problem lies on the chassis with the controllers
(RAID head). So in that case, follow the same steps as above.

2. To isolate the problem, downtime is needed as the problem cant be solved
while the host is accessing the array.

3. Instead of physically disconnecting the JBODs, we can use the "bypass"
commands available on the 3.27R firmware but the usage of these commands is
out of the scope of this document. Please refer to the documentation.

4. This problem happened with 3.27R but the chances are that it can also happen
with 4.x firmware too.

5. Lastly, there is a possibility that this problem may also happen with SE3511
even though it was only seen on SE3510.



Additional Information
With the RAID head with chassis ID set to "0" and one JBOD with ID "1" and the other on ID "2", following would be the loop-map output as seen from channel 2.

This is a GOOD output taken from a system with NO problem.

sccli> show loop-map channel 2

41 devices found in loop map

Channel Loop Map retrieved from CH 2 ID 12

AL_PA SEL_ID SEL_ID TYPE ENCL_ID SLOT (hex) (hex) (dec)
----- ----- ----- ---- ------ ----
D1 0E 14 RAID N/A N/A
B1 21 33 DISK 2 1
AC 24 36 DISK 2 4
AE 22 34 DISK 2 2
AD 23 35 DISK 2 3
AB 25 37 DISK 2 5
B2 20 32 DISK 2 0
A7 28 40 DISK 2 8
A6 29 41 DISK 2 9
A3 2B 43 DISK 2 11
AA 26 38 DISK 2 6
A5 2A 42 DISK 2 10
A9 27 39 DISK 2 7
9F 2C 44 SES 2 N/A
E8 01 1 DISK 0 1
E1 04 4 DISK 0 4
E4 02 2 DISK 0 2
E2 03 3 DISK 0 3
E0 05 5 DISK 0 5
EF 00 0 DISK 0 0
D9 08 8 DISK 0 8
D6 09 9 DISK 0 9
CE 0F 15 RAID N/A N/A
B9 1B 27 DISK 1 11
C6 16 22 DISK 1 6
BA 1A 26 DISK 1 10
C5 17 23 DISK 1 7
B6 1C 28 SES 1 N/A
CC 11 17 DISK 1 1
C9 14 20 DISK 1 4
CB 12 18 DISK 1 2
CA 13 19 DISK 1 3
C7 15 21 DISK 1 5
CD 10 16 DISK 1 0
C3 18 24 DISK 1 8
BC 19 25 DISK 1 9
D4 0B 11 DISK 0 11
DC 06 6 DISK 0 6
D5 0A 10 DISK 0 10
DA 07 7 DISK 0 7
D3 0C 12 SES 0 N/A



Product
Sun StorageTek 3510 FC Array

Internal Comments
There are at least couple of escalations with this problem. See escalations 1-4541318 and 1-9523889.



sccli, 3510, loop-map, SES, JBOD
Previously Published As
82392

Change History
Date: 2007-06-29
User Name: 7058
Action: Approved
Comment: Activated link for Normalization (ease of tracking & reference)
Notes for Normalizaton:
This document is referenced by:86947
Subset Root path:
82392-->86947-->89127-->89034-->89031-->89050/86520
This document references: 80185
Project: Minnow Normalization
Version: 4
Date: 2007-06-29
User Name: 7058
Action: Update Started
Comment: Activating link to 80185
Version: 0
Date: 2005-08-24
User Name: 95826
Action: Approved
Comment: - verified metadata
- changed review date to 2006-08-24
- checked for TM - 2 added
- checked audience : contract
Publishing
Version: 3




Attachments
This solution has no attachment
  Copyright © 2011 Sun Microsystems, Inc.  All rights reserved.
 Feedback