Transcript
Sonexion 1600 Replacement Procedures 1.5
Contents
Contents About Sonexion 1600 Replacement Procedures 1.5.................................................................................................4 Replace a 5U84 Disk.................................................................................................................................................5 Replace a 5U84 EBOD............................................................................................................................................14 Replace a 5U84 SSD...............................................................................................................................................18 Replace a 5U84 Fan Module...................................................................................................................................24 Replace a 5U84 Backlit Panel Bezel ......................................................................................................................29 Replace a 5U84 Side Card......................................................................................................................................37 Check for Errors if LEDs are On....................................................................................................................51 Replace a 5U84 OSS Controller..............................................................................................................................55 Output Example to Check USM Firmware.....................................................................................................69 Replace a 5U84 Power Supply Unit........................................................................................................................76 Replace a 5U84 Chassis.........................................................................................................................................82 Replace a 2U24 Disk...............................................................................................................................................86 Replace a 2U24 EBOD............................................................................................................................................91 Replace a 2U24 Power Module...............................................................................................................................95 Replace a 2U24 Chassis.........................................................................................................................................98 Replace a Quad Server Disk.................................................................................................................................105 Zero the Superblock....................................................................................................................................112 Clean Up Failed or Pulled Drive in Node.....................................................................................................113 Replace a Quad Server MGMT Node....................................................................................................................117 Replace a Quad Server MGS or MDS Node.........................................................................................................131 Replace a Quad Server MGMT Node Fan Module................................................................................................143 Replace a Quad Server MGS or MDS Fan Module...............................................................................................149 Replace a Quad Server Power Supply Unit...........................................................................................................157 Replace PSU With One of Same Wattage..................................................................................................158 Replace Both 1200W PSUs with 1600W PSUs..........................................................................................159 Replace a Quad Server Chassis...........................................................................................................................163 Replace a CNG Server Module.............................................................................................................................170 Configure BIOS Settings for CNG Node......................................................................................................175 BMC IP Address Table for CNG Nodes.......................................................................................................178 Replace a CNG Chassis........................................................................................................................................181 Replace a Cabinet PDU.........................................................................................................................................189 Replace a Cabinet Network Switch.......................................................................................................................197 Replace a Cabinet Network Switch PSU...............................................................................................................207 Replace a Cabinet Management Switch (Brocade)...............................................................................................211
2
Contents
Configure a Brocade Management Switch..................................................................................................216 Replace a Cabinet Management Switch PSU (Brocade)......................................................................................222 Remove a Cabinet Management Switch PSU (Netgear).......................................................................................225 Replace a Cabinet Power Distribution Strip...........................................................................................................228
3
About Sonexion 1600 Replacement Procedures 1.5
About Sonexion 1600 Replacement Procedures 1.5 Typographic Conventions Monospace
A Monospace font indicates program code, reserved words or library functions, screen output, file names, path names, and other software constructs
Monospaced Bold
A bold monospace font indicates commands that must be entered on a command line.
Oblique or Italics
An oblique or italics font indicates user-supplied values for options in the syntax definitions
Proportional Bold
A proportional bold font indicates a user interface control, window name, or graphical user interface button or control.
Alt-Ctrl-f
Monospaced hypenated text typically indicates a keyboard combination
Record of Revision, publication HR5-6098 Publication Number Date
Release Comment
HR5-6098-0
July 2012
1.0
1300 only
HR5-6098-A
April 2013
1.2
1300 and 1600
HR5-6098-B
April 2014
1.3.1
HR5-6098-C
July 2014
1.4
HR5-6098-D
November 2014
New MGMT node procedure
HR5-6098-E
September 2015
Revised EBOD Procedures for 2U24 and 5U84
HR5-6098-F
October 2015
1.5
Aligned with model 2000 1.5 procedures. Added references to RAS (video) procedures.
HR5-6098-G
February 2016
1.5
Deleted references to RAS, restored text-based procedures, pending availability of RAS procedures.
4
Replace a 5U84 Disk
Replace a 5U84 Disk Prerequisites
Part Number
101255900: Disk drive assy, 200GB T10 SAS SSD for SSU Time required 1 hour Interrupt level Live (can be applied to a live system with no service interruption, but requires failover/ failback operations) Tools ●
ESD strap
●
Monitor and USB keyboard or a network-connected PC with terminal console emulation.
●
T-20 Torx screwdriver
About this task
Use this procedure, only for Sonexion 2000 1.5 pre SU-007 systems (see Important, below), to remove and replace a failed disk drive in carrier (disk) in an OSS component within an SSU. IMPORTANT: The procedure in this topic is intended for use with Sonexion 2000 1.5.0 (pre SU-007) systems only. For Sonexion 1.5.0 SU-007 and later systems, do not use this topic to replace failed hardware. Instead, field personnel should log in to the Sonexion service console, which provides step-bystep instructions to replace the failed part. Follow the steps below to access the service console:
1. Cable a laptop to any available port on any LMN switch (located at the top of the rack). 2. Log in to the service console and follow the procedure to remove and replace the failed part. To log in, navigate to the service console (http://service:8080). If that URL is not active, then log in to port 8080 of the IP address of the currently active MGMT node (MGMT0): http://IP_address:8080 where IP_address is the IP address of the currently active (primary) MGMT node. 3. Enter the standard service credentials. This disk is a hard disk drive (HDD) and is termed a disk drive in carrier (DDIC) .Unless otherwise specified, this procedure uses the term disk to mean a HDD DDIC. A modular SSU chassis is a 5U84 enclosure that accommodates 84 disks and houses two OSS controllers. This procedure applies only to the replacement of a failed hard disk drive. The replacement of a failed SSD is covered in Replace a 5U84 SSD on page 18. A failed hot spare must be replaced with a new drive. Rebuilding the disk is unneeded.
5
Replace a 5U84 Disk
Subtasks: ●
Verify Recovery State of SSU GridRAID Devices
●
Verify Location of the Failed Disk
●
Remove and Install the 5U84 Disk
GridRAID Recovery A chassis and two controllers are bundled in the modular SSU. The chassis is a 5U84 enclosure containing 82 HDD disks and two SSD disks. The HDD disks are configured in GridRAID arrays. The GridRAID array is a set of drives configured to act as a single volume. An SSU contains two 41-drive arrays, each configured as [41 (8+2)+2] GridRAID array and two SSD drives configured as a RAID1 array. GridRAID arrays are composed of 41 drives. Within the 41 drives there are two distributed spare volumes and RAID 6 stripes configured with 8 data units and 2 parity units per stripe. When a disk fails and its GridRAID array is degraded, data is immediately reconstructed onto the distributed spare, The distributed spare is a virtual device where its space is distributed across all of the disks. GridRAID defines the operation of reconstructing the data onto the distributed spare as reconstruction. While reconstruction is underway, Sonexion operations continue without interruption. Once the reconstruction is complete and the faulting drive is replaced, the data is copied from the distributed spare to the replacement drive. This operation is called a GridRAID rebalancing. GridRAID rebalancing is a copy operation. 5U84 Enclosure Precautions Whenever a component (such as a PSU, fan, or controller) fails and causes an SSU to change from a fully redundant state to a non-redundant state, to maintain uninterrupted ClusterStor operations, the faulty component should be replaced as soon as possible to return the SSU to a fully redundant state. ●
Incorrectly carrying out this procedure may cause damage to the enclosure.
●
When you replace a cooling fan in any SSU enclosure, you must complete the replacement within 2 minutes.
●
When a drawer is open on any SSU enclosure, do not leave the drawer open longer than 2 minutes.
●
When replacing a disk drive in any SSU enclosure, unlatch the drive and wait 5 seconds for the drive to spin down before removal.
●
To prevent overturning, drawer interlocks stop users from opening both drawers at the same time. Do not attempt to force open a drawer if the other drawer is already open.
●
Operating temperatures inside the enclosure drawers can reach up to 95°C. Take care when opening drawers.
Notes and Cautions ●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
6
Replace a 5U84 Disk
Procedure 1. If the disk location is already known, skip to Verify Recovery State of SSU GridRAID Devices 2. Look for a flashing amber Module Fault LED on the Drive Fault LED (on the 5U84 enclosure's Operator Control Panel and on the Drawer Status LED). The amber LED indicates a problem with the disk. Note that LEDs are located on both sides of the 5U84 enclosure. As the RAID array is spread between the two drawers in the 5U84 enclosure, the Drawer Status LEDs on both drawers flash. Figure 1. Left Panel Amber LEDs: Drawer Status, Drive Fault, Drive Status
Figure 2. Right side Amber Module Fault LED on the Drawer Status LED
3. Open the drawers with the flashing LEDs and look for solid amber Drive Fault LEDs on the faulty disk.
7
Replace a 5U84 Disk
Figure 3. Amber Disk Status LEDs
4. Log in to the primary MGMT node: [Client]$ ssh –l admin primary_MGMT_node Verify Recovery State of SSU GridRAID Devices 5. If the GridRAID recovery state is already known, skip to Verify Location of the Failed Disk. When a GridRAID array becomes degraded because an active disk fails, the array starts a Recovery operation known as Reconstruction, which reconstructs the missing data onto the distributed spare. This step verifies that the degraded device’s recovery operation is in progress or completed. 6. Establish communication with the management node (n000) using one of the following two methods: Method 1: On a workstation or laptop PC with an Ethernet connection to the Sonexion system, use the IP address of the MGMT node (n000) to launch a terminal session, such as Putty, using the settings shown in the following table. Table 1. Settings for MGMT Connection Parameter
Setting
Bits per second
115200
Data bits
8
Parity
None
Stop bits
1
Flow control
None
The function keys are set to VT100+. Method 2: Use a separate monitor and keyboard connected to the back of the MGMT node (n000) as shown in the following figure. Ensure that the connection has the settings shown in the table above.
8
Replace a 5U84 Disk
Figure 4. Monitor and Keyboard Connections on the MGMT Node
7. Log on to the MGMT node (n000) as admin using the related password, as follows: login as: admin
[email protected]’s password: password Last login: Thu Oct 11 11:06:24 2012 from 172.30.22.59 [admin@snx11000n000 ~]$ 8. Change to root user: $ sudo su 9. Verify the state of the GridRAID devices: [nodeXY]$ sudo cat /proc/mdstat The following sample cat /proc/mdstat output shows a GridRAID array in the SSU in an optimal state. [admin@snx11000n004 ~]$ sudo cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] [raid1] md0 : active raid6 sdac[41] sdai[0] sdek[40] sdcf[39] sddm[38] sdm[37] sdbo[36] sddo[35] sdcj[34] sddy[33] sdfh[32] sdeo[30] sdao[29] sdeg[28] sdet[27] sdam[26] sdy[25] sdj[24] sdaw[23] sdbw[22] sdea[21] sdcz[20] sdaq[19] sdak[18] sdbq[17] sdfb[16] sddb[15] sds[14] sdbe[13] sdfi[12] sdcv[11] sdfd[10] sdbl[9] sdcp[8] sdbp[7] sdds[6] sdbf[5] sdag[4] sdck[3] sdo[2] sdeu[1] 30163820544 blocks super 1.2 level 6, 128k chunk, algorithm 1003 [41/41/0] [UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU] in: 224646417 reads, 41117476 writes; out: 971290979 reads, 2983042690 writes, 2222072048 zwrites 1543568470 in raid5d, 4613 out of stripes, 1543568470 handle called reads: 0 for rmw, 117149671 for rcw 265812120 delayed, 4282577743 bit delayed, 0 active, queues: 265769416 in, 221299131 out bitmap: 7/225 pages [28KB], 8192KB chunk, file: /WIBS/snx11000n004:md0/WIB_snx11000n004:md0 md129 : active raid1 sdei6[0] sdce6[1] 999988 blocks super 1.2 [2/2] [UU] md128 : active raid1 sdei5[0] sdce5[1] 409167 blocks super 1.2 [2/2] [UU] unused devices:
If a GridRAID array is recovering, the cat /proc/mdstat command shows the recovery in process and lists the percentage completion of the reconstruction. Personalities : [raid6] [raid5] [raid4] [raid1] md0 : active raid6 sdae[41] sdai[0] sdek[40] sdcf[39] sddm[38] sdm[37] sdbo[36] sddo[35] sdcj[34] sddy[33] sdfh[32] sdeo[30] sdao[29] sdeg[28] sdet[27] sdam[26] sdy[25] sdj[24] sdaw[23] sdbw[22] s dea[21] sdcz[20] sdaq[19] sdak[18] sdbq[17] sdfb[16] sddb[15] sds[14] sdbe[13] sdfi[12] sdcv[11] sdfd[10] sdbl[9] sdcp[8] sdbp[7] sdds[6] sdbf[5] sdag[4] sdck[3] sdo[2] sdeu[1] 30163820544 blocks super 1.2 level 6, 128k chunk, algorithm 1003 [41/40/1] [UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU] in: 225882973 reads, 41456081 writes; out: 1497876428 reads, 3063818325 writes, 2240329843 zwrites 1847938266 in raid5d, 8416 out of stripes, 1847938284 handle called reads: 0 for rmw, 118404480 for rcw 267956727 delayed, 4282509863 bit delayed, 3596 active, queues: 267344611 in, 222347640 out [>....................] = 0.0% (3226624/3770477568) finish=1012.0min speed=62037K/sec bitmap: 7/225 pages [28KB], 8192KB chunk, file: /WIBS/snx11000n004:md0/WIB_snx11000n004:md0 md129 : active raid1 sdei6[0] sdce6[1] 999988 blocks super 1.2 [2/2] [UU] md128 : active raid1 sdei5[0] sdce5[1] 409167 blocks super 1.2 [2/2] [UU] unused devices:
9
Replace a 5U84 Disk
10. For more information on the state of an individual GridRAID device, run: [nodeXY]$ sudo mdadm --misc --detail /dev/mdX Where X is the number of the GridRAID devices. This command provides detail of an MD device; the following example was taken during recovery: [admin@snx11000n004 ~]$ mdadm --misc --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Mon Jul 7 02:32:16 2014 Raid Level : raid6 Array Size : 30163820544 (28766.46 GiB 30887.75 GB) Used Dev Size : 966796800 (922.01 GiB 990.00 GB) Raid Devices : 41 Total Devices : 41 Persistence : Superblock is persistent Intent Bitmap : /WIBS/snx11000n004:md0/WIB_snx11000n004:md0 Update Time : Tue Jul 29 03:18:06 2014 State : active, repaired, recovering Active Devices : 40 Working Devices : 41 Failed Devices : 0 Spare Devices : 1 PD-Repaired : 1 Layout : xyratex-pd Chunk Size : 128K Stripe Width : 8 Distributed Spares: 2 Spare0 : [31, -1] Spare1 : [-1, -1] Rebuild Status : 19% complete Name : snx11000n004:md0 (local to host snx11000n004) UUID : d19a0097:1f82952b:3e5a8fcf:7ef3a07e Events : 5064 Number Major Minor RaidDevice State 0 66 32 0 active sync /dev/sdai 1 129 96 1 active sync /dev/sdeu 2 8 224 2 active sync /dev/sdo 3 69 128 3 active sync /dev/sdck 4 66 0 4 active sync /dev/sdag 5 67 144 5 active sync /dev/sdbf 6 71 160 6 active sync /dev/sdds 7 68 48 7 active sync /dev/sdbp 8 69 208 8 active sync /dev/sdcp 9 67 240 9 active sync /dev/sdbl 10 129 240 10 active sync /dev/sdfd 11 70 48 11 active sync /dev/sdcv 12 130 64 12 active sync /dev/sdfi 13 67 128 13 active sync /dev/sdbe 14 65 32 14 active sync /dev/sds 15 70 144 15 active sync /dev/sddb 16 129 208 16 active sync /dev/sdfb 17 68 64 17 active sync /dev/sdbq 18 66 64 18 active sync /dev/sdak 19 66 160 19 active sync /dev/sdaq 20 70 112 20 active sync /dev/sdcz 21 128 32 21 active sync /dev/sdea 22 68 160 22 active sync /dev/sdbw
10
Replace a 5U84 Disk
23 24 25 26 27 28 29 30 41 32 33 34 35 36 37 38 39 40
67 0 23 active sync /dev/sdaw 8 144 24 active sync /dev/sdj 65 128 25 active sync /dev/sdy 66 96 26 active sync /dev/sdam 129 80 27 active sync /dev/sdet 128 128 28 active sync /dev/sdeg 66 128 29 active sync /dev/sdao 129 0 30 active sync /dev/sdeo 65 224 31 spare rebuilding /dev/sdae 130 48 32 active sync /dev/sdfh 128 0 33 active sync /dev/sddy 69 112 34 active sync /dev/sdcj 71 96 35 active sync /dev/sddo 68 32 36 active sync /dev/sdbo 8 192 37 active sync /dev/sdm 71 64 38 active sync /dev/sddm 69 48 39 active sync /dev/sdcf 128 192 40 active sync /dev/sdek
Verify Location of the Failed Disk 11. If the location of the failed disk is already known, skip to Remove and Install the 5U84 Disk. 12. If the steps in Verify Recovery State of SSU GridRAID Devices have not been performed, log in to the affected node using either method shown in Connect to the MGMT Node before performing these steps.
13. Run the dm_report command: [nodeXY]$ sudo dm_report Following is a partial sample of dm_report output, showing a failed disk in slot 5 of the GridRAID array: Diskmonitor Inventory Report: Version: 1.0.x.1.5-25.x.2455 Host: snx11000n004 Time: Tue Jul 29 00:30:44 2014 encl: 0, wwn: 50050cc10c4000f5, dev: /dev/sg6, slots: 84, vendor: XYRATEX , product_id: UD-8435-SONEXION 2000 , dev1: /dev/sg4 slot: 0, wwn: 5000c50034a9a353, cap: 1000204886016, dev: sdfn, parts: 0, status: Foreign Arrays, dev1: sdfk, t10: 11110111000 slot: 1, wwn: 5000c50034a95e7b, cap: 1000204886016, dev: sdfj, parts: 0, status: Foreign Arrays, dev1: sdfg, t10: 11110111000 slot: 2, wwn: 5000c50034a9d20f, cap: 1000204886016, dev: sdfl, parts: 0, status: Ok, dev1: sdfi, t10: 11110111000 slot: 3, wwn: 5000c50034a9c0a3, cap: 1000204886016, dev: sdfo, parts: 0, status: Foreign Arrays, dev1: sdfm, t10: 11110111000 slot: 4, wwn: 5000c50034a995ab, cap: 1000204886016, dev: sdff, parts: 0, status: Foreign Arrays, dev1: sdfc, t10: 11110111000 slot: 5, wwn: 5000c500347885cb, cap: 1000204886016, dev: sdfe, parts: 0, status: Failed, dev1: sdae, t10: 11110111000 slot: 6, wwn: 5000c50034a96b17, cap: 1000204886016, dev: sdep, parts: 0, status: Foreign Arrays, dev1: sden, t10: 11110111000 slot: 7, wwn: 5000c5003486d04b, cap: 1000204886016, dev: sdfb, parts: 0, status: Ok, dev1: sdey, t10: 11110111000
If drives are not t10 formatted, the drive members in the dm_report output do not display ‘, t10: xxxxxxxxx’ at the end of each status field, as shown below: slot: 5, wwn: 5000c500347885cb, cap: 1000204886016, dev: sdfe, parts: 0, status: Failed, dev1: sdae slot: 6, wwn: 5000c50034a96b17, cap: 1000204886016, dev: sdep, parts: 0, status: Foreign Arrays, dev1: sden
14. Open the drive drawer and verify that the slot containing the failed drive has a lit Drive Fault LED. 15. Verify that the WWN (identifier) for the failed disk reported in the dm_report output matches the WWN of the failed disk in the SSU. IMPORTANT: The WWN identifiers may not match exactly (one digit may be different). There are several WWN identifiers associated with each drive, so there may be a one-digit difference between them. 16. If the failed disk is still responsive, retrieve its serial number: [nodeXY]$ sudo sg_inq /dev/sd XX
11
Replace a 5U84 Disk
Where XX is the SD device. This is sample sg_inq output: Vendor identification: SEAGATE Product identification: ST32000444SS Product revision level: XQB7 Unit serial number: 9WM1GXFJ0000C1055E36 Only the first eight characters of the serial number are written on each drive. Remove and Install 5U84 Disk Drive The following remove/replace procedure is the same whether the failed drive is active in the GridRAID array or marked as a hot spare. 17. Slide the lock latch to the right (llll>) until the disk releases from the slot. Each disk has a yellow indicator line that is visible only when the disk is not latched. When the disk is latched the line is not visible. If the line is visible, the disk with a yellow indicator line is not firmly latched, and the drawer should not be closed. 18. Remove the disk from the chassis. 19. Wait for the system to detect the missing drive. On a quiescent system, it takes approximately 30 seconds for the missing drive to be detected, longer on a busy system. 20. Verify the replacement drive is in the same orientation as the other disks and slide the drive into the slot. The interface card should face the front of the drawer. 21. Push down on the two rubber pads and slide the mechanism forward (towards the rear of the chassis) until the disk locks into position. In some situations, the original disk may be re-inserted into the chassis instead of a new disk drive. In these cases, use the mdadm –force command to add the original disk to the array as a hot spare: [n000]$ sudo mdadm --zero-superblock --force /dev/sdxx Where xx is the failing drive sd number. 22. Verify that the new disk is registered as the hot spare: [nodeXY]$ sudo dm_report Depending on the cluster's load and drive spin up time, it may take a few minutes for the dm_report output to show the new disk registered as the hot spare. GridRAID does place a status on this drive as ‘hot spare’, but there is no hot spare in GridRAID. Rather, GridRAID uses the drive as a replacement for the drive that failed. When the reconstruction is complete, GridRAID performs a GridRAID Rebalance, which copies the data from the distributed spare back to the replacement drive allowing the distributed spare to be used again. Following is a partial sample of dm_report output showing the disk in slot 7 registered as the hot spare: Diskmonitor Inventory Report: Version: 1.0.x.1.5-25.x.2455 Host: snx11000n004 Time: Mon Jul 28 05:50:45 2014 encl: 0, wwn: 50050cc10c4000f5, dev: /dev/sg6, slots: 84, vendor: XYRATEX , product_id: UD-8435-SONEXION 2000 , dev1: /dev/sg4 slot: 0, wwn: 5000c50034a9a353, cap: 1000204886016, dev: sdfn, parts: 0, status: Foreign Arrays, dev1: sdfk, t10: 11110111000 slot: 1, wwn: 5000c50034a95e7b, cap: 1000204886016, dev: sdfj, parts: 0, status: Foreign Arrays, dev1: sdfg, t10: 11110111000
12
Replace a 5U84 Disk
slot: slot: slot: slot: slot: slot:
2, 3, 4, 5, 6, 7,
wwn: wwn: wwn: wwn: wwn: wwn:
5000c50034a9d20f, 5000c50034a9c0a3, 5000c50034a995ab, 5000c500347885cb, 5000c50034a96b17, 5000c5003486d04b,
cap: cap: cap: cap: cap: cap:
1000204886016, 1000204886016, 1000204886016, 1000204886016, 1000204886016, 1000204886016,
dev: dev: dev: dev: dev: dev:
sdfl, sdfo, sdff, sdfe, sdep, sdfb,
parts: parts: parts: parts: parts: parts:
0, 0, 0, 0, 0, 0,
status: status: status: status: status: status:
Ok, dev1: sdfi, t10: 11110111000 Foreign Arrays, dev1: sdfm, t10: 11110111000 Foreign Arrays, dev1: sdfc, t10: 11110111000 Ok, dev1: sdfh, t10: 11110111000 Foreign Arrays, dev1: sden, t10: 11110111000 Hot Spare, dev1: sdae, t10: 11110111000
23. If the new disk comes up as anything other than Hot Spare, clear the superblock information: [nodeXY]$ sudo mdadm --zero-superblock --force /dev/sdXX Where xx is the SD device. 24. Run the dm_report command again to verify that the new disk is registered as the hot spare. 25. If the new disk comes up as anything other than a hot spare again, contact Cray Support. 26. Verify that all indicator lights are normal. a. Verify that the status LEDs on the 5U84 enclosure's OCP are green. b. Verify that no Drive Fault LEDs are lit. If a Reconstruction or Rebalance operation is in progress on the GridRAID array, the amber Module Fault LED and Drawer Status Drive Fault LED are illuminated on the SSU while the Drive Fault LEDs of the disks that are members of the degraded array are flashing. 27. Terminate the console connection and disconnect the serial cable from the controller.
13
Replace a 5U84 EBOD
Replace a 5U84 EBOD Prerequisites
Part number
100843600: Controller Assy, Sonexion EBOD Expansion Module Time 1.5 hours Interrupt level Failover (can be applied to a live system with no service interruption, but requires failover/ failback) Tools ●
Labels (attach to SAS cables)
●
ESD strap
●
Serial cable (9 pin to 3.5mm phone plug)
Requirements ●
IMPORTANT: Whenever a component (such as a PSU, fan, or controller) fails and causes an MMU or SSU to change from a fully redundant state to a non-redundant state, to maintain uninterrupted Sonexion operations, replace the faulty component as soon as possible to return the MMU or SSU to a fully redundant state.
●
Hostnames must be assigned to all OSS nodes in the cluster containing the failed EBOD I/O module (available from the customer)
●
To complete this procedure, the replacement EBOD I/O module must have the correct USM firmware for the system. See Sonexion 2000 USM Firmware Update Guide.
About this task
This procedure includes steps to replace the failed EBOD I/O module in the ESU component (SSU+1 or +n component), verify the operation of the new EBOD I/O module, and return the system to normal operation. Subtasks: ●
Replace EBOD6 on page 16
●
Reactivate the Node13 on page 16
A 5U84 ESU enclosure comprises one 5U chassis with two EBOD I/O modules, 82 drives, two power supply units (PSUs), and five power cooling modules (PCMs). The EBOD I/O modules (left and right) are located on top of the fan modules, and are accessible from the back of the rack. 5U84 Enclosure Precautions
14
Replace a 5U84 EBOD
Whenever a component (such as a PSU, fan, or controller) fails and causes an SSU to change from a fully redundant state to a non-redundant state, to maintain uninterrupted ClusterStor operations, the faulty component should be replaced as soon as possible to return the SSU to a fully redundant state. ●
Incorrectly carrying out this procedure may cause damage to the enclosure.
●
When you replace a cooling fan in any SSU enclosure, you must complete the replacement within 2 minutes.
●
When a drawer is open on any SSU enclosure, do not leave the drawer open longer than 2 minutes.
●
When replacing a disk drive in any SSU enclosure, unlatch the drive and wait 5 seconds for the drive to spin down before removal.
●
To prevent overturning, drawer interlocks stop users from opening both drawers at the same time. Do not attempt to force open a drawer if the other drawer is already open.
●
Operating temperatures inside the enclosure drawers can reach up to 95°C. Take care when opening drawers.
Notes and Cautions ●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
Use the steps in this procedure to fail over and shut down the node, remove the failed EBOD, and install a new one.
Procedure 1. If the location of the failed EBOD I/O module has not been provided, look for an amber Fault LED on the failed EBOD I/O module and the Module Fault LED on the OCP (Operator Control Panel) on the 5U84 enclosure (front panel). 2. Log in to the primary MGMT node: $ ssh -l admin primary_MGMT_node 3. Determine whether a failover occurred: $ sudo cscli fs_info If a failover occurred, go to step 5 on page 16. Otherwise continue to the next step. 4. Fail over the affected OSS resources to its HA partner: $ cscli failover -n affected_oss_nodename Wait for the OSS node's resources to fail over to the OSS node's HA partner. To confirm that the failover operation is complete, run: $ cscli fs_info
15
Replace a 5U84 EBOD
5. Shut down the affected OSS node: $ cscli power_manage -n oss_nodename --power-off Wait for the affected OSS node to completely power off. To confirm that the power-off operation is complete, run: $ pm -q Replace EBOD Perform the following steps from the back of the rack, with required ESD strap. 6. Disconnect both SAS cables from all ports of the failed EBOD I/O module. Label the cables to know the correct port locations when cables are reconnected to the new EBOD I/O module. 7. Release the module latch by grasping it between the thumb and forefinger and gently squeezing it. Figure 5. EBOD I/O Latch Operation
8. Using the latch as a handle, carefully remove the failed module from the enclosure. 9. Inspect the new EBOD I/O module for damage, especially to the interface connector. If the module is damaged, do not install it but obtain another EBOD I/O module. 10. With the latch in the released (open) position, slide the new EBOD I/O module into the enclosure until it completely seats and engages the latch. 11. Secure the module by closing the latch. There is an audible click as the latch engages. The I/O module may take up to 1 minute to re-initialize after inserting the cables. Reactivate the Node 12. Plug in the SAS cables to their original ports on the EBOD I/O, using the labels applied in step 6 on page 16 for more information. 13. Power on the affected OSS node. On the primary MGMT node, run: $ cslci power_manage -n affected oss_nodename --power-on 14. Wait for the OSS node to come online. This may take a few minutes. Confirm that the OSS node is online : $ pdsh -a uname -r | dshbak -c
16
Replace a 5U84 EBOD
15. Verify the following LED status: ●
On the new EBOD, Fault LED is extinguished and the Health LED is illuminated green.
●
On the new EBOD, Connected SAS Lane LEDs are ON and ready, with no traffic showing.
●
On the OCP located at the front of the 5U84 enclosure, the Module Fault LED is off.
16. When the affected OSS node is back online, fail back its HA resources on the OSS node: $ cscli failback -n oss_nodename To confirm that HA Resources are failing back, run: $ cscli fs_info 17. Verify the status of the SAS Lane LEDs on the new EBOD I/O module: ON and FLASHING: the EBOD is active with I/O traffic ON: the EBOD is ready, with no traffic. 18. Connect a serial cable to the new EBOD I/O module and open a terminal session with these settings: Bits per second: 115200 Data bits: 8 Parity: none Stop bits: 1 Flow control: none 19. Press the Enter key until a GEM> prompt appears. Type ver and press Enter. 20. The USM and GEM firmware versions must agree between the EBOD I/O modules. Consult Cray Hardware Product Support to obtain the correct files, and use the procedures in Cray publication Sonexion USM Firmware Update Guide. When the firmware versions match, proceed to the next step. 21. If the terminal connection (console or PC) is still active, terminate it and disconnect the serial cable from the new controller. The 5U84 enclosure EBOD I/O module FRU procedure is complete.
17
Replace a 5U84 SSD
Replace a 5U84 SSD Prerequisites Part number
101255900: Disk Drive Assy, Seagate 200GB T10 SAS SSD for SSU Time 1 hour Service Interval Live (can be applied to a live system with no service interruption, but requires failover/ failback operations) Tools ●
Console with monitor and keyboard (or serial cable and PC with a serial port configured for 115.2Kbs)
●
T-20 Torx screwdriver
●
ESD strap
About this task
Use this procedure, only for Sonexion 2000 1.5 pre SU-004 systems (see Important, below), to remove and replace a failed solid-state drive (SSD) in an OSS component within an SSU. IMPORTANT: The procedure in this topic is intended for use with Sonexion 2000 1.5.0 (pre SU-004) systems only. For Sonexion 1.5.0 SU-004 and later systems, do not use this topic to replace failed hardware. Instead, field personnel should log in to the Sonexion service console, which provides step-bystep instructions to replace the failed part. Follow the steps below to access the service console:
1. Cable a laptop to any available port on any LMN switch (located at the top of the rack). 2. Log in to the service console and follow the procedure to remove and replace the failed part. To log in, navigate to the service console (http://service:8080). If that URL is not active, then log in to port 8080 of the IP address of the currently active MGMT node (MGMT0): http://IP_address:8080 where IP_address is the IP address of the currently active (primary) MGMT node. 3. Enter the standard service credentials. In this procedure, the SSD disk drive in carrier (DDIC) is referred to as simply the SSD. This procedure includes steps to locate and replace the failed SSD and reassemble the RAID array. A chassis and two controllers are bundled in the modular SSU. The chassis is a 5U84 enclosure containing 84 DDICs. The DDICs are configured in GridRAID arrays. Subtasks: ●
Remove and Install 5U84 SSD
18
Replace a 5U84 SSD
●
Monitor the Recovery Process
In a Sonexion system, a chassis and two OSS controllers are bundled in the modular SSU. The chassis is a 5U84 enclosure containing 82 hard drives and two SSDs. The HDDs are configured in GridRAID arrays. The GridRAID array is a set of drives configured to act as a single volume. An SSU contains two 41-drive arrays each configured as a [41 (8+2)+2] GridRAID array with two SSD drives configured as a RAID1 array. In each GridRAID array are two distributed spare volumes and RAID 6 stripes configured with eight data units and two parity units per stripe. GridRAID and Recovery When an hard drive fails and its GridRAID array is degraded, data is immediately reconstructed onto the distributed spare. The distributed spare is a virtual device where its space is distributed across all disks. GridRAID defines the operation of reconstructing the data onto the distributed spare as reconstruction. While the reconstruction is underway, Sonexion operations continue without interruption. Once the reconstruction is complete and the failed drive is replaced, the data is copied from the distributed spare to the replacement drive. This operation is called a GridRAID rebalancing. The GridRAID rebalancing is a copy operation. 5U84 Enclosure Precautions Whenever a component (such as a PSU, fan, or controller) fails and causes an SSU to change from a fully redundant state to a non-redundant state, to maintain uninterrupted ClusterStor operations, the faulty component should be replaced as soon as possible to return the SSU to a fully redundant state. ●
Incorrectly carrying out this procedure may cause damage to the enclosure.
●
When you replace a cooling fan in any SSU enclosure, you must complete the replacement within 2 minutes.
●
When a drawer is open on any SSU enclosure, do not leave the drawer open longer than 2 minutes.
●
When replacing a disk drive in any SSU enclosure, unlatch the drive and wait 5 seconds for the drive to spin down before removal.
●
To prevent overturning, drawer interlocks stop users from opening both drawers at the same time. Do not attempt to force open a drawer if the other drawer is already open.
●
Operating temperatures inside the enclosure drawers can reach up to 95°C. Take care when opening drawers.
Notes and Cautions ●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
Procedure 1. If the failed SSD’s slot number is already known, go to Remove and Install 5U84 SSD. 2. If the location of the failed SSD is not known, do the following: a. Locate the affected SSU in the Sonexion rack.
19
Replace a 5U84 SSD
b. In the affected SSU, to locate the failed SSD, inspect the drive drawers and look for a drive carrier with a Fault LED (amber). In the 5U84 enclosure, the SSD drives should only be in slots 28 and 70. The 5U84 enclosure should also have a Fault LED (amber) on the front panel. 3. Connect to the SSU either by logging in to the MGMT node using one of the methods in Connect to the MGMT Node or by connecting directly to the OSS as described in Connect to an OSS Directly. 4. Issue the dm_report command to determine the failed SSD’s slot number: [nodeXY]$ sudo dm_report The following sample dm_report output shows an SSD drive in slot 28. [admin@snx11000n004 ~]$ sudo dm_report Diskmonitor Inventory Report: Version: 1.0.x.1.5-25.x.2455 Host: snx11000n004 Time: Wed Jul 30 04:04:13 2014 encl: 0, wwn: 50050cc10c4000f5, dev: /dev/sg6, slots: 84, vendor: XYRATEX , product_id: UD-8435-SONEXION 2000 , dev1: /dev/sg4 slot: 0, wwn: 5000c50034a9a353, cap: 1000204886016, dev: sdfn, parts: 0, status: Foreign Arrays, dev1: sdfk, t10: 11110111000 slot: 1, wwn: 5000c50034a95e7b, cap: 1000204886016, dev: sdfj, parts: 0, status: Foreign Arrays, dev1: sdfg, t10: 11110111000 slot: 2, wwn: 5000c50034a9d20f, cap: 1000204886016, dev: sdfl, parts: 0, status: Ok, dev1: sdfi, t10: 11110111000 slot: 3, wwn: 5000c50034a9c0a3, cap: 1000204886016, dev: sdfo, parts: 0, status: Foreign Arrays, dev1: sdfm, t10: 11110111000 slot: 4, wwn: 5000c50034a995ab, cap: 1000204886016, dev: sdff, parts: 0, status: Foreign Arrays, dev1: sdfc, t10: 11110111000 slot: 5, wwn: 5000c500347885cb, cap: 1000204886016, dev: sdfe, parts: 0, status: Ok, dev1: sdfh, t10: 11110111000 slot: 6, wwn: 5000c50034a96b17, cap: 1000204886016, dev: sdep, parts: 0, status: Foreign Arrays, dev1: sden, t10: 11110111000 slot: 7, wwn: 5000c5003486d04b, cap: 1000204886016, dev: sdfb, parts: 0, status: Ok, dev1: sdey, t10: 11110111000 slot: 8, wwn: 5000c5003486c8a7, cap: 1000204886016, dev: sdfa, parts: 0, status: Ok, dev1: sdfd, t10: 11110111000 slot: 9, wwn: 5000c50034a995b7, cap: 1000204886016, dev: sder, parts: 0, status: Ok, dev1: sdeo, t10: 11110111000 slot: 10, wwn: 5000c50034a99e43, cap: 1000204886016, dev: sdex, parts: 0, status: Ok, dev1: sdeu, t10: 11110111000 slot: 11, wwn: 5000c5003478892b, cap: 1000204886016, dev: sdez, parts: 0, status: Foreign Arrays, dev1: sdew, t10: 11110111000 slot: 12, wwn: 5000c500344009cb, cap: 1000204886016, dev: sdeq, parts: 0, status: Ok, dev1: sdet, t10: 11110111000 slot: 13, wwn: 5000c50034a96d1b, cap: 1000204886016, dev: sdev, parts: 0, status: Foreign Arrays, dev1: sdes, t10: 11110111000 slot: 14, wwn: 5000c50034a96727, cap: 1000204886016, dev: sddj, parts: 0, status: Foreign Arrays, dev1: sddg, t10: 11110111000 slot: 15, wwn: 5000c50034aa0f2f, cap: 1000204886016, dev: sddc, parts: 0, status: Foreign Arrays, dev1: sddf, t10: 11110111000 slot: 16, wwn: 5000c50034788a1f, cap: 1000204886016, dev: sddh, parts: 0, status: Foreign Arrays, dev1: sdde, t10: 11110111000 slot: 17, wwn: 5000c50034a9bacb, cap: 1000204886016, dev: sddl, parts: 0, status: Foreign Arrays, dev1: sddi, t10: 11110111000 slot: 18, wwn: 5000c5003486d167, cap: 1000204886016, dev: sddb, parts: 0, status: Ok, dev1: sdcy, t10: 11110111000 slot: 19, wwn: 5000c50034a95b33, cap: 1000204886016, dev: sdda, parts: 0, status: Foreign Arrays, dev1: sddd, t10: 11110111000 slot: 20, wwn: 5000c50034a97947, cap: 1000204886016, dev: sdcj, parts: 0, status: Ok, dev1: sdcl, t10: 11110111000 slot: 21, wwn: 5000c50034a9b2ef, cap: 1000204886016, dev: sdcx, parts: 0, status: Foreign Arrays, dev1: sdcu, t10: 11110111000 slot: 22, wwn: 5000c50034a95573, cap: 1000204886016, dev: sdcz, parts: 0, status: Ok, dev1: sdcw, t10: 11110111000 slot: 23, wwn: 5000c50034a9d60b, cap: 1000204886016, dev: sdcn, parts: 0, status: Ok, dev1: sdck, t10: 11110111000 slot: 24, wwn: 5000c50034a9bdff, cap: 1000204886016, dev: sdct, parts: 0, status: Foreign Arrays, dev1: sdcq, t10: 11110111000 slot: 25, wwn: 5000c500344b97c7, cap: 1000204886016, dev: sdcv, parts: 0, status: Ok, dev1: sdcs, t10: 11110111000 slot: 26, wwn: 5000c500344c8337, cap: 1000204886016, dev: sdcp, parts: 0, status: Ok, dev1: sdcm, t10: 11110111000 slot: 27, wwn: 5000c50034a9ae97, cap: 1000204886016, dev: sdcr, parts: 0, status: Foreign Arrays, dev1: sdco, t10: 11110111000 slot: 28, wwn: 5000cca013039360, cap: 100030242816, dev: sdel, parts: 5, status: Failed, dev1: sdei, t10: 11100111000 ... Array: md128, UUID: e77a4742-d001fbc5-1808357e-fad98dd2, status: Degraded, t10: disabled disk_wwn: 5000cca013039c54, disk_sd: sdch, disk_part: 5, encl_wwn: 50050cc10c4000f5, encl_slot: 70 Array: md129, UUID: dd855cd0-4c2914ef-77e2dfed-8c309601, status: Degraded, t10: disabled ... GRD_CHK(1), APP_CHK(1), REF_CHK(1), ATO(1), RWWP(1), SPT(1), P_TYPE(1), PROT_EN(1), DPICZ(1), FMT(1), READ_CHK(1) T10_key_end End_of_report
5. Verify that the failed SSD indicated by the dm_report output matches the slot containing the failed SSD drive (based on prior inspection). Remove and Install 5U84 SSD 6. Use the Torx screwdriver to unlock the SSU drawer latches and open the drive drawer. Locate the failed SSD in its slot. 7. Unlock the SSD by moving the thumb release button to the right. The drive should pop up slightly. SSDs possess a yellow indicator line that is visible only when the SSD is not latched. When the SSD is latched, the line is not visible. If the line is visible the drawer should not be closed. 8. Remove the SSD from the drive slot.
20
Replace a 5U84 SSD
9. The replacement drive must be new (for example, it must not be formatted). If the replacement drive has been formatted, insert the drive into a different system (other than Sonexion) and wipe the formatting data using this command: [nodeXY]$ sudo dd if=/dev/zero of=/dev/ sd_dev bs=1M 10. Remove the SSD from its anti-static protected package. 11. Using the rubber “touch points”, move the slide latch to the unlocked position. 12. Orient the SSD in the empty slot with the SAS-SATA bridge card side facing the front of the drawer. 13. Carefully slide the SSD into position and press down while moving the slide latch to the locked position. The thumb lock engages automatically. 14. Close the drive drawer and use the Torx screwdriver to lock the drawer latches. After installation, diskmonitor takes approximately 5-10 minutes to notice the new SSD and begin the rebalancing. 15. Log in to the primary MGMT node: [Client]$ ssh –l admin primary_MGMT_node 16. Log into the OSS node hosted on the new SSD: [admin@n000]$ ssh OSS_nodename Where OSS_nodename is the name of the OSS node containing the new SSD. 17. Confirm that the newly installed drive has been accepted. Check the status of the drive; it should be Hot Spare. Run: [nodeXY]$ sudo dm_report For example: [admin@snx11000n004 ~]$ sudo dm report slot: 28, wwn: 5000cca013039360, cap: 100030242816, dev: sdel, parts: 5, status: Hot Spare, dev1: sdei, t10: 11100111000 Monitor the Recovery Process In the following steps, monitor the recovery process until complete, by running the watch cat /proc/mdstat command on both nodes controlling the GridRAID array. This is necessary because each node controls half the partitions. The SSD is a relatively small-capacity drive. Due to the speed of the rebuild process, the rebuild has been captured in stages to give a more complete view. 18. Monitor the recovery process on the OSS node hosted on the new SSD: [nodeXY]$ watch cat /proc/mdstat
21
Replace a 5U84 SSD
Press Ctrl-c to stop the watch command. 19. Exit from the node and log in to its HA partner node that controls the GridRAID array: [nodeXY]$ exit [admin@n000]$ ssh nodeXX 20. Monitor the recovery process on the partner OSS node: [nodeXX]$ watch cat /proc/mdstat This is an example of the whole process: [snx11000n004]$ [snx11000n004]$ [snx11000n000]$ [snx11000n005]$
watch cat /proc/mdstat exit ssh snx11000n05 watch cat /proc/mdstat
This is a sample of watch cat command output: [admin@snx11000n004 ~]$ watch cat /proc/mdstat Every 2.0s: cat /proc/mdstat
Sat Dec
1 03:51:53 2012
Personalities : [raid1] [raid6] [raid5] [raid4] md0 : active raid6 sdaf[40] sdae[39] sdah[38] sdag[37] sdad[36] sdaj[35] sdai[34] sdac[33] sdak[32] sdal[31] sdap[30] sdan[29] sdam[28] sdr[27] sdq[26] sdt[25] sds[24] sdp[23] sdv[22] sdu[21] sdo[20] sdx[19] sdw[18] sdab[17] sdz[16] sdy[15] sdaa[14] sdd[13] sdc[12] sdf[11] sde[10] sdb[9] sdh[8] sdg[7] sda[6] sdj[5] sdi[4] sdn[3] sdl[2] sdk[1] sdm[0] 3271557120 blocks super 1.2 level 6, 128k chunk, algorithm 1003 [41/41/0] [UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU] in: 85 reads, 0 writes; out: 85 reads, 0 writes, 0 zwrites 0 in raid5d, 0 out of stripes, 0 handle called reads: 0 for rmw, 0 for rcw 0 delayed, 0 bit delayed, 0 active, queues: 85 in, 85 out bitmap: 0/49 pages [0KB], 4096KB chunk, file: /mnt/wib/md0 md141 : active raid1 sdbq18[0] 409167 blocks super 1.2 [2/1] [U_] md137 : active raid1 sdbq14[0] sdao14[1] 409167 blocks super 1.2 [2/2] [UU] md140 : active raid1 sdbq17[0] 409167 blocks super 1.2 [2/1] [U_] md136 : active raid1 sdbq13[0] sdao13[1] 409167 blocks super 1.2 [2/2] [UU] md133 : active raid1 sdao10[1] sdbq10[0] 409167 blocks super 1.2 [2/1] [U_] [=========>...........] recovery = 48.7% (200064/409167) finish=0.0min speed=200064K/sec md132 : active raid1 sdbq9[0] sdao9[1] 409167 blocks super 1.2 [2/2] [UU] md129 : active raid1 sdbq6[0] sdao6[1] 409167 blocks super 1.2 [2/2] [UU] md128 : active raid1 sdbq5[0] sdao5[1] 409167 blocks super 1.2 [2/2] [UU] unused devices: Personalities : [raid6] [raid5] [raid4] [raid1] md0 : active raid6 sdae[41] sdai[0] sdek[40] sdcf[39] sddm[38] sdm[37] sdbo[36] sddo[35] sdcj[34] sddy[33] sdfh[32] sdeo[30] sdao[29] sdeg[28] sdet[27] sdam[26] sdy[25] sdj[24] sdaw[23] sdbw[22] sdea[21] sdcz[20] sdaq[19] sdak[18] sdbq[17] sdfb[16] sddb[15] sds[14] sdbe[13] sdfi[12] sdcv[11] sdfd[10] sdbl[9] sdcp[8] sdbp[7] sdds[6] sdbf[5] sdag[4] sdck[3] sdo[2] sdeu[1] 30163820544 blocks super 1.2 level 6, 128k chunk, algorithm 1003 [41/41/0] [UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU] in: 271811271 reads, 50718532 writes; out: 1685327354 reads, 3742691627 writes, 2729414864 zwrites 2315158679 in raid5d, 8416 out of stripes, 2315158697 handle called reads: 0 for rmw, 149399410 for rcw 325807616 delayed, 4278916029 bit delayed, 8 active, queues: 322536613 in, 264915823 out bitmap: 9/225 pages [36KB], 8192KB chunk, file: /WIBS/snx11000n004:md0/WIB_snx11000n004:md0 md129 : active raid1 sdel6[2] sdce6[1] 999988 blocks super 1.2 [2/1] [_U] [==============>......] recovery = 74.7% (748480/999988) finish=29605.7min speed=0K/sec md128 : active raid1 sdel5[2] sdce5[1] 409167 blocks super 1.2 [2/2] [UU] unused devices: When the recovery is complete, the cat/proc/mdstat output on the new node should look similar to this: [admin@snx11000n004 ~]$ sudo cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] [raid1] md0 : active raid6 sdae[41] sdai[0] sdek[40] sdcf[39] sddm[38] sdm[37] sdbo[36] sddo[35] sdcj[34] sddy[33] sdfh[32] sdeo[30] sdao[29] sdeg[28] sdet[27] sdam[26] sdy[25] sdj[24] sdaw[23] sdbw[22] sdea[21] sdcz[20] sdaq[19] sdak[18] sdbq[17] sdfb[16] sddb[15] sds[14] sdbe[13] sdfi[12] sdcv[11] sdfd[10] sdbl[9] sdcp[8] sdbp[7] sdds[6] sdbf[5] sdag[4] sdck[3] sdo[2] sdeu[1] 30163820544 blocks super 1.2 level 6, 128k chunk, algorithm 1003 [41/41/0] [UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU] in: 271811499 reads, 50718581 writes; out: 1685327998 reads, 3742693451 writes, 2729416144 zwrites 2315159735 in raid5d, 8416 out of stripes, 2315159753 handle called reads: 0 for rmw, 149399826 for rcw
22
Replace a 5U84 SSD
325807756 delayed, 4278916029 bit delayed, 16 active, queues: 322536890 in, 264916139 out bitmap: 9/225 pages [36KB], 8192KB chunk, file: /WIBS/snx11000n004:md0/WIB_snx11000n004:md0 md129 : active raid1 sdel6[2] sdce6[1] 999988 blocks super 1.2 [2/2] [UU] md128 : active raid1 sdel5[2] sdce5[1] 409167 blocks super 1.2 [2/2] [UU] unused devices:
21. Log out of the HA partner node. [nodeXX]exit 22. Log in to the OSS node hosted on the new SSD. [admin@n000]$ ssh nodeXY 23. Verify that the rebuild operation was successful: [nodeXY]$ sudo dm_report For example: [admin@snx11000n004 ~]$ sudo dm_report Diskmonitor Inventory Report: Version: 1.0.x.1.5-25.x.2455 Host: snx11000n004 Time: Wed Jul 30 04:19:03 2014 parts: 0, status: Ok, dev1: sdcm, t10: 11110111000 slot: 27, wwn: 5000c50034a9ae97, cap: 1000204886016, dev: sdcr, parts: 0, status: Foreign Arrays, dev1: sdco, t10: 11110111000 slot: 28, wwn: 5000cca013039360, cap: 100030242816, dev: sdel, parts: 5, status: Ok, dev1: sdei, t10: 11100111000 slot: 29, wwn: 5000c50034a997f3, cap: 1000204886016, dev: sdee, parts: 0, status: Foreign Arrays, dev1: sdeh, t10: 11110111000 slot: 30, wwn: 5000c50034a9e473, cap: 1000204886016, dev: sdej, Array: md129, UUID: dd855cd0-4c2914ef-77e2dfed-8c309601, status: Ok, t10: disabled disk_wwn: 5000cca013039360, disk_sd: sdel, disk_part: 6, encl_wwn: 50050cc10c4000f5, encl_slot: 28 disk_wwn: 5000cca013039c54, disk_sd: sdch, disk_part: 6, encl_wwn: 50050cc10c4000f5, encl_slot: 70 End_of_report [admin@snx11000n004 ~]$ Showing that drive slot 28 is now showing a status of OK:slot: 28, wwn: 5000cca013039360, cap: 100030242816, dev:
sdel, parts:
5, status: Ok, dev1: sdei, t10: 11100111000
24. If a serial connection was made earlier in this procedure, disconnect the cable from the controller.
23
Replace a 5U84 Fan Module
Replace a 5U84 Fan Module Prerequisites Part number
100842900: Fan Tray Assy, Sonexion SSU Time 30 minutes Interrupt level Live (can be applied to a live system with no service interruption) Tools ●
ESD strap (recommended)
●
Host names assigned to the two OSS nodes in the SSU that contains the failed fan module (available from the customer)
About this task
The SSU contains a 5U84 enclosure, two controllers, two PSUs, and five fan modules. Each SSU controller hosts one OSS node; there are two OSS nodes per SSU. Within an SSU, the OSS nodes are organized in an HA pair. Each SSU controller hosts one OSS node; there are two OSS nodes per SSU. Within an SSU, the OSS nodes are organized in an HA pair. Each fan module contains two fans that are numbered sequentially. The first fan module contains fan 0 and fan 1, the second fan module contains fan 2 and fan 3, etc. The fan modules themselves are not individually numbered, but are ordered left to right (as viewed from the rear). Fan module 1 is the leftmost canister, fan module 5 is the rightmost canister, and fan modules 2, 3 and 4 are in between. Subtasks: ●
Locate the Failed Fan Module
●
Replace the Fan Module
This procedure includes steps to verify the operation of the new fan module. 5U84 Enclosure Precautions Whenever a component (such as a PSU, fan, or controller) fails and causes an SSU to change from a fully redundant state to a non-redundant state, to maintain uninterrupted ClusterStor operations, the faulty component should be replaced as soon as possible to return the SSU to a fully redundant state. ●
Incorrectly carrying out this procedure may cause damage to the enclosure.
●
When you replace a cooling fan in any SSU enclosure, you must complete the replacement within 2 minutes.
●
When a drawer is open on any SSU enclosure, do not leave the drawer open longer than 2 minutes.
●
When replacing a disk drive in any SSU enclosure, unlatch the drive and wait 5 seconds for the drive to spin down before removal.
24
Replace a 5U84 Fan Module
●
To prevent overturning, drawer interlocks stop users from opening both drawers at the same time. Do not attempt to force open a drawer if the other drawer is already open.
●
Operating temperatures inside the enclosure drawers can reach up to 95°C. Take care when opening drawers.
Notes and Cautions ●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure. IMPORTANT: Whenever a component (such as a PSU, fan, or controller) fails and causes an MMU or SSU to change from a fully redundant state to a non-redundant state, to maintain uninterrupted Sonexion operations, replace the faulty component as soon as possible to return the MMU or SSU to a fully redundant state.
Procedure 1. If the location of the failed fan module is known, skip to Replace the Fan Module. 2. Establish communication with the management node (n000) using one of the following two methods: Method 1: On a workstation or laptop PC with an Ethernet connection to the Sonexion system, use the IP address of the MGMT node (n000) to launch a terminal session, such as Putty, using the settings shown in the following table. Table 2. Settings for MGMT Connection Parameter
Setting
Bits per second
115200
Data bits
8
Parity
None
Stop bits
1
Flow control
None
The function keys are set to VT100+. Method 2: Use a separate monitor and keyboard connected to the back of the MGMT node (n000) as shown in the following figure. Ensure that the connection has the settings shown in the table above.
25
Replace a 5U84 Fan Module
Figure 6. Monitor and Keyboard Connections on the MGMT Node
3. Log on to the MGMT node (n000) as admin using the related password, as follows: login as: admin [email protected]’s password: password Last login: Thu Oct 11 11:06:24 2012 from 172.30.22.59 [admin@snx11000n000 ~]$ 4. Change to root user: $ sudo su Locate the Failed Fan Module 5. Access GEM on an OSS node in the SSU that contains the failed fan module and enter: # conman OSS_host_name -gem where OSS_host_name is the host name of one of the OSS nodes in the SSU that contains the failed fan module. Obtain the OSS node names from the customer. 6. At the GEM prompt, enter: GEM> fan_set 7. If the following error is not issued from the preceding command, proceed to Replace the Fan Module. 1+03:58:30.358 S1 GEM> fan_set The following steps can be run only from the master. 8. Exit from the OSS node. 9. Access GEM on the other OSS node (HA partner) and enter: # conman nodename -gem Where nodename is the host name of the other OSS node in the SSU that contains the failed fan module.
26
Replace a 5U84 Fan Module
10. At the GEM prompt, enter: GEM> fan_set The following fan_set output sample shows status or RPM speed for the ten fans (two per module): GEM> fan_set Fan 0, speed = 13520. Fan 1, speed = 13920. Fan 2, status = STATUS_DEVICE_REMOVED. Fan 3, status = STATUS_DEVICE_REMOVED. Fan 4, speed = 13728. Fan 5, speed = 14016. Fan 6, speed = 13584. Fan 7, speed = 13872. Fan 8, speed = 13536. Fan 9, speed = 13888. In the above command output: ●
Fans 0 and 1 refer to fan module 1, fans 2 and 3 refer to fan module 2, and so on.
●
Speeds in the 13K-14K range indicate a potential problem. The 8K range is typical.
●
Depending on the cause of the fan module failure, a status other than STATUS_DEVICE_REMOVED may appear.
●
The STATUS_DEVICE_REMOVED status indicates a fan failure. Replace the Fan Module
11. Release the fan module by grasping the handle and pushing down on the orange latch. 12. Using the handle, carefully remove the fan module's canister from the enclosure bay. 13. Insert the new fan module into the enclosure bay and slide it in until the module completely seats and resets the latch. The new fan module powers up. 14. When the fan reaches full speed, enter the fan_set command on the OSS node associated with the new fan module: GEM> fan_set The following example fan_set output shows the current RPM speed of each fan: Fan Fan Fan Fan Fan Fan Fan Fan Fan Fan
0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
speed speed speed speed speed speed speed speed speed speed
= = = = = = = = = =
8208. 8144. 8448. 8416. 8224. 8208. 8416. 8448. 8496. 8448.
27
Replace a 5U84 Fan Module
15. Exit GEM: GEM> &. 16. Verify that the fan_set output is normal and shows fan speeds for all fan modules without errors or warnings.
28
Replace a 5U84 Backlit Panel Bezel
Replace a 5U84 Backlit Panel Bezel Prerequisites Part number
Sonexion Model
Part Number
Description
Sonexion 2000
101205500
Panel Assy, Back-lit Sonexion 2000 SSU Fascia
Sonexion 1600
100901300
Panel Assy, Back-lit Sonexion 1600 SSU Fascia
Sonexion 1600
101337700
Panel Assy Fry, Back-lit Sonexion 1600 SSU Fascia
: Time 1 hour Interrupt level Interrupt (requires disconnecting the Lustre clients from the filesystem) Tools ●
Serial cable
●
2mm recessed Allen hex socket screwdriver or torque driver
●
Special plastic tool
●
T-20 Torx screwdriver
●
ESD strap
About this task
The modular SSU incorporates a chassis and two controllers. The chassis is a 5U84 enclosure that contains 84 disks. This procedure includes steps to take the OSS nodes offline, remove and replace the failed bezel, bring the OSS nodes online, and return the system to normal operation. Subtasks: ●
Remove the Defective Bezel
●
Install the New Bezel
5U84 Enclosure Precautions Whenever a component (such as a PSU, fan, or controller) fails and causes an SSU to change from a fully redundant state to a non-redundant state, to maintain uninterrupted ClusterStor operations, the faulty component should be replaced as soon as possible to return the SSU to a fully redundant state. ●
Incorrectly carrying out this procedure may cause damage to the enclosure.
29
Replace a 5U84 Backlit Panel Bezel
●
When you replace a cooling fan in any SSU enclosure, you must complete the replacement within 2 minutes.
●
When a drawer is open on any SSU enclosure, do not leave the drawer open longer than 2 minutes.
●
When replacing a disk drive in any SSU enclosure, unlatch the drive and wait 5 seconds for the drive to spin down before removal.
●
To prevent overturning, drawer interlocks stop users from opening both drawers at the same time. Do not attempt to force open a drawer if the other drawer is already open.
●
Operating temperatures inside the enclosure drawers can reach up to 95°C. Take care when opening drawers.
Notes and Cautions ●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
The following procedure requires disconnecting the Lustre clients from the filesystem. IMPORTANT: Whenever a component (such as a PSU, fan, or controller) fails and causes an MMU or SSU to change from a fully redundant state to a non-redundant state, to maintain uninterrupted Sonexion operations, replace the faulty component as soon as possible to return the MMU or SSU to a fully redundant state.
Procedure 1. If the logical locations (hostnames) of the two OSS nodes associated with the affected SSU are known, skip to Remove the Defective Bezel. 2. Establish communication with the management node (n000) using one of the following two methods: Method 1: On a workstation or laptop PC with an Ethernet connection to the Sonexion system, use the IP address of the MGMT node (n000) to launch a terminal session, such as Putty, using the settings shown in the following table. Table 3. Settings for MGMT Connection Parameter
Setting
Bits per second
115200
Data bits
8
Parity
None
Stop bits
1
Flow control
None
The function keys are set to VT100+. Method 2: Use a separate monitor and keyboard connected to the back of the MGMT node (n000) as shown in the following figure. Ensure that the connection has the settings shown in the table above.
30
Replace a 5U84 Backlit Panel Bezel
Figure 7. Monitor and Keyboard Connections on the MGMT Node
3. Log on to the MGMT node (n000) as admin using the related password, as follows: login as: admin [email protected]’s password: password Last login: Thu Oct 11 11:06:24 2012 from 172.30.22.59 [admin@snx11000n000 ~]$ 4. Change to root user: $ sudo su 5. Connect a serial cable to one of the controllers in the affected SSU (the serial port is on the rear panel). 6. Log in to one of the OSS nodes hosted on the controller (username admin and the customer's password). 7. Determine the hostnames of the OSS nodes hosted on the SSU containing the failed backlit panel bezel: # sudo crm_mon Leave the serial cable connected. This connection is used in the following steps. Remove the Defective Bezel 8. Power off the system as described in Power Off a Sonexion 1600 System via the CLI. 9. Attach an ESD wrist strap and use it continuously. 10. Power off both Power Supply Units (PSUs) in the SSU (located below the controllers) and remove the power cords from the PSUs. 11. Unlock the SSU drawer containing the failed bezel. Using a T-20 Torx screwdriver, rotate the handle lock screw to the unlocked position (opposite from the lock icon). 12. Locate four 2mm Allen screws (one above and one below the handle lock screw, on both sides). 13. Using an Allen screwdriver, detach the bezel from the drawer by loosening and removing each 2mm screw, as shown in the following figure. Place screws in a safe location.
31
Replace a 5U84 Backlit Panel Bezel
Figure 8. Remove Allen Screws from Bezel
14. Unlatch the SSU drawer and open it slightly for better access to the bezel, as shown in the following figure. Figure 9. Unlatch the SSU Drawer
15. Use the following steps to disengage the five tabs along the top of the molded bezel. The molded bezel contains five plastic tabs, along the top and bottom edges. To remove the bezel from the SSU drawer, it is necessary to slightly compress these tabs to disengage them. This step releases only the tabs of the top edge of the bezel. The tabs on the bottom edge are disengaged in step 16.a on page 33 a. Using the special plastic tool, start at the upper right-hand edge and gently pry the first tab with a slight twisting motion, as shown in the following figure. The first tab will not release completely until the tool reaches the second tab, and so forth.
32
Replace a 5U84 Backlit Panel Bezel
Figure 10. Remove Tab to Release Bezel
b. Using the plastic tool, continue to disengage each tab across the top of the bezel. 16. Remove the failed bezel from the SSU drawer: a. Rotate the bezel downward from the top edge and lift away from the drawer enough to allow access to remove the interface cable. This disengages tabs along the bottom edge. Caution: Use care when lifting the bezel away from the drawer; an interface cable attaches the bezel to the drawer. b. Remove the screw that secures the cable retaining plate, as shown in the following figure. Figure 11. Remove Cable Retaining Plate Securing Screw
33
Replace a 5U84 Backlit Panel Bezel
c.
Remove the cable retaining plate from the cable by separating the metal plate from the rubber grommet, as shown in the following figure. Figure 12. Remove Cable Retaining Plate
d. Set the retaining plate and screw aside for re-installation. e. Disconnect the interface cable from the socket, as shown in the following figure. Figure 13. Disconnect the Interface Cable
Do not touch the printed circuit board component. Install the New Bezel 17. Reconnect the interface cable to the PCB plug with the gold pins facing up, as shown in the following figure.
34
Replace a 5U84 Backlit Panel Bezel
Figure 14. Reconnect Interface Cable
18. Insert the grommet of the cable assembly into the retaining plate, that was set aside earlier, as shown in the following figure. Figure 15. Inserting the Cable Assembly Grommet
19. Install the screw that secures the cable retaining plate. Torque to 1.1 Nm (Newton meters) or 9.74 in-lbf (pound-force inches). 20. At the front of the drawer, position the bezel so that the five lower tabs engage in the bottom edge and rotate the bezel into position at the top edge. 21. Press in on the bezel near each tab on the top until all tabs are fully engaged. 22. As shown in the following figure, attach the bezel to the drawer by re-installing the four 2mm Allen screws and torque them to 0.5 Nm (4.43 in-lbf).
35
Replace a 5U84 Backlit Panel Bezel
Figure 16. Attaching the Bezel to the SSU Drawer
If a torque driver is unavailable, snug the screws. 23. Close the SSU drawer. 24. Reconnect the power cords to the two PSUs in the affected SSU (rear panel). 25. Power on each PSU by moving its ON/OFF switch to the ON position. The PSUs power on and the OSS nodes hosted on the SSU start to boot. Each OSS node can take 5-30 minutes to fully boot. 26. Verify that the new backlit panel bezel LEDs are illuminated and functioning correctly. 27. Power on the system as described in Power On a Sonexion 1600 System. 28. Disconnect the serial cable from the controller in the affected SSU and the console (or PC). 29. Using the package from the replacement backlit panel bezel, re-package the failed part and return it to Logistics using the authorized procedures.
36
Replace a 5U84 Side Card
Replace a 5U84 Side Card Prerequisites Part number
●
100934200: PCB Assy, Sonexion SSU Side-card Left
●
100934300: PCB Assy, Sonexion SSU Side-card Right
Time 1 hour Interrupt level Interrupt (requires disconnecting Lustre clients from the filesystem) Requirement Console with monitor and keyboard (or PC with a serial port configured for 115.2 Kbps, 8 data bits, no parity and 1 stop bit) Tools ●
Serial cable
●
2mm recessed Allen hex socket screwdriver or torque driver
●
T-20 Torx screwdriver
●
ESD strap, boots, garment or other approved protection
●
SSU reset tool (P/N 101150300)
About this task
A chassis and two controllers are bundled in the modular SSU (5U84 enclosure). The chassis contains 84 disk drives, divided into two drawers with 42 drives each. A drawer contains two side cards, known as the left side card (LH) and right side card (RH). This procedure includes steps to take the SSU’s OSS nodes offline, remove and replace the side card, bring the OSS nodes online, and return the Sonexion system to normal operation. Subtasks: ●
Remove Side Card
●
Install Side Card
●
Power On the Affected SSU
●
Check for Errors if LEDs are On on page 51
The following procedure requires disconnecting the Lustre clients from the filesystem. 5U84 Enclosure Precautions
37
Replace a 5U84 Side Card
Whenever a component (such as a PSU, fan, or controller) fails and causes an SSU to change from a fully redundant state to a non-redundant state, to maintain uninterrupted ClusterStor operations, the faulty component should be replaced as soon as possible to return the SSU to a fully redundant state. ●
Incorrectly carrying out this procedure may cause damage to the enclosure.
●
When you replace a cooling fan in any SSU enclosure, you must complete the replacement within 2 minutes.
●
When a drawer is open on any SSU enclosure, do not leave the drawer open longer than 2 minutes.
●
When replacing a disk drive in any SSU enclosure, unlatch the drive and wait 5 seconds for the drive to spin down before removal.
●
To prevent overturning, drawer interlocks stop users from opening both drawers at the same time. Do not attempt to force open a drawer if the other drawer is already open.
●
Operating temperatures inside the enclosure drawers can reach up to 95°C. Take care when opening drawers.
Notes and Cautions ●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
Figure 17. OSS Operator’s Panel
Table 4. OSS Operator's Panel Controls and Indicators Display
Description
Unit Identification Display
Dual 7-segment display used to provide feedback to the user. Its primary usage is to display an enclosure unit identification number to assist in setting up and maintaining multiple enclosure systems. The Unit Identification Display is configured via a VPD option. By default, the display is OFF, and the dual 7-segment display is OFF. If the VPD selects use of the display, the 7 segment display is ON and displays the number stored in VPD.
Mute/Input Switch
Used to set the Unit Identification Display and transition alarm states.
38
Replace a 5U84 Side Card
Display
Description
Power On/Standby LED (Green/Amber)
Lights green when system power is available. Lights amber when only standby power is available.
Module Fault LED (PSU/ Cooling Fan/SBB Status) (Amber)
Lights when a system hardware fault occurs. It may be associated with a fault LED on a PSU or an I/O module; helps the user to identify the faulty component.
Logical Fault LED (Amber) Indicates a state change or a fault from something other than the enclosure management system. The condition may be caused by an internal or external source and communicated to the enclosure (normally via SES). The fault is usually associated with a disk drive. LEDs at each disk drive position help the user to identify the affected drive. Drawer 1 Fault (Amber)
Drive cable or side card fault.
Drawer 2 Fault (Amber)
Drive cable or side card fault.
Procedure 1. If the failed side card is not known, determine the SSU containing the card using the front panel and drawer indicators shown in the preceding figure and table. At the front of the rack, look for an amber Drawer 1 Fault or Drawer 2 Fault on the operator’s panel, indicating the drawer containing the defective side card. 2. In the drawer containing the lit Drawer Fault LED, check the Sideplane Fault LED (on the Drawer Status LEDs panel). See the following figure and table. If this LED is lit, the drawer contains the failed side card.
39
Replace a 5U84 Side Card
Figure 18. Drawer Status LEDs
Table 5. Drawer Status LED Key Status
Power LED Fault LED (Green) (Amber)
Cable Fault LED (Amber)
Drawer Fault LED (Amber)
Activity Bar Graph (Green)
Sideplane Card OK/Power Good
On
Off
Off
Off
-
Sideplane Card Fault
Off
On
-
-
Off
Drive Fault
Off
-
-
On
Off
Cable Fault
Off
-
On
-
Off
Drive Activity
On
Off
Off
Off
On
3. Log in to the primary MGMT node: [client]$ ssh -l admin primary_MGMT_node 4. Log in to the SSU containing the failed side card, using one of the following methods: ●
If logging into the SSU via the primary MGMT node (preferred method), enter: [admin@n000]$ ssh OSS_nodename
40
Replace a 5U84 Side Card
●
Connect to the Management node using either approach described in Connect to the MGMT Node
5. Open a terminal session with these settings: ●
Bits per second: 115200
●
Data bits: 8
●
Parity: none
●
Stop bits: 1
●
Flow control: none
Set terminal emulation to VT100+. 6. Connect a serial cable to one of the OSS controllers in the affected SSU (serial port is on the rear panel). 7. Log in to the OSS node hosted on the OSS controller in the affected SSU (username admin and the customer's password). 8. Stop the Lustre file system: [admin@n000]$ cscli unmount -f fsname Example: [admin@snx11000n000 tmp]$ sudo /opt/xyratex/bin/cscli unmount -f snx11000 unmount: stopping snx11000 on snx11000n[002-003]... unmount: stopping snx11000 on snx11000n[004-005]... unmount: snx11000 is stopped on snx11000n[002-003]! unmount: snx11000 is stopped on snx11000n[004-005]! unmount: File system ssetest is unmounted In the following, a zero in the first column under Target shows the file system is not mounted. admin@snx11000n000 tmp]@ sudo /opt/xyratex/bin/cscli fs_info Information about "snx11000" file system: Node Failover Node type Targets partner Devices snx11000n005 oss 0 / 4 snx11000n004 /dev/md1, /dev/md3, /dev/md5, / dev/md7 snx11000n004 oss 0 / 4 snx11000n005 /dev/md0, /dev/md2, /dev/md4, / dev/md6 snx11000n003 mds 0 / 1 snx11000n002 /dev/md66 snx11000n002 mgs 0 / 0 snx11000n003 9. Verify that Lustre has stopped on OSS nodes in the affected SSU : [n0xy]$ sudo pdsh -g oss "mount -t lustre | wc -l" | dshbak -c Example : [admin@snx11000n004 ~]$ sudo pdsh -g oss "mount -t lustre | wc -l" | dshbak -c ---------------snx11000n[004-105] ---------------(0 "shows lustre not mounted on any OSS’s ")
41
Replace a 5U84 Side Card
10. Log in to one of the OSS nodes in the affected SSU: [admin@n0xx]$ ssh oss_node where oss_node is one of the OSS nodes in the affected SSU. 11. To determine which side card has, first run the following on each OSS controller in the affected SSU: [OSS controller]$ sudo sg_map -i -x | grep XYRATEX 12. Check the command output for missing expanders. If the even-numbered OSS node is missing an expander, the right side-card has failed and needs to be replaced. If the odd-numbered OSS node is missing an expander, then the left side-card has failed and needs to be replaced. The command output shows what appears to be double the expected amount of output; that is, two of each system device and two of each expanders. This effect is caused by the SAS cable link on the OSS controller. Example output for the even-numbered OSS node (no missing expanders): [admin@snx11000n004 ~]$ sudo sg_map -i -x | grep XYRATEX /dev/sg4 7 0 0 0 13 XYRATEX UD-8435-SONEXION 2000 3519 /dev/sg6 8 0 0 0 13 XYRATEX UD-8435-SONEXION 2000 3519 /dev/sg33 7 0 15 0 3 XYRATEX DEFAULT-SD-R24 3519 /dev/sg36 8 0 15 0 3 XYRATEX DEFAULT-SD-R24 3519 /dev/sg91 7 0 44 0 3 XYRATEX DEFAULT-SD-R36 3519 /dev/sg94 8 0 44 0 3 XYRATEX DEFAULT-SD-R36 3519 /dev/sg149 7 0 73 0 3 XYRATEX DEFAULT-SD-R36 3519 /dev/sg152 8 0 73 0 3 XYRATEX DEFAULT-SD-R36 3519 /dev/sg179 7 0 88 0 3 XYRATEX DEFAULT-SD-R24 3519 /dev/sg181 8 0 88 0 3 XYRATEX DEFAULT-SD-R24 3519 Example output for the odd-numbered OSS node (no missing expanders): [admin@snx11000n005 ~]$ sudo sg_map -i -x | grep XYRATEX /dev/sg1 7 0 0 0 13 XYRATEX UD-8435-SONEXION 2000 3519 /dev/sg30 7 0 29 0 3 XYRATEX DEFAULT-SD-L36 3519 /dev/sg45 7 0 44 0 3 XYRATEX DEFAULT-SD-L24 3519 /dev/sg60 7 0 59 0 3 XYRATEX DEFAULT-SD-L24 3519 /dev/sg89 7 0 88 0 3 XYRATEX DEFAULT-SD-L36 3519 /dev/sg90 8 0 0 0 13 XYRATEX UD-8435-SONEXION 2000 3519 /dev/sg119 8 0 29 0 3 XYRATEX DEFAULT-SD-L36 3519 /dev/sg134 8 0 44 0 3 XYRATEX DEFAULT-SD-L24 3519 /dev/sg149 8 0 59 0 3 XYRATEX DEFAULT-SD-L24 3519 /dev/sg178 8 0 88 0 3 XYRATEX DEFAULT-SD-L36 3519 Example output for an OSS node missing an expander: an empty row indicates a missing expander (in this case /dev/sg60, /dev/sg149 is a duplicate of /dev/sg60 and not missing an expander): [admin@snx11000n005 ~]$ sudo sg_map -i -x | grep XYRATEX /dev/sg1 7 0 0 0 13 XYRATEX UD-8435-SONEXION 2000 3519 /dev/sg30 7 0 29 0 3 XYRATEX DEFAULT-SD-L36 3519 /dev/sg45 7 0 44 0 3 XYRATEX DEFAULT-SD-L24 3519 /dev/sg60 /dev/sg89 7 0 88 0 3 XYRATEX DEFAULT-SD-L36 3519 /dev/sg90 8 0 0 0 13 XYRATEX UD-8435-SONEXION 2000 3519 /dev/sg119 8 0 29 0 3 XYRATEX DEFAULT-SD-L36 3519 /dev/sg134 8 0 44 0 3 XYRATEX DEFAULT-SD-L24 3519
42
Replace a 5U84 Side Card
/dev/sg149 /dev/sg178 8 0 88 0 3 XYRATEX DEFAULT-SD-L36 3519 Remove Side Card 13. Exit the OSS node and return to the primary MGMT node: [n0xy]$ exit 14. Power off the OSS node pair in the affected SSU : [admin@n000]$ cscli power_manage –n nodexxx-xxy --power-off Where nodexxx-xxy indicates the OSS nodes in the HA pair. Example: [admin@snx11000n000 ~]$ pm -n snx11000n[004-005] Command completed successfully [admin@snx11000n000 ~]$ pm -q on: snx11000n[000-003] off: snx11000n[004-005] unknown: 15. Once the OSS node pair is powered off, shut down the affected SSU by turning off the power switches (at the back of the rack) on the PSUs (located below the OSS controllers). 16. Remove the power cords from the PSUs. 17. Attach an ESD wrist strap and use it at all times. 18. Unlock the SSU drawer containing the failed side card. Use the T-20 Torx screwdriver to rotate the handle lock screw to the unlocked position (opposite from the lock icon). 19. Push and hold the drawer latches inward while pulling the drawer all the way out until it locks.
43
Replace a 5U84 Side Card
Figure 19. Open the Drawer
Once the drawer locks in the open position, the side cards are accessible, as shown in the following figure. Figure 20. Side Card Exploded View
20. Using the 2mm hex driver loosen the captive fasteners on the safety cover, shown in the above figure. Place the cover in a safe place, as it will be reused. 21. Unplug the three power cables attached to the side card and release them from their retaining clips, by squeezing the sides of the cable connectors and firmly pulling. The connectors are shown in the preceding figure. 22. Unplug the two SAS cables. Use a spring hook to lift the female connector casing to release hooks located on male connector, if necessary. Do not bend the metal receiver too much or connector will not grab when reinserted.
44
Replace a 5U84 Side Card
Figure 21. SAS Cables Male/Female Connector
23. Allow the power and SAS cables to hang down freely out of the way. 24. Pull the drawer out to the limit (past where it settles in the locked position). While the drawer its held out, and holding the power connectors on the side card, gently ease the side card off the drawer, getting the card's right-angle bracket past the end of the drawer frame. It may be necessary to apply alternating pressure (an up-and-down, rocking motion) to remove the side card, but be careful not to bend it. Figure 22. Ease Side Card from drawer
Install Side Card 25. Check for any bent pins. Again holding the drawer at the limit of travel, position the card for mounting by getting the right-angle bracket past the drawer frame. On the new side card, align the three square connectors on the new side card with the receptacles on the drawer. 26. Press the new side card from the rear of the three connectors onto the drawer chassis. Be careful to press only above the connector area to avoid flexing the card.
45
Replace a 5U84 Side Card
Figure 23. Press the new Side Card onto Drawer Chassis
27. Install the white cable retaining clips included with the new side card. There are two different size clips (three long clips and one short clip), and they are installed in specific locations, see the following figure. Figure 24. Clip and Connector Locations
28. Connect the two SAS cables to the SAS connectors at the back of the new side card. 29. Connect the three power cables to the power cable connectors on the new side card. 30. Secure the power cables by clipping them to the white retaining clips. 31. Align the safety cover over the new side card, ensuring that the power cables are not trapped beneath the cover. 32. Using the 2mm hex driver, tighten the captive fasteners. ●
Tighten the silver-headed fasteners to 0.5Nm.
46
Replace a 5U84 Side Card
●
Tighten the black-headed fasteners to 0.9Nm.
33. Perform a visual inspection to ensure no cables are outside of the safety cover, and the cover is installed flat against the drawer chassis. 34. Close the drawer by pulling and holding both white latches on the sides and pushing in the drawer slightly. 35. Release the white latches and ensure that they have returned to their original position. 36. Push the drawer completely into the 5U84 enclosure. 37. If required, use the Torx driver to lock the drawer. 38. Reconnect the power cords to the two affected PSUs in the SSU (rear panel). Power On the Affected SSU 39. At the back of the rack, place the PSU switches to the ON position. Wait 2 minutes to allow the OSS controllers to come online. 40. Power on using one of the following methods. ●
Using a pen or appropriate tool, press the Power On/Off button on the rear panel of the OSSs in the SSU.
●
Issue a command from the primary MGMT node to power on the OSS nodes: 1. Log in to the primary MGMT node: [client]$ ssh -l admin primary_MGMT_node 2. Power on the OSS nodes: [admin@n000]$ cscli power_manage –n nodexxx-xxy --power-on Example: [admin@snx11000n000 ~]$ power_manage –n snx11000n[004-105] Command completed successfully [admin@snx11000n000 ~]$ pm -q on: snx11000n[000-005] off: unknown:
--power-on
47
Replace a 5U84 Side Card
Figure 25. OSS Rear Panel on Sonexion 2000
41. Verify that OSS’s have booted: [admin@n000]$ pdsh –g oss date Example: [admin@snx11000n000 ~]$ pdsh -g oss date snx11000n004: Mon Apr 1 11:30:50 PDT 2013 snx11000n005: Mon Apr 1 11:30:50 PDT 2013 42. Use the crm_mon utility to verify the status of the OSS nodes and that the STONITH high-availability (HA) resource has started. a. Log in to an even-numbered OSS node via SSH: [admin@n000]$ ssh OSS_node_hostname b. Display the node status: [admin@n000]$ sudo crm_mon -1 43. Continue to run the crm_mon utility until the output verifies that the OSS nodes are running and that STONITH has started. Following is a partial sample of output showing the OSS nodes are online. [admin@snx11000n000 ~]$ sudo pdsh -w snx11000n004 crm_mon -1r snx11000n004: ============ snx11000n004: Last updated: Mon Apr 1 11:46:31 2013 snx11000n004: Last change: Mon Apr 1 10:47:56 2013 via cibadmin on snx11000n005 snx11000n004: Stack: Heartbeat snx11000n004: Current DC: snx11000n005 (ee6d3c80-0d49-4fa2-a013-3b4991ed6a2f) partition with quorum snx11000n004: Version: 1.1.6.1-2.el6-0c7312c689715e096b716419e2ebc12b57962052 snx11000n004: 2 Nodes configured, unknown expected votes snx11000n004: 55 Resources configured. snx11000n004: ============ snx11000n004: snx11000n004: Online: [ snx11000n004 snx11000n005 ] snx11000n004:
48
Replace a 5U84 Side Card
snx11000n004: Full list of resources: snx11000n004: snx11000n004: snx11000n004-stonith (stonith:external/gem_stonith): snx11000n004 snx11000n004: snx11000n005-stonith (stonith:external/gem_stonith): snx11000n005 snx11000n004: snx11000n004_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate): Started snx11000n004 snx11000n004: snx11000n005_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate): Started snx11000n005 snx11000n004: baton (ocf::heartbeat:baton): Started snx11000n004 snx11000n004: snx11000n004_ibstat (ocf::heartbeat:ibstat): Started snx11000n004: snx11000n005_ibstat (ocf::heartbeat:ibstat): Started
Started Started
snx11000n004 snx11000n005
44. Log in to the primary MGMT node via SSH: [admin@n000]$ ssh -l admin primary_MGMT_node_hostname 45. Mount the Lustre file system: [admin@n000]$ cscli mount -f none Example : [admin@snx11000n000 ~]$ cscli mount -f snx11000 mount: MGS is starting... mount: MGS is started! mount: starting ssetest on snx11000n[002-003]... mount: starting ssetest on snx11000n[004-005]... mount: snx11000 is started on snx11000n[002-003]! mount: snx11000 is started on snx11000n[004-005]! mount: File system snx11000 is mounted. [root@snx11000n000 ~]# /opt/xyratex/bin/cscli fs_info Information about "snx11000" file system: Node Failover Node type Targets partner Devices snx11000n005 oss 4 / 4 snx11000n004 /dev/md1, /dev/md3, /dev/md5, /dev/ md7 snx11000n004 oss 4 / 4 snx11000n005 /dev/md0, /dev/md2, /dev/md4, /dev/ md6 snx11000n003 mds 1 / 1 snx11000n002 /dev/md66 snx11000n002 mgs 0 / 0 snx11000n003 46. Check the USM firmware version running on the new side card, as described in USM Firmware Update Guide. As viewed from the front of the SSU, if the right side card was changed, run the following command on the left (or primary) side card (even-numbered ESM controller). If the left side card was changed, run the following command on the right (or secondary) side card (odd-numbered ESM controller). [admin@n000]$ sudo conman node_name -gem a. When the GEM command prompt is visible, type: ver To list all commands type: help all Output example: 19+05:16:40.965 M0 GEM> ver Canister firmware : 3.5.0.25
49
Replace a 5U84 Side Card
Canister firmware date : May 1 2014 11:41:17 Canister bootloader : 5.03 Canister config CRC : 0xD1B030A4 Canister VPD structure : 0x06 Canister VPD CRC : 0xEE3504B4 Canister CPLD : 0x17 Canister chip : 0x80050002 Canister SDK : 3.06.01-B028 Midplane VPD structure : 0x0F Midplane VPD CRC : 0x0E74E375 Midplane CPLD : 0x05 PCM 1 firmware : 2.24|2.17|2.00 PCM 2 firmware : 2.20|2.12|2.00 PCM 1 VPD structure : 0x03 PCM 2 VPD structure : 0x03 PCM 1 VPD CRC : 0x486003DF PCM 2 VPD CRC : 0x486003DF Fan Controller 0 config : 0960837-04_0 Fan Controller 0 deviceFW : UCD90124A|2.3.6.0000|110809 Fan Controller 1 config : 0960837-04_0 Fan Controller 1 deviceFW : UCD90124A|2.3.6.0000|110809 Fan Controller 2 config : 0960837-04_0 Fan Controller 2 deviceFW : UCD90124A|2.3.6.0000|110809 Fan Controller 3 config : 0960837-04_0 Fan Controller 3 deviceFW : UCD90124A|2.3.6.0000|110809 Fan Controller 4 config : 0960837-04_0 Fan Controller 4 deviceFW : UCD90124A|2.3.6.0000|110809 Battery 1 firmware : Not present Battery 2 firmware : Not present Sled 1 Element 0 Firmware : 3.5.0.25|BL=6.10|FC=0x11EEFEAF|VR=0x06| VC=0xF027D510|CR=0x10|PC=N/A|EV=0x80040002|SV=3.06-B028 Sled 1 Element 1 Firmware : 3.5.0.25|BL=6.10|FC=0x687FE3F1|VR=0x06| VC=0x2737A206|CR=0x10|PC=N/A|EV=0x80050002|SV=3.06-B028 Sled 1 Element 2 Firmware : 3.5.0.25|BL=6.10|FC=0xFA337B92|VR=0x06| VC=0x152E4AD9|CR=0x10|PC=N/A|EV=0x80040002|SV=3.06-B028 Sled 1 Element 3 Firmware : 3.5.0.25|BL=6.10|FC=0x6591A901|VR=0x06| VC=0x38B21CF7|CR=0x10|PC=N/A|EV=0x80050002|SV=3.06-B028 Sled 2 Element 0 Firmware : 3.5.0.25|BL=6.10|FC=0x11EEFEAF|VR=0x06| VC=0xF027D510|CR=0x10|PC=N/A|EV=0x80040002|SV=3.06-B028 Sled 2 Element 1 Firmware : 3.5.0.25|BL=6.10|FC=0x687FE3F1|VR=0x06| VC=0x2737A206|CR=0x10|PC=N/A|EV=0x80050002|SV=3.06-B028 Sled 2 Element 2 Firmware : 3.5.0.25|BL=6.10|FC=0xFA337B92|VR=0x06| VC=0x152E4AD9|CR=0x10|PC=N/A|EV=0x80040002|SV=3.06-B028 Sled 2 Element 3 Firmware : 3.5.0.25|BL=6.10|FC=0x6591A901|VR=0x06| VC=0x38B21CF7|CR=0x10|PC=N/A|EV=0x80050002|SV=3.06-B028 19+05:17:56.232 M0 GEM> In the above output, the side-card firmware versions are listed last in the Sled 1 Element and Sled 2 Element lines. b. To exit GEM, type: &. 47. Any discrepancies in the Sled 1 and Sled 2 element lines require a USM firmware upgrade to level the new side card to the latest firmware version. Refer to Sonexion USM Firmware Update Guide for the model of your system. 48. If the terminal connection (console or PC) is still active, terminate it and disconnect the serial cable from the new controller.
50
Replace a 5U84 Side Card
If LEDs are illuminated on the front panel after powering up the system, check for errors using the following procedure, Check for Errors if LEDs are On on page 51.
Check for Errors if LEDs are On About this task
During installation of a 5U84 sidecard, if any LEDs are lit on the affected SSU’s front panel after powering up the Sonexion system, use this procedure to check for errors in the enclosure. Allow several minutes for the OSS nodes in the affected SSU to come back online and start operations. Typically, the InfiniBand connection is ready as soon as it comes online
Procedure 1. Check if the Infiniband connection is ready after it comes online. Log in to to one of the OSS nodes in the SSU and run: [admin@n000]$ sudo crm_mon -1 2. Once the OSS nodes are back online, visually inspect the front of the SSU and verify the LED status. 3. If all LEDs are green, proceed to Step 9 on page 51. If one or more LEDs are not green, proceed to the following step. 4. Power off the SSU again, remove both side-card covers, and verify all cable connections and the seating of the side card. 5. Power on the SSU. If all LEDs are green, skip to step 9 on page 51, as the amber LED could indicate a different issue. If one or more LEDs are not green, proceed to the next step. 6. Log in to GEM on one of the OSS nodes in the affected SSU: conman nodenameXXX-gem 7. Create a log file on the MGS node and place it in /var/log/conman: GEM> ddump This log dump runs for an extended period. The log file can be located by hostname and date/time stamp. This file contains information to help determine what is causing the problem and can be sent to Seagate Support for examination and root cause analysis. 8. As a final step, if any extra side cards are available, install a different one in the SSU to determine if the LED behavior is different and results in the OSS nodes coming back online with all LEDs lit green on the front of the affected SSU. 9. From the primary MGMT node, connect to the even-numbered OSS node in the affected SSU: [admin@n000]$ conman nodenameXXX–gem This OSS node is hosted on the left-hand side card (as viewed from the rear of the chassis).
51
Replace a 5U84 Side Card
10. When connected, at the GEM prompt run: GEM> gncli 3,all,all phydump Verify that all PHYs labeled ‘Drive’ show the speed as 6Gbps as indicated below on the 36-port expanders. PHY| Type 0 |Port 1 |Port 2 |Port 3 |Port 4 |Port 5 |Port 6 |Port 7 |Port 8 |Drive 9 |Drive 10 |Drive 11 |Drive 12 |Drive 13 |Drive 14 |Drive 15 |Drive 16 |Drive 17 |Drive 18 |Drive 19 |Drive 20 |Drive 21 |Drive 22 |Drive 23 |Drive 24 |Drive 25 |Drive 26 |Drive 27 |Drive 28 |Drive 29 |Drive 30 |Drive 31 |Drive 32 |Drive 33 |Drive 34 |Drive 35 |Drive
| Index | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 20 | 23 | 26 | 27 | 24 | 25 | 21 | 22 | 18 | 19 | 15 | 16 | 14 | 17 | 6 | 9 | 12 | 13 | 10 | 11 | 7 | 8 | 4 | 5 | 1 | 2 | 0 | 3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Flags
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
State E L E L E L E L E L E L E L E L E L S E L S E L S E L S E L S E L S E L S E L S E L S E L S E L S E L S E L S E L S E L S E L S E L S E L S E L S E L S E L S E L S E L S E L S E L S E L S E L S E L S
| Speed | |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps|
Type SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS
| WWN |50050cc10ab11cbf |50050cc10ab11cbf |50050cc10ab11cbf |50050cc10ab11cbf |50050cc10ab11cbf |50050cc10ab11cbf |50050cc10ab11cbf |50050cc10ab11cbf |5000c50025e067c1 |5000c50025e06775 |5000c50025dff939 |5000c50025e05e71 |5000c50025dce9c9 |5000c50025dce0a9 |5000c50025dccba1 |5000c50025e03dfd |5000c50025e07b49 |5000c50025e066cd |5000c50025e08a95 |5000c50025e0513d |5000c50025e08789 |5000c50025e06995 |5000c50025dfbb7d |5000c50025e041d9 |5000c50025e05d1d |5000c50025dfbbd5 |5000c50025e06099 |5000c50025e03a59 |5000c50025e04531 |5000c50025e04155 |5000c50025dfb115 |5000c50025e04349 |5000c50025e03c8d |5000c50025dc83ed |5000c50025e03fa1 |5000c50025e067a9
And as shown here on the 24 port expanders: PHY| Type | Index | Flags | State | Speed | Type | WWN -----------------------------------------------------------------0 |Port | 0 | | E L |6.0Gbps| SAS |50050cc10ab11cbf 1 |Port | 0 | | E L |6.0Gbps| SAS |50050cc10ab11cbf 2 |Port | 0 | | E L |6.0Gbps| SAS |50050cc10ab11cbf 3 |Port | 0 | | E L |6.0Gbps| SAS |50050cc10ab11cbf 4 |Port | 1 | | E | | | 5 |Port | 2 | | E | | | 6 |Port | 3 | | E | | | 7 |Port | 4 | | E | | | 8 |Port | 5 | | E | | | 9 |Port | 6 | | E | | | 10 |Drive | 6 | | E L S |6.0Gbps| SAS |5000c50025e03efd
52
Replace a 5U84 Side Card
11 12 13 14 15 16 17 18 19 20 21 22 23 24
|Drive |Drive |Drive |Drive |Drive |Drive |Drive |Drive |Drive |Drive |Drive |Drive |Drive Virtual
| | | | | | | | | | | | | |
9 12 13 10 11 7 8 4 5 1 2 0 3 0
| | | | | | | | | | | | | |
| | | | | | | | | | | | | |
E E E E E E E E E E E E E E
L L L L L L L L L L L L L L
S S S S S S S S S S S S S
|6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| |6.0Gbps| | |
SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS
|5000c50025e00505 |5000c50025e06561 |5000c50025e08a41 |5000c50025e03c81 |5000c50025e0849d |5000c50025e03de1 |5000c50025e05b19 |5000c50025e03b65 |5000c50025e08471 |5000c50025e07ed9 |5000c50025e05da9 |5000c50025e0672d |5000c50025e06361 |50050cc10d01093e
11. Verify that no errors are seen on the link: [admin@n000]$ gncli 3,all,all ddump_phycounters 12. Wait for 10 minutes. 13. Verify again that no PHY errors are seen using the following command: [admin@n000]$ gncli 3,all,all ddump_phycounters 14. Run report_faults on the OSS module and verify that no errors are reported. Example: GEM> report_faults Drive Manager faults No faults Environmental Control faults No faults General Service faults 10 component(s) registered with fault tracker Local faults: No faults Remote faults: No faults Human Interface Device faults Ops Panel status: Logic Fault LED : OFF Module Fault LED : OFF Warning: Local: Remote: ****No HID warnings to report**** No alarms to report RemoteSync Client faults No faults Processor service faults Board 0 Status: No faults Board 1 Status: No faults Power Manager faults Running in minimal redundant mode
53
Replace a 5U84 Side Card
Sled Manager faults No faults 15. Verify that all PHYs are at 6Gbps and there are no faults. If the data for the PHY error count is scrolling beyond the available buffer, review the GEM command results and output in the GEM log file for the currently connected node. The log file is located in /va/log/conman (for example, /var/log/conman/snx11000n002-gem.log). 16. Exit GEM: GEM> &.
54
Replace a 5U84 OSS Controller
Replace a 5U84 OSS Controller Prerequisites
Part number
101171000: Controller Assy, Sonexion 2000 Turbo 32GB 2648L V2 FRU Time 1.5 hours Interrupt levels ●
To remove and replace SSU controller: Failover (can be applied to a live system with no service interruption, but requires failover/failback operations)
●
When a USM firmware update is needed: Interrupt. (Requires taking the Lustre file system offline. Perform a USM upgrade only if the firmware version is out of date.)
●
Console with monitor and keyboard (or PC with a serial COM port configured for 115.2Kbs)
●
Serial cable
●
Reset tool (P/N 101150300)
●
ESD strap
●
The replacement OSS controller must have the correct USM firmware for your system. Consult Cray Hardware Product Support to obtain the correct files, and see USM Firmware Update Guide.
●
When replacing ESM controllers in an SSU, the operator must only use controllers that have been received from Cray. Operators must not reuse controllers from one system to another, nor should they move controllers from one slot within a system to another.
●
Before performing this operation, obtain a separate, stand-alone 5U84 (or similar) SSU that is not part of the existing cluster. Contact Cray Support to make arrangements.
Tools
Requirements
About this task
This procedure includes steps to replace the failed controller and verify the operation of the new controller, and a procedure to check firmware in the new controller's Object Storage Server (OSS) and update it, if necessary. Subtasks: ●
Wipe HA Data from SSD on the OSS Controller
●
Fail Over Controller and Verify State
●
Replace the Controller
●
Verify New Controller Function
55
Replace a 5U84 OSS Controller
●
Output Example to Check USM Firmware on page 69
A chassis and two controllers are bundled in the modular SSU. Each controller hosts one OSS node; there are two OSS nodes per SSU. Within an SSU, OSS nodes are organized in an HA pair with sequential numbers (for example, n004 / n005). If an OSS node goes down because its controller fails, its resources migrate to the HA partner/OSS node in the other controller. A downed OSS node cannot be reached directly by the Sonexion system. Several steps in this procedure involve logging into the HA partner (on the other controller) to determine the downed node's status and whether its resources have successfully failed over to the HA partner. ●
If the resources have failed over but the downed node is still online, it is placed into standby mode.
●
If the resources have not failed over and the downed node is still online, it is placed into standby mode, which will cause its resources to fail over.
5U84 Enclosure Precautions Whenever a component (such as a PSU, fan, or controller) fails and causes an SSU to change from a fully redundant state to a non-redundant state, to maintain uninterrupted ClusterStor operations, the faulty component should be replaced as soon as possible to return the SSU to a fully redundant state. ●
Incorrectly carrying out this procedure may cause damage to the enclosure.
●
When you replace a cooling fan in any SSU enclosure, you must complete the replacement within 2 minutes.
●
When a drawer is open on any SSU enclosure, do not leave the drawer open longer than 2 minutes.
●
When replacing a disk drive in any SSU enclosure, unlatch the drive and wait 5 seconds for the drive to spin down before removal.
●
To prevent overturning, drawer interlocks stop users from opening both drawers at the same time. Do not attempt to force open a drawer if the other drawer is already open.
●
Operating temperatures inside the enclosure drawers can reach up to 95°C. Take care when opening drawers.
Notes and Cautions ●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure. IMPORTANT: Whenever a component (such as a PSU, fan, or controller) fails and causes an MMU or SSU to change from a fully redundant state to a non-redundant state, to maintain uninterrupted Sonexion operations, replace the faulty component as soon as possible to return the MMU or SSU to a fully redundant state.
Procedure Wipe HA Data from SSD on the OSS Controller If the canister state is known, and it has been verified that canisters have not been used for lab or testing purposes, proceed to Fail Over Controller and Verify State. Perform the following steps on canisters that cannot be verified as new.
56
Replace a 5U84 OSS Controller
1. Obtain a copy of a Live USB image for the release of Sonexion software in use on the system. Contact Cray Support for the location of the image. 2. Burn the Live USB image onto a USB drive with a capacity of at least 8 GB. The Linuxdd command is used to burn the Live USB image (cs15_live_usb.img) onto the USB drive (/dev/sdb). For example: dd if=cs15_live_usb.img of=/dev/sdb bs=512k To run dd and find the device file for the USB drive, refer to the instructions for the operating system. 3. Insert the controller to be downgraded into a powered off stand-alone SSU enclosure. 4. Insert the USB drive into one of the USB ports on the new controller. 5. Connect a serial cable from the console or PC to the new controller (serial port is on the rear panel). 6. Establish communication with the management node (n000) using one of the following two methods: Method 1: On a workstation or laptop PC with an Ethernet connection to the Sonexion system, use the IP address of the MGMT node (n000) to launch a terminal session, such as Putty, using the settings shown in the following table. Table 6. Settings for MGMT Connection Parameter
Setting
Bits per second
115200
Data bits
8
Parity
None
Stop bits
1
Flow control
None
The function keys are set to VT100+. Method 2: Use a separate monitor and keyboard connected to the back of the MGMT node (n000) as shown in the following figure. Ensure that the connection has the settings shown in the table above. Figure 26. Monitor and Keyboard Connections on the MGMT Node
57
Replace a 5U84 OSS Controller
7. Log on to the MGMT node (n000) as admin using the related password, as follows: login as: admin [email protected]’s password: password Last login: Thu Oct 11 11:06:24 2012 from 172.30.22.59 [admin@snx11000n000 ~]$ 8. Change to root user: $ sudo su 9. Power on the enclosure. If the controller does not power on automatically, use the OSS reset tool to press the Power switch. Figure 27. Controller Power and Reset Switches
10. Once the controller starts booting, interrupt the boot procedure by pressing the Esc key. The boot options menu selection appears. If necessary, press Esc key several times before the boot options menu is shown. 11. From the menu, select an option that allows booting from a USB drive. 12. Wait for the operating system to boot completely to the shell prompt. 13. Locate the SSD’s device file: [node]$ sudo sg_map -i -x | grep SanDisk For example: [node]$ sudo sg_map -i -x | grep SanDisk /dev/sg0 3 0 0 0 0 /dev/sda ATA
SanDisk SSD U100
10.5
If this command does not produce any output, confirm that /dev/sda is the internal SSD: [node]$ sudo sg_map -i –x Or contact Cray support if there are any doubts.
58
Replace a 5U84 OSS Controller
14. Using the device file run the following dd commands to wipe the HA data, run: [node]$ sudo dd if=/dev/zero of=/dev/sdXX bs=1M count=8192 oflag=direct For example: [node]$ sudo dd if=/dev/zero of=/dev/sda bs=1M count=8192 oflag=direct 8192+0 records in 8192+0 records out 8589934592 bytes (8.6 GB) copied, 73.0721 s, 118 MB/s Then run: [node]$ sudo dd if=/dev/zero of=/dev/sdXX bs=1M seek=22341 oflag=direct The expected output for this command is: No space left on device. For example: [node]$ sudo dd if=/dev/zero of=/dev/sda bs=1M seek=22341 oflag=direct dd: writing `/dev/sda': No space left on device 8193+0 records in 8192+0 records out 8590811136 bytes (8.6 GB) copied, 72.362 s, 119 MB/s 15. Remove the serial cable from the new controller that connects to the console or PC. 16. Remove the USB drive. 17. Remove the controller from the powered off stand-alone SSU enclosure. Fail Over Controller and Verify State If the canister state is unknown or if canisters might have been used for lab or testing purposes, execute the steps in Wipe HA Data from SSD on the OSS Controller When an OSS node is down, it cannot be reached directly by the Sonexion system. Use the following steps to log in to the HA partner on the other controller to determine the downed node's status (placing it into standby mode if necessary), and make sure its resources have failed over before replacing the failed controller. 18. Determine the physical and logical location (hostname) of the failed controller in the SSU. 19. Log in to the primary MGMT node via SSH. [Client]$ ssh –l admin primary_MGMT_node 20. Determine which node is acting as the primary MGMT node: [admin@n000]$ cscli show_nodes For example: [admin@n000]$ cscli show_nodes -------------------------------------------------------------------------------------Power Service HA
59
Replace a 5U84 OSS Controller
Hostname Role State state Targets HA Partner Resources -------------------------------------------------------------------------------------snx11000n000 MGMT On ----0 / 0 snx11000n001 ----snx11000n001 MGMT) On ----0 / 0 snx11000n000 ----snx11000n002 MDS),(MGS) On N/a 0 / 0 snx11000n003 None snx11000n003 MDS),(MGS) On Stopped 0 / 1 snx11000n002 Local snx11000n004 (OSS) Off Stopped 0 / 1 snx11000n005 Local snx11000n005 (OSS) On Stopped 0 / 1 snx11000n004 Local snx11000n006 (CIFS),(NFS) On Stopped 0 / 0 snx11000n[006-013] ----snx11000n007 (CIFS),(NFS) On Stopped 0 / 0 snx11000n[006-013] ----snx11000n008 (CIFS),(NFS) On Stopped 0 / 0 snx11000n[006-013] ----snx11000n009 (CIFS),(NFS) On Stopped 0 / 0 snx11000n[006-013] ----snx11000n010 (CIFS),(NFS) On Stopped 0 / 0 snx11000n[006-013] ----snx11000n011 (CIFS),(NFS) On Stopped 0 / 0 snx11000n[006-013] ----snx11000n012 (CIFS),(NFS) On Stopped 0 / 0 snx11000n[006-013] ----snx11000n013 (CIFS),(NFS) On Stopped 0 / 0 snx11000n[006-013] ------------------------------------------------------------------------------------------
In the example above, the secondary MGMT node is identifiable by the parentheses around it (in row snx11000n001), while the primary MGMT node has no parentheses in row snx11000n000. If the incorrect MGMT node was used to run cscli show_nodes, the following error message is issued: [MGMT01]$ cscli show_nodes cscli: Please, run cscli on active management node If this occurs, log in to the other MGMT node via SSH. 21. From the primary MGMT node, fail over the resources from the affected OSSnode to its HA partner: [admin@n000]$ cscli failover -n nodes Where nodes are the names of the node(s) that require failover. For example, if there is a failure on OSS node n004 and its resources need to fail over to node n005 on cluster TEST, the command is as follows: [admin@n000]$ cscli failover -n n004 NOTE: Use the following steps to verify the state of the failed controller. When an OSS node is down, it cannot be reached directly by the Sonexion system. You must log in to the HA partner on the other controller to determine the downed node's status (placing it into standby mode if necessary), and make sure its resources have failed over before replacing the failed controller. 22. Log in via SSH to the HA partner of the OSS node that is hosted on the failed controller. [admin@n000]$ ssh HA_partner_node_hostname 23. Use the crm_mon utility to display the status of both OSS nodes: [nodeXY]$ sudo crm_mon -1 When both OSS nodes are online with their resources assigned to them, the crm_mon -1 output looks as follows: [admin@snx11000n004 ~]$ sudo crm_mon -1 ============ Last updated: Wed Jul 30 13:12:06 2014
60
Replace a 5U84 OSS Controller
Last change: Wed Jul 30 13:10:02 2014 via crm_resource on snx11000n005 Stack: Heartbeat Current DC: snx11000n004 (191d0bb0-80da-4715-b2c9-618af928458a) - partition with quorum Version: 1.1.6.1-6.el6-0c7312c689715e096b716419e2ebc12b57962052 2 Nodes configured, unknown expected votes 35 Resources configured. ============ Online: [ snx11000n004 snx11000n005 ] snx11000n005-1-ipmi-stonith (stonith:external/ipmi): Started snx11000n005 snx11000n005-2-ipmi-stonith (stonith:external/ipmi): Started snx11000n005 snx11000n004-3-ipmi-stonith (stonith:external/ipmi): Started snx11000n004 snx11000n004-4-ipmi-stonith (stonith:external/ipmi): Started snx11000n004 Clone Set: cln-kdump-stonith [kdump-stonith] Started: [ snx11000n004 snx11000n005 ] Clone Set: cln-ssh-10-stonith [ssh-10-stonith] Started: [ snx11000n004 snx11000n005 ] Clone Set: cln-gem-stonith [gem-stonith] Started: [ snx11000n004 snx11000n005 ] Clone Set: cln-ssh-stonith [ssh-stonith] Started: [ snx11000n004 snx11000n005 ] Clone Set: cln-phydump-stonith [phydump-stonith] Started: [ snx11000n004 snx11000n005 ] Clone Set: cln-last-stonith [last-stonith] Started: [ snx11000n004 snx11000n005 ] snx11000n004_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate): Started snx11000n004 snx11000n005_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate): Started snx11000n005 baton (ocf::heartbeat:baton): Started snx11000n005 Clone Set: cln-diskmonitor [diskmonitor] Started: [ snx11000n004 snx11000n005 ] snx11000n004_ibstat (ocf::heartbeat:ibstat): Started snx11000n004 snx11000n005_ibstat (ocf::heartbeat:ibstat): Started snx11000n005 Resource Group: snx11000n004_md0-group snx11000n004_md0-wibr (ocf::heartbeat:XYRAID): Started snx11000n004 snx11000n004_md0-jnlr (ocf::heartbeat:XYRAID): Started snx11000n004 snx11000n004_md0-wibs (ocf::heartbeat:XYMNTR): Started snx11000n004 snx11000n004_md0-raid (ocf::heartbeat:XYRAID): Started snx11000n004 snx11000n004_md0-fsys (ocf::heartbeat:XYMNTR): Started snx11000n004 snx11000n004_md0-stop (ocf::heartbeat:XYSTOP): Started snx11000n004 Resource Group: snx11000n004_md1-group snx11000n004_md1-wibr (ocf::heartbeat:XYRAID): Started snx11000n005 snx11000n004_md1-jnlr (ocf::heartbeat:XYRAID): Started snx11000n005 snx11000n004_md1-wibs (ocf::heartbeat:XYMNTR): Started snx11000n005 snx11000n004_md1-raid (ocf::heartbeat:XYRAID): Started snx11000n005 snx11000n004_md1-fsys (ocf::heartbeat:XYMNTR): Started snx11000n005 snx11000n004_md1-stop (ocf::heartbeat:XYSTOP): Started snx11000n005 When the OSS node on the failed controller is in standby mode and its resources have failed over to its HA partner, the crm_mon -1 output looks like this: [admin@snx11000n004 ~]$ sudo crm_mon -1 2 Nodes configured, unknown expected votes 35 Resources configured. ============ Online: [ snx11000n004 ] OFFLINE: [ snx11000n005 ] snx11000n004-3-ipmi-stonith (stonith:external/ipmi): snx11000n004
Started
61
Replace a 5U84 OSS Controller
snx11000n004-4-ipmi-stonith (stonith:external/ipmi): Started snx11000n004 Clone Set: cln-kdump-stonith [kdump-stonith] Started: [ snx11000n004 ] Stopped: [ kdump-stonith:1 ] Clone Set: cln-ssh-10-stonith [ssh-10-stonith] Started: [ snx11000n004 ] Stopped: [ ssh-10-stonith:1 ] Clone Set: cln-gem-stonith [gem-stonith] Started: [ snx11000n004 ] Stopped: [ gem-stonith:1 ] Clone Set: cln-ssh-stonith [ssh-stonith] Started: [ snx11000n004 ] Stopped: [ ssh-stonith:1 ] Clone Set: cln-phydump-stonith [phydump-stonith] Started: [ snx11000n004 ] Stopped: [ phydump-stonith:1 ] Clone Set: cln-last-stonith [last-stonith] Started: [ snx11000n004 ] Stopped: [ last-stonith:1 ] snx11000n004_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate): Started snx11000n004 baton (ocf::heartbeat:baton): Started snx11000n004 Clone Set: cln-diskmonitor [diskmonitor] Started: [ snx11000n004 ] Stopped: [ diskmonitor:1 ] snx11000n004_ibstat (ocf::heartbeat:ibstat): Started snx11000n004 Resource Group: snx11000n004_md0-group snx11000n004_md0-wibr (ocf::heartbeat:XYRAID): Started snx11000n004 snx11000n004_md0-jnlr (ocf::heartbeat:XYRAID): Started snx11000n004 snx11000n004_md0-wibs (ocf::heartbeat:XYMNTR): Started snx11000n004 snx11000n004_md0-raid (ocf::heartbeat:XYRAID): Started snx11000n004 snx11000n004_md0-fsys (ocf::heartbeat:XYMNTR): Started snx11000n004 snx11000n004_md0-stop (ocf::heartbeat:XYSTOP): Started snx11000n004 Resource Group: snx11000n004_md1-group snx11000n004_md1-wibr (ocf::heartbeat:XYRAID): Started snx11000n004 snx11000n004_md1-jnlr (ocf::heartbeat:XYRAID): Started snx11000n004 snx11000n004_md1-wibs (ocf::heartbeat:XYMNTR): Started snx11000n004 snx11000n004_md1-raid (ocf::heartbeat:XYRAID): Started snx11000n004 snx11000n004_md1-fsys (ocf::heartbeat:XYMNTR): Started snx11000n004 snx11000n004_md1-stop (ocf::heartbeat:XYSTOP): Started snx11000n004 In the above example, the OSS node on the failed controller is snx11000n005 and its HA partner is snx11000n004. Replace the Controller Use the following steps to collect the ddump and BMC logs (if possible), power off the controller, and remove. 24. From the primary MGMT node run: [admin@n000]$ conman nodeXX-gem Where nodeXX is the failed node.
62
Replace a 5U84 OSS Controller
The GEM command prompt then appears. For example: [admin@snx11000n000]$ conman snx11000n005-gem Connection to console [snx11000n005gem] opened. 11+01:54:53.622 M0 GEM> 25. At this GEM prompt, run: GEM> ddump This takes more than 7 minutes to complete. The output from this command is placed in /var/log/conman. For example: ------ 1 root root 2730380 Jul 25 04:10 snx11000n005-gem.log 26. Exit GEM: GEM> &. 27. If the failed controller does not respond to IPMI commands, retrieve the BMC sel list as follows, if it needs to be collected separately. [admin@n000]$ ssh nodeXX ipmitool sel list For example: [admin@snx11000n000 /]$ ssh 1 | 07/24/2014 | 13:25:13 | 2 | 07/24/2014 | 13:25:14 | 3 | 07/24/2014 | 13:25:16 | 4 | 07/24/2014 | 13:25:20 | 5 | 07/24/2014 | 13:25:21 |
snx11000n005 ipmitool sel list Event Logging Disabled #0x10 | System Event #0x20 | Timestamp System Event #0x20 | Timestamp System Event #0x20 | Timestamp System Event #0x20 | Timestamp
Log area reset/cleared | Asserted Clock Sync | Asserted Clock Sync | Asserted Clock Sync | Asserted Clock Sync | Asserted
28. If the failed controller responds to IPMI commands, collect BMC logs from the primary MGMT node as follows, if it needs to be collected separately: [admin@n000]$ ipmitool -U admin -P admin -H nodeXY-ipmi sel list | tee -a bmc_logs_nodeXY_`date +%s`
~/
The output on the screen and output in the file will be the same. The file will be stored in the admin home directory (/home/admin) For example: [admin@snx11000n000 /]$ ipmitool -U admin -P admin -H snx11000n005-ipmi sel list | tee -a ~/ bmc_logs_snx11000n005_`date +%s` 1 | 07/24/2014 | 13:25:13 | Event Logging Disabled #0x10 | Log area reset/cleared | Asserted 2 | 07/24/2014 | 13:25:14 | System Event #0x20 | Timestamp Clock Sync | Asserted 3 | 07/24/2014 | 13:25:16 | System Event #0x20 | Timestamp Clock Sync | Asserted 4 | 07/24/2014 | 13:25:20 | System Event #0x20 | Timestamp Clock Sync | Asserted 5 | 07/24/2014 | 13:25:21 | System Event #0x20 | Timestamp Clock Sync | Asserted
Contact Cray Support if any problems are encountered at any stage when retrieving these logs.
63
Replace a 5U84 OSS Controller
29. Power off the failed controller by running the following from the primary MGMT node: [admin@n000]$ cscli power_manage –n nodeXX --power-off 30. Unplug the two RJ-45 network cables. IMPORTANT: Make certain to mark which cable is connected to each port. The ports are numerically labeled. Make certain the cables are connected to the same ports when the new OSS controller is installed. 31. Unplug the InfiniBand (or 40GbE) cable. 32. Unplug the SAS cables if using SSU + n configuration. 33. Unplug the LSI HBA SAS Loopback cable from the two SAS ports. 34. Remove the failed controller from the SSU (using the locking lever to slide out the controller from the back of the rack). 35. Check the new controller for a dust cover cap and remove the one that corresponds to the InfiniBand cable removed earlier (Port II). 36. Insert the new controller halfway into the SSU, but do not seat it in the enclosure. 37. Connect the cables to the new controller. a. Plug in the RJ-45 network cable. b. Plug in the InfiniBand (or 40GbE) cable to the original port it came from (Port II). c.
Plug in the SAS cables if using SSU + n configuration.
d. Plug in the LSI HBA SAS Loopback cable between the two SAS ports. 38. Connect a serial cable from the console or PC to the new controller (serial port is on the rear panel). 39. Open a terminal session using either method described in Connect to the MGMT Node. This serial connection allows monitoring the boot and discovery process but is not needed to complete the procedure. 40. Completely insert the new controller into the SSU (until the locking lever engages and the unit is properly seated in the chassis) and power on the controller. Use the Reset Tool (P/N 101150300) to press the Power On/Off button (hold for 3 seconds, and then release) on the rear panel of the OSS. As shown in the following figure, the controller has two buttons, which are not labeled. The button on the left is the power button, while the button on the right is the reset button.
64
Replace a 5U84 OSS Controller
Figure 28. Controller Power and Reset Switches
When a serial connection is being monitored, the following steps normally occur during replacement: a. The new controller reboots into discovery mode, initiating discovery . b. The controller automatically reboots with the correct hostname and restores the HA configuration. c.
The controller reboots again and becomes completely operational.
41. Wait for the discovery procedure to complete (this will take about 10 minutes). IMPORTANT: If the replacement controller has been removed or swapped from another system, follow the procedure described in Wipe HA Data from SSD on the OSS Controller. Otherwise the auto-discovery process can fail due to left over HA data or a duplicate BMC IP address set on the FRU. 42. Log in to the OSS node hosted on the controller (username admin and the customer's password). Verify New Controller Function 43. After the controller reboots, verify that the new controller is online: [snx11000nxxx]$ sudo crm_mon -1 When both OSS nodes are online with their resources assigned to them, the node status line changes from the following (as seen in step 23 on page 60): Online: [ snx11000n004 ] OFFLINE: [ snx11000n005 ] To: Online: [ snx11000n004 snx11000n005] For example: [admin@snx11000n004 ~]$ sudo crm_mon -1 2 Nodes configured, unknown expected votes 35 Resources configured. ============ Online: [ snx11000n004 snx11000n005] snx11000n004-3-ipmi-stonith (stonith:external/ipmi): Started snx11000n004 snx11000n004-4-ipmi-stonith (stonith:external/ipmi): Started snx11000n004 Clone Set: cln-kdump-stonith [kdump-stonith]
65
Replace a 5U84 OSS Controller
Started: [ snx11000n004 ] Stopped: [ kdump-stonith:1 ] Clone Set: cln-ssh-10-stonith [ssh-10-stonith] Started: [ snx11000n004 ] Stopped: [ ssh-10-stonith:1 ] Clone Set: cln-gem-stonith [gem-stonith] Started: [ snx11000n004 ] Stopped: [ gem-stonith:1 ] Clone Set: cln-ssh-stonith [ssh-stonith] Started: [ snx11000n004 ] Stopped: [ ssh-stonith:1 ] Clone Set: cln-phydump-stonith [phydump-stonith] Started: [ snx11000n004 ] Stopped: [ phydump-stonith:1 ] Clone Set: cln-last-stonith [last-stonith] Started: [ snx11000n004 ] Stopped: [ last-stonith:1 ] snx11000n004_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate): Started snx11000n004 baton (ocf::heartbeat:baton): Started snx11000n004 Clone Set: cln-diskmonitor [diskmonitor] Started: [ snx11000n004 ] Stopped: [ diskmonitor:1 ] snx11000n004_ibstat (ocf::heartbeat:ibstat): Started snx11000n004 Resource Group: snx11000n004_md0-group snx11000n004_md0-wibr (ocf::heartbeat:XYRAID):Started snx11000n004 snx11000n004_md0-jnlr (ocf::heartbeat:XYRAID):Started snx11000n004 snx11000n004_md0-wibs (ocf::heartbeat:XYMNTR):Started snx11000n004 snx11000n004_md0-raid (ocf::heartbeat:XYRAID):Started snx11000n004 snx11000n004_md0-fsys (ocf::heartbeat:XYMNTR):Started snx11000n004 snx11000n004_md0-stop (ocf::heartbeat:XYSTOP):Started snx11000n004 Resource Group: snx11000n004_md1-group snx11000n004_md1-wibr (ocf::heartbeat:XYRAID):Started snx11000n004 snx11000n004_md1-jnlr (ocf::heartbeat:XYRAID):Started snx11000n004 snx11000n004_md1-wibs (ocf::heartbeat:XYMNTR):Started snx11000n004 snx11000n004_md1-raid (ocf::heartbeat:XYRAID):Started snx11000n004 snx11000n004_md1-fsys (ocf::heartbeat:XYMNTR):Started snx11000n004 snx11000n004_md1-stop (ocf::heartbeat:XYSTOP):Started snx11000n004 44. If the crm_mon -1 node status still shows the new node as offline after 15 minutes, verify the state of the xybridge driver and xyvnic0 port: [snx11000nxxx]$ ifconfig xyvnic0 Where [snx11000nxxx] is the node containing the significantly different output. For example: [admin@snx11000n004 ~]$ ifconfig xyvnic0 xyvnic0 Link encap:Ethernet HWaddr 80:00:E0:5B:7B:C2 inet addr:203.0.113.1 Bcast:203.0.113.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:900 Metric:1 RX packets:41151 errors:0 dropped:0 overruns:0 frame:0 TX packets:40752 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:9632555 (9.1 MiB) TX bytes:9711629 (9.2 MiB) 45. If there is any other output, update the xybridge firmware and restart the xyvnic driver.
66
Replace a 5U84 OSS Controller
a. Determine if a xybridge firmware update is available: [snx11000nxxx]$ sudo xrtx_bridgefw –c The output will be similar to the following if an update is available: [admin@snx11000n004 ~]$ sudo xrtx_bridgefw -c Current:c3822a46 Update:98229a3e b. Update the xybridge firmware, if an update is available: [snx11000nxxx]$ sudo xrtx_bridgefw -u c.
Reboot the affected node: [snx11000nxxx]$ sudo reboot
46. If the output from step 44 on page 66 matches the example, but no HA is present, attempt to ping the neighbor controller over the xyvnic0. For the even-numbered nodes the IP address is always 203.0.113.1 and for odd-numbered nodes the address is always 203.0.113.2: [snx11000nxxx]$ ping 203.0.113.2 For example: [admin@snx11000n004 ~]$ ping 203.0.113.2 PING 203.0.113.2 (203.0.113.2) 56(84) bytes of data. 64 bytes from 203.0.113.2: icmp_seq=1 ttl=64 time=0.786 ms 64 bytes from 203.0.113.2: icmp_seq=2 ttl=64 time=0.947 ms ^C — 203.0.113.2 ping statistics — 2 packets transmitted, 2 received, 0% packet loss, time 1742ms rtt min/avg/max/mdev = 0.786/0.866/0.947/0.085 ms a. If the above response appears, attempt to reboot the affected node. If there is no response from the neighbor controller, check the firmware of xybridge and possibly update it. Follow all substeps in step 45 on page 66. b. If all of the above steps failed, power-cycle the SSU whose controller was replaced. At this point it may be necessary to shut down the filesystem first or stop the IO (if any). 47. Log in to the node with the new controller via SSH: [admin@n000]$ ssh -l admin node_with_new_controller 48. Check the USM firmware version running on the new controller: [n004]$ sudo /lib/firmware/gem_usm/xyr_usm_sbb-onestor_r3.20_rc1_rel-470/ fwtool.sh -c Partial sample output: root: XYRATEX:AMI_FW: Root Current Ver: 1.44.0000 root: XYRATEX:AMI_FW: Root New Ver: 1.44.0000 root: XYRATEX:AMI_FW: Root Backup Ver: 1.44.0000
67
Replace a 5U84 OSS Controller
root: root: root: root: root: root: root: root:
XYRATEX:AMI_FW: XYRATEX:AMI_FW: XYRATEX:AMI_FW: XYRATEX:AMI_FW: XYRATEX:AMI_FW: XYRATEX:AMI_FW: XYRATEX:AMI_FW: XYRATEX:FWTOOL:
Boot Current Ver: 1.40.0005 Boot New Ver: 1.40.0005 BIOS Current Ver: 0.39.0006 BIOS New Ver: 0.39.0006 BIOS Backup Ver: 0.37.0000 CPLD Current Ver: 0.16.0001 CPLD New Ver: 0.16.0001 Version checking done
To see the complete output from the previous fwtool.sh –c command, see Output Example to Check USM Firmware on page 69. If the firmware version on the new controller is the same as the other controller, go to step 50 on page 68. 49. Update the firmware version on the new controller if it is running a different version. Refer to USM Firmware Update Guide. 50. From the primary MGMT node, fail back the resources to balance the load between the affected nodes: [admin@n000]$ cscli failback -n nodes Where nodes are the names of the node(s) that previously failed over. For example: [admin@n000]$ cscli failback -n x004 51. After the controller reboots, verify that the new controller is online: [snx11000nxxx]$ sudo crm_mon -1 The above command should return the following output: ============ Last updated: Wed Jul 30 13:12:06 2014 Last change: Wed Jul 30 13:10:02 2014 via crm_resource on snx11000n005 Stack: Heartbeat Current DC: snx11000n004 (191d0bb0-80da-4715-b2c9-618af928458a) - partition with quorum Version: 1.1.6.1-6.el6-0c7312c689715e096b716419e2ebc12b57962052 2 Nodes configured, unknown expected votes 35 Resources configured. ============ Online: [ snx11000n004 snx11000n005 ] snx11000n005-1-ipmi-stonith (stonith:external/ipmi): Started snx11000n005 snx11000n005-2-ipmi-stonith (stonith:external/ipmi): Started snx11000n005 snx11000n004-3-ipmi-stonith (stonith:external/ipmi): Started snx11000n004 snx11000n004-4-ipmi-stonith (stonith:external/ipmi): Started snx11000n004 Clone Set: cln-kdump-stonith [kdump-stonith] Started: [ snx11000n004 snx11000n005 ] Clone Set: cln-ssh-10-stonith [ssh-10-stonith] Started: [ snx11000n004 snx11000n005 ] Clone Set: cln-gem-stonith [gem-stonith] Started: [ snx11000n004 snx11000n005 ] Clone Set: cln-ssh-stonith [ssh-stonith] Started: [ snx11000n004 snx11000n005 ] Clone Set: cln-phydump-stonith [phydump-stonith] Started: [ snx11000n004 snx11000n005 ]
68
Replace a 5U84 OSS Controller
Clone Set: cln-last-stonith [last-stonith] Started: [ snx11000n004 snx11000n005 ] snx11000n004_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate): Started snx11000n004 snx11000n005_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate): Started snx11000n005 baton (ocf::heartbeat:baton): Started snx11000n005 Clone Set: cln-diskmonitor [diskmonitor] Started: [ snx11000n004 snx11000n005 ] snx11000n004_ibstat (ocf::heartbeat:ibstat): Started snx11000n004 snx11000n005_ibstat (ocf::heartbeat:ibstat): Started snx11000n005 Resource Group: snx11000n004_md0-group snx11000n004_md0-wibr (ocf::heartbeat:XYRAID): Started snx11000n004 snx11000n004_md0-jnlr (ocf::heartbeat:XYRAID): Started snx11000n004 snx11000n004_md0-wibs (ocf::heartbeat:XYMNTR): Started snx11000n004 snx11000n004_md0-raid (ocf::heartbeat:XYRAID): Started snx11000n004 snx11000n004_md0-fsys (ocf::heartbeat:XYMNTR): Started snx11000n004 snx11000n004_md0-stop (ocf::heartbeat:XYSTOP): Started snx11000n004 Resource Group: snx11000n004_md1-group snx11000n004_md1-wibr (ocf::heartbeat:XYRAID): Started snx11000n005 snx11000n004_md1-jnlr (ocf::heartbeat:XYRAID): Started snx11000n005 snx11000n004_md1-wibs (ocf::heartbeat:XYMNTR): Started snx11000n005 snx11000n004_md1-raid (ocf::heartbeat:XYRAID): Started snx11000n005 snx11000n004_md1-fsys (ocf::heartbeat:XYMNTR): Started snx11000n005 snx11000n004_md1-stop (ocf::heartbeat:XYSTOP): Started snx11000n005 52. If the terminal connection (console or PC) is still active, terminate it and disconnect the serial cable from the new controller.
Output Example to Check USM Firmware The following is the complete output for step 48 on page 67 in Replace a 5U84 OSS Controller on page 55 [admin@snx11000n005 ~]$ sudo /lib/firmware/gem_usm/xyr_usm_sbbonestor_r3.20_rc1_rel-470/fwtool.sh -c root: XYRATEX:FWTOOL: Searching for Canisters... root: XYRATEX:FWTOOL: Found 1 Canisters: root: XYRATEX:FWTOOL: PID: UD-8435-CS-9000. WWN: 50050CC10E05547E Xyratex FWDownloader v3.51 Scanning SES devices...Please Wait...2 SES devices found. ----------SES Device 0 Addr='/dev/sg1' Found SES Page 1 for Device 0 (/dev/sg1) size 284 (0x11c) WWN: 50050cc10c4000f5 Vendor ID: XYRATEX Product ID: UD-8435-CS-9000 Product Revision: 3519 S/N: SHX0965000G02FX --------------------SES Device 1 Addr='/dev/sg90' Found SES Page 1 for Device 1 (/dev/sg90) size 284 (0x11c) WWN: 50050cc10c4000f5 Vendor ID: XYRATEX Product ID: UD-8435-CS-9000 Product Revision: 3519 S/N: SHX0965000G02FX ----------Performing Check /dev/sg1 UD-8435-CS-9000 Checking canister in Slot 1 Check Canister /dev/sg1 UD-8435-CS-9000 Xyratex FWDownloader v3.51
69
Replace a 5U84 OSS Controller
3.51 Apply rule file rf_cs6000_check_slot1.gem to download file to Enclosure / dev/sg1 using SES page 0x0E. ----------SES Device Addr='/dev/sg1' Found SES Page 1 for Device /dev/sg1 size 284 (0x11c) WWN: 50050cc10c4000f5 Vendor ID: XYRATEX Product ID: UD-8435-CS-9000 Product Revision: 3519 S/N: SHX0965000G02FX ----------Found SES Page 7 for Device /dev/sg1 size 8955 (0x22fb) InterpreterVersion : 2.33 PackageVersion : 2.0 DESCRIPTION: Show Sati 2 GEM Version and Titan (xyr_usm_sbb-onestor_r3.20_rc1_rel-470)
VERSION: 1.0.1 Copyright 2012 Xyratex Inc. Enclosure Services Controller Electronics BootLoader SHOW elec_2; Boot Loader revision : 0503 MATCH Upgrade version : 0503 Enclosure Services Controller Electronics local Firmware SHOW elec_2; Firmware revision : 03050019 MATCH Upgrade version : 03050019 Enclosure Services Controller Electronics CPLD SHOW elec_2; CPLD revision : 17 MATCH Upgrade version : 17 Enclosure Services Controller Electronics Flash Config SHOW elec_2; Flash Config data CRC : d1b030a4 MATCH Upgrade version : d1b030a4 Enclosure Services Controller Electronics VPD CRC SHOW elec_2; VPD CRC : ee3504b4 MATCH Upgrade version : ee3504b4 Performing Check Check Midplane /dev/sg1 UD-8435-CS-9000 Check Fans /dev/sg1 UD-8435-CS-9000 Check Sideplane /dev/sg1 UD-8435-CS-9000 Check Sideplane /dev/sg1 UD-8435-CS-9000 Xyratex FWDownloader v3.51 3.51 Apply rule file rf_5u84_mid_check.gem to download file to Enclosure /dev/ sg1 using SES page 0x0E. ----------SES Device Addr='/dev/sg1' Found SES Page 1 for Device /dev/sg1 size 284 (0x11c) WWN: 50050cc10c4000f5 Vendor ID: XYRATEX Product ID: UD-8435-CS-9000 Product Revision: 3519 S/N: SHX0965000G02FX ----------Found SES Page 7 for Device /dev/sg1 size 8955 (0x22fb) InterpreterVersion : 2.33 PackageVersion : 2.0 DESCRIPTION: Show GEM Version on 5u84 and NEO 3000
70
Replace a 5U84 OSS Controller
VERSION: 1.0.1 Copyright 2012 Xyratex Inc. Enclosure cpld SHOW elec_1; CPLD revision : 05 MATCH Upgrade version : 05 Enclosure vpdcrc SHOW elec_1; VPD CRC : 0e74e375 MATCH Upgrade version : 0e74e375 Xyratex FWDownloader v3.51 3.51 Apply rule file rf_5u84_fan_check.gem to download file to Enclosure /dev/ sg1 using SES page 0x0E. ----------SES Device Addr='/dev/sg1' Found SES Page 1 for Device /dev/sg1 size 284 (0x11c) WWN: 50050cc10c4000f5 Vendor ID: XYRATEX Product ID: UD-8435-CS-9000 Product Revision: 3519 S/N: SHX0965000G02FX ----------Found SES Page 7 for Device /dev/sg1 size 8955 (0x22fb) InterpreterVersion : 2.33 PackageVersion : 2.0 DESCRIPTION: Show GEM Version on 5u84 and NEO 3000
VERSION: 1.0.1 Copyright 2012 Xyratex Inc. Fan
Slot1 Controller Config
SHOW elec_1; Controller Config : 0960837-04_0 MATCH Upgrade version : 0960837-04_0 Fan Slot1 Controller Firmware SHOW elec_1; Controller Firmware : ucd90124a|2.3.6.0000|110809 MATCH Upgrade version : ucd90124a|2.3.6.0000|110809 Fan Slot2 Controller Config SHOW elec_2; Controller Config : 0960837-04_0 MATCH Upgrade version : 0960837-04_0 Fan Slot2 Controller Firmware SHOW elec_2; Controller Firmware : ucd90124a|2.3.6.0000|110809 MATCH Upgrade version : ucd90124a|2.3.6.0000|110809 Fan Slot3 Controller Config SHOW elec_3; Controller Config : 0960837-04_0 MATCH Upgrade version : 0960837-04_0 Fan Slot3 Controller Firmware SHOW elec_3; Controller Firmware : ucd90124a|2.3.6.0000|110809 MATCH Upgrade version : ucd90124a|2.3.6.0000|110809 Fan Slot4 Controller Config
71
Replace a 5U84 OSS Controller
SHOW elec_4; Controller Config : 0960837-04_0 MATCH Upgrade version : 0960837-04_0 Fan Slot4 Controller Firmware SHOW elec_4; Controller Firmware : ucd90124a|2.3.6.0000|110809 MATCH Upgrade version : ucd90124a|2.3.6.0000|110809 Fan Slot5 Controller Config SHOW elec_5; Controller Config : 0960837-04_0 MATCH Upgrade version : 0960837-04_0 Fan Slot5 Controller Firmware SHOW elec_5; Controller Firmware : ucd90124a|2.3.6.0000|110809 MATCH Upgrade version : ucd90124a|2.3.6.0000|110809 Xyratex FWDownloader v3.51 3.51 Apply rule file rf_sideplane_s0_check.gem to download file to Enclosure / dev/sg1 using SES page 0x0E. ----------SES Device Addr='/dev/sg1' Found SES Page 1 for Device /dev/sg1 size 284 (0x11c) WWN: 50050cc10c4000f5 Vendor ID: XYRATEX Product ID: UD-8435-CS-9000 Product Revision: 3519 S/N: SHX0965000G02FX ----------Found SES Page 7 for Device /dev/sg1 size 8955 (0x22fb) InterpreterVersion : 2.33 PackageVersion : 2.0 DESCRIPTION: Show Sat 2 GEM Version and Ttan (xyr_usm_sbb-onestor_r3.20_rc1_rel-470)
VERSION: 1.0.1 Copyright 2012 Xyratex Inc. SAS Expander
BootLoader
SHOW elec_3; Boot Loader revision : 0610 MATCH Upgrade version : 0610 SAS Expander Firmware SHOW elec_3; Firmware revision : 03050019 MATCH Upgrade version : 03050019 SAS Expander CPLD SHOW elec_3; CPLD revision : 10 MATCH Upgrade version : 10 SAS Expander Flash Config SHOW elec_3; Flash Config data CRC : 6591a901 MATCH Upgrade version : 6591a901 SAS Expander VPD SHOW elec_3; VPD revision : 06 MATCH Upgrade version : 06 SAS Expander VPD CRC SHOW elec_3; VPD CRC : 38b21cf7 MATCH Upgrade version : 38b21cf7 SAS Expander BootLoader SHOW elec_4; Boot Loader revision : 0610 MATCH Upgrade version : 0610 SAS Expander Firmware
72
Replace a 5U84 OSS Controller
SHOW elec_4; Firmware revision : 03050019 MATCH Upgrade version : 03050019 SAS Expander CPLD SHOW elec_4; CPLD revision : 10 MATCH Upgrade version : 10 SAS Expander Flash Config SHOW elec_4; Flash Config data CRC : fa337b92 MATCH Upgrade version : fa337b92 SAS Expander VPD SHOW elec_4; VPD revision : 06 MATCH Upgrade version : 06 SAS Expander VPD CRC SHOW elec_4; VPD CRC : 152e4ad9 MATCH Upgrade version : 152e4ad9 SAS Expander BootLoader SHOW elec_7; Boot Loader revision : 0610 MATCH Upgrade version : 0610 SAS Expander Firmware SHOW elec_7; Firmware revision : 03050019 MATCH Upgrade version : 03050019 SAS Expander CPLD SHOW elec_7; CPLD revision : 10 MATCH Upgrade version : 10 SAS Expander Flash Config SHOW elec_7; Flash Config data CRC : 6591a901 MATCH Upgrade version : 6591a901 SAS Expander VPD SHOW elec_7; VPD revision : 06 MATCH Upgrade version : 06 SAS Expander VPD CRC SHOW elec_7; VPD CRC : 38b21cf7 MATCH Upgrade version : 38b21cf7 SAS Expander BootLoader SHOW elec_8; Boot Loader revision : 0610 MATCH Upgrade version : 0610 SAS Expander Firmware SHOW elec_8; Firmware revision : 03050019 MATCH Upgrade version : 03050019 SAS Expander CPLD SHOW elec_8; CPLD revision : 10 MATCH Upgrade version : 10 SAS Expander Flash Config SHOW elec_8; Flash Config data CRC : fa337b92 MATCH Upgrade version : fa337b92 SAS Expander VPD SHOW elec_8; VPD revision : 06 MATCH Upgrade version : 06 SAS Expander VPD CRC SHOW elec_8; VPD CRC : 152e4ad9 MATCH Upgrade version : 152e4ad9 Xyratex FWDownloader v3.51 3.51 Apply rule file rf_sideplane_s1_check.gem to download file to Enclosure / dev/sg1 using SES page 0x0E. ----------SES Device Addr='/dev/sg1' Found SES Page 1 for Device /dev/sg1 size 284 (0x11c) WWN: 50050cc10c4000f5 Vendor ID: XYRATEX Product ID: UD-8435-CS-9000 Product Revision: 3519 S/N: SHX0965000G02FX ----------Found SES Page 7 for Device /dev/sg1 size 8955 (0x22fb)
73
Replace a 5U84 OSS Controller
InterpreterVersion : 2.33 PackageVersion : 2.0 DESCRIPTION: Show Sat 2 GEM Version and Ttan (xyr_usm_sbb-onestor_r3.20_rc1_rel-470)
VERSION: 1.0.1 Copyright 2012 Xyratex Inc. SAS Expander
BootLoader
SHOW elec_1; Boot Loader revision : 0610 MATCH Upgrade version : 0610 SAS Expander Firmware SHOW elec_1; Firmware revision : 03050019 MATCH Upgrade version : 03050019 SAS Expander CPLD SHOW elec_1; CPLD revision : 10 MATCH Upgrade version : 10 SAS Expander Flash Config SHOW elec_1; Flash Config data CRC : 687fe3f1 MATCH Upgrade version : 687fe3f1 SAS Expander VPD SHOW elec_1; VPD revision : 06 MATCH Upgrade version : 06 SAS Expander VPD CRC SHOW elec_1; VPD CRC : 2737a206 MATCH Upgrade version : 2737a206 SAS Expander BootLoader SHOW elec_2; Boot Loader revision : 0610 MATCH Upgrade version : 0610 SAS Expander Firmware SHOW elec_2; Firmware revision : 03050019 MATCH Upgrade version : 03050019 SAS Expander CPLD SHOW elec_2; CPLD revision : 10 MATCH Upgrade version : 10 SAS Expander Flash Config SHOW elec_2; Flash Config data CRC : 11eefeaf MATCH Upgrade version : 11eefeaf SAS Expander VPD SHOW elec_2; VPD revision : 06 MATCH Upgrade version : 06 SAS Expander VPD CRC SHOW elec_2; VPD CRC : f027d510 MATCH Upgrade version : f027d510 SAS Expander BootLoader SHOW elec_5; Boot Loader revision : 0610 MATCH Upgrade version : 0610 SAS Expander Firmware SHOW elec_5; Firmware revision : 03050019 MATCH Upgrade version : 03050019 SAS Expander CPLD SHOW elec_5; CPLD revision : 10 MATCH Upgrade version : 10 SAS Expander Flash Config SHOW elec_5; Flash Config data CRC : 687fe3f1 MATCH Upgrade version : 687fe3f1 SAS Expander VPD
74
Replace a 5U84 OSS Controller
SHOW elec_5; VPD revision : 06 MATCH Upgrade version : 06 SAS Expander VPD CRC SHOW elec_5; VPD CRC : 2737a206 MATCH Upgrade version : 2737a206 SAS Expander BootLoader SHOW elec_6; Boot Loader revision : 0610 MATCH Upgrade version : 0610 SAS Expander Firmware SHOW elec_6; Firmware revision : 03050019 MATCH Upgrade version : 03050019 SAS Expander CPLD SHOW elec_6; CPLD revision : 10 MATCH Upgrade version : 10 SAS Expander Flash Config SHOW elec_6; Flash Config data CRC : 11eefeaf MATCH Upgrade version : 11eefeaf SAS Expander VPD SHOW elec_6; VPD revision : 06 MATCH Upgrade version : 06 SAS Expander VPD CRC SHOW elec_6; VPD CRC : f027d510 MATCH Upgrade version : f027d510 root: XYRATEX:AMI_FW: Root Current Ver: 2.0.000A root: XYRATEX:AMI_FW: Root New Ver: 2.0.000A root: XYRATEX:AMI_FW: Root Backup Ver: 2.0.000A root: XYRATEX:AMI_FW: Boot Current Ver: 2.0.0001 root: XYRATEX:AMI_FW: Boot New Ver: 2.0.0001 root: XYRATEX:AMI_FW: BIOS Current Ver: 0.46.0001 root: XYRATEX:AMI_FW: BIOS New Ver: 0.46.0001 root: XYRATEX:AMI_FW: BIOS Backup Ver: 0.46.0001 root: XYRATEX:AMI_FW: CPLD Current Ver: 17.0.0004 root: XYRATEX:AMI_FW: CPLD New Ver: 17.0.0004 root: XYRATEX:FWTOOL: Version checking done
75
Replace a 5U84 Power Supply Unit
Replace a 5U84 Power Supply Unit Prerequisites Part number
100842700: Power Supply, Sonexion for SSU Time 30 minutes Interrupt level Live (can be applied to a live system with no service interruption) Tools ●
ESD strap
●
Host names assigned to the two OSS nodes in the SSU that contains the failed PSU (available from the customer)
About this task
Use this procedure, only for Sonexion 2000 1.5 pre SU-007 systems (see Important, below), to remove and replace a failed power supply unit (PSU) in the 5U84 enclosure of an SSU component. IMPORTANT: The procedure in this topic is intended for use with Sonexion 2000 1.5.0 (pre SU-007) systems only. For Sonexion 1.5.0 SU-007 and later systems, do not use this topic to replace failed hardware. Instead, field personnel should log in to the Sonexion service console, which provides step-bystep instructions to replace the failed part. Follow the steps below to access the service console:
1. Cable a laptop to any available port on any LMN switch (located at the top of the rack). 2. Log in to the service console and follow the procedure to remove and replace the failed part. To log in, navigate to the service console (http://service:8080). If that URL is not active, then Log in to port 8080 of the IP address of the currently active MGMT node (MGMT0): http://IP_address:8080 where IP_address is the IP address of the currently active (primary) MGMT node. 3. Enter the standard service credentials. PSUs are located below the fan modules and are accessible from the back of the rack. PSUs are not individually numbered; assume a left-to-right order as viewed from the rear. PSU 1 is on the left side (bay 1) and PSU 2 is on the right side (bay 2). This procedure includes steps to verify the operation of the new PSU. In the cabinet, the SSU contains a 5U84 enclosure, two controllers, two PSUs, and five fan modules. 5U84 Enclosure Precautions
76
Replace a 5U84 Power Supply Unit
Whenever a component (such as a PSU, fan, or controller) fails and causes an SSU to change from a fully redundant state to a non-redundant state, to maintain uninterrupted ClusterStor operations, the faulty component should be replaced as soon as possible to return the SSU to a fully redundant state. ●
Incorrectly carrying out this procedure may cause damage to the enclosure.
●
When you replace a cooling fan in any SSU enclosure, you must complete the replacement within 2 minutes.
●
When a drawer is open on any SSU enclosure, do not leave the drawer open longer than 2 minutes.
●
When replacing a disk drive in any SSU enclosure, unlatch the drive and wait 5 seconds for the drive to spin down before removal.
●
To prevent overturning, drawer interlocks stop users from opening both drawers at the same time. Do not attempt to force open a drawer if the other drawer is already open.
●
Operating temperatures inside the enclosure drawers can reach up to 95°C. Take care when opening drawers.
Notes and Cautions ●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
Replace the failed PSU by performing the following steps at the back of the rack:
Procedure 1. If the location of the failed PSU is known, skip to . If not, use one of the following procedures. To locate the failed PSU using hardware indicators, do the following: a. Locate the SSU that contains the failed PSU. b. Look for an amber light on the failed PSU and on the front panel of the SSU. The amber PSU failed LED and others LEDs are shown in the following figure. Figure 29. PSU LEDs
77
Replace a 5U84 Power Supply Unit
2. To locate the failed PSU using commands, do the following: a. Log in to the primary MGMT node. $ ssh -1 admin primary_MGMT_node b. Change to root user and enter: $ sudo su c.
Access GEM on an OSS node in the SSU that contains the failed PSU and enter: # conman nodename -gem In the above, nodename is the host name of one of the OSS nodes in the SSU that contains the failed PSU. Each SSU has two OSS nodes, each of which has a controller. Within an SSU, the OSS nodes are organized in an HA pair. Obtain the OSS node names from the customer.
3. At the GEM prompt, list the status of both PSUs in the SSU (bays 1 and 2) to determine which one failed: > getpsustatus 1 > getpsustatus 2 The following getpsustatus output for the PSU in Bay 1 (showing a fault condition) follows: > getpsustatus 1 **** PCM Bay 1 status **** PCM index: 1 present: yes type: PSMI FRUid: 0xD4 product name: UD-PSU01-2800-AC manufacturer: DELTA part number: 0948719 part revision: 05 serial number: PMD0948719G0755 firmware version: 2.16 power state: off fans self-powered: no nominal power rating: 2814W nominal power rating output bitmask: 0x3 surge power rating: 3014W surge power rating output bitmask: 0x1 surge hold time: 10s AC dropout tolerance: 20ms standby power: 14W cooling power: 0W swap: no hotSwap(private): yes hotSwap(public): yes fault: yes fault_qualifier: "General Failure" "Not providing power" "AC failure" "DC failure" DC output count: 2 output 1 (12.19V) voltage: 12.50V
78
Replace a 5U84 Power Supply Unit
current: 0.00A power: 0.00W min voltage: 11.83V min surge voltage: 0.00V max voltage: 12.60V max current: 233.40A current: 2.00A max surge current: 250.00A output 2 (5.19V) voltage: monitoring unsupported current: monitoring unsupported power: monitoring unsupported min voltage: 5.10V min surge voltage: 0.00V max voltage: 5.40V max current: 2.70A min current: 0.00A max surge current: 0.00A SES info bit: not set 4. Replace the failed PSU by performing the following steps at the back of the rack: a. Verify that the ON/OFF switch on the failed PSU is set to OFF. b. Release the power cord retainer and unplug the power from the PSU. c.
Using the handle, press the latch mechanism toward the center of the failed PSU and carefully remove the PSU from the enclosure bay.
d. Verify that the ON/OFF switch on the new PSU is in the OFF position. e. Insert the new PSU in the enclosure bay and slide it in until the PSU completely seats and engages the latch. f.
Move the ON/OFF switch to the OPN position; the PSU powers on.
79
Replace a 5U84 Power Supply Unit
Figure 30. PSU Switch Location (on Rear Panel)
Verify Operation of New PSU 5. Verify that all indicator lights are normal. ●
The indicator lights on the new PSU should be green.
●
The indicator lights on the front panel of the SSU should be green.
6. Compare the USM and GEM firmware versions between the new PSU and the controller to make certain they match. 7. For instructions on how to compare and update USM firmware versions, see Sonexion 2000 USM Firmware Leveling Guide. 8. Verify that the getpsustatus output is normal and shows no errors (fault: no): GEM> getpsustatus 1 An example of output for the PSU in bay 1: GEM> getpsustatus 1 **** PCM Bay 1 status **** PCM index: 1 present: yes type: PSMI FRUid: 0xD4 product name: UD-PSU01-2800-AC manufacturer: DELTA part number: 0948719
80
Replace a 5U84 Power Supply Unit
part revision: 05 serial number: PMD0948719G0755 firmware version: 2.16 power state: on fans self-powered: yes nominal power rating: 2814W nominal power rating output bitmask: 0x3 surge power rating: 3014W surge power rating output bitmask: 0x1 surge hold time: 10s AC dropout tolerance: 20ms standby power: 14W cooling power: 0W swap: no hotSwap(private): yes hotSwap(public): yes fault: no DC output count: 2 output 1 (12.19V) voltage: 12.46V current: 47.89A power: 569.50W min voltage: 11.83V min surge voltage: 0.00V max voltage: 12.60V max current: 233.40A min current: 2.00A max surge current: 250.00A output 2 (5.19V) voltage: monitoring unsupported current: monitoring unsupported power: monitoring unsupported min voltage: 5.10V min surge voltage: 0.00V max voltage: 5.40V max current: 2.70A min current: 0.00A max surge current: 0.00A combined power: 569.50W SES info bit: not set 9. Exit GEM: GEM> &.
81
Replace a 5U84 Chassis
Replace a 5U84 Chassis Prerequisites Part number
101170900: Chassis Assy, Sonexion 2000 5U SSU FRU, with power supplies and fans Time 2.5 hours Interrupt level Interrupt (requires disconnecting the Lustre clients from the filesystem) Tools ●
ESD strap
●
Console with monitor and keyboard or a PC with a serial port configured for 115.2 Kbs
●
Serial cable
About this task
The SSU comprises a chassis and the two controllers. Each controller hosts an object storage system (OSS) node. Each SSU has two OSS nodes. In this procedure, only the defective chassis is replaced; all other components are re-used in the new SSU chassis. The following procedure requires taking the Lustre filesystem offline. 5U84 Enclosure Precautions Whenever a component (such as a PSU, fan, or controller) fails and causes an SSU to change from a fully redundant state to a non-redundant state, to maintain uninterrupted ClusterStor operations, the faulty component should be replaced as soon as possible to return the SSU to a fully redundant state. ●
Incorrectly carrying out this procedure may cause damage to the enclosure.
●
When you replace a cooling fan in any SSU enclosure, you must complete the replacement within 2 minutes.
●
When a drawer is open on any SSU enclosure, do not leave the drawer open longer than 2 minutes.
●
When replacing a disk drive in any SSU enclosure, unlatch the drive and wait 5 seconds for the drive to spin down before removal.
●
To prevent overturning, drawer interlocks stop users from opening both drawers at the same time. Do not attempt to force open a drawer if the other drawer is already open.
●
Operating temperatures inside the enclosure drawers can reach up to 95°C. Take care when opening drawers.
Notes and Cautions ●
Only trained service personnel should perform this procedure.
82
Replace a 5U84 Chassis
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure. IMPORTANT: Whenever a component (such as a PSU, fan, or controller) fails and causes an MMU or SSU to change from a fully redundant state to a non-redundant state, to maintain uninterrupted Sonexion operations, replace the faulty component as soon as possible to return the MMU or SSU to a fully redundant state.
Procedure 1. Locate the SSU containing the defective chassis. If the logical locations (hostnames) of the two OSS nodes associated with the affected SSU are known, skip to Replace SSU Chassis and Verify. 2. Establish communication with the management node (n000) using one of the following two methods: Method 1: On a workstation or laptop PC with an Ethernet connection to the Sonexion system, use the IP address of the MGMT node (n000) to launch a terminal session, such as Putty, using the settings shown in the following table. Table 7. Settings for MGMT Connection Parameter
Setting
Bits per second
115200
Data bits
8
Parity
None
Stop bits
1
Flow control
None
The function keys are set to VT100+. Method 2: Use a separate monitor and keyboard connected to the back of the MGMT node (n000) as shown in the following figure. Ensure that the connection has the settings shown in the table above. Figure 31. Monitor and Keyboard Connections on the MGMT Node
83
Replace a 5U84 Chassis
3. Log on to the MGMT node (n000) as admin using the related password, as follows: login as: admin [email protected]’s password: password Last login: Thu Oct 11 11:06:24 2012 from 172.30.22.59 [admin@snx11000n000 ~]$ 4. Change to root user: $ sudo su 5. Log in to one of the OSS nodes hosted on the controller, with username admin and the associated password. 6. Determine the hostnames of the OSS nodes in an HA pair: # crm_mon Leave the serial cable connected, to use in the following steps. 7. Power off the affected SSU: [admin@n000]$ cscli power_manage –n nodeXX --power-off Repeat the command for both nodes inside the affected SSU. Replace SSU Chassis and Verify Perform the following steps at the back of the rack. 8. Power off both PSUs in the SSU (located below the controllers). 9. Remove the power cords from the PSUs. 10. Disconnect the RJ-45 and Infiniband cables from both SSU controllers. If serial cables are attached to the controllers as part of the system, disconnect them. 11. Remove all drives from the chassis, keeping the same order per shelf/slot. Although the drive roaming functionality works, the best practice is to put each drive back in its original slot. For peak performance, SSDs must be re-inserted in their original slot. 12. Remove both PSUs from the chassis. 13. Remove all fan modules from the chassis. 14. Remove the SSU controller on the left side of the chassis and mark it as “L”. 15. Remove the SSU controller on the right side of the chassis and mark it as “R”. 16. Remove the four screws in the front mounting brackets. 17. Disconnect the rear sliding brackets from the chassis.
84
Replace a 5U84 Chassis
18. With a second person, remove the SSU chassis from the rack. 19. With a second person, move the new SSU chassis into the rack. 20. Connect the chassis to the rack. a. Secure four screws to the front mounting brackets. b. Connect the rear sliding brackets to the chassis. 21. Insert the SSU controller marked “L” into the left side of the chassis (at the rear of the rack). 22. Insert the SSU controller marked “R” into the right side of the chassis (at the rear of the rack). 23. Connect the RJ-45 and InfiniBand cables to both OSS controllers. If serial cables were previously attached to the controllers as part of the system, reconnect them. 24. Insert all fan modules into the chassis. 25. Insert both PSUs into the chassis. 26. Insert all drives into the chassis (keeping the original order per shelf/slot). 27. Connect the power cords to the PSUs and power on both modules. 28. Wait for the SSU containing the new chassis to fully power on, the OSS nodes to come online. It can take up to 15 minutes for the OSS nodes to boot. 29. Verify that the new SSU controller works correctly a. Connect a serial cable to one of the controllers in the new SSU (serial port is on the rear panel) and to the console with monitor and keyboard or the PC. b. Log in to one of the OSS nodes hosted on the controller (username admin and the customer's password). If the login fails, the OSS node is not yet online. c.
Verify that all drives in the new chassis are detected: [admin@n000]$ sudo sg_map -i -x | egrep -e 'HIT|SEA|WD' | wc -l The command should return a count of 84 or 168 with SSU +1 systems. Example: $ sudo sg_map -i -x | egrep -e 'HIT|SEA|WD' | wc -l 84 IMPORTANT: If the drive counts are lower than expected or not all drives are detected, do not continue. Contact Cray Support to troubleshoot the problem.
d. Check the USM firmware version running on the new controller. See USM Firmware Update Guide. e. Disconnect the serial cable from the controller in the affected SSU and the console (or PC).
85
Replace a 2U24 Disk
Replace a 2U24 Disk Prerequisites
Part Number
101018800: Disk drive assy, 300GB Sonexion SAS 15.3K MMU Time 1 hour Interrupt level Live (procedure can be applied to a live system with no service interruption) Tools ●
ESD strap
●
Console with monitor and keyboard or PC with a serial COM port configured for 115.2Kbs
●
Serial cable
●
Torx-20 screwdriver
About this task
Use this procedure, only for Sonexion 2000 1.5 pre SU-007 systems (see Important, below), to remove and replace a failed disk drive in carrier (disk) in the 2U24 enclosure of a MMU component. IMPORTANT: The procedure in this topic is intended for use with Sonexion 2000 1.5.0 (pre SU-007) systems only. For Sonexion 1.5.0 SU-007 and later systems, do not use this topic to replace failed hardware. Instead, field personnel should log in to the Sonexion service console, which provides step-bystep instructions to replace the failed part. Follow the steps below to access the service console:
1. Cable a laptop to any available port on any LMN switch (located at the top of the rack). 2. Log in to the service console and follow the procedure to remove and replace the failed part. To log in, navigate to the service console (http://service:8080). If that URL is not active, then log in to port 8080 of the IP address of the currently active MGMT node (MGMT0): http://IP_address:8080 where IP_address is the IP address of the currently active (primary) MGMT node. 3. Enter the standard service credentials. Subtasks: ●
3 on page 87
●
8 on page 88
●
dm_report Example Output
86
Replace a 2U24 Disk
In the MMU, a 2U quad server (Intel) is cabled to a 2U24 (or 5U84) enclosure. The 2U24 enclosure contains 24 disks configured in MDRAID arrays, each of which is a set of drives configured to act as a single volume. For data protection, the disk drives are configured as RAID1 and RAID10. When a disk fails and its MDRAID array is degraded, one of the hot spares becomes active and the data from the failed disk is immediately rebuilt on the spare disk. While the rebuild is underway, operations continue without interruption. When the disk is replaced, mark the new disk as the hot spare. If a hot spare fails, it must be replaced with a new drive, but the remove / replace procedure is easier because no disk rebuild is necessary. Instructions to replace a failed hot spare are provided in the following procedure. This procedure applies to both disks used in the MDRAID array and drives marked as hot spares. It includes steps to replace the failed disk, verify the disk recovery/rebuild on a spare disk, and mark the newly installed disk as a hot spare. Notes and Cautions ●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
Procedure 1. If the location of the failed disk is known, skip this step. Otherwise, look for the amber Drive Fault LED on the 2U24 Operator Control Panel (OCP) and the failed disk. The amber LED indicates a problem with the disk. When viewed from the front of the 2U24 enclosure, the slot numbers start from the left (slot 0) and continue to the right (slot 23). The dm_report utility reports drive data starting from slot 0. 2. Log in to the primary MGMT node: [Client]$ ssh –l admin primary_MGMT_node Verify Array State of 2U24 Disk 3. If the array state is already known, go to 8 on page 88. Use the following steps to verify the state of the arrays in the 2U24 enclosure. First run the dm_report command: [admin@n000]$ sudo dm_report The dm_report command provides status for all drives in the enclosure. The command output provides status of OK or Failed for drives owned by the node where the command was run, Foreign for drives owned by the other node, and Hot Spare for hot spare drives owned by either node. For hot spares that are not spinning, offline, etc., the command output reports these as missing drives. Following is a sample dm_report output, showing a failed disk in slot 20: (indicated by the shaded text). The failed disk in slot 20 was part of RAID array md64. In the subsequent shaded text, it shows the RAID array is degraded. A complete level 2 work flow example is shown in dm_report Example Output. [admin@snx11000n000 ~]$ sudo dm_report Diskmonitor Inventory Report: Version: 1.0-3026.xrtx.2287 Host: snx11000n000Time: Wed Mar 27
87
Replace a 2U24 Disk
07:46:33 2013 encl:0, wwn: 50050cc10c201ea4, dev:/dev/sg24, slots:24, vendor: XYRATEX , product_id: EB-2425P-E6EBD slot:0, wwn: 5000c50047a5aa93, cap:450098159104, dev:sda, parts:0, status: Foreign Arrays slot:1, wwn: 5000c50047ad9dc7, cap:450098159104, dev:sdb, parts:0, status: Foreign Arrays slot:2, wwn: 5000c50047a5b4ab, cap:450098159104, dev:sdl, parts:0, status: Foreign Arrays slot:3, wwn: 5000c50047a5b323, cap:450098159104, dev:sdj, parts:0, status: Hot Spare slot:4, wwn: 5000c50047b5a953, cap:450098159104, dev:sdi, parts:0, status: Foreign Arrays slot:5, wwn: 5000c50047b0e66f, cap:450098159104, dev:sdh, parts:0, status: Foreign Arrays slot:6, wwn: 5000c50047b5a767, cap:450098159104, dev:sdg, parts:0, status: Foreign Arrays slot:7, wwn: 5000c50047a82cc7, cap:450098159104, dev:sde, parts:0, status: Foreign Arrays slot:8, wwn: 5000c50047b0e70b, cap:450098159104, dev:sdd, parts:0, status: Foreign Arrays slot:9, wwn: 5000c50047a5b3bf, cap:450098159104, dev:sdc, parts:0, status: Ok slot:10, wwn: 5000c50047a5b3cf, cap:450098159104, dev:sdf, parts:0, status: Foreign Arrays slot:11, wwn: 5000c50047a5ba8b, cap:450098159104, dev:sdk, parts:0, status: Ok slot:12, wwn: 5000c50047b0e18f, cap:450098159104, dev:sdq, parts:0, status: Foreign Arrays slot:13, wwn: 5000c50047b0e6c7, cap:450098159104, dev:sdp, parts:0, status: Foreign Arrays slot:14, wwn: 5000c50047b0e20b, cap:450098159104, dev:sdo, parts:0, status: Ok slot:15, wwn: 5000c50047aff82b, cap:450098159104, dev:sdr, parts:0, status: Foreign Arrays slot:16, wwn: 5000c50047b5a7fb, cap:450098159104, dev:sdu, parts:0, status: Foreign Arrays slot:17, wwn: 5000c50047b5a947, cap:450098159104, dev:sdt, parts:0, status: Foreign Arrays slot:18, wwn: 5000c50047a68763, cap:450098159104, dev:sds, parts:0, status: Hot Spare slot:19, wwn: 5000c500479d1013, cap:450098159104, dev:sdn, parts:0, status: Foreign Arrays slot:20, status: Empty slot:21, wwn: 5000c50047a5bd0b, cap:450098159104, dev:sdv, parts:0, status: Foreign Arrays slot:22, wwn: 5000cca0130397c0, cap:100030242304, dev:sdw, parts:0, status: Ok slot:23, wwn: 5000cca01303a354, cap:100030242304, dev:sdx, parts:0, status: Ok Array:md67, UUID: 62b4f0e3-a94bc5ce-92d43472-4e504842, status: Ok disk_wwn: 5000cca0130397c0, disk_sd:sdw, disk_part:0, encl_wwn: 50050cc10c201ea4, encl_slot:22 disk_wwn: 5000cca01303a354, disk_sd:sdx, disk_part:0, encl_wwn: 50050cc10c201ea4, encl_slot:23 Array:md64, UUID: 5e34e293-1a3e9fa0-5a2a3689-e8a34dc5, status: Degraded disk_wwn: 5000c50047a5b3bf, disk_sd:sdc, disk_part:0, encl_wwn: 50050cc10c201ea4, encl_slot:9 disk_wwn: 5000c50047a5ba8b, disk_sd:sdk, disk_part:0, encl_wwn: 50050cc10c201ea4, encl_slot:11 disk_wwn: 5000c50047b0e20b, disk_sd:sdo, disk_part:0, encl_wwn: 50050cc10c201ea4, encl_slot:14 Array:md127, UUID: 89b60bdb-8890fd5e-6b8f17b2-dbdbcdd2, status: Ok Array is unmanaged -- found no disks in a managed enclosure End_of_report
4. Identify the faulty drive by examining the dm_report slot location and the disk drive with the lit Drive Fault LED (amber). 5. Verify that the WWN (identifier) for the failed disk reported in the dm_report output matches the WWN of the failed disk in the 2U24 enclosure. IMPORTANT: The WWN identifiers may not match exactly (one digit may be different). There are several WWN identifiers associated with each disk, so there may be a one-digit difference between them. 6. If the failed disk is still responsive, fetch its serial number: [admin@n000]$ sudo sg_inq /dev/sd XX where XX is the SD device. This is sample sg_inq output: $ sudo sg_ing /dev/sdaz Vendor identification: SEAGATE Product identification: ST32000444SS Product revision level: XQB7 Unit serial number: 9WM1GXFJ0000C1055E36 Only the first eight characters of the serial number are written on each drive. Replace 2U24 Disk and Verify
88
Replace a 2U24 Disk
The remove/replace procedure is the same whether the failed drive is active in the MDRAID array or marked as a hot spare. This procedure includes steps to verify that the disk is the hot spare, verify OCP status, and verify correct LED indications. 7. Carefully insert the lock key into the lock socket and rotate it counter-clockwise until the red indicator is no longer visible in the opening above the key. 8. Remove the lock key. 9. Release the disk by pressing down the latch and rotating the latch downward. 10. Gently remove the disk from the drive slot. 11. Wait for the system to detect the missing drive. On a quiescent system, it takes approximately 30 seconds for the missing drive to be detected, longer on a busy system. 12. Verify that the disk handle is released and in the open position. 13. Insert the new disk into the empty drive slot and gently slide the drive carrier into the enclosure until it stops. CAUTION: All drive slots must have a disk or a dummy carrier installed to maintain balanced airflow.
IMPORTANT: Ensure that the new disk is oriented so the drive handle opens downward. 14. Seat the disk by pressing the handle latch. NOTE: A click is audible as the handle latch engages. 15. To verify that the new disk is oriented like the other disks, first carefully insert the lock key into the lock socket. 16. Rotate the key clockwise until the red indicator is visible in the opening above the key. 17. Remove the lock key. 18. Verfiy that the new disk is the hot spare: a. Enter: [admin@n000]$ sudo dm_report Depending on the cluster's load and drive spin up time, it may take a few minutes for the dm_report output to show the new disk registered as the hot spare. Following is a partial sample of dm_report output showing the disk in slot 20 registered as the hot spare: $ sudo dm_report Diskmonitor Inventory Report: Host: snx11000n00 Time: Tue Jan 3 14:30:39 2012 encl: 0, wwn: 50050cc10c2002e0, dev: /dev/sg24, slots: 24, vendor: XYRATEX, product_id: EB-2425P-E6EBD, dev1:/dev/sg49 slot: 0, wwn: 5000c5004293f76b, cap: 299999999488, dev: sdx, parts: 0000000000000000, status: Ok, dev1: sds slot: 1, wwn: 5000c5004293c657, cap: 299999999488, dev: sdw, parts: 0000000000000000, status: Foreign Arrays, dev1: sdz slot: 2, wwn: 5000c5004291aca7, cap: 299999999488, dev: sdv, parts: 0000000000000000, status: Ok, dev1: sdaj slot: 3, wwn: 5000c5004293f397, cap: 299999999488, dev: sdu, parts: 0000000000000000, status: Foreign Arrays, dev1: sdah slot: 4, wwn: 5000c500429371cb, cap: 299999999488, dev: sdt, parts: 0000000000000000, status: Ok, dev1: sdag slot: 5, wwn: 5000c50042935093, cap: 299999999488, dev: sdy, parts: 0000000000000000, status: Hot Spare, dev1: sdaf
b. If the new disk comes up as anything other than hot spare, clear the superblock information: [admin@n000]$ sudo mdadm --zero-superblock --force /dev/sd XX
89
Replace a 2U24 Disk
where XX is the SD device number. c.
Run the dm_report command again to verify that the new disk is registered as the hot spare.
19. Verify that the Operator Control Panel (OCP) and Drive Fault LEDs are normal, as shown in the following figure. Figure 32. Operator Control Panel 2U24/4U24
Table 8. 2U24 Control Panel Indicators Description LEDs
State
Description
System Power
Steady Green
AC Power is applied to the enclosure.
Module Fault
Steady Amber
Indicates one of the following (refer to individual module fault LEDs): PCM fault OSS fault Over or under temperature fault condition
Logical Fault
Steady Amber
Indicates a failure of a disk drive
20. Verify that the Module Fault LED on the OCP of the 2U24 enclosure is green. 21. Verify that no Drive Fault LEDs are lit. If a recovery or rebuild operation is in progress on the MDRAID array, the Activity LEDs will be lit for each drive in the array. The Logical Fault LED on the OCP of the 2U24 enclosure will also be lit. 22. Log out of the primary MGMT node.
90
Replace a 2U24 EBOD
Replace a 2U24 EBOD Prerequisites
Part number
100843600: Controller Assy, Sonexion EBOD Expansion Module Time 1.5 hours Interrupt level Failover (can be applied to a live system with no service interruption, but requires failover/ failback) Tools ●
Labels (attach to SAS cables)
●
ESD strap
●
Serial cable (9 pin to 3.5mm phone plug)
Requirements ●
IMPORTANT: Whenever a component (such as a PSU, fan, or controller) fails and causes an MMU or SSU to change from a fully redundant state to a non-redundant state, to maintain uninterrupted Sonexion operations, replace the faulty component as soon as possible to return the MMU or SSU to a fully redundant state.
●
Hostnames must be assigned to all OSS nodes in the cluster containing the failed EBOD I/O module (available from the customer)
●
To complete this procedure, the replacement EBOD I/O module must have the correct USM firmware for the system. See USM Firmware Update Guide.
About this task
The MMU component consists of a 2U quad server cabled to a 2U24 (or 5U84) storage enclosure. The 2U24 storage enclosure contains 24 drives, 2 EBOD I/O modules, and 2 Power Cooling Modules (PCMs). The EBOD I/O modules (upper and lower) are located between the PCMs and are accessible from the back of the cabinet. Each EBOD I/O module has three ports, but the system uses only two ports (A and B). The system architecture requires the lower EBOD I/O module to be installed upside-down in the enclosure. Thus the module's markings and the port sequence are reversed (C / B / A). When an EBOD I/O module is replaced, the SAS cables must be reconnected to the same ports as on the failed module. Apply a label to each SAS cable before it is unplugged, to indicate which port to attach it to on the new EBOD I/O module. When the cables are reconnected, verify that the module and port marked on the label match the module and port on the enclosure. Use this procedure to halt all client I/O and file systems, replace the failed EBOD I/O module, verify the operation of the new EBOD I/O module, and return the system to normal operation.
91
Replace a 2U24 EBOD
Although either MGMT node may be in the failed state, this document presents command prompts as [MGMT0], representing the active MGMT node, which could be either MGMT0 or MGMT1. Notes and Cautions ●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
This procedure has steps to fail over and shut down the node, remove the failed EBOD, and install a new one.
Procedure 1. Determine the location of the failed EBOD I/O module. a. Check if the customer included this information in an SFDC case. If the ticket contains this information proceed to Step 2. b. If the information has not been provided, look for an amber Fault LED on the failed EBOD I/O module and the Module Fault LED on the Operator Control Panel (OCP) on the 2U24 enclosure (front panel). 2. Connect a KVM or console (or PC) to the primary MGMT server. 3. Log in to the primary MGMT node. [Client]$ ssh –l admin primary_MGMT_node 4. Stop Lustre on all nodes. a. Stop the file system. If multiple file systems are configured in the cluster, be sure to run this command on each file system: [admin@n000]$ cscli unmount -f fsname b. Verify that the file system is in a “stopped" state on all nodes in the cluster: [admin@n000]$ cscli fs_info 5. Power off the system as described in Power Off a Sonexion 1600 System via the CLI. Remove the Failed EBOD I/O Module Perform these steps at the back of the rack: 6. Turn off the power switches on both PCMs in the 2U24 enclosure. 7. Disconnect each SAS cable from the failed EBOD I/O module. Note the cable connections to port locations when cables are reconnected to the new EBOD I/O module. Use upper or lower to indicate the module and A or B to indicate the port. Following is a sample label: Upper A
92
Replace a 2U24 EBOD
EBOD I/O modules are installed opposite of each other in the enclosure. Depending on the cable clearance, it may be necessary to disconnect the SAS cables on both EBOD I/O modules to access the failed unit. If the SAS cables are disconnected from the functioning EBOD I/O module, be sure to attach an identifying label to each one (described above) so the cables will be properly reconnected after the new module is installed. 8. Release the module latch by grasping it between your thumb and forefinger and gently squeezing it (see the following figure). Figure 33. EBOD I/O Latch Operation
9. Using the latch as a handle, carefully remove the failed module from the enclosure (see the following figure). Figure 34. Remove an EBOD I/O Module
10. Inspect the new EBOD I/O module for damage, especially to the interface connector. If the module is damaged, do not install it. Obtain another EBOD I/O module. Install the New EBOD I/O Module 11. With the latch in the released (open) position, slide the new EBOD I/O module into the enclosure until it completely seats and engages the latch (see the following figure).
93
Replace a 2U24 EBOD
Figure 35. Install an EBOD I/O Module
12. Secure the module by closing the latch. An audible click occurs as the latch engages. 13. Plug in the SAS cables to their original ports on the EBOD I/O. See for more information. 14. Turn on the power switches on both PCMs in the 2U24 storage enclosure. 15. Verify the following LED status: ●
On the new EBOD I/O module, the Fault LED is extinguished and the Health LED is illuminated green.
●
On the Operator Control Panel located at the front of the 2U24 storage enclosure, the Module Fault LED is off.
16. Power on the system as described in Power On a Sonexion 1600 System. Powering on the nodes may take some time. To determine when the power on sequence is completed, monitor the console output or watch for the IB link LEDs. 17. Check the USM firmware version running on the new EBOD I/O module using the latest document revision for your version of Sonexion: 18. Start the Lustre file system: [admin@n000]$ sudo cscli mount -f fsname 19. Check that the Lustre file system is started on all nodes: [admin@n000]$ sudo cscli fs_info 20. Close the console connection and disconnect the KVM, or, if using a console or PC, disconnect the serial cable from the primary MGMT server. The procedure to replace a failed EBOD I/O module on the MMU’s 2U24 storage enclosure is complete.
94
Replace a 2U24 Power Module
Replace a 2U24 Power Module Prerequisites
Part number
100853500: Power Supply, Sonexion 580W for 2U24 MMU Time 30 minutes Interrupt level Live (procedure can be applied to a live system with no service interruption) Tools ESD strap
About this task Use this procedure, only for Sonexion 2000 1.5 pre SU-007 systems (see Important, below), to remove and replace a failed power cooling module (PCM) in the 2U24 enclosure of an MMU component. IMPORTANT: The procedure in this topic is intended for use with Sonexion 2000 1.5 (pre SU-007) systems only. For Sonexion 1.5 SU-007 and later systems, do not use this topic to replace failed hardware. Instead, field personnel should log in to the Sonexion service console, which provides step-bystep instructions to replace the failed part. Follow the steps below to access the service console: 1. Cable a laptop to any available port on any LMN switch (located at the top of the rack). 2. Log in to the service console and follow the procedure to remove and replace the failed part. To log in, navigate to the service console (http://service:8080). If that URL is not active, then Log in to port 8080 of the IP address of the currently active MGMT node (MGMT0): http://IP_address:8080 where IP_address is the IP address of the currently active (primary) MGMT node. 3. Enter the standard service credentials. The MMU includes a 2U24 enclosure and one Intel 2U server with four nodes. The enclosure contains 24 drives, two PCMs and two EBOD I/O modules. The PCMs supply power and cooling to the 2U24 enclosure; there are no separate power supply or fan modules in the chassis. The PCMs are two 100-240V 764W or 580W dual fan SBB-compliant power cooling modules that are located to the left and right of the EBOD I/O modules, and are accessible from the back of the rack. PCM 0 is the left-side module and PCM 1 is the right-side module. If a PCM fails, the system continues to operate normally on the remaining PCM until the failed module is replaced. Each 2U24 enclosure type has only one qualified model of PCM. The 764W PCM is certified only for the OneStor AP-2212/AP-2224 enclosure, and the 580W PCM is certified only for the OneStor SP-2212/SP-2224 enclosures. Do not install a PCM in an unqualified enclosure, or mix PCMs in the same enclosure.
95
Replace a 2U24 Power Module
The PCMs are located to the left and right of the EBOD I/O modules and are accessible from the back of the rack. PCM 1 is the left-side module and PCM 2 is the right-side module. This procedure includes steps to replace the failed PCM and verify the operation of the new PCM. IMPORTANT: Treat the PCM replacement with urgency: The enclosure operates in a non-redundant state while operating on one PCM. ●
To maintain uninterrupted operations, replace a faulty PCM within 24 hours of its failure.
●
Do not remove a faulty PCM unless a replacement of the correct type ready for installation is available.
●
Two PCMs must always be installed.
●
PCM replacement should take only a few minutes to perform, but must be completed within 10 minutes from removal of the failed PCM to prevent overheating.
Notes and Cautions ●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
Procedure 1. If the location has not been provided, a defective PCM is indicated on the operator control panel by an amber Module Fault LED (front panel) and either an amber AC Fail LED and/or DC Fail LED, and indicated on the faulty PCM by a non-illuminated green PCM OK LED. Figure 36. Operator Control Panel 2U24/4U24
96
Replace a 2U24 Power Module
Figure 37. Power Control Module LEDs
2. Turn off the power switch on the failed PCM. 3. Disconnect the power cord by moving the bale toward the center of the PCM and removing the cord. 4. Release the module latch by grasping it between the thumb and forefinger and gently squeezing it. 5. Using the latch as a handle, carefully remove the failed PCM from the enclosure. WARNING: Do not remove the cover from the PCM. Danger of electric shock exists inside the cover. 6. Carefully inspect the replacement PCM for damage, especially to the rear connector. Avoid damaging the connector pins. If the PCM is damaged, do not install it. Obtain another replacement PCM. 7. Slide the PCM into the empty bay at the rear of the 2U24 enclosure. 8. As the PCM begins to seat, grasp the handle latch and close it to engage the latch. This action engages the caming mechanism on the side of the module and secures the PCM. 9. Verify that the power switch on the replacement PCM is in the OFF position. 10. Move the bale towards the center of the PCM. 11. Connect the power cord to the PCM. 12. Place the bale over and onto the power cord. 13. Turn on the power switch on the new PCM. 14. Verify that the AC Fail and DC Fail LEDs are off and the PCM OK LED is lit green on the PCM, and the Module Fault LED on the Operator Control Panel of the 2U24 enclosure (front panel) is extinguished.
97
Replace a 2U24 Chassis
Replace a 2U24 Chassis Prerequisites Part number
100853300: Chassis Assy, 2U24 without power supplies or controller Time 2 hours Interrupt level Interrupt (requires disconnecting Lustre clients from the filesystem) Tools ●
ESD strap
●
#2 Phillips screwdriver
Required files The following RPMs may be required to complete this procedure. See USM Firmware Update Guide. RPM versions depend on the Sonexion release. Refer to the Sonexion Update Bundles for the current firmware levels. USM firmware must be installed on the primary MGMT node before the USM firmware update procedure is performed. ●
lsscsi RPM (lsscsi-*.rpm)
●
fwdownloader RPM (fwdownloader-*.rpm)
●
GEM/HPM RPM (XYR_USM_SBB-*.rpm)
About this task
Use this procedure to remove and replace a defective chassis (2U24 enclosure) in the MMU component. Subtasks: ●
Remove 2U24 Components and Chassis
●
Replace Chassis and Reinstall Components
●
Power on 2U24 Components and Reactivate
The MMU comprises a 2U24 enclosure and one Intel 2U chassis with four servers. The 2U24 enclosure contains 24 disk drive in carriers (DDIC, referred to simply as disks), two EBOD I/O modules and two power/cooling modules (PCMs). There can be as many as four MMU chassis in a system. The EBOD I/O modules (upper and lower) are located between the PCMs and are accessible from the back of the rack. Each EBOD I/O module has three ports but only two ports (A and B) are used in the system. The architecture requires that the lower EBOD I/O module be installed upside down in the enclosure. This causes the module's markings to be upside down and the port sequence to be reversed (C / B / A). When an EBOD I/O module is replaced, reconnect the SAS cables to the same ports as on the failed module. Apply a label to each SAS cable before it is unplugged to remember which port to attach it to on the new EBOD I/
98
Replace a 2U24 Chassis
O module. When the cables are reconnected, verify the module and port marked on the label against the module and port in the enclosure. In this procedure, only the defective chassis is replaced; all other components in the defective MMU enclosure are re-used in the new chassis. This procedure includes steps to stop all client I/O and file systems, replace the failed 2U24 chassis, verify the operation of the new 2U24 chassis, and return the system to normal operation. Notes and Cautions ●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure. CAUTION: The size and weight of the 2U24 chassis requires two individuals to move the unit safely. Do not perform this procedure unless two individuals are onsite and available to move each 2U24 chassis.
Procedure 1. If the location of the defective chassis is not known, a fault LED (amber) on the front panel of the failed 2U24 chassis indicates the failure. (A system can have as many as four MMU 2U24 chassis.) 2. Establish communication with the management node (n000) using one of the following two methods: Method 1: On a workstation or laptop PC with an Ethernet connection to the Sonexion system, use the IP address of the MGMT node (n000) to launch a terminal session, such as Putty, using the settings shown in the following table. Table 9. Settings for MGMT Connection Parameter
Setting
Bits per second
115200
Data bits
8
Parity
None
Stop bits
1
Flow control
None
The function keys are set to VT100+. Method 2: Use a separate monitor and keyboard connected to the back of the MGMT node (n000) as shown in the following figure. Ensure that the connection has the settings shown in the table above.
99
Replace a 2U24 Chassis
Figure 38. Monitor and Keyboard Connections on the MGMT Node
3. Log on to the MGMT node (n000) as admin using the related password, as follows: login as: admin [email protected]’s password: password Last login: Thu Oct 11 11:06:24 2012 from 172.30.22.59 [admin@snx11000n000 ~]$ 4. Change to root user: $ sudo su 5. Log in to the primary MGMT node: [admin@n000]$ ssh -l admin primary_MGMT_node 6. Unmount the Lustre file system: $ cscli unmount -f fsname 7. Verify that the Lustre file system is stopped on all nodes: $ cscli fs_info 8. Power off the system as described in Power Off Sonexion 2000Power Off a Sonexion 1600 System via the CLI. Remove 2U24 Components and Chassis In the following steps, remove the EBOD I/O modules, power cooling modules (PCMs), and disks from the chassis. Perform these steps from the rear of the rack. 9. Remove the EBOD I/O modules: a. Turn off the power switches on both PCMs in the chassis.
100
Replace a 2U24 Chassis
b. On one EBOD I/O module, disconnect each SAS cable and attach a label that indicates the module and port for reconnecting the cable to the module. Use “upper” or “lower” to indicate the module, and A or B to indicate the port. Sample label: “Upper A”. c.
Gently squeeze the module latch between the thumb and forefinger.
d. Using the latch as a handle, carefully withdraw the module from the enclosure. e. Repeat the previous three steps for the second EBOD I/O module. 10. Remove the PCMs: a. At one of the PCMs, disconnect the power cord by moving the bale toward the center of the PCM and removing the cord. b. Release the module latch by gently squeezing it between the thumb and forefinger. c.
Using the latch as a handle, carefully withdraw the PCM from the enclosure. DANGER: To avoid electrical shock, do not remove the cover from the PCM. Failure to comply will result in death or serious injury.
d. Repeat the preceding substeps to remove the second PCM. 11. Remove the disks: a. Record the exact location of the drives, as they must be installed in the same order in the new 2U24 chassis. b. On one disk, carefully insert the lock key into the lock socket and rotate it counter-clockwise until the red indicator is no longer visible in the opening above the key. c.
Remove the lock key.
d. Release the disk by pressing the latch and rotating it downward. e. Gently remove the disk from the drive slot. f.
Mark the drive with its current drive slot number in the chassis so that it can be reinstalled in the same slot in the new chassis. From the front of the rack, the drive slots are numbered 0 to 23 (left to right).
g. Repeat preceding substeps for the remaining disks. 12. Remove the 2U24 chassis from the front of the rack: a. Remove the left and right front flange caps by pulling the caps free. b. Disconnect the chassis from the rack by removing the screw from the left and right flanges (now exposed after removing the flange caps). c.
With the assistance of a second person, remove the chassis from the rack.
d. With the chassis on a bench, remove the left and right front flange caps by pulling the caps free. (The caps simply snap onto the flanges.) Replace Chassis and Reinstall Components 13. Install the new 2U24 chassis in the rack:
101
Replace a 2U24 Chassis
a. With the assistance of a second person, move the 2U24 chassis into the rack. Carefully align the guide on each side of the chassis with the groove on the rail assembly and gently push the chassis completely into the rack. b. Connect the chassis to the rack by installing a screw into the left and right flanges. c.
Install the flange caps by pressing them into position. They snap into place on the flanges.
14. From the front of the rack, install disks in the new chassis as follows, placing each disk where it was located in the old 2U24 chassis, oriented with the drive handle opening downward. a. On one disk, verify that the disk handle is released and in the open position. b. Insert each disk into the empty drive slot and gently slide the drive carrier into the enclosure until it stops. c.
Seat the disk by pressing the handle latch and rotating it to the closed position. There will be an audible click as the handle latch engages.
d. Verify that the new disk is in the same orientation as the other disks in the enclosure. e. Carefully insert the lock key into the lock socket and rotate it clockwise until the red indicator is visible in the opening above the key. f.
Remove the lock key.
g. Repeat the disk drive installation steps for the remaining disks. 15. Install the PCMs: a. Carefully inspect the PCM for damage, especially to the rear connector. Avoid damaging the connector pins. If the PCM is damaged, do not install it. Obtain another PCM. b. Slide a PCM into the chassis. As the PCM begins to seat, grasp the handle latch and close it to engage the latch. This action engages the caming mechanism on the side of the module and secures the PCM. c.
Verify that the power switch on each PCM is in the OFF position.
d. Slide a PCM into the chassis. As the PCM begins to seat, grasp the handle latch and close it to engage the latch. e. For each PCM, move the bale toward the center of the PCM, connect the power cord to the PCM, and place the bale over and onto the power cord. f.
Repeat the module installation steps for the second PCM.
16. Install EBOD I/O modules: a. Inspect the EBOD damage, especially to the interface connector. If the module is damaged, do not install it. Obtain another EBOD. b. With the latch in the released (open) position, slide the EBOD into the enclosure until it completely seats and engages the latch. c.
Secure the module by closing the latch. There will be an audible click as the latch engages.
d. Repeat the module installation steps for the second EBOD. Power on 2U24 Components and Reactivate
102
Replace a 2U24 Chassis
17. Plug in the four SAS cables to their original ports on the EBOD I/O modules. Figure 39. 2U24 SAS Cabling
Use the cable labels prepared in step 9.b on page 101 , to ensure that the cables are connected to the proper ports. 18. Turn on the power switches on both PCMs. 19. Verify that the indicator LEDs on the PCMs, EBOD I/O modules, and the front panel of the 2U24 chassis are normal and illuminated green. Figure 40. Operator Panel 2U24/4U24
Table 10. 2U24 Operator's Panel LED Description LEDs
State
Indicates
System Power
Steady green
AC power is applied to the enclosure.
Module Fault
Steady amber
PCM fault OSS fault Over or under termperature fault.
Refer to individual module fault LEDs.
103
Replace a 2U24 Chassis
LEDs
State
Indicates
Logical Fault
Steady amber
Failure of a disk drive
20. Power on the system as described in Power On Sonexion 2000Power On a Sonexion 1600 System. 21. The USM and GEM firmware versions must agree between the EBOD I/O modules. Consult Cray Hardware Product Support to obtain the correct files, and use the procedures described in USM Firmware Update Guide. When the firmware versions match, proceed to the following step. 22. Start the Lustre file system: $ cscli mount -f fsname 23. Check that the Lustre file system is started on all nodes: $ cscli fs_info 24. Close the console connection and disconnect the KVM, or, if using a console or PC, disconnect the serial cable from the primary MGMT server.
104
Replace a Quad Server Disk
Replace a Quad Server Disk Prerequisites
Part number:
100900800: Disk Drive Assy, 450GB Sonexion 2U Quad Server MMU FRU Time: 3 hours Interrupt level: Live (procedure can be applied to a live system with no service interruption) Tools: ESD strap, boots, garment or other approved methods Console with monitor and keyboard or PC with a serial COM port configured for 115.2Kbs Serial cable Philips screwdriver
About this task
The following procedure can be applied to a live system with no service interruption, but requires failover/failback operations. It includes steps to replace the failed disk, verify the disk recovery/rebuild on a spare disk, and mark the newly installed disk as a hot spare. The MMU component consists of a 2U quad server cabled to a 2U24 (or 5U84) storage enclosure. The 2U quad server contains four server nodes, two power supply units (PSUs), fans, and disk drives. This FRU procedure applies only to hard drives hosting the primary and secondary MGMT nodes (nodes 00 and 01 respectively). This procedure does not apply to disks hosting the MGS and MDS internal nodes (nodes 02 and 03 respectively), as they are diskless nodes that do not contain a RAID array using internal drives. Subtasks: ●
Additional Information on MGMT Node RAID Status
●
Replace Quad Server Disk
●
Clean Up Failed or Pulled Drive in Node on page 113
Notes and Cautions ●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
105
Replace a Quad Server Disk
Procedure 1. If the location of the failed disk is known, go to Replace Quad Server Disk. 2. If the failed disk is not known, log in to the primary MGMT node: $ ssh -l admin primary_MGMT_node 3. Run the dm_report command on the primary or secondary MGMT node suspected to have the failed disk. $ sudo dm_report Both the primary MGMT node and the secondary MGMT node have MDRAID arrays in the MMU. It may be necessary to run this step on each node to determine which one is associated with the failed disk. 4. Connect the console with monitor and keyboard or PC to the primary MGMT node. 5. Log in to the primary MGMT node. [admin@n000]$ ssh primary_MGMT_node 6. Issue the dm_report command: [admin@n000]$ sudo dm_report If the primary MGMT node is in an unhealthy state, the dm report command produces an output similar to the following: [admin@n000]$ sudo dm_report Diskmonitor Inventory Report: Version: 1.0.x.1.5-25.x.2455 Host: snx11000n00 Time: Thu Jul 31 03:29:36 2014 encl: 0, wwn: 50050cc10c2002dd, dev: /dev/sg24, slots: 24, vendor: XYRATEX , product_id: EB-2425P-E6EBD slot: 0, wwn: 5000c500320a6d63, cap: 450098159616, dev: sda, parts: 0, status: Foreign Arrays, t10: 11110111000 slot: 1, wwn: 5000c500320a81fb, cap: 450098159616, dev: sdb, parts: 0, status: Foreign Arrays, t10: 11110111000 slot: 2, wwn: 5000c500320a7a6f, cap: 450098159616, dev: sdl, parts: 0, status: Ok, t10: 11110111000 slot: 3, wwn: 5000c500320afb6f, cap: 450098159616, dev: sdj, parts: 0, status: Foreign Arrays, t10: 11110111000 slot: 4, wwn: 5000c500320d4073, cap: 450098159616, dev: sdi, parts: 0, status: Foreign Arrays, t10: 11110111000 slot: 5, wwn: 5000c500320a7bef, cap: 450098159616, dev: sdh, parts: 0, status: Foreign Arrays, t10: 11110111000 slot: 6, wwn: 5000c500320a7987, cap: 450098159616, dev: sdg, parts: 0, status: Foreign Arrays, t10: 11110111000 slot: 7, wwn: 5000c500320ad06b, cap: 450098159616, dev: sde, parts: 0, status: Foreign Arrays, t10: 11110111000 slot: 8, wwn: 5000c500320a7a07, cap: 450098159616, dev: sdd, parts: 0, status: Foreign Arrays, t10: 11110111000 slot: 9, wwn: 5000c500320a8383, cap: 450098159616, dev: sdc, parts: 0, status: Hot Spare, t10: 11110111000 slot: 10, wwn: 5000c500320a6f53, cap: 450098159616, dev: sdf, parts: 0, status: Foreign Arrays, t10: 11110111000 slot: 11, wwn: 5000c500320a7f6b, cap: 450098159616, dev: sdk, parts: 0, status: Foreign Arrays, t10: 11110111000 slot: 12, wwn: 5000c500320a7903, cap: 450098159616, dev: sdq, parts: 0, status: Foreign Arrays, t10: 11110111000 slot: 13, wwn: 5000c500320a852b, cap: 450098159616, dev: sdp, parts: 0, status: Foreign Arrays, t10: 11110111000 slot: 14, wwn: 5000c500320a8243, cap: 450098159616, dev: sdo, parts: 0, status: Ok, t10: 11110111000 slot: 15, wwn: 5000c500320b76db, cap: 450098159616, dev: sdr, parts: 0, status: Foreign Arrays, t10: 11110111000 slot: 16, wwn: 5000c500320a72df, cap: 450098159616, dev: sdu, parts: 0, status: Foreign Arrays, t10: 11110111000 slot: 17, wwn: 5000c500320b0b77, cap: 450098159616, dev: sdt, parts: 0, status: Foreign Arrays, t10: 11110111000 slot: 18, wwn: 5000c500320b968b, cap: 450098159616, dev: sds, parts: 0, status: Ok, t10: 11110111000 slot: 19, wwn: 5000c500320afe87, cap: 450098159616, dev: sdn, parts: 0, status: Foreign Arrays, t10: 11110111000 slot: 20, wwn: 5000c500320c8f33, cap: 450098159616, dev: sdm, parts: 0, status: Hot Spare, t10: 11110111000 slot: 21, wwn: 5000c500320b0067, cap: 450098159616, dev: sdv, parts: 0, status: Ok, t10: 11110111000 slot: 22, wwn: 5000cca01305e080, cap: 100030242816, dev: sdw, parts: 0, status: Foreign Arrays, t10: 11100111000 slot: 23, wwn: 5000cca01305e9fc, cap: 100030242816, dev: sdx, parts: 0, status: Foreign Arrays, t10: 11100111000 Array: md64, UUID: 435e6c6a-f76d9702-e41ab3e3-e9ce9e5c, status: Ok, t10: disabled disk_wwn: 5000c500320a7a6f, disk_sd: sdl, disk_part: 0, encl_wwn: 50050cc10c2002dd, encl_slot: 2 disk_wwn: 5000c500320a8243, disk_sd: sdo, disk_part: 0, encl_wwn: 50050cc10c2002dd, encl_slot: 14 disk_wwn: 5000c500320b968b, disk_sd: sds, disk_part: 0, encl_wwn: 50050cc10c2002dd, encl_slot: 18 disk_wwn: 5000c500320b0067, disk_sd: sdv, disk_part: 0, encl_wwn: 50050cc10c2002dd, encl_slot: 21 Array: md127, UUID: 03cf20b5-c06203d5-2300dd64-1b2cd976, status: Degraded, t10: disabled Array is unmanaged -- found no disks in a managed enclosure T10_key_begin: GRD_CHK(1), APP_CHK(1), REF_CHK(1), ATO(1), RWWP(1), SPT(1), P_TYPE(1), PROT_EN(1), DPICZ(1), FMT(1), READ_CHK(1) T10_key_end End_of_report
106
Replace a Quad Server Disk
In this example, the command output for md127 is degraded. If the RAID status is OK, the primary MGMT node is operating correctly. Run the following commands to obtain additional RAID status information for the primary MGMT node. Additional Information on MGMT Node RAID Status The following steps provide more information on the RAID status of the primary MGMT node. 7. Check the drive sg_map to retrieve the device (/dev) number: [admin@n000]$ sudo sg_map –i –x The sg_map command should produce the following output: [sudo@node00 ~]# sg_map -i -x /dev/sg0 0 0 0 0 0 /dev/sda SEAGATE ST9450404SS XQB6 /dev/sg1 0 0 1 0 0 /dev/sdb SEAGATE ST9450404SS XQB6 /dev/sg2 0 0 2 0 0 /dev/sdc SEAGATE ST9450404SS XQB6 /dev/sg3 0 0 3 0 0 /dev/sdd SEAGATE ST9450404SS XQB6 /dev/sg4 0 0 4 0 0 /dev/sde SEAGATE ST9450404SS XQB6 /dev/sg5 0 0 5 0 0 /dev/sdf SEAGATE ST9450404SS XQB6 /dev/sg6 0 0 6 0 0 /dev/sdg SEAGATE ST9450404SS XQB6 /dev/sg7 0 0 7 0 0 /dev/sdh SEAGATE ST9450404SS XQB6 /dev/sg8 0 0 8 0 0 /dev/sdi SEAGATE ST9450404SS XQB6 /dev/sg9 0 0 9 0 0 /dev/sdj SEAGATE ST9450404SS XQB6 /dev/sg10 0 0 10 0 0 /dev/sdk SEAGATE ST9450404SS XQB6 /dev/sg11 0 0 11 0 0 /dev/sdl SEAGATE ST9450404SS XQB6 /dev/sg12 0 0 12 0 0 /dev/sdm SEAGATE ST9450404SS XQB6 /dev/sg13 0 0 13 0 0 /dev/sdn SEAGATE ST9450404SS XQB6 /dev/sg14 0 0 14 0 0 /dev/sdo SEAGATE ST9450404SS XQB6 /dev/sg15 0 0 15 0 0 /dev/sdp SEAGATE ST9450404SS XQB6 /dev/sg16 0 0 16 0 0 /dev/sdq SEAGATE ST9450404SS XQB6 /dev/sg17 0 0 17 0 0 /dev/sdr SEAGATE ST9450404SS XQB6 /dev/sg18 0 0 18 0 0 /dev/sds SEAGATE ST9450404SS XQB6 /dev/sg19 0 0 19 0 0 /dev/sdt SEAGATE ST9450404SS XQB6 /dev/sg20 0 0 20 0 0 /dev/sdu SEAGATE ST9450404SS XQB6 /dev/sg21 0 0 21 0 0 /dev/sdv SEAGATE ST9450404SS XQB6 /dev/sg22 0 0 22 0 0 /dev/sdw HITACHI HUSSL4010ASS600 A182 /dev/sg23 0 0 23 0 0 /dev/sdx HITACHI HUSSL4010ASS600 A182 /dev/sg24 0 0 24 0 13 XYRATEX EB-2425-E6EBD 3022 /dev/sg25 1 0 0 0 0 /dev/sdy SEAGATE ST9450405SS 0002 internal drive 0 , left one /dev/sg26 1 0 2 0 0 /dev/sdaa SEAGATE ST9450405SS 0002 internal drive , right one
In this example, the device numbers are sg25 and sg26, as indicated by the highlighted text above. 8. Obtain additional RAID status information: [admin@n000 ~]$ sudo mdadm –-detail /dev/md127 The preceding command should produce the following output: [admin@n000 ~]$ sudo mdadm --detail /dev/md127 If the drive is in a healthy state, the mdadm command should produce the following output: /dev/md127: Version : 1.0 Creation Time : Tue Mar 12 09:16:15 2013 Raid Level : raid1 Array Size : 439548848 (419.19 GiB 450.10 GB) Used Dev Size : 439548848 (419.19 GiB 450.10 GB)
107
Replace a Quad Server Disk
Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Update Time : Wed Mar 20 09:44:36 2013 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Name : snx11000n000:md512 (local to host snx11000n000) UUID : 475cfc47:37213002:a89b42ce:b4ad65e0 Events : 90 Number Major Minor RaidDevice State 0 65 128 0 active sync /dev/sdy 1 65 144 1 active sync /dev/sdz If the drive has failed, the command produces the following output: [admin@node00 ~]$ sudo mdadm --detail /dev/md127 /dev/md127: Version : 1.0 Creation Time : Tue Mar 12 09:16:15 2013 Raid Level : raid1 Array Size : 439548848 (419.19 GiB 450.10 GB) Used Dev Size : 439548848 (419.19 GiB 450.10 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Update Time : Wed Mar 20 09:44:36 2013 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Name : snx11000n000:md512 (local to host snx11000n000) UUID : 475cfc47:37213002:a89b42ce:b4ad65e0 Events : 90 Number Major Minor RaidDevice State 0 65 128 0 active sync /dev/sdy 1 65 144 1 removed /dev/sdz In the above example, RAID device 1 has been removed. 9. Check the status of the removed RAID device: Node00$ cat /proc/mdstat If the drive has failed, the command produces the following output: [admin@n100 ~]# cat /proc/mdstat Personalities : [raid1] [raid10] md64 : active raid10 sdj[0] sdn[3] sdr[2] sdd[1] 839843744 blocks super 1.2 4K chunks 2 near-copies [4/4] [UUUU] md127 : active raid1 sdz[1](F) sdy[0] 439548848 blocks super 1.0 [2/1] [U_] unused devices:
108
Replace a Quad Server Disk
The (F) notation after sdz indicates that drive md127 has failed. Replace Quad Server Disk 10. After locating the failed disk, inspect the front of the 2U quad server and verify that the slot containing the failed disk has a lit Fault LED (amber) on the carrier. 11. Remove the failed disk by pressing the green button and opening the lever, then sliding the disk out. Figure 41. Remove a disk
12. Place the disk on a stable work surface and remove the four screws holding the disk drive in the carrier, as shown in the following figure. This step may be unnecessary if the new disk drive includes the carrier. Figure 42. Remove a Disk Drive From the Carrier
13. Place the new disk drive in the carrier, align the holes, and secure it with the four screws removed earlier, as shown in the following figure.
109
Replace a Quad Server Disk
Figure 43. Install the Drive in the Carrier
14. Install the new drive in the 2U quad server. Verify that the handle latch is released and in the open position, then slide the drive carrier into the enclosure until it stops. Seat the disk by pushing up the handle latch and rotating it to the closed position, as shown in the following figure. There will be an audble click as the handle latch engages. Figure 44. Installing the DDIC (disk)
15. Confirm that the new drive is listed in the sg_map command output, in a good state, and operational: [MGMT0]$ sudo sg_map –x -i The new drive appears at the bottom of the drive list as /dev/sg25 or /dev/sg26 (depending on which is the newly inserted drive). For example: [sude@snx11000n000]$ sudo sg_map –x /dev/sg0 6 0 0 0 0 /dev/sda SEAGATE /dev/sg1 6 0 1 0 0 /dev/sdb SEAGATE /dev/sg2 6 0 2 0 0 /dev/sdc SEAGATE /dev/sg3 6 0 3 0 0 /dev/sdd SEAGATE
-i ST9450404SS ST9450404SS ST9450404SS ST9450404SS
XRFA XRFA XRFA XRFA
110
Replace a Quad Server Disk
/dev/sg4 6 0 4 0 0 /dev/sde SEAGATE ST9450404SS XRFA /dev/sg5 6 0 5 0 0 /dev/sdf SEAGATE ST9450404SS XRFA /dev/sg6 6 0 6 0 0 /dev/sdg SEAGATE ST9450404SS XRFA /dev/sg7 6 0 7 0 0 /dev/sdh SEAGATE ST9450404SS XRFA /dev/sg8 6 0 8 0 0 /dev/sdi SEAGATE ST9450404SS XRFA /dev/sg9 6 0 9 0 0 /dev/sdj SEAGATE ST9450404SS XRFA /dev/sg10 6 0 10 0 0 /dev/sdk SEAGATE ST9450404SS XRFA /dev/sg11 6 0 11 0 0 /dev/sdl SEAGATE ST9450404SS XRFA /dev/sg12 6 0 12 0 0 /dev/sdm SEAGATE ST9450404SS XRFA /dev/sg13 6 0 13 0 0 /dev/sdn SEAGATE ST9450404SS XRFA /dev/sg14 6 0 14 0 0 /dev/sdo SEAGATE ST9450404SS XRFA /dev/sg15 6 0 15 0 0 /dev/sdp SEAGATE ST9450404SS XRFA /dev/sg16 6 0 16 0 0 /dev/sdq SEAGATE ST9450404SS XRFA /dev/sg17 6 0 17 0 0 /dev/sdr SEAGATE ST9450404SS XRFA /dev/sg18 6 0 18 0 0 /dev/sds SEAGATE ST9450404SS XRFA /dev/sg19 6 0 19 0 0 /dev/sdt SEAGATE ST9450404SS XRFA /dev/sg20 6 0 20 0 0 /dev/sdu SEAGATE ST9450404SS XRFA /dev/sg21 6 0 21 0 0 /dev/sdv SEAGATE ST9450404SS XRFA /dev/sg22 6 0 22 0 0 /dev/sdw HITACHI HUSSL4010ASS600 A202 /dev/sg23 6 0 23 0 0 /dev/sdx HITACHI HUSSL4010ASS600 A202 /dev/sg24 6 0 24 0 13 XYRATEX EB-2425P-E6EBD 3519 /dev/sg25 7 0 0 0 0 /dev/sdy SEAGATE ST9450405SS XRC0 /dev/sg26 7 0 2 0 0 /dev/sdaa SEAGATE ST9450405SS XRC0 16. Rebuild the disk drive in the md127 array, using the /dev/sd name of the newly inserted device: [admin@n000]$ sudo mdadm /dev/md127 --add /dev/sd xx Where sdxx is the new device. 17. Verify that the new disk drive has started to rebuild in the array, by running one of the following commands: ●
Use the cat /proc/mdstat command: [admin@n000]$ cat /proc/mdstat For example: [root@snx11000n000 ~]$ cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] [raid10] md64 : active raid10 sds[0] sdl[3] sdo[2] sdv[1] 859374976 blocks super 1.2 64K chunks 2 near-copies [4/4] [UUUU] bitmap: 3/7 pages [12KB], 65536KB chunk md127 : active raid1 sdaa[2] sdy[0] 439548848 blocks super 1.0 [2/1] [U_] [==================>..] recovery = 93.6% (411549504/439548848) finish=7.3min speed=63767K/sec unused devices: For example: [admin@n000]$ sudo mdadm –detail /dev/md127
●
Use the mdadm –detail /dev/md127 command: /dev/md127: Version : 1.0 Creation Time : Mon Jul 7 02:00:06 2014
111
Replace a Quad Server Disk
Raid Level : raid1 Array Size : 439548848 (419.19 GiB 450.10 GB) Used Dev Size : 439548848 (419.19 GiB 450.10 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Update Time : Thu Jul 31 04:53:27 2014 State : clean, degraded, recovering Active Devices : 1 Working Devices : 2 Failed Devices : 0 Spare Devices : 1 PD-Repaired : 0 Rebuild Status : 93% complete Name : snx11000n000:md512 (local to host snx11000n000) UUID : 03cf20b5:c06203d5:2300dd64:1b2cd976 Events : 51756 Number Major Minor RaidDevice State 0 65 128 0 active sync /dev/sdy 2 65 160 1 spare rebuilding /dev/sdaa If the build finishes without any errors, the procedure is complete. 18. If the mdadm command produces the following error: $ mdadm /dev/md127 --add /dev/sdz mdadm: Cannot open /dev/sdz: Device or resource busy Run the clean drive procedure for 5 minutes. [admin@n000]$ dd if=/dev/zero of=/dev/sdz bs=1M After 5 minutes, press Ctrl-C to quit the procedure. . 19. Repeat steps 17 on page 111 and 18 on page 112 to add the drive to the array and confirm that it starts to rebuild. If the drive fails to rebuild, contact Cray Support for further information.
Zero the Superblock Prerequisites
The original disk has been re-inserted into the chassis, instead of a new disk.
About this task
Use the zero superblock command to erase all previous raid configuration from a previously used drive. ●
Zero the superblock: $ sudo mdadm --zero-superblock /dev/sdz The following output should appear: [root@snx11000n000 ~]# mdadm --zero-superblock /dev/sdz mdadm: Couldn't open /dev/sdz for write
112
Replace a Quad Server Disk
Clean Up Failed or Pulled Drive in Node About this task
Perform this procedure if the drive status shows "removed" or "faulty", and if inserting a new drive does not update the drive status to "spare rebuilding”. This procedure uses node 1 as an example. To clean up a failed or pulled drive in a node:
Procedure 1. Clear the superblock information, if the disk comes up as anything other than hot spare, enter: $ sudo mdadm –zero-superblock-- force /dev/sd XX 2. Log in to the failed or pulled node: $ ssh failed_MGMT_node The preceding command should produce the following output: [admin@n000 ~]$ ssh snx11000n001 Last login: Wed Mar 20 10:44:14 2013 from 172.16.2.2 [root@snx11000n001 ~]# cat /proc/mdstat Personalities : [raid1] [raid10] md127 : active raid1 sdz[1](F) sdy[0] 439548848 blocks super 1.0 [2/1] [U_] unused devices: Sdz is showing as failed. 3. Retrieve md127 array details, which is built from the two internal drives: $ sudo mdadm –detail /dev/md127 The preceding command should procedure the following output: [admin@snx11000n001 ~]$ sudo mdadm --detail /dev/ /dev/md127: Version : 1.0 Creation Time : Tue Mar 12 09:16:14 2013 Raid Level : raid1 Array Size : 439548848 (419.19 GiB 450.10 GB) Used Dev Size : 439548848 (419.19 GiB 450.10 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Update Time : Wed Mar 20 10:57:50 2013 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 1 Spare Devices : 0 Name : snx11000n001:md512 (local to host snx11000n001) UUID : 188d023a:0a1c06de:71c7b8fc:eb7f560f
113
Replace a Quad Server Disk
Events : 1903 Number Major Minor RaidDevice State 0 65 128 0 active sync /dev/sdy 1 0 0 1 removed 1 65 144 faulty spare The faulty spare must then be cleared. 4. Clear the faulty spare: $ sudo mdadm –manage /dev/md127 –remove faulty The preceding command should produce the following output: [admin@snx11000n001 ~]$ sudo mdadm --manage /dev/md127 --remove faulty mdadm: hot removed 65:144 from /dev/md127 5. Check the md127 array details to make certain the faulty spare has been removed: $ sudo mdadm –detail /dev/md127 The preceding command should produce the following output: [admin@snx11000n001 ~]$ sudo mdadm --detail /dev/md127 /dev/md127: Version : 1.0 Creation Time : Tue Mar 12 09:16:14 2013 Raid Level : raid1 Array Size : 439548848 (419.19 GiB 450.10 GB) Used Dev Size : 439548848 (419.19 GiB 450.10 GB) Raid Devices : 2 Total Devices : 1 Persistence : Superblock is persistent Update Time : Wed Mar 20 10:58:25 2013 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0 Name : snx11000n001:md512 (local to host snx11000n001) UUID : 188d023a:0a1c06de:71c7b8fc:eb7f560f Events : 1924 Number Major Minor RaidDevice State 0 65 128 0 active sync /dev/sdy 1 0 0 1 removed The failed or pulled drive will require recovering. 6. Check the dev on the SG map before recovering a faulty drive as the dev may change. 7. Run: $ sudo sg_map –i –x
114
Replace a Quad Server Disk
The preceding command should produce the following output: [admin@snx11000n001 ~]$ sudo sg_map -i -x /dev/sg0 0 0 125 0 0 /dev/sda HITACHI HUSSL4010ASS600 A182 /dev/sg1 0 0 126 0 0 /dev/sdb HITACHI HUSSL4010ASS600 A182 /dev/sg2 0 0 127 0 0 /dev/sdc SEAGATE ST9450404SS XQB6 /dev/sg3 0 0 128 0 0 /dev/sdd SEAGATE ST9450404SS XQB6 /dev/sg4 0 0 129 0 0 /dev/sdf SEAGATE ST9450404SS XQB6 /dev/sg5 0 0 130 0 0 /dev/sdg SEAGATE ST9450404SS XQB6 /dev/sg6 0 0 131 0 0 /dev/sdh SEAGATE ST9450404SS XQB6 /dev/sg7 0 0 132 0 0 /dev/sdj SEAGATE ST9450404SS XQB6 /dev/sg8 0 0 133 0 0 /dev/sdk SEAGATE ST9450404SS XQB6 /dev/sg9 0 0 134 0 0 /dev/sdl SEAGATE ST9450404SS XQB6 /dev/sg10 0 0 135 0 0 /dev/sdm SEAGATE ST9450404SS XQB6 /dev/sg11 0 0 136 0 0 /dev/sdn SEAGATE ST9450404SS XQB6 /dev/sg12 0 0 137 0 0 /dev/sdo SEAGATE ST9450404SS XQB6 /dev/sg13 0 0 138 0 0 /dev/sdq SEAGATE ST9450404SS XQB6 /dev/sg14 0 0 139 0 0 /dev/sdr SEAGATE ST9450404SS XQB6 /dev/sg15 0 0 140 0 0 /dev/sds SEAGATE ST9450404SS XQB6 /dev/sg16 0 0 141 0 0 /dev/sdt SEAGATE ST9450404SS XQB6 /dev/sg17 0 0 142 0 0 /dev/sdv SEAGATE ST9450404SS XQB6 /dev/sg18 0 0 143 0 0 /dev/sdw SEAGATE ST9450404SS XQB6 /dev/sg19 0 0 144 0 0 /dev/sdx SEAGATE ST9450404SS XQB6 /dev/sg20 0 0 145 0 0 /dev/sdaa SEAGATE ST9450404SS XQB6 /dev/sg21 0 0 146 0 0 /dev/sdab SEAGATE ST9450404SS XQB6 /dev/sg22 0 0 147 0 0 /dev/sdac SEAGATE ST9450404SS XQB6 /dev/sg23 0 0 148 0 0 /dev/sdad SEAGATE ST9450404SS XQB6 /dev/sg24 0 0 149 0 13 XYRATEX EB-2425-E6EBD 3022 /dev/sg25 1 0 0 0 0 /dev/sdy SEAGATE ST9450405SS 0002 /dev/sg26 1 0 2 0 0 /dev/sde SEAGATE ST9450405SS 0002 In the above example, the new dev number is “sde”. 8. Recover the failed or pulled drive: $ sudo mdadm --manage /dev/md127 –re-add /dev/sde The preceding command should produce the following output: root@snx11000n001 ~]# mdadm --manage /dev/md127 --re-add /dev/sde mdadm: re-added /dev/sde 9. Check the md127 array details to make certain the faulty spare has been added: $ sudo mdadm –detail /dev/md127 The preceding command should produce the following output: [admin@snx11000n001 ~]$ sudo mdadm --detail /dev/md127 /dev/md127: Version : 1.0 Creation Time : Tue Mar 12 09:16:14 2013 Raid Level : raid1 Array Size : 439548848 (419.19 GiB 450.10 GB) Used Dev Size : 439548848 (419.19 GiB 450.10 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent
115
Replace a Quad Server Disk
Update Time : Wed Mar 20 10:59:24 2013 State : clean, degraded, recovering Active Devices : 1 Working Devices : 2 Failed Devices : 0 Spare Devices : 1 Rebuild Status : 0% complete Name : snx11000n001:md512 (local to host snx11000n001) UUID : 188d023a:0a1c06de:71c7b8fc:eb7f560f Events : 1950 Number Major Minor RaidDevice State 0 65 128 0 active sync /dev/sdy 1 8 64 1 spare rebuilding /dev/sde The drive is now rebuilding.
116
Replace a Quad Server MGMT Node
Replace a Quad Server MGMT Node Prerequisites Part number
100900501: Server, Sonexion Single E5-2680 32GB Management Server Node FDR Time 1 hour Interrupt level Interrupt, if the cords connected to the power distribution strip block the MGMT server nodes and require reconfiguring (require taking the Lustre file system offline) Failover, if the MGMT server nodes can be easily accessed (can be applied to a live system with no service interruption, but requires failover/failback) Tools ESD strap, shoes, and garment Console with monitor and keyboard (or PC with a serial COM port configured for 115.2Kbs, 8 data bits, no parity, and 1 stop bit)
About this task
Use this procedure to remove and replace a failed server node hosting the primary and secondary MGMT nodes in the MMU component in the field. This procedure includes steps to replace the failed server node and return the Sonexion system to normal operations. Subtasks: ●
Shut Down the Failed MGMT Node and Replace
●
Set MGMT Node IPMI Address and Record MAC Address
The MMU component consists of a 2U quad server cabled to a 2U24 (or 5U84) storage enclosure. The 2U quad server contains four server nodes, two PSUs, and disk drives. The MMU’s server nodes host the primary and secondary MGMT nodes, MGS, and MDS nodes. The system's High Availability architecture provides that if one of the server nodes goes down, its resources migrate to its HA partner node so Sonexion operations continue without interruption. This document details the replacement of a failed server node hosting the primary and secondary MGMT nodes. Notes and Cautions ●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
117
Replace a Quad Server MGMT Node
Since either the primary or secondary MGMT node may be in the failed state, this document will present command prompts as [MGMT] which represents the active MGMT node, which could be either MGMT0 or MGMT1.
Procedure 1. If the location of the failed server node is known, go to Step 5 on page 119. 2. Attempt to log in to the primary MGMT node via SSH: [Client]$ ssh –l admin primary_MGMT_node 3. If the primary MGMT node cannot be logged into, then attempt to log in to the secondary MGMT node via SSH: [Client]$ ssh –l admin secondary_MGMT_node 4. Do one of the following: ●
Access CSSM and use the Health tab to identify the malfunctioning server node.
●
At the front of the rack, verify which server node has failed by looking for a System Status LED (amber) or a dark LED on the left and right control panels of the 2U quad server, as shown in the following figure and table.
The following figure and table show the mapping of server node names. Figure 45. Quad Server Control Panels
Table 11. System Status LED Descriptions LED Color
Condition
Description
Green
On
System Ready/No Alarm
118
Replace a Quad Server MGMT Node
LED Color
Condition
Description
Green
Flashing
System Ready, but degraded: redundancy lost such as the power supply or fan failure; non-critical temp/voltage threshold; battery failure; or predictive power supply failure.
Amber
On
Critical Alarm: Critical power modules failure, critical fan failure, voltage (power supply), critical temperature and voltage.
Amber
Flashing
Non-Critical Alarm: Redundant fan failure, redundant power module failure, non-critical temperature and voltage.
-
Off
Power off: system unplugged. Power on: system power off and in standby, no prior degraded/noncritical/ critical state.
From the rear of the rack, physical and logical node definitions are shown in the following figure. Figure 46. Quad Server: Rear Component Identification
The server node names are mapped as defined in this table. Table 12. Quad Server Node Designations Logical
Physical
Function
Node 000
Node 4
MGMT Primary
Node 001
Node 3
MGMT Secondary
Node 002
Node 2
MGS
Node 003
Node 1
MDS
5. Record the BMC IP, Mask and Gateway IP addresses: [MGMT]$ grep affected_nodename-ipmi /etc/hosts This command produces the following output: [admin@snx11000n000 ~]$ grep snx11000n001-ipmi /etc/hosts 172.16.0.102 snx11000n001-ipmi
119
Replace a Quad Server MGMT Node
The Mask and Gateway addresses are set the same for all nodes: Mask = 255.255.0.0 Gateway = 172.16.2.1 6. In some Sonexion configurations, power cords block access to the 2U quad server, while in other configurations, the 2U quad server can be freely accessed. If the cords block access to the 2U quad server, proceed to the following substeps. If they do not block access to the 2U quad server, go to Step 7 on page 121. Figure 47. Power Distribution Strip With Cables in Top Two Sockets
a. Power off the system as described in Power Off a Sonexion 1600 System via the CLI. b. Remove the power cords from the power distribution strip. c.
Re-install the power cords in the power distribution strip so that no sockets in use block removal of the failed server node in the quad server.
120
Replace a Quad Server MGMT Node
Figure 48. Power Distribution Strip With Top Two Sockets Open
d. Power on the system as described in Power On a Sonexion 1600 System. 7. If the failed node is still operational, fail over its resources to its HA partner (active MGMT) node: [MGMT]$ cscli failover –n affected_nodename 8. Verify that the failover operationwas successful: [MGMT]$ sudo crm_mon -1 The following is an example of a successful failover: ============ [root@snx11000n000 ~]# crm_mon -1r ============ Last updated: Wed Aug 6 03:07:01 2014 Last change: Wed Aug 6 02:59:44 2014 via cibadmin on snx11000n001 Stack: Heartbeat Current DC: snx11000n001 (8d542227-dce8-49d7-bcc6-d1e651d7d0ec) - partition with quorum Version: 1.1.6.1-5.el6-0c7312c689715e096b716419e2ebc12b57962052 2 Nodes configured, unknown expected votes 40 Resources configured. ============ Online: [ snx11000n000 snx11000n001 ] Full list of resources: baton (ocf::heartbeat:baton): Started snx11000n000 snx11000n000_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate): Started snx11000n000 snx11000n001_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate): Started snx11000n001 snx11000n000-stonith (stonith:external/libvirt): Started snx11000n000 snx11000n001-stonith (stonith:external/libvirt): Started snx11000n001 prm-httpd (lsb:httpd): Started snx11000n001 prm-mysql (lsb:mysqld): Started snx11000n001
121
Replace a Quad Server MGMT Node
prm-nfslock (lsb:nfslock): Started snx11000n001 prm-bebundd (lsb:bebundd): Started snx11000n001 Clone Set: cln-cerebrod [prm-cerebrod] Stopped: [ prm-cerebrod:0 prm-cerebrod:1 ] prm-conman (lsb:conman): Stopped prm-dhcpd (lsb:dhcpd): Started snx11000n001 prm-xinetd (lsb:xinetd): Started snx11000n001 Clone Set: cln-syslogng [prm-syslogng] Started: [ snx11000n000 snx11000n001 ] prm-nodes-monitor (lsb:nodes-monitor): Started snx11000n001 Clone Set: cln-ses_mon [prm-ses_monitor] Started: [ snx11000n000 snx11000n001 ] Clone Set: cln-nsca_passive_checks [prm-nsca_passive_checks] Started: [ snx11000n000 snx11000n001 ] Resource Group: grp-icinga prm-icinga (lsb:icinga): Stopped prm-nsca (lsb:nsca): Stopped prm-npcd (lsb:npcd): Stopped Resource Group: grp-plex prm-rabbitmq (lsb:rabbitmq-server): Stopped prm-plex (lsb:plex): Stopped prm-repo-local (ocf::heartbeat:Filesystem): Started snx11000n001 prm-repo-remote (ocf::heartbeat:Filesystem): Started snx11000n000 prm-db2puppet (ocf::heartbeat:oneshot): Started snx11000n001 Clone Set: cln-puppet [prm-puppet] Started: [ snx11000n001 ] Stopped: [ prm-puppet:0 ] prm-nfsd (ocf::heartbeat:nfsserver): Started snx11000n001 prm-vip-eth0-mgmt (ocf::heartbeat:IPaddr2): Started snx11000n001 prm-vip-eth0-nfs (ocf::heartbeat:IPaddr2): Started snx11000n001 Resource Group: snx11000n000_md64-group snx11000n000_md64-raid (ocf::heartbeat:XYRAID): Started snx11000n001 snx11000n000_md64-fsys (ocf::heartbeat:XYMNTR): Started snx11000n001 snx11000n000_md64-stop (ocf::heartbeat:XYSTOP): Started snx11000n001 Resource Group: snx11000n000_md67-group snx11000n000_md67-raid (ocf::heartbeat:XYRAID): Started snx11000n001 snx11000n000_md67-fsys (ocf::heartbeat:XYMNTR): Started snx11000n001 snx11000n000_md67-stop (ocf::heartbeat:XYSTOP): Started snx11000n001 The above output shows that the resources are running on node 001 (snx11000n001). Shut Down the Failed MGMT Node and Replace 9. To begin shutting down the affected node, run: [MGMT]$ cscli power_manage –n failed_server_nodename --power-off If the failed server node does not power down after the poweroff command, users have two options to shut down the node: 10. From the active MGMT node, run: [MGMT]$ pm -0 failed_server_nodename 11. If the node is still powered on and the above command has failed, press and hold the power button, for at least six seconds on the front panel of the failed server node.
122
Replace a Quad Server MGMT Node
12. Verify that the failed server node is powered off: [MGMT]$ pm –q For example: on: snx11000n0[01-05] off: snx11000n000 unknown: 13. Apply anti-static protection devices such as a wrist strap, boots, garments or other approved methods. 14. Disconnect the cables attached to the failed server node and note where each cable attaches to the enclosure to ensure that the cables can be connected properly during re-installation. IMPORTANT: Be sure to note the port that each cable is attached to, so the same cable connections can be made after the new server node is installed. Refer to the cable reference guide attached to the rack’s left hand rear door. 15. Remove the failed server node and install the replacement server node. a. Push the green latch to release the failed server node, while using the handle to pull it from the chassis. Figure 49. Removing the Server Node
b. Install the new server node by inserting it into the empty bay at the rear of the enclosure. It may be necessary to firmly push the node to fully seat it. There will be an audible click as the node seats.
123
Replace a Quad Server MGMT Node
Figure 50. Installing the Server Node
16. Connect the cables to the new node, except for the cables that go to the first two NIC ports of the add-in Quad-port NIC card and the Infiniband adapter, based on the notes made in Step 14 on page 123, and the cable reference guide attached to the rack’s left hand rear door. 17. Connect the console keyboard and monitor, and then press the power button on the front of the 2U quad server for the server node that was replaced. Set MGMT Node IPMI Address and Record MAC Address 18. Retrieve the BMC IP Address, Subnet Mask and Gateway IP address obtained in Step 5 on page 119, from the output of the grep affected_nodename-ipmi /etc/hosts command, for the server node being replaced. a. Press F2. b. Select BIOS. c.
Select the Server Management tab.
d. Enter the data in the Baseboard LAN configuration section.
124
Replace a Quad Server MGMT Node
Figure 51. Server Management Tab Screen
19. On the new server node, record the MAC address from BIOS. a. Press F2. b. Select the Advanced tab. c.
Select PCI Configuration.
d. Select NIC Configuration. e. Record the MAC address for “IOM1 Port1 MAC Address” and “IOM1 Port2 MAC Address”. f.
Press F10 to save changes and exit BIOS.
125
Replace a Quad Server MGMT Node
Figure 52. Advanced Screen: MAC Address
20. With the first two NIC ports of the add-in Quad-port NIC card and InfiniBand cables still disconnected, allow the replacement node to boot to a login screen.
126
Replace a Quad Server MGMT Node
Figure 53. Advanced Screen – NIC Configuration
21. Log in as user ‘admin’, using the cluster administrator password. 22. Delete the udev rules file. The new updated udev rules rebuild during reboot: [new MGMT]$ sudo rm /etc/udev/rules.d/70-persistent-net.rules 23. Shut down the new server node: [new MGMT]$ sudo sync; sudo ipmitool chassis power off 24. Connect the first two NIC port (eth0 and eth1) cables to the replacement node. Do not reconnect the Infiniband/40GbE cable yet. 25. If not already done, log in to the active MGMT node: [MGMT]$ ssh –l admin active_MGMT_node 26. Save the t0db databases: [MGMT]$ mkdir -p "/home/admin/$(date +%F)_t0db" [MGMT]$ sudo mysqldump t0db --ignore-table t0db.logs --ignore-table t0db.be_command_log > "/home/admin/$(date +%F)_t0db/t0db_bkup.sql" && sudo mysqldump -d t0db logs be_command_log >> "/home/admin/$(date +%F)_t0db/ t0db_bkup.sql"
127
Replace a Quad Server MGMT Node
27. Save the mysql databases: [MGMT]$ mkdir -p "/home/admin/$(date +%F)_mysql" [MGMT]$ sudo mysqldump mysql > "/home/admin/$(date +%F)_mysql/mysql.sql" 28. Save the mylogin.cnf password file: [MGMT]$ sudo cp /root/.mylogin.cnf "/home/admin/$(date +%F)_t0db" 29. Check the current database: [MGMT]$ sudo mysql t0db -e "select * from netdev where hostname=' nodename '" For example: [snx11000n001]$ sudo mysql t0db -e "select * from netdev where hostname='snx11000n000'" +----+------------------------------+-------------------+-------------+------------+---------|-------------+ | id | node_id | mac_address | ip_address | network_id | if_name | hostname | | 7 | snx11000n0:Node-rack1-38U1-A | 00:1E:67:57:25:28 | 172.16.2.2 | 3 | eth0 | snx11000n0 | +----+------------------------------+-------------------+-------------+------------+---------|-------------+
30. Update the database for eth0 of the new server node using the MAC address obtained in the BIOS for IOM1 Port1 MAC Address. In the following example, new_MAC is the MAC recorded in step 19.e on page 125: [MGMT]$ sudo mysql t0db -e "update netdev set mac_address='new_MAC' where if_name='eth0' and hostname='nodename'" For example: [MGMT]$ sudo mysql t0db -e "update netdev set mac_address='00:1E:67:6B:5D:70' where if_name='eth0' and hostname='snx11000n000'" 31. Update the database for eth1 of the new server node using the MAC address obtained in the BIOS for the IOM1 Port2 MAC Address. In the following example, new_MAC is the MAC recorded in step 19.e on page 125 [MGMT]$ sudo mysql t0db -e "update netdev as a, (select * from netdev where hostname='nodename') as b set a.mac_address='new_MAC' where a.if_name='eth1' and a.node_id=b.node_id" For example: [snx11000n001]$ sudo mysql t0db -e "update netdev as a, (select * from netdev where hostname='snx11000n000') as b set a.mac_address='00:1E:67:39:D6:91' where a.if_name='eth1' and a.node_id=b.node_id" 32. Verify the database change for eth0: [MGMT]$ sudo mysql t0db -e "select * from netdev where hostname=' nodename'" For example: [snx11000n001]$ sudo mysql t0db -e "select * from netdev where hostname='snx11000n000'" +----+------------------------------+-------------------+-------------+------------+---------+---------------+ | id | node_id | mac_address | ip_address | network_id | if_name | hostname | +----+------------------------------+-------------------+-------------+------------+---------+---------------+ | 7 | snx11000n0:Node-rack1-38U1-A | 00:1E:67:6B:5D:70 | 172.16.2.2 | 3 | eth0 | snx11000n000 | +----+------------------------------+-------------------+-------------+------------+---------+---------------+
128
Replace a Quad Server MGMT Node
Note the difference in the MAC address in the highlighted line from the example in Step 29 on page 128. 33. Verify the database changes for eth1: [admin@n000]$ sudo mysql t0db -e "select * from netdev as a, (select * from netdev where hostname='') as b where a.if_name='eth1' and a.node_id=b.node_id" For example: [snx11000n001]$ sudo mysql t0db -e "select * from netdev as a, (select * from netdev where hostname='snx11000n000') as b where a.if_name='eth1' and a.node_id=b.node_id" +----+------------------------------+-------------------+-------------+------------+---------|-------------------+ | id | node_id | mac_address | ip_address | network_id | if_name | hostname | +----+------------------------------+-------------------+-------------+------------+---------|-------------------+ | 8 | snx11000n0:Node-rack1-38U1-A | 00:1E:67:6B:5D:71 | 169.254.0.1 | 1 | eth1 | snx11000n000-eth1 | +----+------------------------------+-------------------+-------------+------------|---------|-------------------+
34. Update Puppet on the active MGMT node to use the new MAC address: [MGMT]$ sudo /opt/xyratex/bin/beUpdatePuppet –s –g mgmt 35. Change the STONITH settings to prevent nodes from shutting one another down: [MGMT]$ sudo crm configure property stonith-enabled=false 36. Power on the new server node: [MGMT]$ cscli power_manage –n replacement_nodename --power-on 37. Monitor the new server node to confirm that HA is now fully configured. Both the new server node and HA partner node should be online: [MGMT]$ sudo ssh new server_nodename crm_mon -1 | grep Online 38. Reconnect the Infiniband/40GbE cable to the new server node. There should no longer be any cables disconnected from the new server node. 39. Re-enable STONITH: [MGMT]$ sudo crm configure property stonith-enabled=true 40. If the system was not fully power cycled off and on, go to Step 42 on page 130. 41. Restart Lustre: [MGMT]$ cscli mount -f filesystem_name For example: mount: mount: mount: mount: mount: mount:
MGS is starting... MGS is started! No resources found on nodes snx11000n007 for "snx11000n" file system starting ssetest on snx11000n[102-103]... starting ssetest on snx11000n[104-105]... No resources found on nodes snx11000n[100-101] for "snx11000n" file
129
Replace a Quad Server MGMT Node
system mount: ssetest is started on snx11000n[102-103]! mount: ssetest is started on snx11000n[104-105]! mount: File system ssetest is mounted. 42. Fail back the resources for the new server node: [MGMT]$ cscli failback –n new_server_nodename 43. Confirm that the failback operation completes and that the resource(s) are moved back to the new server node: [MGMT]$ sudo ssh new_server_nodename crm_mon -1
130
Replace a Quad Server MGS or MDS Node
Replace a Quad Server MGS or MDS Node Prerequisites Part number:
100900601: Server, Sonexion Dual E5-2680 64GB MDS/MGS Node FDR Time: 1 hour Interrupt level: Interrupt (requires taking the Lustre file system offline) Tools: ESD strap, shoes, and garment Console with monitor and keyboard (or PC with a serial COM port configured for 115.2Kbs, 8 data bits, no parity, and one stop bit)
About this task
One 2U quad server and either one 2U24 enclosure or one 5U84 enclosure are bundled in the MMU. Each server node hosts an MMU node: a primary MGMT node, secondary MGMT node, MGS node and MDS node. The following procedure describes the replacement of the MGS and MDS nodes, including steps to replace the failed node and return the system to normal operations. Subtasks: ●
Shut Down MGS or MDS Node and Replace
●
Set MGS/MDS Node IPMI Address and Record MAC Address
Notes and Cautions ●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
Procedure 1. If the location of the failed server node is known, go to Step 4 on page 133. 2. Log in to the primary MGMT node via SSH: [Client]$ ssh –l admin primary_MGMT_node 3. Do one of the following:
131
Replace a Quad Server MGS or MDS Node
●
Access CSSM and use the Health tab to identify the malfunctioning server node.
●
At the front of the rack, look for a System Status LED (amber) or a dark LED on the left and right control panels of the 2U quad server, as shown in the following figure and table.
The following figure and table show the mapping of server node names. Figure 54. Quad Server Control Panels
Table 13. System Status LED Descriptions LED Color
Condition
Description
Green
On
System Ready/No Alarm
Green
Flashing
System Ready, but degraded: redundancy lost such as the power supply or fan failure; non-critical temp/voltage threshold; battery failure; or predictive power supply failure.
Amber
On
Critical Alarm: Critical power modules failure, critical fan failure, voltage (power supply), critical temperature and voltage.
Amber
Flashing
Non-Critical Alarm: Redundant fan failure, redundant power module failure, non-critical temperature and voltage.
-
Off
Power off: system unplugged. Power on: system power off and in standby, no prior degraded/noncritical/ critical state.
132
Replace a Quad Server MGS or MDS Node
Figure 55. Quad Server: Rear Component Identification
Table 14. Quad Server Node Designations Logical
Physical
Function
Node 000
Node 4
MGMT Primary
Node 001
Node 3
MGMT Secondary
Node 002
Node 2
MGS
Node 003
Node 1
MDS
4. Record the BMC IP, Mask and Gateway IP addresses of the failed servre node: [admin@n000]$ grep node -ipmi /etc/hosts This command produces the following output: [admin@snx11000n000 ~]$ grep snx11000n002-ipmi /etc/hosts 172.16.0.103 snx11000n002-ipmi The Mask and Gateway address settings are the same for all nodes: Mask = 255.255.0.0 Gateway = 172.16.2.1 5. Check the power cords plugged into the power distribution strip. In some Sonexion configurations, these cords will block access to the 2U quad server, while the 2U quad server can be freely accessed in other configurations. If the cords block access to the 2U quad server, go to Step 5.a on page 134 to reposition them as necessary. If they do not block access to the 2U quad server, go to Shut Down MGS or MDS Node and Replace.
133
Replace a Quad Server MGS or MDS Node
Figure 56. Power Distribution Strip With Cables in Top Two Sockets
a. Power off the system as described in Power Off a Sonexion 1600 System via the CLI. b. Remove the power cords from the power distribution strip. c.
Re-install the power cords so that no sockets block removal of the failed server node in the 2U quad server. Figure 57. Power Distribution Strip with Top Two Sockets Open
d. Power on the system as described in Power On a Sonexion 1600 System. Shut Down MGS or MDS Node and Replace
134
Replace a Quad Server MGS or MDS Node
6. Fail over the failed node's resources to its HA partner node: [admin@n000]$ cscli failover -n affected_node Verify that the failover operation was successful: [admin@n000]$ ssh partner_node sudo crm_mon -1 The following is an example of a successful failover: [snx11000n000 ~]# ssh snx11000n003 sudo crm_mon -1 [sudo] password for admin: ============ Last updated: Tue Aug 5 11:07:45 2014 Last change: Tue Aug 5 11:01:45 2014 via crm_resource on snx11000n002 Stack: Heartbeat Current DC: snx11000n003 (0b53f7df-3132-4cb2-b0a4-b6fc753cfcdd) - partition with quorum Version: 1.1.6.1-6.el6-0c7312c689715e096b716419e2ebc12b57962052 2 Nodes configured, unknown expected votes 28 Resources configured. ============ Online: [ snx11000n002 snx11000n003 ] Full list of resources: snx11000n002-1-ipmi-stonith (stonith:external/ipmi): Started snx11000n002 snx11000n002-2-ipmi-stonith (stonith:external/ipmi): Started snx11000n002 snx11000n003-3-ipmi-stonith (stonith:external/ipmi): Started snx11000n003 snx11000n003-4-ipmi-stonith (stonith:external/ipmi): Started snx11000n003 Clone Set: cln-kdump-stonith [kdump-stonith] Started: [ snx11000n002 snx11000n003 ] Clone Set: cln-ssh-10-stonith [ssh-10-stonith] Started: [ snx11000n002 snx11000n003 ] Clone Set: cln-ssh-stonith [ssh-stonith] Started: [ snx11000n002 snx11000n003 ] Clone Set: cln-phydump-stonith [phydump-stonith] Started: [ snx11000n002 snx11000n003 ] Clone Set: cln-last-stonith [last-stonith] Started: [ snx11000n002 snx11000n003 ] snx11000n003_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate): Started snx11000n003 snx11000n002_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate): Started snx11000n002 baton (ocf::heartbeat:baton): Started snx11000n002 Clone Set: cln-diskmonitor [diskmonitor] Started: [ snx11000n003 snx11000n002 ] snx11000n003_ibstat (ocf::heartbeat:ibstat): Started snx11000n003 snx11000n002_ibstat (ocf::heartbeat:ibstat): Started snx11000n002 Resource Group: snx11000n003_md66-group snx11000n003_md0-raid (ocf::heartbeat:XYRAID): Started snx11000n003 snx11000n003_md66-raid (ocf::heartbeat:XYRAID): Started snx11000n003 snx11000n003_md66-fsys (ocf::heartbeat:XYMNTR): Started snx11000n003 snx11000n003_md66-stop (ocf::heartbeat:XYSTOP): Started snx11000n003 Resource Group: snx11000n003_md65-group Started snx11000n003 snx11000n003_md65-raid (ocf::heartbeat:XYRAID): snx11000n003_md65-fsys (ocf::heartbeat:XYMNTR): Started snx11000n003 snx11000n003_md65-stop (ocf::heartbeat:XYSTOP): Started snx11000n003 Connection to snx11000n003 closed.
The above output shows that the resources have started onthe HA partner (node n003). Proceed to the next step to power off the failed server node. 7. Shut down the failed server node: [admin@n000]$ cscli power_manage -n failed server_nodename --power-off
135
Replace a Quad Server MGS or MDS Node
If the failed server does not power down after the power-off command, users have two options to shut the node down: a. From the primary MGMT node run: [admin@n000]$ pm -0 failed server_nodename b. If the failed server is still powered on and the above command has failed, press and hold the power button, for at least 6 seconds, on the front panel of the failed server. c.
Verify that the failed server is powered off: [admin@n000]$ pm –q For example: on: snx11000n[101-105] off: snx11000n000 unknown:
8. Apply anti-static protection devices such as a wrist strap, boots, garments or other approved methods. 9. Disconnect the cables attached to the failed server node and note where each cable attaches to the enclosure to ensure the cables can be connected properly during re-installation. IMPORTANT: Note the port that each cable is attached to, so the same cable connections can be made after the new server is installed. Refer to Sonexion 2000 Field Installation Guide in the section "Internal Cabling". 10. Remove the failed server node and install the new server node. a. Push the green latch to release the failed server node, while using the handle to pull it from the chassis. Figure 58. Removing the Server Node
b. Install the new server node by inserting it the server into the empty bay at the rear of the enclosure. It may be necessary to firmly push the node to fully seat it. There will be an audible click as the server node seats.
136
Replace a Quad Server MGS or MDS Node
Figure 59. Installing the Server Node
11. Connect the cables to the new server node, based on the notes made in step 4 on page 133 and information in Sonexion 2000 Field Installation Guide in the section "Internal Cabling".. 12. Connect the console keyboard and monitor, and then press the power button on the front of the 2U quad server for the node that was replaced. Set MGS/MDS Node IPMI Address and Record MAC Address 13. Retrieve the BMC IP Address, Subnet Mask and Gateway IP Address obtained in step 4 on page 133, from the output of the grep node-ipmi /etc/hosts command, for the server node being replaced. a. Press F2. b. Select BIOS. c.
Select the Server Management tab.
d. Enter the data in the Baseboard LAN configuration section.
137
Replace a Quad Server MGS or MDS Node
Figure 60. Server Management Screen
14. On the new server node, record the MAC address from BIOS. a. Press F2. b. Select the Advanced tab. c.
Select PCI Configuration.
d. Select NIC Configuration. e. Record the MAC address for Onboard NIC1 Port1 MAC Address and Onboard NIC1 Port2 MAC Address. f.
Press F10 to exit BIOS.
138
Replace a Quad Server MGS or MDS Node
Figure 61. Advanced Screen: NIC Configuration
15. Press and hold the power button on the new server node until it powers off. IMPORTANT: The new server node must not be left powered on at this point. 16. If not already done, log in to the primary MGMT node: [Client]$ ssh –l admin primary_MGMT_node 17. Check the current database: [admin@n000]$ sudo mysql t0db -e "select * from netdev where hostname='nodename'" For example: [snx11000n000]$ sudo mysql t0db -e "select * from netdev where hostname='snx11000n002'" +----+-----------------+-------------------+------------+------------+---------+----------+ | id | node_id | mac_address | ip_address | network_id | if_name | hostname | +----+-----------------+-------------------+------------+------------+---------+----------+ | 32 | snx11000: | 00:1E:67:55:B2:9A | 172.16.3.5 | 2 | eth0 | | | | Node-1rC1-42U-C | | | | | | +----+-----------------+-------------------+------------+------------+---------+----------+
139
Replace a Quad Server MGS or MDS Node
18. Update the database for eth0 of the new server node using the MAC address obtained in the BIOS for Onboard NIC1 Port1 MAC Address. In the following example, new_MAC is the address recorded in step 14.e on page 138: [admin@n000]$ sudo mysql t0db -e "update netdev set mac_address=new_MAC' where if_name='eth0' and hostname='nodename'" For example: [admin@n000]$ sudo mysql t0db -e "update netdev set mac_address='00:1E: 67:39:D6:90' where if_name='eth0' and hostname='snx11000n002'" 19. Update the database for eth1 of the new server node using the MAC address obtained in the BIOS for Onboard NIC1 Port2 MAC Address. In the following example, new_MAC is the address recorded in step 14.e on page 138 : [admin@n000]$ mysql t0db -e "update netdev as a, (select * from netdev where hostname='nodename') as b set a.mac_address='new_MAC' where a.if_name='eth1' and a.node_id=b.node_id" For example: [admin@n000]$ mysql t0db -e "update netdev as a, (select * from netdev where hostname='snx11000n002') as b set a.mac_address='00:1E:67:39:D6:91' where a.if_name='eth1' and a.node_id=b.node_id" 20. Verify the database change for eth0: [snx11000n000]$ sudo mysql t0db -e "select * from netdev where hostname='snx11000n002'" For example: +----+-----------------+-------------------+-------------+------------+---------|--------------| | id | node_id | mac_address | ip_address | network_id | if_name | hostname | +----+-----------------+-------------------+-------------+------------+---------|--------------| | 32 | snx11000: | 00:1E:67:39:D6:90 | 172.16.3.5 | 2 | eth0 | snx11000n002 | | | Node-R1C1-42U-C | | | | | | +----+-----------------+-------------------+-------------+------------+---------|--------------|
Note the difference in the MAC address in the highlighted line from the example in step 17 on page 139. 21. Verify the database change for eth1: [admin@n000]$ sudo mysql t0db -e "select * from netdev as a, (select * from netdev where hostname='nodename') as b where a.if_name='eth1' and a.node_id=b.node_id"\ For example: [snx11000n000]$ sudo mysql t0db -e "select * from netdev as a, (select * from netdev where hostname='snx11000n002') as b where a.if_name='eth1' and a.node_id=b.node_id" +----+-----------------+-------------------+------------+------------+---------+-----------+ | id | node_id | mac_address | ip_address | network_id | if_name | hostname | +----+-----------------+-------------------+------------+------------+---------+-----------+ | 33 | snx11000: | 00:1E:67:39:D6:91 | NULL | NULL | eth1 | NULL | | | Node-1rC1-42U-C | | | | | | +----+-----------------+-------------------+------------+------------+---------+-----------+
Note the difference in the MAC address in the highlighted line from the example in step 17 on page 139.
140
Replace a Quad Server MGS or MDS Node
22. Delete the tftpboot files for the new server node: [admin@n000]$ sudo ssh nfsserv "rm -rf /tftpboot/nodes/affected_nodename" For example: [snx11000n000 ~]$ sudo ssh nfsserv "rm -rf /tftpboot/nodes/snx11000n002" 23. Update Puppet on the MGMT nodes to use the new MAC address: [admin@n000]$ sudo /opt/xyratex/bin/beUpdatePuppet -s -g mgmt 24. Log in to the active MGS/MDS node: [admin@n000]$ ssh active_MDS/MGS_node 25. Change the STONITH settings to prevent nodes from shutting one another down: [MDS/MGS]$ sudo crm configure property stonith-enabled=false 26. Log out from the active MDS/MGS node: [MDS/MGS]$ exit 27. Power on the new server node: [admin@n000]$ cscli power_manage -n new server_nodename --power-on 28. Monitor the new server node to confirm that HA is now fully configured. Both the active MGS/MDS node and its active partner node should be Online: [admin@n000]$ sudo ssh replacement_nodename crm_mon -1 | grep Online 29. Log in to the active MGS/MDS node: [admin@n000]$ ssh active_MDS/MGS_node 30. Re-enable STONITH: [MDS/MGS]$ sudo crm configure property stonith-enabled=true 31. Log out from the active MDS/MGS node: [MDS/MGS]$ exit 32. If the system was not fully power cycled off and on, go to step 34 on page 142 .
141
Replace a Quad Server MGS or MDS Node
33. Restart Lustre: [admin@n000]$ cscli mount -f filesystemname For example: mount: mount: mount: mount: mount: mount: system mount: mount: mount:
MGS is starting... MGS is started! No resources found starting snx11000n starting snx11000n No resources found
on on on on
nodes snx11000n07 for "snx11000n" file system snx11000n[102-103]... snx11000n[104-105]... nodes snx11000n[100-101] for "snx11000n" file
ssetest is started on snx11000n[102-103]! ssetest is started on snx11000n[104-105]! File system snx11000n is mounted.
34. Fail back the resources for the new server node: [admin@n000]$ cscli failback –n new server_nodename 35. Confirm that the operations are complete and that the resource(s) have moved back to the new server owner: [admin@n000]$ sudo ssh new server_nodename crm_mon -1
142
Replace a Quad Server MGMT Node Fan Module
Replace a Quad Server MGMT Node Fan Module Prerequisites ●
Part number: 100927200:
Fan FRU, 2U Quad Server
●
Time: 1 hour
●
Interrupt level: Interrupt (requires disconnecting the Lustre clients from the filesystem)
●
Tools: ○
ESD strap, shoes, garment, or other approved methods
○
Console with monitor and keyboard (or PC with a serial port configured for 115.2 Kbps, 8 data bits, no parity, and 1 stop bit)
About this task Use this procedure to remove and replace a failed fan module in the MMU's 2U quad server (Intel) in the field. This procedure includes steps to remove and replace the fan module and return the Sonexion system to normal operation. The MMU’s server nodes host the primary and secondary MGMT nodes, MGS and MDS nodes. The system’s High Availability architecture provides that if one or the server nodes goes down, its resources migrate to its HA partner node so Sonexion operations continue without interruption. One 2U quad server and either one 2U24 enclosure or one 5U84 enclosure are bundled in the MMU. The 2U quad server contains four server nodes, two power supply units (PSUs), cooling fans and disk drives. Each server node contains three fan modules. Notes and Cautions ●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
Procedure 1. If the location of the failed fan module is not known, do the following: a. Log in to the primary MGMT node via SSH. [Client]$ ssh –l admin primary_MGMT_node b. Access the CSSM GUI and use the Health tab to identify the faulty server. OR
143
Replace a Quad Server MGMT Node Fan Module
c.
To verify physically the downed server node, go to the front of the rack, look for a System Status LED (amber) or a dark LED on the left and right control panels of the 2U quad server, as shown in the following figure and table.
The following figure and table show the mapping of server node names. Figure 62. Quad Server Control Panels
Table 15. System Status LED Descriptions LED Color
Condition
Description
Green
On
System Ready/No Alarm
Green
Flashing
System Ready, but degraded: redundancy lost such as the power supply or fan failure; non-critical temp/voltage threshold; battery failure; or predictive power supply failure.
Amber
On
Critical Alarm: Critical power modules failure, critical fan failure, voltage (power supply), critical temperature and voltage.
Amber
Flashing
Non-Critical Alarm: Redundant fan failure, redundant power module failure, non-critical temperature and voltage.
-
Off
Power off: system unplugged. Power on: system power off and in standby, no prior degraded/noncritical/ critical state.
From the rear of the rack, physical and logical node definitions are shown in the following figure.
144
Replace a Quad Server MGMT Node Fan Module
Figure 63. Quad Server: Rear Component Identification
The server node names are mapped as defined in the following table. Table 16. Quad Server Node Designations Logical
Physical
Function
Node 000
Node 4
MGMT Primary
Node 001
Node 3
MGMT Secondary
Node 002
Node 2
MGS
Node 003
Node 1
MDS
2. Power off the system as described in Power Off Sonexion 2000. 3. Check the power cords plugged into the power distribution strip. In some configurations, these cords can block access to the 2U quad server, while, in other configurations, the 2U quad server can be freely accessed. If the cords do not block access to the 2U quad server, proceed to step 4 on page 146. If they block access as shown in the following figure, perform the following substeps.
145
Replace a Quad Server MGMT Node Fan Module
Figure 64. Power Distribution Strip With Cables in Top Two Sockets
a. Remove the power cords from the power distribution strip. b. As shown in the following figure, reinstall the power cords in the power distribution strip so that no sockets in use block removal of the affected server node (containing the failed fan module) in the 2U quad server. Figure 65. Power Distribution Strip with Top Two Sockets Open
4. Apply anti-static protection devices such as a wrist strap, boots, garments or other approved methods 5. Unplug the SAS cables attached to the MDS server (physical node 1, lower-right), as they obstruct access to the MGMT nodes.
146
Replace a Quad Server MGMT Node Fan Module
6. Push the green latch to release the server node, while using the handle to pull it from the chassis, as shown in the following figure. Figure 66. Removing the Server Node
7. Place the server node on a sturdy work surface. 8. Disconnect the fan module cable from the Node Power Docking (NPD) board. 9. Remove the fan module from the dock, as shown in the following figure. Figure 67. Removing the Server Node Fan Module
10. Place the new fan module into the dock, as shown in the following figure.
147
Replace a Quad Server MGMT Node Fan Module
Figure 68. Installing the Server Node Fan Module
11. Reinstall the server node by inserting the module into the empty bay at the rear of the enclosure, as shown in the following figure. It may be necessary to firmly push the module to fully insert it into the bay. There will be an audible click when the server node seats. Figure 69. Install the Server Node
12. Connect the SAS cables back to the MDS server (node 02). 13. Power on the system as described in Power On Sonexion 2000. 14. To check that the new server node is operational, verify that the System Status LED is green or run: [admin@n000]$ sudo crm_mon -1
148
Replace a Quad Server MGS or MDS Fan Module
Replace a Quad Server MGS or MDS Fan Module Prerequisites ●
Part number: 100927200: Fan FRU, 2U Quad Server
●
Time: 1 hour
●
Interrupt level:
●
○
Failover except in the case below (can be applied to a live system with no service interruption, but requires failover/failback)
○
Interrupt, if the cords connected to the power distribution strip require reconfiguring (requires taking the Lustre file system offline)
Tools: ○
ESD strap, shoes, garment, or other approved methods
○
Console with monitor and keyboard (or PC with a serial port configured for 115.2 Kbps, eight data bits, no parity, and one stop bit)
About this task
Use this procedure, only for Sonexion 2000 1.5 pre SU-007 systems (see Important, below), to remove and replace a failed disk drive in carrier (disk) in the 2U24 enclosure of a MMU component. IMPORTANT: The procedure in this topic is intended for use with Sonexion 2000 1.5.0 (pre SU-007) systems only. For Sonexion 1.5.0 SU-007 and later systems, do not use this topic to replace failed hardware. Instead, field personnel should log in to the Sonexion service console, which provides step-bystep instructions to replace the failed part. Follow the steps below to access the service console:
1. Cable a laptop to any available port on any LMN switch (located at the top of the rack). 2. Log in to the service console and follow the procedure to remove and replace the failed part. To log in, navigate to the service console (http://service:8080). If that URL is not active, then Log in to port 8080 of the IP address of the currently active MGMT node (MGMT0): http://IP_address:8080 where IP_address is the IP address of the currently active (primary) MGMT node. 3. Enter the standard service credentials. Subtasks: ●
3 on page 87
●
8 on page 88
●
dm_report Example Output
149
Replace a Quad Server MGS or MDS Fan Module
One 2U quad server and either one 2U24 enclosure or one 5U84 enclosure are bundled in the MMU. The 2U quad server contains four server nodes, two power supply units (PSUs), cooling fans and disk drives. Each server node contains three fan modules. This procedure can be applied to a live system with no service interruption, but requires failover/failback. Notes and Cautions ●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
Procedure 1. If the location of the failed server fan module is not known, do one of the following: ●
Access CSSM and use the Health tab to identify the malfunctioning server node.
●
At the front of the rack, look for a System Status LED (amber) or a dark LED on the left and right control panels of the 2U quad server, as shown in the following figure and table.
The following figure and table show the mapping of server node names. Figure 70. Quad Server Control Panels
Table 17. System Status LED Descriptions LED Color
Condition
Description
Green
On
System Ready/No Alarm
Green
Flashing
System Ready, but degraded: redundancy lost such as the power supply or fan failure; non-critical temp/voltage threshold; battery failure; or predictive power supply failure.
150
Replace a Quad Server MGS or MDS Fan Module
LED Color
Condition
Description
Amber
On
Critical Alarm: Critical power modules failure, critical fan failure, voltage (power supply), critical temperature and voltage.
Amber
Flashing
Non-Critical Alarm: Redundant fan failure, redundant power module failure, non-critical temperature and voltage.
-
Off
Power off: system unplugged. Power on: system power off and in standby, no prior degraded/noncritical/ critical state.
Figure 71. Quad Server: Rear Component Identification
Table 18. Quad Server Node Designations Logical
Physical
Function
Node 000
Node 4
MGMT Primary
Node 001
Node 3
MGMT Secondary
Node 002
Node 2
MGS
Node 003
Node 1
MDS
2. Check the power cords plugged into the power distribution strip. In some configurations, these cords block access to the 2U quad server, while in other configurations the 2U quad server can be freely accessed. 3. If the cords do not block access to the 2U quad server, proceed to step 4 on page 152. If they block access, perform the following steps to reposition them as necessary : a. Power off the system as described in Power Off Sonexion 2000. b. Remove the power cords from the power distribution strip. c.
Install the power cords so that the first two sockets in the power distribution strip are unused, as shown in the following figure.
151
Replace a Quad Server MGS or MDS Fan Module
Figure 72. Power Distribution Strip with Top Two Sockets Open
d. Power on the system. 4. Hand off the resources for the affected server node (containing the failed fan module) to its HA partner node, unless a failover has already occurred, in which case the resources should have been handed off to the HA partner node. a. Log in to the MGMT node, whichever has not failed. b. Fail over the resources of the failed server node to its HA partner node (MGS to MDS or MDS to MGS): $ cscli failover -c cluster_name -n node_name where node_name is the name of the affected server node. 5. Make certain that failover has occurred: $ sudo pdsh –g mgs crm_mon –l The preceding command produces the following output. failover: Operation performed successfully. [root@snx11000n000 ~]# pdsh -g mgs crm_mon -1 snx11000n002: Last updated: Wed Mar 27 12:51:06 2013 snx11000n002: Last change: Wed Mar 27 12:50:55 2013 via cibadmin on snx11000n002 snx11000n002: Stack: Heartbeat snx11000n002: Current DC: snx11000n002 (b5e13092-ed1b-48ae-aeb3-51e3d83b7e5f) - partition with quorum snx11000n002: Version: 1.1.6.1-2.el6-0c7312c689715e096b716419e2ebc12b57962052 snx11000n002: 2 Nodes configured, unknown expected votes snx11000n002: 14 Resources configured. snx11000n002: Online: [ snx11000n002 snx11000n003 ] snx11000n002: snx11000n003-stonith (stonith:external/ipmi): Started snx11000n003 snx11000n002: snx11000n002-stonith (stonith:external/ipmi): Started snx11000n002 snx11000n002: snx11000n003_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate): Started snx11000n002: snx11000n002_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate): Started snx11000n002: baton (ocf::heartbeat:baton): Started snx11000n002 snx11000n002: snx11000n003_ibstat (ocf::heartbeat:ibstat): Started snx11000n003 snx11000n002: snx11000n002_ibstat (ocf::heartbeat:ibstat): Started snx11000n002 snx11000n002: Resource Group: snx11000n003_md66-group snx11000n002: snx11000n003_md0-raid (ocf::heartbeat:XYRAID): Started snx11000n003 snx11000n002: snx11000n003_md66-raid (ocf::heartbeat:XYRAID): Started snx11000n003
152
Replace a Quad Server MGS or MDS Fan Module
snx11000n002: snx11000n002: snx11000n002: snx11000n002: snx11000n002: snx11000n002:
snx11000n003_md66-fsys (ocf::heartbeat:XYMNTR): snx11000n003_md66-stop (ocf::heartbeat:XYSTOP): Resource Group: snx11000n003_md65-group snx11000n003_md65-raid (ocf::heartbeat:XYRAID): snx11000n003_md65-fsys (ocf::heartbeat:XYMNTR): snx11000n003_md65-stop (ocf::heartbeat:XYSTOP):
Started snx11000n003 Started snx11000n003 Started snx11000n003 Started snx11000n003 Started snx11000n003
At the underscored line above, resources are all running on node 03. 6. Shut down the affected server node: $ sudo pm -0 node_name where node_name is the name of the affected server node. For example: [admin@snx11000n000 ~]$ sudo pm -0 snx11000n002 7. Check that the power off command was successful: $ sudo pm –q This command produces the following output: [root@snx11000n000 ~]# pm –q on: snx11000n[101-105] off: snx11000n002 unknown: 8. Apply anti-static protection devices such as a wrist strap, boots, garments or other approved methods. Replace MGS-MDS Node Fan Module In the following steps, note or label all cables that are removed: Ethernet, IB and SAS. All cables must be re-installed in correct positions. 9. Push the green latch to release the server node, while using the handle to pull it from the chassis, as shown in the following figure. Figure 73. Removing the Server Node
153
Replace a Quad Server MGS or MDS Fan Module
10. Place the server node on a sturdy work surface. 11. Disconnect the fan module cable from the Node Power Docking (NPD) board. 12. Remove the fan module from the dock, as shown in the following figure. Figure 74. Removing the Server Node Fan Module
13. Place the new fan module into the dock, as shown in the following figure. Figure 75. Installing the Server Node Fan Module
14. Install the server node by inserting the module into the empty bay at the rear of the enclosure, as shown in the following figure. It may be necessary to firmly push the module to fully insert it into the bay. There will be an audible click when the server node seats.
154
Replace a Quad Server MGS or MDS Fan Module
Figure 76. Installing the Server Node
15. Power up the new server node: $ sudo pm -1 node_name where node_name is the name of the new server node. For example: [admin@snx11000n001 ~]$ sudo pm -1 snx11000n002 16. Check that the power up command was successful: $ sudo pm –q Which produces the following output: [admin@snx11000n001 ~]$ sudo pm –q on: snx11000n[000-105] off: unknown: 17. Wait 5 minutes for the node to boot, and hand back the resources to the new server node from its HA partner node, as follows. a. Log in to whichever node received the handover in step 4 on page 152 (MGS or MDS). b. Fail back the server node resources to the new server node: $ cscli failback -c cluster_name -n node_name where node_name is name of the new server node. For example, if the MGS node (node 002) failed the failback command would be: [admin@snx11000n003 ~]$ sudo cscli failback -c snx11000n000 -n snx11000n002
155
Replace a Quad Server MGS or MDS Fan Module
18. Make certain that failover has occurred: $ sudo pdsh –g mgs crm_mon –l 19. This command produces the following output. root@snx11000n000 ~]# /opt/xyratex/bin/cscli failback -c snx11000n0 –n snx11000n002 failback: Operation performed successfully. [root@snx11000n000 ~]# pdsh -g mgs crm_mon -1 snx11000n002: Last updated: Wed Mar 27 13:14:08 2013 snx11000n002: Last change: Wed Mar 27 13:06:12 2013 via cibadmin on snx11000n002 snx11000n002: Stack: Heartbeat snx11000n002: Current DC: snx11000n003 (49624f05-cdeb-49f6-a2da-8de7311192d0) - partition with quorum snx11000n002: Version: 1.1.6.1-2.el6-0c7312c689715e096b716419e2ebc12b57962052 snx11000n002: 2 Nodes configured, unknown expected votes snx11000n002: 14 Resources configured. snx11000n002: Online: [ snx11000n002 snx11000n003 ] snx11000n002: snx11000n003-stonith (stonith:external/ipmi): Started snx11000n003 snx11000n002: snx11000n002-stonith (stonith:external/ipmi): Started snx11000n002 snx11000n002: snx11000n003_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate): Started snx11000n002: snx11000n002_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate): Started snx11000n002: baton (ocf::heartbeat:baton): Started snx11000n002 snx11000n002: snx11000n003_ibstat (ocf::heartbeat:ibstat): Started snx11000n003 snx11000n002: snx11000n002_ibstat (ocf::heartbeat:ibstat): Started snx11000n002 snx11000n002: Resource Group: snx11000n003_md66-group snx11000n002: snx11000n003_md0-raid (ocf::heartbeat:XYRAID): Started snx11000n003 snx11000n002: snx11000n003_md66-raid (ocf::heartbeat:XYRAID): Started snx11000n003 snx11000n002: snx11000n003_md66-fsys (ocf::heartbeat:XYMNTR): Started snx11000n003 snx11000n002: snx11000n003_md66-stop (ocf::heartbeat:XYSTOP): Started snx11000n003 snx11000n002: Resource Group: snx11000n003_md65-group snx11000n002: snx11000n003_md65-raid (ocf::heartbeat:XYRAID): Started snx11000n002 snx11000n002: snx11000n003_md65-fsys (ocf::heartbeat:XYMNTR): Started snx11000n002 snx11000n002: snx11000n003_md65-stop (ocf::heartbeat:XYSTOP): Started snx11000n002
156
Replace a Quad Server Power Supply Unit
Replace a Quad Server Power Supply Unit Prerequisites ●
Part numbers: ○
100900700: Power Supply, Quad Server 1200W
○
101037800: Power Supply, Quad Server 1600W
●
Time: 30 minutes, or 1 hour if both PSUs need replacement.
●
Interrupt level:
●
○
Live, for replacing a 1600W PSU with another 1600W (procedure can be applied to a live system with no service interruption)
○
Interrupt, for replacing both 1200W PSUs with 1600W PSUs (requires interrupting system and disconnecting Lustre clients from the filesystem)
Tools: ESD strap
About this task
One 2U Quad Server and either a 2U24 enclosure or 5U84 EBOD enclosure are bundled in the MMU. The 2U Quad Server contains two power supply units (PSUs) located at the rear of the server in the center two bays. IMPORTANT: The Quad server must not have PSUs of different wattages (1200W and 1600W). The PSU’s wattage must be known before replacing. Observe the following constraints. ●
●
For the first two cases, execute Replace PSU With One of Same Wattage on page 158: ○
A 1200W PSU must be replaced with a 1200W PSU.
○
A 1600W PSU must be replaced with a 1600W PSU.
For the following case, execute Replace Both 1200W PSUs with 1600W PSUs on page 159: ○
If a 1200W PSU fails and no 1200W PSU is available, both PSUs in the server must be replaced with 1600W parts. Use the procedure in this section to remove and replace both PSUs. In this case, the procedure is classed as Interrupt service level.
Notes and Cautions ●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
157
Replace a Quad Server Power Supply Unit
Replace PSU With One of Same Wattage About this task
Use this procedure to remove and replace a failed PSU in a 2U Quad Server in the MMU. This procedure includes steps to remove and replace the failed PSU and return the system to normal operation.
Procedure 1. If the location of the failed PSU is not known, from the rear of the server, look for a PSU with an illuminated Status LED (amber). In an operational PSU, the Status LED is illuminated green. Figure 77. Quad Server PSU LEDs
2. Disconnect the power cord attached to the failed PSU. 3. Push the green latch to release the PSU, as shown in the following figure. Figure 78. Remove PSU from Quad Server
4. Using the handle, remove the PSU by carefully pulling it out of the power supply bay.
158
Replace a Quad Server Power Supply Unit
5. Verify that the part number on the label of the failed power supply matches the replacement power supply: ●
100900700: Power Supply, Quad Server 1200W
●
101037800: Power Supply, Quad Server 1600W
6. Insert the new PSU into the power supply bay. It may be necessary to firmly push the PSU to fully seat it. There will be an audible click as the PSU seats. Figure 79. Inserting the Power Supply Unit
7. Connect the power cord to the new PSU. 8. Verify that the Status LED is lit green. The green LED indicates that the new PSU is operational.
Replace Both 1200W PSUs with 1600W PSUs About this task
Use this procedure to remove both failed 1200W PSUs and replace them with 1600W PSUs in a 2U Quad Server in the MMU. This procedure includes steps to remove and replace the failed PSU and return the system to normal operation.
Procedure 1. If the location of the failed PSU is not known, from the rear of the server, look for a PSU with an illuminated Status LED (amber). In an operational PSU, the Status LED is illuminated green.
159
Replace a Quad Server Power Supply Unit
Figure 80. PSU LEDs
2. Log in to the primary MGMT node: [Client]$ ssh -l admin primary_MGMT_node 3. Unmount the Lustre file system: [admin@n000]$ cscli unmount -f fsname 4. Shut down all the nodes hosted on the 2U quad server: [admin@n000]$ pm -0 nodes Where nodes is the range of node names for the four server nodes (primary and secondary MGMT, MGS and MDS nodes). For example: [admin@snx11000n000]$ pm -0 snx11000n[000-003] 5. Once the primary and secondary MGMT, MGS and MDS nodes shut down (power button LEDs on the control panels at the front of the 2U quad server are not lit), power off the attached 2U24 (or 5U84) enclosure by turning off the power switches on the PSUs/PCMs at the back of the rack. ●
In a 2U24 enclosure, the power switches are located on the PCMs.
●
In a 5U84 EBOD enclosure, the power switches are located on the PSUs.
6. Remove both PSUs in the 2U quad server. a. Disconnect the power cord attached to one PSU. b. Push the green latch to release the PSU, as shown in the following figure.
160
Replace a Quad Server Power Supply Unit
Figure 81. Remove PSU from Quad Server
c.
Using the handle, remove the power supply module by carefully pulling it out of the power supply bay.
d. Repeat for the second PSU. 7. Install the 2 1600W PSUs. a. Insert one 1600 PSU into the power supply bay, as shown in the following figure. It may be necessary to firmly push the unit to fully seat the PSU. There will be an audible click as the PSU seats. Figure 82. Inserting the Power Supply Unit
b. Connect the power cord to the replacement PSU. c.
Verify that the Status LED is lit green. This indicates that the replacement PSU is operational.
d. Repeat for the second PSU.
161
Replace a Quad Server Power Supply Unit
8. Connect the power cords to the PSUs. 9. Power on the system as described in Power On Sonexion 2000. 10. Start the Lustre resources. ●
For working via CLI from the primary MGMT node, run: [admin@n000]$ cscli mount -f fsname
●
For working from the CSSM GUI, perform these steps: 1. Click the Node Control tab. 2. Select all nodes in the file system. 3. Click Selected Nodes. 4. Click Start Lustre in the drop-down menu.
162
Replace a Quad Server Chassis
Replace a Quad Server Chassis Prerequisites Part number
100886701: Server, Sonexion 2U Quad MDS + MGS FDR, FRU- fully populated The above part number applies to a complete quad server including internal components. If only the chassis is defective, this complete assembly is ordered and the four server modules and disk drives are swapped between the two chassis, so that the original servers and disks remain with the system. Time 2 hours Interrupt level Interrupt (requires taking the Lustre filesystem offline.) Tools ●
ESD strap, boots, garment, or other approved methods
●
#1 and #2 Phillips screwdriver
●
Console with monitor and keyboard (or PC with a serial port configured for 115.2 Kbps, 8 data bits, no parity and 1 stop bit)
About this task
The MMU component consists of a 2U quad server cabled to a 2U24 (or 5U84) storage enclosure. The 2U quad server contains 4 server nodes, 2 PSUs and disk drives.. The MMU’s server nodes host the primary and secondary MGMT nodes, MGS and MDS nodes. The system's High Availability architecture provides that if one of the server nodes goes down, its resources migrate to its HA partner node so ClusterStor operations continue without interruption. In this procedure, only the defective chassis is replaced; all other components are re-used in the new MMU chassis. The Lustre file system will be unavailable during this procedure. Disconnect all clients before continuing. Notes and Cautions ●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
Internal components need to be removed from the replacement chassis because only the chassis is to be replaced.
163
Replace a Quad Server Chassis
Procedure 1. If the the location of the faulty quad server is not known, do one of the following: ●
Access the CSSM GUI and use the Health tab to identify the faulty server, or
●
At the front of the rack(s), look for the 2U quad server with its System Status LED on (amber) or dark LEDs on the left and right control panels (see following figure). Server node LED descriptions are given in the table after the figure. The physical and logical layout of server nodes in a 2U quad server, as viewed at the back of the rack(s).
Figure 83. Quad Server Control Panel
Table 19. System Status LED Descriptions LED Color
Condition
Description
Green
On
System Ready/No Alarm
Green
Flashing
System Ready, but degraded: redundancy lost such as the power supply or fan failure; non-critical temp/voltage threshold; battery failure; or predictive power supply failure.
Amber
On
Critical Alarm: Critical power modules failure, critical fan failure, voltage (power supply), critical temperature and voltage.
Amber
Flashing
Non-Critical Alarm: Redundant fan failure, redundant power module failure, non-critical temperature and voltage.
-
Off
Power off: system unplugged. Power on: system power off and in standby, no prior degraded/noncritical/ critical state.
164
Replace a Quad Server Chassis
Figure 84. Quad Server Rear Component Identification
Table 20. Quad Server Node Designations Logical
Physical
Function
Node 000
Node 4
MGMT Primary
Node 001
Node 3
MGMT Secondary
Node 002
Node 2
MGS
Node 003
Node 1
MDS
2. Disconnect all clients. 3. Log in to the primary MGMT server node as admin: [Client]$ ssh –l admin primary_MGMT_node 4. Unmount the Lustre file system: [admin@n000]$ cscli unmount -f fsname Where fsname is the file system name. 5. Power off the system as described in Power Off Sonexion 2000. 6. Once all of the nodes are shut down (power button LEDs on the control panels at the front of the 2U server are not lit), power off the 2U24 (or 5U84) MMU storage enclosure by turning off the power switches on the PSUs/PCMs (at the back of the rack). ●
In a 2U24 enclosure, the power switches are located on the PCMs.
●
In a 5U84 EBOD enclosure, the power switches are located on the PSU.
7. Disconnect the power cord from each PSU. 8. Disconnect the cables attached to each of the four server nodes. Make a note of each cable location and tag each cable. Cable diagrams are provided in the Sonexion 2000 Quick Start Guide included with the system. 9. Remove two PSUs from the 2U quad server:
165
Replace a Quad Server Chassis
a. Push the green latch to release the PSU, as shown in the following figure. Figure 85. Remove PSU from Quad Server
b. Using the handle, remove the power supply module by carefully pulling it out of the power supply cage. Store the PSU in a safe location. c.
Repeat for the second PSU.
10. Remove the four server nodes from the quad server: a. Note the location of each server node when it is removed from the chassis; they must be re-installed in exactly the same locations. b. For each server node, push the green latch to release the node, while using the handle to pull it from the chassis. c.
Place each node in a safe location, and repeat until all four server nodes have been removed. Figure 86. Remove the Server Node
11. Remove disk drives from the quad server:
166
Replace a Quad Server Chassis
a. Note the location (slot) of each disk drive when it is removed from the chassis; they must be re-installed in exactly same position. b. Remove each disk drive by pressing the green button and opening the lever. Pull to remove the disk. Store the drives in a safe location. Figure 87. Remove a Disk
Replace Chassis and Install Components 12. Loosen the two fasteners securing the server chassis to the front of the rack. 13. With a second person, slide the 2U quad server out of the rack and depress the two safety locks on the rails to completely remove the chassis. 14. Place the server on a sturdy bench. 15. With an assistant, slide the server chassis into the rack and depress the two safety locks on the rails to completely seat the chassis. 16. Secure the server chassis to the front of the rack cabinet with the two fasteners. 17. Refer to notes from step 11 on page 166 to install the disk drives in the correct slots. Use the following steps for each drive (shown in the following figure): a. For each drive, verify that the disk drive handle latch is released and in the open position, then slide the drive carrier into the enclosure until it stops. b. Seat the disk drive by pushing up the handle latch and rotating it to the closed position. There will be an audble click as the handle latch engages.
167
Replace a Quad Server Chassis
Figure 88. Install the Disk Drive in the Quad Server
18. Refer to notes from step 10 on page 166 to ensure each server node is installed in the correct location. Install each server node by inserting the server into the empty bay at the rear of the enclosure, as shown in the following figure. A firm push on the module may be needed to fully seat it into the bay. There will be an audible click as the server seats. Figure 89. Install the Server Node
19. Insert the PSU into the empty power supply cage, as shown in the following figure. A firm push may be needed on the module to fully insert it into the cage. There will be an audible click as the PSU seats.
168
Replace a Quad Server Chassis
Figure 90. Insert the Power Supply Unit
20. Connect the power cords to the PSUs. 21. Connect all the cables to the server nodes. Be sure to reconnect the cables to their original ports. Refer to notes from step 8 on page 165 and to the Sonexion 2000 Quick Start Guide for cabling information. 22. Power on the 2U24 or 5U84 MMU. 23. Power on the Sonexion system, as described in the procedure for your system and software release. 24. Log in to the primary MGMT server node: [Client]$ ssh –l admin primary_MGMT_node 25. Start Lustre resources using CSCLI or the GUI: ●
From the primary MGMT node, run this command: $ cscli mount -c cluster_name -f fsname
●
If working from CSSM, perform these steps: 1. Click the Node Control tab. 2. Select all nodes in the file system. 3. Click Selected Nodes. 4. Click Start Lustre in the drop-down menu.
When all resources are running, the procedure is complete.
169
Replace a CNG Server Module
Replace a CNG Server Module Prerequisites Part number Time 2 hours Interrupt level ●
If the site has configured Round-Robin DNS for the CNG nodes, as recommended: Failover (can be applied to a live system with no service interruption, but requires failover/failback)
●
If the site uses static client assignments: Enterprise Interrupt (requires disconnecting enterprise clients from the filesystem)
Tools ●
ESD strap, boots, garment or other approved methods
●
Console with monitor (DB15) and keyboard (USB)
●
Video cable
About this task
The CIFS/NFS Gateway (CNG) is an optional component that can be added to Sonexion systems. The CNG component shares the Lustre file system to enterprise clients (Windows, Linux, Macintosh, Solaris, etc.) using enterprise NAS protocols (CIFS2 or NFS). A CNG unit consists of a 2U chassis that contains two or four server modules, each with built-in cooling fans, and two PSUs, which also contain built-in fans. The two PSUs are located at the rear of the server in the center two bays. Four-Node and Two-Node CNG Servers The physical node numbering on the four-node CNG is shown in the following figure.
170
Replace a CNG Server Module
Figure 91. CNG Four-Node Version Rear View
For the two-node variety, nodes 1 and 2 are connected identically to nodes 1 and 2 in the four-node configuration and nodes 3 and 4 are replaced with blanking plates. Distinguishing the CNG from the MMU The MMU includes a 2U chassis that also contains four server modules. However, the MMU server modules are not interchangeable with those in the CNG, and care should be taken not to confuse them. To locate CNG components and distinguish from the MMU, examine the cabling attached to the 2U server components at the back of the rack. If any components have SAS cables connected, the 2U enclosure is the MMU. If there are no SAS cables, the 2U enclosure is the CNG. This procedure was written for a CNG configured for four servers, but is similar to the procedure for a two-node chassis. CAUTION: Do not leave any enclosure bay empty.
Procedure 1. If the location of the failed server node is not known, do one of the following: ●
Access CSSM and use the Health tab to identify the faulty node by its hostname, and to determine which chassis contains the corresponding server node and which other nodes share the chassis, or
●
Locate the CNG components, and look for the 2U enclosure with its System Status LED on (amber) or dark LEDs on the left and right control panels. (See the following figure.) To locate the CNG components, examine the cabling attached to the 2U server components at the back of the rack. If any of the components have SAS cables connected then the 2U enclosure is the MMU. If there are no SAS cables, the 2U enclosure is a CNG.
171
Replace a CNG Server Module
Figure 92. Quad Server Control Panels
Table 21. Server Node System Status LED Descriptions LED Color
Condition
Description
Green
On
System Ready/No Alarm
Green
Flashing
System Ready, but degraded: redundancy lost such as the power supply or fan failure; non-critical temp/voltage threshold; battery failure; or predictive power supply failure.
Amber
On
Critical Alarm: Critical power modules failure, critical fan failure, voltage (power supply), critical temperature and voltage.
Amber
Flashing
Non-Critical Alarm: Redundant fan failure, redundant power module failure, non-critical temperature and voltage.
-
Off
Power off: system unplugged.
2. Log in to the active MGMT server node: [Client]$ ssh -l admin active_MGMT_node 3. Record the hostname of the affected node (hosted on the failed server module): [admin@n000]$ nodeattr -n exporter For example: [admin@snx11000n000 ~]$ nodeattr -n exporter snx11000n006 snx11000n007 snx11000n008 snx11000n009
172
Replace a CNG Server Module
The hostnames are numbered in the same order as the physical nodes. In this example, the hostname snx11000n006 corresponds to physical node 1 in the following figure, and the hostname snx11000n009 corresponds to physical node 4. Figure 93. CNG Four-Node Version Rear View
4. Record the primary and secondary BMC IP, Mask and Gateway IP addresses of the affected node: [admin@n000]$ grep hostname-ipmi /etc/hosts Where hostname is replaced by the hostname of the affected node. For example: [admin@snx11000n000 ~]$ grep snx11000n008-ipmi /etc/hosts 172.16.0.52 snx11000n008-ipmi 10.10.0.9 snx11000n008-ipmi-sec The Mask and Gateway addresses are set the same for all nodes: Subnet Mask Default Gateway IP
: 255.255.0.0 : 0.0.0.0
5. Define the gateway IP address and prefix length of the secondary IPMI network: [admin@n000]$ pdsh -N
-g mgmt ip a l | grep :ipmi
For example: [admin@snx11000n000 ~]$ pdsh -N -g mgmt ip a l | grep :ipmi inet 10.10.0.3/16 brd 10.10.255.255 scope global secondary eth0:ipmi Where 10.10.0.3 is the gateway IP address and /16 == 10.10.255.255 being the prefix length. CTDB performs the ECN IP takeover for failed nodes, so it is not required to disconnect active clients. However, any active IO will suffer connection loss and be interrupted during the failover process. 6. Stop the CIFS/NFS services: [admin@n000]$ cscli cng node disable –n hostname [admin@n000]$ cscli cng apply Do you confirm apply? (y/n) y This command interrupts any active connections from enterprise clients.
173
Replace a CNG Server Module
7. Shut down the affected node: [admin@n000]$ cscli power_manage –n node --power-off Where node is the hostname for the affected node, for example: [admin@n000]$ cscli power_manage –n snx11000n016 –power-off If the affected node does not shut down after the --power-off command is issued, there are two options to shut the node down: a. From the primary MGMT node run: [admin@n000]$ pm -0 affected_node b. If the node is still powered on and the pm -0 command fails, press and hold the power button, for at least six seconds, on the front panel of the affected server node. 8. Apply anti-static protection devices such as a wrist strap, boots, garments or other approved methods. 9. Disconnect the cables attached to the failed server node, and make a note as to where each cable attaches to ensure the cables are connected properly during re-installation. IMPORTANT: Note the port that each cable is attached to, so that the same cable connections can be made after the new server node is installed. Refer to the cable reference guide attached to the rack’s left hand rear door or provided in the Sonexion 2000 Quick Start Guide included with the system. 10. Push the green latch to release the server node, while using the handle to pull it from the chassis. Figure 94. Remove the Server Node
11. Install the server node by inserting the server into the empty bay at the rear of the enclosure. It may be necessary to push the module firmly to fully seat it fully into the bay. There will be an audible click as the server seats.
174
Replace a CNG Server Module
Figure 95. Install the Server Node
To configure the new server node, go to Configure BIOS Settings for CNG Node on page 175
Configure BIOS Settings for CNG Node About this task
Before specifying the BIOS settings, verify that the following equipment is available: ●
Console with monitor and keyboard
●
Video cable
The physical node numbering on the CNG is as follows: Figure 96. CNG Four-Node Version Rear View
Procedure 1. Connect the console’s video cable to the node (in the CNG chassis) for which BIOS settings are being configured. 2. Power on the new server module using the power button.
175
Replace a CNG Server Module
3. Press F2 during POST to enter the BIOS setup utility. If fail to press F2 when the BIOS setup utility starts, power-cycle the server module (press and hold the ON/OFF switch for 6 seconds) and try again. 4. Specify the server module’s BIOS settings as follows. Depending on the specific configuration utility and BIOS version available, screen layouts, option names, and navigation may vary slightly from the descriptions in this procedure. Use the arrow keys to navigate through the BIOS and press Enter to confirm a setting. Set the Baseboard LAN Configuration and Boot Options Entries 5. Navigate to the Server Management tab > BMC LAN Configuration menu > Baseboard LAN Configuration. 6. Set the Baseboard LAN Configuration entries, using the information recorded in step 4 on page 173 of Replace a CNG Server Module on page 170, or use factory defaults shown in the following figure. IP Source IP Address Subnet Mask Gateway IP
Static 172.16.0.xxx 255.255.0.0 0.0.0.0
Figure 97. CNG Node Positions Within Cabinet
7. Navigate to the Advanced tab > PCI Configuration > NIC Configuration menu > Onboard NIC1 Port1 MAC Address, and record the MAC address. 8. Navigate to the Boot Options tab > Network Devices Order. 9. Disable all network boot options except for one (IBA GE Slot 0400 v1372): Boot Option #1 Boot Option #2
IBA GE Slot 0100 v1372 ST9450405SS
These entries set the boot order to the first network adapter, and then the first drive in the server. The Boot Option entries may differ slightly from the entries shown above.
176
Replace a CNG Server Module
IMPORTANT: If any other boot devices appear on this screen, disable them. 10. Press F10 to save and exit. This automatically reboots the node. 11. Disconnect the video cable from the server. 12. Connect the remaining cables to the new server node based on the notes made in step 9 on page 174. For more information about cable port assignments, refer to Sonexion 2000 Field Installation Guide in the "Internal Rack Cabling" section. 13. Log in to the primary MGMT node: [Client]$ ssh –l admin primary_MGMT 14. Configure the new MAC address: a. Update the MGMT database to use the MAC address of the new server node, which you obtained in step 7 on page 176. Use the Onboard NIC1 Port1 MAC Address: [admin@n000]$ sudo echo "update netdev set mac_address='new_node_mac_addr' where hostname='node_hostname'" | mysql t0db Where new_node_mac_addr is a new node mac address, and node_hostname is the target node hostname. For example: [admin@n000]$ sudo echo "update netdev set mac_address='34:56:78:90:ab:cd' where hostname='snx11000n016'" | mysql t0db b. Update the configuration to use the new database entry: [admin@n000]$ sudo /opt/xyratex/bin/beUpdatePuppet -s -g mgmt 15. Power cycle the new CNG node: [admin@n000]$ cscli power_manage –n node --cycle Where node is the hostname for the server node that was replaced. For example: [admin@n000]$ cscli power_manage –n snx11000n016 --cycle 16. Start the CIFS/NFS services on the new node: [admin@n000]$ cscli cng node enable –n node [admin@n000]$ cscli cng apply Do you confirm apply? (y/n) y 17. Check the firmware level on the new CNG node: [root@n000]# pdsh -w new_cng_node "/lib/firmware/mellanox/release_1/ xrtx_mlxfw -c | grep 'Current'"
177
Replace a CNG Server Module
Example: [root@n000] #pdsh -w snx11000n016 "/lib/firmware/mellanox/release_1/ xrtx_mlxfw -c | grep 'Current'" snx11000n016: Name: 01:00.0 Current: 2.30.8000 Update: 2.30.8000 snx11000n016: Name: 02:00.0 Current: 2.30.8000 Update: 2.30.8000 Firmware levels should be as follows: Expansion Bus
Release 1.5.0
Release 2.0
CNG (Config A and B) onboard CX-3
2.30.8000
2.32.5100
CNG (Config A) PCI-e CX-3
2.30.8000
2.32.5100
CNG (Config B) PCI-e CX-2
2.9.1000
2.9.1000
18. If Round Robin DNS was not available, reconnect the Enterprise Clients disconnected in step 3 on page 172 in Replace a CNG Server Module on page 170, using the network utilities on the clients. If the client connection works, the server replacement procedure is complete.
BMC IP Address Table for CNG Nodes Use the following table to look up designated IP addresses and ranges, and assign them to nodes in the CNG chassis. Table 22. BMC IP Addresses (Enclosures) Rack
Nodes
BMC IP Address / Range
Base
CNG nodes
172.16.0.50 to 172.16.0.57
●
The minimum number of nodes per CNG chassis is two, and the maximum is four.
●
Only one CNG chassis is supported in releases 1.5 and 2.0. It must be mounted in the base rack when a 2U24 EBOD is in use for MMU storage.
●
The CNG chassis may also be mounted in a customer rack, and is the only option when a 5U84 EBOD is installed in the base rack for MMU storage.
●
Within groupings in the CNG chassis, it is recommended that IP addresses be assigned in the following order when viewing the rack from the rear. The number referred to below is the 4th octet of the IP address: ○
Assign even numbers (4th octet of the IP address) to right-side nodes.
○
Assign odd numbers (4th octet of the IP address) to left-side nodes.
○
Assign lower numbers (4th octet of the IP address) to the bottom enclosure in the rack and work your way up the rack.
The IP addresses used in the base rack diagram below follow the conventions described above. When viewed from the rear, the IP addresses, from left to right, top to bottom, are: ●
172.16.0.53 (top left)
●
172.16.0.52 (top right)
178
Replace a CNG Server Module
●
172.16.0.51 (bottom left)
●
172.16.0.50 (bottom right)
179
Replace a CNG Server Module
Figure 98. IP Addresses Used With CNG Nodes
180
Replace a CNG Chassis
Replace a CNG Chassis Prerequisites ●
Part number:
●
Time: 2 hours
●
Interrupt level: Enterprise Interrupt (requires disconnecting enterprise clients from the filesystem)
●
Tools: ○
ESD strap
○
#1 and #2 Phillips screwdriver
○
Console with monitor and keyboard (or PC with a serial port configured for 115.2 Kbps, 8 data bits, no parity and 1 stop bit)
About this task
The CIFS/NFS Gateway (CNG) is an optional component that can be added to Sonexion systems. The CNG component shares the Lustre file system to enterprise clients (Windows, Linux, Macintosh, Solaris, etc.) using enterprise NAS protocols (CIFS2 or NFS). Subtasks: ●
Disconnect CNG Clients and Shut Down Nodes
●
Remove Components from the CNG Chassis
●
Install CNG Chassis and Components
A CNG unit consists of a 2U chassis that contains two or four server modules, each with built-in cooling fans, and two PSUs, which also contain built-in fans. The two PSUs are located at the rear of the server in the center two bays. Four-Node and Two-Node CNG Servers The physical node numbering on the four-node CNG is shown in the following figure.
181
Replace a CNG Chassis
Figure 99. CNG Four-Node Version Rear View
For the two-node variety, nodes 1 and 2 are connected identically to nodes 1 and 2 in the four-node configuration and nodes 3 and 4 are replaced with blanking plates. Distinguishing the CNG from the MMU The MMU includes a 2U chassis that also contains four server modules. However, the MMU server modules are not interchangeable with those in the CNG, and care should be taken not to confuse them. To locate CNG components and distinguish from the MMU, examine the cabling attached to the 2U server components at the back of the rack. If any components have SAS cables connected, the 2U enclosure is the MMU. If there are no SAS cables, the 2U enclosure is the CNG. This procedure was written for a CNG configured for four servers, but is similar to the procedure for a two-node chassis. CAUTION: Do not leave any enclosure bay empty.
Disconnect CNG Clients and Shut Down Nodes
Procedure 1. If the location of the failed chassis is not known, do the following: ●
Access CSSM and use the Health tab to determine that the CNG unit is faulty.
●
Locate the CNG in the rack by looking for the 2U two-node or four-node server with its System Status LED on (amber) or dark LEDs on the left and right control panels. See the following figure. To locate CNG components (as opposed to those in the MMU), examine the cabling attached to the 2U server components at the back of the rack. If any of the components have SAS cables connected, the 2U enclosure is the MMU. If there are no SAS cables then the 2U enclosure is the CNG.
182
Replace a CNG Chassis
Figure 100. Quad Server Control Panels
Table 23. System Status LED Descriptions LED Color
Condition
Description
Green
On
System Ready/No Alarm
Green
Flashing
System Ready, but degraded: redundancy lost such as the power supply or fan failure; non-critical temp/voltage threshold; battery failure; or predictive power supply failure.
Amber
On
Critical Alarm: Critical power modules failure, critical fan failure, voltage (power supply), critical temperature and voltage.
Amber
Flashing
Non-Critical Alarm: Redundant fan failure, redundant power module failure, non-critical temperature and voltage.
-
Off
Power off: system unplugged. Power on: system power off and in standby, no prior degraded/noncritical/ critical state.
183
Replace a CNG Chassis
Figure 101. CNG Four-Node Chassis Rear View
2. Disconnect the CIFS/NFS clients, using the network utilities on your Windows/SMB/CIFS or NFS client. The gateway access is unavailable during this procedure. 3. Login to the active MGMT server node as “admin”: [Client]$ ssh -l admin active_MGMT_node 4. Stop the CIFS/NFS services: [admin@n000]$ cscli cng disable –y [admin@n000]$ cscli cng apply –y This command interrupts any active client connections. 5. Shut down the gateway nodes: [admin@n000]$ cscli power_manage –n nodes --power-off Where nodes is the range of node names for the four server nodes. For example: [admin@n000]$ cscli power_manage –n snx11000n0[6-9] -–power-off Remove Components from the CNG Chassis Once all of the nodes are shut down (the power button LEDs on the control panels at the front of the 2U server are not illuminated), remove the internal components from the CNG chassis: IMPORTANT: It is recommended to tag the cables or make a note of their connections. Refer to the latest revision of the Internal Rack Cabling guide for your Sonexion release for cabling information. 6. Verify that the four server modules are off. If any power button LEDs on the control panels are illuminated press and hold the power button for more than 4 seconds until the LED is extinguished. The network LEDs may flash but the node is only on if the power button is illuminated. 7. Disconnect the power cord from each of the two power supplies. 8. Disconnect the cables attached to each of the four server modules.
184
Replace a CNG Chassis
9. Remove the two power supply units from the faulty chassis. a. Push the green latch to release the PSU. Figure 102. Remove PSU from Quad Server
b. Using the handle, remove the power supply module by carefully pulling it out of the power supply cage. Store the PSUs in a safe location. c.
Repeat for the second PSU.
10. Remove the four server modules: a. Push the green latch to release the server module, while using the handle to pull it from the chassis. Figure 103. Removing the Server Node
b. Place the server module in a safe location, and repeat until all four server modules are removed. 11. Remove the CNG chassis from the rack: a. Loosen the two fasteners securing the server chassis to the front of the rack.
185
Replace a CNG Chassis
b. With a second person, slide the chassis out of the rack and depress the two safety locks on the rails to completely remove the chassis. c.
Place the chassis on a sturdy bench.
Install CNG Chassis and Components 12. Install the replacement chassis in the rack: a. With an assistant, slide the chassis into the rack and depress the two safety locks on the rails to completely seat the chassis. b. Secure the chassis to the front of the rack cabinet with the two fasteners. 13. Install the four server modules: IMPORTANT: Cray recommends placing each server module into the same bay from which it was removed. This helps the CSSM software monitor each module more accurately over its lifetime. a. Install the server module by inserting it into the empty bay at the rear of the chassis. It may be necessary to push the module firmly to fully seat it into the bay. You'll hear a click as the module seats. Figure 104. Installing the Server Module
b. Connect the network cables to the server node. IMPORTANT: Be sure to reconnect the cables to their original ports. Refer to your notes or the latest revision of Internal Rack Cabling for your Sonexion release. c.
Repeat Steps a. and b. for the other (one or three) server modules.
14. Install the PSUs: a. Insert the power supply unit into the empty power supply cage. It may be necessary to firmly push the unit to fully insert it into the cage. You'll hear a click as the PSU seats.
186
Replace a CNG Chassis
Figure 105. Installing the Power Supply Unit
b. Repeat for the second PSU. c.
Connect the power cords to the PSUs. IMPORTANT: Be sure to reconnect the power cables to their original ports. Refer to your notes or the latest revision of the Internal Rack Cabling guide for your Sonexion release.
15. Log in to the active MGMT node: [Client]$ ssh –l admin active_MGMT_node 16. Power on the server nodes: [admin@n000]$ cscli power_manage –n NODES --power-on 17. Wait for the server nodes to fully boot (approximately 10 minutes). 18. Make certain the Lustre filesystem has started: [admin@n000]$ cscli show nodes -----------------------------------------------------------------------Hostname Role Power Lustre Targets Partner HA State state Resources -----------------------------------------------------------------------snx11000n000 Mgmt on ----0 / 0 snx11000n001 ----snx11000n001 *Mgmt on ----0 / 0 snx11000n000 ----snx11000n002 MGS,*MDS on N/A 0 / 0 snx11000n003 None snx11000n003 MDS,*MGS on Started 1 / 1 snx11000n002 All snx11000n004 OSS on Started 4 / 4 snx11000n005 All snx11000n005 OSS on Started 4 / 4 snx11000n004 All -------------------------------------------------------------------------
187
Replace a CNG Chassis
19. Start the CIS and NFS services: [admin@n000]$ cscli cng enable –y [admin@n000]$ cscli cng apply –y 20. Connect the CIFS/NFS clients using the network utilities on your Windows/SMB/CIFS or NFS client. If the client connection works, the CNG chassis FRU procedure is complete.
188
Replace a Cabinet PDU
Replace a Cabinet PDU Prerequisites Parts
Part Number
Description
100840300
PDU Assy, Sonexion 30A Triple Input US
100840400
PDU Assy, Sonexion 32A Dual Input EU
100927300
PDU Assy, Sonexion 50A Dual Input US
Time 1 hour per PDU. For example three racks (that is, six PDUs) would take six. Interrupt level Live (can be applied to a live system with no service interruption) Tools Phillips screwdriver (No. 2) Torx T30 Console with monitor and keyboard and a PC / laptop with a DB9 serial COM port configured for 115.2Kbs) Serial cable, Null Modem (DB9 to DB9), female at both ends ESD strap
About this task
Use this procedure to remove and replace a failed Power Distribution Unit (PDU) in a Sonexion rack. Each rack contains two PDUs that are mounted on the left and right rear-facing sides of the cabinet. The specific PDU model (whether 30A, 32A, 50A or 60A) installed in the rack is determined by the geographical location of the system; all PDUs are factory-installed with the appropriate line-in cables and plugs. This procedure has been written specifically for use by an admin user and does not require root (super user) access. We recommend that this procedure be followed as written and that the technician not log in as root or perform the procedure as a root user. Subtask: ●
Configure PDUs
Notes and Cautions ●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the equipment may be impaired.
189
Replace a Cabinet PDU
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
Procedure 1. If the location of the failed PDU and its rack is not known, locate the rack containing the failed PDU by checking the status of any nodes and switches connected to the PDU, looking for error indications. The PDU has rotating status LED displays and alarms that may show signs of failure. 2. Attach an ESD wrist strap and use it at all times. 3. Prepare to remove the failed PDU from the rack. a. Release 2.0 only: Remove AC power at the wall or floor socket by switching power off and removing the plugs from the AC outlet. Perform this for each of the PDU line cords. This is because PDUs in Sonexion 2.0 systems have no on/off switches. Raritan PDUs have from one to three AC line cords. Each line cord must be individually powered on or off. b. Remove all component power cords attached to the PDU. To aid with ensuring power cords are properly reconnected, make a note of all cable connections to the PDU. Refer also to Sonexion 2000 Quick Start Guide, a copy of which may be located in the outer sleeve of the rack packaging. c.
Unplug the Ethernet cable attached to the failed PDU.
4. Use a Phillips screwdriver to remove the ground strap wire from the failed PDU to the ground point at the bottom of the rack. Refer to the following figure. Figure 106. Remove Ground Strap Wire
5. Remove the failed PDU by lifting it up to the keyhole openings (three total) and remove it from the frame. a. Loosen the three Torx retaining bracket screws on the vertical cable tray. Refer to the following figure. Ensure that the brackets slide down and disengage.
190
Replace a Cabinet PDU
Figure 107. Torx Retaining Bracket Screws
b. Unhook the vertical cable tray and rotate it 90 degrees so the tray is behind the SSUs. c.
Unscrew and remove the shipping L-Bracket from the top of the PDU, using the Phillips screwdriver. Refer to the following figure. Figure 108. Shipping L-Bracket
d. Lift the failed PDU from the mounting tabs and remove it from the rack. 6. Before the new PDU is put into the rack, remove the ground strap wire from the old PDU and transfer it to the new PDU. Ensure that the ground strap wire is tight and secure. 7. Install three mounting screws in the back of the PDU before proceeding to the next step. 8. Install the new PDU. a. Position the new PDU in the rack and feed the mains power cables into the opening at the top or bottom of rack and to the mains power connector. b. Hook the PDU to the mounting tabs by lifting and sliding the PDU onto the tabs. c.
Screw the shipping L-Bracket to the top of the PDU, using the Phillips screwdriver.
d. Rotate the cable tray back into position.
191
Replace a Cabinet PDU
e. Secure the retaining brackets with three Torx screws. 9. Reconnect the ground strap wire and connect power to the PDU. a. Re-connect the ground strap wire from the new PDU to the bottom of the rack, (previously removed in step 6 on page 191). b. Re-connect the Ethernet cable to the PDU. c.
Connect the PDU's line cords to the AC power outlets and, if required, turn to the ON position.
d. Wait for a few seconds, the PDU will begin its power on process. e. Verify that the LEDs on the PDU indicate normal operation. There is a rotating status display. Configure PDUs Verify that the following equipment is available before configuring the network switches. ●
PC/Laptop with a DB9 serial COM port, mouse, keyboard
●
DB9 console cable (DB9 to DB9, female on both ends)
●
Serial terminal emulator program, such as SecureCRT, PuTTY, Tera Term, Terminal Window (Mac), etc.
The managed power distribution units (PDUs) that are included in Sonexion systems must be configured before the system is provisioned. There are two PDUs per rack (referred to as PDU 0 and PDU 1 in this procedure). IMPORTANT: This procedure applies only to systems running release 2.0.0 or later. 10. Connect the DB9 console cable to the CONSOLE/MODEM port (DB9) on the new PDU and the serial COM port (DB9) on the PC/Laptop. Figure 109. Console/Modem Port on Power Distribution Unit
WARNING: Do not use a rollover cable (DB9 to RJ45). Plugging an RS-232 RJ45 connector into the Ethernet port of the PDU can cause permanent damage to the Ethernet hardware. 11. Open a terminal session with these settings: Table 24. Settings for MGMT Connection Parameter
Setting
Bits per second
115200
Data bits
8
Parity
None
192
Replace a Cabinet PDU
Parameter
Setting
Stop bits
1
Flow control
None
The function keys are set to VT100+. For using Terminal Window on a Mac, enter the following: screen /dev/ttyUSB0 cs8,115200 12. Once connected, log in to the PDU with the default username and password. 13. Enter config mode by typing config. The following prompt appears: Config:# 14. When prompted, change the password and the password aging parameters (so that the password does not age and possibly require changing again). Enter the current password and the new password, then re-enter the new password: Config:# password Current password: old_password Enter new password: new_password Re-type new password: new_password Add the following line to disable password aging: Config:# ‘security loginLimits passwordAging disable’ 15. Configure the PDU to use a wired connection: Config:# ‘network mode wired’ 16. Configure the PDU to use Static: Config:# ‘network ipv4 ipConfigurationMode static’ Config:# ‘network ipv4 ipAddress ip_address’ Config:# ‘network subnetMask ipv4 255.255.0.0’ To show the commands available, type: config:# network? network command [arguments...] Available commands: interface Configure LAN interface ip Configure IP access settings ipv4 Configure IPv4 settings ipv6 Configure IPv6 settings mode Switch between wired and wireless networking services Configure network service settings wireless Configure wireless network
193
Replace a Cabinet PDU
17. Calculate the IP address to use, using the following table. Table 25. Device IP Addresses Device Type
Starting IP Address
Ending IP Address
Usable Addresses
Power Distribution Units
172.16.253.10
172.16.254.254
498
Table 26. Base Rack Starting IP Addresses Device Type
Which Device in Base Rack
IP Address
Power Distribution Unit
Right hand PDU (viewed from rear)
172.16.253.10
Left hand PDU1
172.16.253.11
a. From the above table, find the PDU starting IP address of the PDU being configured. Example: 172.16.253.10 for the right-hand PDU and 172.16.253.11 for the left-hand PDU. b. Determine the rack_ID where the PDU being configured is installed. The rack_ID of the base rack is 0. The rack_ID of the first expansion rack is 1, second expansion rack is 2, and so on. Continuing example: The PDU being configured is in the third expansion rack. The rack_ID is 3. c.
Multiply the rack_ID by 2. Continuing example: rack_ID x 2 3 x 2 = 6
d. Add the result from step 17.c on page 194 to the fourth octet of the IP address from step 17.a on page 194. Continuing example for the right hand PDU: 172.16.253.11 fourth octet = 11 11 + 6 = 17 e. Replace the fourth octet of the IP address from Step 17.a on page 194 with the result from Step 17.d on page 194. Continuing example: 172.16.253.11 172.16.253.XX 172.16.253.17 Example result: 172.16.253.17 is the device IP address for the left hand PDU installed in expansion rack 3. IMPORTANT: When configuring the switch or PDU, the netmask is always 255.255.0.0. 18. Set a hostname for the PDU. To set the PDU host name, from the config command prompt, type: config:# network ipv4 preferredHostName cluster_name-pdunum-rrack_num Where: ●
cluster_name is the value used in the YAML file, as provided by the customer.
●
num is the PDU number: 0 is the PDU on the left side of the rack (rear view); 1 is the PDU on the right side of the rack (rear view).
●
rack_num is the rack number in which the pdu is installed. 0 is the base rack; 1 is expansion rack 1; 2 is expansion rack 2, etc.
194
Replace a Cabinet PDU
Example: config:# network ipv4 preferredHostName nsit-test-pdu0-r0 Example host names for PDUs: Table 27. Title nsit-test-pdu0-r0
Right side PDU in the base rack
nsit-test-pdu1-r0
Left side PDU in the base rack
nsit-test-pdu1-r7
Left side PDU in expansion rack #7
nsit-test-pdu0-r2
Right side PDU in expansion rack #2
19. Save the new configuration and leave configuration mode. Type either of the following: 'apply' ‒ or ‒ ‘cancel’ 20. Verify the settings: 'show network' 'show network [details]' The above command shows network information, as in this example: details Enable detailed view show network Networking mode: Wired IP Configuration mode: Static IP address: 10.22.160.113 Net mask: 255.255.240.0 Gateway: 10.22.160.1 To show the PDU name, type ‘show PDU’ This shows the PDU name, as in this example: PDU 'dvtrack_pdu0_r0' Model: PX2-5104X2-V2 Firmware Version: 2.5.30.5-41228 # 21. Log out of the PDU: 'exit' 22. After configuring the new PDU, power OFF the PDU via the AC outlets. Verify that all line cords are OFF.
195
Replace a Cabinet PDU
23. Using the packaging from the new PDU, re-package the failed PDU and return it per the authorized procedures. Reconnect Cords from Components In the following, re-connect the line cords from all components that were disconnected earlier, into the PDU. IMPORTANT: This procedure applies only to systems running release 2.0.0 or later. 24. After configuring the new PDU, power OFF the PDU via the AC outlets. Verify that all line cords are OFF. 25. At the back of the rack, confirm that the SSU PSUs related to this PDU change are in the OFF position. 26. At the back of the rack, confirm that the MMU storage (if fitted) 2U24 or 5U84 PCM / PSU related to this PDU change are in the OFF position. 27. The PDU's line cords should still be connected to the AC power outlets; therefore, turn to the ON position (they were turned OFF in step 24 on page 196). 28. Wait for a few seconds for the PDU to begin its power on process. 29. Verify that the LEDs on the PDU indicate normal operation. There is a rotating status display. 30. At the back of the rack, power on the SSUs and MMU storage (if fitted) and any other component related to this PDU from step 3 on page 190. 31. Confirm that all components’ PSUs or PCMs related to this PDU now have a good power indication showing. 32. Using the packaging from the new PDU, re-package the failed PDU and return it according to authorized procedures.
196
Replace a Cabinet Network Switch
Replace a Cabinet Network Switch Prerequisites Part number
100900900: Switch Assy, Mellanox IB 36-port (Mellanox 6025, unmanaged) Time 1.5 hours Interrupt level Failover (can be applied to a live system with no service interruption, but requires failover/ failback) Tools ●
ESD strap (recommended)
●
Console with monitor and USB keyboard or a PC with a serial port configured for 115.2 Kbs
●
Serial cable
●
#2 Phillips screwdriver
About this task
Use this procedure to remove and replace a failed data network switch (InfiniBand). This includes steps to remove and replace the failed network switch, configure the new switch (if it is not already configured), update firmware on the new network switch, as required, and return the Sonexion system to normal operation. IMPORTANT: Check with Cray Hardware Product Support to determine if the switch requires a sprecialized configuration file. Subtasks: ●
Replace Switch
●
Check Switch Installation
The Sonexion system contains two data network switches, known as Network Switch 0 (lower switch) and Network Switch 1 (upper switch), stacked one on top of the other in the rack. The dual data network switches manage I/O traffic and provide network redundancy throughout the Sonexion system. Notes and Cautions ●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
197
Replace a Cabinet Network Switch
CAUTION: Before replacing the upper network switch, verify that the dual management switches (positioned above it in the rack) are securely attached to the rack with no damaged hardware, to avoid problems when the upper network switch is removed.
Procedure 1. If the location of the failed network switch in the rack is not known, check the status of the connected (cabled) ports on both network switches. Look for the switch with inactive ports (LEDs are dark). In an operational network switch, all connected ports have valid links with green LEDs. 2. Identify which nodes are affected by the switch failure by tracing the cabled port connections from the failed network switch to components. Refer to the Sonexion 2000 Quick Start Guide, which is included with the system. ●
Cables attaching to the 2U quad server affect the primary and secondary MGMT, MGS and MDS nodes.
●
Cables attaching to an SSU controller (OSS) affect OSS nodes. Each OSS hosts an OSS node.
●
Cables attached to the management switches, network switches, and to the Manufacturing equivalent of an Enterprise Client Network (ECN) affect the CNG nodes.
3. If the status of the affected nodes’ resources is already known, skip to Replace Switch. Use the following steps to verify that the affected nodes’ resources have failed over, either via CSSM or CLI. To check node status using the CSSM GUI: a. On the Node Control tab, check whether the affected nodes’ resources have failed over to their HA partner nodes. b. If the node resources have failed over, go to Replace Switch. c.
If the node resources have not failed over, select the affected nodes and manually fail over their resources to the HA partner nodes. When the Node Control tab indicates that all node resources have successfully failed over, go to Replace Switch.
4. The remaining steps in this topic show the use of the CLI. To check node status: a. Log in to the primary MGMT node via SSH: [Client]$ ssh -l admin primary_MGMT_node b. If the affected node is an MGS, MDS, or OSS node, SSH into the node: [admin@n000]$ ssh MGS/MDS/OSS_nodename c.
Determine if the node's resources failed over to its HA partner node: [admin@mgs_mds_oss]$ sudo crm_mon -1
5. If the node's resources have failed over, log in to the remaining affected nodes via SSH and use the crm_mon -1 command to check if their resources have failed over. When the resources of all affected nodes have successfully failed over, go to Replace Switch. 6. If the node's resources have not failed over:
198
Replace a Cabinet Network Switch
a. Return to the primary MGMT node, if necessary. b. Fail over the node’s resources to its HA partner node: [admin@node]$ cscli failover -n HA_partner_node Where node is the MGMT0, MGS, MDS, or OSS node. c.
On the HA partner, verify that the node's resources failed over to the HA partner node: [admin@HA_partner_node]$ sudo crm_mon -1
7. Manually fail over the remaining nodes resources to their HA partner nodes. When the resources of all affected nodes have successfully failed over, go to Replace Switch. Following is sample output showing a node (snx11000n003) with its resources failed over to its HA partner node (snx11000n002). Last change: Fri Jan 11 10:12:44 2013 via cibadmin on snx11000n003 Stack: Heartbeat Current DC: snx11000n003 (6c10c5af-04b8-4f37-a635-e451779b1667) - partition with quorum Version: 1.1.6.1-2.el6-0c7312c689715e096b716419e2ebc12b57962052 2 Nodes configured, unknown expected votes 12 Resources configured. ============ Online: [ snx11000n002 snx11000n003 ] Full list of resources: snx11000n003-stonith (stonith:external/libvirt): Started snx11000n003 snx11000n002-stonith (stonith:external/libvirt): Started snx11000n002 snx11000n003_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate): Started snx11000n003 snx11000n002_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate): Started snx11000n002 baton (ocf::heartbeat:baton): Started snx11000n002 Resource Group: snx11000n003_md66-group snx11000n003_md0-raid (ocf::heartbeat:XYRAID): Started snx11000n002 snx11000n003_md66-raid (ocf::heartbeat:XYRAID): Started snx11000n002 snx11000n003_md66-fsys (ocf::heartbeat:XYMNTR): Started snx11000n002 snx11000n003_md66-stop (ocf::heartbeat:XYSTOP): Started Resource Group: snx11000n003_md65-group snx11000n003_md65-raid (ocf::heartbeat:XYRAID): Started snx11000n002 snx11000n003_md65-fsys (ocf::heartbeat:XYMNTR): Started snx11000n002 snx11000n003_md65-stop (ocf::heartbeat:XYSTOP): Started snx11000n002
Replace Switch The network switches are installed facing the back of the rack, one on top of the other. The front panel of each network switch (the connector side) faces the back of the rack and the back panel of each network switch (the power side) faces the front of the rack. 8. Disconnect all QSFP/Infiniband cables from the switch. 9. Disconnect the power cord from the switch. 10. Remove the four screws securing the switch to the rack. 11. Carefully slide the failed switch out of the rack while holding the front of the switch to keep it steady. 12. If the new switch has not yet been unpacked:
199
Replace a Cabinet Network Switch
a. Place the shipping cartons on a flat surface. b. Cut all straps securing the cartons. c.
Unpack the switch and accessories from the cartons.
13. Mount the rail kit hardware to the new switch as follows, shown in the following figure. Using the flat-head Phillips head screws (provided), attach the switch slides onto the switch, using seven flat-head screws for short switches and seven screws for standard depth switches. The hardware is attached so that the switch can be installed facing the back of the rack. Figure 110. Securing the Rail
14. Carefully install the new switch into the rack, securing it into place with the 4 screws used to secure the failed switch. 15. At the back of the rack, connect all QSFP/Infiniband cables to the switch. The following figures show a schematic view of network switch locations (highlighted) and port locations on those switches.
200
Replace a Cabinet Network Switch
Figure 111. TOR Network Switches
Figure 112. Network Switches Ports
16. At the front of the rack, connect the power cord to the switch and connect the other end to the PDU. IMPORTANT: The switch does not have an ON/OFF control. The switch powers on when the power cord is plugged in and power is applied. Wait for the switch to power on and complete its boot cycle (approximately 5 minutes). 17. Check the status LEDs and confirm that they show status lights consistent with normal operation.
201
Replace a Cabinet Network Switch
Figure 113. Status LEDs after Five Minutes
18. Check the status of the connected (cabled) ports in the new network switch. Check Switch Installation 19. Confirm that all connected ports show a valid link (green LED). 20. Confirm that all ports are active: [admin@n000]$ pdsh -a ibstatus |dshbak 21. Confirm that the Port 1 Status shows an ACTIVE state and a rate of 56Gb/sec (FDR). OSS nodes have two ports, ignore port 2 status. Infiniband device 'mlx4_0' port 1 status: default gid: fe80:0000:0000:0000:0050:cc03:0079:5654 base lid: 0xa sm lid: 0x3 state: 4: ACTIVE phys state: 5: LinkUp rate: 56 Gb/sec (4X FDR) link_layer: InfiniBand 22. If port 1’s status shows a not-active state or a rate less than 56 Gb/sec, investigate the ibstatus command output. Sample output: [admin@n000]$ pdsh -a ibstatus |dshbak ---------------snx11000n000
202
Replace a Cabinet Network Switch
---------------Infiniband device 'mlx4_0' port 1 status: default gid: fe80:0000:0000:0000:001e:6703:003e:2b28 base lid: 0x1 sm lid: 0x3 state: 4: ACTIVE phys state: 5: LinkUp rate: 56 Gb/sec (4X FDR) link_layer: InfiniBand ---------------snx11000n001 ---------------Infiniband device 'mlx4_0' port 1 status: default gid: fe80:0000:0000:0000:001e:6703:003e:25b0 base lid: 0x3 sm lid: 0x3 state: 4: ACTIVE phys state: 5: LinkUp rate: 56 Gb/sec (4X FDR) link_layer: InfiniBand ---------------snx11000n002 ---------------Infiniband device 'mlx4_0' port 1 status: default gid: fe80:0000:0000:0000:001e:6703:003e:2158 base lid: 0x4 sm lid: 0x3 state: 4: ACTIVE phys state: 5: LinkUp rate: 56 Gb/sec (4X FDR) link_layer: InfiniBand ---------------snx11000n003 ---------------Infiniband device 'mlx4_0' port 1 status: default gid: fe80:0000:0000:0000:001e:6703:003e:0d38 base lid: 0x2 sm lid: 0x3 state: 4: ACTIVE phys state: 5: LinkUp rate: 56 Gb/sec (4X FDR) link_layer: InfiniBand ---------------snx11000n004 ---------------Infiniband device 'mlx4_0' port 1 status: default gid: fe80:0000:0000:0000:0050:cc03:0079:5f54 base lid: 0x5 sm lid: 0x3 state: 4: ACTIVE phys state: 5: LinkUp rate: 56 Gb/sec (4X FDR) link_layer: InfiniBand Infiniband device 'mlx4_0' port 2 status: default gid: fe80:0000:0000:0000:0050:cc03:0079:5f55 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 40 Gb/sec (4X QDR) link_layer: InfiniBand
203
Replace a Cabinet Network Switch
---------------snx11000n005 ---------------Infiniband device 'mlx4_0' port 1 status: default gid: fe80:0000:0000:0000:0050:cc03:0079:5f06 base lid: 0xb sm lid: 0x3 state: 4: ACTIVE phys state: 5: LinkUp rate: 56 Gb/sec (4X FDR) link_layer: InfiniBand Infiniband device 'mlx4_0' port 2 status: default gid: fe80:0000:0000:0000:0050:cc03:0079:5f07 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 40 Gb/sec (4X QDR) link_layer: InfiniBand ---------------snx11000n006 ---------------Infiniband device 'mlx4_0' port 1 status: default gid: fe80:0000:0000:0000:0050:cc03:0079:5654 base lid: 0xa sm lid: 0x3 state: 4: ACTIVE phys state: 5: LinkUp rate: 56 Gb/sec (4X FDR) link_layer: InfiniBand Infiniband device 'mlx4_0' port 2 status: default gid: fe80:0000:0000:0000:0050:cc03:0079:5655 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 40 Gb/sec (4X QDR) link_layer: InfiniBand ---------------snx11000n007 ---------------Infiniband device 'mlx4_0' port 1 status: default gid: fe80:0000:0000:0000:0050:cc03:0079:55a6 base lid: 0x7 sm lid: 0x3 state: 4: ACTIVE phys state: 5: LinkUp rate: 56 Gb/sec (4X FDR) link_layer: InfiniBand Infiniband device 'mlx4_0' port 2 status: default gid: fe80:0000:0000:0000:0050:cc03:0079:55a7 base lid: 0x0 sm lid: 0x0 state: 1: DOWN phys state: 2: Polling rate: 10 Gb/sec (4X) link_layer: InfiniBand 23. Upgrade firmware on the new switch to the version specified below (latest qualified firmware level as of November 2014).
204
Replace a Cabinet Network Switch
Table 28. Network Switch Firmware Levels Vendor/Model
Firmw PSID are/ Softw are
.bin File Name
Mellanox SX6025 (FDR)
9.2.80 00
MT_10102 10021
fw-SwitchX-rel-9_2_8000-MSX6025F_B1-B4.bin
MT_10101 10021
fw-SwitchX-rel-9_2_8000-MSX6025F_A1.bin
24. Fail back resources on the nodes affected by the network switch replacement, as follows. a. Trace the cabled port connections from the failed switch to the components (either a 2U quad server or an SSU controller [OSS]). b. Fail back the resources for the MGMT nodes, the MGS/MDS nodes, and then the OSS nodes. Verify that the failbacks were successful. [admin@n000]$ cscli failback -n node It can take from 30 seconds to a few minutes for the node's resources to fail back completely. Following is sample output showing an HA node pair (snx11000n002 and snx11000n003) in online mode with their local resources assigned to them. ============ Last updated: Mon Jan 14 04:54:52 2013 Last change: Fri Jan 11 10:12:44 2013 via cibadmin on snx11000n003 Stack: Heartbeat Current DC: snx11000n003 (6c10c5af-04b8-4f37-a635-e451779b1667) partition with quorum Version: 1.1.6.1-2.el6-0c7312c689715e096b716419e2ebc12b57962052 2 Nodes configured, unknown expected votes 12 Resources configured. ============ Online: [ snx11000n002 snx11000n003 ] Full list of resources: snx11000n003-stonith (stonith:external/libvirt): Started snx11000n003 snx11000n002-stonith (stonith:external/libvirt): Started snx11000n002 snx11000n003_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate): Started snx11000n003 snx11000n002_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate): Started snx11000n002 baton (ocf::heartbeat:baton): Started snx11000n002 Resource Group: snx11000n003_md66-group snx11000n003_md0-raid (ocf::heartbeat:XYRAID): Started snx11000n003 snx11000n003_md66-raid (ocf::heartbeat:XYRAID): Started snx11000n003 snx11000n003_md66-fsys (ocf::heartbeat:XYMNTR): Started snx11000n003 snx11000n003_md66-stop (ocf::heartbeat:XYSTOP): Started snx11000n003 Resource Group: snx11000n003_md65-group snx11000n003_md65-raid (ocf::heartbeat:XYRAID): Started snx11000n002 snx11000n003_md65-fsys (ocf::heartbeat:XYMNTR): Started snx11000n002 snx11000n003_md65-stop (ocf::heartbeat:XYSTOP): Started snx11000n002 c.
Repeat the preceding two substeps for all nodes affected by the switch replacement.
205
Replace a Cabinet Network Switch
Once all affected nodes are online with their local resources reassigned to them, the procedure to replace a failed InfiniBand Mellanox SX6025 network switch is complete.
206
Replace a Cabinet Network Switch PSU
Replace a Cabinet Network Switch PSU Prerequisites ●
Part number: 100901000: Power Supply, Sonexion for 36-port FDR IB Mellanox Switch
●
Time: 30 minutes
●
Interrupt level: Failover (can be applied to a live system with no service interruption, but requires failover/ failback)
●
Tools:
●
○
ESD strap
○
#2 Phillips screwdriver
Access requirements: This procedure has been written specifically for use by an admin user and does not require root (super user) access. We recommend that this procedure be followed as written and that the technician does not log in as root or perform the procedure as a root user.
About this task
The system contains two high-speed network switches, known as Network Switch 0 (lower switch) and Network Switch 1 (upper switch), stacked one on top of the other in the rack, as shown in the following figure. Figure 114. Power and Connector Side Panels
The dual InfiniBand network switches manage I/O traffic and provide network redundancy throughout the system. CAUTION: PSUs have directional airflows similar to the fan module. The fan module airflow must coincide with the airflow of all of the PSUs (see the following figure). The switch's internal temperature is affected if the PSU airflow direction is different from the fan module airflow direction.
207
Replace a Cabinet Network Switch PSU
Figure 115. Airflow direction
Notes and Cautions ●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
Procedure 1. If the location of the failed PSU is not known, from the front of the server, look for a Mellanox switch with an illuminated Status LED (amber). In an operational 36-port switch PSU, the status LED is illuminated green. CAUTION: Make certain that the PSU that is not being replaced is showing all green, for both the PSU and status indicators. Figure 116. Status LED locations
Figure 117. Status LED bar
208
Replace a Cabinet Network Switch PSU
Table 29. LED Color Status LED Color
Status
Solid Green
OK: The Power supply is delivering the correct voltage – 12VDC
Solid Red
Error: The PS unit is not operational
Off
Off: No power to the system (neither PS unit is receiving power). If one PS unit is showing green and the second PS unit is unplugged it will show a red indication.
2. Disconnect the power cord attached to the failed PSU. 3. Grasping the handle with the right hand, push the latch release with the thumb while pulling the handle outward, as shown in the following figure. Figure 118. Power Supply Unit Removal Latch
As the PSU unseats, the PSU status indicators turn off. 4. Using the handle, remove the power supply module by carefully pulling it out of the power supply bay. Figure 119. Removing a Power Supply Unit
5. Make certain the mating connector of the new unit is free of any dirt or obstacles.
209
Replace a Cabinet Network Switch PSU
6. Insert the PSU by sliding it into the opening until a slight resistance is felt. 7. Continue pressing the PSU until it seats completely. The latch snaps into place, confirming the proper installation. 8. Insert the power cord into the supply connector. 9. Verify that the PSU indicator is illuminated green. This indicates that the replacement PSU is operational. If the PSU is not green, repeat the whole procedure to extract and re-insert the PSU.
210
Replace a Cabinet Management Switch (Brocade)
Replace a Cabinet Management Switch (Brocade) Prerequisites Part number
Sonexion Model Part Number Description Sonexion 2000
101171100
Switch Assy, Brocade 24-Port ICX66610 1GBE RJ45 Mgmt FRU
Sonexion 2000
101161901
Switch Assy, Brocade 48-Port ICX6610 1GBE RJ45 Mgmt FRU
Sonexion 900
101018600
Switch, GbE Managed 24-Port Airflow=PS to Port (Brocade)
Sonexion 900
101018700
Switch, GbE Managed 48-Port Airflow=PS to Port (Brocade)
Time 1.5 hours Interruption level Failover (can be applied to a live system with no service interruption, but requires failover/ failback operations. Tools ●
Phillips screwdriver (#2)
●
Console with monitor and keyboard (or PC with a serial port configured for 9600 Kbps, 8 data bits, no parity and 1 stop bit)
●
PC/Laptop with a DB9 serial COM port, mouse, keyboard
●
Rollover/Null modem serial cable (DB9 to RJ45 Ethernet)
●
Serial terminal emulator program, such as SecureCRT, PuTTY, Tera Term, Terminal Window (Mac) etc.
●
ESD strap, boots, garment or other approved methods
Requirements The size and weight of the Brocade switch requires two individuals to move the unit safely. Do not perform this procedure unless two individuals are onsite and available to move each switch.
211
Replace a Cabinet Management Switch (Brocade)
About this task Use this procedure to remove and replace a failed management switch (24/48-port Brocade), configure the new switch (if it is not already configured), bring the nodes online, and return the ClusterStor system to normal operation. The switch configuration must be done properly to ensure a resilient, redundant , secure and easily accessible management environment. IP addresses are obtained by the switches via the DHCP server, which will allow access for configuration changes and firmware upgrades. The instructions in this procedure can be used for both 24- and 48-port Brocade switches. This procedure includes steps to remove and replace the failed switch, configure the new switch (if it is not already configured), bring the nodes online, and return the Sonexion system to normal operation. Notes and Cautions ●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
Take note of the following: ●
The port on the failed switch into which the ISL is plugged
●
Whether this is a multi-rack system with port(s) that have inter-rack connections. These ports will not be configured as admin_edge_ports. Leave these blank (unconfigured).
●
Which management switch is being replaced; the upper or lower switch. This information will be needed for configuring the new switch.
Procedure 1. Identify the port into which the inter-switch link (ISL) cable is plugged. IMPORTANT: In a multi-rack Sonexion system, several ports have inter-rack connections. These ports are not configured as admin_edge_ports; leave them blank (unconfigured). 2. If the location of the failed management switch is not known, check the status of the connected (cabled) ports or look for an error indicator. The LEDs on inactive ports are off. In a switch that is operating normally, all connected ports have valid links with green LEDs. No warning or error LEDs should be lit. 3. If the identity of the nodes affected by the switch failure is already known, go to Verify Failover. Otherwise, identify the affected nodes by tracing the cabled port connections from the failed network switch to Sonexion components, and associating cables with nodes as follows: ●
Cables attached to the 2U quad server affect the MGMT, MGS, and MDS nodes.
●
Cables attached to an OSS controller in an SSU affect OSS nodes. Each OSS controller hosts an OSS node. Verify Failover
212
Replace a Cabinet Management Switch (Brocade)
Once the affected nodes are identified, use the following steps to verify that the affected nodes' resources have failed over to their HA partner nodes. If the resources are confirmed to have failed over, go to Remove Failed Management Switch. Determine node status using the GUI or CSCLI: Check node status on CSSM GUI 1. If CSSM is running, go to the Node Control tab and check if the affected nodes' resources have failed over to their HA partner nodes. 2. If the node resources have failed over, go to step Remove Failed Management Switch. 3. If the node resources have not failed over, select the affected nodes and manually fail over their resources to the HA partner nodes. 4. When the Node Control tab indicates that all node resources have successfully failed over, go to Remove Failed Management Switch. Check node status via CSCLI The following steps show the use of CSCLI. 4. Log in to the primary MGMT node via SSH: [Client]$ ssh -l admin primary_MGMT_node 5. If the affected node is an MGS, MDS, or OSS node, SSH into the node: [admin@n000]$ ssh MGS/MDS/OSS_nodename 6. Determine if the node's resources failed over to its HA partner node: [admin@n000]$ sudo crm_mon -1 7. If the node's resources have failed over, log in to the remaining affected nodes via SSH and use the crm_mon -1 command to check if their resources have failed over. When the resources of all affected nodes have successfully failed over, go to Remove Failed Management Switch. 8. If the node's resources have not failed over: a. Return to the primary MGMT node, if necessary. b. Fail over the node’s resources to its HA partner node: [admin@n000]$ cscli failover -n node_name Where node_name is the name of the HA partner node c.
On the HA partner, verify that the node's resources failed over to the HA partner node: [HA Partner]$ sudo crm_mon -1
9. Manually fail over the remaining resources to the respective HA partner nodes. When the resources of all affected nodes have successfully failed over, go to Remove Failed Management Switch.
213
Replace a Cabinet Management Switch (Brocade)
This is sample output showing a node (snx11000n003) with its resources failed over to its HA partner node (snx11000n002). ============ Last updated: Mon Jan 14 04:54:52 2013 Last change: Fri Jan 11 10:12:44 2013 via cibadmin on snx11000n003 Stack: Heartbeat Current DC: snx11000n003 (6c10c5af-04b8-4f37-a635-e451779b1667) - partition with quorum Version: 1.1.6.1-2.el6-0c7312c689715e096b716419e2ebc12b57962052 2 Nodes configured, unknown expected votes 12 Resources configured. ============ Online: [ snx11000n002 snx11000n003 ] Full list of resources: snx11000n003-stonith (stonith:external/libvirt): Started snx11000n003 snx11000n002-stonith (stonith:external/libvirt): Started snx11000n002 snx11000n003_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate): Started snx11000n003 snx11000n002_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate): Started snx11000n002 baton (ocf::heartbeat:baton): Started snx11000n002 Resource Group: snx11000n003_md66-group snx11000n003_md0-raid (ocf::heartbeat:XYRAID): Started snx11000n002 snx11000n003_md66-raid (ocf::heartbeat:XYRAID): Started snx11000n002 snx11000n003_md66-fsys (ocf::heartbeat:XYMNTR): Started snx11000n002 snx11000n003_md66-stop 11 on page 214(ocf::heartbeat:XYSTOP): Started Resource Group: snx11000n003_md65-group snx11000n003_md65-raid (ocf::heartbeat:XYRAID): Started snx11000n002 snx11000n003_md65-fsys (ocf::heartbeat:XYMNTR): Started snx11000n002 snx11000n003_md65-stop (ocf::heartbeat:XYSTOP): Started snx11000n002 Remove Failed Management Switch 10. Identify the port into which the inter-switch link (ISL) cable is plugged. In a multi-rack Sonexion system, several ports have inter-rack connections. These ports will NOT be configured as admin_edge_ports; leave them BLANK (unconfigured). 11. Determine whether the upper or lower management switch is being replaced. You will need this information when you configure the new switch. 12. At the back of the rack, disconnect all network cables from the failed switch. For the port connection layout, refer to Sonexion Field Installation Guide for the applicable model of Sonexion . 13. At the front of the rack, disconnect the power cords from the failed switch. 14. At the back of the rack, remove the four retaining pan-head screws from the front of the failed switch. 15. With a second person, carefully slide the failed switch out of the rack.
214
Replace a Cabinet Management Switch (Brocade)
On the lower switch, the mounting tabs might catch on the PDU as you remove it from the rack; it will be a tight fit, but the switch will slide out. Install New Management Switch 16. If the new switch has not yet been unpacked from the shipping carton(s), follow these steps: a. Place the shipping carton(s) on a flat surface. b. Cut all straps securing the carton(s). c.
Unpack the switch and accessories from the carton(s).
17. Using the Phillips head screws (provided), attach the mounting brackets (2) to the sides of the new switch. One bracket attaches to each side of the switch (in the front). Each mounting bracket requires 4 screws. 18. With a second person, slide the switch into the rack. 19. Align the mounting brackets with the rack holes. Using two pan-head screws with nylon washers, attach each bracket to the rack. 20. Connect the power cord to the power receptacle on the switch. The switch does not have an ON/OFF control. The switch powers on when the power cord is plugged in and power is applied. for the switch to power on and complete its boot cycle (approximately 5 minutes). Do not connect the network cables to the new switch. That step will be performed when the switch is configured. 21. Configure the new switch. Refer to Configure a Brocade Management Switch on page 216. 22. Fail back resources on the nodes affected by the switch replacement (MGMT nodes, then MGS/MDS nodes and finally the OSS nodes). a. Trace the port connections previously cabled to the failed switch. b. For each affected node, fail back its resources. Verify that the failback operation was successful. [admin@n000]$ cscli failback -n affected_node It may take 30 seconds to a few minutes for the nodes' resources to fail back completely. Following is an example output showing an HA node pair (snx11000n002 and snx11000n003) in online mode with their local resources assigned to them. ============ Last updated: Mon Jan 14 04:54:52 2013 Last change: Fri Jan 11 10:12:44 2013 via cibadmin on snx11000n003 Stack: Heartbeat Current DC: snx11000n003 (6c10c5af-04b8-4f37-a635-e451779b1667) partition with quorum Version: 1.1.6.1-2.el6-0c7312c689715e096b716419e2ebc12b57962052 2 Nodes configured, unknown expected votes 12 Resources configured. ============ Online: [ snx11000n002 snx11000n003 ] Full list of resources: snx11000n003-stonith (stonith:external/libvirt): Started snx11000n003 snx11000n002-stonith (stonith:external/libvirt): Started snx11000n002
215
Replace a Cabinet Management Switch (Brocade)
snx11000n003_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate): Started snx11000n003 snx11000n002_mdadm_conf_regenerate (ocf::heartbeat:mdadm_conf_regenerate): Started snx11000n002 baton (ocf::heartbeat:baton): Started snx11000n002 Resource Group: snx11000n003_md66-group snx11000n003_md0-raid (ocf::heartbeat:XYRAID): Started snx11000n003 snx11000n003_md66-raid (ocf::heartbeat:XYRAID): Started snx11000n003 snx11000n003_md66-fsys (ocf::heartbeat:XYMNTR): Started snx11000n003 snx11000n003_md66-stop (ocf::heartbeat:XYSTOP): Started snx11000n003 Resource Group: snx11000n003_md65-group snx11000n003_md65-raid (ocf::heartbeat:XYRAID): Started snx11000n002 snx11000n003_md65-fsys (ocf::heartbeat:XYMNTR): Started snx11000n002 snx11000n003_md65-stop (ocf::heartbeat:XYSTOP): Started snx11000n002 c.
Repeat Step 7.b for all nodes affected by the switch replacement. The procedure to replace a failed management switch (24/48-port Brocade) in the field is complete.
Configure a Brocade Management Switch About this task
Use this procedure to configure the new Brocade management switch to use DHCP. The following schematic shows the management switches (highlighted) and network switches in the rack (rear view). Perform this procedure only on a the new switch installed in Replace a Cabinet Management Switch (Brocade) on page 211 Figure 120. TOR Management Switches
216
Replace a Cabinet Management Switch (Brocade)
In this procedure, each rack has two management switches (referred to in this procedure as management switch 0 and management switch 1. Perform this procedure only on the switch that is replacing the problematic switch. Commands used in this procedure are not case-sensitive.
Procedure 1. Connect to the new switch by connecting the rollover cable to the leftmost console port (RJ45), shown in the following figure, on the new switch and to the serial COM port (DB9) on the PC or Laptop. Figure 121. Console Port on Brocade ICS 6610-24
On Brocade switches, the console port is coupled with the out-of-band management port (which is not used by the Sonexion system). Use the following table to determine the console port's location. Table 30. Console Port Location Brocade Switch Type
Console Port Location
ICX 6610-24 (24-port)
Left side of the pair of ports
ICX 6610-48 (48-port)
Right side of the pair of ports
2. Open a terminal session and specify these serial port settings to use a PuTTY/SecureCRT emulator program: Baud rate (bits per second):
9600
Data bits:
8
Stop bits:
1
Parity:
None
Flow control:
Off
3. When using a PuTTY/SecureCRT emulator program, specify the following serial port settings : screen /dev/ttyUSB0 cs8,9600 After connecting to the new switch, the terminal screen should resemble the following figure.
217
Replace a Cabinet Management Switch (Brocade)
Figure 122. Terminal Screen after Connecting to the Switch
After connecting to the new switch, go to the next step. 4. Configure the new switch's hostname as follows, using naming conventions shown below: enable configure terminal hostname switchname write memory For switchname, use the following format: cluster_name-sw_type num-rrack_num Where: ●
cluster_name is the value used in the YAML file, as provided by the customer. Use snx11000n for generic systems where the cluster name is not specified.
●
sw_type has these values: ○
ibsw for InfiniBand switches or gesw for 40GbE or 10/40GbE switches
○
mgmt for management switches
●
num is the switch number; 0 is the lower switch, 1 is the upper switch.
●
rack_num is the rack number in which the switch is installed. 0 is the base rack, 1 is expansion rack 1, 2 is expansion rack 2, etc.
218
Replace a Cabinet Management Switch (Brocade)
Sample hostnames for management switches, where the cluster name is snx11000n: Sample Hostname
Description
snx11000n-mgmtsw1-r0
Upper management switch in the base rack
snx11000n-mgmtsw0-r7
Lower management switch in expansion rack #7
5. Configure the Spanning-Tree as follows. The primary management switch is the lower switch (0), and the secondary management switch is the upper switch (1), as shown in the following figure. Figure 123. TOR Management Switches
Enter the following: Spanning-tree 802-1w Spanning-tree 802-1w priority ['0' for primary, '4096' for secondary] With the exception of the ISL port and any inter-rack link port, the remaining ports are configured this way: interface ethernet 1/1/1 spanning-tree 802-1w admin-edge-port interface ethernet 1/1/2 spanning-tree 802-1w admin-edge-port interface ethernet 1/1/3 spanning-tree 802-1w admin-edge-port
…
This sequence is repeated until all ports are configured, followed by: write memory This saves the configuration. IMPORTANT: Ports used for the ISL link between switches and any inter-rack connection should not have admin-edge-port enabled, leave them blank unconfigured.
219
Replace a Cabinet Management Switch (Brocade)
6. Set the management IP address of the new switch. Setting the management IP address enables remote login, should it be necessary to manually update the switch configuration or apply firmware updates. The management IP address can be identified by the DHCP server (172.16.2.2). The steps below ensure the management IP address of the new switch is properly set. Type: enable Configure terminal Ip dhcp-client enable Ip dhcp-client auto-update enable Write mem 7. Configure the username and password: Username admin priv 0 password Sonexion Enable super-user-password Sonexion Write mem 8. Enable logging, SSH, SNMP, and NTP; and disable Telnet: Logging 172.16.2.2 snmp-server community ClusterRead ro snmp-server community ClusterWrite rw crypto key generate rsa modulus 1024 ip access-list standard 10 permit 172.16.0.0/16 ssh access-group 10 enable aaa console aaa authentication login default local no telnet server write memory 9. Obtain the IP address of the switch: SSH@snx11000n006-primary # ship address This is sample output: IP Address Type 172.16.255.88 Dynamic SSH@snx11000n-primary##
Lease Time 308
10. Verify that the new switch can be accessed via SSH. a. Log in to the primary management node via SSH. b. Access the new switch via SSH: ssh admin@mgmt_switch_ip_address c.
Enter the switch's password.
d. If a switch prompt appears, the new switch can be accessed via SSH access.
220
Replace a Cabinet Management Switch (Brocade)
11. The following error may occur when SSH is used to access the switch: "aes128-ctr,3des-ctr,aes256-ctr,aes128-cbc,3des-cbc,aes256-cbc,twofish256cbc,twofish-cbc,twofish128-cbc" The above text shows ciphers, while the client supports only the following: "arcfour256,arcfour,blowfish-cbc" If the above case occurs, do the following: a. Add one of the server ciphers to the primary MGMT node (MGMT0) configuration to one of the following: ~/.ssh/config OR /etc/ssh/ssh_config Following is an example that adds one of the server ciphers to the client configuration: cd /etc/ssh [root@localhost-mgmt ssh]# vi ssh_config # Protocol 2,1 # Cipher 3des # Ciphers aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128cbc,3des-cbc # MACs hmac-md5,hmac-sha1,[email protected],hmac-ripemd160 # EscapeChar ~ # Tunnel no # TunnelDevice any:any # PermitLocalCommand no "ssh_config" 64L, 2164C written b. Remove the “#” prompt from the next line to match the following entry: Ciphers aes128-ctr,aes192-ctr,aes256-ctr,arcfour256, arcfour128,aes128-cbc, 3des-cbc c.
Press the Esc key.
d. Enter wq! and press Enter to exit. SSH should now work on the Sonexion system. 12. Once the new switch has booted, reconnect the network cables. Refer to the Sonexion 2000 Quick Start Guide, which is included with the system. 13. Check the status of the connected (cabled) ports. IMPORTANT: Wait for links to be established on all connected ports (green LEDs).
221
Replace a Cabinet Management Switch PSU (Brocade)
Replace a Cabinet Management Switch PSU (Brocade) Prerequisites
Part number
101233400: Power Supply, Sonexion for 24-port ICX66610 1GBE Brocade Switch Time 30 minutes Interrupt level Live (can be applied to a live system with no service interruption) Tools ●
ESD strap
●
#2 Phillips screwdriver
About this task
The Sonexion 2000 system contains two Brocade ICX 6610-24-I or ICX 6610-48-I management switches used for configuration management and health monitoring of all system components. These are the only management switches used in Sonexion 2000 systems. These switches have dual redundant power supplies, eliminating management switches as a single point of failure. The management network is private and not used for data I/O in the cluster. The Brocade switches have two PSU receptacles at the rear of the switch (see the following figure). Each switch ships from the manufacturer with one PSU installed, but a second PSU can be installed to provide backup power in case of a PSU failure and for load-balancing when both PSUs are operational. Each Brocade switch shipped with a Sonexion 2000 system has two PSUs installed as the standard configuration. PSUs are hot-swappable and each has one standard power receptacle for the AC power cable. Figure 124. Rear Panel 24- and 48-port Brocade Switch
Figure 125. Brocade Switch PSU Precautions Be sure to have the correct type of PSU before beginning the procedure to replace the PSU. IMPORTANT: Check with Cray Hardware Product Support to determine if the switch requires a sprecialized configuration file. WARNING: When inserting a power module into the switches, do not use unnecessary force. Doing so can damage the connectors on the rear of the supply and on the midplane.
222
Replace a Cabinet Management Switch PSU (Brocade)
CAUTION: Check to see that the PSU that is not being replaced is showing all green, for both the PSU and status indicators. CAUTION: Make sure that the proper air flow direction will be available when replacing a PSU (see the following figures) . Figure 126. Airflow Direction, Front to Back, E-labeled PSU
Figure 127. Airflow Direction, Back to Front, I-labeled PSU
Notes and Cautions ●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
Procedure 1. If the location of the failed PSU is not known, from the back of the rack, locate the switch with the failed PSU by checking the status of the power status LED.
223
Replace a Cabinet Management Switch PSU (Brocade)
Figure 128. Management Switch Power Status LEDs
2. Disconnect the power cord attached to the failed PSU. 3. Loosen the two captive screws on the PSU. 4. Using the extraction handle, remove the PSU by carefully pulling it out of the power supply bay. Figure 129. Remove Management Switch PSU
5. Before opening the package that contains the power supply, touch the bag to the switch casing to discharge any potential static electricity. Cray recommends using an ESD wrist strap during installation. 6. Remove the PSU from the anti-static shielded bag. 7. Make certain the mating connector of the new unit is free of any dirt or obstacles. 8. Holding the PSU level, guide it into the carrier rails on each side and gently push it all the way into the slot, ensuring that it firmly engages with the connector. 9. Align the two captive screws with the screw holes in the switch’s pack panel. 10. Using a screwdriver, gently tighten the captive screw. 11. Insert the power cord into the supply connector. 12. Confirm that the new PSU is powered on and displays a green power LED. If the PSU power LED is not green, repeat the entire procedure to extract and re-insert the PSU. If the PSU does not show a green power LED after repeating the entire procedure to extract and re-insert the PSU, then contact Cray Support. This procedure has an interrupt level of Live and powering off this switch could potentially affect the running cluster.
224
Remove a Cabinet Management Switch PSU (Netgear)
Remove a Cabinet Management Switch PSU (Netgear) Prerequisites Part number
100901400: Power Supply, Netgear 24-Port GbE Switch Time 30 minutes Interrupt level Interrupt (requires taking the Lustre file system offline) Tools ●
ESD strap
●
#2 Phillips screwdriver
Requirements Root access is not required to perform this procedure on a system. WARNING: When inserting a power module into the switches, do not use unnecessary force. Doing so can damage the connectors on the rear of the supply and on the midplane.
About this task Use this procedure to remove and replace a failed Power Supply Unit (PSU) in a 24- or 48-port Netgear switch. This procedure includes steps to replace the failed PSU and return the Sonexion system to normal operation. Sonexion 1600 systems use 24 or 48-port Netgear switches (either one or two) to provide a dedicated local network for configuration management and health monitoring of all components. The management network is private and not used for data I/O in the cluster. Each Netgear switch contains one PSU, which is accessible from the back of the switch (see the following figure). Figure 130. Netgear 24/48-port Switch Rear Panel
Use this procedure to remove and replace a failed PSU in a Netgear switch.
225
Remove a Cabinet Management Switch PSU (Netgear)
Procedure 1. If the location of the switch with the failed PSU is known, skip to step ****** 2. Locate the failed PSU in a Netgear switch by checking the status of the power LED. See the following figures. Figure 131. Netgear 24-port Switch LEDs and Ports
Figure 132. Netgear 48-port Switch LED and Port Layout
3. Log in to the primary MGMT node via SSH (user name admin and the customer password): [Client]$ ssh -l admin primary_MGMT_node 4. Stop the Lustre file system: [admin@n000]$ cscli unmount -f fsname 5. Check that the Lustre file system is stopped on all nodes: [admin@n000]$ cscli fs_info After verifying the Lustre file system has stopped, power off the Sonexion system. Refer to Sonexion 1600 Power On/Off Procedures. 6. Disconnect the power cord attached to the failed PSU. 7. Loosen the two captive screws on the PSU. 8. Use the extraction handle and remove the PSU by carefully pulling it out of the power supply bay. See the following figure.
226
Remove a Cabinet Management Switch PSU (Netgear)
Figure 133. Remove Netgear Switch PSU
9. Make certain the mating connector of the new unit is free of any dirt and obstructions. 10. Insert the new PSU into the power supply bay, and gently push the PSU into the bay. 11. Align the two captive screws with the screw holes in the switch’s pack panel. 12. Using a screwdriver, gently tighten the captive screw. 13. Insert the power cord into the supply connector. There is no power switch on the PSU. When power is supplied to the PSU, it starts to boot. 14. Allow 2 minutes for the PSU to complete its boot cycle and verify all connected RJ45 connections show green LEDs, and the power LED is green. 15. Power on the Sonexion system. Refer to Sonexion 1600 Power On/Off Procedures. If the PSU’s power LED is not lit green, remove the PSU and repeat the installation procedure.
227
Replace a Cabinet Power Distribution Strip
Replace a Cabinet Power Distribution Strip Prerequisites
Part number
100894000: Power Distribution Strip, C20-C13 Figure 134. Power Distribution Strip
Time: 1.5 hours Interrupt level: Interrupt (requires taking the Lustre filesystem offline) Tools: Phillips screwdriver (#2) Console with monitor and keyboard (or PC with a serial port configured for 115.2 Kbps, 8 data bits, no parity and 1 stop bit) Serial cable Cat5e cables (2) RS-232 to Ethernet serial cable ESD strap, boots, garment or other approved methods Access requirements: This procedure has been written specifically for use by an admin user and does not require root (super user) access. Cray recommends that this procedure be followed as written and that the technician does not log in as root or perform the procedure as a root user.
About this task
Each rack contains two power distribution strips that are mounted on the left and right rear-facing sides of the cabinet. There are two types of power distribution strip in the rack, a 7-position and the other is a 12‑postion model. The replacement procedure is the same, differing only in the plugging order. To replace a defective power distribution strip, power off the entire rack.
228
Replace a Cabinet Power Distribution Strip
Notes and Cautions ●
Only trained service personnel should perform this procedure.
●
If this equipment is used in a manner not specified by the manufacturer, the protection provided by the equipment may be impaired.
●
Due to product acoustics, it is recommended that users wear ear protection for any prolonged exposure.
Procedure 1. If the location of the failed power distribution strip is not known, determine which power distribution strip failed by checking the status of any failed power receptacles. In most cases a failed receptacle can be re-cabled to get around the issue until appropriate downtime can be scheduled. 2. Power off the system as described in Power Off Sonexion 2000. 3. At the back of the rack, disconnect all the power cords from the failed power distribution strip. Figure 135. Power Distribution Strip with Power Cords
4. At the back of the rack, using the Philips (#2) screwdriver, remove the two retaining pan-head screws from the front of the failed power distribution strip. 5. Using the Phillips head screws (provided), attach the mounting brackets (two total) to the sides of the new power distribution strip.
229
Replace a Cabinet Power Distribution Strip
6. Align the mounting brackets and the rack holes (42U position). Using two pan-head screws with nylon washers, attach each bracket to the rack. 7. Connect all power cords to the power receptacle on the power distribution strip. Refer to the cabling diagrams in the following figures for cable configuration. If the power cords block access to components, install the power cords so that the first two sockets are empty, rather than the last two. Figure 136. Power Distribution Strip With Top Sockets Empty
8. Power on the system as described in Power On Sonexion 2000. Power Diagrams
230
Replace a Cabinet Power Distribution Strip
Figure 137. Base Rack (4U MMU) Power Distribution Strip for PX2-5104X2-V2
231
Replace a Cabinet Power Distribution Strip
Figure 138. Base rack with PDU: PX2-5965X3-V2
232
Replace a Cabinet Power Distribution Strip
Figure 139. Base Rack (4U MMU) Power Distribution for PX2-5100X2-V2
233