Transcript
Disaster Recovery Solution for Hitachi NAS Platform
User Guide By Steven Sonnenberg
February 2016
Feedback Hitachi Data Systems welcomes your feedback. Please share your thoughts by sending an email message to
[email protected]. To assist the routing of this message, use the paper number in the subject and the title of this white paper in the text.
Contents Background Information .................................................................................................... 3 Important Replication Terms........................................................................................................... 3 Hitachi NAS Platform Important Terms........................................................................................... 4 Relationship Between Replication and Hitachi NAS Platform Entities ........................................... 4 Typical Hitachi Universal Replicator Components.......................................................................... 5 Typical Global-active Device Components — Dual Cluster ........................................................... 8 Typical Global-active Device Components — Stretch Cluster ..................................................... 10 Takeover........................................................................................................................................ 11 Disaster Recovery-Validation ........................................................................................................ 12 Data Recovery Solution for Hitachi NAS Platform Components .................................................. 13 Configuration Synchronization Methods....................................................................................... 16
Basic Operations................................................................................................................. 21 Management through Hitachi Command Suite ............................................................................ Summary of Operations ................................................................................................................ Global Arguments ......................................................................................................................... Command Syntax ......................................................................................................................... Configuration Files ........................................................................................................................ Operations Reference ...................................................................................................................
24 25 27 27 30 32
Configuration Management............................................................................................. 56 Configuration Record .................................................................................................................... Resiliency ...................................................................................................................................... Failure Points ................................................................................................................................ Loss of the Disaster Recovery Solution for Hitachi NAS Platform................................................ Repair and Replacement of the Disaster Recovery Solution for Hitachi NAS Platform................ Loss of SMU ................................................................................................................................. Restore an SMU............................................................................................................................ Loss of a Hitachi NAS Platform Node........................................................................................... Loss of all Hitachi NAS Platform Nodes ....................................................................................... Loss of Array Configuration ..........................................................................................................
56 56 57 60 60 61 62 66 71 82
Array Provisioning .............................................................................................................. 86 XML Introduction........................................................................................................................... 86 Discovery Function ....................................................................................................................... 89 Array Provisioning ....................................................................................................................... 105
Contents Storage Management ....................................................................................................... 115 Initial Provisioning ....................................................................................................................... Expansion.................................................................................................................................... Duplication .................................................................................................................................. groupdev Tool ............................................................................................................................. Converting From a Non-tiered Pool to a Tiered Storage Pool.................................................... Managing Storage for Disaster Recovery-Validation .................................................................. Deleting Storage .........................................................................................................................
115 118 120 125 126 126 128
XML Tags ............................................................................................................................ 130 Pair Status Codes.............................................................................................................. 133 System Status Codes........................................................................................................ 135 Configuration Record...................................................................................................... 136 Active Directory Registration Tool................................................................................ 139 Getconfig.sh - Support Incident Collection................................................................ 140 Database Schema ............................................................................................................. 141 hdrs.cfg Config File Values ............................................................................................ 143
1
Disaster Recovery Solution for Hitachi NAS Platform User Guide This user guide describes how to use the disaster recovery solution for Hitachi NAS Platform. Hitachi Universal Replicator is a block level asynchronous replication facility provided through a pair of cooperating and Fiber-connected enterprise storage arrays. This is a good solution for lower RPO versus server-based (object/file) replication. In comparison with replication overhead on Hitachi NAS Platform, replication overhead on Hitachi Universal Replicator is off-loaded from the server to the storage array, which can provide a lower RPO. Global-active device is a synchronous replication technology enabling read/write access from both sides. Within a limited distance, global-active device delivers the following:
An RPO of zero
Simultaneous read/write coordination
Automatic failover in case the storage is lost
This solution provides a way to leverage Hitachi NAS Platform with Universal Replicator or global-active device by doing the following:
Managing the replication and provisioning lifecycles
Facilitating the disaster handling and recovery processes
The solution supports both stretched cluster and dual cluster designs. In a dual cluster environment, a Hitachi NAS Platform cluster is configured to use the storage of each storage system. In a stretched cluster environment, a Hitachi NAS Platform node is configured to use each replicated storage array. The replication technologies manage coherent storage as a unit referred to as a "pool." Pools are presented to clients, such as Hitachi NAS Platform or VMware vSphere. When presenting identical storage to clients at different locations, the pools are referred to as "paired.". In normal operation, one of the pools for Hitachi NAS Platform and Hitachi Virtual Storage Platform is referred to as the primary pool. The other system is the secondary pool.
Primary Pool — The storage from Virtual Storage Platform is presented to Hitachi NAS Platform as one or more storage pools sponsoring file system storage in the Hitachi NAS Platform system. Secondary Pool — The storage for the other system is in secondary mode and the storage is not exposed to Hitachi NAS Platform. This cooperation between Virtual Storage Platforms is known as a "paired state." The other system in secondary mode maintains the following:
A nearly real-time copy of all changes made to the storage when using Hitachi Universal Replicator
An exact copy of all changes when using global-active device
1
2 key differences between Hitachi Universal Replicator and global-active device is the following: The
Universal Replicator is asynchronous, supporting writes from a single site but supporting replication over great distances. Global-active device is synchronous, supporting read/write from both locations up to a distance of 100 km (62 miles).
Although a global-active device supports writes in both locations, in dual-cluster configurations only one cluster will have exposed replicated storage at any time.
2
3
Background Information Understanding this background information is necessary to implement this solution.
Important Replication Terms You need familiarity with these important terms regarding replication.
HORCM (facility) — Hitachi Open Remote Copy Manager. This is an interface for controlling device pairing on Hitachi storage subsystems. HORCM#.conf (instance) — This is a configuration file describing the pairing of LUNs, consistency groups, communication paths, and ports needed to support pairing and array operations. Virtual CMD-Dev (out-of-band) — This is the management of pairing conducted through a UDP conversation between the Hitachi Open Remote Copy Manager manager and the SVP on the storage array. Contrast this with in-band control, where SCSI commands are sent using a special Fibre Channel device to manage the array. Journal Volume — This is storage designated for the exclusive use of an asynchronous pairing relationship. In particular, a journal volume buffers changes between peers. Consistency Group (copy-group) — This is a group of volumes treated as a single consistency group for all operations. Device-group — This is a logical grouping of volumes managed within the storage array. Each device can belong to only a single device-group. Each device is given a device-group-specific name when assigning it.
Logical device name — This is the name assigned to a device as part of joining a device group.
LDEV name — This is the name assigned to the LDEV when creating it.
LUN — This is a storage volume (contiguous storage) which can be used as a SCSI device through a Fibre Channel port. It has been mapped or presented through one or more ports. Port (Fibre Channel) — This is a communications endpoint used for exposing LUNs through a Fibre Channel interconnection to storage clients or for communications between arrays. This can be Hitachi NAS Platform or other storage arrays. Historically, each array port is configured with a single role, but several arrays offer 'Universal Ports' that can implement all roles simultaneously. The common roles used for this solution are listed below.
Target role — This provides general SCSI communication. An example is Hitachi NAS Platform.
RCU role — This acts as the Hitachi Replicator receiver for inbound.
MCU role — This acts as the Hitachi Replicator initiator for output.
ELUN role - This port acts as an initiator for connecting an external array.
S-VOL — This is a volume (LUN) currently assigned a secondary (read-only) role in a paired volume set.
P-VOL — This is a volume (LUN) currently assigned the primary (read-write) role in a paired volume set.
3
4 Pairing (States) These are the pairing states for replication.
SMPL — The pair is split (unpaired), with changes not tracked.
COPY — Peers are performing a coarse track-by-track copy. There are no disaster recovery guarantees.
PAIR — This is normal asynchronized data exchange, bit-for-bit updates with disaster recovery guarantees.
PSUS (SSUS) — The pair is suspended, with recording of changed tracks for future resynchronization.
PSUE (SSUE) — The pair is suspended due to error.
SSWS — This is the force-write mode for a secondary site, as a result of forced takeover.
Hitachi NAS Platform Important Terms You need familiarity with these important terms regarding Hitachi NAS Platform.
SMU (system management unit) — This is a dedicated server acting as the management and control interface to a Hitachi NAS Platform node or cluster. It provides a quorum device for managing cluster failover. Storage Pool (or span) — This is a grouping of storage assigned to provide storage to one or more file systems. System Drive — This is a managed (logical) SCSI device that Hitachi NAS Platform uses to provide raw storage for pools. Drive-spec — This is a shorthand specification for one or more system drives consisting of comma-separated drive numbers or dash-separated ranges of drives. COD (configuration-on-disk) — This is information written into each system drive as metadata. This allows the system to recognize disk, pool, or cluster information about a drive. LUID — This is the unique identifier for any SCSI device. Shadow (or Mirror) Drive — This is a fictitious system drive representing a remote device on a peer system. This exists only for Hitachi Universal Replicator.
Cluster — This is a group of 1 to 8 Hitachi NAS Platform nodes acting as a single entity.
Stripeset — This is a set of drives that are treated as a single unit for allocation operations.
Relationship Between Replication and Hitachi NAS Platform Entities This describes the relationship between replication and Hitachi NAS Platform entities.
System Drive | LUN | LDEV — Each system drive has a corresponding LUN from the storage array. Usually the LUN is from an dynamic provisioning pool providing storage for use in a storage pool. Technically, a LUN is an LDEV presented to one or more clients through a Fiber connection. Storage Pool | Consistency Group — Each storage pool, also known as span, is mapped to a replication consistency group consisting of multiple system drives. The name of the storage pool must match the consistency group entry exactly or the disaster recovery solution won't be aware that Hitachi NAS Platform has access to the storage.
4
5 Typical
Hitachi Universal Replicator Components
Figure 1 shows a typical Hitachi NAS Platform with Hitachi Universal Replicator arrangement. It sponsors a single Hitachi NAS Platform pool consisting of four LDEVs which are part of HUR Pair Group1. Shown in light blue are the SMUs used to manage Hitachi NAS Platform. The disaster recovery solution software is installed in the SMU. The disaster recovery software uses IP to manage all components, local and remote. These connections are shown in dashed red lines.
Figure 1
5
6 Figure 2 shows more detail of the Fiber connectivity required to create a Hitachi Universal Replicator environment.
Figure 2 Using Figure 2 as a reference, this is a description of the various components and the configuration files normally needed to support this arrangement.
Hitachi NAS Platform cluster on Site A is hur-rep01.
This cluster may consist of one more nodes.
It provides a pool (or span) named HUR Pvol consisting of two system drives; SD4 and SD5.
Hitachi NAS Platform cluster on Site B is hur-rep02.
The cluster hur-rep02 has an equivalent pool to cluster hur-rep01 consisting of two system drives, SD 0 and SD 1. The system drives have been mirrored between the storage arrays using LDEV 00:83 and LDEV 00:84 on Hitachi Storage Platform.
Port 5B and Port 7B are set up as initiator and RCU in order to support inter-array replication.
The system drives are Fiber-mapped through Port 1A and Port 2A to the LDEV 01:23 and LDEV 01:24 on Hitachi Virtual Storage Platform.
Normally two pairs would be used to provide additional throughput and higher reliability.
Both systems are using Journal ID #22 with different internal storage.
Both systems provide command-devices CD:05 and CD:04 for the Hitachi Open Remote Copy Manager controllers to manage the command and control of the pairing.
6
7 Note — Although the Virtual Storage Platform in Site A is labeled Primary VSP, the storage array does not become primary or secondary. Only the state of the LUNs in the consistency group becomes primary or secondary. Normal operation transfers the primary attribute of all consistency groups at the same time, to refer to the site as primary or secondary. The box labeled ICC consists of the management software and configuration files used to manage the replication relationship. A pair of Hitachi Open Remote Copy Manager configuration files contains the necessary information to provide replicated storage across the two Hitachi NAS Platform systems. These files are typically prepared by the storage administration team during software setup to enable remote replication using, for example, Hitachi Universal Replicator or local replication. The configuration for Site B is similar with a few key differences:
The peer, Site A, is listening on port 11000 (instance 0).
The instance on Site B is ‘1.’
By convention, one is added to the instance number of the base listening port: 11000 + 1 = 11001. Typical management of a Hitachi Universal Replicator system is performed through a series of command line interface commands on Hitachi Open Remote Copy Manager. These commands include the following examples:
paircreate
pairsplit
pairresync
pairdisplay
raidcom
When using Hitachi Universal Replicator with Hitachi NAS Platform an additional set of about 30+ Hitachi NAS Platform command line interface scripts are used to setup Hitachi NAS Platform and to respond to requests for site swap and disaster recovery. With the exception of tearing down or intentionally suspending an Hitachi Universal Replicator environment, there is no need to run any Hitachi Open Remote Copy Manager command for operational management or monitoring.
Figure 3
7
8 Hitachi Universal Replicator facility operates as a state machine. Without introducing failure scenarios, the operational The cycle proceeds as follows:
Simplex State In the simplex state, the storage at both sites has no relationship. Both sides could act as writeable primary storage. One does not present writeable storage to separate clusters prior to creating a pairing relationship. Copy State Once initiating pair creation, the secondary becomes non-writeable. The storage moves to the copy state until copying the data in the primary group to the secondary.
During this time, entire tracks are copied from primary to secondary. Data consistency guarantees are not yet in force.
Paired State
Once copying tracks from the primary to the secondary completes, the pairs automatically enter the paired state. An efficient byte-level update mechanism maintains synchronization. In the paired state, only the primary is allowed to have write access. The secondary site has read-only access. After completing pairing, Hitachi NAS Platform is configured to become aware of the special relationship between the storage exposed to the cooperating Hitachi NAS Platform clusters. This relationship is known as a remote-mirror. Once mirroring is configured, Hitachi NAS Platform is prepared to undergo a swap of roles as needed, or promote a secondary if the primary becomes unavailable.
When Hitachi NAS Platform is configured as the secondary, the following happens:
The pool is not exposed.
The file systems are not mounted.
Any CIFS shares are unavailable.
Configuration changes occurring at the primary site are not known on the secondary site, as they are not stored in the storage (and are not replicated). Status information collected from Hitachi NAS Platform and Hitachi Universal Replicator is digested, synthesized, and presented through a single command and the resulting XML file. The administrator needs little or no knowledge of command control interface to manage these capabilities on Hitachi Universal Replicator or Hitachi NAS Platform. This includes the need to have awareness of or to modify any Hitachi Open Remote Copy Manager configuration files.
Typical Global-active Device Components — Dual Cluster Figure 4 shows a typical Hitachi NAS Platform with global-active device replication arrangement sponsoring a single NAS Platform pool. It consists of four LDEVs, replicated to both sites, but only active or exposed on a single site at a time.
8
9 dashed lines around the pink boxes indicate that normally FS1 and SP1 are not visible on Site 2. Although the Site 2 The cluster does not expose the store, the following is true:
The storage array at Site 2 does maintain an up-to-date copy of the data.
Cross-site paths that are shown in black can be used as alternate paths in the case of local path failure.
There is a dependency on the third storage system that acts as a quorum server. This server is required for global-active device operation. The server can be located on Site 1 or preferably at a third site.
Figure 4
9
10 Figure 5 shows the logical Fibre Channel infrastructure needed to support a Hitachi NAS Platform using a global-active device environment. The solid lines are designated as primary paths. The dashed lines are alternate or cross-site paths. HNAS Cluster 2 HNAS Node 2
HNAS Cluster 1 HNAS Node 1 1
1
3
3
3E
3E
4E
4E
5H, 7H
6H, 8H
6H, 8H
GAD Pairs (MCU‐RCU)
5H, 7H
Quorum
1H
1H
VSP G1000
VSP G1000 1E, 2E
VSP
Figure 5
Typical Global-active Device Components — Stretch Cluster Figure 6 shows a typical Hitachi NAS Platform/global-active device replication arrangement sponsoring a pair of Hitachi NAS Platform pools consisting of 4 LDEVs each. In any global-active device implementation, there will be a quorum array providing a small quantity of storage that is virtualized to both the global-active device arrays and used for synchronization between the arrays. In a stretched cluster, both nodes are active and have access to all resources. There is a higher cost to accessing resources where the P-VOL (lock coordinator) is remote, therefore a stretch cluster is designed so that the virtual servers (EVS) at each node manage storage that is attached directly to the local array. EVS1 on Site1 bound to filesystem1 (FS1) and storage pool1 (SP1) whose 4 LDEVs are labeled 'P' on VSP G1000#1 forming the global-active device pair group1. SP1 and FS1 are available on Hitachi NAS Platform Node 2 accessing the corresponding LDEVs labeled 'S' with lock/data synchronization occurring through the global-active device replication link. A parallel situation exists for the global-active device pair group2 which is principally used on Site2 through EVS2. The availability of read/write access on either node to either pool does not require the use of the non-preferred paths (optional) and is shown in black. Note - In order to support a stretched cluster, 2 10 Gb/sec dedicated connections must be available for the NVRAM/mirroring between cluster nodes (labelled as Cluster interconnect below). Currently, there is a limitation of only two nodes, one per site.
10
11
Figure 6
Takeover Takeover is the transitioning of storage from one site to the other in a dual-cluster configuration. In a dual-cluster using Hitachi Universal Replicator, the following happens:
Storage at the primary site becomes S-VOLs. The copy at the secondary site becomes P-VOLs. This copy is exposed to Hitachi NAS Platform and configured the same as the primary cluster.
If global-active device is the replication technology, the takeover implementation is slightly different but the result is identical. The P-VOL for global-active device is the site that performs the lock coordination for the pairs because, with global-active device, the data is always accessible. A stretched cluster does not have a takeover operation. Both sites always access the same data. Takeover works in both directions and is effective even if the other site becomes unavailable.
11
12 Disaster
Recovery-Validation
Disaster recovery-validation is the ability to increase confidence in the validity of the remote copy. It allows full access to a snapshot of that remote copy. Disaster recovery-validation performs restoration exercises on the remote copy without disrupting production activities. The management of the snapshot copy is through local replication technologies:
Hitachi ShadowImage
Hitachi Thin Image
In order to support disaster recovery-validation, the disaster recovery solution for Hitachi NAS Platform provides the following capabilities:
The hdrs utility that supports the storage provisioning cycle of these copies.
A set of operations for loading and unloading of validation snapshots.
The status command that displays a catalog of disaster recovery-validation copies.
Currently, Hitachi NAS Platform only allows a single disaster recovery-validation copy of each type (SI or HTI) to be exposed at any time. This copy must be unloaded prior to promotion of the disaster recovery site. There is a mechanism provided to select which version to expose at any time. Figure 7 shows the general flow:
Takeover: Storage Pool on Disaster Site is promoted to Primary Site, usually as the result of a disaster at the production site
Validation Copy R/W
R/W
Production R/O R/W
Disaster R/O
Takeback: Storage Pool on the Production Site demoted to secondary site, during takeover restored as Primary Site Figure 7 12
13 LUNs making up a pool on the disaster site as an S-VOL can be used to take a snapshot that can be mounted at the disaster site in read-write mode. This pool, represented by the green hexagon in Figure 7, is a disaster recovery-validation pool. Once the disaster-recovery pool is presented to the Hitachi NAS Platform on the disaster system, its configuration can be set to match that of the production Hitachi NAS Platform so that it can be used to perform validation. The disaster recovery solution on Hitachi NAS Platform uses both of the following:
EVS migration Configuration of a Hitachi NAS Platform synchronization process to apply the latest configuration from the peer site during takeover and disaster validation scenarios.
Refer to Configuration Synchronization for an understanding of the options and tradeoffs.
Data Recovery Solution for Hitachi NAS Platform Components There is no additional hardware required for this solution. Instead, each SMU is provided with data recovery solution software for Hitachi NAS Platform that provides the management interfaces, monitoring the status of replicated pools. The data recovery solution collects configuration and health information by monitoring the storage arrays and Hitachi NAS Platform activities. Figure 8 shows the components involved in the solution. There are two or three storage arrays:
One storage array is for the primary site.
One storage array is for the disaster site.
When using global-active device, an additional array is for the quorum array.
The blue lines depict Fiber connections that exist between the following providing remote cross-site paths between the Hitachi NAS Platform systems and the peer arrays:
Local Hitachi NAS Platform and storage array
The storage arrays
If used, the quorum server and the arrays
13
14 management functions, shown in solid black lines, are conducted using IP protocols between the active disaster All recovery solution components and the other systems or components. Fibre Channel connectivity is shown in blue.
Figure 8
14
15 Figure 9 shows the inside of the SMU with the disaster recovery solution components.
In the yellow-gold blocks are third party software add-ons for storage array management.
In the orange block is the ssc tool used for managing the Hitachi NAS Platform equipment.
The elements in the upper box make up the disaster recovery solution, consisting of the following:
HDSlib — Code library
hdrsmon — Daemon for monitoring and correcting issues (monitor.pyc)
Config — Maintains a copy of the provisioned state of the HNAS clusters as well as the storage arrays
Tools — Perform operations (perform.py, hdrs.pyc)
Figure 9 HORCMGR is configured to communicate with storage arrays using an IP-based command device in a mode known as out-of-band communication. The Hitachi Open Remote Copy Manager managers run as an auxiliary daemon service on the SMU. Additional Open Remote Copy Manager instances are used on the disaster site to support a disaster recovery-validation copy using Hitachi Thin Image or Hitachi ShadowImage.
HORCM0 — Primary storage array
HORCM1 — Disaster site storage array
HORCM2 and HORCM3 (if used) — Thin Image
HORCM4 and HORCM5 (if used) — ShadowImage (if used)
Additional Hitachi Open Remote Copy Manager instances are possible for provisioning on the quorum server array or to support 3DC in the future, if used.
15
16 additional daemon, hdrsmon, also runs on the primary HDRS site (normally the primary SMUs). It does the following: An
Provides automated monitoring and correction
Maintains an archive of saved configuration information (local and remote)
Monitors performance of the various consistency groups during normal replication and copy operations
Notifies Hitachi NAS Platform of any automated corrections made as well as conditions impacting the disaster recovery solution that cannot be corrected
Activities initiated by administrators or operators are performed using ssh to invoke the perform.py script on the HDRS. An optional set of example Microsoft® Windows® PowerShell® scripts can be provided that wrap the ssh-based operations. These are intended to be modified as needed. Providing a simple, secure, and remote operational interface for disaster recovery solution functions, ssh can be integrated with management solutions or with Hitachi Automation Director. The disaster recovery solution for Hitachi NAS Platform can be installed external to the SMU on a physical or virtual server. This server must be configured with the same operating system and software as a backup SMU. The disaster recovery solution requires the following:
4 GB of memory
1 GB of disk space
Two or more network adapters
Configuration Synchronization Methods Configuration synchronization is a feature offered for all dual cluster deployments. This is because much of the configuration of a Hitachi NAS Platform cluster is maintained in the registry which is internal to the nodes forming a cluster. It is not storage based and is not automatically migrated along with storage. As a result, replication of the storage will not convey the configuration to the target automatically. Without conveying the configuration, elements that give form to the storage, such as CIFS shares or NFS exports, will not match the source cluster. As an example, file systems are fully storage resident with the exception of the device ID, which is managed within the registry of the cluster. When the pool or span migrates from one cluster to another, the device ID may change. This information normally is not utilized by clients, and is not synchronized. Virtual volumes are exposed but are not stored in the file system. These require specialized handling, as virtual volumes are part of the context that clients depend on such as quota management. There are two basic technologies used to synchronize configuration. Both can be utilized during a span transition:
Configuration Synchronization
EVS Migration
16
17 default is to use both technologies. First, use EVS migration, and then use global configuration synchronization. EVS The migration is applied only for the non-administrative EVS. Overrides exist for the following:
To not use EVS migration
To not perform global configuration
To ignore configuration synchronization altogether
Span transitions include the following:
Operation=takeover — When a span is moved from one site to another
Operation=dr-load — When a snapshot of a secondary span is presented locally for disaster recovery-validation
Operation=rebuild — When an on-demand re-synchronization request is made
Configuration Synchronization Configuration synchronization is implemented using the command line interface for Hitachi NAS Platform. This creates a database of entries that require synchronization on span transition. When required, the database is replayed to ensure that the configuration matches. The process erases any existing configuration and then rebuilds it according to the database.
17
18 following is the list of elements that are synchronized: The
Global CNS
Per-evs CNS
Links, directories
Virtual volumes
Links, directories
Usage quotas
CIFS shares
Mounts in filesystems
Mounts in CNS
Share permissions
Share settings
ABE, snapshot exposure, forced casing, symlink following, virus scan
Access rules
NFS exports
Mounts in filesystems
Mounts in CNS
Configuration changes
Access rules
EVS Migration EVS migration is the process of taking a registry dump at the source site to import one or more EVSs at the target site. The operational interface is pool-oriented, as this is the smallest unit of replication management. An EVS is not necessarily bound to a single span, If an EVS is bound to file systems from multiple pools, it is not possible to perform an EVS migration. An issue is reported if this arrangement is maintained as it is against best practice and complicates operations.
Figure 10
18
19Figure 10 on the left, EVS1 acts as if it were exclusively bound to SPAN1. That is, migrating EVS1 only affects In filesystems on SPAN1 (FS1 and FS2). In Figure 10 on the right, migrating SPAN1 would not be possible because EVS2 cannot be disabled, reset, or deleted as its serving FS3 which is not involved in the operation. When performing a full site operation, span1 and span2, EVS migration can be supported. When performing an EVS migration, two behaviors are impacted by whether the pair of EVSs are using uniform or non-uniform networking.
With uniform networking, an EVS in cluster1 can be migrated to cluster2, running with identical network settings. Only one copy of the EVS can be run in either cluster at the same time. Before the EVS is enabled in the target cluster, the EVS in the source cluster is disabled. Non-uniform networking implies that the EVS running in cluster1 uses different settings than when run in a cluster. Each EVS can be functioning in parallel with the other. The source EVS will not have to be disabled during an EVS migration. Whenever EVS migration is used, the target EVS must be deleted before the remote EVS can be imported. If this is unacceptable for a specific environment, then use the 'nomigrate' option for takeover, dr-load, and rebuild operations.
Another factor influencing the ability to use EVS migration is whether the target EVS is not disabled and is servicing other filesystems. If true, it cannot be deleted in order to perform an import. The licensed quantity of EVS servers must also be sufficient to allow EVS migration to take place.
Configuration Synchronization versus EVS Migration The administrative EVS cannot be migrated because it would cause the target cluster to become an exact copy of the source copy. Therefore, information that is considered global is not migrated using EVS migration. Configuration synchronization performs CNS migration. It may be extended to cover other areas in the future. The use of command line interface by configuration synchronization is not particularly efficient, but can be used when EVS migration cannot be used, such as where one or more isolated EVSs do not exist or where deletion of a target EVS is unacceptable.
Algorithm EVS migration is the preferred mechanism. This will be used unless one of the following is true:
The lack of isolation prevents EVS migration from occurring.
The target EVS is currently bound to file systems.
Insufficient EVS licenses exist to create a new EVS on the target
Configuration Sync will be employed to perform global configuration changes only when EVS migration has succeeded. The EVS ID is permanently associated with the EVS, following it across clusters. The name or label of the EVS can be changed easily. When the labels and IDs match, or when no target EVS is present, the following happens with EVS migration:
Delete the target EVS if it exists
Import or create the source EVS
19
20 When labels match, the assumption is that it is using uniform networking. This means that the source EVS will be disabled prior to enabling the target EVS. If labels do not match, the following happens:
The addressing of the existing target EVS (and its private label EVS) will be preserved and applied to the target EVS after it is restored. Prior to enabling it, the source EVS will not be disabled.
When performing a disaster recovery-load operation to create a snapshot of an in-production pool, the production EVS will not be disabled. Therefore, the target EVS should be created with a different name and uniform networking should not be used.
20
21
Basic Operations The disaster recovery solution for Hitachi NAS Platform supports five primary operations:
Status
Display status
Identify issues
The status operation reports on the current state of the pairs and identifies any corrective action required. The hdrsmon daemon running on the primary system invokes status periodically, automatically correcting of a set of configured issues. Unfixable issues will be reported to Hitachi NAS Platform using its event log and forwarded to management if Hitachi NAS Platform is being monitored by the Hitachi HiTrack system.
Correct
Perform any corrections needed
Correct uses the result of the status operation to take a set of programmed actions to reduce or repair the issues it detects. These actions may include the following:
Initiating replication
Re-syncing pairs
Creating a Hitachi NAS Platform pool
Re-assigning an EVS
Mounting file systems
Waiting for replication copy to complete
Adding a new copy-group to Hitachi Open Remote Copy Manager, and restarting Open Remote Copy Manager service
The correct process is iterative and normally continues as long as progress is being made or until a time or cycles limit is reached. Hdrsmon uses correct to implement repairs for any of the issues it has been configured to repair.
Takeover
Swap primary and secondary roles
Performing a planned takeover causes the primary and secondary roles to swap for the specified pool or pools. It then exposes the storage, file systems, and EVS binding on the new primary. Assuming that inter-array communication is healthy, this process takes only a few minutes. Part of this process is restoring the configuration that was last active on the old primary onto the new primary. In a typical deployment, a script created on-site invokes release on the old primary, invokes DNS/IP changes, and then invokes takeover on the existing primary. The configuration synchronization that occurs during takeover is performing the equivalent of a cross-cluster EVS migration. Each EVS running on the specified pool will be restored at the target site. When not using a migration takeover, after transitioning the storage, a configuration rebuild phase is performed that applies the configuration from the partner site to the new storage. The resources include the following:
21
22
Cluster name space (CNS)
Virtual volumes
CIFS and SMB shares and share permissions
NFS exports
The purpose is to synchronize the configuration between the old and new site. This is accomplished by removing the old configuration and rebuilding all configuration items. Additional complexity arises if factors prevent an organized takeover. This is something such as the loss of communications between any of the components that comprise this system. In such a situation, from a PAIR(Primary)-PAIR(Secondary) state, a forced takeover on the Secondary will result in a PAIR(Primary)-SSWS(Primary) situation. The new primary can mount the storage writeable after forcing the drives to ‘primary.’ The old primary, being out of contact, is unaware of the transition. Once communication is restored, the original (old) primary becomes PSUE (Primary) [suspended due to error]. Then the pairs can be swapped to restore the new PAIR (Secondary)-PAIR (Primary) state. All of these activities are carried out transparently by the disaster recovery solution for Hitachi NAS Platform management software. Here is a description of the process: (1) The operator requests a forced takeover using ‘bin/perform.py operation=takeover force’. Perform.py realizes that a lack of communication with the peer requires a forced takeover. (2) After the takeover, perform.py 'pegs' the drives so that access to storage has now transitioned to the other site. (3) Once a change in status occurs, such as communication restored, execute ‘bin/perform.py operation=status’ in the background by the hdrsmon daemon to do things such as correct the pairing status, drive access, and undo the SCSI mode pegging automatically.
Release
Prepare to lose primary status, by unmounting file systems
Disaster Recovery-Validation
Management of a snapshot on the disaster site
Disaster recover-validation is managed using two operations:
dr-load
dr-unload
The goal of disaster recovery-validation is to do the following:
Create a point-in-time snapshot of the production data. Present the production data on the disaster site, where the data with the configuration can be used to validate disaster-site readiness.
dr-load will do any and all of the following:
Create the pair, then split the snapshot.
Expose the LUNs to Hitachi NAS Platform as a pool.
22
23
Configure the NAS Platform resources using this pool using configuration synchronization.
Additional options will do either or both of the following:
Refresh the snapshot with the current state of the storage on the production system
Freshen the configuration to match the production system
dr-unload will do any and all of the following:
Delete the Hitachi NAS Platform configuration such as unmounting filesystems.
Hide the LUNs so that takeover can be performed.
The operations described can be visualized in the following diagram.
Figure 11
23
24 Management
through Hitachi Command Suite
Hitachi Command Suite is a global management suite for enterprise storage management. In a typical environment, a customer may license one or more of the components in Table 1 to manage them through the Command Suite framework. Table 1. Hitachi Command Suite Framework Components Area
Component
Automation
Hitachi Automation Director
Analytics
Analytics in Hitachi Command Suite
Mobility
Data Mobility in Hitachi Command Suite Data Mobility for Mainframe in Hitachi Command Suite Nondirsuptive migration
Replication
Hitachi Local Replication Hitachi Local Replication for Mainframe Hitachi Remote Replication Hitachi Remote Replication for Mainframe Hitachi Remote Replication Extended Global-active device
Mainframe Compatibility
Hitachi Compatible PAV for Mainframe
Common Operating System
Hitachi Storage Virtualization Operating System
Command Suite uses a set of agents to interact with various elements. In particular uses, it uses pair management servers, which are systems running Hitachi Device Manager agent software hosted on a Microsoft Windows or Linux host with connected Fibre Channel to one or more storage arrays. In order to allow visibility and limited management of resources which the disaster recovery solution for Hitachi NAS Platform is providing or has provisioned, a Hitachi Device Manager agent is packaged with the disaster recovery solution and a special set of Hitachi Open Remote Copy Manager files is made available to give Device Manager visibility and access. The agent registers with the Device Manager server during installation, if its address has been configured in the disaster recovery solution configuration file.
24
25 Hitachi Device Manager will periodically refresh the pair configuration information that it obtains using this agent. Device Manager allows Hitachi Replication Manager to monitor or perform some pair management operations with storage being managed through the disaster recovery solution for Hitachi NAS Platform. The disaster recovery solution host should be visible as one of the Linux agents for Device Manager, and, if configured correctly and after a refresh, pair configurations will display the managed pools. See Figure 12.
Figure 12
Summary of Operations Table 2 provides a summary of available commands and options. Table 2. Summary of Operations Operation
Arguments
Results
Prepare to release the Primary copy from the active site
pool=pool-or-poolpattern
Status:
cluster=cluster
0 is success
-1 is failure
release
Initiate a transition to change the active site for the specified pool
takeover
force=y
Status:
pool=pool-or-poolpattern
0 is success
noconfig=y
-1 is failure
noglobal=y
nomigrate=y
N is the count of configuration failures
cluster=cluster
25
26 Table 2. Summary of Operations (Continued) Operation
Arguments
Reapply configuration or a registry db=database-spec dump for the specified pool noaction=y rebuild noglobal=y
Results Status:
0 is success
N is the count of failures
nomigrate=y foreign=cluster delete=nodelete|deletefirst Take a snapshot of a disaster snaptype=SI,CW recovery storage pool and present pool=pool it to Hitachi NAS Platform
dr-load
snap=snapshot
Status:
0 is success
-1 is failure
refresh=config|snap noglobal=y nomigrate=y Unload and optionally invalidate snapshot resources
Status:
release=y
0 is success
-1 is failure
dr-unload
Perform correction of issues impacting the solution
snap=snapshot
correct
pool=pool-or-poolpattern
Status:
maxwait=secs
+N is success
cycles=number
See
issues=issueset Report on the status of all components, correcting discrepancies if needed
status
Display XML log file
pool=pool
display
pattern=pattern
Status:
0 is success
See
Status:
0 is success
26
27 Global
Arguments
The arguments in Table 3 can be used with any of the commands listed in Table 2. Table 3. Global Arguments Argument
Purpose
flush=y
used to discard any stored DB
result=filename
used to record the result of the operation
The result of an operation invocation is always an XML file.
When the "result" argument is missing, an XML file will be created using the name of the operation and a date specification which is stored in the LOG directory.
If the "result" argument supplies the filename, the XML file will be post-fixed in the LOG directory using filename.xml.
Using stdout for the filename causes the result to be displayed rather than stored.
A dash '-' can be used in place of result=stdout
Command Syntax perform.py [flush=y] operation=status|takeover|release|rebuild|dr-load|dr-unload|query|display [operation-specific options] [result=filename] The operation-specific options are covered with each operation.
Output Coloration The output of commands are colored to provide a visual indication as to the success of the operation. For example, in the output below the yellow on red background means that status identified issues that need investigation/correction.
In the example below, the 'correct' command was successful
Note - In both of the examples above, the detailed information has been stored in XML files in the LOG directory on the HDRS host.
27
28 Command Example The following example was executed directly on the SMU.
The output file contains only the XML which starts after any Progress/Debug messages. The coloration is not part of the file. It is added to help the reader focus on key information. Initiating collection (fs,smu,cluster,repl,evs,pool)...done
To re-display the XML with colored highlighting, use 'bin/perform.py display pattern=status'
Command Shortcuts The following shortcuts simplify the syntax required for issuing a command.
Using '-' represents 'result=stdout'
The operation can be specified without 'operation='
Example: bin/perform.py operation=status -
Example: bin/perform.py status -
Where only a single mandatory parameter is required, its name can also be omitted (pattern= is not needed)
Example: bin/perform.py display status
28
29 Microsoft Windows Powershell Command Syntax perform.ps1
IP|DNS-name
status|takeover|release|auto-correct|revert
Note — The implementation of the perform.ps1 script is fairly rudimentary but provides a basis for further integration.
Table 4. Microsoft Windows Powershell Command Syntax Command
Operation
status
Performs 'operation=status' result=req-datetime
takeover
Performs 'operation=takeover' result=req-datetime
release
Performs 'operation=release' result=req-datetime
revert
Performs 'operation=takeover' noconfig result=req-datetime
correct
Performs 'operation=correct' result=req-datetime
The following example was executed using PowerShell PS C:\Users\steve> .\perform.ps1 172.31.62.72 status pool=HUR_POOLC
pool ---HUR_POOLC
available --------Yes
state ----PAIR
29
30 perform.ps1 script is only one example using PowerShell of how to invoke operations on the SMU and to parse the The result. In the example above, the following happens: 1. Invoke perform.py using PLINK.EXE (ssh). 2. Retrieves the result file using PSCMD.EXE (scp). 3. Displays the result as text. 4. Opens the result as an XML object. 5. Performs a simple query.
Configuration Files Three configuration files used to configure the operation of the utility are stored in the conf directory:
general.cfg — This contains general configuration items and the configuration of all supported commands.
use-cases.cfg — This is used to provide pool-specific or characteristic driven options.
hdrs.cfg — This is created prior to installation and normally not edited unless the array credentials are changed.
general.cfg The configuration file consists of a set of stanzas, named by a string within closed brackets. An example is [operations]. Each stanza has a list of name-value pairs. The value is multi-value when there are items separated by commas. In the general.cfg file, there is a DEFAULT stanza, along with a stanza for each supported operation. Each operation stanza consists of a function which is the name of the python definition implementing the operation, as well as name-value pairs for managing the arguments for the function; required, optional, choice, and so forth. The contents of general. Cfg should not be changed, with the exception of the DEFAULT stanza. This is a small extract from the file. [DEFAULT] data_freshness = 30 dbstale = 120 debug = 0 sleeptime = 30 operations = query,takeover,release,status,rebuild,dr-load,dr-unload,correct,display [status] function = status_hur required = operation optional = pool,cluster,result [takeover] function = takeover_hur required = operation optional = pool,force,result,noconfig [release] function = release_hur required = operation optional = pool,cluster,result
30
31 [rebuild] …. [dr-load] …. [dr-unload] … [correct] function = status required = operation optional = maxwait,issues,cycles
use-cases.cfg The DEFAULT stanza in use-cases.cfg applies values that are otherwise not overriden by more specific entries. Table 5. use-cases.cfg Attributes Attribute
Explanation
superflush
Specifies the system-drive superflush to use for newly provisioned pools. Reference: sd-set Hitachi NAS Platform command line interface.
queuedepth
Specifies the system-drive queuedepth value to use for newly provisioned pools. Reference: sd-set Hitachi NAS Platform command line interface This is supported in version 12.4 and later.
stripecnt
This is the number of drives incorporated into a single stripe (logically contiguous span area). Note: only drives that are divisible by the stripecnt will be added to a span. Reference: span-create Hitachi NAS Platform command line interface
pairrepair
A set of one or more hexadecimal values separated by commas indicating auto-correct issues that should be automatically corrected. A value of '0x0' indicates that no repairs are found.
vsmserial
The array serial number sponsoring the pools resources. This attribute is mandatory when using the site="GAD-both" storage model. The value 'n/a' can be used as a wildcard, but is not advised. Note: if the serial number is 6 digits starting with a "3," omit the 3. Use 'sd-list -R' to view the serial number seen by Hitachi NAS Platform.
evs-interface The name of an EVS interface that should not be imported as part of EVS migration. This is typically used if the EVS uses a non-standard interface only at the target site (e.g. evs-interface=agi-vlan0712
31
32 logic used to determine the stripecnt, pairrepair, and vsmserial is the following: The
If there is a stanza matching the pool with the desired attribute, use it.
Otherwise, see if the pool matches any pool-patterned stanzas containing the desired attribute.
If there is pool-patterned stanza that exists, use it.
Otherwise, use the value in the DEFAULT stanza.
For the 'superflush' and 'queuedepth' attribute, a modified algorithm is used. The search order is as follows:
Specified-pool, pool-match, drive-type, drive-tier, DEFAULT Drive-type:'SAS, SSD, and others are supported by sd-object-dump. Reference sd-object-dump in Hitachi NAS Platform command line interface Drive-tier: 'tier' followed by a numeric value
The contents of this file can be changed. What follows is a small extract from the file that is used to demonstrate how the file can be used. [DEFAULT] superflush = -w 1 -s 384 queuedepth = 32 stripecnt = 4 pairrepair = 0x40000000,0x10000000,0x8000,0x4000,0x1000 vsmserial = n/a [tier0] [SSD] queuedepth = 64 [SAS] [HUR-sample-1] stripecnt = 8 queuedepth = 16 [HUR-sample-SSD*] queuedepth = 128 [GAD-norepair] pairrepair = 0x0
hdrs.cfg This file is described in the Implementation Guide, and in the section hdrs.cfg Config File Values on page 143, as this file is created during the design of an HDRS solution.
Operations Reference This describes each of the supported operations.
32
33 Operation=display This function locates the newest file under LOG/ matching the supplied pattern and displays it with colorization.
Returns
Required arguments
0 — success (no issues)
pattern=grep-pattern
Sample Transaction bin/perform.py operation=display pattern=status display the most recent LOG file named 'status' (usually the last status command) bin/perform.py display ^status display the most recent LOG file starting with the pattern 'status'. Note: operation= and pattern= are omitted bin/perform.py display correct-2016-01-16 display the most recent LOG file containing the pattern 'correct-2016-01-16). Assumably a correct operation performed on Jan 16, 2016 bin/perform.py display fix display the most recent automated correction performed by the hdrsmon.
Operation=status This function evaluates the current status of the following, reporting any issue, for the replication environment on both sites and the Hitachi NAS Platform configuration. The status operation does not perform any corrective actions but will display single line information on issues it identifies.
Returns
0 — success (no issues)
1 — pairs have non-healthy status
+ system-severities are the following:
0 — healthy
10 — minor
20 — moderate
30 — severe
40 — critical
Optional arguments
pool=poolname 33
34
This specifies the name of an existing pool, if statistics on an individual pool are desired. XML description The XML output contains a 'pair' stanza for each managed pool. For each pair, information is provided from each Hitachi NAS Platform cluster. If a disaster recovery-validation snapshot exists, it will be cataloged along with an indication on whether it is mounted. A code and pair-state are supplied in the 'health' tag for each pair.
Issues — If any detected irregularities are noted. Each issue contains a description and a code that is documented in the section Pair Status Codes on page 133 if it is actionable. Health — A final code, pair-code, and system-state are provided for the system as a whole. The pair-code is the collection of all pair related issues and the code represents the system-issue if any. The system issues are documented in System Status Codes. Status — This indicates the return code for the operation.
Sample Transaction In the following output, the XML element request/operation/health
Indicates that system was found to be healthy:
system-state=healthy
Each pair-state=healthy
One issue was flagged, and no actions/corrections/recommendations were made.
34
35 The issue that was flagged
does not contain an issue code, so it is informational in nature. In the case of a stretched cluster each node's primary access is to its local array with secondary access using remote-paths to the peer's array. In the example below, all pools are included in the status output.
The healthy highlighting shown in green-on-blue identifies two replicated storage pools; HDRS-STRETCH-1 and HDRS-STRETCH-2, as well as the overall health of the system and the status of the operation. Shown in yellow-on-red is the fact that each node can access its peer's storage using remote paths if needed. Also note that under each paired pool are XML elements for each node, rather than a cluster as shown in the following example.
35
36
The are several points to notice in the previous output: Messages emitted prior to the XML are indications of actionable issues that should be corrected using the 'correct' operation. This output will be displayed even if the XML has just been logged. In this case, two pools have global-active device replication suspension which can be corrected. This system consists of five replicated pools in a dual cluster environment (see the green-on-blue highlighted lines). Two of the pools are flagged with issues; 0x80000 and 0x100 (yellow-on-orange) indicating the global-active device suspension and the impact to drive paths as a result of one of the global-active device pairs being 'blocked'. Each issue has an 'action' element indicating the proper course of action; i.e. to let 'correct' repair the global-active device suspension. Each pool is also given an imperfect health code (0x80100) indicating the issues flagged against it. This is shown in orange-on-white under each pair element. The overall health of the solution is shown as /request/operation/health.
The overall pair-code for the solution is the accumulation of all pair-code values affecting any of its pairs.
36
37
The config-state=healthy indicates that a recent configuration backup had been taken and archived, as shown below in /request/operation/configuration. Of final note in yellow-on-red is the overall status code for the operation=status is '1' indicating that pairing issues exist, but no system issues.
System issues are issues that are environmental rather than being pool or pair specific. An example is shown below.
System issues are not encapsulated with a 'pair' XML element. In the example above, code=0x100000 has been raised against every node in our dual cluster. The state="s|s|s|s" indicates that all four Fibre Channel ports have lost partial drive access. This can occur when one of the arrays is disconnected or damaged. The example further shows that the HDRS-GAD-1 pool is now using 'all' remote paths as it is forced to do remote I/O with the peer array. An additional system issue code 0x200000 was also flagged to indicate that there are Fibre Channel pathing issues affecting device availability. In this example, there are no 'action' Elements providing guidance on this issue. Since this is a global-active device protected pool, use of the remote-paths is automatic, and therefore the health for replication pairs is still healthy. This final example shows the status for a Hitachi Universal Replicator pool.
37
38
The pool replicated using Hitachi Universal Replicator does not do as well when the array is lost. It loses all paths to its local copy, and this causes the Hitachi NAS to experience a storage failure which forcibly unmounts its filesystems. The 'action' recommendation is to perform a 'forced takeover' to transition the copy at the disaster site to become the primary and thus writeable. The status code is the result of both a serious system error and loss of pools (at least temporarily). The following major XML Elements are used in status and correct output. Individual tags are described in the section XML Tags on page 130. Table 6. Major XML Elements Attribute
Explanation
Tags
request
An element encapsulating the logical request.
Elapsed, finish, start, version
arguments
An element containing the arguments of the request.
Operation, [see individual commands]
operation
An encapsulating element for the actual operation.
Command, function, start
pair
An element for each replicated pool as specified by the pool argument (or its absence).
Available home-cluster, pool, pool-cluster, storage-cluster,
health
An element subordinate to each pair indicating the health state of the pair. Also an element subordinate to the 'operation' indicating the overall health of the system, configuration, and pairs/pools.
Config-state, system-code, system-state, pair-code
38
39 Table 6. Major XML Elements (Continued) Attribute
Explanation
Tags
configuration An element containing status about the configuration management of the components of the solution, and when they were most recently backed up.
Elapsed, start
status
The value of the operation return code
Code
cluster
A subordinate element to a 'pair' in a dual-cluster environment. This Drive-access, element contains the attributes of a pool on a single cluster. drive-role, drive-status, iomode, name, pool-available, state, type
node
A subordinate element to a 'pair' in a stretched-cluster environment. Drive-access, This element contains the attributes of a pool on a single node. drive-role, drive-status, iomode, name, pool-available, state, type
stats
A collection of statistics about a pool replicated using Hitachi Universal Replicator.
Avg-changes-sec, avg-queuejournal-util, max-changes-sec, max-queue
snapshot
Information about a DR validation snapshot appearing on the disaster site for validation/backup purposes.
Horcm-ready, loaded, pct, pool, role, snap, state
Other XML Elements used in status and correct output: Table 7. XML Elements Attribute
Explanation
Tags
issue
An issue identifying a system or pool-specific problem
Code, msg, severity, system-code, [drive-list, reason, using-local-paths, using-remote-paths, affected-drives, detail, filesystems, state, …]
action
A message possibly containing a recommended course of action.
recommendation
smu
A command record issued against the current cluster's SMU (unless noted otherwise)
Cmd, returncode, [cycle, cluster, pause]
39
40 Table 7. XML Elements (Continued) Attribute
Explanation
Tags
corrective- The record of a completed automated correction. actions
Cycle, code
horcm
A command record issues to the replication subsystem.
Cmd, returncode, [cycle]
config
An element containing status about the configuration management of the components of the solution, and when they were most recently backed up.
Start, status, topic
Operation=correct This function evaluates the current status and performs any required corrective action for the replication environment on both sites and the Hitachi NAS Platform configuration.
Returns
0 — success (no issues found),
255 — failure
+ value — issues have been repaired in value cycles
- value — the number of cycles attempted, but some issues remain, or correction failed
Optional arguments
pool = pool name Specifies the name of an existing pool, if corrections on an individual pool are desired.
issues = issues The set of issues for which correct should be tasked with correction. These values may come from a previous 'status' operation, and should specified as a single hexadecimal value, e.g. issues=0x80100.
maxwait = seconds The maximum amount of time in seconds to use processing actions. No subsequent actions will be performed after the time has expired.
cycles = count
40
41
The maximum number of iterations to perform in attempting to correct discovered issues. XML description Refer to the XML description for the status operation, as auto-correct leverages much of the same logic. When an issue is detected, an action tag is added describing the overall operation to be performed. Depending on the complexity, a section stanza may be added, which contains all of the commands issues to address the issue. If the issue is corrected, a corrective-actions tag indicates whether the operation succeeded and another cycle may be required to determine if additional actions are necessary. If this happens an interim XML log file will be created.
Status — This indicates the return code for the operation, as described above.
Sample Transaction In the following example, correct is repairing any issues associated with a specific pool.
After a subsequent check, no additional issues are detected, the correction loop terminates, and since the correction was made, the resulting status was successful. To see the details of the correction, use 'bin/perform.py display correct' to view the XML file.
41
42
The two
XML elements and one element show the corrective action that was performed. If you know the issue(s), it can be specified using issues=0xNNNN. In this way, 'collect' won't continue scanning for issues after it makes the correction (as shown in the example below). A status code=1 indicates that it was corrected in one cycle. If you don't want 'correct' to recheck for other issues (and thus complete faster), either specify the issues it should correct or use cycles=1 as shown below.
Operation=release This function performs a graceful preparation prior to yielding Primary access to paired pools. It ensures that the latest configuration has been collected and then releases any services accessing the shared data, such as un-mounting file systems.
42
43 Note — If operation=takeover force is used, the release will be performed automatically and remotely.
Returns
0 — success
-1 — failure
Optional arguments
pool=pool name This specifies the name of an existing pool or a glob pattern to select the pools to release. The default is the following: pool=*
cluster = clustername This indicates which cluster will be taking over the pool.
Sample Transaction
43
44 Operation=takeover This function performs the role swap for the specified pool after verifying that the current configuration state supports the operation. The takeover operation performs the storage role swap, and then proceeds to restore any missing configuration between the systems. The final step is to prepare the new secondary for a future role swap.
Returns
0 — success
-1 — failure
Other — The count of failures during re-configuration
Optional arguments
pool=pool name This specifies the name of an existing pool or a glob pattern to select the pools to takeover. The default is the following: pool=*. Note: if you specify the pool, the cluster is implicit in the pool's current location.
noconfig This instructs the system to not perform any configuration changes during takeover, other than re-assigning the EVS and mounting the file systems.
noglobal This instructs the system to not perform any global configuration changes that are non-file system related during takeover. An example includes CNS operations.
nomigrate This instructs the system to not attempt EVS migration from the existing primary to the takeover site using registry backup and import. See the migration notes for more information.
cluster = clustername This indicates which cluster will be taking over for all of the pools. This parameter cannot be used when 'pool' is specified.
44
45 Migration notes
EVSs selected for migration must support only file systems bound on one of the selected pools.
If the EVS ID does not exist in the target cluster, it will be created.
If insufficient EVS licenses are available, EVS migration will not be used.
If the EVS ID does exist and there is more than one EVS with the same name, the target EVS will be disabled and deleted prior to import. If the EVS ID does exist with a different name, it would not be deleted, but imported as described in the sections that follow.
Sample Transaction The output from a forced takeover contains two XML files in addition to the progress information returned:
One XML file for the release
One XML file for the takeover
In the following takeover output, besides the XML results, progress is displayed using stdout. Since this was a forced takeover it runs operation=release on the peer, and that output is displayed (in yellow). Due to the complexity of the operation, the output is divided into sections. Running release pool: * on SCE-1939-C1 Saved output in LOG/release-20:52:00.xml Retrieve configuration from 172.17.252.61 Execute takeover on SCE-1939-C2 for HUR_pool4 Execute takeover on SCE-1939-C2 for HUR_pool1 Horctakover phase has completed Verify HNAS SCE-1939-C2 sees new pool HUR_pool4 Verify HNAS SCE-1939-C2 sees new pool HUR_pool1 Rebuild configuration phase Rebuild configuration from saved DB for pool HUR_pool4 Rebuild configuration from saved DB for pool HUR_pool1 Prepare Secondary site (SCE-1939-C1) for future takeover Saved output in LOG/takeover-23:53:33.xml
Operation=query This function performs a variety of query functions, including a free-form SQL search. Its use is only recommended for diagnostic purposes.
Returns
0 — success
-1 — failure
Required argument
sql=SQL expression This specifies the SQL expression to evaluate.
45
46 Sample Transaction Refer to the database content in “Active Directory Registration Tool” on page 139. SQL expressions conforming to SQLite3 are supported, but not tested. bin/perform.py operation=query sql="SELECT * FROM smu"
Hquery "SELECT DISTINCT pool FROM pool"
Operation=dr-load This function performs all the steps needed to support disaster recovery validation of the specified pool or pools:
Take a snapshot of the remote pool.
By default, mounts the storage and applies the latest configuration, including VVols and shares.
This function prepares the necessary Hitachi Open Remote Copy Manager environment to support the snapshot, potentially even allocating storage for this purpose.
Returns
0 — success
-1 — failure
Required arguments You must use one of the following arguments.
pool=pool-name
46
47
This specifies the name of an existing pool. To take the initial snapshot, use the following: pool
snap=snapshot-name This instructs the system to use an existing snapshot. If the snapshot-name is specified and it exists in SSUS mode, it will be loaded. Snapshots are named using the pool-name with a -CW or -SI extension.
Optional arguments
snaptype=CW | SI This instructs the system to use the specified snapshot mechanism. CW is the default mechanism. It covers Hitachi Thin Image and Hitachi Copy-on-Write Snapshot. Storage resources must be available or the load operation will fail.
refresh=config | snap This is used with the ‘snap’ option, refresh=snap instructs the system to re-snap, load, and configure an existing snapshot. When used with a loaded snapshot, refresh=config will cause the configuration to be refreshed
noglobal This instructs the system to not perform any global configuration changes (non-file system related) during takeover. An example includes to not perform CNS operations.
nomigrate This instructs the system to not attempt EVS migration from the existing primary to the takeover site using registry backup and import (see notes).
Notes
The function can only be used on a system in which the pools specified are in the secondary mode.
Disaster recovery-validation is only available at the secondary site for a given pool.
Sample Transactions This example shows a fresh snapshot starting from an automatic Hitachi Open Remote Copy Manager setup. bin/perform.py operation=dr-load pool=HUR-tfs This example shows a re-load of a snapshot which had been unloaded.
bin/perform.py operation=dr-load snap=HUR-tfs-SI This example shows a re-snap of an existing snapshot.
bin/perform.py operation=dr-load snap=HUR-tfs-CW refresh=snap waiting for all LUNs to reach SSUP state
48
49
This example shows a re-configuration of a loaded snapshot. bin/perform.py operation=dr-load snap=HUR-tfs-SI refresh=config
Operation=dr-unload This function unloads the specified snapshot, which is normally in preparation to support automatic takeover. When specifying the destroy option, it also frees any resources associated with the snapshot type or name. This function frees any storage, terminating Hitachi Open Remote Copy Manager support for the snapshot provided.
50
51 When used with the release option, it splits the snapshot back to simplex mode, erasing any snapshot information it contains.
Returns
0 — success
-1 — failure
Required arguments
snap=snapname This specifies the name of an existing snapshot. Snapshots are named using pool-CW or pool-SI.
Optional arguments
release When specified, this performs a split to simplex mode, discarding any contents in the snapshot.
Sample Transaction This example shows a normal unload. bin/perform.py operation=dr-unload snap=HUR-tfs-CW
This example shows an unload with a release. bin/perform.py operation=dr-unload snap=HUR-tfs-SI release waiting for all LUNs to reach SSUP state
Operation=rebuild This function rebuilds the Hitachi NAS Platform configuration, or produces the commands to do so, using the specified database.
51
52 not use this unless directed to do so by support personnel. Do
Returns
0 — success
-1 — failure
Count — failed configuration operations
Required arguments
db=db-directory-or-prefix This specifies the directory or prefix of a set of database files to use. For example, for the last takeover configuration use the following: db=TAKEOVER/IP-address
pool=pool-name This specifies the name of an existing pool. To target the rebuild operation to a specific pool, use the following: pool
Optional arguments
noaction This does not perform any changes to the target. This is used to obtain the list of commands that were issued. This also generates output that can be used directly in a siconsole session.
foreign=cluster When specified, this uses the cluster name from the database rather than the current format. This is normally only used with the noaction option when importing a database from another site.
noglobal This instructs the system to not perform any global configuration changes that are non-file system related during takeover. An example includes CNS operations.
nomigrate This instructs the system to not attempt EVS migration from the existing primary to the takeover site using registry backup and import.
delete=deletefirst | nodelete When present, this modifies the behavior of the delete operations which normally proceed the create operations.
deletefirst: Default operation is to delete prior to creation.
nodelete: Perform no delete operations.
52
53
Using delete=nodelete may not work in all cases. For example, creating a share assigns a set of default share permissions which are then modified. Skipping the delete may produce unexpected results. Sample Transaction This example uses the .csv files under SAVE named table-172.31.62.70.csv to rebuild the configuration on the node where this is invoked. bin/perform.py operation=rebuild db=SAVE/172.31.62.70 Warning: no such DB exists: SAVE/smu-172.31.62.70.db
In this example, a database directory was copied from a foreign cluster. It will be used to generate command line interface commands that can be replayed in an siconsole session. The output consists of two main parts; command line interface output and the standard XML output. The output shown below will be abridged. The warnings about inconsistent columns are an indication that the local database schema is different than the imported data and can usually be ignored. The output containing the command line interface-formatted instructions will need to have the ‘//’ converted to ‘/’ before pasting in the siconsole. bin/perform.py operation=rebuild db=DB-10-08 noaction foreign=BN1-HNAS Warning abnormal: missing columns need 26/ have 24 in table hur ['efun-hdp-03', '10.23.78.11:11000', '3', '94718', '13:00|13:01|13:02|13:03|13:04|13:05|13:06|13:07|13:08|13:09|13:0A|13:0B|13:0C|13:0D|13:0E|13:0F |13:A0|13:A1|13:A2|13:A3', '16|17|18|19|20|21|22|23|61|62|63|64|65|66|67|68|114|115|120|121', 'Secondary', 'NA', 'OK', 'Secondary', 'Allowed', 'N', 'NA', 'PAIR', '100', '0', '22.00', '34', '0.18', '0.48', '2.20', 'NML-25', '2000', '2000'] Warning abnormal: missing columns need 9/ have 7 in table smu ['1.0.5', '10.58.151.75', '10.58.151.75', 'running', '1', 'OK', 'Y'] namespace-rm -r Builds / namespace-delete Builds namespace-create Builds evsfs add efun-fs01 1 evs-select 1 filesystem-list efun-fs01 namespace-mklink efun-fs01 /AdAssistancePlatform Builds /AdAssistancePlatform namespace-mklink efun-fs01 /ADC_MPT_EVAL Builds /ADC_MPT_EVAL namespace-mklink efun-fs01 /multimediaig Builds /multimediaig namespace-mklink efun-fs01 /osdemand Builds /osdemand namespace-mklink efun-fs01 /ping Builds /ping namespace-mklink efun-fs01 /pla_store Builds /pla_store namespace-mklink efun-fs01 /sdpstripe Builds /sdpstripe evs-select 1 virtual-volume removeall efun-fs01 evs-select 1 virtual-volume list efun-fs01 Atlas evs-select 1 virtual-volume list efun-fs01 DeliveryEngine evs-select 1 virtual-volume list efun-fs01 cache1 evs-select 1 quota add --usage-limit 5T efun-fs01 cache1 evs-select 1 virtual-volume list efun-fs01 cache2
54
55 evs-select 1 quota add --usage-limit 5T efun-fs01 cache2 evs-select 1 virtual-volume list efun-fs01 cache5 evs-select 1 quota add --usage-limit 5T efun-fs01 cache5 evs-select 1 virtual-volume list efun-fs01 cnsroot evs-select 1 virtual-volume list efun-fs01 sdpstripe evs-select 1 virtual-volume list efun-fs01 searchgold evs-select 1 cifs-share del --target-label efun-fs01 ping1 evs-select 1 cifs-share add --snapshot-dirs disable ping1 efun-fs01 \\ping1 evs-select 1 cifs-share del --target-label efun-fs01 cache1 evs-select 1 cifs-share add --snapshot-dirs disable cache1 efun-fs01 \\cache1 evs-select 4 cifs-share add --snapshot-dirs disable cache7 efun-fs04 \\cache7 cifs-saa add merchantdatapipeline REDMOND\\BUILDS_FULLCONTROL af cifs-saa add merchantdatapipeline REDMOND\\BUILDS_READ ar cifs-saa delete merchantdatapipeline Everyone cifs-share del --target-label Builds adsapps cifs-share add --snapshot-dirs disable --noscanforviruses adsapps Builds \\adsapps cifs-saa add adsapps REDMOND\\BUILDS_FULLCONTROL af cifs-saa add adsapps REDMOND\\BUILDS_READ ar cifs-saa delete adsapps Everyone cifs-share del --target-label Builds networkquality cifs-share add --snapshot-dirs disable --noscanforviruses networkquality Builds \\networkquality cifs-saa add networkquality REDMOND\\BUILDS_FULLCONTROL af cifs-saa add networkquality REDMOND\\BUILDS_READ ar cifs-saa delete networkquality Everyone cifs-share del --target-label Builds av2 cifs-share add --snapshot-dirs disable --noscanforviruses av2 Builds \\av2 cifs-saa change av2 Everyone ar
55
56
Configuration Management In order to assist in reconstruction of failed equipment, loss of configuration, or actual destruction of equipment, the disaster recovery solution for Hitachi NAS Platform maintains a configuration repository of Hitachi NAS Platform and a subset of array configuration information in readable and non-readable formats. In most cases, having this information can significantly improve the time needed to re-create an operational environment and should make professional services unnecessary. The hdrsmon daemon automatically takes a daily backup, and then copies it to the backup system running the disaster recovery solution for Hitachi NAS Platform. The backup system maintains a 2-week rolling history. Hitachi NAS Platform information is collected automatically. Storage array configuration is managed through the HiTrack SVP agent and communication infrastructure. HDRS supplements this information through its HDRS discovery function which creates an XML configuration along with selected raidcom commands.
Configuration Record The configuration record contains both the Hitachi NAS Platform configuration and SAN configuration. This information is used in reconstructing lost hardware in the case of a material loss. Once a day, hdrsmon collects a configuration record consisting of the following items:
Hitachi NAS Platform registry dump and SMU backup
Hitachi NAS Platform configuration output files
SAN configuration output files
hdrs discovery output
Resiliency These factors affect resiliency of a replicated deployment.
Many components participate to provide business continuity in the case of failure. All components are designed to handle single failures in a variety of ways, including RAID and multi-pathing. Design factors beyond the control of Hitachi Data Systems can reduce the resiliency of the solution. This includes the use of a single power source, the network provider, or the switching gear. The disaster recovery solution for Hitachi NAS Platform does its best to manage the situation in the face of outages, temporary and long term. It provides detailed cross-product failure detection and, in many cases, automated correction. In some cases, failures can be resolved by restarting services or re-initiating pairing. In other situations, administrator overrides may become necessary, such as performing a takeover to an alternate site.
In the event of a real disaster, components may need to be replaced, or storage and drives re-provisioned. The purpose of maintaining the configuration record is that it facilitates the repair and reconfiguration process of many of the components. Another area that impacts the resiliency of the solution is the number of touch points and the proliferation of information required to resume operation after a disaster. The disaster recovery solution for Hitachi NAS Platform requires no additional components in order to manage the equipment under its sphere of influence. All disaster recovery solution for Hitachi NAS Platform components are stateless, in that they do not maintain their own configuration. As such, in certain failure situations, it is easier to re-install the disaster recovery solution rather than to try and re-construct its state. This simplicity is crucial for maintaining a robust disaster recovery solution.
56
57 Table 8 lists the locations of configuration information. Table 8. Configuration Information Component
Location
Backup/preservation Strategy
The disaster recovery solution for Hitachi NAS Platform configuration
conf/hdrs.cfg
Identical copy exists on both sites. This does not change during operation
conf/general.cfg, use-cases.cfg
This is not copied. It is not expected to change significantly once set
Hitachi Open Remote Copy Manager configuration
/etc/horcm*.cfg
Information is maintained in the SAN rather than in Hitachi Open Remote Copy Manager files (device-groups). All files are rebuilt on demand.
Hitachi NAS Platform configuration
Hitachi NAS Platform registry
Backed up daily and copied to the peer.
SAN configuration
SVP and persisted to Hitachi HiTrack sends the configuration offsite. The system drives and to the disaster recovery solution for Hitachi NAS Platform SVP optionally maintains a copy and copies it to the peer.
Failure Points Each of the solution components required for access and regular maintenance of the solution need to be examined for potential failures. The greater role the component has in the solution, the more critical is its availability. Using this frame of reference, access to data is more important than access to future storage provisioning.
57
58 Figure 13 identifies 12 points affecting the global-active device replication technology. All of the following failures are hardware related.
All hardware failures are protected with redundancy at multiple levels, so the likelihood of failure is quite low. Global-active device can sustain any single failure shown in Figure 10 and be corrected automatically through the disaster recovery solution for Hitachi NAS Platform once the circumstances have cleared. Single failures usually cause the pairing to be suspended, which limits access to a single array. A dual failure may impact global-active device further, which may require deletion of the global-active device pairs in order to access any copy of the data.
Figure 13
58
59 Figure 14 shows the failure locations that impact the disaster recovery solution for Hitachi NAS Platform itself. The overall solution is impacted by a combination of factors.
Figure 14 Management operations directed by operators require an SSH connection to the SMU. The disaster recovery solution for Hitachi NAS Platform continues to monitor and correct issues that it discovers without intervention. Issues found uncorrected are presented to Hitachi HiTrack using SNMP (not shown). Communication between the disaster recovery solution and the Hitachi NAS Platform clusters is performed using ssc. Loss of communication between them impacts the ability of the disaster recovery solution to monitor and manage the Hitachi NAS Platform systems. (also #5 in Figure 13). The disaster recovery solution for Hitachi NAS Platform uses raidcom and Hitachi Open Remote Copy Manager commands to manage the arrays. This UDP communication to Open Remote Copy Manager running on the SVP and translating commands to the array. Loss of communication impacts the ability of the disaster recovery solution ability to monitor and manage these arrays. (also #6 in Figure 13). The production disaster recovery solution uses SSH to transfer the configuration records to the backup the disaster recovery solution. A configuration record is a single tar-formatted file whose file name contains a date and time stamp. For the loss cases following, it is required to locate the configuration record, extract a portion of its contents, and transport those contents to a system where the browser is being used.
59
60 Loss
of the Disaster Recovery Solution for Hitachi NAS Platform
The following circumstances can cause the disaster recovery solution for Hitachi NAS Platform to become unusable.
An SMU upgrade causes several system files to be overwritten. Of biggest impact, is the /etc/sysconfig/iptables file, followed by the /etc/sudoers file. When this file is overwritten by an SMU upgrade and when the disaster recovery solution for Hitachi NAS Platform runs on an external server, the SMU will not tunnel ssc communications to the AdminIP. This is performed only when the disaster recovery solution cannot reach the adminIP of the cluster, such as when it only has an address on the private network. It uses the SMU to tunnel access to its managed Hitachi NAS Platform cluster. The collect.pyc module detects a change in SMU version and can re-configure itself if the root password has not changed.
If the SMU operating system is reloaded and the disaster recovery solution for Hitachi NAS Platform is co-resident, all of the disaster recovery solution software is erased. This is not a serious issue, as the disaster recovery solution can be reinstalled in 10 minutes.
Repair and Replacement of the Disaster Recovery Solution for Hitachi NAS Platform To repair or replace the disaster recovery solution for Hitachi NAS Platform, do the following. 1. Verify the you have the conf/backup-hdrs.cfg or a local conf/hdrs.cfg file of the peer. Store it in a safe location. 2. Remove the disaster recovery solution software, if it exists. (1) Become root: by typing the following: $ su (provide root password) (2) Type the following: # bin/removesmu.sh Follow any directions to complete the removal. Key files are saved and can be retrieved to re-install the solution. (3) Become manager by typing the following: # exit The /home/manager directory should now be empty 3. Perform a regular installation. After unpacking the HDRS.tgz file, restore the conf/hdrs.cfg file (or copy backup-hdrs.cfg from the peer to conf/hdrs.cfg). It should take about 5 minutes. $ util/install.sh (provide root password)
60
61 Loss
of SMU
If the SMU configuration is lost or needs to be re-installed, do the following. 1. Make sure that you have a valid configuration record. You will need several files from the configuration record to bring the SMU back.
If restoring the primary SMU, use files under 0/.
If restoring the secondary, use files under 1/.
Most of the files contain random characters at the end of the pathnames to preserve their uniqueness. The pathnames used as examples below are just sample names.
2. Transfer the configuration record to the workstation you are using as your browser. This may require the use of WinSCP or similar tool that can use SSH to retrieve files from the disaster recovery solution for Hitachi NAS Platform host. [manager@HNAS-B-SMU SAVE]$ tar xvf 198.18.0.8-config-08-11.tgz ./0/smu_11Aug15_010000.zip ./0/SMUbasicinfo.txt ./0/smu_11Aug15_010000.zip ./0/SMUbasicinfo.txt
Reconfiguring an SMU will require basic network information, which can be found in SMUbasicinfo.txt in the configuration record. 1: lo: mtu 65536 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo 2: eth0: mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 00:50:56:87:31:29 brd ff:ff:ff:ff:ff:ff inet 172.17.248.106/24 brd 172.17.248.255 scope global eth0 3: eth1: mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 00:50:56:87:67:37 brd ff:ff:ff:ff:ff:ff inet 192.0.2.99/24 brd 192.0.2.255 scope global eth1 172.17.248.0/24 dev eth0 proto kernel scope link src 172.17.248.106 192.0.2.0/24 dev eth1 proto kernel scope link src 192.0.2.99 169.254.0.0/16 dev eth0 scope link metric 1002 169.254.0.0/16 dev eth1 scope link metric 1003 default via 172.17.248.1 dev eth0 smub.hnasdrsandbox.gs.lab (none)
Use the information above to derive and supply the information requested below. After undergoing basic configuration and a reboot, you will need the SMU backup file to restore its configuration. The following section goes through the process from beginning to end.
61
62 Restore
an SMU
To restore an SMU, do the following. 1. After log on as root using KVM, virtual console, a directly connected terminal, or a serial cable, run the following command: # smu-config
Figure 15 2. Provide passwords for root and manager.
Use the default nasadmin password and ignore all warnings.
Figure 16
62
63 Enter the following information in the dialog, and type the following to confirm: y 3.
SMU public IP (eth0)
Netmask
Gateway
SMU private IPv4 (eth1)
Enable IPv6
IPv6 stateless auto-configuration
SMU static IPv6 (eth0)
SMU static IPv6 gateway
Domain
SMU hostname
The values for your installation may not match the settings shown in Figure 17.
Figure 17 The system reboots. After the reboot, you will be able to access the SMU using a browser at the public address of the SMU to proceed. 4. Restore the SMU. (1) Open the SMU using a browser, and log on to the SMU. (2) From the Home menu, click SMU Administration and then SMU Backup and Restore. Do not use the wizard. (3) Click Choose File and then select the SMU backup file from the configuration record (Figure 18). (4) Click Restore, and then confirm the restoration (Figure 19).
63
64
A screen similar to Figure 20 displays while the SMU reboots. Your browser may not refresh after the reboot of the SMU. (5) Open the SMU with the browser again, and log on to the SMU again.
Figure 18
Figure 19
64
65
Figure 20 5. If the disaster recovery solution for Hitachi NAS Platform was installed on the SMU and this is a replacement SMU, repeat the procedure to restore the disaster recovery solution in the section Repair and Replacement of the Disaster Recovery Solution for Hitachi NAS Platform on page 60. 6. To verify that the configuration has been restored, from the Home menu, click Server Management, and then click Cluster Configuration.
Figure 21
65
66 Loss
of a Hitachi NAS Platform Node
If a Hitachi NAS Platform node is damaged or needs to be replaced, the SMU will report the node in an error state. Every EVS running on the node will automatically migrate to another working node so that the damaged node can be reconfigured non-disruptively.
Figure 22
Repair an Hitachi NAS Platform Node (Part of a Cluster) To repair a node that does not affect a quorum in the cluster, do the following. 1. Run nas-preconfig as shown in Rebuild a Lost Node to restore addressability. Use the following IP addresses:
The same IP address for the cluster node being replaced. Make sure this is different than the surviving nodes. A temporary Admin EVS IP address, different from the Admin EVS that the cluster uses. Note that this IP will disappear once the node joins the cluster. Note that these IP addresses have to be reachable by the SMU
2. After the node reboots, it automatically rejoins the cluster. If you do not know the addresses used by the node, retrieve them from the configuration record as follows. [manager@HNAS-A-SMU ~]$ tar xvf conf/config-08-11.tgz ./0/registry.tgz ./0/evsipaddr-906ob6.txt \ ./0/cluster-verbose-UftOAY.txt ./0/evsipaddr-906ob6.txt ./0/cluster-verbose-UftOAY.txt ./0/registry.tgz
Use the evsipaddr*.txt and cluster-verbose*.txt files from the Configuration Record to understand the addressing scheme of the cluster.
66
67 [manager@HNAS-A-SMU ~]$ cat 0/evsipaddr-906ob6.txt EVS Type -------admin admin evs 1 evs 2
Label -------HNAS-A-1 HNAS-A-1 EVS1 EVS2
IP Address ----------192.0.2.2 198.18.0.11 198.18.0.30 198.18.0.31
Mask ------------255.255.255.0 255.255.255.0 255.255.255.0 255.255.255.0
Port ---eth1 ag1 ag1 ag1
[manager@HNAS-A-SMU ~]$ cat 0/cluster-verbose-UftOAY.txt Node EVS ID ---- -----1 1 1 1 2 2 3 3 0
Type ------Cluster Service Service Cluster Cluster Admin
Label Enabled Status IP Address --------------- ------- ------ --------------HNAS-A-Clustr-1 Yes Online 192.0.2.200 EVS1 Yes Online 198.18.0.30 EVS2 Yes Online 198.18.0.31 HNAS-A-Clustr-2 Yes Online 192.0.2.201 HNAS-A-Clustr-3 Yes Online 192.0.2.205 HNAS-A-1 Yes Online 192.0.2.2 198.18.0.11
Port ---eth1 ag1 ag1 eth1 eth1 eth1 ag1
Rebuild a Lost Node This documents the process when only a single node needs to be replaced, but otherwise the cluster remains up with surviving nodes. Typically, step 1 is executed by the Customer Service and Support personnel from Hitachi Data Systems who install the replacement node. They require the local addressing information in the following format: Admin Public (eth0): IP =; Netmask = Admin Private (eth1): IP = 192.0.2.99 ; Netmask = 255.255.255.0 Physical Node (eth1): IP = 192.0.2.201 ; Netmask = 255.255.255.0 Gateway: 192.0.2.1 Domain: srlab.local Unit Hostname: HNAS-A-Clustr-2 (will disappear after joining the cluster)
Note — It is very important not to use the Admin IP address from the existing cluster. Instead use an unused temporary address. Also, leave the Optional Admin Service Public IP address empty. 1. Run nas-preconfig as root on the replacement node.
67
68 root@HNAS-A-Clustr-2(bash):/opt# nas-preconfig
This script configures the server's basic network settings (when such settings have not been set before). Please provide the server's: - IP address - netmask - gateway - domain name - host name After this phase of setup has completed, further configuration may be carried out via web browser. Please enter the 192.0.2.99 Please enter the 255.255.255.0 Please enter the Please enter the Please enter the 192.0.2.201 Please enter the 255.255.255.0 Please enter the 192.0.2.1 Please enter the srlab.local Please enter the HNAS-A-Clustr-2
Admin Service Private (eth1) IP address Admin Service Private (eth1) Netmask Optional Admin Service Public (eth0) IP address Admin Service Public (eth0) Netmask Optional Physical Node (eth1) IP address Physical Node (eth1) Netmask Gateway Domain name (without the host name) Hostname (without the domain name)
Admin Public (eth0): IP =; Netmask = Admin Private (eth1): IP = 192.0.2.99 ; Netmask = 255.255.255.0 Physical Node (eth1): IP = 192.0.2.201 ; Netmask = 255.255.255.0 Gateway: 192.0.2.1 Domain: srlab.local Unit Hostname: HNAS-A-Clustr-2 Are the above settings correct? [y/n]y Configuration written to /etc/opt/mfb.ini.
1. Reboot the node. root@HNAS-A-Clustr-2(bash):/opt# reboot The system is going down for reboot NOW! (Tue Aug 11 14:17:49 2015)
2. Under Server Settings in Cluster Configuration, click Details on the node that is being recycled.
68
69 Click Remove (Figure 23) and confirm (Figure 24). The result is shown in Figure 25. 3.
Figure 23
Figure 24
Figure 25
69
70 To add the node to the existing cluster, click Add Node. 4. 5. If properly configured in the network, the replacement node should appear (Figure 26). Click the node, type the supervisor password (default 'supervisor'), and click Next . Figure 27 shows the results.
Figure 26
Figure 27 6. To complete the addition, click finish. After the node reboots, it rejoins the cluster. This may take several minutes. For a few minutes, the cluster might appear to look healthy but the new node will not appear in the list. Wait several minutes until it appears. Note - It is important to recognize that when actual node replacement takes place, new identifiers are assigned to the network and fabric devices. The network identifier or MAC address may be registered with a switch or firewall. The fabric identifier or node WWN may be used in both zoning configuration and in host-grp security. These need to be updated when introducing new equipment to the cluster.
70
71 Loss
of all Hitachi NAS Platform Nodes
When the whole cluster is lost, the goal is to leverage our registry dump to avoid the need to professionally reconfigure the cluster. The Customer Service and Support team from Hitachi Data Systems is responsible for initial node network addressing. It is your responsibility to ensure that they are provided with these addresses. The information needed is in the configuration record, specifically the following:
cluster-show
cluster-verbose
evsipaddr
In order to use this information in configuration recovery, extract the appropriate files as well as the registry.tgz file and copy them to the system where the user interface will be executing the recovery. Table 9 is an example configuration record provided to the Customer Service and Support staff. Table 9. Configuration Record Table Node name
Node IP
Node Admin IP
HNAS-A-Clustr-3
192.0.2.203
192.0.2.23
HNAS-A-Clustr-2
192.0.2.202
192.0.2.22
HNAS-A-Clustr-1
192.0.2.201
192.0.2.2
Common Information
Admin Public (eth0): IP = ; Netmask =
Admin Private (eth1): IP = ; Netmask = 255.255.255.0
Physical Node (eth1): IP = ; Netmask = 255.255.255.0
Gateway: 192.0.2.1
Domain: srlab.local
Unit Hostname: Note — It is important to re-use the Admin IP information from the original cluster for Node #1.
If Node #1 is being replaced, then the following is true:
The cluster will be assigned a new MAC-address
Replacement licenses need to be obtained
71
72 comments in Rebuild a Lost Node regarding licensing. Always perform the restore on the node that will be licensed See first. 1. Configure the first node (the one that has the macid that the license has been assigned to) using nas-preconfig.
2. Go to SMU Home > SMU Administration > Managed Servers. Remove the existing "managed server". Even though the IP address is the same as the new one you have just configured, it will show as "unknown". In the same screen, go back and click the Add button. Enter the Admin EVS Address, username and password.
3. Proceed to add licenses. The license should be generated based on this node's MacID. 4. Apply all licenses. There are two in the previous output.
72
73 This error message is OK. No restart will be required. You can go to the main License Keys window to confirm that all licenses are there.
If you have preserved the Admin IP on Node 1, this node will appear as a managed server for the SMU. 5. Use the Cluster Wizard to promote to a cluster. Go to Server Settings > Cluster Wizard. The configuration record contains the name of the existing cluster. Do not change it. [manager@HNAS-A-SMU ~]$ cat /tmp/config-08-12_03-11/0/cluster-show-qLMKme.txt Overall Status = Online Cluster Health = Robust Cluster Mode = Clustered Cluster Name = HNAS-A-Clustr Cluster UUID = 9bb2ac4e-7e11-11d0-9000-7700107c4c83 Cluster Size = 2 Node Name = HNAS-A-Clustr-1 Node ID = 1 Cluster GenId = 6 Cluster Master = No
73
74
6. Use the Cluster Node IP Address assigned to Node 1. 7. Select the SMU as the quorum device. This should appear in the list. 8. When finished making the settings, click OK. 9. Confirm the changes. Figure 28 displays while the system is rebooting. It may take from 5 to 10 minutes to reboot the server.
Figure 28
Figure 29
74
75 Once the node has finished rebooting, you will get the following prompt, inviting you to add new nodes. We don't 10. want to do this, because we want to restore instead. Click no to restore the server configuration.
Figure 30
Under Server Settings, click Configuration Backup & Restore. (Do not confuse this with SMU Backup & Restore). Click Browse to select the registry file that was extracted from the configuration record (Figure 31). Figure 32 displays while the file serving restarts. Figure 33 displays when the restart finishes.
Figure 31
75
76
Figure 32
Figure 33 11. Now that the restore is finished, proceed to add the rest of the nodes. Power on, install, and configure the second node, running 'nas-preconfig' from the console.
Figure 34
76
77 Reboot the node once you're done, and wait a few minutes - enough time for the HNAS operating system to be fully 12. running (3-5 minutes). 13. Under Server Settings, click Cluster Configuration and then click Add Node. Each node is added one at a time. The 14. order is not important. 15. Nodes typically appear pre-entered with their names and IP addresses as the cluster discovers them.
Figure 35 16. Click the node, type the Username (supervisor) and Password, and then click next (Figure 36).
Figure 36 17. To complete the process, click finish (Figure 37). The new node reboots and should join the cluster within 10 minutes.
77
78
Figure 37
Figure 38 Go back to Home > Server Settings > Cluster Configuration and refresh until the new node appears in the cluster.
78
79
Figure 39 Note — Do not add the other nodes until this node has joined the cluster and appears in the Cluster Configuration window.
79
80 After restoring all the nodes, on the EVS Management window under Server Settings, enable every EVS. 18. 19. Set the EVSs back to their preferred nodes, if these were configured, otherwise migrate them as appropriate. Go to Home > Server Settings > EVS Migration. Click on the link "Migrate all to preferred".
20. If the HNAS is not at the desired firmware level, it needs to be upgraded to the next, otherwise all of the features might not work as in the original cluster that was replaced. Go to Server Settings > Firmware Package Management, then click on the "upload package" link.
21. Select the Managed Server (default) and click OK.
80
81
22. In the Upgrade File: section, click on the Browse section and select the firmware file. Leave the rest of the options ('Set as default package' and 'Restart file serving, and reboot the server if necessary') checked, and click Apply.
23. The process of uploading the file and installing will take a while (15 minutes or more), so wait for the process to complete. Both nodes will reboot automatically, one at a time. If the nodes have been physically replaced, there will be additional activities required as the WWNs of the host ports have changed. This may impact zoning and the WWNs registered in secure host groups on the array. This process is currently not automated.
81
82 Loss
of Array Configuration
On rare occasions, the storage array itself may become completely inoperative or damaged. The following procedure works for an existing array or for a new array. However, the procedure behaves differently, based on the situation.
The array install team should recover the array configuration record from the HiTrack server maintained by HDS. This will allow installing the array with its storage in the same configuration as the older array.
For a new array the procedure will not format the LDEVs, therefore formatting will need to be started.
For an existing array, the procedure does the following:
Leaves the LDEVs in a blocked state
Note — In all cases, a restore not recreate all resources, such as device_grps or copy_grps. At the end of the process, you will use the disaster recovery tool to complete the re-configuration.
82
83 Post Array Provisioning Activities After configuration restoration, LDEVs remain blocked. Figure 40 shows an example of blocked LDEVs.
Figure 40 In addition to LDEVs, other storage array-based resources are also not recoverable using the mechanism in “Loss of Array Configuration” on page 82. For a new storage array, the following is true:
Recovery of pools and pool-based volumes are not supported.
Copy-grps and device-grps are also not supported, so pairing of devices is not possible.
83
84 rebuilding a lost storage array, to complete provisioning of the storage array, do the following. After 1. Unblock resources.
For a new storage array, complete formatting the LDEVs before proceeding.
For an existing storage array, use the following command: python -m bin/hdrs -reset
2. Create a subset of the XML files and, using the -provision option, re-create the dynamic provisioning pools and paired-pool. (1) Extract the XML configuration files from the configuration record at 0/*.xml and 1/*.xml. (2) Edit those files to include the following sections:
Dp_pools
Pools
Remote_connections
Journals (for Hitachi Universal Replicator)
The peer storage array (the one not being restored) should have an empty XML file, such as the following:
The file you create should look similar to the following:
84
85
(1) Run this command. python -m bin/hdrs -provision 240195-revised.xml 240136-empty.xml The command above will complete the reprovisioning of the following for Hitachi Universal Replicator:
Remote connection
Journals
Dynamic provisioning pools
Paired-pools
85
86
Array Provisioning This section covers the assistance provided in performing storage array provisioning to support a replicated environment. Aside from the Fabric networking and array licensing, additional tasks need to be performed on both systems to support replication and replicated pools. The data recovery solution for Hitachi NAS Platform can perform any of these tasks when provided with appropriate XML description files. Since the disaster recovery solution does not rely on Hitachi Command Suite, it comes equipped with its own lifecycle functions for replicated storage management. Storage array provisioning takes place at any of the following times:
Initial installation XML files provided during installation can perform any of the provisioning functions.
Storage changes Addition of storage or storage applications may require any of the following:
New LDEVs, dynamic provisioning pools, parity groups
New resource groups of VSMs
New host groups
Retirement of any of the components
The data recovery solution for Hitachi NAS Platform does not do the following:
Create parity groups These operations are usually created by the Customer Service and Support staff from Hitachi Data Systems.
Carve parity groups into basic LDEVs and perform formatting These operations are performed by the storage administration team.
In order to use global-active device for provisioning, you need to become familiar with the syntax of the XML specifications used to drive this process.
XML Introduction Many available XML tutorials are available, such as the following:
http://www.w3schools.com/xml/xml_whatis.asp Note — It will normally not be necessary to construct most XML documents from scratch. Cut-and-paste from existing documents will cover most cases.
86
87 Important Definitions This example (with line numbers added for readability) contains these examples of XML components:
Root attribute — gad-config This root attribute has four attributes.
Attribute — cluster, instance, ip, and serial Each attribute contains a value within double quotation marks that further defines its root attribute.
Element — extpaths This element has element content (children) to further define it.
Element content — dev Each element content is a child of extpaths. Each is differentiated by the attribute extgroup.
Tag — gad-config, extpaths, extpath, dev, exthostgroup, and path
Each of these is a tag of its respective parent element.
The indentation make the XML hierarchy and nesting apparent. 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21
87
88 There are two styles of element syntax used, self-closing and explicit. The use of each is correct.
Self Closing — The element opens and closes within a single set of angle brackets. Line 18:
Explicit — The element opens with a set of angle brackets on one line, intervening content, and then closing with another set of angle brackets. Line 13:
The XML documents will not be shown in whole. Instead the output is discussed in element sections in the following order.
ports These are elements or attributes associated with the Fibre connection resources of a storage array, including host-groups.
resources These are elements or attributes associated with logical groupings maintained with the machine used for administrative segregation or with virtual storage machines.
pools These are elements or attributes corresponding to one or more LDEVs. They are treated as a group and replicated between storage arrays.
extpaths These correspond to external paths and the logical wiring for external quorum devices.
copygroups These correspond to array copy groups.
journals These correspond to journal volumes used for Hitachi Universal Replicator (async buffering).
dp_pools These correspond to dynamic provisioning pools, DP-VOLs, and dynamic provisioning pool vols.
snap_pools These correspond to a special type of pool used to provide storage for Hitachi Thin Image.
parity_grps These are the basic physical storage representation and the basic LDEVs they contain.
connections These represent the remote connections between machines used for all remote replication technologies.
88
89 of these sections represents a logical configuration area within the storage array. Each There is usually a single XML document for each storage array, and not all element types need be present for each array. For example, journals may not be present unless Hitachi Universal Replicator is used. However, pools are an exception. They represent a replicated resource. Their attributes are not overlapping, allowing a single XML file to represent resources on both the source array and the target array. A. XSL schema file conf/HDRS1.xsd is provided which will carefully validate all XML provided for use with the hdrs -provision operation. It will reject XML that is missing attributes or uses information that isn't acceptable.
Discovery Function The data recovery solution for Hitachi NAS Platform supports a discovery function that creates a set of XML documents describing the state of the connected storage array at the time it is run. This tool is run during installation to create files named conf/serial/initial.xml capturing the initial state of the storage arrays. It is also run on a daily basis and incorporated within the configuration record for each array. You can learn about the XML syntax by studying the output from an array discovery. These files also make it easy obtain the initial copy material for building your own XML definitions. The output of running a discovery follows. The discovery function uses hdrs to obtain the list of storage arrays to interrogate and the credentials to use for doing so. While performing the discovery, hdrs outputs any irregularities that it discovers. The following example highlights irregularities in red. [manager@smua ~]$ python -m bin/hdrs -discover Warning: path 100-2 on port CL2-F is (WAR) not Warning: path 100-2 on port CL1-F is (WAR) not Warning: path 100-1 on port CL2-F is (WAR) not Warning: path 100-1 on port CL1-F is (WAR) not Warning: path 101-4 on port CL2-F is (DSC) not Warning: path 101-4 on port CL1-F is (DSC) not Warning: path 101-5 on port CL2-F is (BLK) not Warning: path 101-5 on port CL1-F is (BLK) not No internal ldev assigned to 100-2 remote rcu status is abnormal ERR remote rcu status is abnormal WAR remote rcu status is abnormal ERR Saved output in conf/310073/08-05_18-11.xml Saved output in conf/310038/08-05_18-13.xml Saved output in LOG/gadtool-08-05_18-11.xml
NML NML NML NML NML NML NML NML
on on on on on on on on
0 0 0 0 0 0 0 0
Discovery creates three files:
One file for the primary storage array (written twice)
One file for the disaster site
One log file of modification activity or irregular information. Note — You can also pass an optional parameter, which will be used as the directory and/or filename prefix for the XML files produced. The serial number of the array is always appended to the prefix.
Each configuration section is described by reviewing its highlights.
89
90 Element ports 1 2
3
4
5
6
7
8
9 10
11
12
13
14
15
16
17
18
19
20 21 22
23 24
The ports element contains one type of sub-element:
port
90
91
A port contains these attributes:
name
type TAR - target port RCU - receiver / target MCU - initiator ELUN - external initiator
A port may contain this attribute:
conn
6
Port CL7-H is in RCU mode and the connection attribute is point-to-point. Where not specified, FCAL is assumed. Information comes from the raidcom get host_grp command. A port may contain one of more hostgrp elements. 13 14
Port CL1-E is in target mode, and there are a number of host-groups defined. For each host-group, a name, host-mode, host-mode-options, and a group name are defined. Information comes from 'raidcom get host_grp, get hba_wwn' commands. resources Element 1 2
4
5
6
7
8
9
10
11
12
3
4
5
6
7 8 9
92
93 pools element contains one type of sub-element: The
pool.
A pool must contain the following attributes:
name
site
A pool may optionally contain the following attributes:
capacity
sourcepool
targetpool
sourceresource
targetresource
quorumdev
ldevname
Information about pool is obtained through the raidcom get copy_grp and device_grp scripts. Unlike the other array resources, a pool is not a native array resource and should not be confused with a dynamic pool. Also, some global-active device clients, such as VMware Virtual Volumes, may only have a single LDEV in a pool. Table 10 provides a summary of available commands and options for pool. Table 10. Available Commands and Options for the pool Element Attribute
Values
Usage
capacity
Numeric value (GB)
Required when submitting for provisioning, otherwise optional
sourcepool
dynamic provisioning pool ID (number) or name
Required when submitting for provisioning on the primary array, otherwise optional
targetpool
dynamic provisioning pool ID (number) or name
Required when submitting for provisioning on the target array, otherwise optional
sourceresource
Resource name
Required when provisioning resources on the primary into a resource VSM (e.g. GAD-alternate for stretched cluster)
targetresource
Resource name
Required when provisioning resources on the target into a resource VSM (e.g. GAD-primary for dual-cluster)
93
94 Table 10. Available Commands and Options for the pool Element (Continued) Attribute
Values
Usage
ldevname
Device-member-name, for example 'pool-name-#-dev-'
Optional; sets naming convention for LDEVs added to device group. Must contain 'tier0' or 'tier1' when creating groups for a tiered span
site
'GAD-primary', 'GAD-secondary', 'GAD-both-primary', 'GAD-both-secondary', or 'HUR'
Required attribute; indicate which site becomes the Primary during pairing. If omitted or other value, no pairing is initiated
The pool element must contain one or two types of sub-element:
lun
hostgrp
A hostgrp must contain either one of the following attributes containing the host-group information for all of the LDEVs in the pool:
source
target
The format for the attribute values is port host-grp-name[, port host-grp-name]…, as shown in the example. The lun element must contain one or two types of sub-element:
source
target
count[optional]
ldevname[optional]
The attribute values must contain hexadecimal ldev-ids, such as 00:b3, of existing allocated LDEVs. When used with provisioning, an additional format is supported that simplifies allocation, as follows: 1 2 3
4 5
6 In this format, a count attribute is added to indicate the number of LUNs to allocate. Wildcards can be used in the ldev-id specifiers to indicate starting values to use for ldev-id allocation.
94
95 example also demonstrates a specification that will create a global-active device pool on both sites using the pools This and resources indicated with 40 GB LDEVs. If desired, the devicegrpname attribute may also be added to each lun element to provide a naming system for the device group members. extpaths Element 1 2
3
4
5
6
7
8
9
10
11
12
13
14
The extpaths element contains one type of sub-element:
extpath
An extpath must contain at least one of each of these elements:
dev
path
Two paths are recommended to reduce single points of failure.
A dev element may contain a sub-element exthostgroup, which describes the paths and host-group name of the quorum array. The extpath element must contain the following attributes:
pathgroup
product
serial
Information is derived from raidcom get path and raidcom get external_grp commands. Information about the extattributes is obtained by querying the quorum array using raidcom or Hitachi Storage Navigator Modular command line gateway. 95
96dev element must contain the following attributes: A
extgroup
exthlun
extlun
extpool
intlun
qid
serial attribute
The path element has these attributes include the following, among others:
port
wwn
Table 11 summarizes the values of the attributes. Table 11. Available Commands and Options for the extpaths Element Attribute
Values
Notes
pathgroup
A locally assigned group for all devices mapped from the specified resource
product
The device name of the quorum array
Examples: HUS, VSP
serial (extpath)
The serial number of the quorum array
An entry must be present in hur.cfg providing credentials to interrogate/provision on the quorum server
extgroup
The locally assigned path-group-id assigned for this LDEV
extlun
Hex ldev-id on the quorum array
extpool
Pool number of the quorum Optional; may not be present for quorum devices array created from basic LDEVs
intlun
Hex ldev-id on the local array
qid
Quorum device ID (0-n)
serial (dev)
Serial number of the local array
port (path)
Port on the array virtualizing the quorum array
wwn
Worldwide name of the quorum array's port (lowercase, no colons)
These values are unique across the pair of arrays
96
97 Table 11. Available Commands and Options for the extpaths Element (Continued) Attribute
Values
group
The quorum array's host-group name
port (exthostgroup)
The quorum array's port
Notes
The extlun and intlun attributes can be specified using ldev-id wildcards. The extlun attribute can be set to '-' as these will be determined automatically. Global-active device is prepared to allocate external LDEVs (13G) from the extpool provided, map them to the quorum arrays ports, import them into the local array, and set them up for use as quorum devices. Use the same devices on the remote array. Provisioning of quorum devices must be specified in both XML documents. copygroups Element 1 2
3 4 5 The copygroups element contains one type of sub-element:
copygroup
A copygroup always contains these attributes:
devicegrp
name
Depending on the type and state, copygroup may also contain any of these attributes:
pairing
mirror
journal
devicesvolgrp Note — Copygroups are never used for provisioning. Copygroups are only reported during discovery.
97
98 Table 12 provides a summary of available attributes and values for the copygroup element. Table 12. Available Commands and Options for the copygroup Element Attribute
Values
Notes
devicegrp
The name of the device-grp This name must be prefixed with the copygroup containing the list of LDEV name and should not be ending in -CW or -SI members for the pool or unless the group is used to maintain a snapshot. pair
pairing
The pairing state as reported by pairdisplay if it is known
mirror
H0 - HUR, h1 - GAD, h3 HTI, h2 - CW
journal
The journal ID assigned to this pair or pool.
devicesvolgrp
The name of the related This is only present for CW or SI as it used for device group containing the local replication S-VOLs of the original pairs
journals Element 1 2
3
4
5
6
7
8 The journals element contains one type of sub-element:
journal
98
99journal must contain the following attributes: A
If it is comprised of a virtual LDEV from a dynamic provisioning pool:
capacity
id
ldevid
pool
If it is comprised of a basic LDEV from a parity group:
raid-grps
raid-lvl
raid-typ
When provisioning, express the capacity in gigabytes such as this: 32G The id can be '-' to indicate that it will be assigned. The ldevid can use the same ldevid wildcarding schema (for example, 12:--). The pool can be specified using pool numbers or names. When provisioning journals from parity groups, the raid-grps can consist of any of the following:
Match the RAID level and type when specifying '*
Match a pattern, such as 01-??
Match any from a list, such as 01-01,01-04,01-23
dp_pools Element 1 2
3 4 5
6
7
8
9
99
100 10 11
12 The dp_pools element contains one type of sub-element:
dp_pool
A dp_pool sub-element must contain these attributes:
id
name
status [optional: for reporting only]
usage [optional: for reporting only]
The status and usage attributes are reported during discovery. A dp_pool always contains at least one of the following sub-element:
composition
The composition sub-element corresponds to the dynamic provisioning pool volume, one or more of the basic LDEVs used to provide storage to the dp_pool. These elements for composition are identical to the description in journals Element for the following:
capacity
ldevid
name
raid-grps
raid-lvl
raid-typ...
A dp_pool may contain the following sub-elements:
pool
ldev
The pool element always contains two attributes when reported during discovery:
ldevcnt (the number of DP VOLs making up the named pool)
name
100
101 ldev element identifies a DP VOL from the dp_pool that is not allocated to a paired or replicated pool. Its attributes The consist of the following:
capacity
ldevid
ldevname (or N/A)
When performing provisioning, only use the composition sub-element. It can be used multiple times to allocate multiple LDEVs. Below is an example that will create two dynamic provisioning pools to support a tiered span.
Place this fragment into two XML files: one for the primary site and one for the disaster recovery site:
101
102 Invoke the data recovery solution for Hitachi NAS Platform with both files, as shown: python -m bin/hdrs -provision site1.xml site2.xml snap_pools Element 1 2
3
4 5
The snap_pools element contains one type of sub-element:
snap_pool
A snap_pool must contain these attributes:
id
name
The status and usage attributes are reported during discovery. A dp_pool always contains at least one composition sub-element. The composition sub-element corresponds to the dynamic provisioning pool volume. See dp_pools Element. The composition sub-elements are the following, which are identical to where they are used in dp_pools Element:
capacity
ldevid
name
raid-grps
raid-lvl
raid-typ
The snap_pools element is only used by the data recovery solution for Hitachi NAS Platform on the secondary site for thin-image (snapshot) disaster recover validation. parity_grps Element 1 2
3 4
102
103 5 6 7 8 9 10 11 12 13 14 The parity_grps element contains one type of sub-element:
parity_grp
A parity_grp must contain the following attributes:
id
raid-lvl
raid-typ
A parity_grp optionally contains one or more of the following element:
ldev
Information is gathered from raidcom get parity_gr and raidcom get ldev -ldev_list parity_grp. The attributes of ldev are detailed in Table 13. Table 13. Available Commands and Options for the ldev Element Attribute
Values
Explanation
attrib
A set of attributes for the LDEV including: CMD, CVS, ELUN, VVOL, JNL, POOL
capacity
The allocated size of the LDEV in gigabytes
dp_pool
The name of the dynamic Attrib includes POOL provisioning pool this LDEV is part of
103
104 13. Available Commands and Options for the ldev Element (Continued) Table Attribute
Values
ldevid
The hexadecimal LDEV ID
name
The name the LDEV, if defined
status
The status of the LDEV, including: NML, BLK, BSY
Explanation
The parity groups element cannot be provisioned using the data recovery solution for Hitachi NAS Platform. It only reports on parity_grp contents to assist in locating unused storage. connections Element 1 2
3
4
5 6
7
8
9
10 The connections element contains one type of sub-element:
connection
The connections element contains these attributes:
id
serial
The id element is the path id of a connection. The serial element is the identifier of the peer array. A connection sub-element must contain the following attributes:
mcu
rcu
The mcu attribute identifies the local port that forms the connection. The rcu attributes identifies the remote port that forms the connection. This information is obtained using raidcom get rcu.
104
105perform provisioning, a definition must exist in the conf/hur.cfg file for both storage arrays. Also, the ports must be To connected, which includes being zoned correctly, among other things.
Array Provisioning Using the correct XML, you can replicate storage, provision storage, delete storage, or expand storage in an environment using Hitachi Universal Replicator or global-active device. Storage Management and the discovery option has been fully covered. To provision array components, use the following command: python -m bin/hdrs -provision XMLfile1 [XMLfile2] XMLfile1 contains definitions for the primary array, XMLfile2 contains definitions for the secondary array. The attributes on the root element of each file indicates the array being targeted.
Specifying only a single file only supports storage provisioning. To create a global-active device environment, you need the following sections:
Ports designated for MCU/RCU to support the remote connection
Ports designated as target ports for connection with Hitachi NAS Platform
For the secondary site, a resource group matching the serial of the primary site
Host-groups appropriately designed to support the configuration
Host-groups assigned to this resource group
Connection definitions (at least one path in each direction)
To create a Hitachi Universal Replicator environment, you need the following sections:
Ports designated for MCU/RCU to support the remote connection
Ports designated as target ports for connection with Hitachi NAS Platform
Host-groups appropriately designed to support the configuration
For each site, create a set of journals (1 per pair)
Connection definitions (at least one path in each direction)
105
106 Hitachi Universal Replicator Example Figure 41 presents a dual-cluster environment consisting of two Hitachi Universal Replicator connected arrays. Each cluster consists of three nodes that are connected to eight target ports on their respective storage arrays.
HNAS‐A‐Clustr
1,2
3,4
1,2
3,4
HNAS‐B‐Clustr
1,2
1,2
3,4
Brocade 6505 ‐ A
1A
2A
5B
6B
1B
2B
5A
6A
3A 3B
4A 4B
3,4
1,2
3,4
1,2
3,4
Brocade 6505 ‐ B
Hitachi Universal Replicator Pairs (MCU‐>RCU)
1A
2A
5B
6B
1B
2B
5A
6A
3A
4A
3B
4B
HUS VM 240136
HUS VM 240195
Figure 41 Creating the connections Element is straight forward. On the primary storage array (240195) use the following:
On the secondary storage array (240195) use the following:
106
107 are a few choices to design a Fibre connectivity strategy. This example does the following: There
Map Hitachi NAS Platform port 1 and port 2 on all nodes to ports 1A, 2A, 5B, 6B on both storage arrays.
Map Hitachi NAS Platform port 3 and port 4 on all nodes to array ports 1B, 2B, 5A, 5B on both storage arrays.
Name the host-groups P1_P2 and P3_P4.
The resulting ports Element is for one storage array (the other is based on this):
107
108create journals on both machines and dynamic provisioning pools for the paired pools, use the result of the initial To discovery output files to identify available parity groups. The following output shows that parity_grp 1-1 has been carved into 200 GB LDEVs, which can be used for journals. The remainder of storage is carved into 1610 GB LDEVs, which can be used to create dynamic provisioning pools.
The journals Element looks like this for one storage array starting with four journals (it is the same for both storage arrays).
id="0" ldevid="0a:00" raid-grps="01-01" raid-lvl="RAID6" id="1" ldevid="0a:01" raid-grps="01-01" raid-lvl="RAID6" id="2" ldevid="0a:02" raid-grps="01-01" raid-lvl="RAID6" id="3" ldevid="0a:03" raid-grps="01-01" raid-lvl="RAID6"
The final section needed is the dp_pools Element for paired-storage, and on the secondary site, for Hitachi Thin Image and Hitachi ShadowImage. Using the available parity groups, the following is construction for each storage array:
For the primary storage array:
108
109
For the secondary storage array:
raid-grps="01-02"
raid-grps="01-02"
raid-grps="01-03"
raid-grps="01-03"
109
110 Global-active Device Dual-Cluster Example Figure 42 is a dual-cluster environment consisting of the following:
Two global-active device-connected storage arrays (Hitachi Virtual Storage Platform G1000)
Each cluster consists of a single node whose port 1 and port 3 are connected to two target ports on their respective storage arrays.
One quorum device (Hitachi Virtual Storage Platform) 50:03:01:70:00:02:64:21,23
50:03:01:70:00:02:66:C1,3
HNAS Cluster HNAS2
HNAS Cluster HNAS1 1
1
3
3
3E
3E
4E
4E
5H, 7H
6H, 8H
6H, 8H
GAD Pairs (MCU‐RCU)
5H, 7H
Quorum
1H
VSP G1000 10073 47.75
1H
1E, 2E
VSP G1000 10038 47.76
VSP 53086 45.30
Figure 42 Focusing on the key differences, there are paths existing from the primary Hitachi NAS Platform cluster to the peer array and vice versa. These are the cross-site paths, that global-active device can leverage since storage is accessible globally. The quorum array has storage mapped to both storage arrays using port 1H known as quorum device or devices. The connections Element for global-active device is identical to that for Hitachi Universal Replicator. No journals Element is needed, as global-active device is synchronous. The host-groups and resource group elements use this naming scheme for Hitachi NAS Platform ports:
HNAS-GAD-CL#-P#
HNAS-GAD-cluster#-port#
110
111
There are a few additional considerations when designing for storage sharing using global-active device. The use of host-mode-option 78 is added to host-groups that are presenting remote storage. In the previous XML example, the following hostgroups are presented to storage to the HNAS-SiteB, using the hmo attribute to notify siteB that there is additional cost in using this link:
On Port CL3-E:
CL3-E-5, CL3-E-6,
On Port CL4-E
CL4-E-4, CL4-E-5.
Figure 42 shows these hostgroup links in dashed lines. The Hitachi NAS Platform node must see identical storage across both storage arrays. This means that the secondary storage array uses a virtual storage machine to represent its copy of the storage. Figure 43 illustrates this.
. Figure 43
111
112 43 shows exporting a pool consisting of four LDEVs to two different clients, labelled Cluster A and Cluster B. Using Figure global-active device, a resource group is created on the secondary storage array using the serial number of the primary storage array. The host-groups that pool uses must be in the same resource group. The LDEVs have virtualization enabled so that they can leverage the ldevid of the primary LDEVs where they do not match. The section below shows the resources Element of the XML document. This is only used on the secondary storage array.
Figure 44 shows the pairing resources in virtual storage machines on both storage arrays. The serial number of these arrays is neither of the actual arrays. To create this arrangement, both arrays must define the HDRS VSM resource and the host groups. Then LDEVs will be associated with these resource groups.
Figure 44 To create a pool using this arrangement, use site="GAD-both" in the pool definition file, as shown in the following example.
112
113
If using the same host groups to support Hitachi Universal Replicator and global-active device, a Hitachi Universal Replicator pool must use the sourceresource and targetresource attributes. The extpaths Element performs the following:
Mapping, if needed
Device allocation, if needed
Assignment as a quorum device.
Since the same ports are used on both storage arrays, this section will be identical in both XML documents.
The previous example shows the quorum server exporting two LDEVs from a dynamic provisioning pool. If the quorum server uses basic LDEVs, then the following is necessary:
The LDEVs must be formatted.
The extpool attribute is omitted.
113
114 Global-active Device Stretched-Cluster When designing a stretched cluster, create pools whose primary storage is on both arrays. The figure below extends the previous design for a stretched cluster.
Figure 45 In Figure 45, the blue volume/pool's primary is on Cluster B where it is represented as a physical device, and the corresponding virtual storage machine/resource group containing the secondary volume/pool is on the primary array. There is a resource definition in both XML documents, as well as host-groups assigned to each. A host-group cannot be in the global resource group (id=0) and a private resource group at the same time.
Final Points The same XML format can be used for deletion of array attributes using the following: python -m bin/hdrs -delete XMLfile1 [XMLfile2] Note — Be careful. Performing a discover followed by a delete may empty your array, and the array may be shared by others. Always trim down the XML documents and delete only what you know you will not need. Re-provisioning existing resources is harmless, so you do not need to trim elements that may already exist. The output may show errors in attempting to re-create existing resources.
114
115
Storage Management Storage management consists of the following categories of operations:
Installation and initial configuration of paired storage pools
Expansion of storage pools
Creation of additional pools
Conversion from non-tiered to tiered pools
Initial Provisioning To add storage during installation, add a pools Element to the XML configuration files. The example below adds two pools:
HUR-notier-1
HUR-tiered-1
The second pool is a tiered span, containing LDEVs from two different dynamic provisioning pools. In these examples, only a single file is required, as it contains both source and target attributes. The example that follows shows a segment for creating two global-active device pools.
115
116
In the next example, this definition creates pools for a stretched cluster:
One pool is GAD-primary
the other pool is GAD-secondary.
Also note the use of a sourceresource to name the VSM on the primary.
Note — Although this is presented as a mechanism for initial provisioning, using the python -m bin/hdrs -provision XML file can be used at any time.
116
117 final example shows a definition for a pool that is managed within a pair of VSMs, as shown in Figure 46. The VSMs This use a serial number that is shared between them, belonging to neither of the storage arrays. In order to set this up, host-groups must be assigned to both VSMs.
Figure 46 The following XML document provisions these resources on one of the sites.
name="DUAL_C2_N_P2"
name="DUAL_C2_N_P1"
name="DUAL_C1_N_P3"
name="DUAL_C1_N_P4"
117
118 Although you can create a large number of host-groups, a WWN can only belong to a single host-group per port in a VSM or otherwise on an array. This requires you to decide on a single pool model, unless redundant ports are available.
The main advantage to the GAD-both model is that, after a forced-delete, it can be re-paired.
Expansion Expansion is supported for existing pools. The simplest form is the following: python -m bin/hdrs -more paired-pool Here is an example of a pool expansion: [manager@smua ~]$ python -m bin/hdrs -more GAD-Test-2 Allocating 4 new ldevs from pool: HNAS GAD Pool 0 Saved output in conf/GAD-Test-2-extend-7-30_4-28.xml Pool GAD-Test-2 already exists on 0, but may be expanding (have 4,0,0) Provision pool: GAD-Test-2 Verifying target host groups belong to VSP-G1000-310073(6) Adding host-groups for 4 ldevs, have 4 Refresh HORCM Pairing pool: GAD-Test-2 quorumdev 0 Initiated pairing GAD-Test-2 -> 0 on 0 Provision HNAS: HNASsiteA GAD-Test-2 -> (1) LOG/hnasauto-GAD-Test-2-7-30_4-27.xml Saved output in LOG/gadtool-7-30_4-27.xml
The logic behind this function is to analyze the existing pool to determine all of its characteristics, such as dynamic provisioning pools, names, and sizes. It allocates the LDEVs and creates the following XML document.
118
119
This document is then passed to the -provision function to perform the host-group mapping, Hitachi Open Remote Copy Manager update, pairing, and Hitachi NAS Platform span expansion. The more function adds stripeset new LDEVs to the paired pool on both sites. Stripeset is the basic allocation of a Hitachi NAS Platform span. HNASnodeA:$ span-list -s GAD-Test-1 Span instance name OK? Free Cap/GiB Chunks Con --------------------- --- ---- ------- --------------------- --GAD-Test-1 Yes 48% 160 160 x 1073741824 90% On HDP pool 10 with 224GiB free, shared with GAD-Test-2 and GAD-Test-3 Set 0: 4 x 40GiB = 160GiB, of which 33GiB is free, 0GiB is vacated SD 0 (rack '10038', SD '5500') SD 1 (rack '10038', SD '5501') SD 2 (rack '10038', SD '5503') SD 3 (rack '10038', SD '5504')
If the pool is not a Hitachi NAS Platform span, it will use the value in conf/hdrs.cfg. You can continue expanding the pool, subject to the provisioning limits of the dynamic provisioning pools on both sites. Pools can be expanded using the -provision option. Create an XML document with the storage that you want added as shown: [manager@smua ~]$ bin/hdrs.py -provision pool-expand.xml Provide 1 configuration file for storage/dp_pool and 2 for system provisioning Pool GAD-Test-1 already exists on 0, but may be expanding (have 4,0,0) Provision pool: GAD-Test-1 Verifying target host groups belong to VSP-G1000-310073(6) Adding host-groups for 4 ldevs, have 4 Refresh HORCM Pairing pool: GAD-Test-1 quorumdev 0 Initiated pairing GAD-Test-1 -> 229 on 0 Saved output in LOG/gadtool-08-10_21-44.xml
Re-running this command will continue adding additional shared storage. 119
120 Duplication Duplication is supported for dynamic provisioning pools and existing pools. The simplest form is: python -m bin/hdrs -another HDP-pool | paired-pool Here is an example of a pool duplication: [manager@smua ~]$ python -m bin/hdrs -another GAD-Test-1 Using GAD-Test-2 as the next copygrp/pool Allocating 4 new ldevs from pool: HNAS GAD Pool 0 Create new copy group GAD-Test-2 Saved output in conf/GAD-Test-2-duplicate-7-30_4-20.xml Pool GAD-Test-2 already exists on 0, but may be expanding (have 4,0,0) Provision pool: GAD-Test-2 Verifying target host groups belong to VSP-G1000-310073(6) Adding host-groups for 4 ldevs, have 0 Refresh HORCM Pairing pool: GAD-Test-2 quorumdev 0 Initiated pairing GAD-Test-2 -> 0 on 0 Provision HNAS: HNASsiteA GAD-Test-2 -> (1) LOG/hnasauto-GAD-Test-2-7-30_4-19.xml Saved output in LOG/gadtool-7-30_4-19.xml
The logic is the following: 1. Examine the existing pool. 2. Create an XML specification for the new pool. 3. Invoke the -provision function internally using the specification as shown. 4. Refresh the instances of Hitachi Open Replication Copy Management 5. Initiate pairing of the resources 6. Invoke the HDRS 'correct' function to provision the additional pool to Hitachi NAS platform.
120
121 is an example of dynamic provisioning pool duplication: Here [manager@smua ~]$ bin/hdrs.py -another "HNAS GAD Pool 0" The supplied model is not a valid group/pool, try as HDP pool Using HNAS GAD Pool 2 as the next HDP pool on 0 Saved output in conf/HNAS GAD Pool 2-duplicate-08-09_21-21.xml Create DP pool HNAS GAD Pool 2 on 0 Find a pg-ldev with available ldevs in: ['1-1', '1-2', '1-3', '1-4', on 0 Chose ldev 41:02 from parity-grp 2-2 Find a pg-ldev with available ldevs in: ['1-1', '1-2', '1-3', '1-4', on 0 Chose ldev 41:03 from parity-grp 2-2 Using HNAS GAD Pool 2 as the next HDP pool on 1 Saved output in LOG/HNAS GAD Pool 2-dp-1.xml Create DP pool HNAS GAD Pool 2 on 1 Find a pg-ldev with available ldevs in: ['1-1', '1-2', '1-3', '1-4', on 1 Chose ldev 04:04 from parity-grp 1-3 Find a pg-ldev with available ldevs in: ['1-1', '1-2', '1-3', '1-4', on 1 Chose ldev 04:05 from parity-grp 1-4 Saved output in LOG/gadtool-08-09_21-20.xml
'2-1', '2-2', '2-3', '2-4']
'2-1', '2-2', '2-3', '2-4']
'2-1', '2-2', '2-3', '2-4']
'2-1', '2-2', '2-3', '2-4']
When a '*' is used as the value for raid-grps, the logic it uses it to determine the composition of the existing dynamic provisioning pool, is used to match any unused ldev in a parity group that matches the raid-lvl and raid-type. The allocation is then performed on both sites. In keeping with Hitachi NAS Platform best practice, create one dynamic provisioning pool per Hitachi NAS Platform pool. The disaster recovery solution for Hitachi NAS Platform proceeds without using separate pools. The disaster recovery solution for Hitachi NAS Platform supports tiered and untiered storage. Tiered storage is a mechanism whereby the pool consists of two types (or tiers) of storage:
One tier is used for file system metadata (tier 0).
One tier is used for the data itself (tier 1).
Significant performance gains can be achieved when using faster storage for file system metadata.
121
122 storage functions supported for expansion and tiered storage pools are the following: The
Adding an additional pool
Adding storage to an existing pool (not tiered)
Adding storage to a specific tier to an existing tiered span (or both)
Conversion from untiered to tiered pool by adding tier0 storage to an existing pool
To simplify the Hitachi Universal Replicator configuration tasks, as well as to eliminate the need to perform tiering operations in Hitachi NAS Platform, use the device naming facility within the storage array for all storage management functions. To introduce this activity requires introducing the array naming capabilities and then describing how these facilities are leveraged for storage management tasks. All of the current arrays provide naming four levels: 1. LDEV labels (or nickname) LDEV-level naming organizes the storage devices by assigning meaningful names. Figure 47 shows four LDEVs (volumes) and associated labels. Labels can be assigned during volume creation and modified at any time.
Figure 47 2. Copy group name Copy groups correspond to the Hitachi NAS Platform storage pools and the Hitachi Universal Replicator consistency group. Each copy group normally contains a single device group, which consists of one or more devices. 3. Device group name Device groups are a container for related LDEVs. When a device/DEV is added to a device group, it is assigned a device group member name. This name does not need to correspond to the devices LDEV label. Instead, a specific set of naming conventions is used to manage the storage.
122
123 Device group member name 4. In order to leverage SAN naming for the disaster recovery solution for Hitachi NAS Platform, a few restrictions are applied.
The copy group will be used as the Hitachi NAS Platform storage pool name.
The copy group for Hitachi Universal Replicator use Mirror ID = 0, and must specify an available journal ID.
The copy group for global-active device will use Mirror ID = 0, and must not specify a journal ID.
Device groups (part of the copy_grp) must be prefixed with the name of copy group. The following convention must be used:
-PVOL contains all devices making up the pool
-CW contains all thin image devices for maintaining a snapshot
-SI contains all S-VOLs used to support Hitachi ShadowImage
-SVOL-SI contains a reference to an existing Hitachi Universal Replicator or global-active device secondary using unique names. This is a requirement for using local replication.
A copy group with one device group is for Hitachi Universal Replicator or global-active device. A copy group with two device groups for local replication. Loopback is not supported. Device groups must contain member names that are all prefixed by the name of the copy group.
All of these restrictions are handled automatically when using the hdrs or groupdev tools for managing copy and device groups. [manager@smu1 ~]$ raidcom get copy_grp COPY_GROUP LDEV_GROUP SI-bohr-bkup-10242014 SI-bohr-bkup-10242014-PVOL SI-bohr-bkup-10242014 SI-bohr-bkup-10242014-SVOL HUR-heisenberg HUR-heisenberg-PVOL HUR-SQL2012 SQL2012-PVOL IS_5825_1414436106 IS_DEV_P_5825_1414436106 HTI_pauli_G HTI_pauli_G_drive_PVOL HTI_pauli_G HTI_pauli_G_drive_SVOL HUR-einstein HUR-einstein-PVOL TI-einstein-1 TI-einstein-1-PVOL TI-einstein-1 TI-einstein-1-SVOL HUR-pool2 HUR-pool2-PVOL HUR-pool2 HUR-pool2-SVOL HUR-hnas HUR-hnas-PVOL HUR-hnas HUR-hnas-SVOL
MU# JID# Serial# 0 - 356006 - 356006 h1 - 356006 h1 - 356006 0 - 356006 3 - 356006 - 356006 h1 - 356006 h3 - 356006 - 356006 h0 12 356006 12 356006 h0 10 356006 10 356006
123
124 must use a 3-character minimum case-sensitive prefix to associate storage resources that will be managed by the You disaster recovery solution for Hitachi NAS Platform. As an example, using ‘HUR’ in the output above yields the following candidates:
HUR-hnas
HUR-pool2
HUR-heisenberg
HUR-SQL2012
Only HUR-hnas and HUR-pool2 obey the naming conventions above ("Device groups must contains member names that all prefixed by the name of the copy group"). The other copy groups will be ignored by the disaster recovery solution for Hitachi NAS Platform. To display the device group members of the selected items above, do the following: [manager@smu1 ~]$ raidcom get device_grp -device_grp_name HUR-hnas-PVOL LDEV_GROUP HUR-hnas-PVOL HUR-hnas-PVOL
LDEV_NAME HUR-hnas-PVOL1 HUR-hnas-PVOL2
LDEV# 794 795
Serial# 356006 356006
The HUR-hnas pool has no tiered storage because neither LDEV_NAME has tier0 or tier1 in it. [manager@smu1 ~]$ raidcom get device_grp -device_grp_name HUR-pool2-PVOL LDEV_GROUP HUR-pool2-PVOL HUR-pool2-PVOL HUR-pool2-PVOL HUR-pool2-PVOL HUR-pool2-PVOL HUR-pool2-PVOL HUR-pool2-PVOL HUR-pool2-PVOL
LDEV_NAME HUR-pool2-1-tier0 HUR-pool2-2-tier0 HUR-pool2-3-tier0 HUR-pool2-4-tier0 HUR-pool2-tier1-5 HUR-pool2-tier1-6 HUR-pool2-tier1-7 HUR-pool2-tier1-8
LDEV# 131 132 133 137 138 141 145 146
Serial# 356006 356006 356006 356006 356006 356006 356006 356006
HUR-pool2 pool consists of four tier0 and four tier1 LUNS.
124
125 this introduction to duplication, here is a description of the primary storage management functions within the With solution.
Installation During installation, prefix is supplied which identifies the replicated resources managed through this software.
Data collection cycle During data collection, obtain copy groups and their members reviewed to determine if additional pools will need to be added to the Hitachi Open Remote Copy Manager configuration.
Adding an additional pool (administrative action) Create storage resources naming LUNs and device groups, as based on the guidelines presented. Once a journal has been assigned, a copy group can be created. Data collection will notice the new copy group, update and restart the Hitachi Open Remote Copy Manager services, including the new pool in the disaster recovery solution database. After running the operation=auto-correct, any new pools will be paired and the Hitachi NAS Platform pool loaded.
Expanding non-tiered storage (administrative action) Add LUNs (usually four or more) to the existing device group. Using the first example, add the LUNS to HUR-hnas-PVOL on the production machine and to HUR-hnas-PVOL on the corresponding device group on the secondary machine. Restart Hitachi Open Remote Copy Manager services to recognize the new storage. After running operation=correct, the drives will be paired and the Hitachi NAS Platform pool expanded.
Expanding tiered storage (administrative action) Add LUNs (usually four or more) to the appropriate device group on both systems (for example, HUR-pool2-PVOL), naming the LDEV_NAME with the name tier0 or tier1. After running operation=correct, the drives are paired and the appropriate tier expanded.
Converting to tiered storage (administrative action) Add LDEV NAMEs (usually four or more) to the device group containing tier0 (for example, HUR-hnas-tier0-1), which must be added to the copy-grp (for example, HUR-pool1 in example 1). After running operation=correct, the new drives are paired, existing storage converted to tier1 with no change made to the existing LDEV names, the tier0 storage added, and file systems converted.
groupdev Tool Use the groupdev tool to convert existing Hitachi Open Remote Copy Manager files into copy and device groups. The syntax to call groupdev is the following: Python -m bin/groupdev -convert tier0prefix filename The convert option takes the existing filename, which is assumed to be in horcm#.conf format, to create the device_grp and copy_grps. The command is typically run twice, once for each side of the pair. The collect tool automatically updates the HORCM-0 and HORCM-1 instance files when it discovers new copy_grps.
125
126 Converting
From a Non-tiered Pool to a Tiered Storage Pool
The assumption is that an existing non-tiered storage pool has been replicated and exposed to Hitachi NAS Platform. At least one filesystem must exist on the existing pool for conversion to take place. To do the conversion, add storage from another dynamic provisioning pool and label the LDEVs in the device-grp with 'tier0'. [manager@smua ~]$ python -m bin/hdrs -provision pool-convert.xml Provide 1 configuration file for storage/dp_pool and 2 for system provisioning Pool GAD-Test-1 already exists on 0, but may be expanding (have 12,0,0) Provision pool: GAD-Test-1 Verifying target host groups belong to VSP-G1000-310073(6) Adding host-groups for 4 ldevs, have 4 Refresh HORCM Initiated pairing GAD-Test-1 -> 0 on 0 Provision HNAS: HNASsiteA GAD-Test-1 -> (1) LOG/hnasauto-GAD-Test-1-08-10_22-31.xml Saved output in LOG/gadtool-08-10_22-31.xml
After correction and re-expansion, auto-correct pairs, expands, converts, and re-mirrors the pool.
Managing Storage for Disaster Recovery-Validation In order to support a snapshot of a storage pool, one of the following must be true:
Storage must be prepared to house the copy if using Hitachi ShadowImage
Virtual volumes must be prepared to manage a virtual copy if using Hitachi Thin Image.
Allocate storage using global-active device, as shown below. Assuming that a snap pool exists in the storage array on the disaster site, an existing pool can be prepared for Thin Image using the following: python -m bin/hdrs -thinimage pool-name If there is not an existing snap dynamic provisioning pool, the tool will fail with this message: No thin image pools available. Use -provision on the thin_pool.template
126
127 choices for creating a snap pool are the following: The
Use raidcom
Hitachi Device Manager,
If there is a thin_pool.template, use the template to provision a snap pool
Create the snap pool with the following command: python -m bin/hdrs -provision thin_pool.template [manager@smua ~]$ python -m bin/hdrs -provision thin_pool.template Provide 1 configuration file for storage/dp_pool and 2 for system provisioning Create snap_pool DP-HTI-Pool on 1 Saved output in LOG/gadtool-08-05_15-37.xml The contents of the thin_pool.template are the following:
Once a snap pool exists, re-run the operation, as follows: [manager@smua ~]$ python -m bin/hdrs -thinimage GAD-Test-1 Allocating 4 new ldevs from pool: snap Adding target copy groups for: GAD-Test-1-CW Saved output in LOG/GAD-Test-1-CW-pool-snap-08-05_15-46.xml Pool GAD-Test-1-CW already exists on 1, but may be expanding (have 0,0,0) Provision pool: GAD-Test-1-CW Verifying target host groups belong to VSP-G1000-310073(6) Adding host-groups for 4 ldevs, have 0 Saved output in LOG/gadtool-08-05_15-45.xml
This shows creating a perfectly-sized pool called GAD-Test-1-CW to hold the snapshot of GAD-Test-1. For Hitachi ShadowImage, the name of a dynamic provisioning pool must be specified, as actual storage is allocated to hold the shadow copies. The syntax is shown, as follows: python -m bin/hdrs -shadowimage pool-name HDP-poolname
127
128 If you are not sure which pools can be used for this purpose, do any of the following:
Use raidcom get pool -key opt -I1 to identify pools on the disaster site.
Use the output of python -m bin/hdrs -discover to review the dp_pools Element for available pools.
Create another pool by duplicating an existing one using the following: python -m bin/hdrs -another HDP-poolname Ask your storage administrator
Below is an actual text capture, showing this process. [manager@smua ~]$ python -m bin/hdrs -shadowimage GAD-Test-1 "HNAS GAD Pool 1" Allocating 4 new ldevs from pool: HNAS GAD Pool 0 Adding target copy groups for: GAD-Test-1-SI Saved output in LOG/GAD-Test-1-SI-pool-11-08-05_15-55.xml Pool GAD-Test-1-SI already exists on 1, but may be expanding (have 0,0,0) Provision pool: GAD-Test-1-SI Verifying target host groups belong to VSP-G1000-310073(6) Adding host-groups for 4 ldevs, have 0 Saved output in LOG/gadtool-08-05_15-54.xml
Support for Hitachi Shadowimage and Hitachi Thin Image storage provisioning of tiered Hitachi NAS Platform pools is not currently supported by the data recovery solution for Hitachi NAS Platform.
Deleting Storage The data recovery solution for Hitachi NAS Platform provides a way to delete storage resources. python -m bin/hdrs -delete paired-pool | device-grp | ldevs | HDP-pool | XMLfile1 [ XMLfile2] The resources in XML files are deleted in an optimal order. When deleting resources at the command-line, be careful to delete the resource at its highest level. For example, take note of the following:
Do not delete LDEVs if they are part of a paired-pool, also known as a copy-grp.
Do not delete device-grps that belong to copy-grps. Instead, delete the copy-grp first.
If the pool has had a snapshot/shadowimage created, delete the snapshot in Shadowimage copy-grp first.
128
129 Deleting storage array resources is a bit tricky for the following reasons:
Deleting a LUN (an LDEV with mappings) cannot be performed without deleting the mappings first.
Deletion of devices within resource groups or with virtual LDEV mappings is more complicated.
Deletion of a paired device (snapshot) or remote pairing is not allowed without removing the pairings.
Deletion of LDEVs from a dynamic provisioning pool leaves the devices in the blocked state.
Error codes from operations are not recorded in one place. Places to look include the following:
Global-active device User Guide
Command Control Interface User Reference
Hitachi Thin Image User Guide
Hitachi ShadowImage User Guide
Hitachi Universal Replicator User Guide
129
130
XML Tags Table 14. XML Tags Tag
Explanation
Element
Elapsed
The amount of time elapsed in performing the enclosed activity
Multiple
Finish
The wall time at the completion of the operation
Multiple
start
The wall time at the start of the operation
Multiple
Version
The version of the HDRS
Request
Home-cluster
The HNAS cluster which maintains the Primary (or writeable) Pair replication copy under normal operation. Note: quorum device = 0 is used to indicate the 'production site', =1 is used for the 'alternate' site.
Pool
The name of the replication group and HNAS span/pool
Pool-cluster
The HNAS cluster where the pool is available (normally the same as Pair the home-cluster)
Storage-cluster
The HNAS cluster whose primary array is used as the storage array for the pool (normally the same as the home-cluster)
Pair
Available
'Yes' indicates that the pool is available and healthy
Pair
Status
'complete': The status of the configuration capture
config
Backup-ip
The address of the HDRS backup software
Config
Topic
'hnas-config', 'archive-config': The type of configuration operation
config
System-code
The accumulation of all system-codes detected on the system. See the section System Status Codes on page 135 for a detailed list of codes
Issue, health
System-state
A level from healthy, normal, warning, severe, critical based on the severity associated with the most severe system-code
Health
Pair
130
131 Table 14. XML Tags (Continued) Tag
Explanation
Element
Pair-code
The accumulation of the issue codes against a pool or against all pools in the cluster
Health
code
A pair-code normally identified in an issue element and referenced during correction
Issue
Drive-access
'Allowed', 'Denied': This is the HNAS 'allowed' characteristic indicating the drive is licensed and can be used by the cluster
Node, cluster
Drive-role
'Primary', 'Secondary': This is the HNAS drive 'role' reported by the Node, cluster sd-list command
Drive-status
'OK','Unknown',etc. This is the HNAS drive 'status' reported by the Node, cluster sd-list command
Iomode
'L/M','L/L','B/B','-': This is the GAD iomode as reported by 'pairdisplay'. It indicates the mirroring state of the GAD pairs,
Node, cluster
Name
The name of the HNAS cluster or node
Node, cluster
Pool-available
'True', 'False': This is an indication that the pool/span is usable on the node/cluster.
Node, cluster
State
'PAIR','COPY','PSUS','SSUS','PSUE','SSUE','SSWS','SMPL': This is the replication state of the pool/group.
Node, cluster
type
Dual = dual-cluster, stretch = stretched-cluster
Node, cluster
Avg-changes-sec
Average number of changes/sec measured by the amount of change Stats (HUR) of the journal index
Avg-queue
The average delta between the journal index of the primary and secondary
Stats
Journal-util
The amount of journal storage being utilized
Stats
Max-changes-sec
The maximum value seen for average changes per second
Stats
Max-queue
The maximum value seen for average queue
Stats
Horcm-ready
If 'yes' indicates that the HORCM instances and storage are allocated to support this type of snapshot
Snapshot
Loaded
If 'yes' indicates that the snapshot is loaded at the disaster cluster. N/A indicates there is no valid data to load.
Snapshot 131
132 Table 14. XML Tags (Continued) Tag
Explanation
Element
Pct
The percentage value for the copy being made
snapshot
Pool
The base pool for the snapshot
Snapshot
Role
The replication role (normally = 'primary'
Snapshot
snap
The name of the snapshot
Snapshot
state
The replication state for the snapshot (COPY/SMPL/PSUS)
Snapshot
Msg
The issue description
Issue
Node
The node it impacts
Issue
Cluster
The cluster it impacts
Issue
State
X|X|X|X codes representing each of the Fibre Channel ports on a node. The possible values are: s - some devices are now unavailable, S - all devices are not reachable, N - normal, D - port down, O - administratively disabled, C - missing cable
Issue = 0x100000
Using-local-paths
The number of devices accessed using local paths
Issue
Using-remote-paths
The number of devices accessed using remote paths
Issue
detail
Further information
Issue
Drive-list
The list of HNAS system-drives involved
Issue
Affected-drives
The list of HNAS system-drives involved
Issue
Filesystems
List of affected filesystems
Issue = 0x40000
State
'failure-storage'
Issue = 0x40000
132
133
Pair Status Codes These are the pair status codes returned for Operation=display . Table 15. Pair Status Codes Code
Explanation
Enabled for automated correction
Severity
0x00000001
In Simplex mode, pools should not be available in Hitachi NAS Platform prior to pairing
N
notice
0x00000002
Some fibre paths have been hardcoded (see GAD 0x1000)
N
notice
0x00000004
Pool is exposed on wrong site
N
severe
0x00000008
Pool is paired, but not available on either site
Y, HDRS
notice
0x00000010
Not used
0x00000020
Administrative access has been given to a secondary site (SSWS) using -RS
Y
warning
0x00000040
Database is missing peer information on the pool (is peer alive?)
Y
severe
0x00000080
One side of the pair has gone simplex, this N is usually the result of a forced-delete
severe
0x00000100
Not used
0x00000200
Secondary drives are missing, (storage trauma)?
-
warning
0x00000400
Shadow drives on the Secondary are present and secondary volumes maintain access (appears as unhealthy in UI)
Y
notice
0x00000800
Secondary drives are not in ‘allowed’ state N (HUR) not available for takeover
warning
0x00001000
HUR - Ready to implement Hitachi NAS Platform remote mirror provisioning
Y
severe
0x00002000
HUR - Primary system’s drives are not in ‘allowed’ state, but shadows exist
-
severe
0x00004000
Paired pools are suspended (administratively)
Y
warning
0x00008000
Primary state is marked ‘pair in error’
Y
warning/severe
0x00010000
Queue discrepancy may indicate one-way Fibre Channel blockage
-
warning
0x00020000
Not used
0x00040000
No file systems mounted, manual EVS assignment is required, and other filesystem issues
Y
warning
0x00080000
GAD - Pair suspension due to error
-
warning
133
134 Table 15. Pair Status Codes (Continued) Code
Explanation
Enabled for automated correction
Severity
0x00100000
Drives added to existing pair, some still in SMPL mode, span expansion
HDRS
notice
0x00200000
Mismatch in the drive count between consistency groups
-
warning
0x00400000
Not used
0x00800000
HUR - Drives added, re-mirroring required
HDRS
warning
0x01000000
GAD - Forced delete has occurred, peer must do so as well
-
severe
0x02000000
Remote drives are in Secondary mode but are still in Primary state (result of forced takeover)
-
warning
0x04000000
Not used
0x08000000
Primary state, but drives are marked as Secondary, un-pegging is needed
N
warning
0x10000000
Local state is Secondary but is operating in Y Primary state (result of storage swap)
severe
0x20000000
Forced delete required to restore access
-
severe
0x40000000
Forced takeover has taken place, but role reversal has not taken place (still in SSWS/PSUE)
Y
severe
0x80000000
Pool health is not marked OK
-
severe
134
135
System Status Codes These are the system status codes. Note — Codes in bold are specific to global-active device.
Table 16. System Status Codes Code
Explanation
Severity
0x40000000
A HORCM manager is not running or cannot authenticate
severe
0x08000000
Inter-array communication lost in both directions
critical
0x04000000
Inter-array communication degraded (partial path loss)
warning
0x02000000
Inter-array communication lost in one direction
severe
0x00800000
Inconsistent EVS security mode between clusters
warning
0x00200000
Pathing issues impact SCSI device availability
severe
0x00100000
A node has at least 2 not-normal fibre ports
warning
0x00020000
Quorum device is blockaded (global-active device)
severe
0x00010000
Quorum paths are blocked to one array (global-active device) severe
0x00008000
Quorum paths are partially blocked (global-active device)
warning
0x00004000
Quorum array is unreachable by either cluster (global-active device)
critical
0x00080000
One node in a cluster has lost all fibre connectivity
severe
0x00040000
All nodes in a cluster lose all fibre connectivity
critical
135
136
Configuration Record All files contain a 6 character random value at the end of the pathname that should be ignored. If not otherwise stated, all files are text files ending in '.txt' The configuration record contains subdirectories (usually '0', and '1') corresponding to the Hitachi Open Remote Copy Manager instances for the Primary and Secondary arrays (and clusters). Table 17. Configuration Record pathname
Explanation
Span-list
Span-list --sds
Scsi-racks
Scsi-racks
Sd-list-tier
Sd-list -= --show-tier
Sd-list
Sd-list -= -c
sd-list_raid-name
sd-list -= --raid-name
cluster-show
cluster-show
cluster-verbose
cluster-show -v
cluster-getmac
macid-cluster
evsipaddr
evsipaddr -l
scsi-raid-groups
scsi-raid-groups
sdpath
sdpath
fc-hports
pn all hport-wwn
macid-all
pn all getmacid
license-all
licensekey list
copy_grp
raidcom get copy_grp | grep "^pattern"
device_grps
raidcom get device_grp | grep -oP "^pattern*"
ldev_devgrp-devgrp
raidcom get ldev -grp_opt ldev -device_grp_name devgrp -key frnt
host_grp-port
raidcom get host_grp -port port
hba_wwn-port
raidcom get hba_wwn -port port
lun-port
raidcom get lun -port port
port-RCUMCU
raidcom get rcu
get_journal
raidcom get journal
get_pool
raidcom get pool -key opt
get_ldev
raidcom get ldev -ldev_list defined -cnt 2000 -key frnt
horcctl
horcctl -D
registry.tgz
backupregistry
hdcp.zip
Auto-dump array configuration and restructuring
SMUbasicinfo.txt
Ip addr, ip route, hostname, domainname
MM-DD_HH-mm.xml
hdrs -discover output file
Smu_DDMMMYY_TTTTTT.z SMU backup ip
136
137 files are packed into a single compressed tar image (.tgz extension) which is saved locally under The conf/config-MM-DD.tgz and copied to the peer under SAVE/IP-of-SMU-config-MM-DD.tgz. The system maintains the last 2 weeks of configuration records. An example of the contents of the configuration is shown below. -rw-------rw-------rw-------rw-------rw-------rw-------rw-------rw-------rw-------rw-------rw-------rw-------rw-------rw-------rw-------rw-------rw-------rw-------rw-------rw-------rw-------rw-------rw-------rw-------rw-------rw-rw-r--rw-------rw-------rw-------rw-------rw-------rw-------rw-------
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
manager manager manager manager manager manager manager manager manager manager manager manager manager manager manager manager manager manager manager manager manager manager manager manager manager manager manager manager manager manager manager manager manager
manager 75 Feb 26 10:43 copy_grps-tSCByg.txt manager 47 Feb 26 10:43 device_grps-hlntLz.txt manager 1275 Feb 26 10:43 device_grp-HUR_pool1-PVOL-pHM9k9.txt manager 3208 Feb 26 10:43 ldev_devgrp-HUR_pool1-PVOL-hVCboN.txt manager 375 Feb 26 10:43 device_grp-HUR_pool-PVOL-CW-OzszEm.txt manager 888 Feb 26 10:43 ldev_devgrp-HUR_pool-PVOL-CW-ROJgVk.txt manager 375 Feb 26 10:43 device_grp-HUR_pool3-PVOL-Rp47a7.txt manager 880 Feb 26 10:43 ldev_devgrp-HUR_pool3-PVOL-S9zX3D.txt manager 276 Feb 26 10:43 host_grp-CL4-A-1-IBRrGV.txt manager 201 Feb 26 10:43 hba_wwn-CL4-A-1-R4KE7j.txt manager 2798 Feb 26 10:43 lun-CL4-A-1-KTCsYs.txt manager 276 Feb 26 10:44 host_grp-CL4-B-2-3Zjogq.txt manager 201 Feb 26 10:44 hba_wwn-CL4-B-2-MiKmlx.txt manager 2798 Feb 26 10:44 lun-CL4-B-2-0Wt2mX.txt manager 276 Feb 26 10:44 host_grp-CL4-B-1-pde4ai.txt manager 201 Feb 26 10:44 hba_wwn-CL4-B-1-SMZx1Y.txt manager 2798 Feb 26 10:44 lun-CL4-B-1-KS0e9p.txt manager 276 Feb 26 10:44 host_grp-CL4-A-2-geNHZA.txt manager 201 Feb 26 10:44 hba_wwn-CL4-A-2-U5Ly1O.txt manager 2798 Feb 26 10:44 lun-CL4-A-2-_kUn2q.txt manager 328 Feb 26 10:44 port-RCUMCU-t0UItI.txt manager 136 Feb 26 10:44 get_rcu-pNUeBT.txt manager 224 Feb 26 10:45 get_journal-5CHdyY.txt manager 792 Feb 26 10:45 get_pool-K1bYwm.txt manager 55 Feb 26 10:45 horcctl-gzCFcG.txt manager 18722 Feb 26 10:45 registry.tgz manager 2517 Feb 26 10:45 span-list-I40wlf.txt manager 4093 Feb 26 10:45 sd-list_raid-name-i1Vm1N.txt manager 4432 Feb 26 10:45 sd-list-tier-5QixUS.txt manager 2398 Feb 26 10:45 scsi-raid-groups-SaoXkc.txt manager 4124 Feb 26 10:45 scsi-racks-T3eVuT.txt manager 30322 Feb 26 10:45 sd-list-nOFLqf.txt manager 499 Feb 26 10:45 fc-hports-HGnxv_.txt
In the output above, the name of the file is directly related to the command that was issued to obtain the output. As an example, the following is run: raidcom get device_grp –device_grp_name HUR_pool1-PVOL’ The result is the device_grp-HUR_pool1-PVOL-pHM9k9.txt file, and this file contains: LDEV_GROUP HUR_pool1-PVOL HUR_pool1-PVOL HUR_pool1-PVOL HUR_pool1-PVOL HUR_pool1-PVOL HUR_pool1-PVOL
LDEV_NAME HUR_pool1-tier1-13 HUR_pool1-tier1-14 HUR_pool1-tier1-15 HUR_pool1-tier1-16 HUR_pool1-tier0-9 HUR_pool1-tier0-10
LDEV# 33 34 35 36 41 42
Serial# 210068 210068 210068 210068 210068 210068
137
138 HUR_pool1-PVOL HUR_pool1-PVOL HUR_pool1-PVOL HUR_pool1-PVOL HUR_pool1-PVOL HUR_pool1-PVOL HUR_pool1-PVOL HUR_pool1-PVOL HUR_pool1-PVOL HUR_pool1-PVOL
HUR_pool1-tier0-11 HUR_pool1-tier0-12 HUR_pool1-1 HUR_pool1-2 HUR_pool1-3 HUR_pool1-4 HUR_pool1-5 HUR_pool1-6 HUR_pool1-7 HUR_pool1-8
43 45 62 63 64 65 66 67 68 69
210068 210068 210068 210068 210068 210068 210068 210068 210068 210068
138
139
Active Directory Registration Tool In order to simplify failure between clusters that will be registered to the same Active Directory Domain, running a special utility can provide the secondary site with the same Kerberos credentials as established by the primary site. This is accomplished through an export from the primary site which can be imported by the secondary site using a privileged utility. At the end of the status command a section of the status is presented by a config element that summarizes configuration information. The highlighted section indicates that there is a discrepancy between the Active Directory Domain on the primary and secondary cluster that can be addressed using this facility.
In order to perform the operation, you will need to open a case with Hitachi Global Support requesting a temporary 'dev' password from the Hitachi NAS Platform team to update the registry at the secondary site. This will be accomplished by running the bin/importAD.sh command. The output below is an example of the process: [manager@hnas-smu-a ~]$ bin/importAD.sh MAC ID is 68-1B-AA-3E-DA-4E Missing required dev-password argument. (AVN: 2) Usage: bin/importAD.sh dev-password [node]
The support representative will need the 'MAC ID' information from the output above as well as the date on the SMU in order to generate a dev password. Don't perform this procedure until you are ready to reboot the Secondary cluster. [manager@hnas-smu-a ~]$ bin/importAD.sh epq2LfOxnSAi06kp Using 198.18.1.8 to access HNAS-B-Clustr Validating 'dev' password Wrote 363 B in 68 ms at 5.213 KB/s (5338 B/s) Transferred local file conf/cifs-krb-update to server file tmp Are you prepared for an immediate reboot of HNAS-B-Clustr? Enter 'YES' to proceed: YES reg.write \S\rf\mmb\SMBDomainInfo.meta=0x2:0x1 reg.write = Success reg.write \S\rf\mmb\SMBDomainInfo.trigger= reg.write = Success reg.write \S\rf\mmb\SMBDomainInfo\00000000000000000002=/0/0/0/1/0/0/0/4CORP/0/0/0/fcorp.bigo.local/0/0/0/f corp.bigo.local/0/0/0/fCORP.BIGO.LOCAL/0/0/0\10\80\de\f0\a2\980\baF\b9\ac/9X\1b\ee\87*/0/0/0/0 reg.write = Success Applied AD registry import Rebooting cluster HNAS-BClustr
139
140
Getconfig.sh - Support Incident Collection When creating a support ticket, the team will ask that you run a collection tool in order to assemble the information that may be required by the escalation group. You can run the tool as follows: bin/getconfig.sh This script collects environmental data about the system, recent logs, databases, Hitachi NAS platform diagnostics, the last configuration record, and it runs the HDS support team 'getconf.sh' script as well. A sample session is shown below: [manager@SMU-3113 ~]$ bin/getconfig.sh HDS11113333 Starting getcfg collection...done Updating package with HDRS information ################################################################## Please Upload the following results of the getconfig: /home/manager/SAVE/getconfig-HDRS-HDS11113333.12_25-00_10.tgz to the GSC Support Site TUF:
https://tuf.hds.com
*note: GSC Support prefers this method as it places an entry in the case management system and an email notification is sent to the entire support team when your upload is complete. ##################################################################
The resulting file needs to be copied using WinSCP or equivalent so that the user can open a browser and enter information and upload the file using https://tuf.hds.com/upload.php.
140
141
Database Schema These are database schema. 'smu': 'TEXT'),
(('smu_key', 'INTEGER'), ('version', 'TEXT'), ('smu_ip', 'TEXT'), ('serial',
('micro', 'TEXT'), ('horcm', 'TEXT'), ('instance', 'INTEGER'), ('hur_creds', 'TEXT'), ('reachable', 'TEXT'), ('firmware', 'TEXT'), ('cci', 'TEXT'), ('cow', 'TEXT'), ('license', 'TEXT'), ('quorum_dev', 'TEXT'), ('quorum_path', 'TEXT'), ('patch', 'TEXT'), ('smu_vers', 'TEXT')), 'cluster': (('cluster_key', 'INTEGER'), ('cluster', 'TEXT'), ('node_count', 'INTEGER'), ('health', 'TEXT'), ('mode', 'TEXT'), ('uuid', 'TEXT'), ('admin_ip', 'TEXT'), ('passwd', 'TEXT'), ('smu_key', 'INTEGER'), ('max_evs', 'INTEGER')), 'cns':
(('cns_key', 'INTEGER'), ('entry', 'TEXT'), ('type', 'TEXT'), ('target', 'TEXT'), ('local_cache', 'TEXT'), ('remote_cache', 'TEXT'), ('evsid', 'INTEGER'), ('smu_key', 'INTEGER')), 'node': (('node_key', 'INTEGER'), ('node_ip', 'TEXT'), ('node', 'TEXT'), ('cluster_key', 'INTEGER'), ('nodeid', 'TEXT'), ('evs_count', 'INTEGER'), ('cpu_load', 'INTEGER'), ('fs_ops', 'INTEGER'), ('fs_MBs', 'INTEGER'), ('rd_msec', 'INTEGER'), ('wr_msec', 'INTEGER'), ('stripe_msec', 'INTEGER'), ('smu_key', 'INTEGER'), ('wwns', 'TEXT'), ('port_state', 'TEXT')), 'pool': 'TEXT'),
(('pool_key', 'INTEGER'), ('pool', 'TEXT'), ('online', 'TEXT'), ('status', ('freespace_pct', 'INTEGER'), ('capacity_GB', 'INTEGER'), ('provisioned_GB',
'INTEGER'), ('drives', 'TEXT'), ('node_key', 'TEXT'), ('fs_count', 'INTEGER'), ('type', 'TEXT'), ('smu_key', 'INTEGER'), ('cluster_key', 'INTEGER'), ('repl_OK', 'TEXT'), ('wwns', 'TEXT')), 'repl': (('pool_key', 'INTEGER'), ('pool', 'TEXT'), ('peer', 'TEXT'), ('type', 'TEXT'), ('journal', 'INTEGER'), ('arrays', 'TEXT'), ('luns', 'TEXT'), ('drives', 'TEXT'), ('state', 'TEXT'), ('capacity', 'TEXT'), ('status', 'TEXT'), ('role', 'TEXT'), ('access', 'TEXT'), ('tier0', 'TEXT'), ('tier1', 'TEXT'), ('peg', 'TEXT'), ('shadows', 'TEXT'), ('repl_state', 'TEXT'), ('iomode', 'TEXT'), ('pct', 'INTEGER'), ('max_journal_util', 'INTEGER'), ('avg_queue', 'TEXT'), ('max_queue', 'TEXT'), ('avg_delta_rt', 'TEXT'), ('avg_changes_sec', 'TEXT'), ('max_changes_sec', 'TEXT'), ('ports', 'TEXT'), ('smu_key', 'INTEGER'), ('cluster_key', 'INTEGER'), ('pathing', 'TEXT'), ('jid', 'TEXT'), ('paths', 'INTEGER')), 'evs': (('evs_key', 'INTEGER'), ('evsid', 'INTEGER'), ('evs', 'TEXT'), ('node_key', 'INTEGER'), ('nodeid', 'INTEGER'), ('status', 'TEXT'), ('enabled', 'TEXT'), ('fs_count', 'INTEGER'), ('ip_address', 'TEXT'), ('mask', 'TEXT'), ('port', 'TEXT'), ('smu_key', 'INTEGER'), ('security', 'TEXT')), 'fs': (('fs_key', 'INTEGER'), ('fsid', 'INTEGER'), ('fs', 'TEXT'), ('evs_key', 'INTEGER'), ('evsid', 'TEXT'), ('pool_key', 'INTEGER'),
141
142
('confine_mb', 'INTEGER'), ('size_mb', 'INTEGER'), ('mounted', 'TEXT'), ('used_mb', 'INTEGER'), ('used_pct', 'INTEGER'), ('snapshot_mb', 'INTEGER'), ('quota_mb', 'INTEGER'), ('thin', 'TEXT'), ('dedup', 'TEXT'), ('replication', 'TEXT'), ('fs_block', 'INTEGER'), ('fs_ops', 'INTEGER'), ('fs_MBs', 'INTEGER'), ('smu_key', 'INTEGER'), ('state', 'TEXT')), 'vivol': (('vivol_key', 'INTEGER'), ('fs_key', 'INTEGER'), ('fs', 'TEXT'), ('vivol', 'TEXT'), ('path', 'TEXT'), ('quota_used', 'INTEGER'), ('file_used', 'TEXT'), ('quota_limit', 'TEXT'), ('smu_key', 'INTEGER')), 'nfs': (('nfs_key', 'INTEGER'), ('fs_key', 'INTEGER'), ('fs', 'TEXT'), ('path', 'TEXT'), ('nfs', 'TEXT'), ('config', 'TEXT'), ('key', 'INTEGER'), ('clients', 'TEXT'), ('smu_key', 'INTEGER')), 'cifs': (('cifs_key', 'INTEGER'), ('fs_key', 'TEXT'), ('fs', 'TEXT'), ('path', 'TEXT'), ('cifs', 'TEXT'), ('settings', 'TEXT'), ('perms', 'TEXT'), ('user_limit', 'INTEGER'), ('config', 'TEXT'), ('key', 'TEXT'), ('smu_key', 'INTEGER'))
142
143
hdrs.cfg Config File Values
Table 18. Disaster Recovery Solution Stanza Installation Information Keyname
Disaster Recovery Solution Stanza Information
Typical Value
[ipaddress]
IP address of the disaster recovery solution server:
type
smu
admin
dual for dual cluster
dual-backup for the standby cluster
stretch for a single cluster
n/a for a non-NAS Platform environment
SMU IP address:
This is the public primary SMU IP.
On a backup cluster, the standby SMU is the primary SMU.
Admin IP address:
serial
192.0.2.2
This is set to n/a on a stretched cluster. 192.0.2.2
This is the private administration IP address of the secondary cluster. This is set to n/a on a stretched cluster or on the backup cluster where it cannot reach the primary SMU.
Serial:
This is the public IP address of the SMU of the secondary cluster.
Peer Admin IP address:
dual
This is the private administrative IP address of the cluster.
Peersmu IP address:
peeradmin
This is an SMU if the disaster recovery solution in installed on an SMU.
Installation type:
peersmu
Stanza title
The serial number of the primary storage array. For a Hitachi VSP G1000, this must be six digits and start with a "3."
143
144 18. Disaster Recovery Solution Stanza Installation Information (Continued) Table Keyname
Disaster Recovery Solution Stanza Information
Typical Value
peerserial
Peer serial:
pattern
hcsport
hcskey
hcscert
The pattern must have a minimum of 3 characters For global-active device environments, start the pattern with this: GAD
This pattern will exclude any matching pools
Backup IP address:
hcsip
HUR
Regular expression pattern
backup
For a Hitachi VSP G1000, this must be six digits and start with a "3."
The pattern to use to correlate copy groups:
Excludepattern
The serial number of the secondary storage array.
This is the address of the disaster recovery solution backup server. This is usually the remote SMU.
HCS IP address:
The IP address of the Hitachi Command Suite server.
If Command Suite is not deployed, set this value to this: n/a
HCS (Hitachi Device Manager port)
The service port of the Hitachi Device Manager server.
Use 2001 if SSL is not being configured
Certificate password
The password associated with the certificate
Only used with SSL
Certificate file
n/a
2001 or 2443
Key password
filename
The file containing the Hitachi Device Manager server's certificate
144
145 19 describes the information required for each array stanza. Information displayed in red represents typical values Table that need to be provided. Table 19. Array Stanza Information Explanation Keyname
Array Stanza Information
Typical Value
[serial]
Array stanza title:
and
instance
Use 0 for the primary storage array.
Use 1 for the secondary storage array.
raidpswd
Use 101 for the secondary site when communicating with Command Suite. If not using Command Suite, use n/a. If this is a quorum server, do not provide this key/value.
hdrs-raidcom
The name of the account created on the storage array for disaster recovery solution.
The password of the Raiduser account.
Peer:
100, 101, or n/a
Use 100 for the primary site when communicating with Hitachi Command Suite.
Raidpswd:
peer
For quorum servers, assign values starting with 8.
Raiduser:
0 or 1
On the backup cluster, 0 is used for the secondary storage array.
HCSinstance:
raiduser
For a Hitachi Virtual Storage Platform G1000, this must be six digits and start with a "3."
Instance:
hcsinstance
Use the storage array serial number.
or
The serial number of the peer storage array. For Virtual Storage Platform G1000, this must be six digits and start with a "3." For quorum servers omit this key/value.
145
146 19. Array Stanza Information Explanation (Continued) Table Keyname
Array Stanza Information
Typical Value
type
Type:
M700, R700 or R800
The type code for the system:
ip
Hitachi Unified Storage VM: M700
Hitachi Virtual Storage Platform: R700
Hitachi Virtual Storage Platform G1000: R800
For quorum servers omit this key/value
IP address(es):
or
Provide one or two addresses for Midrange quorum servers (comma separated).
146
1
Corporate Headquarters 2845 Lafayette Street Santa Clara, CA 96050-2639 USA www.HDS.com
community.HDS.com
Regional Contact Information Americas: +1 408 970 1000 or [email protected] Europe, Middle East and Africa: +44 (0) 1753 618000 or [email protected] Asia Pacific: +852 3189 7900 or [email protected]
HITACHI is a trademark or registered trademark of Hitachi, Ltd., Other notices if required. Microsoft, Windows, and PowerShell are registered trademarks or trademarks of Microsoft Corporation. All other trademarks, service marks and company names are properties of their respective owners. Notice: This document is for informational purposes only, and does not set forth any warranty, expressed or implied, concerning any equipment or service offered or to be offered by Hitachi Data Systems Corporation. AS-475-00 February 2016.