Transcript
Huawei FusionSphere 3.1
Technical Reliability
Issue
V1.0
Date
2014-04-20
White
HUAWEI TECHNOLOGIES CO., LTD.
Paper
on
Copyright © Huawei Technologies Co., Ltd. 2014. All rights reserved. No part of this document may be reproduced or transmitted in any form or by any means without prior written consent of Huawei Technologies Co., Ltd.
Trademarks and Permissions and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd. All other trademarks and trade names mentioned in this document are the property of their respective holders.
Notice The purchased products, services and features are stipulated by the contract made between Huawei and the customer. All or part of the products, services and features described in this document may not be within the purchase scope or the usage scope. Unless otherwise specified in the contract, all statements, information, and recommendations in this document are provided "AS IS" without warranties, guarantees or representations of any kind, either express or implied. The information in this document is subject to change without notice. Every effort has been made in the preparation of this document to ensure accuracy of the contents, but all statements, information, and recommendations in this document do not constitute a warranty of any kind, express or implied.
Huawei Technologies Co., Ltd. Address:
Huawei Industrial Base Bantian, Longgang Shenzhen 518129 People's Republic of China
Website:
http://enterprise.huawei.com
Issue V1.0 (2014-04-20)
Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd.
i
Huawei FusionSphere 3.1 Technical White Paper on Reliability
About This Document
About This Document Purpose This document describes the system reliability of FusionSphere.
Intended Audience This document is intended for:
Marketing engineers
Sales engineers
Distributors
Symbol Conventions The symbols that may be found in this document are defined as follows: Symbol
Description Indicates an imminently hazardous situation which, if not avoided, will result in death or serious injury. Indicates a potentially hazardous situation which, if not avoided, could result in death or serious injury. Indicates a potentially hazardous situation which, if not avoided, may result in minor or moderate injury. Indicates a potentially hazardous situation which, if not avoided, could result in equipment damage, data loss, performance deterioration, or unanticipated results. NOTICE is used to address practices not related to personal injury. Calls attention to important information, best practices and tips. NOTE is used to address information not related to personal injury, equipment damage, and environment deterioration.
Issue V1.0 (2014-04-20)
Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd.
ii
Huawei FusionSphere 3.1 Technical White Paper on Reliability
About This Document
Change History Changes between document issues are cumulative. The latest document issue contains all the changes in earlier issues.
Issue 01 (2014-04-20) First Release.
Issue V1.0 (2014-04-20)
Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd.
iii
Huawei FusionSphere 3.1 Technical White Paper on Reliability
Contents
Contents About This Document .................................................................................................................... ii 1 System Architecture Description ............................................................................................... 1 1.1 Overview of Huawei FusionSphere Solution ................................................................................................................ 1
2 Architecture Reliability ............................................................................................................... 2 2.1 Redundant Network Paths ............................................................................................................................................ 2 2.2 Detached-Plane Network Communication.................................................................................................................... 3 2.3 Management Node HA ................................................................................................................................................. 4 2.4 Traffic Control .............................................................................................................................................................. 5 2.5 Fault Detection.............................................................................................................................................................. 5 2.6 Data Consistency Check ............................................................................................................................................... 6 2.7 Management Data Backup and Restoration .................................................................................................................. 6 2.8 Global Time Synchronization ....................................................................................................................................... 6
3 FusionCompute Reliability ......................................................................................................... 7 3.1 VM Live Migration ....................................................................................................................................................... 7 3.2 Storage Cold and Live Migration ................................................................................................................................. 8 3.3 VM Load Balancing...................................................................................................................................................... 9 3.4 VM HA ......................................................................................................................................................................... 9 3.5 VM Fault Isolation ...................................................................................................................................................... 10 3.6 VM OS Fault Detection .............................................................................................................................................. 10 3.7 Black Box ................................................................................................................................................................... 11 3.8 Virtualized Deployment of Management Nodes ......................................................................................................... 11 3.9 Host Fault Recovery ................................................................................................................................................... 11
4 FusionStorage Reliability .......................................................................................................... 12 4.1 Data Store Redundancy Design .................................................................................................................................. 12 4.2 Multi-Failure Domain Design ..................................................................................................................................... 13 4.3 Data Security Design .................................................................................................................................................. 13 4.4 Tight Data Consistency ............................................................................................................................................... 14 4.5 NVDIMM Power Failure Protection .......................................................................................................................... 15 4.6 I/O Traffic Control ...................................................................................................................................................... 15 4.7 Disk Application Reliability ....................................................................................................................................... 15 4.8 Metadata Reliability.................................................................................................................................................... 16 Issue V1.0 (2014-04-20)
Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd.
iv
Huawei FusionSphere 3.1 Technical White Paper on Reliability
Contents
5 FusionManager Reliability ....................................................................................................... 17 5.1 Active and Standby Management Nodes Architecture ................................................................................................ 17 5.2 Data Consistency Between Active and Standby Hosts ............................................................................................... 17 5.3 Real-Time Backup of Management Data .................................................................................................................... 17
6 Network Reliability .................................................................................................................... 18 6.1 Multipathing Storage Access ...................................................................................................................................... 19 6.2 Traffic Control over Virtualized Networks ................................................................................................................. 20 6.3 NIC Load Balancing ................................................................................................................................................... 21 6.4 Switch Stacking .......................................................................................................................................................... 22 6.5 Switch Interconnection Redundancy .......................................................................................................................... 22 6.6 VRRP .......................................................................................................................................................................... 22
7 Hardware Reliability .................................................................................................................. 24 7.1 Memory Reliability ..................................................................................................................................................... 24 7.2 Hard Disk Reliability .................................................................................................................................................. 24 7.3 Online Scheduled Disk Fault Detection and Precaution ............................................................................................. 24 7.4 Power Reliability ........................................................................................................................................................ 25 7.5 System Detection ........................................................................................................................................................ 25 7.6 Onboard Software Reliability ..................................................................................................................................... 25
Issue V1.0 (2014-04-20)
Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd.
v
Huawei FusionSphere 3.1 Technical White Paper on Reliability
1
1 System Architecture Description
System Architecture Description
1.1 Overview of Huawei FusionSphere Solution Figure 1-1 Overview of Huawei FusionSphere data center solution
Huawei FusionSphere solution consolidates multiple applications in the service system, thereby improving the server utilization and system reliability, reducing purchase costs, and increasing the maintenance efficiency. Elastic hosts provide high-quality pay-per-use elastic services. Users can apply for resource scheduling and query the resource utilization by themselves, which significantly improves the response speed. Users are charged automatically. Compared with traditional service modes, the FusionSphere solution reduces costs but provides friendly user experience.
Issue V1.0 (2014-04-20)
Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd.
1
Huawei FusionSphere 3.1 Technical White Paper on Reliability
2
2 Architecture Reliability
Architecture Reliability
Architecture reliability focuses on the reliability of services and the public platform among data centers and among subsystems in data centers.
2.1 Redundant Network Paths The network of the FusionSphere solution can be divided into the core layer, convergence layer, access layer, and virtual network layer. Switching devices at the core layer enable communication between data centers and connect the FusionSphere to external networks. The S93xx switch cluster provides redundant connections to firewalls/NAT devices and aggregation switches of data centers. Switching devices at the convergence layer are deployed in the equipment room of data centers for converging traffic from and providing access for the switches at the access layer and exchanging data with switches at the core layer. The S93xx switch cluster provides redundant connections to switching devices at the core layer and those at the access layer. Access switches are responsible for the access of servers in their own cabinets. The stacked S53xx switches provide redundant connections to switching devices at the convergence layer and to the internal virtual network layer. The virtual network layer, located in the server, implements the communication among VMs on the server and between VMs and external networks. On servers, multiple network interface cards (NICs) are bound together to prevent service interruption due to faults on a single NIC.
Issue V1.0 (2014-04-20)
Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd.
2
Huawei FusionSphere 3.1 Technical White Paper on Reliability
2 Architecture Reliability
Figure 2-1 Configuration of redundant network paths
2.2 Detached-Plane Network Communication The cloud computing system is divided into the management plane, storage plane, and service plane. FusionSphere employs the detached-plane architecture to ensure the reliability and security of various network plane data. Different planes are isolated by VLANs so that the fault of a single plane exerts no impact on other planes. For example, if the management plane becomes faulty, the service plane can be still used to access VMs. In addition, the system supports priority settings based on VLANs. By setting the highest priority for internal management and control packets, the administrator and other users can manage and control the system at any time. Figure 2-2 shows the network connection among servers, switching devices at the access layer, and switching devices at the convergence layer.
Issue V1.0 (2014-04-20)
Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd.
3
Huawei FusionSphere 3.1 Technical White Paper on Reliability
2 Architecture Reliability
Figure 2-2 Detached-plane network communication
In servers, the management plane, service plane, and storage plane can be deployed on different physical NICs and connected to different switching devices at the access layer through NIC binding and categorization, thereby implementing network isolation at the physical layer.
2.3 Management Node HA The active and standby management nodes of FusionSphere use the heartbeat detection mechanism. The standby node detects the health status of the active node in real time. Once a fault is detected, the standby management node takes over services from the active node. The software watchdog detects processes on the management nodes. If the software watchdog detects that a process is in a deadlock or infinite loop, it restarts the process. If the problem persists after the restart, the software watchdog initiates a switchover between the active and standby nodes and sends an alarm to ensure the availability of the process.
Issue V1.0 (2014-04-20)
Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd.
4
Huawei FusionSphere 3.1 Technical White Paper on Reliability
2 Architecture Reliability
Figure 2-3 HA of the management nodes
Management nodes work in active/standby mode to manage services in the system. If both active and standby management nodes become faulty, new services, for example, creating or deleting VMs, will be adversely affected. However, services of the VMs that have been properly running are not interrupted.
2.4 Traffic Control The traffic control mechanism helps the management node provide concurrent services of high availability without system collapse duo to excessive traffic. Traffic control is enabled for the Elastic Service Controller (ESC) and Virtualization Resource Management (VRM) access points to prevent excessive loads on the front end and enhance system stability. To prevent service failures due to excessive traffic, this function is also enabled for each key internal process in the system, such as traffic control on image download, authentication, VM services (including VM migration, VM high availability, VM creation, hibernation, waking up, and stopping), and operation and maintenance (O&M).
2.5 Fault Detection The system provides fault detection and alarm sending function and the tool for displaying faults on web browsers. When a cluster is running, users can detect cluster management and load balancing using a data visualization tool to detect faults, including load balancing problems, abnormal processes, or hardware performance deterioration trends. Users can view historical records to obtain information about daily, weekly, and even annual hardware resource consumption. By running a probe program on each detected node, including customized VMs, the O&M module of FusionSphere collects key indicator information, such as the CPU usage, network Issue V1.0 (2014-04-20)
Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd.
5
Huawei FusionSphere 3.1 Technical White Paper on Reliability
2 Architecture Reliability
traffic, and memory data of the detected node or VM. The O&M module also detects all exceptions in the system, such as process crashes, management and storage link faults, node breakdown, and system resource overload. In addition, the FusionSphere solution provides a set of health check tools and generates the check reports of each component for technical support and maintenance engineers. The tools can check the current information and running status of the system for the use of site deployment, preventive maintenance, or upgrade.
2.6 Data Consistency Check FusionSphere automatically audits and restores key resource data, periodically audits VMs, and checks volume information to ensure volume data and status consistency. When detecting an exception, FusionSphere automatically generates a log and provides maintenance instructions.
2.7 Management Data Backup and Restoration FusionSphere that can connect to a third-party FTP server supports periodic local and remote backup of configuration and service data on management nodes. If the management node service becomes abnormal and cannot be automatically restored, it can be restored using the local data backup rapidly. If a devastating fault occurs and both the active and standby management nodes are faulty at the same time, and they cannot be restored by restarting, they can be restored using the remote data backup within one hour. With this service, the time for restoration is reduced.
2.8 Global Time Synchronization The FusionSphere solution provides an internal clock synchronization mechanism to ensure time consistency among all internal components, such as IP storage area network (IP SAN) devices, switches, management nodes, computing nodes, server baseboard management controllers (BMCs), and firewalls. The FusionSphere solution uses a precise external clock source to ensure time consistency of the entire system. Global time synchronization allows proper communication between all network elements (NEs) and facilitates system maintenance.
Issue V1.0 (2014-04-20)
Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd.
6
Huawei FusionSphere 3.1 Technical White Paper on Reliability
3
3 FusionCompute Reliability
FusionCompute Reliability
3.1 VM Live Migration The system supports VM live migration without interrupting services. The cloud management system creates an image for the VM on the destination server and synchronizes the data of the VM with the source server. Data to be synchronized includes the status of the memory, register, stack, vCPU, and storage device and dynamic information about virtual hardware. The hypervisor enables rapid duplication of memory data to ensure memory synchronization and prevents service interruption during VM migration. Meanwhile, shared storage ensures persistent data consistency before and after VM migration. Figure 3-1 shows the mechanism of VM live migration. Figure 3-1 VM live migration
Issue V1.0 (2014-04-20)
Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd.
7
Huawei FusionSphere 3.1 Technical White Paper on Reliability
3 FusionCompute Reliability
Live migration allows VMs scattered on different servers to be migrated to only several servers or one server when traffic is light. Then idle servers can be turned off. This reduces costs for customers and saves energy. VM live migration can ensure high reliability of the customer system. If a fault occurs on a running physical machine, its services can be migrated to other properly running machines before the situation turns worse. Hardware can be upgraded online without interrupting services. Before upgrading a physical machine, users can migrate all its VMs to other machines. After the upgrade is complete, users can migrate the VMs back. In this way, services are not interrupted. VM live migration applies to the following application scenarios:
Manual VM migration to any idle physical server as required
Batch VM migration to any idle physical server based on the resource utilization
3.2 Storage Cold and Live Migration FusionSphere offers cold migration and live migration for VM disks. Cold migration is to move VM disks from one data store to another when the VM is stopped. Live migration is to move VM disks from one data store to another without service interruption. Figure 3-2 Mechanism of storage cold migration
Figure 3-3 Mechanism of storage live migration
Issue V1.0 (2014-04-20)
Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd.
8
Huawei FusionSphere 3.1 Technical White Paper on Reliability
3 FusionCompute Reliability
3.3 VM Load Balancing If a new VM is started, VMs are live migrated, or computing nodes are remotely restarted due to faults, the system working in load balancing mode dynamically distributes the loads based on the load status of each physical computing server in a cluster.
3.4 VM HA If the physical CNA server breaks down or restarts abnormally, the system can migrate the VMs with high availability (HA) to other computing servers, ensuring rapid restoration of VMs. A cluster can house thousands of VMs. Therefore, if a computing server breaks down, the system migrates VMs to different servers based on the network traffic and destination server load to prevent network congestion and overload on the destination server. Figure 3-4 VM HA
VM HA is triggered if the heartbeat connection between the VRM and CNA is disconnected for 30 seconds, or a VM works abnormally suddenly. The lockout mechanism at the storage layer prevents one VM instance from being started concurrently on multiple CNAs. Issue V1.0 (2014-04-20)
Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd.
9
Huawei FusionSphere 3.1 Technical White Paper on Reliability
3 FusionCompute Reliability
CNA nodes can be recovered from power-off failures. Then, service processes are automatically resumed after the restart, and the VMs that were running on the CNAs are migrated to other computing nodes.
3.5 VM Fault Isolation With the virtualization technology, one physical server can be virtualized into multiple VMs. VMs are separated from each other. If a virtual machine fails, other virtual machines can still work properly. User experience on virtual machines is the same as that on physical machines. Figure 3-5 Protocol stack in a hypervisor
All in all, any operation performed on a VM exerts no impact on other VMs of the same physical server or the virtualization platform.
3.6 VM OS Fault Detection If a VM becomes faulty, the system automatically restarts the faulty VM from the physical server where the VM is located or from another physical server, depending on the preset policy. Users can also configure the system to neglect the faults. The system can detect and rectify VM OS faults, such as the blue screen of death (BSOD) on Windows VMs and the panic status of Linux VMs.
Enhances system self-recovery abilities, reducing the reliance on maintenance personnel.
Minimizes the mean time to repair (MTTR), improving system reliability.
Issue V1.0 (2014-04-20)
Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd.
10
Huawei FusionSphere 3.1 Technical White Paper on Reliability
3 FusionCompute Reliability
3.7 Black Box Virtualization software and virtualization management software support the black box. The black box collects the information that the system has before a critical fault or system down occurs. The information is backed up to the local directory for future fault locating. The black box collects and stores kernel logs and diagnosis information provided by the diagnosis tool before the OS exits due to unexpected errors, thereby enabling maintenance engineers to export the data for analysis after the system breaks down. To ensure that the stored information is not lost, the black box uses the netpoll to send the information collected to a remote server in real time. If the network is abnormal, the black box stores the information on the local server.
3.8 Virtualized Deployment of Management Nodes The management software of the FusionSphere solution can be deployed to VMs, that is, management nodes support virtualized deployment. Management node VMs support the following functions besides redundancy, live migration, and HA:
IP SAN storage+local storage, improving system reliability.
Automatic startup upon host power-on. After a host is powered on, its VRM management node VMs automatically start. If both active and standby management node VMs fail to start, FusionManager provides VRM heartbeat detection and alarm sending, and FusionCompute provides a tool for restoring the management nodes.
3.9 Host Fault Recovery If the entire device or a CNA node becomes faulty and cannot be recovered by a restart or alarm handling guide, FusionCompute allows users to replace the node and restore its original services and configurations with one click or commands. The nodes that can be replaced include the hard disk, main board, NIC, and RAID. After a host is recovered, the VMs bound to it can automatically start again, and public configurations, such as the network, storage, computing, and NTP configurations can be recovered as well.
Issue V1.0 (2014-04-20)
Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd.
11
Huawei FusionSphere 3.1 Technical White Paper on Reliability
4
4 FusionStorage Reliability
FusionStorage Reliability
By deeply integrating computing and storage, Huawei-developed distributed storage software, FusionStorage, delivers optimal performance, sound reliability, and high cost-effectiveness. It can be deployed on a server to consolidate local disks on all servers into a virtual storage resource pool. Therefore, FusionStorage can completely take the place of external storage area network (SAN) device in some scenarios.
4.1 Data Store Redundancy Design User data can be backed up in two or three copies, ensuring data reliability. As shown in Figure 4-1, three nodes form a resource pool, in which storage data is backed up as two copies. The active data copy is saved on one node and the standby data copies are evenly distributed on the other two nodes. In this case, data will not lose in the event of single points of failure. Figure 4-1 Data saved as two copies in the FusionStorage system
In the dual-copy scenario, if a disk becomes faulty in a FusionStorage resource pool, data of the entire system will not lose, and services still run properly.
In the three-copy scenario, if two disks become faulty concurrently in a FusionStorage resource pool, data of the entire system will not lose, and services still run properly.
The system data persistence reaches 99.99% in the dual-copy scenario and 99.99999% in the three-copy scenario.
Issue V1.0 (2014-04-20)
Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd.
12
Huawei FusionSphere 3.1 Technical White Paper on Reliability
4 FusionStorage Reliability
4.2 Multi-Failure Domain Design FusionSphere regards one resource pool as a failure domain by default. As shown in Figure 4-2, if two resource pools are created in the FusionStorage system, two independent failure domains are available. If one hard disk becomes faulty concurrently in different resource pools (failure domains), two or three points of failure will not occur in the resource pools. To be more specific, data in the entire system will not lose. This greatly reduces the possibility of two or three points of failure. Figure 4-2 Multi-failure domains in the FusionStorage system
4.3 Data Security Design In a FusionStorage resource pool, data store DR can be configured at the server or rack level, which effectively reduces the failures that may concurrently occur on dual-copy or three-copy disks.
1.
Server-based security: The standby server data copy is distributed only on another server. Any disk fault on the same server does not cause data loss in the system, ensuring service continuity. The server-based security level is the default value. Figure 4-3 shows the copy distribution by server.
Issue V1.0 (2014-04-20)
Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd.
13
Huawei FusionSphere 3.1 Technical White Paper on Reliability
4 FusionStorage Reliability
Figure 4-3 Server-based security of FusionStorage
2.
Rack-based security: The standby rack data copy is distributed only on the nodes of other racks. Any blade or disk fault on the same rack does not cause data loss in the system, ensuring service continuity. Figure 4-4 shows the copy distribution by rack. Figure 4-4 Rack-based security of FusionStorage
4.4 Tight Data Consistency The tight consistency and replication protocol is used to ensure the consistency of multiple data copies. Only after all copies are successfully written to the disk, the system prompts for Issue V1.0 (2014-04-20)
Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd.
14
Huawei FusionSphere 3.1 Technical White Paper on Reliability
4 FusionStorage Reliability
the data write success. In most cases, FusionStorage ensures that data read from any copy is the same. If a disk is faulty temporarily, FusionStorage does not write data into the copies on the disk until the disk fault is rectified. Then FusionStorage restores data in the copies and continues to write data into them. If a disk can be no longer used, FusionStorage removes it from the cluster and finds another available disk for copies. Using the rebalance mechanism, FusionStorage can balance data distribution among all disks.
4.5 NVDIMM Power Failure Protection Some key dynamic data, such as the metadata, is saved in the memory when FusionStorage is running. The data will be lost if the server is powered off unexpectedly. To prevent data loss, FusionSphere uses the non-volatile dual in-line memory module (NVDIMM) that can provide fast access and data integrity in the case of power failures.
4.6 I/O Traffic Control FusionSphere supports overload I/O traffic control. If I/O traffic is overloaded, FusionSphere preferentially guarantees high-priority services by disabling low-priority services based on the traffic control algorithm and policy. This prevents processing delays, service success rate reduction, and system resets that may occur due to insufficient resources.
4.7 Disk Application Reliability FusionSphere supports hard disk Smart detection, slowly-rotating/fast-rotating disk detection, disk SCSI fault handling, hard disk hot swap and identification handling, and disk scan, and allows upper-layer services to conduct read repair, remove or reconstruct a disk, mark a bad block, scan valid data in disks, handle the exceeding of the Smart threshold, and handle slowly-rotating disks (removing disks after pre-reconstruction).
Read Repair
When failing to read data, FusionStorage automatically identifies the failure location. If the data fails to be read from a disk sector, FusionStorage retrieves the data from other copies of the data and writes the data back into the original disk sector. The Read Repair feature can be used to repair most sector read failures. However, if the failure persists after read repair is performed, select another disk for the copy and removes the failed disk from the cluster.
Block Status Table (BST)
If a bad sector exists when the system scans disks or reads data, an error in operation (EIO) is reported. Then the system attempts to read data from another copy and writes data back into the original disk sector. However, if the other copy is unavailable, the system needs to mark the bad block in the BST and restores the lost data in the bad block using upper-layer applications.
Issue V1.0 (2014-04-20)
Disk removal and reconstruction
Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd.
15
Huawei FusionSphere 3.1 Technical White Paper on Reliability
4 FusionStorage Reliability
Smart Data can detect disk errors, such as WP, ABRT, and DF errors, and report a special EIO to FusionStorage. If only one copy is available, FusionStorage refuses to remove the disk but performs the procedures for handling the dual-disk failure. If two copies are available, FusionStorage removes the disk and reconstructs data.
Disk scan for valid data
FusionStorage scans disks to prevent silent data corruption. If the scan fails due to a bad sector (an extended EIO is displayed), FusionStorage performs a fine-grained scan to locate the bad sector and performs read repair on the bad sector. If the read repair fails, FusionStorage then marks the bad block in the BST.
Handling the exceeding of the Smart threshold and slowly-rotating disks (pre-construction and disk removal)
Upon detecting that a disk exceeds the threshold or a disk rotates slowly, FusionStorage first migrates the main partition away and pre-constructs another copy (three copies are available plus the original two copies). After the copy is constructed, FusionStorage removes the disk that exceeds the threshold or rotates slowly.
4.8 Metadata Reliability The metadata of volumes and snapshots is stored in two metadata volumes, each of which has two copies. The system contains altogether four copies, thereby ensuring high reliability of the metadata.
Issue V1.0 (2014-04-20)
Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd.
16
Huawei FusionSphere 3.1 Technical White Paper on Reliability
5
5 FusionManager Reliability
FusionManager Reliability
5.1 Active and Standby Management Nodes Architecture The FusionSphere management system works in active/standby mode. The active node provides services through the floating IP address. If the active node process is faulty or the OS on the active node or the host breaks down, the standby node takes over service processing. During the switchover, the floating IP address is configured and the MAC address is updated on the gateway. All processes detected by the original active node start on the standby node and provide services. The active and standby management nodes use the heartbeat detection mechanism. The standby node detects the health status of the active node in real time. Once a fault is detected, the standby management node takes over services from the active node.
5.2 Data Consistency Between Active and Standby Hosts FusionSphere uses databases working in active/standby mode. The active database performs data read and write operations. If data in the active database is changed, the change will be synchronized to the standby database. To ensure the performance of the active database, asynchronous synchronization is performed between the active and standby databases. This prevents data loss if an active/standby database switchover occurs.
5.3 Real-Time Backup of Management Data Real-time backup of management data: Manual backup is performed before an important operation, such as a system upgrade or critical data modification, is performed for FusionSphere. If the important operation fails or the operation has not achieved the expected result, the data backup can be used to restore the FusionSphere system to minimize the impact on services. Therefore, you need to back up the data of the management node in advance. FusionManager supports the following functions to implement management data backup:
Interworking with third-party FTP servers to back up management data.
Uploading the management data backups of each component to the third-party FTP servers.
Instantly backing up the management data of components, such as FusionCompute and FusionManager.
Querying the backup status.
Issue V1.0 (2014-04-20)
Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd.
17
Huawei FusionSphere 3.1 Technical White Paper on Reliability
6 Network Reliability
6
Network Reliability
The network subsystem takes four measures to enhance system reliability:
Uses the NIC binding technology to improve the availability of server ports.
Uses the switch stacking technology to virtualize two switches into one, improving the link utilization and the reliability of access switches.
Uses the Smart Link technology to connect aggregation switches.
Uses Virtual Router Redundancy Protocol (VRRP) to deploy active and standby routers at the core router side, improving the availability of the core network.
Figure 6-1 shows the network of the data center. Figure 6-1 Data center network Terminal
Terminal
Internet
Intranet
Service network Management network Storage network Network connection Optional device
Application system Firewall Firewall
Firewall
Load balancing
Load balancing
Service core
Firewall
SSL VPN
SSL VPN
Service convergence
Service access
VM
UVP Computing resource
Storage access
VM
Storage convergence
VM
Backup resources
Disaster recovery resources
The network is divided into three layers:
Access layer
Connect servers and storage devices to the switches at the access layer in the uplink.
Issue V1.0 (2014-04-20)
Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd.
18
Huawei FusionSphere 3.1 Technical White Paper on Reliability
6 Network Reliability
Deploy six NICs on servers, two for the service plane, management plane, and storage plane, respectively, to ensure link redundancy. Add the management, service, and storage planes to VLANs on the access switches to isolate them. You are advised to use the switch stacking mode for simplifying the networking and improving network reliability. The service plane network carries the service data of VMs. The management network transmits internal management messages between management servers and resource servers. The storage plane network transmits data between servers and disk arrays.
Convergence layer
Connect access switches to the switches at the convergence layer in the uplink. You are advised to configure aggregation switches to work in cluster mode. Connect access switches to aggregation switches through Eth-Trunk ports. After the aggregation switches are stacked, the VRRP function is not required. If aggregation switches are required to provide the gateway function, set the user gateway to the IP address of the VLANIF interface.
Core layer
Connect aggregation switches to the switches at the core layer in the uplink. You are advised to configure core switches to work in cluster mode. Connect core switches to upper-layer devices through Open Shortest Path First (OSPF) or static routes. If they are connected through OSPF, the addresses advertised by OSPF include the interconnection addresses of the core switches, direct route addresses, and loopback addresses. If they are connected through static routes, the VRRP address is used as the gateway.
6.1 Multipathing Storage Access Computing nodes support the redundancy deployment of storage Initiators modules, and VMs on these nodes can access the storage system using standard protocols, such as iSCSI. The load balancing, switch stacking, and clustering technologies of multiple NICs provide physical redundant storage paths.
Issue V1.0 (2014-04-20)
Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd.
19
Huawei FusionSphere 3.1 Technical White Paper on Reliability
6 Network Reliability
Figure 6-2 Multipathing access to data stores
Figure 6-2 describes the multipathing access process when computing nodes communicate with storage nodes. Any VM can provide at least two paths for its attached virtual volumes. The multipathing software is used to control multipathing access and implements service switchover upon failures. This prevents service interruption that may be caused by single points of failure and ensures system reliability.
6.2 Traffic Control over Virtualized Networks Traffic control over virtualized networks allows users to configure the outbound bandwidth based on the network plane or the virtual NIC. 1. Bandwidth control based on the network plane
Issue V1.0 (2014-04-20)
Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd.
20
Huawei FusionSphere 3.1 Technical White Paper on Reliability
6 Network Reliability
Figure 6-3 QoS on the network plane
The management plane, storage plane, and service plane are allocated with specified bandwidth based on the physical bandwidth resources. Traffic congestion on a plane does not affect I/O on other planes. Administrators can configure the assured bandwidth (only when the host uses an iNIC), upper bandwidth limit, and bandwidth priority to implement the network I/O control. The assured bandwidth ensures basic transmission on each network plane even in the case of extreme network congestion. System administrators can configure proper assured bandwidths for network planes based on the actual service scenario. 2. Bandwidth control based on a virtual NIC The assured bandwidth, upper bandwidth limit, and bandwidth priority can be configured for a host equipped with an iNIC to ensure the quality of communication between VMs. This function also prevents mutual influence between VMs. The administrator can configure a virtual interface and set the assured bandwidth for a virtual NIC of a VM, thereby ensuring high communication quality for the VM even under network congestion. The administrator can set the upper bandwidth limit of virtual NICs to limit the maximum bandwidth of a VM. The bandwidth priority empowers a VM with a higher priority to occupy more bandwidths.
6.3 NIC Load Balancing Multiple NICs provided by the physical server work in Bonding mode for reliability and load balancing. These NICs are bound as one logical NIC so that the traffic accessing the server can be balanced to each NIC. This prevents the traffic burst from concurrent access and ensures stable and smooth access to the server. If one NIC becomes faulty, other NICs immediately take over its services without service interruption. Binding multiple NICs as an array can increase the incoming and outgoing bandwidths on the server, implement load balancing, and enhance the disaster tolerance capability. This avoids transmission congestion on the server or service interruption caused by the failure of a single NIC.
Issue V1.0 (2014-04-20)
Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd.
21
Huawei FusionSphere 3.1 Technical White Paper on Reliability
6 Network Reliability
6.4 Switch Stacking Stackable switches on the same location can be connected through stacking cables or high-speed uplink ports to form a reliable logical switch. The S5300 access switch is stacked through stacking ports. The switch stacking mechanism improves the reliability of switches, implements centralized management and maintenance of switches, and reduces maintenance costs. The switch stacking technology empowers two stacked physical switches to act as one switch, with no trunk interface configured between them. The two physical switches in a stacking group work in active/standby mode. If one switch is faulty, the other switch takes over services. Before a stack group is established, each switch is a standalone entity with its own IP address and needs to be managed separately. The stacking technology enables them to work as a logical device with only one IP address, which can be used to manage and maintain all switches in the stacking group. The stacking protocol can elect the active, standby, and slave switches to implement data backup active/standby switchover. Switches can be connected in a ring or link topology. Then the active switch is elected using the stack management protocol to manage the stack group, including allocating IDs to stacked members, collecting information about the stacking topology, and sends the topology information to stacked members. The active switch also designates its standby switch, which can serve as the active switch to manage the stacking group if the original active one becomes faulty.
6.5 Switch Interconnection Redundancy Smart Link is known as a backup link. It enables reliable and efficient backup and switchover for dual uplinks. Compared with Spanning Tree Protocol (STP), Smart Link provides better convergence performance. Compared with Rapid Ring Protection Protocol (RRPP) and Smart Ethernet Protection (SEP), Smart Link simplifies the configuration. Dual-uplink networking is one of the most common networking modes. In the dual-uplink networking, STP is used to block the standby link for backup. If the active link is faulty, packets are switched to the standby link. This solution meets users' requirement on redundancy and backup functions, but the performance offered is not satisfactory, because the convergence speed is implemented only in seconds even if quick migration of STP is used. This is an unfavorable performance indicator for high-end Ethernet switches applied to carrier-class core networks. To resolve the preceding problems, Huawei rolls out the FusionSphere solution introducing the Smart Link technology to implement active/standby link redundancy backup and quick migration for dual-uplink networking. This solution is dedicated for dual-uplink networking, ensuring performance and simplifying configuration. It also introduced the Monitor Link technology for detecting the uplink.
6.6 VRRP Virtual Router Redundancy Protocol (VRRP) is a fault tolerance protocol. With this protocol enabled, several routers can be grouped as a virtual router. If the next-hop switch of a host
Issue V1.0 (2014-04-20)
Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd.
22
Huawei FusionSphere 3.1 Technical White Paper on Reliability
6 Network Reliability
becomes faulty, other routers rapidly take over services, ensuring service continuity and reliability. VRRP unites a group of routing devices in a LAN as a VRRP backup group, which is equivalent to a virtual router. The host on the LAN needs to know only the IP address of the virtual router instead of the IP addresses of specific devices. After the default gateway of the host is set to the IP address of the virtual router, the host can use the virtual gateway to communicate with external networks. VRRP dynamically associates the virtual router with a physical device that transmits service packets. If the physical is faulty, VRRP select a new device to take over service transmission. The entire process is invisible to users, thereby implementing continuous communication between the internal network and the external network.
Issue V1.0 (2014-04-20)
Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd.
23
Huawei FusionSphere 3.1 Technical White Paper on Reliability
7 Hardware Reliability
7
Hardware Reliability
7.1 Memory Reliability Memory errors are divided into hardware errors and software errors. Hardware errors are caused by invalid or damaged hardware. If a hardware error occurs, the component displays incorrect data continuously. Hardware errors can be detected by memory self check during the RH2285 startup. Software errors occur frequently but cannot be detected by memory self check. Data in the memory can be protected only by error check and correction algorithms. The RH2285 employs the Error Checking and Correcting (ECC) technology to detect 2-bit memory errors and correct 1-bit memory errors.
7.2 Hard Disk Reliability The following features ensure the reliability of hard disks:
Hard disk hot swap: The RH2285 supports hot swappable hard disks, including the SAS and SATA disks.
Hard disk RAID: The RH2285 supports several RAID modes, such as RAID 0, RAID 1, and RAID 5, and supports hot spare disks for RAID groups, ensuring high reliability of hard disk data. If a hard disk in a RAID group is faulty, the RH2285 supports data restoration, RAID group recovery, and online hard disk replacement. The RAID card has batteries, which improves the hard disk access performance and protects the data in the cache when power-off failures occur.
7.3 Online Scheduled Disk Fault Detection and Precaution Huawei cloud computing solution uses the industry-leading self-detection, analysis and reporting (S.M.A.R.T.) technology to detect and manage the hard disks that are based on the Advanced Technology Attachment (ATA) interfaces and Internet Small Computer Systems Interfaces (iSCSIs), including checking the reliability of hard disks and predicting their errors. The S.M.A.R.T technology can detect hard disk attributes, such as the data throughput performance, motor startup time and track seeking error rate, and compare them with standard values. Therefore, it can determine hard disk faults and notify users of avoiding data loss.
Issue V1.0 (2014-04-20)
Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd.
24
Huawei FusionSphere 3.1 Technical White Paper on Reliability
7 Hardware Reliability
The S.M.A.R.T technology is widely applied to hardware data reliability. It can be used to analyze the status of motors, circuits, disks, and heads when hard disks are running. If an anomaly occurs, it enables the system to send an alarm. Or in some special situations, it enables the system to reduce the disk speed and back up data. A S.M.A.R.T-enabled hard disk can analyze the operating status and historical records of the disk head, disk sheet, motor, and circuit and compare them with the preset thresholds using the detecting signals on the hard disk and detecting software on the host. An alarm will be sent if the values are not within the preset thresholds. The S.M.A.R.T technology detects the following key attributes:
Read Error Rate
Start/Stop Count
Relocated Sector Count
Spin up Retry Count
Drive Calibration Retry Count
ULTRA DMA CRC Error Rate
Multi-zone Error Rate
7.4 Power Reliability Each self-developed server, such as the RH2285, involved in Huawei cloud data center is equipped with two power supply units (PSUs) that can generate alarms when a fault occurs. PSUs work in 1+1 redundancy mode and are hot swappable. If one PSU becomes faulty, the system runs properly without service interruption, and the faulty PSU can be replaced online.
7.5 System Detection The self-developed servers, such as the RH2285, involved in Huawei cloud data center can detect the temperature of key heat components in real time, such as the CPU and memory. They collaborate with intelligent fans to ensure the reliable running of the system. The RH2285 can detect the operating status of key components, such as the fan, power supply, and hard disk. It generates an alarm if a fault occurs. The faulty components can be replaced online if their devices support hot swap. Those devices that do not support hot swap, however, need to be powered off before replacement.
7.6 Onboard Software Reliability BMC software supports double images. If one image is damaged in the Flash, the BMC starts from the other image. This prevents the failure in the system startup. BMC software detects processes running on the server and restarts the server if a process stops responding.
Issue V1.0 (2014-04-20)
Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd.
25