Preview only show first 10 pages with watermark. For full document please download

System High Availability Technical White Paper

   EMBED


Share

Transcript

Huawei FusionCloud Desktop Solution 6.1 System High Availability Technical White Paper Issue 01 Date 2017-02-03 HUAWEI TECHNOLOGIES CO., LTD. Copyright © Huawei Technologies Co., Ltd. 2017. All rights reserved. No part of this document may be reproduced or transmitted in any form or by any means without prior written consent of Huawei Technologies Co., Ltd. Trademarks and Permissions and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd. All other trademarks and trade names mentioned in this document are the property of their respective holders. Notice The purchased products, services and features are stipulated by the contract made between Huawei and the customer. All or part of the products, services and features described in this document may not be within the purchase scope or the usage scope. Unless otherwise specified in the contract, all statements, information, and recommendations in this document are provided "AS IS" without warranties, guarantees or representations of any kind, either express or implied. The information in this document is subject to change without notice. Every effort has been made in the preparation of this document to ensure accuracy of the contents, but all statements, information, and recommendations in this document do not constitute a warranty of any kind, express or implied. Huawei Technologies Co., Ltd. Address: Huawei Industrial Base Bantian, Longgang Shenzhen 518129 People's Republic of China Website: http://e.huawei.com Issue 01 (2017-02-03) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. i Huawei FusionCloud Desktop Solution 6.1 System High Availability Technical White Paper Contents Contents 1 Huawei FusionCloud Desktop Solution ............................................................................... 1 2 System Availability Specifications ........................................................................................ 2 3 System Reliability .................................................................................................................... 4 3.1 Cabinet ...................................................................................................................................................... 4 3.2 Server........................................................................................................................................................ 4 3.2.1 CPU Reliability ................................................................................................................................ 4 3.2.2 Memory Reliability........................................................................................................................... 5 3.2.3 Hard Disk Reliability ........................................................................................................................ 5 3.2.4 Supporting Regular Disk On-Line Faulty Detection and Precautions.................................................. 5 3.2.5 Power Reliability .............................................................................................................................. 6 3.2.6 System Monitoring ........................................................................................................................... 6 3.2.7 Onboard Software Reliability ............................................................................................................ 6 3.3 Storage Devices ......................................................................................................................................... 6 3.4 Network Devices ....................................................................................................................................... 7 3.4.2 NIC Load-Sharing ............................................................................................................................ 7 3.4.3 Switch Stacking ................................................................................................................................ 8 3.4.4 Switch Interconnection Redundancy.................................................................................................. 8 3.4.5 Virtual Router Redundancy Protection............................................................................................... 8 3.4.6 Detached-Plane Network Communication ......................................................................................... 9 3.5 Cloud Platform HA.................................................................................................................................... 9 3.5.1 Management Node HA ..................................................................................................................... 9 3.5.2 Data Backup for Management Nodes ...............................................................................................10 3.5.3 VM Backup .....................................................................................................................................10 3.5.4 VM HA ........................................................................................................................................... 11 3.5.5 VRM-Independent VM HA Management .........................................................................................12 3.5.6 VM Fault Detection and Handling ....................................................................................................13 3.5.7 Live Migration of VMs ....................................................................................................................13 3.5.8 Storage Migration ............................................................................................................................14 3.5.9 VM Load Balancing.........................................................................................................................15 3.5.10 Black Box ......................................................................................................................................15 3.5.11 Data Consistency ...........................................................................................................................15 3.5.12 Health Check Tool and Information Collection Tool .......................................................................16 Issue 01 (2017-02-03) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. ii Huawei FusionCloud Desktop Solution 6.1 System High Availability Technical White Paper Contents 3.6 FusionAccess Availability .........................................................................................................................16 3.6.1 FusionAccess Service HA ................................................................................................................16 3.6.2 FusionAccess Service Monitoring ....................................................................................................17 3.6.3 Desktop Access HA .........................................................................................................................18 3.6.4 FusionAccess Management Data Backup .........................................................................................19 3.6.5 Power-on Recovery Reliability Design .............................................................................................19 A Glossary .................................................................................................................................. 20 Issue 01 (2017-02-03) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. iii Huawei FusionCloud Desktop Solution 6.1 System High Availability Technical White Paper 1 1 Huawei FusionCloud Desktop Solution Huawei FusionCloud Desktop Solution The architecture components of the desktop cloud product Huawei FusionCloud desktop solution are deployed on virtual machines (VMs). Figure 1-1 shows the architecture of Huawei FusionCloud desktop solution. Figure 1-1 Architecture of Huawei FusionCloud desktop solution Issue 01 (2017-02-03) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 1 Huawei FusionCloud Desktop Solution 6.1 System High Availability Technical White Paper 2  2 System Availability Specifications System Availability Specifications Annual average global VM Availability rate reaches 99.9% (written in the Service Level Agreement) This specification indicates the proportion of the time when VMs are available. It is determined by the availability and repair capability. For details, see the following formula: A MTBF MTBF MTTR Where: A indicates the availability. MTBF indicates the mean time between failures. MTTR indicates the mean time to repair. For details about how to achieve the promoted value, see chapter and Chapter 3 "System Reliability."  Duty Time, 24 x 7 This value 24 x7 indicates that the VM can provide service all the time.  Power Recovery Duration, shorter than two hours This specification indicates the duration from the time the power to the cloud platform is resumed to the time all services are recovered. System software of the cloud platform, including management software and computing server software, does not need to be loaded sequentially. Loading each server takes less than 5 minutes. A maximum of 20 servers can be loaded concurrently.  VM Migration Duration, 3 minutes This specification indicates the duration from the time the system detects that a VM is power-off or breakdown to the time the system successfully restarts the VM or the duration form the time the system detects that a VM is faulty to the time the system starts the VM on another server. The duration depends on the duration for the startup of the operating system (OS) on the VM. If the system management server does not receive any heartbeat response from a VM within 40 seconds, it will start the VM on another server. This is the high availability (HA) process. VM Migration Duration does not include the startup duration of the VM itself. VM brain-split is avoided using a lock mechanism.  Issue 01 (2017-02-03) Live Migration Duration, 20 seconds for a VM with 1 GB memory Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 2 Huawei FusionCloud Desktop Solution 6.1 System High Availability Technical White Paper 2 System Availability Specifications This specification indicates the duration for migrating a VM from one server to another server without affecting service provisioning. During the live migration, the virtualization software copies the memory to the destination physical server at a rate of about 1 GB per 20 seconds. After that, the software copies the data changed during the previous copy operation to the destination physical server at the same rate. The process repeats until all the latest data has been migrated to the destination physical server. Then the new VM is restarted and the original VM is stopped. The migration takes a few milliseconds, and the user is unaware of the process.  Issue 01 (2017-02-03) TC Yearly Failure Rate, smaller than 3% Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 3 Huawei FusionCloud Desktop Solution 6.1 System High Availability Technical White Paper 3 System Reliability 3 System Reliability 3.1 Cabinet The cabinet has the following characteristics for the reliability:  Dual power distribution units (PDUs), with the overcurrent protection function.  Earthquake-proof up to magnitude 9  In compliance with NEBS L3 3.2 Server 3.2.1 CPU Reliability Reliability Feature Concept Benefit Core isolation Disables services running on some CPU cores when the cores are faulty. Ensures service availability by sacrificing certain performance and recovers system processing capabilities during off-peak hours. Socket isolation Starts the active CPU only for services when the standby CPU is faulty. Ensures service availability by sacrificing certain performance and recovers system processing capabilities during off-peak hours. Provides a PECI channel that is decoupled from the active system for out-of-band systems to access MCA registers of CPUs. Uses the PECI channel of out-of-band systems to access MCA registers of CPUs when an internal error occurs in the PCH and the PECI channel of the ME is unavailable, maximally capture fault information for fault location. PECI-based MCA register access by out-of-band systems Issue 01 (2017-02-03) Enables out-of-band systems to access MCA registers using the Platform Environment Control Interface (PECI). Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 4 Huawei FusionCloud Desktop Solution 6.1 System High Availability Technical White Paper 3 System Reliability Reliability Feature Concept Benefit IVR (core/socket voltage detection) Provides monitoring and alarm functions for the IVR modules integrated into CPUs. Provides the fault monitoring and alarm functions for all modules in CPUs so that risks to system stability can be identified and processed in a timely manner. 3.2.2 Memory Reliability Memory errors mainly include hardware errors and software errors. Hardware errors are caused by invalid or damaged hardware. Components will return incorrect data continuously. Hardware errors can be detected by memory self-checking during server startup. Software errors occur frequently, and cannot be detected by memory self-checking. The data in the memory can only be protected by algorithms of error checking and correcting (ECC). As for memory software error correction, the X6000 and E6000 servers can check memory 2 bit errors and repair memory single bit errors by adopting ECC technology of industry standard. 3.2.3 Hard Disk Reliability  Hard disk hot–swapping: The servers support hard disk (SATA/SAS) hot–swapping during system running.  Hard disk RAID: The X6000 and E6000 servers support several RAID modes such as RAID 0, RAID 1 and RAID 5, support hot spare disks for RAID groups, ensuring high reliability of hard disk data. When a certain hard disk at a RAID group is faulty, the servers support data restoration, RAID group recovery, and on-line hard disk replacement. The RAID card has batteries, which improves the hard disk access performance and protects the data in the cache when power outages occur. 3.2.4 Supporting Regular Disk On-Line Faulty Detection and Precautions The storage module of Huawei desktop cloud solution adopts the advanced technical standard of SMART to monitor the Advanced Technology Attachment (ATA) and small computer system interface (SCSI) hard disks, manage and check hard disk reliability, and predict disk errors. The detecting principle is to detect hard disk properties such as data throughput performance, motor start time, and error rate, and then deduce the hard disk faults and display dialog box to avoid data loss by comparing and analyzing attribute values and standard values. SMART is widely used to improve the hard disk reliability. The key monitoring attributes of SMART include:  Read error rate  Start/stop count  Relocated sector count  Spin up retry count  Drive calibration retry count Issue 01 (2017-02-03) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 5 Huawei FusionCloud Desktop Solution 6.1 System High Availability Technical White Paper  ULTRA DMA CRC error rate  Multi-zone error rate 3 System Reliability 3.2.5 Power Reliability Servers are configured with multiple power supply units (PSUs), and can generate alarms when a fault occurs. PSUs support redundancy and hot swap, which ensures that the system keep running when any PSU is faulty. The faulty PSU can be replaced on-line. 3.2.6 System Monitoring The system monitors temperature of key components, such as the CPU and memory. Together with the intelligent fan speed controlling and monitoring, the system reliability is ensured. The system monitors the running status of the key components, such as fans, PSUs and hard disk. An alarm is generated when a fault occurs. Devices that support hot swap can be replaced on-line. Devices that do not support hot swap must be powered off before replacement. 3.2.7 Onboard Software Reliability BMC software supports double images. If one Image is damaged in the Flash, the BMC will start from the other Image. This prevents the failure in system starting. BMC software monitors processes running on the server and restarts the server if a process stops responding. 3.3 Storage Devices The FusionStorage distributed storage system is designed for appliance. The FusionStorage uses local storage resources on the computing node to store user data, and adopts redundancy and distributed cache technologies to ensure data consistency and provide a high storage performance solution. In FusionStorage distributed storage scenario, three duplicates are stored, providing 99.9993% availability. Each IP SAN (a Huawei S5500T system) consists of a controller enclosure, and three disk enclosures, with a maximum of 96 hard disks. The following configurations ensure the reliability of storage device:  Eight physical links for multipathing  Two global hot spare disks for each set  Seven 9+1 RAID 5 groups for each set Table 3-1 shows IP SAN reliability specifications. Table 3-1 IP SAN reliability specifications Scenario Availability MTBF (y) Yearly Outage (min/y) Ten disks configured as RAID 5 99.9991% 12.684 4.73 Issue 01 (2017-02-03) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 6 Huawei FusionCloud Desktop Solution 6.1 System High Availability Technical White Paper 3 System Reliability Availability of the IP SAN is 99.9991%, MTBF is 12.684 year or 111110.1 hours, and Yearly Outage is 4.73 minutes per year. The FusionStorage and IP SAN supports the core data protection mechanisms, such as data redundancy protection, power-off protection, background scanning, and data pre-reconstruction. These mechanisms ensure data security. The user data persistence rate reaches 99.999%. The data persistence rate for traditional PCs is less than 95%. 3.4 Network Devices The network subsystem takes five measures to enhance system reliability: Figure 3-1 Network subsystem 3.4.2 NIC Load-Sharing As shown in Figure 3-1, the system adopts the bonding mode for the multiple NICs provided by the physical server. This ensures system reliability and load-sharing. By using the bonding mode, multiple NICs are bonded into one logical NIC. Therefore, the NICs can work synchronously. In this way, the traffic to the server is shared on the NICs. Therefore, the load on each NIC is much smaller and the ability of anti-concurrent access is improved. This ensures stable and quick access to the server. In addition, if one NIC is faulty, the other NICs take over the load seamlessly without interrupting services. This avoids service interruption caused by failure in one NIC or link. Issue 01 (2017-02-03) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 7 Huawei FusionCloud Desktop Solution 6.1 System High Availability Technical White Paper 3 System Reliability After multiple NICs are bonded to one server, the system uses the array to expand incoming and outgoing bandwidth on the server, to implement load balancing, and to enhance disaster tolerance capability. This avoids transmission congestion or service interruption caused by a single NIC. 3.4.3 Switch Stacking Stacking switches is to connect a set of physical switches through stacking cables or high-speed uplink ports to form one reliable logical switch. The access switch is stacked through stacking ports. The switch stack mechanism improves the reliability of switch devices and can be managed in a centralized manner which reduces maintenance cost. After the two switches are stacked, they act as a single switch and are presented as one switch device to peripherals. The two physical switches work in the master/slave mode. When one of them is faulty, the other will take over services from the faulty one. These switches are connected in ring or link topology. Then the stack master is elected by running the stack management protocol. A stack master is responsible for the stack system management, including: assigning IDs to stack members, collecting information about the stack topology, and sends the topology information to stack members. A stack master also designates the stack slave, which is prompted as the stack master to manage the stack system if the master is faulty. 3.4.4 Switch Interconnection Redundancy Smart link is also named as backup link. It provides a reliable and efficient solution on backup and switchover for dual uplink of a link, and is usually used in dual uplink networking. Compared with Spanning Tree Protocol (STP), Smart Link provides a better convergence performance. Compared with Rapid Ring Protection Protocol (RRPP) and Smart Ethernet Protection (SEP), Smart Link simplifies the configuration approach. Dual uplink networking is one of the most common networking modes. Dual uplink networking clears the redundancy using the STP and provides backup solution. When the master link is faulty, traffic falls over to the slave link. This can meet the users' requirement on redundancy backup on the functional level, but fails to meet many users' requirements on performance level. Because the convergence speed is only in seconds, even if quick migration in quick STP is adopted. This is an unfavorable performance KPI for high-end Ethernet switches applied to telecommunication-level core network. Based on the mentioned reasons, Huawei FusionCloud introduces Smart Link solution, which achieves active/standby links redundancy backup and quick migration for dual uplink networking. Smart Link solution, which is customized for dual uplink networking, ensures performance, simplifies configuration. A kind of port association solution, called Monitor Link, is introduced as a complement of Smart Link. It is used to monitor uplink, improving the backup function of Smart Link. 3.4.5 Virtual Router Redundancy Protection Virtual Router Redundancy Protocol (VRRP) is a fault tolerance protocol. With this protocol, several routers can be grouped as a virtual router. When the next-hop switch of a host is fault, service falls over a backup router of this virtual router without interrupting the service. VRRP constructs a group of routing devices in the LAN into a VRRP backup group, which equals to a virtual router. The hosts on LAN only need to know the IP address of virtual router instead of the IP address of specific devices. After setting the default gateways of hosts to the IP address of the virtual router, the hosts can use virtual gateways to communicate with external networks. Issue 01 (2017-02-03) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 8 Huawei FusionCloud Desktop Solution 6.1 System High Availability Technical White Paper 3 System Reliability VRRP dynamically associates virtual router with physical equipment which undertakes service transmission. When the equipment is faulty, new equipment will be selected to takeover service transmission. The whole process is transparent to users, achieving continuous communication between internal network and external network. 3.4.6 Detached-Plane Network Communication The entire cloud computing system is logically divided into three planes: management plane, storage plane, and service plane. To ensure data reliability of various network planes, the FusionCloud solution adopts a detached-plane architecture. Different planes are separated by using virtual local area network (VLAN). If one plane malfunctions, the other two planes can keep on working. For example, when a temporary malfunction occurs on the management plane, the service plane can still work properly and provide services to the cloud end user. In addition, the cloud computing system supports priority setting based on the VLAN. With the highest authority in the internal management and packet control, the administrator and user can manage and control the system at any time. 3.5 Cloud Platform HA 3.5.1 Management Node HA The active and standby management nodes of the FusionCloud system use the heartbeat detection communication mechanism. The standby node detects the health status of the active node in real time. When a fault is detected in the active management node, the standby management node takes over the services of the active node and continues to provide services. By starting the Watchdog, all the application processes on the service management node are monitored in real time. The Watchdog can detect the abnormal status of the process like deadlock and restart the process for recovery. If the process cannot be recovered after being restarted, you can perform active/standby switchover on the service management node and generate an active/standby exception alarm to ensure the reliability of the application process. Issue 01 (2017-02-03) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 9 Huawei FusionCloud Desktop Solution 6.1 System High Availability Technical White Paper 3 System Reliability Figure 3-2 Management node HA The management node is responsible for all the services in the entire system. The node works in active/standby mode. When the active and standby node malfunctions simultaneously, relative services fail too, such as services of VM creation or deletion. A running VM cannot be affected by the malfunction of the active/standby node. Users can perform the applications on the VM without knowing that a fault has occurred on the active/standby node. 3.5.2 Data Backup for Management Nodes All data on the management nodes are automatically backed up regularly. Even when the active and standby management servers are faulty and all data is lost, the data can recover quickly. The following describes the data recovery process when the active and standby management servers are faulty and all data is lost: Step 1 Change the management server. Step 2 Reload the management node. Step 3 Copy backup data to the management node, and start the management node. It takes about 30 minutes to recover all the lost data. ----End 3.5.3 VM Backup The eBackup VM backup scheme uses the Huawei eBackup software and the snapshot backup function of FusionCompute to back up data for VMs. By working with FusionCompute, eBackup backs up data of a specified VM or a specified volume of the VM based on specified policies. If VM data is lost or the VM is faulty, data can be restored by Issue 01 (2017-02-03) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 10 Huawei FusionCloud Desktop Solution 6.1 System High Availability Technical White Paper 3 System Reliability using backup data. Backup data is stored on the virtual disks attached to the eBackup VM or the peripheral storage devices of the network file system (NFS) or common Internet file system (CIFS). The eBackup VM backup scheme has the following characteristics:  Ease of use. Users do not need to install backup proxy software. They only need to create VM templates and manage VMs on the graphical user interface (GUI).  The VM-level backup service enables users to configure the full backup policy, incremental backup policy, backup cycle, backup period, and backup data expiration policy. Different types of VMs can be configured with different backup policies.  Efficient backup and restoration. In full backup mode, only valid data is backed up. In incremental backup mode, only modified data is backed up. This minimizes the backup traffic and the required backup storage space.  Concurrent backup and restoration. Each backup device supports 200 VMs and allows concurrent backup and restoration for eight VMs. Each backup domain supports 10 backup devices. The backup has no impact on production VMs because eBackup software is deployed on dedicated virtual devices. 3.5.4 VM HA If the physical CNA server is powered down or restarted abnormally, the system can migrate the VMs with high availability (HA) to other computing servers. This ensures that VMs can be quickly restored to the normal state. The FusionCloud solution provides multiple migration strategies. After a computing server is powered down, since thousands of VMs can run within a cluster, the system migrates VMs to different destination servers based on the network traffic status and load of destination servers. This avoids network congestion and destination server overload. Issue 01 (2017-02-03) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 11 Huawei FusionCloud Desktop Solution 6.1 System High Availability Technical White Paper 3 System Reliability Figure 3-3 VM HA feature If a VM cannot connect to the VRM, the system regards that the VM is faulty. The system selects another computing node to start the faulty VM. The VM supports the high availability feature. If a VM cannot connect to the VRM, the VRM regards that the VM is faulty. The VRM issues a command about restarting the VM on another computing node. Then, the VM fault is automatically recovered. To prevent VM split brain caused by incorrect decisions, the system introduces the anti-split brain lock mechanism. Management nodes are running at active/standby mode and feature high availability (HA). Active and standby management nodes, deployed on different CNAs, are in a mutually exclusive relationship. For example, when the active management node is faulty, the standby management node becomes the active management node. Meanwhile, the faulty management node is restarted on another CNA and serves as the standby management node. Therefore, active and standby management nodes will not break down simultaneously. 3.5.5 VRM-Independent VM HA Management FusionSphere supports VRM-independent VM HA management. This function allows FusionSphere to detect the network heartbeat connection between hosts independent of the VRM node. The HA function between hosts can take effect even if the VRM node is faulty. This function also allows data stores that are associated with hosts to detect host status, Issue 01 (2017-02-03) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 12 Huawei FusionCloud Desktop Solution 6.1 System High Availability Technical White Paper 3 System Reliability thereby preventing misjudgment on HA due to management network faults. After this function is enabled, FusionSphere can detect host faults on the service plane and generate alarms accordingly. 3.5.6 VM Fault Detection and Handling Most of VMs run on Windows, which is prone to faults, such as blue screen of death (BSOD). When the BSOD occurs on a VM, the Huawei cloud platform can detect BSOD information and automatically restart the VM. After the VM is automatically restarted, you only need to connect the VM. The Huawei cloud platform also supports the one-touch disk migration function. When the user VM operating system (OS) breaks down, the user does not need to reinstall the OS, avoiding the risk of data loss. Instead, the Huawei cloud platform creates a VM, which shares the same specifications with the faulty one, and automatically mounts the data from the faulty VM to the new VM. The user can log in to the new VM to obtain the data without any other manual operations. 3.5.7 Live Migration of VMs The VM is the resource entity for the cloud platform to provides elastic computing services. To prevent the service interruption caused by VM unavailability, the system enables the VM to migrate without interrupting services, which is called live migration. In the process of migration, to ensure the memory synchronization, the Hypervisor quickly copies the memory data and migrates the VM to the target host without interrupting the service. Figure 3-4 shows how the VM migrates to the target host without interrupting services. The data on the VM remains unaltered after the migration by using shared storage resources. Figure 3-4 VM live migration Live migration of the VM can reduce service running costs for the customer. With this function, services running on different servers can be migrated to fewer or one server when the traffic is light, then the idle server can be turned off. This helps the customer to reduce costs. It also saves energy and reduces emission. Issue 01 (2017-02-03) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 13 Huawei FusionCloud Desktop Solution 6.1 System High Availability Technical White Paper 3 System Reliability Live migration of the VM can ensure high reliability of the customer system. When a fault occurs on a running physical machine, you can migrate the services to other properly running machine before the situation turns worse. The hardware can be upgraded without interrupting services. When the customer wants to upgrade the hardware without interrupting the services, you can migrate all the VMs on the physical machine to other machines and then upgrade the machine. After the upgrading is finished, you migrate the VMs back. During the process, the services are not interrupted. Currently, the system only supports live migration of the VM in the following application scenarios:  Manually migrate the VM to any idle physical server as required.  Migrate the VM in batches to any idle physical server based on the status of resource utilization. 3.5.8 Storage Migration The storage virtualization module on the cloud platform supports the storage live migration capability. The user data must be migrated from one storage device to the other storage devices in any of the following scenarios:  The storage device is being maintained.  Users have higher requirements on storage performance.  The existing storage devices cannot meet requirements. The storage live migration ensures that the user data can be migrated from one storage device to another storage device without affecting VMs. Figure 3-5 shows the storage migration. Issue 01 (2017-02-03) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 14 Huawei FusionCloud Desktop Solution 6.1 System High Availability Technical White Paper 3 System Reliability Figure 3-5 Storage migration 3.5.9 VM Load Balancing If a new VM is started, VMs are live migrated, or computing nodes are remotely restarted due to faults, when the system works in load balancing mode, the system node dynamically distributes the load based on the current load status of each physical computing server to achieve a dynamically balanced status for the load of each physical computing server in a cluster. 3.5.10 Black Box Black box technology is introduced to the managing and computing nodes. When the system runs abnormally or breaks down, the black box automatically saves the kernel logs of the virtual machine manager (VMM), the system snapshot, the diagnosis information of the kernel logs, and the last words of the system to a reliable storage device (the computing node) or sends them to the remote server like the TFTP server using netpoll. Therefore after the system breaks down, the information can be exported for problem analysis and identification. 3.5.11 Data Consistency The entire cloud system meets the high reliability requirement in the Telecom field. More than 80% of system development codes are used to deal with various faults. The checkpoint and rollback mechanisms are adopted to ensure data consistency. The cloud system has the data auditing mechanism to audit and clear the potential junk data brought by faults. The junk data can also be collected. This prevents data inconsistency due to the junk data and ensures that services can run properly. Issue 01 (2017-02-03) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 15 Huawei FusionCloud Desktop Solution 6.1 System High Availability Technical White Paper 3 System Reliability 3.5.12 Health Check Tool and Information Collection Tool The system provides the health check tool and information collection tool to meet the high reliability requirements in the Telecom field. The health check tool regularly examines the system health, and monitors system running status, alarm and log information, progress status, key configuration information, key resource usage, and changing trend of key resources to detect performance, security, and reliability risks. The health check tool also checks the system before and after the high-risk operation and system upgrade to verify system health status. The information collection tool accurately collects the fault-related logs and alarm information based on fault types. This facilitates fault location and analysis, simplifies fault information collection, and shortens the service breakdown duration. 3.6 FusionAccess Availability 3.6.1 FusionAccess Service HA Table 3-2 describes the FusionAccess service software deployment modes. Other services of FusionAccess adopt redundancy deployment modes except the license service because the HDC can cache licenses. Services are not affected in one month even if single point failure occurs in the license service. When a fault occurs in any service, the system detects the fault in a timely manner and isolates the fault. In addition to redundancy deployment, all services of FusionAccess support local service monitoring. If a service is abnormal, it will be restarted to ensure proper running. In scenarios where domain controllers including the AD and LiteAD are used to authenticate users, if the domain controllers cannot authenticate users properly, the system can use the local authentication function of VMs to ensure proper use of desktop cloud services. Table 3-2 FusionAccess service software deployment modes Service Name Function Deployment Mode Service Impact of Single Point Failure WI Allows users to log in to VMs. Load balancing None HDC Performs desktop access control. Load balancing None License Performs license control. Single-node deployment Services are not affected in one month even if single point failure occurs in the license service. ITA Supports service provisioning. Active/Standby None vLB Performs load balancing for WIs. Active/Standby None Issue 01 (2017-02-03) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 16 Huawei FusionCloud Desktop Solution 6.1 System High Availability Technical White Paper 3 System Reliability Service Name Function Deployment Mode Service Impact of Single Point Failure vAG Functions the self-service maintenance login gateway and desktop access gateway. Load balancing None GaussDB Stores desktop service data. Active/Standby None AD (LiteAD)/DNS/ DHCP Functions as IT infrastructure facilities. Active/Standby None Backup Server Manages system backup. Single-node deployment Services are not affected, and monitoring is implemented. UNS Unified domain name service Load balancing 3.6.2 FusionAccess Service Monitoring FusionAccess monitors VDI infrastructure servers in real time. When services (or servers) malfunction, alarms are centrally displayed on the ITA portal. Guides are provided for handing each alarm. Different services or (servers) are monitored using different methods. Services of the Linux servers proactively report heartbeats to the ITA. The heartbeat information carries CPU and memory usage. If the ITA does not receive server heartbeats in three consecutive cycles, an alarm indicating abnormal services is generated. If the CPU or memory usage carried in heartbeats exceeds 80%, an alarm is also generated. Windows servers are monitored by checking service status. Table 3-3 provides major FusionAccess alarms. Table 3-3 FusionAccess alarms Service Name Function Monitoring Method WI VM login page Heartbeat HDC Desktop access control Heartbeat License License control Heartbeat ITA Service provisioning Checking service status vLB WI load balancer Heartbeat Issue 01 (2017-02-03) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. Remarks Two ITA servers check the service status of each other. 17 Huawei FusionCloud Desktop Solution 6.1 System High Availability Technical White Paper 3 System Reliability Service Name Function Monitoring Method vAG Self-maintenance login gateway and desktop access gateway Heartbeat GaussDB Stores desktop service data. Triggered by HDC and ITA services. AD (LiteAD)/DNS/D HCP IT infrastructure The ITA proactively monitors processes. Backup Server Manages and configures data backup. Monitored by checking backup results. UNS Unified domain name management Heartbeats CPU/Memory/Di sk Monitors all servers, CPUs, memories, and disks of VDI. Servers periodically report CPU, memory, and disk status. Clock synchronization Synchronizes clocks in a system. Servers periodically report clock synchronization status. Remarks Ensure that the disk is not full. In the Linux OS, ensure that the number of inodes does not exceed 80% of disk partitions. 3.6.3 Desktop Access HA The following three methods are used to improve desktop access HA:  Automatic desktop reconnection If desktop disconnection occurs due to intermittent network disconnection or other causes, clients automatically reconnect to desktops and users do not need to log in again.  Automatic desktop service port switching Clients may fail to connect to desktops if desktop service ports on user VMs are used by applications installed by users when these ports are fixed. To avoid such software compatibility problems, Huawei desktop service adopts automatic port switching technology to ensure that a client can use another available port when a port is occupied.  Desktop service process HA Desktop service processes running on user VMs can automatically recover after abnormal termination (regardless of whether the processes are terminated by users or other processes due to errors). Issue 01 (2017-02-03) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 18 Huawei FusionCloud Desktop Solution 6.1 System High Availability Technical White Paper 3 System Reliability 3.6.4 FusionAccess Management Data Backup Figure 3-6 shows the management data backup. Figure 3-6 Management data backup Backup server FTPS ITA FTPS WI FTPS DB FTPS FTPS AD (LiteAD)/ DHCP/DNS FTPS HDC License The data on each node is backed up into compressed files at 01:00. The backup files are sent to the backup server over FTPS. The backup server can be the one in Huawei VDI solution or the FTP server provided by the customer. The IP address of the FTP backup server can be configured on the ITA management page. Only the latest 10 backup data copies can be retained on the backup server. If data is damaged, you can download the data from the backup server and quickly restore the data according to the documentation. 3.6.5 Power-on Recovery Reliability Design When the power supply is recovered after an unexpected outage in a data center, the system can automatically start all servers. The server node startup is in random sequence. Therefore, the system still works properly after the power supply recovery in this scenario. Issue 01 (2017-02-03) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 19 Huawei FusionCloud Desktop Solution 6.1 System High Availability Technical White Paper A Glossary A Glossary A AD active directory ATA advanced technology attachment B BIOS basic input/output system C CNA Computing Node Agent D DB database DHCP Dynamic Host Configuration Protocol DNS domain name server G GM GalaxManager I ITA IT adapter M MTBF mean time between failure MTTR mean time to repair N NC network computer NEBS Network Equipment Building System P PDU power distribution unit R Issue 01 (2017-02-03) Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 20 Huawei FusionCloud Desktop Solution 6.1 System High Availability Technical White Paper RAID redundant array of independent disks RBD reliability block diagram A Glossary S SCSI Small Computer System Interface T TFTP Issue 01 (2017-02-03) Trivial File Transfer Protocol Huawei Proprietary and Confidential Copyright © Huawei Technologies Co., Ltd. 21