Transcript
Building Continuous Cloud Infrastructures Deepak Verma, Senior Manager, Data Protection Products & Solutions John Harker, Senior Product Marketing Manager
October 8, 2014
WebTech Educational Series Building Continuous Cloud Infrastructures In this Webtech, Hitachi design experts will cover what is needed to build Continuous Cloud Infrastructures – servers, networks and storage. Geographically distributed, fault-tolerant, stateless designs allow efficient distributed load balancing, easy migrations, and continuous uptime in the face of individual system element or site failures. Starting with distributed stretch-cluster server environments, learn how to design and deliver enterprise-class cloud storage with the Hitachi Storage Virtualization Operating System and Hitachi Virtual Storage Platform G1000. In this session, you will learn: •
Options for designing Continuous Cloud Infrastructures from an application point of view.
•
Why a stretch-cluster server operating environment is important to Continuous Cloud Infrastructure system design.
•
How Hitachi global storage virtualization and global-active devices can simplify and improve server-side stretch-cluster systems.
Application Business Continuity Choices Types of failure scenarios Locality of Reference of a failure
How much data can we lose on failover? (RPO) How long does recovery take? (RTO)
How automatic is failover? How much does solution cost?
Types of Failure Events & Locality of Reference
LOGICAL FAILURE
PHYSICAL FAILURE
Probability: High Causes: Human Error, Bugs Desired RTO/RPO: Low/Low Remediation: Savepoints, Logs, Backups, Point-in-Time Snapshots Cost: $ Probability: Medium Causes: Hardware Failure Desired RTO/RPO: Zero/Zero Remediation: Local High Availability Clusters – Servers & Storage Cost: $$
Probability: Low Causes: Rolling Disasters Desired RTO/RPO: Medium/ Remediation: Remote Replication with Point-in-Time Snapshots Cost: $$$$ Probability: Very Low Causes: Immediate Site Failure Desired RTO/RPO: Low/ZeroLow Remediation: Replication Synchronous, Remote High Availability Cost: $$$
COST
PROBABILITY
REMOTE RECOVERY
LOCALIZED RECOVERY
Understanding RPO and RTO
2am
Hours
4am
6am
8am
RPO
10am
12pm
2pm
4pm
Seconds
8pm
OUTAGE
12am
ZERO
10pm
12am
RTO
2am
Hours
Data Protection Options Traditional approach ‒ Multi-pathing, Server clusters, backups, Application or database driven
Storage array based replication ‒ Remote and local protection
Appliance based solutions ‒ Stretched clusters, quorums
Array based high availability
Traditional Data Protection Approach App/DB Server
Cluster
Buffer
App/DB Server
App/DB Server
Buffer
Buffer
Cluster App/DB Server Buffer
App & DB Backup
DB
Log
DB
App & DB Restore
Log
Tape
Tape
Truck, Tape Copy, VTL Replication
‒
Focus has been on server failures at local site only
‒
Coupled with enterprise storage for higher localized UpTime
‒
Local Physical
Local Logical
Remote Physical
Remote Log. & Phy.
RPO
0*
4-24 hrs.
8-48 hrs.
8-48 hrs.
Logical failures and rolling disasters have high RPO/RTO
RTO
0*
4-8 hrs.
4+ hrs.
4+ hrs.
‒
Scalability and efficiency are oxymoron
Caveats
‒
Recovery involves manual intervention and scripting
*Assume HA for every component and cluster aware application.
Application Based Data Protection Approach App/DB Server
Cluster
Buffer
Application Data Transfer
App/DB Server
App/DB Server
Cluster App/DB Server
Buffer
Buffer
Buffer
App & DB Backup
DB
Log
DB
App & DB Restore
Log
Tape
Tape
Truck, Tape Copy, VTL Replication
‒
Reduces remote physical recovery times
‒
Requires additional standby infrastructure, licenses
‒
Consumes processing capability of application/db servers
‒
Specific to every application type, OS type, etc.
‒
Fail-back involves manual intervention, scripting
Local Physical
Local Logical
Remote Physical
Remote Log. & Phy.
RPO
0*
4-24 hrs.
0-4 hrs.#
8-48 hrs.
RTO
0*
4-8 hrs.
15 min. - 4 hrs.#
4+ hrs.
Caveats
*Assume HA for every component and cluster aware application. #Network
latency and application overhead dictate values
Array Based Data Protection Approach App /DB Server
Cluster
Buffer
Offline App/DB Server
App/DB Server Buffer
Cluster Offline App/DB Server OFFLINE
Array Based Block Sync or Async. App/DB Aware Local Array Clone/Snap
DB
DB
Log
‒
Reduces recovery times across the board
‒
No additional standby infrastructure, licenses, or compute power
Generic to any application type, OS type, etc.
‒
Fail-back as easy as fail-over, with some scripting Not application awareness, usually crash consistent
App/Db Aware Remote Array Clone/Snap
Log
Single IO Consistency
Local Physical
Local Logical
Remote Physical
Remote Log. & Phy.
RPO
0*
15 min. – 24 hrs.
0-4 hrs.#
15-24 hrs.
RTO
0*
1 – 5 min.
5 - 15 min.
1 – 5 min.
Caveats ‒
DB
Optional Batch Copy
Single IO Consistency
‒
Log
Single IO Consistency
Single IO Consistency
DB Tape
Log
*Assume HA for every component and cluster aware application. #Network
latency and application overhead dictate values
Appliance Based High Availability Approach App/DB Server
Cluster
Buffer
Extended Cluster
App/DB Server
App/DB Server
DB
Server
Buffer
Buffer
Applianc e
Cluster App/DB
Applianc e
Buffer
Applianc e
Log
Applianc e
App & DB Backup
DB
App & DB Restore
Quorum
Log
DB
Log
Tape
Tape
Truck, Tape Copy, VTL Replication
Local Physical
Local Logical
Remote Physical
Remote Log. & Phy.
RPO
0*
4-24 hrs.
0#
8-48 hrs.
Introduces complexity (connectivity, quorum) and risk and latency to performance
RTO
0*
4-8 hrs.
0#
4+ hrs.
Does not address logical recovery RPO and RTO
Caveats
‒
Takes remote physical recovery times to zero
‒
Combine with app/db/os clusters for “true” 0 RPO & RTO
‒
‒
*Assume HA for every component and cluster aware application. #Synchronous
Distances, coupled with app/db/os geo-clusters
Array Based H/A + Data Protection Approach App /DB Server
Cluster
Buffer
App/DB Server
Offline App /DB App/DB Server Server Buffer
Extended Cluster
Buffer
App/DB Cluster Offline App/DB Server Server Buffer OFFLINE
App/DB Aware Local Array Clone/Snap
DB
Log
ArrayArray Based Bi-Directional Based Block HighSync Availability Copy or Async.
DB
Log
Single IO Consistency
Single IO Consistency
App/Db Aware Remote Array Clone/Snap
Quorum
DB Tape
DB
Log
Single IO Consistency
Single IO Consistency
‒
Takes remote physical recovery times down to zero
‒
Generic to any application type, OS type, etc.
Log
Local Physical
Local Logical
Remote Physical
Remote Log. & Phy.
RPO
0*
15 min. – 24 hrs.
0#
15 min -24 hrs.
0*
1 – 5 min.
0#
1 – 5 min.
‒
No performance impact, built-in capability of array
‒
Combine with app/db/os clusters for “true” 0 RPO & RTO
RTO
‒
Fail-back as easy as fail-over, no scripting
Caveats
‒
Combined with snaps/clones for dual logical protection
*Assume HA for every component and cluster aware application. #Synchronous
Distances, coupled with app/db/os geo-clusters
Consideration to move to an active-active highly available architecture Storage platform capable of supporting H/A Application/DB/OS clusters capable of utilizing storage H/A functionality without impacts Network capable of running dual site workloads with low latency
Quorum site considerations to protect against split-brain or H/A downtime. People and process maturity in managing active-active sites Coupled logical protection across both sites and 3rd site DR
Options for Data Protection 2am
4am
Hours
Archive Hitachi Content Platform
Backup Data Instance Manager Data Protection Suite Symantec Netbackup
6am
8am
10am
12pm
4pm
2pm
RPO
Seconds
Application aware Snapshot and Mirroring
Operational resiliency
Operational recovery • HAPRO • HDPS IntelliSnap • Thin Image or In-system replication Disaster Recovery • Universal Replicator (async) • TrueCopy (sync)
• Universal Replicator (async) • Truecopy (sync)
CDP Data Instance Manager
8pm
OUTAGE
12am
ZERO
10pm
12am
RTO
2am
Hours
Transparent Cluster Failover
Restore/Rec over from
Restore from
Global Active Device
Snapshot
Database logs Backup
Mirroring Replication
Always On
Hitachi Storage Virtualization Operating System Introducing Global Storage Virtualization Virtual Server Machines FOREVER CHANGED the way we see
DATA CENTERS
Hitachi
STORAGE VIRTUAL OPERATING SYSTEM is doing the SAME
Application
Application
Virtual Storage Identify
Virtual Storage Identify
Operating System
Operating System
Host I/O and Copy Mgmt.
Host I/O and Copy Mgmt.
Virtual Hardware
Virtual Hardware
Virtual Hardware
Virtual Hardware
Hardware
Server OS and VM File System
CPU
Memory
NIC
Drive
Hardware
Virtual Storage Software
Virtual Storage Director
Cache
Front-End Ports
Media
Disaster Avoidance Simplified New SVOS global-active device (GAD) Virtual-storage machine abstracts underlying physical arrays from hosts Storage-site failover transparent to host and requires no reconfiguration When new global-active device volumes are provisioned from virtual-storage machine, they can be automatically protected Simplified management from a single pane of glass
Site A
Site B
Compute HA Cluster
Storage HA Cluster
Global Storage Virtualization Virtual-Storage Machine
Supported Server Cluster Environments SVOS global-active device OS + Multipath + Cluster Software Support Matrix
Global-active device Support
OS
Version
Cluster
August 2014
VMware
4.x, 5.x
VMware HA (Vmotion)
Supported Supported
IBM AIX
6.x, 7.x
HACMP / PowerHA
Supported
2008
MSFC
Supported
2008 R2
MSFC
Supported
2012
MSFC
Supported
2012 R2
MSFC Red Hat Cluster VCS
Supported Supported Supported
Microsoft Windows
Red Hat Linux
5.x, 6.x
Hewlett Packard HP-UX
11iv2, 11iv3
MC/SG
10, 11.1
SC VCS Oracle RAC
Oracle Solaris
1Q2015
Supported Supported Supported Supported
Hitachi SVOS Global-Active Device Clustered Active-Active Systems
global storage virtualization global-active device
Servers with Apps Requiring
High Availability
Write to Multiple Copies Simultaneous from Multiple Applications
Virtual LDEVs: 10:01 10:02
Virtual Storage Identity 123456
Virtual Storage Identity 123456
Resource Group 1
Resource Group 2
LDEVs: 10:00 10:01 10:02
LDEVs: 20:00 20:01 20:02
Quorum
Servers with Apps Requiring
High Availability
Read Locally Simultaneous from Multiple Applications
One Technology, Many Uses Cases HETEROGENEOUS STORAGE VIRTUALIZATION Host
GLOBAL-ACTIVE DEVICES
Virtual-Storage Machine
CPU
CACHE
PORTS
PORTS
MEDIA
Physical-Storage Machine
MEDIA
External-Storage Machine
MEDIA
External-Storage Machine
One Technology, Many Uses Cases NON-DISRUPTIVE MIGRATION Host Preserve Identity During Migration
GLOBAL-ACTIVE DEVICES
Virtual-Storage Machine
LOGICAL DEVICES
CPU
CACHE
PORTS
PORTS
MEDIA
MEDIA
Physical-Storage Machine Physical-Storage Machine
MEDIA
One Technology, Many Uses Cases MULTI-TENANCY Host
Host
GLOBAL-ACTIVE DEVICES
Virtual-Storage Machine #1
CPU
CACHE
GLOBAL-ACTIVE DEVICES
Virtual-Storage Machine #2
PORTS
PORTS
MEDIA
MEDIA
Physical-Storage Machine
MEDIA
One Technology, Many Uses Cases FAULT TOLERANCE Host
MIRRORING GLOBAL-ACTIVE DEVICES
GLOBAL-ACTIVE DEVICES
Virtual-Storage Machine
CPU
CACHE
PORTS
PORTS
MEDIA
MEDIA
Physical-Storage Machine
MEDIA
CPU
CACHE
PORTS
PORTS
MEDIA
MEDIA
Physical-Storage Machine
MEDIA
One Technology, Many Uses Cases APPLICATION / HOST - LOAD-BALANCING Application
GLOBAL-ACTIVE DEVICES
Virtual Storage Machine
CPU
CACHE
PORTS
PORTS
MEDIA
MEDIA
Physical Storage Machine #1
MEDIA
CPU
CACHE
PORTS
PORTS
MEDIA
MEDIA
Physical Storage Machine #2
MEDIA
One Technology, Many Uses Cases DISASTER AVOIDANCE and ACTIVE-ACTIVE DATA CENTER Host
Host NAS
NAS
Server Cluster
MIRRORING GLOBAL-ACTIVE DEVICES
GLOBAL-ACTIVE DEVICES
Virtual Storage Machine
CPU
CACHE
PORTS
PORTS
MEDIA
MEDIA
Physical Storage Machine
Site A
MEDIA
CPU
CACHE
PORTS
PORTS
MEDIA
MEDIA
Physical Storage Machine
Site B
MEDIA
Delivering Always-Available VMware
Prod. Servers (Active)
VMware Stretch Cluster
Prod. Servers (Active)
Extend native VMware functionality with or without vMetro Storage Cluster Active/Active over metro distances Global-active device
Fast, simple non-disruptive migrations 3-data center high availability (with SRM support) Hitachi Thin Image snapshot support
Site 1
Site 2 QRM
Quorum system
VMware Continuous Infrastructure Scenarios Application Migration
Path/Storage Failover
HA Failover
Read/Write IO switches to local site’s path
ESX switches paths to alternate site path
VMware HA fails over VM, local site’s IO path is used
VMware ESX +HDLM
VMware ESX +HDLM
VMware ESX +HDLM
VMware ESX +HDLM
VMware ESX +HDLM
VMware ESX +HDLM
HCS Active
Quorum
Quorum
Quorum
Delivering Always-Available Oracle RAC
Prod. Servers (Active)
Oracle RAC
Prod. Servers (Active)
Elegant distance extension to Oracle RAC
Active/Active over metro distances Simplified designs, fast non-disruptive migrations
Global-active device Site 1 Site 2
3-data center high availability Increase infrastructure utilization and reduce costs
QRM
Quorum
Delivering Always-Available Microsoft Hyper-V
Active/Active over metro distances
Prod. Servers (Active)
Microsoft Multisite/Stretch Cluster
Prod. Servers (Active)
Complement or avoid Microsoft geo clustering Fast, simple and non-disruptive application migrations
Hitachi Thin Image snapshot support Simple Failover and failback
Global-active device Site 1
Site 2 QRM
Quorum
Global-Active Device Management Hitachi Command Suite (HCS) offers efficient management of global-active devices while providing central control of multiple systems Storage-Management Server Storage Mgt Server (Active) HCS
Prod. Server-1 (Active)
HCS Agent CCI
App/ DBMS
Storage Mgt Server (Passive)
HCS clustering
App/DBMS clustering
Clustered HCS server is used, the local HCS server enables GAD management
HCS
Prod. Server-2 (Active)
HCS Agent CCI
App/ DBMS
Pair Mgt Server
Pair Mgt Server
CMD
HA mirroring
HCS DB
TC/HA mirroring
Primary Quorum Volume
HCS Database should be replicated with either TrueCopy or GAD
Pair-Management Servers Managed through Hitachi Replication Manager
CMD
HCS DB
Remote QRM
If local site fails, the remote HCS server takes over GAD management
Runs Hitachi Device Manager Agent/CCI
HCS management requests to configure/operate the HA mirrored via the command device
3 Data Center Always Available Infrastructures Protecting the protected Server Node (e.g., Oracle/RAC)
Server Cluster
Server Node (e.g., Oracle/RAC) I/O Active
I/O Active
Global-Active Device Active-Active high availability Read-local
Global Active Device (GAD)
HUR PVOL
HUR PVOL Journal group
Journal group
HUR Active
HUR Standby
Bi-directional synchronous writes Metro distance Consistency groups (supported early 2015)
Hitachi Universal Replicator Active/standby ‘remote’ paths
HUR SVOL Journal group
Quorum
Pair configuration is on GAD consistency and HUR journal group basis with Delta-Resync
Journal groups with Delta-Resync
Any distance Remote FCIP Quorum
Global-Active Device Specifications Index
August 2014
Late 2014
Global-active device management
Hitachi Command Suite v8.0.1 or later
Max number of volumes (creatable pairs)
64K
Max pool capacity
12.3 PB
Max volume capacity
46 MB to 4 TB
46 MB to 59.9 TB
Supporting products in combination with global-active device. All on either side or both sides
Dynamic Provisioning / Dynamic Tiering / Hitachi Universal Volume Manager ShadowImage / Thin Image
HUR with Delta-Resync Nondisruptive Migration (NDM)
Campus distance support
Can use any qualified path failover software
Metro distance support
Hitachi Dynamic Link Manager is required (until ALUA support)
Hitachi Storage Software Implementation Services Service Description
Pre-deployment assessment of your environment Planning and design
Prepare subsystem for replication options Implementations: ‒ Create and delete test configuration ‒ Create production configuration ‒ Integrate production environment with Hitachi Storage Software
Test and validate installation Knowledge transfer
Don’t Pay the Appliance Tax!
SAN port explosion Appliance proliferation
Additional management tools Limited snapshot support
Per-appliance capacity pools Disruptive migrations
All of the above
With Appliances
Complexity Scales Faster Than
Capacity
Global-Active Device: Simplicity at Scale
Native, high-performance design Single management interface
Advanced non-disruptive migrations Simplified SAN topologies
Large-scale data protection support Full access to storage pool
All of the above
Avoid the
Appliance Tax With
Hitachi
Hitachi Global Storage Virtualization OPERATIONAL SIMPLICITY
ENTERPRISE SCALE
Questions and Discussion
Upcoming WebTechs WebTechs, 9 a.m. PT, 12 p.m. ET ‒ The Rise of Enterprise IT-as-a-Service, October 22
‒ Stay tuned for new sessions in November
Check www.hds.com/webtech for ‒ Links to the recording, the presentation, and Q&A (available next week) ‒ Schedule and registration for upcoming WebTech sessions
Questions will be posted in the HDS Community: http://community.hds.com/groups/webtech
Thank You