Transcript
Fachhochschule Wiesbaden Fachbereich Design Informatik Medien Studiengang Informatik Master-Thesis zur Erlangung des akademischen Grades Master of Science – M.Sc.
Design and Evaluation of Self-Management Approaches for Virtual Machine-Based Environments
vorgelegt von Dan Marinescu am 29. Februar 2008
Referent: Prof. Dr. Reinhold Kröger Korreferent: Prof. Dr. Steffen Reith
II
Erklärung gem. BBPO, Ziff. 6.4.2 Ich versichere, dass ich die Master-Thesis selbstständig verfasst und keine anderen als die angegebenen Hilfsmittel benutzt habe.
Wiesbaden, 29.02.2008
Dan Marinescu
Hiermit erkläre ich mein Einverständnis mit den im Folgenden aufgeführten Verbreitungsformen dieser Master-Thesis: Verbreitungsform Einstellung der Arbeit in die Bibliothek der FHW Veröffentlichung des Titels der Arbeit im Internet Veröffentlichung der Arbeit im Internet
Wiesbaden, 29.02.2008
ja
nein
√
√
√
Dan Marinescu
III
IV
Contents
1
Introduction
1
2
Background
5
2.1
Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.1.1
Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.1.2
Case Study: Xen . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.1.3
Live Migration . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.1.4
Hardware-Assisted Virtualization . . . . . . . . . . . . . . . . .
16
2.1.5
Management of Vitual Machine Environments . . . . . . . . . .
17
2.2
Service Level Management . . . . . . . . . . . . . . . . . . . . . . . . .
20
2.3
Autonomic Computing . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
2.3.1
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
2.3.2
Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
2.3.3
Architectural Considerations . . . . . . . . . . . . . . . . . . . .
23
2.3.4
Examples of Autonomic Computing Systems . . . . . . . . . . .
24
Complexity theory and Optimization . . . . . . . . . . . . . . . . . . . .
25
2.4.1
Introduction to Complexity Theory . . . . . . . . . . . . . . . .
25
2.4.2
Complexity Classes . . . . . . . . . . . . . . . . . . . . . . . . .
25
2.4.3
Optimization Problems . . . . . . . . . . . . . . . . . . . . . . .
26
2.4.4
The Knapsack Family of Problems . . . . . . . . . . . . . . . . .
27
2.4.5
Approximation Algorithms . . . . . . . . . . . . . . . . . . . . .
29
2.4.6
Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
2.4
V
3
Analysis
37
3.1
State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
3.1.1
Related Academic Work . . . . . . . . . . . . . . . . . . . . . .
38
3.1.2
Limitations of the Previous Work . . . . . . . . . . . . . . . . .
42
3.1.3
Commercial Solutions . . . . . . . . . . . . . . . . . . . . . . .
43
Architectural considerations . . . . . . . . . . . . . . . . . . . . . . . .
44
3.2.1
A Formal Model of the Managed Infrastructure . . . . . . . . . .
44
3.2.2
Framework Requirements . . . . . . . . . . . . . . . . . . . . .
46
The Resource Allocation Problem . . . . . . . . . . . . . . . . . . . . .
47
3.3.1
Controller Requirements . . . . . . . . . . . . . . . . . . . . . .
47
3.3.2
Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
3.3.3
Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
3.3.4
Approximation Algorithms . . . . . . . . . . . . . . . . . . . . .
51
A Bottleneck Determination Mechanism . . . . . . . . . . . . . . . . . .
52
3.4.1
The Expert Systems Approach . . . . . . . . . . . . . . . . . . .
53
3.4.2
The Time Series Analysis Approach . . . . . . . . . . . . . . . .
55
The Overprovisioning Problem . . . . . . . . . . . . . . . . . . . . . . .
56
3.2
3.3
3.4
3.5 4
Design
59
4.1
Architecture of the Framework . . . . . . . . . . . . . . . . . . . . . . .
59
4.1.1
The basic concept . . . . . . . . . . . . . . . . . . . . . . . . . .
59
4.1.2
The Physical Manager . . . . . . . . . . . . . . . . . . . . . . .
61
4.1.3
The VM Manager . . . . . . . . . . . . . . . . . . . . . . . . . .
63
4.1.4
The Cluster Manager . . . . . . . . . . . . . . . . . . . . . . . .
64
4.1.5
The Complete Architecture . . . . . . . . . . . . . . . . . . . . .
66
4.1.6
The Intra-Manager Communication . . . . . . . . . . . . . . . .
66
4.1.7
The Inter-Manager Communication . . . . . . . . . . . . . . . .
67
4.2
The Bottleneck Determination Mechanism . . . . . . . . . . . . . . . . .
69
4.3
The Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
4.3.1
Algorithm A . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
4.3.2
Algorithm B . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
VI
5
Implementation
79
5.1
Software Configuration of the Cluster . . . . . . . . . . . . . . . . . . .
79
5.2
The Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
80
5.2.1
The Self-Manager Backbone . . . . . . . . . . . . . . . . . . . .
80
5.2.2
The Physical Manager . . . . . . . . . . . . . . . . . . . . . . .
83
5.2.3
The VM Manager . . . . . . . . . . . . . . . . . . . . . . . . . .
93
5.2.4
The Cluster Manager . . . . . . . . . . . . . . . . . . . . . . . .
98
5.2.5
The Inter-Manager Communication . . . . . . . . . . . . . . . .
99
5.3
The Bottleneck Determination Mechanism . . . . . . . . . . . . . . . . . 106
5.4
The Cluster Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.5 6
5.4.1
Algorithm A . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.4.2
Algorithm B . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Code Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Evaluation
119
6.1
The Testbed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.2
Evaluation of the Management Framework . . . . . . . . . . . . . . . . . 120
6.3
Evaluation of the Cluster Controllers . . . . . . . . . . . . . . . . . . . . 122
7
Summary
125
8
Bibliography
129
VII
VIII
List of Figures 2.1
Hosted Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.2
OS-Layer Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2.3
Hardware-Layer Virtualization . . . . . . . . . . . . . . . . . . . . . . .
9
2.4
The structure of a machine running Xen [BDF+ 03] . . . . . . . . . . . .
13
2.5
Xen Live Migration [CFH+ 05] . . . . . . . . . . . . . . . . . . . . . . .
15
2.6
An exemplary lifecycle of a hardware-assisted VMM [Int05] . . . . . . .
16
2.7
The Xen Management Architecture [Mel07] . . . . . . . . . . . . . . . .
18
2.8
Autonomic Element [KC03] . . . . . . . . . . . . . . . . . . . . . . . .
24
2.9
Complexity Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
2.10 Search Space Example . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
2.11 The GA approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
3.1
Example of local search trough the search space . . . . . . . . . . . . . .
49
3.2
Crossover example . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
3.3
Operation mode of an expert system . . . . . . . . . . . . . . . . . . . .
54
3.4
Time series example . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
4.1
The basic concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
4.2
The Physical Manager . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
4.3
The VM Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
4.4
The Cluster Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
4.5
The complete architecture of the framework . . . . . . . . . . . . . . . .
67
4.6
Modular Self-Manager architecture . . . . . . . . . . . . . . . . . . . . .
68
4.7
Exemplary inter-manager and intra-manager communication . . . . . . .
69
4.8
The internal architecture of the VM Logic module . . . . . . . . . . . . .
70
4.9
The internal architecture of the Cluster Logic module . . . . . . . . . . .
70
IX
4.10 Exemplary limitation of Algorithm A . . . . . . . . . . . . . . . . . . .
73
4.11 Algorithm B visual example . . . . . . . . . . . . . . . . . . . . . . . .
77
5.1
The complete architecture of the framework . . . . . . . . . . . . . . . .
81
5.2
The architecture of the Self-Manager . . . . . . . . . . . . . . . . . . . .
81
5.3
The Implementation of the Physical Manager . . . . . . . . . . . . . . .
84
5.4
Using the Xen API to access the Xen hypervisor . . . . . . . . . . . . . .
90
5.5
The Implementation of the VM Manager . . . . . . . . . . . . . . . . . .
94
5.6
The Implementation of the Cluster Manager . . . . . . . . . . . . . . . .
98
5.7
VMProblemSolver UML class diagram . . . . . . . . . . . . . . . . . 100
5.8
The VM Manager-to-Cluster Manager communication . . . . . . . . . . 101
5.9
ClusterMonitorUpdater UML class diagram . . . . . . . . . . . . 103
5.10 The Physical Manager-to-Cluster Manager communication . . . . . . . . 104 5.11 ClusterRequestProcessor UML class diagram . . . . . . . . . . 105 5.12 The Cluster Manager-to-Physical Manager communication . . . . . . . . 106 6.1
The testbed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.2
Empirical observations . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.3
Influence of available memory on SLO violations . . . . . . . . . . . . . 122
6.4
A scenario comprising two physical machines . . . . . . . . . . . . . . . 123
6.5
A scenario comprising three physical machines . . . . . . . . . . . . . . 124
X
List of Tables 4.1
Exemplary VM-to-PM encoding of two parent-child states . . . . . . . .
5.1
Code statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
XI
74
XII
List of Listings 5.1
Configuration file example . . . . . . . . . . . . . . . . . . . . . . . . .
82
5.2
The Module interface . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
5.3
The properties of the Physical Manager Monitor module . . . . . . . . .
85
5.4
Extract from the PMMonitor.java, showing how the module properties are loaded . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
Extract from the PMMonitor.java, showing how the to be monitored parameters are sent to the Xen module . . . . . . . . . . . . . . . . . . .
86
Extract from the PMMonitor.java, showing how the messages received from the Xen module are processed . . . . . . . . . . . . . . . .
87
5.7
The properties of the Physical Manager Actuator module . . . . . . . . .
88
5.8
The setMemoryVM() from the PMActuator.java . . . . . . . . .
88
5.9
The properties of the Physical Manager Xen module . . . . . . . . . . .
89
5.10 Extract from the XenManager Python class . . . . . . . . . . . . . . . .
91
5.11 Extract from the XenHelper Java class . . . . . . . . . . . . . . . . . . .
91
5.12 Extract from XenModule.java, showing how the received messages are processed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
5.13 The properties of the SL Monitor module . . . . . . . . . . . . . . . . .
94
5.14 The subscribeServiceParameters() method of the SL Monitor
95
5.15 Extract from VMActuatorModule.java, showing how a received message is processed . . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
5.16 Extract from ClusterMonitor.java, showing how a message is sent to the Cluster Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
99
5.5 5.6
5.17 The giveMemory() method . . . . . . . . . . . . . . . . . . . . . . . 102 5.18 Instantiating and registering a remote object . . . . . . . . . . . . . . . . 103 5.19 Updating the gloabl resource view . . . . . . . . . . . . . . . . . . . . . 104 5.20 Delegating a task to a Physical Managers . . . . . . . . . . . . . . . . . 106 5.21 Parameters loaded by the VM Logic module . . . . . . . . . . . . . . . . 107
XIII
5.22 Updating a TomcatParameter bean . . . . . . . . . . . . . . . . . . . . . 107 5.23 Calling Jess from inside the VM Logic module . . . . . . . . . . . . . . 108 5.24 The Jess rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.25 The configuration parameters for the Cluster Logic . . . . . . . . . . . . 111 5.26 Algorithm A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.27 Extract from ClusterState.java . . . . . . . . . . . . . . . . . . . 114 5.28 The startKbfsAlgorithm() method . . . . . . . . . . . . . . . . . 116
XIV
Chapter 1 Introduction The first computing systems developed were large and expensive to operate. Due to an increasing demand for usage, these computers evolved into time-sharing systems (around 1965) which permitted multiple applications to run simultaneously. Yet this allowed a faulty application to crash the entire system. The need to increase the reliability of the system led to the appearance of the virtual machine monitor (VMM) in the late 1960s. Also known as hypervisor, a VMM represents a software indirection layer between the hardware on one side and the operating system on the other side. Back then, the VMM was primary used to provide isolation between applications. Probably the best known system of that time using a VMM was VM/370, developed by IBM. The 1970s were flourishing for the virtualization technology, but then the 1980s and 1990s brought a drop in hardware prices, which meant that it was now affordable to host one application per computing system. While in the 1970s, computer architectures were virtualization-aware [PG74], the replacement of mainframes with minicomputers and PCs gave birth to new computer architectures that didn’t offer the necessary hardware support for VMMs [RG05]. As such, by the late 1980s, VMMs became history. In the 1990s, VMMs where brought back to light by researchers at Stanford University, in a research project that led to the birth of VMware Inc.. Nowadays, virtualization is mainly used to reduce costs by consolidating a bunch of low-utilized server boxes in a single-server system, with almost every big name in the IT industry being involved in a virtualization project. With VMware leading the market, vendors like SWsoft and XenSource1 all offer competitive alternatives. Also long-time established vendors like Sun2 and Oracle have recently joined the market with their own Xen-based virtualization 1 2
recently acquired by Citrix note that Sun already provided a virtualization solution known as Solaris Zones or Solaris Containers
1
Chapter 1. Introduction solution. And not only virtualization products are developed for the IA-32 architecture, but also the architecture itself is being extended by both Intel (VT) and AMD (SVM) to support virtualization. Besides reducing costs, it is the way how maintenance and system management are simplified that contributes to the fast adoption of virtualization in the data center. These are factors that have grown in importance over the past years, with system administrators avoiding products that fail to provide rich management capabilities. Yet it slowly became clear to the research community that the ever increasing complexity of computing systems will finally lead to systems that are impossible to manage by human beings [KC03]. Even nowadays, managing computing systems has become expensive and prone to errors. This indicates that a change in the paradigm used to design computing systems needs to take place. The solution to this problem is to develop systems which are capable to manage themselves, called autonomic computing systems or self-* systems. This solution is inspired by biological systems existing in nature, particularly the human autonomic nervous system (ANS). The ANS is responsible for many low-level, yet vital body functions like heart rate or body temperature, without conscious control. Similarly, given high-level goals, computing systems should be able to manage themselves, without or with little human interaction. Virtualization significantly increases the overall complexity of computing systems. Applications are now hosted inside virtual machines, which are themselves hosted on physical machines. This increased complexity and the dynamic nature of virtualization-based data centers justify the need for the introduction of self-management approaches in these environments. Although it increases the overall complexity, virtualization also offers a great basis for self-management. Features like the allocation of resources (memory, CPU shares and others) at run-time and live migration of virtual machines are great control knobs that can be used by autonomic systems. Advanced monitoring and debugging is also possible due to the introduction of an indirection layer between the hardware and the operation system. What is missing is an intelligent controller, capable of taking decisions based on data obtained through monitoring, and then transforming these decisions into tasks executed by means of the previously mentioned control knobs. The goal of this thesis is to design and evaluate various approaches for such an intelligent controller. The remainder of this thesis is organized as follows:
2
Chapter 1. Introduction In Chapter 2, background information with respect to virtualization, service level management and autonomic computing is provided, as well as a short introduction to complexity theory and optimization. This background information is necessary to understand the chapters that follow. Next, in Chapter 3, a state of the art analysis with respect to the autonomic management of virtualization-based environments is presented. Based on this analysis, a set of requirements is defined for this thesis. Finally, an analysis of approaches that can be used to fulfil these requirements is presented at the end of this chapter. Following this analysis the design of a self-management framework is presented in Chapter 4. This framework is based on the principles of the autonomic computing paradigm and uses a service level management approach. Besides the management framework, the design of three “intelligent” components to be used in the framework is described. Both the framework and the “intelligent” components are based on the analysis performed in the previous chapter and fulfil the requirements defined there. In Chapter 5, the implementation of this design is presented. While the design was technology-independent, the implementation of this design is Java-based and uses Xen as a virtualization technology. The concept and the implementation are subsequently evaluated in Chapter 6. For this purpose, a testbed was set up in which the “intelligent” components and the framework as a whole were tested. The thesis is concluded in Chapter 7. Here, a summary is presented and indications are provided with respect to how the work of this thesis can be extended in the future.
3
Chapter 1. Introduction
4
Chapter 2 Background 2.1
Virtualization
This section provides background information regarding virtualization, starting with a taxonomy and followed by an in-depth analysis of Xen, an open-source virtualization solution.
2.1.1
Taxonomy
Smith and Nair proposed a taxonomy where a distinction is made between process-level and system-level virtualization [SN05]. While with process virtualization a virtual platform is created that executes a single process, system virtualization provides a complete environment that supports an operating system together with multiple user processes. For the purpose of this thesis, the focus will be on system-level virtualization. From now on, unless otherwise specified, by virtualization we understand system-level virtualization. In this context, virtualization is commonly defined as a technology that introduces a software abstraction layer between the hardware on one side, and the operating system and the applications running on top of it on the other side. This abstraction layer is called virtual machine monitor (VMM) or hypervisor and basically hides the physical resources of the computing system from the operating system (OS). Since the hardware resources are directly controlled by the VMM and not by the OS, it is possible to run multiple (possibly different) OSs in parallel on the same hardware. As a result, the hardware platform is partitioned into one or more logical units called virtual machines (VMs). The following requirements have been adapted from the original requirements for a VMM defined by Popek and Goldberg [PG74]:
5
2.1. Virtualization
Chapter 2. Background
• Equivalence: Running an application inside a virtual machine must be equivalent to running the same application on the underlying hardware. • Control: The VMM must control and synchronize the access of VMs to hardware resources. • Isolation: VMs must be isolated from each other with the purpose of ensuring stability (the crash of a VM should not affect other VMs), security (a possibly compromised VM shouldn’t grant access to other VMs) and data consistency. • Performance: The performance overhead caused by virtualization should be minimal, close to "bare metal" performance. • Encapsulation: VMs must exist in a form (i.e. a file or a partition) which allows easy migration or cloning of the VM. There are different approaches to virtualization, and there are different ways to classify them. From an architectural point of view however, the various virtualization approaches can be categorized into: • Hosted Virtual Machine Architecture • OS-Layer Virtualization • Hardware-Layer Virtualization note that for the purpose of this thesis, the focus will be entirely on the Intel IA-32 (x86) architecture. Hosted Architecture In this approach, the VMM runs on top of a host operating system, commonly as an application in userspace. The result is that, in a VM, the applications and the guest OS run on top of a virtual hardware provided by the VMM. This architecture can be observed in Figure 2.1, and is commonly used in applications dedicated to the desktop market. VMware Workstation1 , Parallels 2 and Virtual PC3 are commercial desktop virtualization products which use this technique. The main advantage of this approach is that it is very 1
http://www.vmware.com/products/ws/ http://www.parallels.com/ 3 http://www.microsoft.com/windows/products/winfamily/virtualpc/default. mspx 2
6
Chapter 2. Background
VM Management Software
2.1. Virtualization
Applications
Applications
Applications
Guest OS e.g. Windows
Guest OS e.g. Linux
Guest OS e.g. Solaris
Virtual Hardware
Virtual Hardware
Virtual Hardware
Virtual Machine Manager Host OS Hardware Layer Figure 2.1: Hosted Architecture easy to use. A common user can install a software product like VMware Workstation just like any other software product on its operating system4 . Inside VMware Workstation, a guest OS5 can be installed and used just like it would be running directly on hardware. The main disadvantage of this approach is the poor performance, which can be up to 30% less than when running directly on hardware. Since usability is more important than performance when it comes to the desktop market, it is easy to understand why products using this approach have been so successful in the desktop sector. However, when it comes to servers, performance plays a key role, with ease of use and installation being less relevant. This is why the server market requires another virtualization solution.
OS-Layer Virtualization Also known as Single Kernel Image (SKI) or container-based virtualization, this concept implements virtualization by running several instances of the same OS in parallel. This means that not the hardware but the host OS is the one being virtualized. The resulting VMs all use the same virtualized OS image. This architecture is presented in Figure 2.2. Here, the virtualized OS image is called virtualization layer. Among products that use this approach are Virtuozzo6 and its open source variant 4
currently VMware Workstation supports Microsoft Windows and several Linux distributions Microsoft Windows, Linux, Solaris x86, Netware and FreeBSD are supported 6 http://www.swsoft.com/en/products/virtuozzo/ 5
7
2.1. Virtualization
Chapter 2. Background
Applications
Applications
Applications
Host OS (1)
Host OS (2)
Host OS (3)
Virtualization Layer Host OS Hardware Layer
Figure 2.2: OS-Layer Virtualization OpenVZ7 , Solaris Zones (Containers)8 , BSD Jails9 and Linux VServer10 . All these products are commonly used in Web hosting, high performance computing (HPC) clusters and grid computing. This thin architecture eases the administration of the system, but has a big drawback: since the VMs use the same kernel as the host OS, the guest OS must be the same as the host OS. As such, it is not possible to run e.g. Windows on top of Linux.
Hardware-Layer Virtualization This approach is commonly used on the server market due to its high virtual machine isolation and performance. Here, the VMM runs directly on hardware, controlling and synchronizing the access of the guest OSs to the hardware resources. Figure 2.3 depicts this architecture. VMWare ESX Server11 and Xen12 are the main two competing VMMs that use this approach. Since the x86 architecture was not developed with virtualization in mind, and thus it is not what Popek and Goldberg would refer to as a virtualizable architecture [PG74], new techniques were developed to implement CPU virtualization. Paravirtualization is the 7
http://openvz.org/ http://www.sun.com/bigadmin/content/zones/ 9 http://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/jails.html 10 http://linux-vserver.org/Welcome_to_Linux-VServer.org 11 http://www.vmware.com/products/vi/esx/ 12 http://xensource.com/ 8
8
Chapter 2. Background
2.1. Virtualization
Applications
Applications
Applications
e.g. Guest OS Linux
e.g. Guest OS Windows
e.g. Guest OS Solaris
Virtual Machine Manager (Hypervisor) Hardware Layer
Figure 2.3: Hardware-Layer Virtualization technique used by Xen, which provides a virtual machine interface representing a slightly modified copy of the underlying hardware, meaning that the nonvirtualizable portions of the x86 original instruction set are replaced with their easily virtualized equivalents. The result is that either the guest operating systems running on top of this interface must be ported to use the slightly modified instruction set [RG05], or, more recently, the hardware architecture must support virtualization (Intel-VT, AMD-SVM). On the other hand, the applications running on top of the guest OS don’t require any modifications. Although the cost of porting an OS to Xen is relatively low [BDF+ 03], this is still a big disadvantage of the paravirtualization approach. The approach used by VMWare ESX avoids this problem by performing on-the-fly binary translation of privileged code (kernel code), which replaces the nonvirtualizable instructions with a code that can be run directly on the CPU [AA06]. A caching system is then used to increase the performance.
Process-level Virtualization Techniques The previously mentioned classification covers system-layer virtualization. While these techniques are used in both server and desktop virtualization, there are two process-level virtualization techniques that are widely used in software development and should be mentioned: emulation and high-level-language virtualization. The emulation of one or more hardware devices, be it a printer, a CPU or even a complete computing system represents the complete imitation of that device or set of devices. This technique is also referred to as interpretation since it involves an interpreter program that fetches, decodes and emulates the executed guest instructions. While this clearly offers
9
2.1. Virtualization
Chapter 2. Background
a poor performance compared with other virtualization techniques [SN05], it is the only one that makes it possible to write and run code written for one architecture (e.g. ARM) on top of a different architecture (e.g. Intel IA-32). Examples of emulation software are Bochs13 and PearPC14 , but these are not the only ones. High-level-language (HLL) virtualization has been used to provide cross-platform portability. As such, a HLL VM doesn’t emulate any specific hardware architecture but specifies its own virtual instruction set architecture (ISA). Souce code written in such a language is compiled against the virtual ISA, resulting in binary code that can run on top of the HLL VM. This code is portable since it can run on all host platforms that implement the HLL VM. Examples of such VMs are the Java Virtual Machine (JVM)15 and the Microsoft Common Language Runtime (CLR), which is the backbone of the .NET framework16 .
2.1.2
Case Study: Xen
Xen started as an open-source project at the University of Cambridge (UK) and while its code base is still open-source, it has grown up into a commercial product marketed by XenSource. At the end of October 2007, Citrix acquired XenSource. Besides Citrix, vedors like Virtual Iron, Sun, Oracle and more recently Microsoft also provide a Xen-based virtualization solution. This subsection provides a description of how Xen approaches virtualization.
The Virtual Machine Interface As previously mentioned, Xen uses a technique called paravirtualization. The difference between paravirtualization and full system virtualization is that, while full system virtualization provides an identical copy of the underlying hardware, paravirtualization provides a slightly different copy of the hardware. This modified copy of the hardware is referred to as the virtual machine interface (VMI). Since the VMI is different from the hardware interface for which common operating systems like Linux or Windows were designed, running such an OS on top of the VMI requires applying some modifications to the respective OS. A modified guest OS communicates with the hypervisor via so-called 13
http://bochs.sourceforge.net/ http://pearpc.sourceforge.net/ 15 http://java.sun.com/ 16 http://msdn.microsoft.com/netframework/ 14
10
Chapter 2. Background
2.1. Virtualization
hypercalls. In an operating system, a system call is a software trap from the application to the kernel. Similarly, a hypercall is software trap from the guest OS to the hypervisor. The VMI consists of three main subsystems: memory management, CPU and device I/O. The following paragraphs present how Xen virtualizes these subsystems, as described in [BDF+ 03]. However, the focus will be mostly on how Xen virtualizes memory and CPU.
Virtualizing Memory is considered to be the most difficult part of paravirtualizing the x86 architecture [BDF+ 03] since it requires both the development of complex mechanisms in the hypervisor as well modifying the guest OS. The approach used by Xen is to provide the guest OS with direct read-only access to the hardware page tables, while page table update requests are passed to Xen via a hypercall. It is then the job of the hypervisor to validate these updates to ensure safety. Since entering the hypervisor through hypercalls is an obvious performance overhead compared to direct updates, a guest OS may queue updates before applying an entire batch with a single hypercall. To avoid a TLB flush when entering and leaving the hypervisor, the top 64MB of each address space are reserved to Xen and cannot be accessed by any guest OS. However, this is no problem for the applications running on the guest OS, since none of the common x86 application binary interfaces (ABIs) uses this address region. Just like VMware ESX Server, a baloon driver [Wal02] was implemented for Xen guests, with the help of which the memory allocation of a domain can by dynamically adjusted. This is done by passing memory pages back and forth between the hypervisor and the page allocator of the guest OS. A threshold can be specified over which the memory capacity of a domain can no longer be extended.
Virtualizing CPU implies inserting a hypervisor below the OS and thus violates the common assumption that the OS is the most privileged part of the system. To guarantee domain isolation, the guest OS must run at a lower privilege level. Since the x86 architecture provides four distinct privilege levels (referred to as ring 0-3), the code of a Xen guest OS, originally running in ring 0, must be ported to run in ring 1, while Xen becomes the most privileged entity of the system by running in ring 0. This is possible because none of the modern OSes use rings 1 and 2, with applications executing in ring 3. Because the ported guest OS runs in ring 1, privileged instructions cannot run directly on the CPU and need to be paravirtualized and executed by Xen. Since more domains can run concurrently on the same CPU, they need to be scheduled. Three scheduling mechanisms have been developed for Xen: Borrowed Virtual Time (BVT) [DC99], Simple Earliest Dead-
11
2.1. Virtualization
Chapter 2. Background
line First (SEDF) and the Credit Scheduler. According to the Xen wiki17 , the first two have proven several limitations and have been replaced by the Credit Scheduler, which is currently the default scheduler in Xen. The Credit Scheduler is a proportional fair share scheduler. Each virtual machine, called domain in Xen terminology, is assigned a weight and a cap. The weight of a domain reflects the fact that, for example, when a host is contended, a domain with a weight of 4 will get twice as much CPU time as a domain with a weight of 2 (legal weights range from 1 to 65535). The cap of a domain determines the maximum amount of CPU the domain is allowed to consume, independently of CPU utilization. This parameter is optional and means that even if the CPU has idle cycles, a domain with a cap value of say 50 is allowed to consume maximum 50% of the CPU time. On an symmetric multiprocessing (SMP) host, a cap value of say 200% means that the respective domain can fully use maximum two CPUs. Credit Scheduler uses the notion of virtual CPU (VCPU) to reflect the cpu of a virtual machine. If not specifically restricted, a VCPU can run on any physical CPU. A physical CPU can host several VCPUs, which find themselves in a running queue managed by the CPU. An accounting thread recomputes every 30 ms the amount of credits a VCPU has earned. When a VCPU runs, its credits are consumed. VCPUs in the queue of a CPU are arranged on behalf of their priority. A VCPU can have two priorities: over or under. A priority of over signalizes that the VCPU has a negative amount of credits, while a priority of under signalizes a positive amount of credits. On a SMP host, VCPUs are load balanced across physical CPUs. When no VCPUs with a priority of under are located in the running queue of a CPU, the respective CPU will look for VCPUs with a priority of under in the running queues of the other physical CPUs of the SMP host. This guarantees a fair scheduling across all the domains. When no such VCPU is found, the next to run will be a VCPU with a priority of over from the running queue. When the running queue of a physical CPU is empty, the queue of the other physical CPUs will be looked for VCPUs that request CPU time. Virtualizing Device I/O is done in Xen with the help of the priviledged control domain, called Domain 0. This domain contains the original, unmodified device drivers, which access the hardware through the hypervisor. The other unprivileged domains use paravirtualized drivers. This is done by modifying the kernel to use virtual drivers that communicate with the original drivers in the Domain 0. For example, when an unprivileged domain wants to write something on its hard disk, the modified kernel will recognize that this is a write operation on a block device, and uses its virtual block device driver. 17
http://wiki.xensource.com/xenwiki/
12
Chapter 2. Background
2.1. Virtualization
Control Plane Software
User Software
User Software
User Software
GuestOS
GuestOS
GuestOS
GuestOS
Xeno-Aware Device Drivers
Xeno-Aware Device Drivers
Xeno-Aware Device Drivers
Xeno-Aware Device Drivers
(XenoLinux)
Domain0 control interface
(XenoLinux)
virtual x86 CPU
(XenoBSD)
virtual phy mem
virtual network
(XenoXP)
virtual blockdev
X E N
H/W (SMP x86, phy mem, enet, SCSI/IDE)
read-only VBD prevent source-a This control current state of level manageme of administrativ server: current filters and routi packet and flow interfaces and v of higher-level istrative policy.
3. DETAI
In this sectio Figure 1:2.4: TheThe structure runningXen the [BDF Xen 03] Figure structureofofaamachine machine running that make up a visor, hosting a number of different guest operating systems, Xen and guest O including Domain0 running control software in a XenoLinux This driver is nothing more than a communication interface to the original driver located rent discussion environment. in the Domain 0. The write operation is then performed by the driver in Domain 0, with most mature; no NetBSD gives u the hypervisor being responsible for determining the hardware address of the hard disk. this process was automated with scripts. In contrast, Linux needed 3.1 Contr far fewer modifications to its generic memory system as it uses preTwo mechani processor macros to access PTEs — the macro definitions provide Control and Management an overlying do a convenient place to add the translation and hypervisor calls reby paravirtualization. When Xenquired was developed, one of the goals was to separate mechanism from policy.be Asmade using a mains from Xen In both OSes, the architecture-specific sections are effectively a result, the hypervisor itself provides only the basic control operations, exporting them The hypercal a port of the x86 code to our paravirtualized architecture. This by means involved of a control interface to thewhich authorized domain, Domain 0. This is trap in rewriting routines used privileged instructions, and domainsoftware to the large amount low-level system initialization code. created at removing boot time aand is the placeofwhere the application management software analogous can tems. An exam Again, more changes were required in Windows XP, mainly due be hosted. This architecture is presented in Figure 2.4. Through the control interface, table updates, in to the presence of legacy 16-bit emulation code and the need for Domain0 is able to create and terminate other domains, adjust scheduling returning contro a somewhat different boot-loading mechanism. Note that the x86-parameters, memory allocation and create virtual interfaces andthan block Communicat specific code base in XP isnetwork substantially larger in devices. Linux and asynchronous e hence a larger porting effort should be expected. mechanisms for 2.3 Control and Management tion of importan 2.1.3 Live Migration Throughout the design and implementation of Xen, a goal has to traditional Un been to separate policy from mechanism wherever possible. Aleach acting to fl Migrating though the workload from one physical can aspects be very useful. the hypervisor must becomputer involvedtoinanother data-path (for System events are used example, scheduling the CPUload between domains, network network, or that administrators can then easily perform balancing, fault filtering management and hardware packets before transmission, or enforcing access control when readPending even maintenance. In [HJ04], the authors point out three approaches to workload migration. ing data blocks), there is no need for it to be involved in, or even dated by Xen b Process migration in the 1980’s a popular to achieve aware of,was higher level issues such as technique how the CPU is to beworkload shared, migration by the guest OS [pro00]. However, someeach limitations, process migrations hasn’t been widely or which because kinds of of packet domain may transmit. the set of pendi Theprominent resulting limitation architecture is one in which the hypervisor itself appropriate man used. The most is that process migration suffers from dependencies provides only basic control operations. These are exported through by setting a Xe an interface accessible from authorized domains; potentially comabling interrupt plex policy decisions, such as admission control, are best performed 13 3.2 Data T by management software running over a guest OS rather than in privileged hypervisor code. The presence The overall system structure is illustrated in Figure 1. Note that tection domain a domain is created at boot time which is permitted to use the conthat a data trans trol interface. This initial domain, termed Domain0, is responsible vertically throug for hosting the application-level management software. The conTwo main fa + hyper-
2.1. Virtualization
Chapter 2. Background
on the source host, which must remain online to handle certain system calls or memory accesses from the migrated process. These dependencies are called residual dependencies. The second approach is object mobility. This only works in complex frameworks like CORBA, J2EE and .NET, yet porting all legacy software to these frameworks is not realistic since it would be a very time intensive job. With the emergence of x86 virtualization, a third approach became possible: the VM migration. This can be either implemented in the hypervisor or in the OS itself, in the latter case being called self-migration [HJ04]. The following describe the Xen approach to VM migration, as presented in [CFH+ 05]. The Xen live migration mechanism was designed to enable the migration of a VM between two physical hosts located in the same LAN. The performance of this approach depends on two criteria: • the downtime, representing the amount of time for which the VM is offline (suspended) • the total migration time, representing the total amount of time needed for migration Since it is not possible to minimize both downtime and total migration time, a trade-off must be found between the two. There are three phases that can be part of the migration process: • the push phase, where pages are copied across the network from the source VM to the destination VM while the source machine keeps running • the stop-and-copy phase, where the source VM is suspended while pages are copied to the destination VM • the pull phase, where the destination VM starts and if it accesses a page that has not been copied, it pulls the page from the source VM over the network OS migration techniques can be differentiated depending on which of these phases are part of the memory migration process. For example, in pure stop-and-copy, the VM is halted, all pages are copied to the destination VM which is then started. While easy to implement, this approach obviously doesn’t fit a scenario where a live service (like for example a streaming server) is hosted inside the VM. With pure demand-migration, this problem is solved by copying only the essential kernel data structure to the destination during the stop-and-copy phase and then entering the pull phase. While the downtime is
14
Chapter 2. Background
we want a migrated OS to maintain ctions without relying on forwardoriginal host (which may be shut ion), or on support from mobility ms that are not already present (as M will include all protocol state (e.g. arry its IP address with it.
ements we observed that in a clustwork interfaces of the source and ypically exist on a single switched managing migration with respect to ment is to generate an unsolicited grated host, advertising that the IP cation. This will reconfigure peers new physical address, and while a n-flight packets may be lost, the mible to continue using open connecservable interference.
VM running normally on Host A
2.1. Virtualization Stage 0: Pre-Migration Active VM on Host A Alternate physical host may be preselected for migration Block devices mirrored and free resources maintained Stage 1: Reservation Initialize a container on the target host
Overhead due to copying
Downtime (VM Out of Service)
Stage 2: Iterative Pre-copy Enable shadow paging Copy dirty pages in successive rounds. Stage 3: Stop and copy Suspend VM on host A Generate ARP to redirect traffic to Host B Synchronize all remaining VM state to Host B Stage 4: Commitment VM state on Host A is released
VM running normally on Host B
Stage 5: Activation VM starts on Host B Connects to local devices Resumes normal operation
Figure 1:Live Migration timeline Figure 2.5: Xen Migration [CFH+ 05]
gured not to accept broadcast ARP to system failure than when it is running on the original sinvent IP spoofing), so an reduced unsolicited this way, totalTomigration time we is very the performance glethe host. achieve this, viewlong the and migration process asafter migration ll scenarios. If the operating system a transactional can turn out to be unacceptable.interaction between the two hosts involved: n, it can opt to send directed replies Stage 0: Pre-Migration We begin with an active VM on in its own ARP cache, to remove physical A. pre-copy To speed migration any future and migration, a tart. Alternatively, on a switched net- used by The approach Xen ishost called combines an iterative push get host may be preselected where the resources recan keep its original Ethernet MAC phase with a stop-and-copy phase. The functioning of this approach is shown in Figure quired to receive migration will be guaranteed. network switch to detect its move to 2.5 and can be split into six stages. First, a destination host must be selected (Stage 0) and request is issued migrate OS phase is called a VM-containerStage must 1: be Reservation initialized on A that host (Stage 1).toThen, thean push from host A to host B. We initially confirm that the tion of storage may be similarly adat first allresources pages andarethen only the pages fromathe source VM available ondirty B and reserve data centers consolidateinteractively, their stor- copyingnecessary to the(NAS) destination VMVM (Stage 2). Theofnumber of Failure rounds is based on a statistical container that size. tolimited secure and resources a network-attached storage here means that the VM simply continues to run to using local disks in analysis. individual In Stage 3, the source VM is suspended and CPU state and on theAremaining dirty unaffected. advantages in this environment, in-copied to the pages are destination VM. To ensure protection against hardware failures, ed administration, widespread venStage 2: Iterative Pre-Copy During the first iteration, all Xen uses ce on fewer spindles leading to aa transactional model for the migration process. Thus, in Stage 4, the destination pages are fromaAconsistent to B. Subsequent physical machine commits thattransferred it has received OS image.iteraThe source VM is further advantage for migration is tions copy only those pages dirtied during the previous to migrate disk storage, as the NASand then, in Stage 5, the VM on the destination host resumes normal operation. discarded transfer phase. from all host machines in the clushe problem of migrating local-disk Stage 3: Stop-and-Copy We suspend the running OS inThepossible two other resources besides that need to be migrated areB. network lthough we suggest some stance at Amemory and redirect its network traffic to As and storage. discussion of future work. A migrated VM offering a liveearlier, serviceCPU needs to maintain the same IP address that it had described state and any remaining inconsistent memory pages are then transferred.ARP At the endto advertise the before migration. This is done by broadcasting an unsolicited reply of this stage there is a consistent suspended copy of new physical location (the MAC address of the destination host). Xen doesn’t handle the view the VM at both A and B. The copy at A is still conmigration of local storage. network-attached storage hasoftofailure. be used. sideredInstead, to be primary and is resumed in case e execute when migrating an OS are Stage 4: Commitment Host B indicates to A that it has 1. We take a conservative approach successfully received a15 consistent OS image. Host A migration with regard to safety and acknowledges this message as commitment of the miugh the consequences of hardware gration transaction: host A may now discard the origour basic principle is that safe miinal VM, and host B becomes the primary host. me leave a virtual OS more exposed Stage 5: Activation The migrated VM on B is now acnet controllers, hardware MAC filtering will tivated. Post-migration code runs to reattach device
le addresses are in use (though some cards
2.1. Virtualization
2.1.4
Chapter 2. Background
Hardware-Assisted Virtualization
As already mentioned in Chapter 1, both Intel (VT) and AMD (SVM) extended the IA-32 architecture to support virtualization. Although the extensions from Intel and AMD are not compatible with each other, they are very similar [AMD05, Int05]. What follows is an overview of the Intel VT-x architecture. With VT-x, Intel introduced a new form of processor operation mode named VMX. This form of operation can be entered by executing the VMXON instruction and left by executing the VMXOFF instruction. There are two VMX execution modes: VMX root and VMX non-root. Both of these modes support all four (0-3) IA-32 privilege levels, which means that legacy OSs can run without modifications with respect to the privilege for which they were designed. The VMX root mode is similar to the IA-32 without VT-x. It is the VMM that is supposed to run in this mode. The VMX non-root mode is where the guest OS should run. This is because software running in this mode is restricted in certain ways, in INTRODUCTION AND VMX OVERVIEW spite the privilege level it is running in.
ry,
sa tes s), ck ual of
ed nun ernto to
he nd
alLL ary on
MX hat
villy
ac-
es. nd
Guest 0
VM Exit VMXON
Guest 1
VM Entry VM Monitor
VM Exit VMXOFF
Figure 1-1. Interaction of a Virtual-Machine Monitor and Guests
Figure 2.6: An exemplary lifecycle of a hardware-assisted VMM [Int05] VMRESUME; it regains control using VM exits. VM exits transfer control to an entry point specified by the VMM. The VMM can take action appropriate to the cause of the VM exit and can then return to the virtual machine via a VM entry.
The two VMX execution can be toentered means ofVMX twooperation. VMX transitions: • Eventually, themodes VMM may decide shut itselfby down and leave It does so by executing the VMXOFF instruction.
• VM Entry: VMX root to VMX non-root (VMM to guest OS) 1.5 from VIRTUAL-MACHINE CONTROL STRUCTURE VMX non-root operation and VMX transitions are controlled by a data structure called a virtual-
machine control structure (VMCS).to VMX root (guest OS to VMM) • VM Exit: from VMX non-root
Access to the VMCS is managed through a component of processor state called the VMCS pointer (one per logical processor). The value of the VMCS pointer is the 64-bit address of the VMCS. The VMCS pointer can be read and written using the instructions VMPTRST and VMPTRLD. The of VMM configures a VMCS usingVMM other instructions: VMREAD, VMWRITE, The exemplary lifecycle a hardware-assisted is presented in Figure 2.6. and VMCLEAR.
A virtual machine control structure (VMCS) is used to manage VMX transitions and VMX nonA VMM could use a different VMCS for each virtual machine that it supports. For a virtual machine withdata multiple logical processors (virtualinprocessors), VMM could use of a different root operations. This structure is divided several the sections. One them, the guest VMCS for each virtual processor. state section,Chapter is used to save and load guest Chapter state (the virtual state) at VM exits 2 describes the structure of a VMCS. 3, Chapter 4, andCPU Chapter 5 provide details on how the VMCS controls VMX non-root operation, VM entries, and VM exits. Chapter 7 provides detailed descriptions for each of the new VMX instructions.
16
1-3
Chapter 2. Background
2.1. Virtualization
respectively VM entries. Similarly, the host state section is used to load host state at VM exits. In the VMX non-root mode, many instructions and events can cause VM exits. While many instructions exit unconditionally, other instructions, interrupts and events can be configured to do so. This is achieved by using the VM-execution control fields in the VMCS, which provides support for VMMs to implement flexible virtualization strategies by specifying which instructions, interrupts and events should cause VM exits [UNR+ 05]. To optimize the performance of a VMM, the frequency of VM exists must be reduced [AA06].
2.1.5
Management of Vitual Machine Environments
With virtualization now making its place in the data centers, the management of virtualized environments is becoming an important issue. However, managing virtualizationbased environments nowadays generally means that one has to use simple, GUI-based tools. An integration with traditional platforms like IBM Tivoli18 and HP Open View 19 is still under development. In this context, the management capabilities of Xen and VMWare ESX are described. This involves both management interfaces provided by the two VMMs, as well as existent management solutions that use these management interfaces.
2.1.5.1
Management Interfaces
Xen has recently finished its 1.0 version of the Xen-API20 and included it in the Xen 3.1 release. This management API is aiming at integrating systems running Xen with enterprise management solutions. Figure 2.7 shows the Xen management architecture. On top of the Xen hypervisor runs Xend, the Xen daemon, which controls and supervises the execution of the unprivileged domains. One of the requirements of the Xen-API was to allow remote access and control. The result was the Xen-RPC layer which consists of a set of RPCs that use a wire format based on XML-RPC and runs on top of Xend. Based on Xen-RPC, bindings for various languages have been developed or are in the process of being developed. Bindings for Python and C (libvirt) are already available. Although the Xen-API already reached version 1.0, the integration with enterprise management solutions still requires the development of a CIM integration interface. The Xen18
http://www.ibm.com/tivoli http://www.managementsoftware.hp.com/ 20 http://wiki.xensource.com/xenwiki/XenApi 19
17
2.1. Virtualization
Chapter 2. Background
!"#$%&'"("#)$*+,-.)",)/+" !"#69C%
>(
>"#.#D1 ?C#'&+2 %":E'@
F&:&6=&0"2
AB>"#
789HH!%G
*A&,-" !%G6789
:.+)6(&'"+ 5.=:.+)
5.=>"#$?9$=.#2.#'0@
!"#6789$1:"+$3;;8$1+$3;;8< !"#2 !"#0)1+" !
!"#
310) 4).5.)."0
!
Figure 2.7: The Xen Management Architecture [Mel07]
CIM project21 works on an open source implementation of a DMTF22 CIM provider for Xen which uses the Xen-API. The DMTF System Virtualization, Partitioning, and Clustering Working Group (SVPC WG) is a standardization body which is currently working on defining models for virtualization and resource allocation which are not directly bound to any product. VMware has already developed several management APIs. For managing a VMware infrastructure, two APIs are available: the VMware infrastructure SDK23 provides developers of third-party tools a Web Service interface to the Virtual Center, which controls the ESX Server. The other infrastructure API is the CIM SDK24 which allows managing the ESX Server via a CIM interface. Besides these APIs, a guest SDK25 can be used to monitor guests running inside an ESX Server. The management capabilities of ESX Server clearly overpower the ones of Xen, at least for the time being. 2.1.5.2
Management Software
In the following paragraphs, several management software products used for the management of virtual machine-based environments are described. Like before, the focus is only on management tools for Xen and VMware ESX. Note that only the products offering a larger amount of features have been selected. Enomalism26 is an open source, web-based management tool for Xen virtual machines. 21
http://wiki.xensource.com/xenwiki/XenCim http://www.dmtf.org/ 23 http://www.vmware.com/support/developer/vc-sdk/ 24 http://www.vmware.com/support/developer/cim-sdk/ 25 http://www.vmware.com/support/developer/guest-sdk/ 26 http://www.enomalism.com/ 22
18
Chapter 2. Background
2.1. Virtualization
The feature list of Enomalism includes: • Monitoring interface • Resource management (like for example dynamic memory allocation) • Live migration • Disk management tools • SSH client and VNC virtual desktop • Virtual Appliance Management (package management) • An open API for integration with third party products Virt-manager27 is the open source virtual machine manager developed by Red Hat. Although originally designated to manage Xen systems, Virt-manager can also be used to manage Qemu28 and KVM29 guests. This is possible because Virt-manager was build on top of libvirt30 , a generic virtual machine management API. The list of features is quite similar with the ones of Enomalism, with few notable exceptions like the lack of disk management tools or support for live migration. Virtual Iron31 is a commercial virtualization product that is based on the Xen hypervisor. The Virtual Iron Virtualization Manager includes a web-based GUI for the management of a virtual environment. The main management feature of Virtual Iron is its policy-driven resource and workload management capability. This allows the automatic management of resources and makes the difference between the Virtual Iron Virtualization Manager and products like Enomalism and Red Hats Virt-Manager. On the other hand, the fact that it is not open-source and its free version is stripped of some features makes this product less attractive as it would otherwise be. Novell32 has been intensifying its work lately in the field of virtualization. As such, its release of the SUSE Linux Enterprise Server 10 SP1 includes a broad support for managing and provisioning Xen virtual machines. Also its management product, Novell ZENworks33 , has been extended with virtualization management capabilities. The 27
http://virt-manager.et.redhat.com/ http://fabrice.bellard.free.fr/qemu/ 29 http://kvm.qumranet.com 30 http://libvirt.org/ 31 http://www.virtualiron.com/products/virtualization.cfm 32 http://www.novell.com/linux/virtualization/ 33 http://www.novell.com/products/zenworks/ 28
19
2.2. Service Level Management
Chapter 2. Background
Novell ZENworks Orchestrator addresses the automatic life-cycle management of virtual machines. Xen Enterprise34 is a commercial product offered by Citrix. It is build on top of Xen and adds features like easy installation, increased performance and a management console. The management capabilities of XenEnterprise are mainly focused on monitoring, lifecycle management and resource management of CPU, memory, disk and network I/O for QoS. VMware’s Virtual Center35 was designed to manage a VMware virtual infrastructure. Its feature list includes support for provisioning, monitoring, automation through task scheduling and alerting, automated optimization and live migration. Like before, VMware seems to be one step in front of its competitors.
2.2
Service Level Management
The existence of an IT department inside a company has been common ever since information technology started to play a role in everyday business. Nowadays, IT departments are present in every big, mid-sized and even in some small companies. The job of an IT department is to provide IT services inside the company, and in some cases also to the external customers. In the last years however, a trend of outsourcing IT services can be observed. This is because companies started depending on a high variety of IT services. Having an IT department that provides such a variety of services has proven to be very costly. To reduce these costs, companies prefer to outsource certain services to specialized third-parties. For example, companies have become dependent of their Web presence. In many cases, having downtimes translates in loosing money. A certain degree of reliability is required, and this is known to be costly. In such case, many companies choose to outsource their Web services to Web hosting providers, in a move aimed at reducing costs. When a company outsources IT services to a specialized third-party, the company needs to have a certain guarantee that these services are working well. However, the company is not interested on how many resources (e.g. memory, cpu, network bandwidth) have been assigned to a certain service, or how the service has been implemented, but on how well a service performs. As such, a so-called service level management approach is required. 34 35
http://xensource.com/products/xen_enterprise/index.html http://www.vmware.com/products/vi/vc/
20
Chapter 2. Background
2.3. Autonomic Computing
In the following, the core terms of service level management are defined, based on the taxonomy described in [Deb04]. • The term Quality of Service (QoS) stands for the quality with which a service is provided, and is described by a set of QoS parameters. • A QoS parameter is a quantifiable and measurable indicator used to evaluate the quality of a service. Various QoS parameters exist. Of relevance for the purpose of this thesis are response time and throughput. • Response time is a QoS parameter and represents the period of time between the moment in which a user makes a call to a service and the moment in which this call ends, i.e. the user receives an answer corresponding to this call. • Throughput is a QoS parameter and represents the amount of calls successfully performed in a time unit. • A Service Level Objective (SLO) defines a range of values for one or more QoS parameters, subject to an agreement. • A Service Level Agreement (SLA) represents a contract between a service provider and a service consumer. An SLA defines services to be provided, a set of QoS parameters for these services and the respective SLOs.
2.3 2.3.1
Autonomic Computing Motivation
In the last decades, advances in computer hardware and software have led to an ever increasing complexity of computing systems. Nowadays, this complexity seems to be reaching the limits of human capability [KC03]. Managing computing systems is becoming costly and prone to errors. In the future, this complexity will go beyond the capabilities of human experts. In [GC03], Ganek et al. backs this assumption with facts: according to Nick Tabellion, the CTO of Fujitsu Softek, for every dollar spent for storage, nine further dollar are spent to manage it. Furthermore, over 40% of computer system outages are caused by operator errors due to the complexity of the system. Knowing this, it is clear that a change in the paradigm used to design computing systems needs to take place.
21
2.3. Autonomic Computing
Chapter 2. Background
While computing systems are just now reaching this high complexity, in the nature biological systems of high complexity exist for millions of years. Researchers looked for inspiration in these biological systems and found the human autonomic nervous system (ANS). The ANS is responsible for many low-level, yet vital functions, such as the heart rate or the body temperature. These functions are managed without conscious control. Similarly, the autonomic computing paradigm [Hor01] states that computing systems should be able to manage themselves, based on a set of high-level goals defined by the system administrator. Thus, tasks like installation, configuration, optimization and maintenance should be performed by computing system without human interaction.
2.3.2
Taxonomy
In many cases, when a new field of research gets the attention of the research community, a lack of unity can be observed with respect to terms and definitions. What IBM calls autonomic computing is more or less what Intel calls proactive computing [Ten00], whereas by organic computing [MSMW04] the German research community addresses the same problems but focuses more on hardware-related aspects. For the purpose of this thesis, the IBM taxonomy will be used as a backbone and extended with complementary terms and definitions when necessary. Self-management denotes the process by which computing systems manage themselves. This is considered to be the essence of autonomic computing. There are various aspects of self-management. In [KC03], Kephart and Chess define four such aspects: • Self-configuration represents the property of a computing system to configure itself automatically, based on a high-level policy. This way, the job of the system administrator is to provide this high level policy, while the system is responsible for the installation and configuration processes. New components added to the system must also be incorporated by the system in a seamless way. • Self-optimization represents the property of a computing system to continously seek to improve its operation. This is achieved by dynamically adjusting its parameters to better adapt the changes taking place in the system. • Self-healing represents the property of a computing system to detect, diagnose and repair local problems. These problems can be caused by both software and hardware components.
22
Chapter 2. Background
2.3. Autonomic Computing
• Self-protecting represent the property of a computing system to defend itself in the face of a malicious attack or cascading failures. Besides protecting in face of problems that allready took place, the system should also be capable of foreseeing possible problems based on logs and reports. There are other self-management properties besides these four, which can be found in various publications. Terms like self-monitoring, self-inspection, self-testing, self-repairing and self-organization have all been used in academic research, and are referred to as self* or self-X properties. A term-frequency statistic for each of these self-X properties is provided in [FZ05]. One frequently appearing term is self-oranization. Self-organization is a highly interdisciplinary term. Definitions for self-organization have been provided by biologists, physicists, chemists, philosophers and recently computer scientists. The latter define self-organization as the capability of a computing system to adapt itself to a dynamic environment without outside control. The presence of a central coordination component in such systems is generally excluded. Self-organization is considered to be a special propriety of a self-managing system [HMG04].
2.3.3
Architectural Considerations
Implementing autonomic computing imposes new architectural considerations. Autonomic systems are seen as interactive collections of autonomic elements. The autonomic element represents the basic component of the system. Such an element is able to manage both its internal behaviour, as well as its relationship with other autonomic elements. Together, they collaborate to meet the goals set by a higher-ranked autonomic element or by a human system administrator. From a structural point of view, an autonomic element is composed of one or more managed elements and only one autonomic manager. The autonomic manager controls the managed elements. This structure is shown in Figure 2.8. The autonomic manager monitors the managed element, analyzes the obtained information and, based on its knowledge base, constructs and executes plans that modify the behavior of the managed element. The roadmap towards autonomic computing sees both the development of new managed elements and their corresponding autonomic managers, as well as developing and attaching autonomic managers to the already existing manageable elements. Every autonomic system has a specific behaviour. This behaviour can be seen as a collection of functions, each implemented by a specific autonomic element. However, to facilitate the system-level behaviour, additional autonomic elements are required [?]. These
23
manager and the managed e merely conceptual rather tha 2.3. Autonomic Computing Chapter 2. Background may melt away—leaving ful nomic elements with well-de interfaces, but also with few internal structure. Autonomic manager Each autonomic element w Analyze Plan managing its own internal sta for managing its interactions w that consists largely of signal Knowledge Monitor Execute other elements and the externa internal behavior and its rela elements will be driven by go Managed element has embedded in it, by other authority over it, or by subc ments with its tacit or explicit may require assistance from Figure 2.8: Autonomic Element [KC03] achieve its goals. If so, it wi obtaining necessary resources are calledFigure infrastructure and support operation ofwith the other autonomic system. and for dealing with exceptio 2. Structure elements of an autonomic element. the Elements interact Exampleselements of suchand infrastructure elements are: failure of a required resource. with human programmers via their autonomic managers. Autonomic elements will fun interactions among autonomic elements as it will from individual computing c • registry: used by autonomic to find each otherof the individual disk drives to small-scale com fromcomponents the internal self-management autonomic elements—just as the social intelligence as workstations or servers t • sentinel: provides monitoring services autonomic of an ant colonytoarises largelyelements from the interactions enterprises in the largest auton among individual ants. A distributed, service-ori- the global economy. • aggregator: combines the functions of severalwill autonomic in one At the lower levels, an auton ented infrastructure supportelements autonomic ele-service of internal behaviors and rela ments and their interactions. • broker: facilitates the interaction between autonomic elementselement will elements, and the set of eleme As Figure 2 shows, an autonomic typically consist of one or more managed elements interact, may be relatively lim coupled with a single autonomic manager that con- Particularly at the level of ind trols and represents them. The managed element well-established techniques— 2.3.4 Examples of Autonomic Computing Systems will essentially be equivalent to what is found in under the rubric of fault tolera development of elements that nonautonomic systems, although it can Although a relatively new fieldordinary of research, autonomic computing approaches are already be adapted to enable the autonomic manager to one important aspect of being being implemented. The SMART DB2 IBM aims to make DB2 developing fault-tolerance monitor andproject controlstarted it. Theby managed element couldtheof Thesuch project focusesa CPU, mostly prop- such engineering feats database management system be autonomic [LL02]. a hardware resource, as storage, oron duced servers, which have a mean ti a printer, or a softwareThe resource, such asproject a data-at Berkeerties like self-optimization and self-configuration. OceanStore eral decades. base, a directory service, or a large legacy system. ley works on a high-available data storage, dealing with aspects like self-healing, selfAt the higher levels, fixed be At the highest level, the managed element could + optimization, self-configuration self-protection [KBC 00]. Also at and relationships will give beand an e-utility, an application service, orIBM, even the an Oceano dynamism and flexibility. All t individual of business. The resources autonomicfor manager disproject [IBM] targets the management computing software farms. In tinguishes the autonomic element from its nonau- nomic elements will be expr the Distributed Systems Lab of the University of Applied Sciences Wiesbaden, a Selftonomic counterpart. By monitoring the managed level, goal-oriented terms, l Manager framework was developed [Sch07, SG06]. element and its external environment, and con- themselves with the responsib structing and executing plans based on an analysis details on the fly.
Many other examples exist. This shows that when it comes to designing computing systems, 44 autonomic computing is acknowledged as a viable and, in the future, necessary Computer paradigm. In the future, we expect to see more systems which dispose of such autonomic computing capabilities.
24
Chapter 2. Background
2.4 2.4.1
2.4. Complexity theory and Optimization
Complexity theory and Optimization Introduction to Complexity Theory
Complexity theory is a branch of theoretical computer science which analyzes the difficulty of providing efficient algorithms for computational problems and the amount of resources required by the execution of an algorithm (e.g. processing time, working memory and others). The introduction to complexity theory provided in this section is mainly based on [ACK+ 00, Vaz01, KPP04]. Much of the complexity theory deals with decision problems. A decision problem is formally defined by the characteristic function f : I → S, where I represents the set of problem instances and S = {0, 1}. Informally, a decision problem can be described as a question with a “yes” or “no” answer. Thus, based on the input of a problem, one can distinguish between problems that yield a “yes” and problems that yield a “no” answer. This distinction is being made because generally the proof for one of the answers is easier to verify than the other. For example, consider the problem of determining if a natural number is prime. This is a decision problem where a proof for a “no” answer can be easier verified than a proof for a “yes” answer. A decision problem X where the “yes” and “no” answers of a decision problem Y are reversed is called the complement of Y.
2.4.2
Complexity Classes
Computational problems can be categorized into different classes, depending on their complexity. Thus, a complexity class represents a set of problems with a similar complexity. Decision problems are commonly categorized into two main classes: P and N P . • P is the class of all decision problems for which a deterministic machine exists that solves the problem in O(nk ), for k ∈ N (polynomial time). • N P is the class of all decision problems for which a non-deterministic machine exists that solves the problem in O(nk ), for k ∈ N. As it can be observed from the definitions above, P is a subset of N P . Although not demonstrated it is commonly believed that P 6= N P and so P ⊂ N P . The most difficult problems in N P are the ones that are most likely not in P and are called N P -complete. The class of N P -hard problems represents those problems that are at least as hard as the
25
2.4. Complexity theory and Optimization
Chapter 2. Background
problems in N P , and to which any problem in N P can be reduced in polynomial time. As shown in Figure 2.9, the N P -hard set is not a subset of N P , but N P -complete problems represents the subset of N P -hard problems that is in N P .
NP-hard
NPcomplete
NP
P
Figure 2.9: Complexity Classes
2.4.3
Optimization Problems
In complexity theory, optimization problems have always been one of the most studied subjects. This is because of the practical relevance of optimization problems in many areas. Example of such areas are networking, stock cutting, artificial intelligence, resource scheduling and many others. An optimization problem P can be defined as a quadrupel P = {I, Sol, m, goal}, where: • I is the set of instances of P; • Sol is a function that associates to any x ∈ I the set of feasible solutions to x; • m represents the measure function that associates a positive integer to each pair (x, y) with x ∈ I and y = Sol(x); • goal ∈ {M IN, M AX} and specifies whether P is a minimization or a maximization problem. Informally, an optimization problem means finding the best solution(s) from the set of all valid solution. For each optimization problem, a corresponding decision problem exists.
26
Chapter 2. Background
2.4. Complexity theory and Optimization
If the optimization problem is a minimization problem, the decision problem asks for the existence of a pair (x, y), with x ∈ I and y = Sol(x) and m(x, y) ≤ K, where K is a positive integer. Analogously, a maximization problem asks for the existence of a pair (x, y), with x ∈ I and y = Sol(x) and m(x, y) ≥ K, where K is a positive integer. If the decision problem corresponding to the optimization problem is in NP, than the optimization problem is an NP optimization problem or in NPO.
2.4.4
The Knapsack Family of Problems
One of the most studied optimization problems are the knapsack problems. The original knapsack problem (KP) can be informally formulated as follows: consider a hiker that plans to go on a tour. He has a rucksack (knapsack) with a limited capacity c and a set of items available, each having a weight wi and a benefit pi that it brings to the hiker. The hiker is looking to find a subset of these items, whose total weight does not exceed the capacity of the rucksack and whose total benefit is maximized. Formally, the 0-1 knapsack problem is formulated as the following maximization problem:
maximize subject to
n X
i=1 n X
pi xi
wi xi ≤ c
(2.1)
i=1
xi ∈ {0, 1}, i = 1, . . . , n The binary variable (hence 0-1 in the name of the problem) xi equals 1 if the item i should be included in the knapsack, and 0 otherwise. Close variants of the 0-1 knapsack problem are the bounded knapsack problem (BKP) and the unbounden knapsack problem (UKP). The BKP defines a bound on the amount of identical items that can be packed in the knapsack, while the UKP defines no such limit. Another variant of the KP with high practical relevance is the subset sum problem (SSP). The SSP is a particular case of the KP, where pi = wi , and has applications in logistics, cutting and packing. In many situations, items to be packed in a knapsack have more than one dimension. This leads to the d-dimensional knapsack problem (d-KP), or the multidimensional knapsack problem. This problem is formally defined by:
27
2.4. Complexity theory and Optimization
maximize
n X
Chapter 2. Background
pi xi
i=1
subject to
n X
wji xi ≤ cj , j = 1, . . . , d
(2.2)
i=1
xi ∈ {0, 1}, i = 1, . . . , n In other situations, one has more knapsacks available, in which one-dimensional items must be packed. This problem is called the multiple knapsack problem (MKP) and is formally defined by:
maximize subject to subject to
m X n X
pi xji
j=1 i=1 n X
wi xji ≤ cj , j = 1, . . . , m
i=1 m X
(2.3) xji ≤ 1, i = 1, . . . , n
j=1
xji ∈ {0, 1}, j = 1, . . . , m, i = 1, . . . , n where m is the number of knapsacks, n the number of items and xji = 1 if the item i is assigned to knapsack j. Otherwise, xji = 0. A special case of the MKP of high relevance is when all the knapsacks have the same capacity. This problem is called the bin packing problem (BP). Another special case of the MKP is when a cardinality constraint is added to the list of constraints. Informally, this means that no more than k-items can be packed in a knapsack. The resulting variant is called the k-item multiple knapsack (kMP), or in the special case of identical knapsacks the k-item bin packing (kBP). When the generalizations of the d-KP and MKP are applied in the same time, a multidimensional multiple knapsack problem (MMKP) results. This problem has many applications in the field of cargo packing and load scheduling. There are many other variants of the knapsack problem: the multiple subset sum problem, the quadratic knapsack problem, the multiple choice knapsack problem and others (see [KPP04]). They are however of less relevance for the purpose of this thesis.
28
Chapter 2. Background
2.4. Complexity theory and Optimization
An important property of the knapsack problems is their complexity. All the problems from the knapsack family are known to be N P -hard [Pis95]. Even the simple KP is known to be N P -complete.
2.4.5
Approximation Algorithms
Considering that P 6= N P , there is no algorithm that finds an optimal solution to an NP-hard problem in polynomial time [Kan92, ACK+ 00]. In practice however, it is not always necessary to find an optimal solution to a problem. Instead, in most of the cases, it is enough to find an approximate solution in short time. An algorithm that finds such an approximation solution to a problem in polynomial time is called polynomial time approximation algorithm. Although the paradigm of finding approximate solutions for computing problems has been used ever since the beginning of computer science, only in the late 60’s the research community introduced the notion of guaranteed performance ratio. The performance ratio is the common quality measure for approximation algorithms and represents the ratio between the value of the solution returned by the approximation algorithm and the value of the optimal solution. For years, the solutions yielded by approximation algorithms had a poor quality of approximation, many having a relative error of about 50% [ACK+ 00]. Over the past years, results have improved: the approximation quality has increased and problems which previously resisted to any approximation strategy have now been “cracked”. In the following, the main techniques used for the design of approximation algorithms are briefly introduced.
2.4.5.1
Greedy
The greedy method is frequently used in the design of algorithms for optimization problems. The main idea of the greedy method is that, at each step, from a multitude of choices, the choice is made which seems best at the moment. A greedy algorithm comprises of the following: • an initial set of items • a ranking function • a validation function • an objective function
29
2.4. Complexity theory and Optimization
Chapter 2. Background
The initial set of items represents the input of the problem. The goal is to determine a subset of these items that satisfies the constraints, while maximizing the objective function. First, the ranking function is used to sort the items according to some criterion. At each step, the item with the highest ranking is chosen. The validation function is used to determine if this item can contribute to the solution. This cycle is repeated until a complete solution is found. The quality of the solution depends on the ranking function. Although for each problem a ranking function can be determined that leads to an optimal solution, for an NP-hard problem, we do not expect to be able to compute such a ranking function in polynomial time. However, in many cases, simple ranking functions can be found that guarantee good approximate solutions. For example, for the knapsack problem, a ranking function that orders the items on behalf of their profit/weight ratio leads to a greedy algorithm with a worst case performance ratio of 2.
2.4.5.2
Linear Programming
Linear programming is a technique used to optimize (minimize or maximize) a linear objective function subject to a finite number of equalities and inequalities. Many practical problems can be formulated as linear programming problems. A problem that can be solved by linear programming has the following form:
maximize
F (x1 , . . . , xp ) = c1 x1 + . . . + cp xp
subject to
ai1 x1 + . . . + aip xp ≤ bi
f or i = 1, . . . , m1
ai1 x1 + . . . + aip xp ≥ bi
f or i = m1 + 1, . . . , m2 (2.6)
ai1 x1 + . . . + aip xp = bi
f or i = m2 + 1, . . . , m
(2.7)
f or j = 1, . . . , p
(2.8)
where
xj ≥ 0
(2.4) (2.5)
A vector x = (x1 , . . . , xp ) ∈ Rn is called solution of the problem if it satisfies the constrains (2.5) - (2.7). When x also satisfies (2.8), then x is a valid solution. A valid solution x∗ = (x∗1 , . . . , x∗p ) is called optimal solution when for each valid x, F (x∗ ) ≥ F (x) (for minimization problems F (x∗ ) ≤ F (x), respectively). The first algorithm that solves a linear problem was created in 1947 by George Dantzig and is named the Simplex algorithm. This algorithm has proven its efficiency in practice. However, in worst case, the Simplex algorithm has an exponential running time. In 1979, the Ellipsoid algorithm was introduced by Leonid Khachiyan as the first worst case
30
Chapter 2. Background
2.4. Complexity theory and Optimization
polynomial time algorithm for linear programming. Although faster in worst case, this algorithm is outperformed by the Simplex algorithm for most practical problems. In 1984, Narendra Karmarkar introduced an algorithm that is nowadays known as Karmarkar’s algorithm. This algorithm improves the worst case performance of the Ellipsoid algorithm, and yields a similar practical performance as the Simplex algorithm. For an in-depth description of these algorithms, [Kar91] is a good reference. For an N P -hard optimization problem P, knowing that any linear program can be solved in polynomial time means that we can’t find a linear programming formulation of P so that for each instance p of P, the number of constraints is polynomial in the size of p [ACK+ 00], unless P = N P . However, most optimization problems can be formulated as integer linear programming problems. By relaxing the integer constraint, we obtain linear programming problems that can be solved in polynomial time. After obtaining an optimal solution to the linear program, we can derivate feasible solutions to the original integer linear program. Obviously, these solutions are only approximate solutions and can be obtained by rounding the values of the variables that violate the integrality constraint.
2.4.5.3
Dynamic Programming
Dynamic programming is a technique used to solve problems which can be divided in subproblems, and for which an optimal solution can be calculated from the optimal solutions of the subproblems. This is known as Bellman’s principle of optimality and gains its name from Richard Bellman, the mathematician that first introduced dynamic programming. Unlike the divide-and-conquer strategy, dynamic programming can be used when the subproblems overlap. This technique generally comprises three steps: 1. break the problem in subproblems 2. solve the subproblems recursively 3. calculate the solution of the problem from the solutions of the subproblems Dynamic programming can be used to solve N P -hard optimization problems in pseudopolynomial time. A polynomial time algorithm has a running time which is polynomial with respect to the length of the input, which is commonly approximately the log of the numerical value of the input. A pseudo-polynomial time algorithm has a running time which is polynomial with respect to the numerical value of the input, and not the length of the input. Thus, since the numerical value of the input is exponential in the length of
31
2.4. Complexity theory and Optimization
Chapter 2. Background
the input, such algorithms are technically exponential functions and are not considered polynomial. For example, take the most “naive” way of testing if a number n is prime: √ check if n divides evenly to any of the numbers between n and n. This algorithm is pseudo-polynomial, since it requires n − 2 divisions, so it is linear with respect to the numerical value of n. Obviously, this algorithm is not efficient for big values of n. According to [Vaz01], all pseudo-polynomial time algorithms known for N P -hard problems are based on dynamic programming.
2.4.5.4
Randomized Algorithms
A randomized algorithm is an algorithm that makes some of its choices in a nondeterministic, random way. This means that using the same input, the algorithm might yield different outputs. For optimization problems, this means that running a randomized algorithm several times on the same instance of a problem might yield different feasible solutions. Randomized algorithms offer a probabilistic performance guarantee with respect to the execution time or the quality of the solution. This means that there is a small probability that the algorithm takes long to execute, respectively yield a poor solution. Randomized algorithms that offer a probabilistic performance guarantee with respect to their execution time are called Las Vegas algorithms, while the ones offering a probabilistic performance guarantee with respect to the quality of the solution are called Monte Carlo algorithms. Both type of algorithms are known to be easy to implement and very efficient.
2.4.6
Heuristics
In the previous subsection we have seen how approximation algorithms can be used to find solutions to optimization problems that have a certain performance guarantee. For some problems however, finding such algorithms is hard, while for other problems it has been demostrated that, unless P = N P , no approximation algorithm with good performance ratio exist. In other cases, the approximation algorithms found are hard to implement or have little practical relevance. In such cases, using heuristics is the only way to find solutions to an optimization problem. Heuristic methods generally have the property of being simple, easy to implement and fast in execution. As with approximation algorithms, the aim is not to obtain an optimal solution but one that is reasonable good. However, when a heuristic is used, no guarantee can be made about the quality of the solution yielded by such a heuristic-based algorithm. Also, no guarantee can be made about the execution
32
Chapter 2. Background
2.4. Complexity theory and Optimization
time of the algorithm. Both are major disadvantages compared with approximation algorithms. In the following, several heuristic methods based on local search are described.
2.4.6.1
Local Search, Hill Climbing and Simulated Annealing
One way to solve optimization problem is to use local search. Algorithms that use local search start from an initial solution and then move to a neighbour solution in the search space. This step is repeated until a local optimum is reached. The term local optimum denotes the fact that a global optimum might exist, that is a better solution than the current, but one that was not yet visited. Obviously, visiting each solution in the search space is a bad strategy with respect to the execution time. Thus, to speed up the execution of the local search, heuristics are required. Instead of visiting all the neighbours of a solution, a heuristic is used to first rank the neighbours and then move to the best one. Of course, even when heuristics are used, we do not expect to reach a global optimum in polynomial time (unless P = N P ). Many heuristics, like for example hill climbing, only permit moving to a better solution than the current one. When this is no longer possible, which coincides with finding a local optimum, the algorithm returns this final state (solution) and terminates. However, there are many situations when this approach can lead to bad solutions. Take for example the search space in Figure 2.10. Here, both a local maximum (the smaller hill) and a global maximum (the bigger hill) exist. A heuristic like hill climbing can easily get stuck on the smaller hill — the local maximum, and thus return what could be a suboptimal solution. A way to prevent such behaviour is to permit moving to a solution than is not necessary better than the current one. Simulated annealing is a heuristic method that permits moving from one neighbour solution to another, even if this neighbour solution is worse that the current one. It was inspired by the process of annealing a solid which is used in metallurgy to increase the size and reduce the defects of a certain material. The annealing of a solid first involves melting the solid by heating it at high temperatures. Than, due to the high energy of the system, the atoms become unstuck from the initial state and move randomly in a search for states with lower energy. After the solid was brought in the liquid phase, the cooling down begins. During this process, the probability of atoms moving around decreases. To avoid reaching a local minimal configuration, which corresponds to defective crystals, the cooling process must take place very slowly. This increases the chances of finding a better state.
33
2.4. Complexity theory and Optimization
Chapter 2. Background
f(x,y)
2.5 2 1.5 1 0.5 0 -3
-2
4 -1
0 x
2 1
2
0 3
-2 4
5
y
-4
Figure 2.10: Search Space Example In computer science, simulated annealing works as follows: from the current state, a neighbour state is randomly chosen. A chosen state that is better than the current state is always accepted. Otherwise, a state is accepted with a probability p, with p < 1. With the time, the value of p slowly decreases, meaning that the probability of moving into a worse state than the current one decreases. An algorithm that uses simulated annealing always remembers the best state that was visited. When p reaches the value 0, the algorithm only permits moving to a neighbour state if this state is better than the current one. When this is no longer possible, the algorithm returns the best state that was visited and terminates. While this state might be a local maximum, it is known that when p decreases slowly enough, a global maximum is reached with a probability close to 1 [RN04].
2.4.6.2
Tabu Search
Tabu search is a local search technique that, just like simulated annealing, permits the move from one solution to a worse neighbour solution with the hope of obtaining a better local optimum in the end. However, unlike simulated annealing, tabu search uses a deterministic approach when choosing the next move. Commonly, a memory structure called tabu list is used to keep moves that are considered tabu, that is moves that are considered to lead to a bad local optimum and thus not allowed. When tabu search is used, a move to a neighbour solution means choosing the best nontabu move. Since moving to a worse solution is allowed with tabu search, the best solution visited so far needs to be remembered. A common use for a tabu list is to save the moves
34
Chapter 2. Background
2.4. Complexity theory and Optimization
that have already been performed, thus preventing the search to return to solutions that have already been visited. Tabu lists and other “tabu” memory structures are at the core of tabu search. However, using tabus can sometime lead to prohibiting attractive moves that could finally lead to an improved solution. To avoid this, so-called aspiration criteria are used, which permit the execution of a tabu move only if this leads to a solution better than the current best solution. 2.4.6.3
Genetic Algorithms
Genetic Algorithms (GAs) are a search technique which is used to solve optimization problems. The idea behind this technique is to simulate the biologic evolution in order to solve optimization problems. This means that technical systems are optimized based on the principles of biological evolution. Thanks to the advances in modern molecular genetic, we now know how properties of individuals are inherited by the following generations. This inherited material is carried by so-called chromosomes. Chromosomes basically represent a collection of genes. The gene is the place where the value corresponding to each property is retained. For example, a gene can hold the value “blond” of the “hair colour” property. All the genes and their values are recalled to as the genotype of an organism. The observable behaviour of this organism is referred to as the organisms phenotype. In other words, one can know about the genotype by observing the DNA of the organism, while the phenotype can be known by observing the outward appearance of the organism. When two organisms reproduce, a cross between the genetic material of the two parents takes place. This is referred to as crossover and implies recombining the chromosomes of the two parents. In some cases, a mutation can take place, meaning that the value of a resulting gene is not the result of the chromosome recombination, but rather the result of the influence of an external factor. However, the probability of a mutation taking place is rather very low. While the crossover operator is the actual local search operator, a mutation can be seen as a “jump” to a state located “far away” in the search space, relative to the current state. According to the theory of evolution, the individuals of a population are subject to natural selection. The core of the natural selection is the principle by which the fittest survives. This way, the evolution can be seen as an elimination process, where the individuals that are better adapted to the surrounding environment survive. The new generation, obtained as a result of the reproduction of the surviving individuals, will share the genetic
35
2.4. Complexity theory and Optimization
Chapter 2. Background
properties of their ancestors and will generally be even better adapted to the surrounding environment, due to the crossover and eventual mutation processes. Genetic algorithm use these principles of biological evolution to solve optimization problems. The genetic algorithm approach comprises of two phases, as shown in Figure 2.11.
START
Problem PHASE I
Encoding
The first phase is called the initialization phase. In this phase, an appropriate representation form is chosen, with respect to the problem that needs to be solved. Note that there is no general representation that suits all problems, but rather each problem requires its own representation that takes the nature of the problem into consideration. After such a representation form is found, the initial population must be generated, ending thereby the initialization phase, which is never to be repeated in the problem solving process.
Initialization
Not good enough
Evaluation wrt. Fittness Parent Selection
Crossover
Good enough
Mutation
Selection of a New Generation
PHASE II
Output of the The second phase is called the evolution cycle. As the Best Individual name suggests, this phase will be repeated until a terSTOP mination condition will be reached. This termination condition generally means finding one or more acceptFigure 2.11: The GA approach able solutions to the problem, or reaching a maximum amount of iterations. The evolution cycle starts with an evaluation of all the individuals of the current generation. After that, the parent individuals are selected, based on their fitness. The parent individuals are then reproduced by means of crossover. Every now and then, depending on a previously determined probability, a mutation can take place. The last phase of the evolution cycle represents the selection of the individuals belonging to the new generation.
36
Chapter 3 Analysis The way virtualization-based environments are nowadays managed is via GUI or command-line tools like the ones described in Section 2.1. While virtualization is generally used to simplify maintenance and to consolidate servers, it doesn’t help reducing the complexity of the infrastructure. In fact, a model where a service is hosted inside a virtual machine, which in its turn is hosted along other virtual machines inside a physical machine is more complex than the previously used model where a physical machine hosts a single service. We have previously argued that precisely this complexity is the main driving factor for the development of self-managing systems. With virtualization making its place in the data center [Inf], a solution for automatically managing such VM-based environments is obviously needed. In this chapter, an analysis of self-management approaches for virtualization-based environments is presented. The rest of this chapter is organized as follows: first, Section 3.1 presents the related academic work as well as related commercial products. Upon observing the limitations of this work, a formal model of the infrastructure and a set of requirements for the management framework are defined in Section 3.2. Designing a framework that fulfills these requirements leads to three problems, which are analyzed in Sections 3.4, 3.3 and 3.5.
3.1
State of the Art
There are no commercial solutions that can manage VM-based environments in a fully autonomic way. The few existent solutions that address it use simple approaches to simplify the management task of the system administrator. This section presents a survey of preliminary academic work related to the self-management of VM-based environments, followed by a brief description of currently available commercial solutions.
37
3.1. State of the Art
3.1.1
Chapter 3. Analysis
Related Academic Work
Various publications exist that deal with the autonomic management of virtualizationbased environments. Unfortunately, the approaches described in these publications prove to be inconsistent with respect to the taxonomy used to classify each approach. Therefore, we propose a basic taxonomy which can be easily extended in the future. Furthermore, in the frame of this taxonomy, existent academic research will be described. The taxonomy used here is partially based on the work of Kephart and Walsh [pol04]. There, the authors define three types of policies: action, goal and utility functions. However, the term policy is used in many publications with a more restricted meaning, denoting what Kephart and Walsh refer to as an action policy. In order to avoid confusion, we will use the term policy only when it represents an action policy. Thus, the currently published self-management approaches which are subject of this section can be categorized as being one of the following: • Policy-based • Goal-driven • Utility-based
Policy-based At the core of this policy-based approach lies a policy that dictates which actions are to be performed, given a specific state. A mechanism is used to match the current state of the system to one of the states given in the policy, and when this match is successful, the action corresponding to that state is performed. Policy-based approaches have been a popular way to manage computing systems [LLS02, LMKB01]. This is mainly because policy-based approaches are easy to implement and perform reasonably well. Virtualization-based computing systems are no exception. In [RRX+ 06], Ruth et al. developed a system called VIOLIN, based on a so-called nanoHUB1 infrastructure, which uses the Xen hypervisor for virtualization. The system is composed of a multitude of physical machines and one adaption manager. Each physical machine of the system hosts one or more virtual machines and one monitor daemon. The monitor daemon monitors the resource allocation (memory and CPU) of both the physical machine and the virtual machines running on that physical machine. Besides monitoring, 1
http://www.nanohub.org
38
Chapter 3. Analysis
3.1. State of the Art
the monitor daemon is also able to allocate resources to local virtual machines, thus serving both as a monitoring and an actuator component. Results from all these monitoring daemons are gathered by the adaption manager, which then computes a global view of available and allocated resources. Furthermore, the adaption manager dictates virtual machine resource allocation through the monitoring daemons. Two mechanisms are used: a local mechanism and live-migration. The local mechanism uses memory ballooning and weighted CPU scheduling to reallocate local resources belonging to the physical machine hosting the VMs. When this is not possible, live migration is used to transfer a virtual machine requiring more resources on another physical machine. The resource reallocation policy used by the adaption manager doesn’t attempt to find the optimal resource utilization. Instead, a heuristic is used to improve the overall performance of the system, by finding over- and under-utilized VMs. Results obtained by the authors show an increase of performance and resource utilization. The overhead caused by migration is also acceptable. Another policy-based approach has been published by Grit et al. [GIYC]. Here, the authors used and extended Shirako2 , a Java-based toolkit for resource leasing services, to explore algorithmic challenges when using policies for the adaptive virtual machine hosting on clusters. Shirako uses the concept of leases. These leases are contracts between resource providers and resource consumers. They define the resources allocated to a consumer and the duration of their ownership. Although Xen is used as a virtualization technology, the analysis performed by the authors has a generic character. Thus, the authors propose a two-layer architecture, with brokers responsible for VM provisioning and provider sites responsible for VM placement. The authors explored three challenges: the provisioning problem, the placement problem and the combination of the two. The provisioning problem has proven to be an instance of the Knapsack problem, which, as seen in Section 2.4, even in it’s simplest form is N P -hard. Also the analysis of the VM placement problem shows that finding a plan that minimizes the number of migrations is N P -hard. Things get even more complicated when provisioning and placement policies must work together. An alternative solution can be the tight integration of the two policies, to avoid creating a tension between them. Historic data was used to emulate the behaviour of a system when a mechanism is used that combines the two types of policies. This mechanism is quite simple: when more resources are required by a VM, if enough local resources are available, the VM is reprovisioned; otherwise, live migration is used to relocate the VM on another host inside the cluster. The authors simplified the complexity of the problem by suspending a VM to disk when neither local nor remote resources are 2
http://www.cs.duke.edu/nicl/cereus/shirako.html
39
3.1. State of the Art
Chapter 3. Analysis
directly available. Results are presented that show how different combinations of policies perform with respect to the number of migrations over a period of time.
Goal-driven Instead of having a policy that defines what to do when a specific state is reached, goaldriven approaches specify a desired state, a goal that must be achieved. The intelligent agent must then perform an action to bring the system into the desired state, and thus to achieve the goal. Different techniques exist to implement such an intelligent agent. One field of research that inspired strategies for the dynamic allocation of resources in virtual environments using a goal-driven approach is control theory [HDPT04]. Zhang et al. [ZBG+ 05] used a control-theoretic model for VM adaption. The basic idea is that VMs, called friendly virtual machines, are responsible for adjusting their demand for resources. This adjustment must consider both an increase in efficiency as well as fairness. To implement this, the authors modified an instance of User Mode Linux (UML) running on one physical machine, so that each UML VM is able to perform these adjustments. The technique implemented in each VM is called feedback-based adaptive control and comprises three mechanisms. First, a feedback signal mechanism is used to determine the bottlenecked resource of a VM. This mechanism uses the virtual clock time (VCT) as a method to detect overload. The VCT represents the real time between two consecutive clock cycles and is used by the authors to determine which resource is the bottleneck. Second, a control signal mechanism is used to adapt the resource consumption. The authors propose two methods: (1) control the maximum number of processes or threads and (2) control the rate by forcing the VM to periodically sleep. Finally, the brain of the system is the controller, which determines the adaption strategy. The authors use a rule called additive-increase/multiplicative-decrease (AIMD), which allows to linearly increase the demand of a VM and force the demand to decrease exponentially when a feedback signal indicates an overload. This way, the VM can adapt its demand to match the capacity of the underlying resource. According to the authors, AIMD is an adaption rule known to converge to efficiency and fairness. In [PSZ+ 07], Padala et al. use classical control theory to address dynamic resource allocation problems in shared infrastructures. To test their approach, the authors use the Xen hypervisor, RUBiS — an online action site [RUB], and a Java implementation of TPC-W — an e-commerce suite [TPCa]. Both RUBiS and TPC-W are multi-tier applications, consisting of a database tier and a Web tier, and are used for benchmarking purposes. The feedback-control strategy chosen by the authors involves using a sensor, an actuator and
40
Chapter 3. Analysis
3.1. State of the Art
a controller. The sensor measures how many of the allocated CPU cycles are actually consumed by the VM. The actuator uses the CPU scheduler of the hypervisor to assign CPU shares to virtual machines. This way, both the sensor and the actuator make use of the Xen SEDF scheduler, which implements weighted fair sharing of CPU processing time. The controller uses a two-layer architecture. Each VM is assigned an utilization controller which computes the CPU share required by each VM. In doing this, the utilization controller uses two criteria: (1) make sure that the VM has enough resources so that it does not become a bottleneck3 and (2) make sure that the resource utilization of the VM is kept at a high level. The utilization controllers send the result of their measurements to the arbiter controller which determines if the requested CPU entitlements for all virtual machines can be satisfied. If this is not possible, the arbiter controller uses a QoS differentiation mechanism to determine which CPU entitlements will be satisfied. Results presented by the authors show that using this approach increases the resource utilization while meeting certain QoS goals. Time series analysis has been used for decades to forecast the behavior of a system. In [BKB07], Bobroff et al. describe a mechanism for the dynamic migration of VMs. For this purpose, a forecasting technique based on time series was used on historical data to determine future demands. The result of the forecasting is then fed to a management algorithm. Bin packing heuristics are used on the forecasted demand to remap virtual machines to physical machines with the purpose of providing Service Level Agreement (SLAs) probabilistic guarantees.
Utility-based Utility-based approaches use utility functions to express the value of a state. The job of the intelligent agent is to find a sequence of actions that leads to the state with the highest utility or close to optimal. Utility functions have been previously used in autonomic systems [WTKD04]. Menascè et al. used a utility-based approach for the dynamic CPU allocation to virtual machines [MB06]. A formal model was developed, which defines different workload classes for VMs. The response times of these workloads are used as performance metrics. Each VM has a utility function associated that is a function of its performance metrics. The global utility function of the system Ug is a function of the utility functions of each VM. This model abstracts the optimal CPU allocation task to optimizing Ug . The authors use two 3
queueing theory shows that the reponse time increases sharply when the CPU utilization gets close to 100%; the authors used a CPU utilization of 80% as their safety margin
41
3.1. State of the Art
Chapter 3. Analysis
approaches to CPU allocation: using CPU priorization and using CPU shares. In both cases, finding an optimal allocation generates a search space that can be extremely large, depending on the number of virtual machines, priority classes respectively the number of CPU shares used. This means that a heuristic is required to find a solution close to the optimal. The authors use a combinatorial search technique called beam-search, which works as follows: starting from the current allocation, compute Ug for all the neighbours of the allocation; keep the k-neighbours with the best values of Ug ; repeat the process for all k-neighbours, keeping again only k-best ones; continue until a certain level d is reached. Although no actual virtualization technology was used to implement this approach, the authors obtained promising results by means of simulation.
3.1.2
Limitations of the Previous Work
In the preceding subsection 3.1.1, the currently existent academic work related to the use of autonomic computing techniques to manage virtual environments has been presented. Yet all the approaches that were analyzed have some limitations. Many of them use historical data and simulation to test the results of their controller strategy. Although correct from a theoretical point of view, they still need to prove their fitness in a real, working environment. Furthermore, to reduce the complexity of the resource allocation problem, the authors used two different approaches. The most common one was to take into consideration only one resource, like for example to deal only with CPU allocation and ignore other resources like for example memory, or to perform live migration when ever service level agreement violations occur without considering the reallocation of locally available resources. This is again not realistic, since in a real environment different resources can be the bottleneck. In a scenario where a VM needs more memory to perform well, memory which is locally available, neither increasing the CPU share nor migrating the VM are acceptable solutions. The second approach used was to reduce the demand for resources of a VM when these resources can’t be provided. This is done by either suspending the VM or by forcing the VM to periodically sleep. However, both these solutions actually lead to increase the response times of services running in the respective VMs. An intelligent controller capable of replacing the system administrator should take its decisions based on data gathered throughout both service-level and VM-level monitoring. The ultimate quality criteria is the behaviour of the services hosted on virtual machines, and not the resource utilization of these virtual machines. Conclusions regarding the
42
Chapter 3. Analysis
3.1. State of the Art
quality of a service can not always be drawn by simply monitoring the resource utilization of a virtual machine.
3.1.3
Commercial Solutions
A few years ago, when technologies like VMWare and Xen started to be used in data centers, it became obvious that the next big concern is how to manage these new virtualization-based environments. Ever since, a multitude of tools were developed to address this problem, like the ones presented in Section 2.1.5.2. Yet most of them focus on providing a feature-rich and user-friendly management interface. In this subsection, commercial solutions from two vendors are presented, which differ from the others by focusing on automating the management process. These products have not been tested, and their description is based on the information available on the respective Web sites. For some reasons, in some cases, no assertions can be made about their mode of operation ToutVirtual’s Virtual IQ4 Enterprise Suite consist of four products: VirtualIQ Pro, ProvisionIQ, AssetIQ and ShieldIQ. VirtualIQ Pro provides monitoring and policy-based management through a Web interface and can be installed as a virtual appliance. Such virtual appliances are available for VMware, Xen and Microsoft Virtual Server. ProvisionIQ, as the name suggests, is used to handle the provisioning and life-cycle management of virtual machines, while AssetIQ offers a comprehensive view of all the environment’s resources. ShieldIQ is probably the most interesting product of the range and is used to monitor and maintain QoS parameters of a virtual appliance. Unfortunately, SieldIQ is only available as a virtual appliance that runs on top of VMware products. It is also hard to estimate how this product range can scale with respect to different self-management strategies. Netuitive5 developed its product range based on the concept that managing a vitual environment with script- and rule-based monitoring tools is a difficult task. This is because performance metrics tend to change fast, thus making it very difficult to track possible problems and react properly to avoid them. Instead, Netuitive comes with a set of performance management solutions which they claim to be “self-learning and continuously adaptive” and so replacing “human guesswork with automated mathematics and analysis". The way it works is that the software learns which are the normal parameters of the 4 5
http://www.toutvirtual.com/solutions/solutions.php http://www.netuitive.com/products/
43
3.2. Architectural considerations
Chapter 3. Analysis
environment and is able to forecast problems before they occur. Besides forecasting, the tools from Netuitive are also able to perform fault diagnostic and find the root cause of a problem, thus reducing the mean time of repair. Overall, Netuitive provides an advanced monitoring solution, which is able to forecast problems and perform diagnostics. A big advantage is the self-leaning process, which eliminates the need for acquiring the knowledge from a human expert (i.e. the system administrator), a task that always proved to be difficult. However, the solutions from Netuitive can’t provide what it takes to be a full autonomic manager. While components are available for acquiring knowledge, monitoring and analyzing, there seem to be no components for planning and automatically executing decisions based on the performed analysis. These tasks still need to be accomplished by a human system administrator.
3.2
Architectural considerations
As previously argued, the currently existent solutions for self-managing virtual machines have considerable limitations. This section presents the formal requirements of this thesis, aimed at designing and evaluating self-management approaches for virtual machine-based environments. First, we start with a formal model of the managed infrastructure. Then, a motivation for a common framework is given, as well as a set of requirements for this framework.
3.2.1
A Formal Model of the Managed Infrastructure
To evaluate different self-management approaches, they all must have one thing in common: the managed infrastructure. In the following, a formal model of the managed infrastructure is provided. Our infrastructure is based on a cluster model. In this cluster, three types of entities exist: virtual machines, physical machines and the cluster itself. Thus, we define: C - the cluster with C = {S, P, V, g, s} where S - set of services P - set of physical machines (PMs)
44
Chapter 3. Analysis
3.2. Architectural considerations
V - set of virtual machines (VMs)
g ⊆ P × V , g is a relation which states which VMs are provisioned on which PMs s : V → S, s a surjective function which says which service runs on a given VM6
Each VM vi has: − • a vector → ai = (cpui , memi , neti , ioi ), representing the amount of allocated resources − • a vector → ri = (cpui , memi , neti , ioi ), representing the amount of required resources where: • cpu represents the CPU share • mem represents the amount of memory • net represents the network bandwith • io represents the I/O bandwith → → → → We say that − p ≥ − q if ∀pn ∈ − p and its corresponding element qn ∈ − q are in the relationship: pn ≥ qn . The vector operators ≤,>,< and = are defined analogically. Furthermore, each PM pj has: → − • a vector tj = (cpuj , memj , netj , ioj ), representing the total amount of resources on that PM → • a vector − uj = (cpuj , memj , netj , ioj ), representing the amount of resources provisioned to VMs Using this formal model, we can define when a virtual machine has too much or too few resources assigned: if ri > ai , then the virtual machine vi needs more resources; if ri < ai , than the virtual machine vi has more resources than it requires. Similarly, we can 6
we have defined a model where a single service is hosted on virtual machine, but the same service can be hosted on more than one virtual machines; we believe that this corresponds to scenarios where virtualization is used in data centers
45
3.2. Architectural considerations
Chapter 3. Analysis
define if a virtual machine can be provisioned on a physical machine, or even determine the resource utilization of a physical machine. Note that the abstration provided by this model is somehow simplified, since the dimensions are considered independetly. This means that we exclude the possibility of a virtual machine requiring more of resource x, but in the same time releasing some of resource y.
3.2.2
Framework Requirements
While the authors of the publications presented in subsection 3.1.1 all claim having obtained good results, it is impossible to evaluate these approaches against each other. First of all, this is not possible because different architectures and infrastructures were used. By defining a formal model of the managed infrastructure, we addressed part of the problem. However, to be able to evaluate different controller strategies, a common framework is required in which these strategies can be plugged in and replaced. For such a framework, we define the following requirements: 1. Support for service level management 2. Separation of control algorithms from the management framework 3. Support for different virtualization technologies 4. Support for a common evaluation mechanism 5. Scalability (1) means that the framework should provide support for managing services hosted in virtual machines. As shown in Section 2.2, this involves monitoring various QoS parameters and making sure that these parameters are SLA-conform. (2) basically assures that the management framework is responsible for dealing with aspects like monitoring and execution of tasks, while a separate intelligent controller defines management tasks based on gathered monitoring data. As such, a framework should support different types of intelligent controllers by providing a generic controller interface. (3) requires the framework to support different virtualization technologies (e.g. VMware ESX or Xen, see Section 2.1), without affecting the design of the controller. Thus, from a controller’s perspective, the framework is acting as an abstraction layer on top of the virtualization technology. Generic interfaces for monitoring and task execution should simplify plugging in adapters for various virtualization technologies.
46
Chapter 3. Analysis
3.3. The Resource Allocation Problem
(4) means that the framework has to support a consistent mechanism for tracing and comparing management decisions. Only this way different self-management approaches can be evaluated against each other. (5) states that the framework should be scalable with respect to the number of physical and virtual machines that it can manage. New physical or virtual machines dynamically added to the cluster must be recognized by the framework.
3.3
The Resource Allocation Problem
The framework requirements defined in the previous section lead to a clear separation between the controller and the management framework. The management framework provides an abstraction layer on top of which the controller only deals with resource providers (the physical machines) and resource consumers (the virtual machines). The job of the controller is to satisfy the requirements of the resource consumers, given a global view of the cluster resources supplied by the resource providers.
3.3.1
Controller Requirements
Although separated from the management framework, the controller is still subject to some requirements derived from nature of the managed infrastructure. Thus, any controller mechanisms must consider the following requirements.
1. fast execution times 2. solution at reduced costs
(1) derives from the fact that resource requirements coming from the virtual machines must be satisfied in a short time. This puts an upper bound on the execution time of the resource allocation algorithm. (2) is derived from the fact that reallocating resources is costly, especially when virtual machines have to be migrated. Having these requirements means that the goodness of a solution is a function of both the profit offered by this function and the cost of reaching this solution.
47
3.3. The Resource Allocation Problem
3.3.2
Chapter 3. Analysis
Approaches
The controller is responsible for satisfying the resource requirements of the virtual machines. An equivalent formulation of our resource allocation problem is the problem of packing virtual machines into physical machines. More abstract, this means packing items into boxes, with the purpose of reaching a certain profit. This problem is similar to the knapsack problems presented in Section 2.4. Thus, a first observation can be made: our resource allocation problem belongs to the knapsack family of problem. This is an important observation since, as described in Section 2.4, all knapsack optimization problems are known to be N P -hard 7 . This means that the problem that needs to be solved by the controller is an N P -hard optimization problem. Sometimes however, satisfying all the requirements of all virtual machines might not be possible. This is similar to not being able to pack all the items in the available knapsacks. In such case, the approach used is to maximize the profit. For knapsack problems, where each item has a specific profit, this means finding the set of items that can be packed in the available knapsacks and which give the highest profit. For our problem, if we consider that all virtual machines have the same profit, this means maximizing the number of satisfied virtual machines. Alternatively, if the virtual machines have different profits, than the goal is to maximize this profit while not exceeding the capacity of the physical machines. We have seen in Section 2.4 that, unless P = N P , there is no algorithm that finds an optimal solution to a N P -complete optimization problem in polynomial time. In practice, small instances of an optimization problem can be solved by simply enumerating all possible solutions. For our problem however, this approach can not be used, since the size of our problem corresponds to the size of a virtualized data center. Such a data center could for example have 100 virtual machines provisioned on 10 physical machines, case in which the simple enumeration of all possible virtual machine to physical machine mappings is not a feasible approach. This leaves two options: • heuristics • approximation algorithms In the following, these two alternatives are analyzed.
7
in their most general case
48
Chapter 3. Analysis
3.3.3
3.3. The Resource Allocation Problem
Heuristics
Although there are no execution time guarantees for algorithms using heuristics, in practice algorithms that use heuristics are known to have fast execution times. This makes heuristics a good candidate for our problem, since they fulfill the first controller requirement. The second controller requirement states that the costs of reaching a solution should not be too high. For our problem, a solution translates into a certain resource allocation map. Thus, the second requirement translates in finding a resource allocation map close to the initial resource allocation map. In such cases, heuristics based on local search are a good choice, since these heuristics look for solutions in the neighbourhood of the initial state. This is exemplified in Figure 3.1. Here, the black spots represent states in the search space, the red spot represnts the initial state and the circle around the initial state represents the neighbourhood of the initial state.
Figure 3.1: Example of local search trough the search space
In Section 2.4, several local search heuristics have been introduced. For the analysis, we considered the following four techniques: • Hill climbing • Simulated annealing • Genetic algorithms
49
3.3. The Resource Allocation Problem
Chapter 3. Analysis
• Tabu search
Hill climbing always selects the best neighbour from all the neighbours of the present state. For most problems, this strategy converges fast to a local maximum. For our problem, such a behaviour has its advantages and also its disadvantages. The advantages are that hill climbing generally executes fast, and that the solution found, corresponding to a local maximum, is usually not far away from the initial state. In terms of resource allocation, this means that a new resource allocation map is found fast, and this map does not differ much from the initial one, thus reducing the number of reallocations necessary. However, there are many cases when the quality of the solution found by hill climbing is rather poor. This is a clear disadvantage of this approach. Simulated annealing reduces the risks of reaching a local maximum of poor quality by allowing random moves through the search space. While the solutions found by simulated annealing are still quite close to the initial state, increasing the quality of the solution generally means increasing the execution time. Using simulated annealing might lead to a violation of the first controller requirement. Genetic algorithms use an approach based on the concept of natural evolution and the survival of the fittest principle. Although this makes them very different from the other heuristics based on local search, genetic algorithms are still a local search technique. For our problem, using an appropriate encoding, the crossover mechanism of genetic algorithms is similar to exchanging virtual machines between physical machines. The success of a genetic algorithms approach depends mostly on finding such an encoding. Figure 3.2 shows an example of a one-point crossover, using a possible encoding. Here, a mapping of virtual machines to phyiscal machines is encoded in form of an individual. A virtual machine corresponds to a gene, and sections of the individual represent physical machines. When the crossover takes place, the mappings of PM 1 and PM 2 are exchanged by the two individuals, while the mapping of PM 3 remains the same. As such, in the next generation, an individual appears corresponding to a state where VM 2 and VM 1 are provisioned on PM 1, VM 4 and VM 5 are provisioned on PM 2 and VM 3, VM 6 and VM 7 are provisioned on PM 3. A fitness function must than check the validity and, if valid, the utility of this state. Compared with other approaches, tabu search introduces the concept of taboo moves. In the search space of our problem, states can be either valid or invalid. An invalid state is one where the total size of the virtual machines packed inside a physical machine exceeds the total capacity of that physical machines. When a solution is reached, corresponding to
50
Chapter 3. Analysis
3.3. The Resource Allocation Problem
PM 1 VM 1
VM 2
PM 2
PM 3
VM 5
VM 1
PM 1
VM 3
VM 4
VM 6
VM 5
VM 7
VM 3
PM 2
PM 3 Crossover point
Figure 3.2: Crossover example a final state, the system needs to start from the initial state and go trough all the intermediate states visited on the way to the final state. If any of these intermediate states is invalid, than the solution found is invalid even if the final state is valid. None of the previous techniques has a special mechanism that guarantees that such a situation can never take place. However, using tabu search, this can be achieved by defining a move from a valid state to an invalid state as a taboo move. All these heuristics have their advantages and disadvantages. Alone the theoretical analysis of these approaches can not provide a guarantee with respect to the performance of a controller implementing such a heuristic. However, there are also no strong arguments against using any one of them. It is also possible to combine these local search heuristics into one special heuristic that fits best to our problem.
3.3.4
Approximation Algorithms
Nowadays, approximation algorithms exist for many N P -hard optimization problems. However, such algorithms exist for a specific family or class of problems, not necessary for a specific instance of a problem. Also knapsack problems are known for having approximation algorithms. We have previously shown that our problem belongs to the knapsacik family of problems. Thus, finding an approximation algorithm for our resource
51
3.4. A Bottleneck Determination Mechanism
Chapter 3. Analysis
allocation problem can be done by reducing our problem to a specific knapsack problem, which has an approximation algorithm. For our problem, when only one resource is taken into consideration (e.g. memory), and only one physical machine is available in the cluster, than we are dealing with the original 0-1 knapsack problem. Furthermore, when more resources are taken into consideration (e.g. cpu, memory, network and I/O), we are dealing with a multidimensional knapsack problem. When the resources are provided by several identical physical machines, we are dealing with a multidimensional bin packing (or vector packing) problem. Finally, in the most general case, when the physical machines must not be identical, our problem is a multidimensional multiple knapsack problem (MMKP). Although mostly similar, our problem has some properties that differentiates it from the MMKP. Treating our resource allocation problem like an MMKP would mean that resources are provisioned offline. However, our problem deals with a dynamic resource allocation. In knapsack terminology, this means that the items are already packed in knapsacks, when one dimension of one of the items changes. In this case, taking the second controller requirement into consideration, it would mean that a limited number of items can be moved between knapsacks. An algorithm that performs under such constrain is called a semi-online algorithm [EGCGJ97]. Various approximation algorithms exist for knapsack problems. In [EGCGJ97], the authors present a survey of approximation algorithms for the one dimensional bin packing problem. Here, several semi-online algorithms are mentioned that offer a guaranteed performance ratio smaller than 1.5. Approximation algorithms for the multidimensional bin packing problems are described in [BCS06] and [CKP01]. Finding an approximation algorithm for our problems means finding an algorithm among the existent approximation algorithms for knapsack problems. From this multitude of algorithms, the one is chosen which fits best to our resource allocation problem, without violating any (or too many) constraints.
3.4
A Bottleneck Determination Mechanism
In our model, a single service is hosted inside a virtual machine. The first framework requirement states that the framework should provide support for managing the services inside virtual machines. However, the controller allocates resources based on requirements coming from the virtual machines through the management framework. It is then the management framework that is responsible for translating the behaviour of the service
52
Chapter 3. Analysis
3.4. A Bottleneck Determination Mechanism
into resource requirements, when this is the case. To achieve this, the framework needs to provide what we refer to as a bottleneck determination mechanism. The job of the bottleneck determination mechanism is to monitor both the service and the virtual machine on which the service is provisioned. In case a service is having a bad behaviour, the mechanism must be able to determine which resource is the cause of this behaviour. Note that, as described in Section 3.2.1, we address only single resources. Other causes for such a behaviour, like software errors, are not treated by the bottleneck determination mechanism. When a resource is found to be the bottleneck, the management framework lets the controller know that more of that specific resource is required by a certain virtual machine. The bottleneck determination mechanism behaves just like a human expert (e.g. a system administrator): it first observes the problem and then uses its knowledge to find the resource causing the bottleneck. Based on this diagnostic, the bottleneck determination mechanism finds or suggests a solution. This behaviour is a common behaviour of expert systems, indicating that an expert system approach could be adequate for such a mechanism.
3.4.1
The Expert Systems Approach
In computer science, the term expert system is used to define a system capable of performing classifications and diagnostics, as well as making decisions. Thus, an expert system is a program capable of performing the job of a human expert. Like humans, expert systems can be experts for a specific domain. For example, an expert system for medical diagnostics can assist a doctor in the diagnosis of a patient. To be able to perform such actions, an expert system must maintain a knowledge base, focused on that specific domain. To create such a knowledge base for an expert system, a process called knowledge aquisition is performed. Acquiring knowledge basically means transferring expertise from a knowledge source (for example a human expert or a book) to a program. Furthermore, this knowledge needs to be encoded somehow in the knowledge base, in a format that is machine readable and can be associated with the way humans store knowledge (from a logical perspective). This is known as knowledge representation. Depending on the approach of the expert system there are several ways to represent knowledge. Beside a knowledge base, an expert system must also have a reasoning mechanism. Confronted with a specific problem, the inference engine of the expert system uses the knowledge base to determine a solution. This mode of operation is pictured in Figure 3.3.
53
3.4. A Bottleneck Determination Mechanism
Problem
Reasoning Mechanism
Chapter 3. Analysis
Solution
Knowledge Base
Figure 3.3: Operation mode of an expert system Over the years, many approaches to expert systems have been developed [Jac98]. In the following, two expert systems techniques are described, showing how each of them could be used for the bottleneck determination mechanism. These techniques have been selected, since they seem to be the most adequate for our purposes.
3.4.1.1
Rule-Based Systems
Rule-based systems are a traditional approach to expert systems. A rule-based system consists of: • a set of rules, also known as the production memory • a rule interpreter • a working memory The knowledge base is represented in form of a set of rules. Each rule has a set of facts and a set of actions. The working memory represents the central data structure and contains data about the current state. It is the data in the working memory that triggers the rules, with the rule interpreter being responsible for the selection and activation of rules. The rule interpreter tries to match the facts of a rule with the data in the working memory. When all the facts of a rule are matched, the rule interpreter fires the actions associated with these facts. In case more than one rule fires, the rule interpreter must chose one of them to apply. Rule-based systems are a possible approach to our bottleneck determination mechanism. The main challenge of using this approach is acquiring and encoding the knowledge of a human expert (system administrator) in form of a set of rules.
54
Chapter 3. Analysis 3.4.1.2
3.4. A Bottleneck Determination Mechanism
Case-Based Reasoning
Case-based reasoning is a relatively new approach (early 1980’s) to expert systems. Like other approaches, case-based reasoning tries to imitate the way human experts solve problems. However, unlike the other approaches, case-based reasoning is based on the observation that a human expert solves a problem by recalling a similar problem that he or she solved in the past and applying a similar solution to the current problem. As such, the assumption that lies at the core of case-based reasoning is that similar problems have similar solutions. An expert system that uses case-based reasoning usually performs the following steps: • search the knowledge base for a case which documents a solution for a problem similar to the current problem • adapt this solution to the current problem • apply the adapted solution to the current problem • store the current problem and its solution as a new case in the knowledge base One of the advantages of this approach is that the knowledge representation in form of cases is easy to implement and proven to be efficient. Case-based reasoning systems are able to determine a solution relatively fast, which means that they can be used to resolve on-line problems. However, the generalization that similar problems have similar solutions has always been criticized and can lead erroneous solutions. Like rule-based systems, case-based reasoning is a possible approach for our problem. However, acquiring knowledge from a human expert and encoding it in form of cases is more difficult than encoding this knowledge in form of rules.
3.4.2
The Time Series Analysis Approach
Time series represent a sequence of observations taken over a time interval. An example of a time series is available in Figure 3.4. In many cases, adjacent observations are dependent. Time series analysis deals with techniques for analyzing these dependencies. According to [BJR94], time series analysis has four important areas of application: 1. forecasting future values using current and past values (observations)
55
3.5. The Overprovisioning Problem
Chapter 3. Analysis Time series example
1
load
0.8
0.6
0.4
0.2
0 00:30:00 00:45:00 01:00:00 01:15:00 01:30:00 01:45:00 02:00:00 02:15:00 02:30:00 02:45:00 03:00:00 Time
Figure 3.4: Time series example 2. determining the output of a system given a specific input; this is known as determining the transfer function of a system 3. determining the effects of external events, called interventional events, on the behaviour of a time series 4. design discrete control systems, which use so-called control schemes to adjust the input of the system in order to obtain a desired output Time series analysis has its applications in economics, social sciences, geophysics, meteorology, engineering and others. For our bottleneck determination problem, time series analysis could be used to forecast future bottlenecks and, together with an expert systems technique, to determine the cause of this future bottleneck. Alternatively, time series analysis could also be used to design a discrete control system capable of determining the bottleneck on its own, without use of an expert system.
3.5
The Overprovisioning Problem
Up to this point of the analysis, we have only considered the problem of virtual machines requiring additional resources. However, in some cases, virtual machines could be overprovisioned. This means that more resources have been provisioned to a virtual machine than the virtual machine requires. In such cases, a mechanism is required to release these resources.
56
Chapter 3. Analysis
3.5. The Overprovisioning Problem
In the previous section, an analysis of a bottleneck determination mechanism was provided. For the overprovisioning problem, a similar mechanism can be used. Both the service and the virtual machine parameters should be monitored to determine when the virtual machine is overprovisioned, and both expert systems and time series analysis, presented in the previous section, are feasible approaches for implementing a mechanism capable to detect an overprovisioning. The controller approach for this problem is trivial. When a virtual machine releases resources, these resources are assigned to the physical machine hosting the respective virtual machine. In some cases, this could lead to overprovisioned physical machines, that is physical machines with a low resource utilization. However, such situations are compensated as soon as the controller needs to satisfy resource requirements coming from virtual machines, since in such cases the controller aims to optimize the resource utilization of the entire cluster. Such an optimization also means that the controller tries to achieve a configuration that minimizes the consume of energy in the data center, i.e. reduces the number of online physical machines.
57
3.5. The Overprovisioning Problem
Chapter 3. Analysis
58
Chapter 4 Design In the previous chapter, the goal of this thesis has been described, a set of requirements has been defined and various approaches to fulfilling these requirements have been proposed and analyzed. In this chapter, the design follwing as a result of this analysis is presented. The first section describes the design of the framework. Section 4.2 describes the design of the bottleneck determination mechanism, while Section 4.3 is dedicated to the design of the various controller approaches.
4.1 4.1.1
Architecture of the Framework The basic concept
The framework created as part of this thesis is targeting the self-management of virtual machine-based environments. As such, the autonomic computing paradigm must be taken into consideration when designing the framework. As described in Section 2.3, the autonomic computing paradigm states that autonomic systems represent a collection of autonomic elements. In this context, the autonomic element is the basic component of the system. In section 3.2, a formal model of the managed infrastructure was defined. In this model, the system is composed of four types of entities: the service, the virtual machine, the physical machine and the cluster itself. Since in our model only one service is hosted on a virtual machine, the service and the virtual machine are treated together. As a result, the model of the system comprises three different types of entities.
59
4.1. Architecture of the Framework
Chapter 4. Design
Applying the autonomic computing paradigm to this model means that these entities must appear in the system as autonomic elements. To achieve this, each of these elements requires an autonomic manager. As such, the following managed elements - autonomic element pairs are defined: • virtual machine - VM Manager • physical machine - Physical Manager • cluster - Cluster Manager The VM Manager manages both the virtual machine and the service hosted on it. Each virtual machine has exactly one VM Manager. The Physical Manager is responsible for a physical machine hosting virtual machines. While in our model a physical machine can host several virtual machines, each physical machine has exactly one Physical Manager. The cluster entity is a singleton, and thus exactly one Cluster Manager exists in the system, responsible for the cluster itself. Together with their managed elements, the autonomic managers form the autonomic elements of the system. In Section 3.2, five framework requirements have been defined. At this stage of the design, two of them are addressed. The second framework requirement states that the controller needs to be separated from the management framework. We address this requirement by creating a hierarchical model, which is somehow similar to the model proposed by Ruth et al. [RRX+ 06] and described in Section 3.1. This model is shown in Figure 4.1. Each virtual machine has a VM Manager. Each physical machine has a Physical Manager. The VM Manager monitors a virtual machine and the service running on this virtual machine. When the VM Manager determines that the virtual machine it manages requires more resources, it sends a message to the Cluster Manager, asking for the necessary resources. Each Physical Manager monitors the resource utilization of the physical machine it is assigned to. All Physical Managers forward this information to the Cluster Manager. Based on this data, the Cluster Manager computes a global resource view. This global resource view is then used by the Cluster Manager to satisfy the requirements of the VM Managers. Whenever a request arrives, the Cluster Manager consults the global resource view and determines how the resources can be allocated. The Cluster Manager then defines reallocation tasks, which are forwarded and executed by the Physical Managers. Through this framework design, we have addressed but not completely fulfilled the second framework requirement. The VM Managers and Physical Managers are responsible for
60
Chapter 4. Design
4.1. Architecture of the Framework Cluster Manager
requests/alarms
VM Manager
requests/alarms
VM Manager
VM Manager
VM
Physical Manager
Physical Manager
VMM
VMM VM
VM 1
VM Manager
!""""""""""#
VM 1
1
!""""""""""%
Cluster of physical machines
!""""""""""$
- Service
Figure 4.1: The basic concept monitoring and executing tasks, while the Cluster Manager only takes resource allocation decisions. Thus, the Cluster Manager wraps the controller and is separated from the rest of the management framework. It is obvious that this design is scalable with respect to the number of physical and virtual machines, in the data center limits described in Section 3.3.2. From an architectural point of view, the framework supports adding any type of these machines at run time. This fulfills the fifth framework requirement.
4.1.2
The Physical Manager
Among the autonomic managers of the framework, the Physical Manager is the simplest one. From an architectural point of view, it comprises only two modules: the M ONITOR1 and the ACTUATOR module. This architecture is shown in Figure 4.2. The Physical Manager has two responsibilities. First, it monitors the resource utilization levels of the physical machine it was assigned to. The resources subject to monitoring are memory, CPU, network and I/O bandwith and respect the formal model defined in Section 3.2.1. The data is than forwarded to the Cluster Manager. The monitoring is performed by the Monitor module. 1
module names that were not introduced are written using SMALL CAPITALS
61
4.1. Architecture of the Framework
Chapter 4. Design
Actuator
Monitor
VMM
Figure 4.2: The Physical Manager The second responsibility of the Physical Manager is the execution of tasks coming from Cluster Manager. The framework has a generic character, and thus there are various tasks which can be assigned to the Physical Manager by the Cluster Manager. For example, the Physical Manager can be commanded to allocate additional memory to a certain virtual machine, increase the CPU share of a virtual machine or even migrate a virtual machine on another physical machine. Inside the Physical Manager, the execution of tasks is performed by the Actuator module. Both the monitoring and the execution of tasks are performed by the Physical Manager on its associated physical machine trough the VMM. This is because, as shown in Section 2.1, in a virtualized physical machine the VMM and not the OS controls the access to resources. It is therefore the VMM where resource utilization levels can be monitored and resource allocation tasks can be executed. As it can be observed in Figure 4.2, the Physical Manager has no logic component,i.e. while a common autonomic manager, as shown in Section 2.3, performs a so-called MAPE (Monitor, Analyze, Plan, Execute) loop, the Physical Manager only performs the Monitor and Execute phases. In our design, the Analyze and Plan phases are simply offloaded to the Cluster Manager. Furthermore, this design provides support for different virtualization technologies. The Monitor and Actuator modules can be implemented to work with a certain VMM (e.g. Xen), and then easily adapted to work with another VMM (e.g. VMWare ESX), without changing the way the framework works. This way, our framework is virtualization technology agnostic and thus fulfills the third framework requirement.
62
Chapter 4. Design
4.1. Architecture of the Framework
Actuator VM Logic SL Monitor
SLAs
VM Monitor
VM
Service
Figure 4.3: The VM Manager
4.1.3
The VM Manager
From an architectural point of view, as it can be observed in Figured 4.3 the VM Manager is the most complex component among autonomic managers. It comprises four modules: the SL M ONITOR, the VM M ONITOR, the VM L OGIC and the ACTUATOR module. As the figure indicates, the VM Logic is the central module of the VM Manager. Inside the VM Manager, all modules communicate with and only with the VM Logic. At this design stage, we address the first framework requirement: support for service level management. Through its SL Monitor module, the VM Manager is capable of monitoring the service hosted inside the virtual machine. By monitoring certain stipulated QoS parameters, the VM Manager is capable of determining if the quality of the service complies with a stipulated SLA. The main responsibility of the VM Manager is to determine if the service inside the virtual machine is working at the required parameters, and if this is not the case to determine what is causing the bottleneck. The SL Monitor module monitors service parameters like response time and throughput. It then forwards these parameters to the VM Logic module. The VM Monitor module monitors the resource utilization of the virtual machine. The parameters monitored by the VM Monitor are memory, CPU, network and I/O bandwidth, and respect the formal model defined in Section 3.2.1. Like the SL Monitor, the VM Monitor forwards these parameters to the VM Logic. Based on the monitoring data gathered from the two monitoring modules, the VM Logic is capable of determining: (a) if the service is working well and (b) the resource that is
63
4.1. Architecture of the Framework
Chapter 4. Design
Monitor
Cluster Logic
Actuator
Figure 4.4: The Cluster Manager causing the bottleneck when the service is not working well. When the service is not providing the required quality, specified by means of an SLA, the VM Manager must, upon determining the resource causing the bottleneck, require more of the bottlenecked resource from the Cluster Manager. This is done by the Actuator module: when a resource causing a bottleneck has been determined, the VM Logic sends the required additional amount from the resource to the Actuator module, which forwards this in the form of a request to the Cluster Manager. Having only one resource as a possible bottleneck respects the formal model defined in Section 3.2.1. The VM Manager performs a complete MAPE loop. The Monitor phase is performed by the VM Monitor and SL Monitor modules. The VM Logic module performs the Analyze and Plan phases. To be able to perform these phases, and implementation of the VM Logic must maintain a specific knowledge base. Finally, the Execute phase is performed by the Actuator module.
4.1.4
The Cluster Manager
In the hierarchy of the framework, the Cluster Manager is situated on the highest position. This autonomic manager addresses the self-optimization of the cluster, and communicates with all VM Managers and Physical Managers. It comprises three modules: the M ONI TOR , the C LUSTER L OGIC and the ACTUATOR modules. This architecture is shown in Figure 4.4. The Monitor module receives data from both VM Managers and Physical Managers. This data is preprocessed and forwarded to the Cluster Logic. The Cluster Manager performs two main tasks: 1. computes a global view of the resources in the cluster
64
Chapter 4. Design
4.1. Architecture of the Framework
2. satisfies the requirements of the VM Managers These tasks are addressed by the Cluster Logic, based on the data received via the Monitor module. Thus, the Cluster Logic actually represents the intelligent controller we mentioned throughout Chapter 3. Decisions taken by the Cluster Logic are forwarded to the Physical Managers through the Actuator Module. This architectural approach addresses the second and forth framework requirement. The Cluster Logic represents the controller and is separated from the monitoring and task execution functionalities of the framework. Various controllers can be easily implemented as a Cluster Logic module and plugged into the framework. This fulfill the second framework requirement. Furthermore, it is possible to use the same infrastructure and management framework with various controllers. This represents the foundation of a common evaluation mechanism and thus fulfils the fourth framework requirement. Computing a global view of the resources in the cluster is done based on data received from the Physical Managers. Each Physical Manager monitors the resource utilization levels of the physical machine to which it is assigned. This local resource view is than forwarded to the Cluster Manager by all Physical Managers. From this multitude of local resource views, the Cluster Manager computes its global resource view. The global re→ source view is actually defined by the vector − ai and the relation g, as in Section 3.2.1.As such, the global resource view contains data like which VMs are provisioned on each physical machine, how much memory is provisioned on each VM, what is the CPU share on each VM, how much memory is still available on each physical machine and what is the CPU utilization of each physical machine. Satisfying the requirements of the VM Managers represents the most difficult task of the Cluster Logic. When a requirement from a VM Manager is received, the Cluster Logic consults its global resource view, and based on it takes a resource allocation decision. This decision is then forwarded to the Actuator module. The job of the Actuator module is to translate a Cluster Logic decision into tasks for the Physical Managers. Simple decisions like “increase the CPU share of VM i by x” are simply forwarded to the appropriate Physical Manager, in this case the Physical Manager on which the VM i is provisioned. However, the Cluster Logic can also take complex decisions like: “migrate VM i from physical machine t on physical machine p, migrate VM j from physical machine q to physical machine t, allocate additional 100MB memory to VM j”. When such a decision is taken, the Actuator module must delegate tasks to various Physical Managers. For our example, the Actuator must command the Physical
65
4.1. Architecture of the Framework
Chapter 4. Design
Manager of t to migrate VM i on p, then command the Physical Manager of q to migrate VM j on t, and then again to the Physical Manager of t to allocate additional 100MB memory to VM j. Thus, the actuator must coordinate the delegation of tasks among several Physical Managers. We have previously mentioned that the Analyze and Plan phases of the MAPE loop corresponding to the Physical Manager have been offloaded to the Cluster Manager. The necessity of this design decision is obvious: the Cluster Manager is the only one that disposes of a global resource view, while the Physical Managers each dispose of a local resource view. The framework is not aimed at optimizing single physical machines, but the cluster as a whole. As such, the Cluster Manager performs a complete MAPE loop. The Monitor phase is performed by the Monitor module, the Analyze and Plan phases are performed by the Cluster Logic module and the Execute phase is performed by the Actuator module.
4.1.5
The Complete Architecture
Section 4.1.1 described the managed infrastructure, together with the management backbone of the framework. Afterwards, in Sections 4.1.2, 4.1.3 and 4.1.4, the architecture of each of the autonomic managers was described in detail. Putting all these pieces together the complete architecture of the framework is obtained. This is shown in Figure 4.5. Since each entity in the system has an autonomic manager, our system is composed only of autonomic components. Compared with other architectural approaches to autonomic computing systems, our architecture resembles the most the approach proposed by White et. al in [WHW+ 04]. Furthermore, the internal architecture of each autonomic element conforms to the IBM vision [KC03] of how autonomic computing systems operate. This complete architecture fulfils all five framework requirements defined in 3.2, as subsequently shown in Sections 4.1.1, 4.1.2, 4.1.3 and 4.1.4.
4.1.6
The Intra-Manager Communication
In the previous subsections, the communication between the modules of each autonomic manager was described. This is called the intra-manager communication. The intra-manager communication is based on a modular Self-Manager, developed in a project ourside of this thesis in the Distributed Systems Lab of the University of Ap-
66
Chapter 4. Design
4.1. Architecture of the Framework Cluster Manager Monitor
Cluster Logic
requests/alarms
requests/alarms
Actuator
VM Manager
VM Manager
Actuator
Actuator
VM Logic SL Monitor
VM Logic
SLAs
VM Monitor
SL Monitor
Physical Manager Monitor
Physical Manager
Actuator
Actuator
Actuator
Actuator
VM Logic
VM Logic
SLAs
VM Monitor
SL Monitor
SLAs
VM Monitor
VMM VM
VM 1
VM Manager
SL Monitor
VM Monitor
VMM VM
Monitor
SLAs
VM Manager
!""""""""""#
VM 1
1
!""""""""""%
Cluster of physical machines
!""""""""""$
- Service
Figure 4.5: The complete architecture of the framework plied Sciences Wiesbaden [Sch07, SG06]. The Architecture of the Self-Manager can be observed in Figure 4.6. Internally, the Self-Manager consists of four main components: the central element of the Self-Manager is the management kernel, which provides adapters to connect extension modules. Extension modules are instantiated by the central module manager. The modules implement sensors, effectors, and the internal logic of the self-management controller. A messaging subsystem, which is part of the kernel, is responsible for message handling within the Self-Manager. In order to use the communication backbone offered by the Self-Manager, each module of the framework must be implemented as a Self-Manager module. This has serious implications on the implementation of the framework. Because the Self-Manager is written in Java and the communication between modules takes place over Java Messaging Service2 (JMS), it means that the framework must be implemented in Java.
4.1.7
The Inter-Manager Communication
To understand how all the pieces of the framework work together, in the following paragraphs the intra-manager communication is presented together with the inter-manager 2
http://java.sun.com/products/jms/
67
4.1. Architecture of the Framework
Chapter 4. Design
Figure 4.6: Modular Self-Manager architecture
communication, on a dynamic resource allocation example. By inter-manager communication we understand the communication between autonomic managers. Figure 4.7 shows the sequence diagram corresponding to an exemplary dynamic resource allocation. The SL Monitor and VM Monitor modules of the VM Manager monitor the parameters of the service and virtual machine respectively. This data is fed to the VM Logic, which then checks if the QoS parameters of the service are SLA-conform, and if not determines the resource causing the bottleneck. The results of the diagnosis are forwarded to the Actuator module of the VM Manager, which than performs a request for additional resources. This request is received by the Monitor Module of the Cluster Manager. This data is than fed to the Cluster Logic. The Cluster Logic decides about the resource reallocation and sends its decision to the Actuator module of the Cluster Manager. The Actuator module of the Cluster Manager then delegates this task to the involved Physical Manager. The Physical Manager receives the command via its Actuator module and performs the reallocation via the VMM. Note that this sequence diagram is meant only to exemplify the inter- and intra-manager communication. Certain details like updates from Physical Managers and delegation of tasks to several Physical Managers are not shown in Figure 4.7.
68
Chapter 4. Design
4.2. The Bottleneck Determination Mechanism
Inter-VM Manager Communication SL Monitor
VM Monitor
VM Logic
Inter-Cluster Manager Communication (VM) Actuator
(Cluster) Monitor
Cluster Logic
(Cluster) Actuator
(Physical) Actuator
QoS Parameters Resource Utilization Bottlenecked Resource Request Resource VM Requests Resource Resource Reallocation Allocate Resource
Intra-Manager Communication
Intra-Manager Communication
Figure 4.7: Exemplary inter-manager and intra-manager communication
4.2
The Bottleneck Determination Mechanism
In Section 3.4 it was argued that, in oder to provide the abstration level required by the controller, the framework needs to comprise a bottleneck determination mechanism. As it can be observed from the architecture of the framework, presented in the previous section, this functionality is provided by the VM Logic module of a VM Manager. There are various approaches to implement the bottleneck determination mechanism. These approaches were described in Section 3.4. Among them, the rule-based system approach appears to be the most attractive one to start with. This is because rule-based systems are known to be a feasible, easy to implement approach in the context of autonomic computing. In the previous section, we have drawn the design decision of implementing the entire framework in Java. This means that the VM Logic module, like all the modules of the framework, must be written in Java. The VM Logic module hosts the bottleneck determination mechanism. Ideally, the rule engine implementing the bottleneck determination mechanism should support the integration in a Java program. Fortunately, such rule engines exists, and the Java integration is defined by JSR 943 . The internal architecture of the VM Logic module using a JSR 94 rule engine is shown in Figure 4.8. The VM Logic disposes a communication interface, over which the communication with the SL Monitor, VM Monitor and Actuator modules of the VM Manager takes place. The center of the VM Logic component is represented by the rule engine. The rule 3
http://jcp.org/en/jsr/detail?id=94
69
4.3. The Controller
Chapter 4. Design Communication Interface JSR 94
Rule Engine SLAs
Figure 4.8: The internal architecture of the VM Logic module engine uses the communication interface to receive data from the SL Monitor and VM Monitor. This data is fed to the rule engine via the standardized JSR 94 interface. The SLAs defined for the service hosted inside the virtual machine are translated in form of rules into the rule engine. The rule engine is used for the bottleneck determination. When such a bottleneck is determined, this information is send to the Actuator module via the Communication interface.
4.3 The Controller As described in Section 4.1, the controller functionality of the framework is implemented by the Cluster Logic module of the Cluster Manager. The internal architecture of the Cluster Logic is shown in Figure 4.9.
Communication Interface
Global Resource View
Algorithm
Figure 4.9: The internal architecture of the Cluster Logic module
70
Chapter 4. Design
4.3. The Controller
The Communication Interface assures the communication with the Monitor and Actuator modules of the Cluster Manager. When the Monitor modules forwards monitoring data coming from the Physical Managers, the Communication Interface updates this data in the Global Resource View. When a VM Manager sends a request for additional resources, again via the Monitor module of the Cluster Manager, the Communication Interface forwards this request to the controller Algorithm. The algorithm uses the Global Resource View to make a resource allocation decision, and sends this decision via the Communication Interface to the Actuator module of the Cluster Manager, which then delegates tasks to the respective Physical Managers. Obviously, the central component of the controller is the algorithm. We have designed two such algorithms: Algorithm A and Algorithm B. These algorithms are presented in the following subsections.
4.3.1
Algorithm A
Algorithm A is the simplest of the two and is quite similar to the greedy best-first technique [RN04]. Whenever a VM Manager requests additional resources for its corresponding virtual machine, Algorithm A uses the global resource view to check if these resources are locally available, i.e. on the physical machine hosting this virtual machine. If this is the case, than the requested resources are locally assigned to the virtual machine. If the requested resources are not locally available, than Algorithm A checks for physical machines in the cluster which can satisfy the resource requirements of the requesting virtual machine. These requirements represent the sum of the resources currently provisioned to the virtual machine and the additional resources requested by the virtual machine. The pseudocode of this algorithm is shown in Algorithm 1. From the list of physical machines that can provision the virtual machine, the physical machine is chosen which has the highest resource utilization. Note that, in the list of remote machines which provide sufficient resources, both offline and online machine can be found. Choosing the remote machine that has the highest resource utilization makes sure that offline machines are powered-on only if no online machines can provide the required resources. This strategy aimes at increasing the energy efficiency of the cluster. Using a different heuristic when choosing the remote physical machine on which to migrate, like for example choosing the physical machine with the lowest resource utilization, has one big disadvantage. Consider requirements coming from two different virtual machines. Both require a migration on a remote physical machine, and are processed se-
71
4.3. The Controller
1 2 3 4 5 6 7 8 9 10 11 12 13
Chapter 4. Design
Input: A VM that requires more resources Output: New mapping if resources available locally then assign local resources; return new remapping; end else if remote machines exists with the required resources available then migrate on the machine with the highest resource utilization; return new remapping; end else return no solution found; end end Algorithm 1: Algorithm A
quentially. The first virtual machine requires a relative small amount of resources, and is provisioned on the physical machine with the lowest resource utilization, although physical machines with a higher resource utilization could also provision this virtual machine. When the requirements of the second virtual machine must be fulfilled, Algorithm A finds out that no remote physical machine has enough free resources to provision this second virtual machine. However, on the physical machine where the first virtual machine was provisioned, these resources were available before the first virtual machine was provisioned. Thus, it would have been possible to satisfy the requirements of the second virtual machine, if the first virtual machine would have been provisioned on a physical machine with a higher resource utilization. This clearly speaks for the heuristic used in Algorithm A and against an alternative heuristic where the low-utilized physical machines are preferred over high-utilized physical machines. However, Algorithm A has two limitations: first, since it is a heuristic-based algorithm, it has no guaranteed performance ratio. For our case this means that, although Algorithm A aims at optimizing the allocation of virtual machines to physical machines with the final purpose of reducing the number of necessary online physical machines, one can never say how far this solution is from the optimal allocation. The second limitation of Algorithm A is that it is not capable of performing complex virtual machine to physical machine reallocations in the cluster. This limitation is exemplified in Figure 4.10. Here, VM 2 was created and needs to be provisioned. However, VM 2 is too big for PM 1, and on PM 2 there is not enough place, i.e. there are not enough resources free. In such a case, Algorithm A finds no solution, although, as it can be observed in Figure 4.10, it is possible
72
Chapter 4. Design
4.3. The Controller
VM 2
2
1
VM 1
PM 1
PM 2
Figure 4.10: Exemplary limitation of Algorithm A to migrate VM 1 from PM 2 to PM 1 and then to provision VM 2 on PM 2. Such complex solutions are not found by Algorithm A.
4.3.2
Algorithm B
Algorithm B was designed to address the limitations of Algorithm A. It is based on the K-Best First Search (KBFS) Algorithm first introduced by Felner et al. [FKK03] in 2003. KBFS is a generalization of best-first search (BFS), and was proven to outperform BFS4 . Some tabu search techniques, presented in Section 2.4, are also used. Simplified, KBFS basically performs a search in a tree. To be able to apply KBFS to our problem, an encoding of the VM-to-PM mapping is required. Furthermore, the parentchild relationship needs to be defined, as well a “goodness” criteria for each state. 4.3.2.1
The Encoding
Algorithm B encodes the mapping of VMs to physical machines in the cluster in form of states. In addition, the encoding includes allocated and available resources of each phys4
Felner et el. showed that KBFS outperforms BFS by a factor of 15 on random trees with dead-ends
73
4.3. The Controller
Chapter 4. Design
ical machine. This basically reflects the global resource view computed by the Cluster Manager and so the formal model defines in Section 3.2.1. The initial state represents the root of the search tree. From the initial state, child states are generated. This process is repeated recursively. A child state B of a parent state A represents a VM-to-PM mapping which differentiates from the VM-to-PM mapping of state A in only one point, i.e. a single migration took place. An exemplary encoding is shown in Table 4.1.
State A
State B
VM 1 VM 2 VM 3 VM 1 VM 2 VM 3
PM 1 PM 1 PM 2 PM 1 PM 2 PM 2
Table 4.1: Exemplary VM-to-PM encoding of two parent-child states
4.3.2.2
The Goodness Criteria
To apply KBFS to our problem, one needs to be able to compare different states in order to determine which state is the better one. Each state has its global profit, representing a function of the sum of the local profits of each physical machine and the validity of the state. The local profit of a physical machine represents a function of its resource utilization. As such, the higher the resource utilization of a machine, the higher the local profit. An offline physical machine has the maximum local profit possible. This assures that the global profit of a state increases when some machines are kept at a high resource utilization while the others are kept offline. This is shown by the formula: local_prof it(p) = f (resource_utilization(p)) where: local_prof it : P → Q, resource_utilization : P → Q, and f : Q → Q.
74
Chapter 4. Design
4.3. The Controller
Global profit is calculated by multiplying the sum of the local profits with the validity of the state, which can be either 0 or 1. Thus, the global profit can either be the sum of the local profits if the state is valid, or 0 if the state is not valid. A valid state is a state where the total capacity of any physical machine in the cluster is bigger or equal to the sum of requested capacities of the VMs provisioned on that physical machine. In terms of P the formal model defined in Section 3.2.1, this means that ∀j ∈ P, tj ≥ ni=1 ai , where a1 , . . . an represent the virtual machines provisioned on the physical machine j. All other states are invalid. As a result, Algorithm B defines a valid global state as a state with a global profit higher than 0. Furthermore, valid states can also be ranked, depending on their global profit. This is shown by the formula: global_prof it(state) = validity(state) ·
P]P
i=1
local_prof it(i),
where: Z = the set of states, global_prof it : Z → Q, and validity : Z → {0, 1}. Besides the global profit of a state, Algorithm B also computes the total cost of moving from the initial state to any given state. The total cost is a function of the number of migrations necessary for the translation from the initial state to that given state. The total cost is then used in combination with the global profit to determine the utility of a translation between an initial state and a given state. This is the ultimate quality factor that is used to select the final state. Thus, using the formal model defined in Section 3.2, the utility of a state can be calculated with the formula: utility(state) = d(global_prof it(state), total_cost(state)) where: utility : Z → Q, d : Q × Q → Q, total_cost(state) = e(migrations_necessary(initial_state, state)), total_cost : Z → Q, e : N → Q, and migrations_necessary : Z × Z → N.
75
4.3. The Controller
Chapter 4. Design
Note that the functions d, e and f are not determined at this point. Instead, they can vary, depending on the implementation. Tests using different configurations of d, e and f can than determine which configuration fits best to certain scenarios.
4.3.2.3
Applying KBFS
Using this encoding and goodness criteria, Algorithm B performs its local search. For this search, two lists are used: the open list of states (nodes) that have been evaluated with respect to their utility but not expanded, and the closed list which contains the nodes that have already been expanded. At each iteration, the best K nodes from the open list are expanded, their children are evaluated and added to the open list. After a limited number of iterations, the algorithm terminates and the node with the highest utility is selected. This represents the final state, and the system moves to that state by means of successive migrations. What Algorithm B perfomes is a mixture between breath search and depth search. In Figure 4.11, an example of how Algorithm B performs is shown, while the pseudocode of Algorithm B is shown in Algorithm 2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Input: A VM that requires more resources, K, ITERATIONS Output: New mapping if resources available locally then assign local resources; return new remapping; end else determine all the children of the current state; add current state to the closed list; add the children of the current state to the open list; while ITERATIONS 6= 0 do select the K children with the highest utility from the open list; foreach children do move the current state to the closed list; add the children of the current state to the open list; end decrement ITERATIONS by one; end select the state with the highest utility from the closed list; return new remapping; end Algorithm 2: Algorithm B
76
Chapter 4. Design
4.3. The Controller Initial State 1
2.1
K=2 ITERATIONS=3
2.2
2.3
2.1
2.3
Sorting
2.2
3.1
3.2
3.3
3.4
3.5
Sorting
3.4
4.1
3.3
3.1
4.2
4.3
4.3
4.2
3.5
3.2
Sorting
4.1
Final State
Figure 4.11: Algorithm B visual example Algorithm B is always capable to find a valid solution if valid solutions exist and the translation costs to at least one of these solutions does not exceed a certain, predefined value. To obtain this, one needs to adapt the ITERATIONS parameter to the specific allocation problem, so that the algorithm terminates exactly when the maximum number of migrations permitted is exceeded. Note that, in finding a solution, the algorithm determines a local optimum which is not necessarily the global optimal solution. When a final state is obtained, the solution is represented in form of translations (migrations) from the initial state, through intermediary states to the final state. While the final
77
4.3. The Controller
Chapter 4. Design
state is in most of the cases a valid state, with the exception of the case where the problem has no acceptable solution, one or more of the intermediary states might not be valid states. To avoid this problem, Algorithm B uses a tabu search technique: it defines translations from valid to invalid states as a tabu move. As such, no solutions which contain tabu moves are accepted, and the algorithm chooses the best state in the closed list which leads to a solution without tabu moves. Furthermore, to avoid cycles, the parent of a state cannot be the child of the same state. Therefore, such a migration is defined as a tabu move. Algorithm B provides significant advantages. Most important, unlike Algorithm A, if an acceptable solutions exist, Algorithm B will find it. Furthermore, Algorithm B improves the resource utilization of the cluster and reduces the consume of energy, by minimizing the number of physical machines necessary. However, just like Algorithm A, Algorithm B has one limitation: it provides no guaranteed performance ratio.
78
Chapter 5 Implementation In Section 4.1, the architecture of the management framework was designed. The framework is used to manage a cluster of virtualized physical machines. The software configuration of cluster is described in Section 5.1. Furthermore, the management framework comprises three types of autonomic managers: the Physical Manager, the VM Manager and the Cluster Manager. The overall implementation of each of these managers is presented in Section 5.2, while the implementation of their core control components, i.e. the bottleneck determination mechanism and the cluster controller is described in Sections 5.3 and 5.4 respectively. Finally, Section 5.5 presents the effort of this implementation measured in lines of code. Note that although the framework and the controllers were designed using the mathematical model from Section 3.2.1, where four types of resources are defined (memory, CPU, network bandwidth and I/O bandwidth), the implementation of the framework focuses on only one resource, namely memory. This is because the implementation presented here is meant to be a prototype, a proof-of-concept with respect to the design presented in the previous chapter. Note that in the following sections, the implementation is described in detail. The purpose of this high level of detail is to set up a strong building block for future work.
5.1
Software Configuration of the Cluster
The framework developed as part of this thesis is targeted at the autonomic management of virtulization-based data centers. As such, for the purpose of this implementation, a virtualization technology commonly found in data centers must be used to virtualize the
79
5.2. The Framework
Chapter 5. Implementation
physical machines in the cluster. As shown in Section 2.1, three products fit well in this category: Virtuozzo and its open source variant OpenVZ, Xen (including here all the virtualization products that are based on the Xen hypervisor) and VMWare ESX. Out of the three, the open source variant of Xen that comes with Fedora Core 7 was chosen for the implementation. The argument for using Xen was mostly of financial nature, since Xen is open source, unlike Virtuozzo and VMWare ESX. Furthermore, OpenVZ has the disadvantage of hosting only VMs that use the same (Linux) kernel as the host OS. This is not the case with Xen, as seen in Section 2.1. As such, the physical machines in the managed infrastructure were set up to run Fedora Core 7 with Xen 3.1. Fedora Core 7 was also used as guest OS running in the virtual machines. Inside each virtual machine, a Tomcat 5.5.25 server was installed, and the JMX monitoring of Tomcat was enabled. This permits monitoring the Tomcat parameters and also some of the OS parameters of the virtual machine hosting the Tomcat server. Furthermore, for the communication with the Xen hypervisor, the XenAPI daemon was activated on both physical machines.
5.2 5.2.1
The Framework The Self-Manager Backbone
In Section 4.1, we have seen that the management framework is based on a Self-Manager, developed in the Distributed Systems Lab of Wiesbaden University of Applied Sciences. The Self-Manager is the backbone of our modular architecture, providing support for loading modules and for the communication between modules. Recall the architecture of the framework. To facilitate the reading of this thesis, this architecture is shown again in Figure 5.1. Each of the three autonomic managers, i.e. the VM Manager, the Physical Manager and the Cluster Manager represents an instance of the Self-Manager. However, the differences between them are determined by the modules loaded by the Self-Manager core. The overall architecture of the Self-Manager is shown in Figure 5.21 . In order to load and initialize modules, the Self-Manager kernel must be started with an argument pointing at a configuration file, where the modules to be loaded and their initialization parameters are specified. The configuration file uses the Java .properties 1
again, this figure is repeated to facilitate reading this thesis
80
Chapter 5. Implementation
5.2. The Framework
Cluster Manager Monitor
Cluster Logic
requests/alarms
requests/alarms
Actuator
VM Manager
VM Manager
Actuator
Actuator
VM Logic SL Monitor
VM Logic
SLAs
VM Monitor
SL Monitor
Physical Manager Monitor
Physical Manager
Actuator
Actuator
Actuator
Actuator
VM Logic
VM Logic
SLAs
VM Monitor
SL Monitor
VM
!""""""""""#
VM 1
1
!""""""""""%
Cluster of physical machines
Figure 5.1: The complete architecture of the framework
Figure 5.2: The architecture of the Self-Manager
81
SLAs
VM Monitor
VMM
VM 1
VM Manager
SL Monitor
VM Monitor
VMM VM
Monitor
SLAs
VM Manager
!""""""""""$
- Service
5.2. The Framework
Chapter 5. Implementation
format2 . As an example the configuration file is shown in Listing 5.1. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
#Module Manager settings moduleManager.class = de.fh_wiesbaden.cs.vs.manager.ModuleManager ModuleManager.loadModules = module_1 module_2
######################################## # The modules with their configuration # ######################################## ################################# # Module module_1 configuration # ################################# module_1 = de.fh_wiesbaden.cs.vs.example.Module1 #reference to the module_2 module module_1.module_2_reference = module_2 #the initial value module_1.init_value = 233 ################################# # Module module_2 configuration # ################################# module_2 = de.fh_wiesbaden.cs.vs.example.Module2 #the hostname to monitor module_2.hostname = wwwvs
Listing 5.1: Configuration file example In this example, two modules, module_1 and module_2 are loaded. This is done in line 3. For each module, the main class of the module must be specified. For module_1, this is done in line 13, while for module_2, this is done in line 23. Each of the two modules specifies a set of properties, which are loaded when the module initializes: lines 15 - 18 for module_1 and lines 25 - 26 for module_2. There are three abstract types of modules that can be loaded by the Self-Manager: Control modules, Action modules and Event modules. The modules are loaded based on the main class of the module specified in the configuration file. Each of the three types of modules extends the Module interface, shown in Listing 5.2. 1 public interface Module { 2 3 public abstract void initialize( 4 String moduleName, ConfigReader configReader, Log tracer, 5 Log logger, Messenger messenger); 2
http://java.sun.com/j2se/1.4.2/docs/api/java/util/Properties.html
82
Chapter 5. Implementation 6 7 8 9 10 11 12 13 14 15 16 }
5.2. The Framework
public abstract void shutdown(); public abstract void updateConfig(); public abstract boolean ping(); public abstract State getState(); public abstract ContentElement invoke(BasicMessage requestMessage);
Listing 5.2: The Module interface A module implementing one of the three types of modules must provide an implementation for the initialize() (lines 3-5), shutdown() (line 7) and invoke() (line 15) methods. For the rest of the methods defined in the Module interface, a default implementation is provided by the class BasicModule, which can be used by extending this class in the module. The three types of autonomic managers in the framework represent a composition of concrete modules that implement one of the three types of modules (Control, Action, Event). In the following subsections, the implementation of each of these autonomic managers is described.
5.2.2
The Physical Manager
The purpose of the Physical Manager is to monitor and execute commands on a virtualized physical machine. Although all the physical machines in the cluster use Xen, the implementation of the framework kept the generic character of the design, presented in Section 4.1.2. Thus, from the point of view of the implementation, the internal architecture of the physical manager comprises three modules, as shown in Figure 5.3. The M ONITOR and ACTUATOR modules are used to monitor respectively execute commands on the physical machine and have a generic character. This means that if instead of Xen say VMware ESX is used, the Monitor and Actuator modules require no modification. Instead, it is the X EN module that, as the name suggests, is technology-bound. This module receives messages from the Actuator and Monitor modules and transforms these requests into Xen-specific commands. Thus, it is this Xen module that is used for the actual interaction with the VMM. To adapt the Physical Manager to use another virtualization technology, say VMware ESX, all one needs to do is to implement a VMware ESX module capable of handling the same messages as the Xen module and replace the
83
5.2. The Framework
Chapter 5. Implementation
RMI
RMI
Actuator
Monitor
JMS
JMS
Xen Xen API
Xen 3.1 (VMM)
Figure 5.3: The Implementation of the Physical Manager Xen module with it. In the following paragraphs, the implementation of each of these modules is described in detail.
5.2.2.1
The Monitor Module
The Monitor module of the Physical Manager is an Event module that is used to monitor the resource utilization of the physical machine. The implementation of the Monitor module is used to monitor: • The amount of free physical memory • The amount of used physical memory • The amount of physical memory allocated to each virtual machine hosted on the physical machine The resources to be monitored are specified as a module property in the configuration file of the Physical Manager. Listing 5.3 shows an extract of the Physical Manager configuration file where the properties of the Monitor module are set. The hostname of the physical machine is specified in line 2. Line 7 specifies the main class of the module. A reference to the module used to communicate with the hypervisor is set in line 10, while in lines 12-13, the parameters subject to monitoring are specified. In this case, the number
84
Chapter 5. Implementation
5.2. The Framework
of hosted VMs and the amount of physical memory allocated to each VM, including the remaining free memory on the physical host, are to be monitored. 1 2 3 4 5 6 7 8 9 10 11 12 13 14
#the hostname of the physical machine hostname = asterix ################################### # PM Monitor Module configuration # ################################### pm_monitor = de.fh_wiesbaden.cs.vs.manager.modules.monitor.pm_monitor.PMMonitor #reference to the hypervisor module pm_monitor.hypervisor_module = xen_module #parameters to monitor pm_monitor.monitored_parameter_1= number_hosted_vms pm_monitor.monitored_parameter_2= memory_util_of_each_vm
Listing 5.3: The properties of the Physical Manager Monitor module Note the generic character of loading the properties of the Monitor module. Assume one wants to monitor the CPU utilization of each VM, in addition to the number of VMs provisioned on the physical machine and the amount of memory allocated to each VM. In such a case, the configuration file would be extended with a line like: pm_monitor.monitored_parameter_3 = cpu_util_of_each_vm
In the main class of the Monitor module, these properties are read and their value is used to initialize the attributes of the module. This is shown in Listing 5.4. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
hypervisorModuleReference = moduleProperties.getProperty( this.moduleName + ".hypervisor_module"); //read the parameters to monitor while (readingParameters) { tmpParameterName = moduleProperties.getProperty( this.moduleName + ".monitored_parameter_" + parameterCounter); if (tmpParameterName != null) { monitoredParameters.add(tmpParameterName); updatedMonitoredParameters.put(tmpParameterName, "-1"); parameterCounter++; } else { readingParameters = false; } }
Listing 5.4: Extract from the PMMonitor.java, showing how the module properties are loaded
85
5.2. The Framework
Chapter 5. Implementation
Here, the hostname parameter is read in lines 1-2. Then, in line 5-16, the parameters to be monitored are read, until no more parameters are found in the config file. Each parameter is added to the monitoredParameters Vector (line 10) and to the updatedMonitoredParameters Hashtable (line 11) with the initial value -1. This method of reading the parameters supports the generic character of the configuration file. After the parameters to be monitored are read, they are forwarded to the Xen module. This is presented in Listing 5.5 and represents a good example of how the intra-manager communication is implemented. 1 hypervisorModuleQueueSender = messenger.createJMSQueueSender( 2 queueSession.createQueue(hypervisorModuleReference)); 3 4 for (int i = 0; i < monitoredParameters.size(); i++) { 5 ComplexContentElement hypervisorModuleMessageElement = new ComplexContentElement( 6 Identifiers.COM_GET_VALUE); 7 8 hypervisorModuleMessageElement.addElement(new SimpleContentElement( 9 "id", monitoredParameters.get(i))); 10 11 hypervisorModuleComplexMessage = new ComplexMessage(); 12 hypervisorModuleComplexMessage.setcontent(hypervisorModuleMessageElement); 13 14 hypervisorModuleMessage = queueSession.createObjectMessage( 15 hypervisorModuleComplexMessage); 16 hypervisorModuleMessage.setJMSReplyTo(queueReceiver.getQueue()); 17 hypervisorModuleQueueSender.send(hypervisorModuleMessage); 18 }
Listing 5.5: Extract from the PMMonitor.java, showing how the to be monitored parameters are sent to the Xen module Here, a QueueSender is first initialized (lines 1-2). Next, for each parameter to be monitored, a complex element is created. The complex element contains the command for the Xen module, in this case the “get the value of the following parameter” command (lines 5-6), and a parameter identifier (lines 8-9). This complex element is then packed in a complex message (lines 11-12) and sent to the Xen module (lines 14-17). As already mentioned, the messages between the modules of an autonomic manager are sent via JMS. No further details are provided here with respect to the implementation of this JMS-based communication, since this is a feature that comes with the Self-Manager and was developed outside of this thesis. When the Monitor module receives a message from the Xen module, the invoke() method is called asynchronously. In this method, the processComplexElement() method is subsequently called. The content of this method is shown in Listing 5.6.
86
Chapter 5. Implementation
5.2. The Framework
1 private ContentElement processComplexElement( 2 ComplexContentElement receivedMessageBody, ComplexMessage receivedMessage) { 3 4 if (receivedMessageBody.getName().equals( 5 Identifiers.COM_GET_NUMBER_HOSTED_VMS)) { 6 7 String tmpValue = MessageUtility.getValueFromMessage( 8 Identifiers.NUMBER_HOSTED_VMS, receivedMessage); 9 10 if (!updatedMonitoredParameters.get(Identifiers.NUMBER_HOSTED_VMS).equals( 11 tmpValue)) { 12 13 updatedMonitoredParameters.put(Identifiers.NUMBER_HOSTED_VMS, tmpValue); 14 sendUpdateToClusterManager(); 15 } 16 17 } else if (receivedMessageBody.getName().equals(Identifiers.MEMORY_UTIL_OF_EACH_VM)) { 18 processMemoryUtilOfEachVM(receivedMessageBody, receivedMessage); 19 } 20 21 return null; 22 }
Listing 5.6: Extract from the PMMonitor.java, showing how the messages received from the Xen module are processed This method has been implemented to deal with the number_hosted_vms (line 8) and memory_util_of_each_vm (line 17) parameters, but can be easily extended to deal with other monitoring parameters. For the number_hosted_vms parameter, the received data is parsed (lines 7-8), and if the value of this parameter has been modified (lines 10-11), then the value of the parameter is updated (lines 13), an update is sent to the Cluster Manager (line 14). For the memory_util_of_each_vm parameter, the received data is parsed, updated and sent to the Cluster Manager in the method processMemoryUtilOfEachVM() (line 18). Updates to the Cluster Manager are sent via Java RMI. This is part of the inter-manager communication and is therefore described in Section 5.2.5.
5.2.2.2
The Actuator Module
The Actuator module of the Physical Manager is an Action module that is used to execute tasks specified by the Cluster Manager on the physical machine monitored by the Physical Manager. How these tasks are received is part of the inter-manager communication and thus described in Section 5.2.5.
87
5.2. The Framework
1 2 3 4 5 6 7 8 9 10
Chapter 5. Implementation
#the hostname of the physical machine hostname = asterix #################################### # PM Actuator Module configuration # #################################### pm_actuator = de.fh_wiesbaden.cs.vs.manager.modules.monitor.pm_actuator.PMActuator #reference to the hypervisor module pm_actuator.hypervisor_module = xen_module
Listing 5.7: The properties of the Physical Manager Actuator module The Actuator module is loaded by the Self-Manager core using the configuration file just like any other module. The parameters loaded for the Actuator module are shown in Listing 5.7. Line 7 specifies the main class of the module. Since the Actuator module receives its tasks from the Cluster Manager, the only parameters from the configuration file used to initialize the attributes of the module are the hostname parameter (line 2), which specifies the hostname or IP address of the managed physical machine, and the hypervisor_module parameter (line 10), which points to the module used to communicate with the hypervisor. The way these parameters are read from the configuration file and the initialization of the attributes of the module is performed similar to the one previously described for the Monitor module, and therefore not further described. 1 public void setMemoryVM(String vmName, String bytes) throws JMSException { 2 3 ComplexContentElement hypervisorModuleMessageElement = new ComplexContentElement( 4 Identifiers.COM_SET_VM_TOTAL_MEMORY); 5 hypervisorModuleMessageElement.addElement( 6 new SimpleContentElement(Identifiers.VM_NAME, vmName)); 7 hypervisorModuleMessageElement.addElement( 8 new SimpleContentElement(Identifiers.VM_TOTAL_MEMORY, bytes)); 9 10 ComplexMessage cm = new ComplexMessage(); 11 cm.setcontent(hypervisorModuleMessageElement); 12 13 ObjectMessage om = queueSession.createObjectMessage(cm); 14 om.setJMSReplyTo(queueReceiver.getQueue()); 15 16 hypervisorModuleQueueSender.send(om); 17 }
Listing 5.8: The setMemoryVM() from the PMActuator.java
88
Chapter 5. Implementation
5.2. The Framework
The implementation of the Actuator modules can handle two types of tasks: (1) set the memory allocation of a virtual machine and (2) migrate a VM from one physical machine to another. This is done by the methods setMemoryVM() and migrateVM(), of which, because they are similar, only the first is shown in Listing 5.8. The setMemoryVM() method is used to set the memory allocation of a specific VM. This is done by sending a message to the Xen module (lines 10-16), in which a complex element is embedded (lines 3-4) which contains the command for the Xen module, in this case “set the memory allocation of the following VM to the following value”. As such, in the complex element, a reference to the virtual machine (lines 5-6) and the total amount of memory (in bytes) allocated to the VM (lines 7-8) are embedded.
5.2.2.3
The Xen Module
The Xen module of the Physical Manager is a proxy module that implements both the functionalities of an Action module and an Event module. It is responsible for the actual interaction with the Xen hypervisor. Through the Xen module, the memory allocation of each machine can be monitored and dynamically changed. For the dynamic memory allocation, the Xen hypervisor uses a balloon driver. The Xen module can also dictate a Xen hypervisor to migrate one of the VMs running on top of it to another physical machine in the cluster3 . Both the balloon driver and the migration of virtual machines are techniques that were described in Section 2.1. 1 2 3 4 5 6 7 8 9
############################### # PM Xen Module configuration # ############################### xen_module = de.fh_wiesbaden.cs.vs.manager.modules.hypervisor.xen.XenModule #URL, username and password to connect to the hypervisor xen_module.hypervisor_URL = http://172.25.48.226:9363 xen_module.username = secret_username xen_module.password = secret_pasword
Listing 5.9: The properties of the Physical Manager Xen module Just like the other modules, the Xen module loads its parameters from the configuration file. The parameters of the Xen module are shown in Listing 5.9. In line 4, the main class of the module is specified. The Xen hypervisor is remotely managed via the XenAPI (see Section 2.1). The Xen hypervisor can be accessed using the Xen API under a URL 3
under the constraint that this physical machine also uses Xen and the CPUs of the two machines are from the same family
89
5.2. The Framework
Chapter 5. Implementation
(line 7), which specifies the IP address of the physical machine on which the Xen hypervisor runs and the port number at which XenAPI requests can be sent via XML RPC. Additionally, the Xen hypervisor can be configured to accept commands over the XenAPI only when a username (line 8) and password (line 9) are provided. These parameters are loaded by the Xen module and then used to initialize the attributes of the module, just like in the previously described modules. The XenAPI comes with bindings for C, Python and C#. This is a problem, since the management framework is written in Java. To overcome this problem, the approach shown in Figure 5.4 was used. The XenAPI Python bindings are used to implement a XenManager Python class, which provides various methods for monitoring and executing commands on a Xen physical machine. This class is wrapped in the XenHelper Java class via Jython.
XenHelper.java (Jython Wrapper) xenmanager.py
Python XenAPI Python Bindings XML RPC
Xen 3.1 (VMM)
Figure 5.4: Using the Xen API to access the Xen hypervisor
An extract from the XenManager Python class is shown in Listing 5.10. In lines 6-10 the constructor of the class is defined. The three parameters of the constructor are the url of the hypervisor, the username and the password. The default value of the url parameter is set to http://asterix:9363. These parameters are used in the constructor to connect to the Xen hypervisor (lines 9-10), via XenAPI (also see line 1). The only method shown in this extract is the set_vm_total_memory(self, vm_name,
90
Chapter 5. Implementation
5.2. The Framework
mem_in_bytes) method (lines 12-17). This method is used to dynamically change the memory allocation of a certain VM, and receives as parameters the name of the VM and the amount of memory to allocate to the VM, in bytes (line 12). In line 17, the XenAPI is used to perform the dynamic memory allocation. 1 import XenAPI as xenapi 2 3 class XenManager: 4 "This class is used to manage a Xen Hypervisor" 5 6 def __init__(self, url=’http://asterix:9363’, username=’’, password=’’): 7 "Constructor, connect to the hypervisor" 8 9 self.session=xenapi.Session(url) 10 self.session.xenapi.login_with_password(username, password) 11 12 def set_vm_total_memory(self, vm_name, mem_in_bytes): 13 "Dynamically allocate/deallocate memory to a VM" 14 15 vm=self.get_vm_by_name(vm_name) 16 if vm == None: return None 17 self.session.xenapi.VM.set_memory_dynamic_max_live(vm, mem_in_bytes)
Listing 5.10: Extract from the XenManager Python class To use the Python XenAPI bindings in the Physical Manager, which is written in Java, the XenHelper Java class was implemented. This class uses Jython to wrap the XenManager Python class. An extract from the XenHelper Java class is shown in Listing 5.11. In line 1, the PythonInterpreter class provided by Jython is imported. A PythonInterpreter object is used to call Python code out of a Java program. This is used in the class constructor, in lines 10-19, to initialize a XenManager Python object and thus connect to the Xen hypervisor. The setVMTotalMemory() method is also presented in this code extract. This Java method simply wraps the set_vm_total_memory() Python method (line 28) and is used to dynamically change the memory allocation of a certain VM. 1 import org.python.util.PythonInterpreter; 2 3 public class XenHelper { 4 private PythonInterpreter interp; 5 6 public XenHelper( 7 String hypervisorURL, String username, String password) 8 throws HypervisorConnectionException { 9 10 interp=new PythonInterpreter();
91
5.2. The Framework 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 }
Chapter 5. Implementation
interp.exec("import sys"); interp.exec("import xenmanager"); interp.set("url", hypervisorURL); interp.set("username", username); interp.set("password", password); interp.exec("manager=xenmanager.XenManager(url, username, password)"); } public void setVMTotalMemory(String vmName, String memInBytes) throws VMDoesNotExistException { interp.set("vm_name", vmName); interp.set("mem_in_bytes", memInBytes); try { interp.exec("manager.set_vm_total_memory(vm_name, mem_in_bytes)"); } catch(NullPointerException e) { throw new VMDoesNotExistException( "VM named "+vmName+" does not exist on this host"); } } ...
Listing 5.11: Extract from the XenHelper Java class The Xen module receives requests in form of messages from both the Monitor and the Actuator modules of the Physical Manager. When such a message is received, the processComplexElement() is called asynchronously. This is shown in Listing 5.12. In line 5, the identifier of the received command is read. In lines 7-13, this identifier is matched against a set of known command identifiers. 1 private ContentElement processComplexElement( 2 ComplexContentElement receivedMessageBody, ComplexMessage receivedMessage) { 3 4 ContentElement returnValue = null; 5 String command = receivedMessageBody.getName(); 6 7 if (command.equals(Identifiers.COM_GET_VALUE)) { 8 returnValue = getParameterValue(receivedMessage); 9 } else if (command.equals(Identifiers.COM_SET_VM_TOTAL_MEMORY)) { 10 setMemoryVM(receivedMessage); 11 } else if (command.equals(Identifiers.COM_LIVE_MIGRATE)) { 12 liveMigrate(receivedMessage); 13 } 14 15 return returnValue; 16 } 17
92
Chapter 5. Implementation
5.2. The Framework
18 private void setMemoryVM(ComplexMessage receivedMessage) { 19 20 String vmName = MessageUtility.getValueFromMessage( 21 Identifiers.VM_NAME, receivedMessage); 22 String memInBytes = MessageUtility.getValueFromMessage( 23 Identifiers.VM_TOTAL_MEMORY, receivedMessage); 24 25 try { 26 xenHelper.setVMTotalMemory(vmName, memInBytes); 27 } catch (VMDoesNotExistException e) { 28 logger.error(this.moduleName, e.getLocalizedMessage()); 29 } 30 }
Listing 5.12: Extract from XenModule.java, showing how the received messages are processed In this module, three types of commands are treated: a command from the Monitor module asking for one of the two parameters subject to monitoring (lines 7-8) and two commands from the Actuator module, the first asking to dynamically allocate memory to a certain VM (lines 9-10) and the second asking to migrate a certain VM (lines 11-12). Upon a successful match, the corresponding method is called. In lines 18-30, the content of the setMemoryVM() method is shown. This method is called when a message commanding the Xen module to perform a dynamic memory allocation is received. The message is parsed (lines 20-23) and the setVMTotalMemory() method is called on a XenHelper object which was previously initialized in the constructor of the Xen module. A similar process is performed for getParameterValue() (line 8) and liveMigrate() (line 12), with the difference that for the first one a response is also sent to the requesting module, in this case the Monitor module.
5.2.3
The VM Manager
The VM Manager is responsible for the management of both a virtual machine and the service running in this virtual machine. From the point of view of the implementation, the VM Manager comprises five modules, as shown in Figure 5.5. The SL M ONITOR and VM M ONITOR modules are used to monitor the service and the virtual machine respectively. These modules are independent of the monitoring technology and thus have a generic character. For this implementation, JMX was used for monitoring. As such, a JMX module was used for the monitoring, in a similar way as the X EN module was used as a “proxy” module in the Physical Manager. The VM L OGIC module
93
5.2. The Framework
Chapter 5. Implementation RMI
Actuator JMS
JMS
VM Logic
SLAs JMS
SL Monitor
VM Monitor
JMS
JMS
JMX
JMX
JMX
Tomcat TPC-W
Fedora Core 7 (VM)
Figure 5.5: The Implementation of the VM Manager is the central component of the VM Manager and implements the bottleneck determination mechanism. When a bottleneck was determined, the ACTUATOR module is used to request an additional amount of the bottlenecked resource from the Cluster Manager. In the following paragraphs, the implementation of each of these modules is described, with the exception of the JMX module which was implemented outside of this thesis, and the VM Logic module, which is an “intelligent" module and so described separately in Section 5.3. 5.2.3.1
The SL Monitor Module
The SL Monitor module of the VM Manager realizes service level monitoring and represents an Event module that is used to monitor the service running inside a virtual machine. Like all other modules, the SL Monitor loads its parameters from a configuration file. The part of the VM Manager configuration file dedicated to the SL Monitor is shown in Listing 5.13. The main class of the module is specified in line 4. The other two parameters read are a reference to the technology-specific monitoring module (line 7), in this case the JMX module, and a reference to the controller module (line 10), in this case the VM Logic.
94
Chapter 5. Implementation
1 2 3 4 5 6 7 8 9 10
5.2. The Framework
###################################### # VM SL Monitor Module configuration # ###################################### service_monitor = de.fh_wiesbaden.cs.vs.manager.modules.monitor.service_monitor.VMServiceMonitor #Reference to the jmx module used to gather the required service parameters service_monitor.monitoring_module_reference = jmx_module #Reference to the controller module where the service parameters are to be forwarded service_monitor.controller_module_reference = controller_module
Listing 5.13: The properties of the SL Monitor module These parameters are initialized when the module is loaded by the Self-Management core. Unlike the Monitor module of the Physical Manager, the SL Monitor receives the service parameters it needs to monitor from the central component, the VM Logic. A message containing the parameters that need to be monitored is received as soon as the module finishes is initialization process. When such a message is received, the method subscribeServiceParameters() is called asynchronously. The content of this method is shown in Listing 5.14. In Line 4, a QueueSender object is created, which is later used to send a message containing the monitoring parameters to the JMX module (lines 22-26). In lines 13-15, the message received from the VM Logic is parsed, and for each monitoring parameter received, a complex element is created (line 11) to wrap the respective monitoring parameter (lines 17-29). Each complex element is packed in a separate message (lines 22-23) and sent to the JMX module. By sending these messages, the SL Monitor modules subscribes to the requested parameters, with updates being sent every 20 seconds4 . 1 private void subscribeServiceParameters(ContentElement receivedMessageBody) 2 throws JMSException { 3 4 jmxModuleQueueSender = messenger.createJMSQueueSender( 5 queueSession.createQueue(jmxModuleReference)); 6 ComplexContentElement complexBodyElement = (ComplexContentElement) receivedMessageBody; 7 ComplexContentElement jmxModuleMessageElement; 8 String[] tmpRequiredParams; 9 10 for (int i = 0; i < complexBodyElement.getSize(); i++) { 11 jmxModuleMessageElement = new ComplexContentElement("Subscribe_value"); 12 13 tmpRequiredParams = complexBodyElement.getElement(i).getName().split("\\|"); 4
of course, this 20 seconds threshold can be modified in the configuration file of the JMX module
95
5.2. The Framework 14 15 16 17 18 19 20 21 22 23 24 25 26 27 } 28 }
Chapter 5. Implementation
requiredServiceParameters.add(tmpRequiredParams[0]); requiredServiceParameters.add(tmpRequiredParams[1]); jmxModuleMessageElement.addElement( new SimpleContentElement("command",JmxModule.COM_SUBSCRIBE_VALUE)); jmxModuleMessageElement.addElement( new SimpleContentElement("id",complexBodyElement.getElement(i).getName())); jmxModuleCm = new ComplexMessage(); jmxModuleCm.setcontent(jmxModuleMessageElement); jmxModuleMessage = queueSession.createObjectMessage(jmxModuleCm); jmxModuleMessage.setJMSReplyTo(queueReceiver.getQueue()); jmxModuleQueueSender.send(jmxModuleMessage);
Listing 5.14: The subscribeServiceParameters() method of the SL Monitor When a message from the JMX module is received, the SL Monitor module simply forwards this message to the VM Logic.
5.2.3.2
The VM Monitor Module
The VM Monitor module of the VM Manager is an Event module which is responsible for monitoring the resource utilization of a virtual machine. The parameters indicating what exactly the VM Monitor is monitoring are received from the VM Logic. For this implementation, the virtual machine parameters monitored are free memory, total swap space and free swap space. Since the implementation of the VM Monitor is very similar to the one of the SL Monitor, no further details are presented.
5.2.3.3
The Actuator Module
The Actuator module of the VM Manager is an Action module responsible for requesting additional resources from the Cluster Manager. When the VM Logic determines a resource that is causing a bottleneck, a message is sent to the Actuator module. When the Actuator module receives the message, the invoke() method is called asynchronously. The content of this method is shown in Listing 5.15. In this method, the nature of the message is first determined (line 5) and then the corresponding method is called, in this case the only available option being the callClusterManager() method (line 8). It is possible that the VM Logic sends a second message, signaling that a bottleneck has occurred, before the Cluster Manager is capable of allocating the required resource (e.g. when a migration is necessary). To avoid sending redundant requests to the Cluster
96
Chapter 5. Implementation
5.2. The Framework
Manager, the problemSolved boolean variable, originally set to false, is now set to true. The Actuator module sends no requests to the Cluster Manager until the value of this variable is not set to false again. 1 public ContentElement invoke(BasicMessage requestMessage) { 2 3 ContentElement requestMessageBody = requestMessage.getBody(); 4 5 if (requestMessageBody.getName().equals(Identifiers.BOTTLENECK_IDENTIFIER)) { 6 if(!problemSubmitted){ 7 problemSubmitted=true; 8 callClusterManager((ComplexMessage) requestMessage); 9 } else { 10 //Do nothing, give the cluster manager some time to solve the problem 11 } 12 } 13 return null; 14 } 15 16 private synchronized void callClusterManager(ComplexMessage requestMessage) { 17 String problemCause = MessageUtility.getValueFromMessage( 18 PROBLEM_IDENTIFIER, requestMessage); 19 20 if(problemCause.equals(Identifiers.MEMORY_IDENTIFIER)) { 21 try { 22 vmProblemSolver.giveMemory(vmName); 23 //the method returns only after the dynamic memory allocation was performed 24 25 problemSubmitted=false; 26 } catch (RemoteException e) { 27 logger.error(this.moduleName + e.getLocalizedMessage()); 28 } 29 } 30 }
Listing 5.15: Extract from VMActuatorModule.java, showing how a received message is processed Inside the callClusterManager() method, the resource causing the bottleneck is first read from the message received from the VM Logic (lines 17-18), and then the corresponding method is called. As previously mentioned at the beginning, the only type of resource treated by the implementation of the framework is memory. Thus, in lines 20-29, a situation where memory is the resource causing the bottleneck is treated. In line 22, a method is called to send a synchronous request to the Cluster Manager. This mechanism is part of the inter-manager communication and thus not described here, but in Section 5.2.5. When a confirmation is received from the Cluster Manager, the value of the problemSolved variable is set to false and thus the Actuator stops dumping the
97
5.2. The Framework
Chapter 5. Implementation
messages received from the VM Logic module.
5.2.4
The Cluster Manager
The Cluster Manager is responsible for the self-optimization of the cluster as a whole. From the point of view of the implementation, the Cluster Manager comprises three modules, shown in Figure 5.6.
RMI
Monitor JMS
Cluster Logic
JMS
Actuator
RMI
Figure 5.6: The Implementation of the Cluster Manager
The M ONITOR module is used to receive updates from the Physical Managers and requests from the VM Managers. These requests are then forwarded to the C LUSTER L OGIC module, which determines the allocation of resources in the cluster. Upon making a decision, the Cluster Logic module forwards this decision to the ACTUATOR module. The Actuator module then transforms this decision in forms of tasks which are delegated to the corresponding Physical Managers. In the following paragraphs, the Monitor and the Actuator modules are described. The Cluster Logic module represents the most important controller of the framework and is thus described separately in Section 5.4.
98
Chapter 5. Implementation 5.2.4.1
5.2. The Framework
The Monitor Module
The Monitor module of the Cluster Manager represents an Event module and is used to receive requests from the VM Managers and updates from the Physical Managers. The way these requests and updates are received is part of the inter-manager communication and thus described in Section 5.2.5. When a request is received from a VM Manager, the request is wrapped in form of a complex element which is then given as a parameter to the sendMessageToController() method, shown in Listing 5.16. Here, the complex element is wrapped in a message (lines 3-4) which is then sent to the Cluster Logic (lines 6-8). An update from a Physical Manager is forwarded to the Cluster Logic in a similar way. 1 public static void sendMessageToController(ComplexContentElement controllerModuleMessageElement) throws JMSException{ 2 3 ComplexMessage cm = new ComplexMessage(); 4 cm.setcontent(controllerModuleMessageElement); 5 6 ObjectMessage om = queueSession.createObjectMessage(cm); 7 om.setJMSReplyTo(queueReceiver.getQueue()); 8 controllerModuleQueueSender.send(om); 9 }
Listing 5.16: Extract from ClusterMonitor.java, showing how a message is sent to the Cluster Logic
5.2.4.2
The Actuator Module
The Actuator module of the Cluster Manager is an Action module which is used to delegate tasks coming from the Cluster Logic to the Physical Manager. This delegation of tasks is part of the inter-manager communication and thus described in Section 5.2.5.
5.2.5
The Inter-Manager Communication
The autonomic managers can be located on different physical machines in the cluster. As such, the inter-manager communication must take place over the network. Java RMI provides support for creating Java-based distributed applications by allowing objects running in one Java Virtual Machine to invoke methods on objects running in another Java Virtual Machine [Jav]. Since our framework is written in Java, using the high-level approach provided by Java RMI for the inter-manager communication is a natural choice. The
99
5.2. The Framework
Chapter 5. Implementation
following paragraphs describe how Java RMI was used to implement the inter-manager communication.
5.2.5.1
The VM Manager-to-Cluster Manager Communication
When a VM Manager determined a resource causing a bottleneck, the VM Manager sends a request to the Cluster Manager, asking for more of that resource. This communication is referred to as the VM Manager-to-Cluster Manager communication. For this communication, the Cluster Manager plays the role of the server while the VM Manager represents the client. When the Cluster Manager is started, several remote objects are created. One of them is used for the VM Manager-to-Cluster Manager communication and is an instance of the VMProblemSolverImpl class, which extends the UnicastRemoteObject class and implements the VMProblemSolver interface. The corresponding UML class diagram is shown in Figure 5.7. Because the VMProblemSolver interface implements the Remote interface, it is here where the methods are defined that can be remotely called on this remote object.
«interface» java.rmi.Remote
«interface» VMProblemSolver +giveMemory() : int
UnicastRemoteObject
VMProblemSolverImpl +giveMemory() : int
Figure 5.7: VMProblemSolver UML class diagram
The sequence diagram in Figure 5.8 shows how the VM Manager-to-Cluster Manager
100
Chapter 5. Implementation
5.2. The Framework
communication takes place. First, the Monitor module of the Cluster Manager creates and starts a server thread. When this thread is created, an object instance of the VMProblemSolverImpl class is created and registered in the RMI Registry. There is only one RMI Registry in the cluster, and this is the central place where all the remote objects are registered. When the Actuator module of the VM Manager wants to request more memory, it does this by first getting a remote reference of the remote object in the RMI Registry.
ClusterMonitor
VMRequestProcessorServerThread
VMActuatorModule
RMI Registry
VMProblemSolverImpl
start instantiate bind
get remote reference remote reference
giveMemory() Forward to Cluster Logic
memory allocated memory allocated
Figure 5.8: The VM Manager-to-Cluster Manager communication
The VMProblemSolverImpl class is not contained in the VM Manager, but dynamically downloaded at run time. After the class definition was downloaded, the giveMemory() method is called on the remote object (see Listing 5.15, line 22). The implementation of the giveMemory() method is shown in Listing 5.17. The method receives as single parameter an identifier for the virtual machine that requires additional memory (line 1). Next, this request is wrapped in a complex element (lines 3-6) and then forwarded to the Cluster Logic module (line 9). In case an error occurs, the method returns -1, otherwise returning 0. This return value is received by the VM Actuator module, which can then determine if the request was successful or not.
101
5.2. The Framework
Chapter 5. Implementation
1 public int giveMemory(String vmIdentifier) throws RemoteException { 2 3 ComplexContentElement controllerModuleMessageElement = new ComplexContentElement( 4 Identifiers.COM_GET_MORE_MEMORY); 5 controllerModuleMessageElement.addElement( 6 new SimpleContentElement(Identifiers.VM_NAME, vmIdentifier)); 7 8 try { 9 ClusterMonitor.sendMessageToController(controllerModuleMessageElement); 10 } catch (JMSException e) { 11 logger.error(this.moduleName, e.getLocalizedMessage()); 12 return -1; 13 } 14 15 return 0; 16 }
Listing 5.17: The giveMemory() method
5.2.5.2
The Physical Manager-to-Cluster Manager Communication
One of the responsibilities of the Physical Manager is to keep the Cluster Manager up to date with the resource utilization of the physical machine assigned to it. For this implementation, the Physical Manager sends the Cluster Manager data showing the amount of memory assigned to each virtual machine, and the amount of free memory remaining. This communication is called the Physical Manager-to-Cluster Manager communication. One of the remote objects created when the Cluster Manager starts is an instance of the ClusterMonitorUpdaterImpl class. This remote object is then used by the Physical Manager for the Physical Manager-to-Cluster Manager communication. The ClusterMonitorUpdaterImpl class extends the UnicastRemoteObject class and implements the ClusterMonitorUpdater interface. The corresponding UML diagram is shown in Figure 5.9. The ClusterMonitorUpdater interface extends the Remote interface and thus specifies the methods that can be remotely called on the remote object. The sequence diagram in Figure 5.10 shows how the Physical Manager-to-Cluster Manager communication takes place. The first part is similar to the VM Manager-to-Cluster Manager communication: a remote object is instantiated and registered in the RMI Registry. This is also shown in Listing 5.18. The object is first created (line 6) and then registered in the RMI Registry (line 11).
102
Chapter 5. Implementation
5.2. The Framework
«interface» java.rmi.Remote
«interface» ClusterMonitorUpdater +updateResourceView() : string
UnicastRemoteObject
ClusterMonitorUpdaterImpl +updateResourceView() : string
Figure 5.9: ClusterMonitorUpdater UML class diagram
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
public ClusterMonitorUpdaterServerThread() throws RemoteException{ if(System.getSecurityManager()==null) { System.setSecurityManager(new SecurityManager()); } clusterMonitorUpdater= new ClusterMonitorUpdaterImpl(); } public void run(){ try { Naming.rebind("rmi://localhost:1099/ClusterMonitorUpdater",clusterMonitorUpdater); } catch (Exception e) { //handle expection } }
Listing 5.18: Instantiating and registering a remote object
Listing 5.19 shows how the Monitor module of the Physical Manager gets a remote reference to the remote object (lines 7-8) and then calls the updateResourceView() method on this object (line 16). This method receives as parameters the hostname of the physical machine and the local resource view, encoded in a hashtable-based data structure. This local resource view is then used in the Cluster Logic to update the global resource view.
103
5.2. The Framework
ClusterActuator
Chapter 5. Implementation
ClusterMonitorUpdaterServerThread
PMMonitor
RMI Registry
ClusterMonitorUpdaterImpl
start instantiate bind
get remote reference remote reference
updateResourceView() updateHostResourceView()
Figure 5.10: The Physical Manager-to-Cluster Manager communication
1 private void sendUpdateToClusterManager() { 2 try { 3 localRersourceView.put(Identifiers.RESOURCE_VIEW_MEMORY, localMemoryView); 4 5 if (cmu == null) { 6 try { 7 cmu = (ClusterMonitorUpdater) Naming.lookup( 8 "rmi://localhost:1099/ClusterMonitorUpdater"); 9 } catch (Exception e1) { 10 this.logger.error( 11 this.moduleName, e1.getLocalizedMessage() 12 + " please start the cluster manager"); 13 } 14 } 15 16 cmu.updateResourceView(hostname, localRersourceView); 17 } catch (Exception e) { 18 e.printStackTrace(); 19 } 20 }
Listing 5.19: Updating the gloabl resource view
5.2.5.3
The Cluster Manager-to-Physical Manager Communication
After the Cluster Logic has taken a resource reallocation decision, this decision must be transformed in form of tasks which are then delegated to the corresponding Physical Managers. This delegation of tasks is called the Cluster Manager-to-Physical Manager communication. When a Physical Manager starts, it creates and registers a remote object to the RMI
104
Chapter 5. Implementation
5.2. The Framework
Registry. This object is an instance of the ClusterRequestProcessorImpl class and is used by the Cluster Manager to delegate tasks to the Physical Manager. The ClusterRequestProcessorImpl class extends the UnicastRemoteObject class and implements the ClusterRequestProcessor interface. The corresponding UML diagram is shown in Figure 5.11. The ClusterRequestProcessor interface extends the Remote interface and thus specifies which methods can be remotely called on the remote object.
«interface» java.rmi.Remote
«interface» ClusterRequestProcessor +setMemoryVM() : string +migrateVM() : string
UnicastRemoteObject
ClusterRequestProcessorImpl +setMemoryVM() : string +migrateVM() : string
Figure 5.11: ClusterRequestProcessor UML class diagram
The Cluster Manager-to-Physical Manager communication is shown in Figure 5.12. In this example, the method setMemoryVM() is called on the remote object. The corresponding source code is shown in Listing 5.20. In this code extract from the Actuator module of the Cluster Manager, the name of the virtual machine (lines 3-4) subject to memory reallocation, the name of the physical machine hosting this virtual machine (lines 5-6) and the new amount of memory allocated to this machine (7-8) are read from the message received from the Cluster Logic. The name of the physical machine is then used to determine the remote object instantiated by the Physical Manager that manages this machine (lines 11-12). In line 13, the setMemoryVM() method is called on the remote object, and receives as parameters the name of the virtual machine and the new amount of memory allocated to this virtual machine. For the migrateVM() method, this is done similarly.
105
5.3. The Bottleneck Determination Mechanism PMActuator
ClusterRequestProcessorServerThread
Chapter 5. Implementation
ClusterActuator
RMI Registry
ClusterRequestProcessorImpl
start instantiate bind
get remote reference remote reference
setMemoryVM() setMemoryVM()
Figure 5.12: The Cluster Manager-to-Physical Manager communication
1 private void setVmMemory(ComplexMessage receivedMessage) { 2 3 String vmName = MessageUtility.getValueFromMessage( 4 Identifiers.VM_NAME, receivedMessage); 5 String physicalHostname = MessageUtility.getValueFromMessage( 6 Identifiers.LOCAL_PHYSICAL_MACHINE_NAME, receivedMessage); 7 String vmMemorySize = MessageUtility.getValueFromMessage( 8 Identifiers.VM_TOTAL_MEMORY, receivedMessage); 9 10 try { 11 ClusterRequestProcessor crp = (ClusterRequestProcessor) Naming.lookup( 12 "rmi://localhost:1099/ClusterRequestProcessor" + physicalHostname); 13 crp.setMemoryVM(vmName, vmMemorySize); 14 } catch (Exception e) { 15 logger.error(this.moduleName, e.getLocalizedMessage()); 16 } 17 18 }
Listing 5.20: Delegating a task to a Physical Managers
5.3 The Bottleneck Determination Mechanism The bottleneck determination mechanism is implemented in the VM Logic module of the VM Manager. In Section 4.2, it was argued that a rule engine that supports JSR 94 is convenient for the bottleneck determination mechanism. For the implementation, the Jess rule engine was used [San]. Although a commercial product, Jess is available free of charge when used for academic purposes.
106
Chapter 5. Implementation
5.3. The Bottleneck Determination Mechanism
The VM Logic determines the parameters that are to be monitored by the SL Monitor and the VM Monitor modules of the VM Manager. These parameters are loaded from the configuration file while the VM Logic module is being initialized. This is shown in Listing 5.21. The parameters for the service are specified in lines 19-20, while the parameters of the operating system to be monitored are specified in lines 23-28. These parameters are loaded and forwarded to the SL Monitor and VM Monitor respectively. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
###################################### # VM Controller Module configuration # ###################################### controller_module = de.fh_wiesbaden.cs.vs.manager.modules.controller.VMControllerModule #Reference to the rule base controller_module.rule = conf/rulebase/bottleneckdetermination.cls #The rule service provider controller_module.rule_service_provider = jess.jsr94 #Reference to the actuator module controller_module.actuator_module_reference = actuator_module controller_module.service_monior_module_reference = service_monitor controller_module.vm_os_monior_module_reference = vm_os_monitor #Tomcat mbeans and attributes controller_module.tomcat_mbean_name_1=Catalina:type=GlobalRequestProcessor,name=http-8080 controller_module.tomcat_mbean_attribute_1=maxTime #VM OS mbeans and attributes controller_module.vm_os_mbean_name_1=java.lang:type=OperatingSystem controller_module.vm_os_mbean_attribute_1=FreePhysicalMemorySize controller_module.vm_os_mbean_name_2=java.lang:type=OperatingSystem controller_module.vm_os_mbean_attribute_2=FreeSwapSpaceSize controller_module.vm_os_mbean_name_3=java.lang:type=OperatingSystem controller_module.vm_os_mbean_attribute_3=TotalSwapSpaceSize
Listing 5.21: Parameters loaded by the VM Logic module In line 8 of Listing 5.21, the file containing the rule set for the bottleneck determination is specified, while in line 11, the rule engine — Jess — is specified. Two Java beans are used to wrap the monitoring data received from the two monitoring modules: TomcatParameter for data received from the SL Monitor and OSParameter for data received from the VM Monitor. When a message is received from one of the two monitoring modules, the corresponding bean is updated. For the TomcatParameter bean, this is shown in Listing 5.22.
107
5.3. The Bottleneck Determination Mechanism
Chapter 5. Implementation
1 if(smthChanged) { 2 tomcatParameter = new TomcatParameter(); 3 tomcatParameter.setMaxTime(tomcatParameterValuePair.get(Identifiers.TOMCAT_MAX_TIME)); 4 }
Listing 5.22: Updating a TomcatParameter bean After an update took place, the rule engine (Jess) is called. This is done in order to determine if the service parameters are still SLA-conform and if not, to determine the resource causing the bottleneck. Listing 5.23 shows how Jess is called from inside the VM Logic module. If the rule engine is called for the first time, a session needs to be initialized (lines 1-6). In this process, a new bean, instance of the Bottleneck class, is created. This bean will be used later on to process the response of the rule engine. A List is then created, which contains the three beans: tomcatParameters and osParameters for the input and bottleneck for the output. In line 13, this list is fed to the rule engine and the rule engine is called. 1 2 3 4 5 6 7 8 9 10 11 12 13
if (!sessionInitialized) { session = (StatelessRuleSession) runtime.createRuleSession( ruleURI, new HashMap(), RuleRuntime.STATELESS_SESSION_TYPE); bottleneck = new Bottleneck(); sessionInitialized = true; } List inputList = new ArrayList(); inputList.add(tomcatParameter); inputList.add(osParameter); inputList.add(bottleneck); List results = session.executeRules(inputList);
Listing 5.23: Calling Jess from inside the VM Logic module When Jess is called for the first time, the data from the rule file specified in the configuration file is loaded in the working memory of Jess. The content of the rule file is shown in Listing 5.24. This rule is used to determine if memory is the cause of a bottleneck. Memory is determined as the cause of a bottleneck when an SLA violation occurs and in the same time there is little amount of free memory available and a lot of swapping takes place. In line 10, the counter global variable is initialized to the value of 1. In line 4, it is made sure that the next time Jess will be called in this session, the counter variable retains its old value and is not again set to 1. An SLO is defined in line 15: the value of
108
Chapter 5. Implementation
5.3. The Bottleneck Determination Mechanism
the maxTime Tomcat parameter must be under 2000 ms. This is the only SLO contained in the SLA. In lines 25-27, 31-33 and 37-39, templates are declared that hold the values of the TomcatParameter, OSParameter and Bottleneck beans respectively. Next, the rule in lines 43-49 states that if the maxTime Tomcat parameter is under the limit defined in line 15, than everything is SLA-conform and thus no bottleneck exists. The rule in lines 54-59 increases the value of the counter, if the value of the maxTime Tomcat parameter is over the limit defined in Line 15, but the value of the counter variable is not bigger than 3. This basically means that three SLO violations are tolerated. If the value of the counter variable is bigger than 3, than the rule in lines 65-74 checks if there is little available free memory (threshold set in line 20) and a lot of swapping takes place (threshold set in line 21), than the memory is causing a bottleneck. Otherwise, if all these symptoms are true, with the exception of the free memory or the swapping threshold, then it means that there is a bottleneck, but this is not memory. This is specified by the rule in lines 80-89. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
/* Makes sure that the global variables are not reseted */ (set-reset-globals nil) /* This is a counter This counter is used to specify that the maxTime QoS parameter can be violated 3 times before a problem is raised */ (defglobal ?*counter* = 1) /* This is the QoS parameter that specifies which is the maximum time (in s) allowed for processing a request */ (defglobal ?*qosmaxtimelimit* = 2000) /* These to parameters are the results of an empiric analysis of the Tomcat maxTime parameter behavior under different resource availability */ (defglobal ?*freeMemLowerLimit* = 10000000) (defglobal ?*usedSwapUpperLimit* = 30000000) /* Points to the java class */ (deftemplate TomcatParameter (declare (from-class de.fh_wiesbaden.cs.vs.manager.modules.controller.util.TomcatParameter)) 27 ) 28 /* 29 Points to the java class
109
5.3. The Bottleneck Determination Mechanism 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82
Chapter 5. Implementation
*/ (deftemplate OSParameter (declare (from-class de.fh_wiesbaden.cs.vs.manager.modules.controller.util.OSParameter)) ) /* Points to the java class */ (deftemplate Bottleneck (declare (from-class de.fh_wiesbaden.cs.vs.manager.modules.controller.util.Bottleneck)) ) /* This rule basically states that if the maxTime is under the limit, than we have no problem */ (defrule noproblem (TomcatParameter {maxTime <= ?*qosmaxtimelimit*}) ?botfact <- (Bottleneck {havingProblem == TRUE}) => (bind ?*counter* 1) (modify ?botfact (havingProblem FALSE)) ) /* This rule basically states that if the maxTime is over the limit, but the counter has not passed 3, than increase the counter and reset maxTime */ (defrule increasecounter ?fact <- (TomcatParameter {maxTime > ?*qosmaxtimelimit*}) (test (<= ?*counter* 3)) => (bind ?*counter* (+ ?*counter* 1)) ) /* This rule basically states that if the maxTime is over the limit, and if there is little free physical memory and quite a lot of swapping takes place, than memory is bottleneck */ (defrule dosmth ?botfact <- (Bottleneck {memory == FALSE}) (TomcatParameter {maxTime > ?*qosmaxtimelimit*}) (OSParameter {freePhysicalMem < ?*freeMemLowerLimit* && usedSwapMem > ?*usedSwapUpperLimit*}) (test (> ?*counter* 3)) => (modify ?botfact (memory TRUE)) (modify ?botfact (havingProblem TRUE)) (bind ?*counter* 1) ) /* This rule basically states that if the maxTime is over the limit, but there is some free physical memory and there is little swapping taking place, than memory is not the bottleneck */ (defrule dosmthmemnoprob (TomcatParameter {maxTime > ?*qosmaxtimelimit*}) (OSParameter {freePhysicalMem >= ?*freeMemLowerLimit* && usedSwapMem <=
110
Chapter 5. Implementation
83 84 85 86 87 88 89
5.4. The Cluster Controller
?*usedSwapUpperLimit*}) ?botfact <- (Bottleneck {memory == TRUE || havingProblem == FALSE}) (test (> ?*counter* 3)) => (modify ?botfact (memory FALSE)) (modify ?botfact (havingProblem TRUE)) (bind ?*counter* 1) )
Listing 5.24: The Jess rule The results of the bottleneck determination performed by the rule engine are saved in the Bottleneck bean. From this bean, the results are parsed and if memory turns out to be the bottleneck, then a message containing this information is sent by the VM Logic module to the Actuator module of the VM Manager. The Actuator module will then send a request for additional memory to the Cluster Manager.
5.4
The Cluster Controller
The Cluster Logic implements the functionality of the cluster controller. For the purpose of this thesis, the two algorithms described in Section 4.3 were implemented. Both of these algorithms are contained in the Cluster Logic component, but only one of them can be used at a time. As it can be observed in Listing 5.25, this is decided based on the controller_algorithm parameter (line 12), which specifies which algorithm should be used. In the example in Listing 5.25, Algorithm B (kbfs_algorithm) is used. As it can be observed in lines 15-16, each algorithm can have its own parameters. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
###################################### # Cluster Logic Module configuration # ###################################### cluster_logic = de.fh_wiesbaden.cs.vs.manager.modules.logic.cluster_logic.ClusterLogic #reference to the cluster actuator module cluster_logic.actuator_module_reference = cluster_actuator #the controller algorithm to use, choose one of the following: # simple_greedy_algorithm # kbfs_algorithm cluster_logic.controller_algorithm = kbfs_algorithm #parameters for the kbfs algorithm cluster_logic.kbfs_algorithm.k = 3 cluster_logic.kbfs_algorithm.iterations = 3
Listing 5.25: The configuration parameters for the Cluster Logic
111
5.4. The Cluster Controller
Chapter 5. Implementation
In the following subsections, the implementation of both algorithms is described.
5.4.1
Algorithm A
Out of the two algorithms, Algorithm A is the simpler one. Its implementation is straightforward and shown in Listing 5.26. The implementation of Algorithm A is contained in a single method, called startSimpleGreedyAlgorithm(). This method receives as parameters the name of the virtual machine requesting more memory, the name of the physical machine on which the virtual machine is provisioned and the amount of additional memory required by the virtual machine. In line 10-11, the total amount of memory required by the virtual machine is computed. 1 private void startSimpleGreedyAlgorithm( 2 String vmName, String physicalHost, Long addMemory) { 3 4 long localAvailableMemory = ClusterResourceView.getFreeMemory( 5 physicalHost).longValue(); 6 7 String remoteHost; 8 long remoteAvailableMemory; 9 10 long necessaryMemory = ClusterResourceView.getVMMemory(vmName).longValue() 11 + addMemory.longValue(); 12 13 if (localAvailableMemory >= addMemory) { 14 try { 15 localIncreaseVMMemory(vmName, physicalHost, necessaryMemory); 16 } catch (JMSException e) { 17 logger.error(this.moduleName, e.getLocalizedMessage()); 18 } 19 } else { 20 Enumeration physicalHosts = ClusterResourceView.getPhysicalHostNames(); 21 22 physicalHosts=sortBasedOnUtilization(physicalHosts); 23 24 while (physicalHosts.hasMoreElements()) { 25 remoteHost = physicalHosts.nextElement(); 26 remoteAvailableMemory = ClusterResourceView.getFreeMemory( 27 remoteHost).longValue(); 28 29 if (remoteAvailableMemory >= necessaryMemory) { 30 try { 31 migrateVmOnMachine(vmName, physicalHost, remoteHost); 32 33 Thread.sleep(MAX_MIGRATION_TIME); 34 35 localIncreaseVMMemory(vmName, remoteHost, necessaryMemory); 36 } catch (Exception e) {
112
Chapter 5. Implementation
5.4. The Cluster Controller
37 logger.error(this.moduleName, e.getLocalizedMessage()); 38 } 39 break; 40 } 41 } 42 } 43 }
Listing 5.26: Algorithm A If the additional amount of memory required by the virtual machine is locally available (line 13), than the total amount of memory required by the virtual machine is locally allocated from the physical machine on which the virtual machine is currently provisioned (line 15). If this is not the case, than the virtual machine is migrated on the first physical machine (line 31) that provides enough memory (line 29). After the migration is performed, the memory allocation of the virtual machine is increased to the required value. Note that the design of Algorithm A from Section 4.3 specifies that the physical machine with the highest resource utilization which provides enough resources must be chosen. This is assured in line 22 by sorting the list of physical machines decreasingly with respect to their amount of free memory.
5.4.2
Algorithm B
Among the two controller algorithms, Algorithm B is the more complex one. The encoding of the virtual machine to physical machine mapping in form of a state, together with all the resources in the cluster is implemented by the class ClusterState as described in Section 4.3. An extract of the ClusterState.java file is presented in Listing 5.27, showing the attributes and the two constructors of the class. Each state has a parent state (line 3), from which this state was generated by migrating a virtual machine (line 4) from one physical machine (line 5) to another physical machine (line 6) in the cluster. The cost attribute represents the number of migrations performed from the initial state to reach the current state. Furthermore, a state has a global profit (line 9) and a state utility (line 10), calculated by means of the formulas presented in Section 4.3. The resourceView attribute corresponds to the global view of the resources in the cluster, while the allocationView is a helper data structure which encodes only the virtual machine to physical machine mapping. Although a redundant information, this speeds-up the execution of the algorithm.
113
5.4. The Cluster Controller
Chapter 5. Implementation
1 public class ClusterState implements Comparable { 2 3 private ClusterState parent; 4 private String migratedVM; 5 private String fromPyhsical; 6 private String toPhysical; 7 private int cost; 8 private double stateUtility; 9 private double globalProfit; 10 private Hashtable>> resourceView; 11 private Hashtable allocationView; 12 13 // for the initial state 14 public ClusterState( 15 Hashtable>> resourceView, 16 Hashtable allocationView) { 17 18 this.parent = this; 19 this.cost = 0; 20 this.allocationView = allocationView; 21 this.resourceView = resourceView; 22 this.globalProfit = calcuateGlobalProfit(); 23 this.stateUtility = calculateStateUtility(); 24 } 25 26 public ClusterState( 27 ClusterState parent, String vmName, String sourcePhysicalName, 28 String destinationPhysicalName) { 29 30 this.parent = parent; 31 this.migratedVM = new String(vmName); 32 this.fromPyhsical = new String(sourcePhysicalName); 33 this.toPhysical = new String(destinationPhysicalName); 34 this.resourceView = adaptResourceView(vmName, sourcePhysicalName, 35 destinationPhysicalName); 36 this.allocationView = adaptAllocationView(vmName, sourcePhysicalName, 37 destinationPhysicalName); 38 this.cost = parent.getCost() + 1; 39 this.globalProfit = calcuateGlobalProfit(); 40 this.stateUtility = calculateStateUtility(); 41 } 42 ... 43 }
Listing 5.27: Extract from ClusterState.java
The constructor in lines 14-24 of Listing 5.27 is used to create the initial state. The initial state is its own parent (line 18), has no migrate costs (line 19), an allocationView (line 20) and a resourceView (line 20) which are given as parameters of the constructor. Based on these attributes, the global profit (line 22) and the utility of the state (line 23) are calculated.
114
Chapter 5. Implementation
5.4. The Cluster Controller
The constructor in lines 26-41 of Listing 5.27 is used to create a child state, given a certain parent state and the identifier of a virtual machine which is migrated from a certain source physical machine to another destination physical machine in the cluster. These are all parameters of the constructor, as shown in lines 26-28 . Inside the constructor, the attributes are set to their corresponding values, either directly like in lines 30-33, or by calculating the new values of these attributes like in lines 34-40. The encoding provided by the ClusterState class is then used by Algorithm B in the startKbfsAlgorithm() method, which is shown in Listing 5.28. The beginning of this method is similar to the one of the startSimpleGreedyAlgorithm() method. In fact, if the necessary memory is locally available (line 10), then this method will locally allocate the necessary memory (line 12) and return, thus performing just like the startSimpleGreedyAlgorithm() method. It is when this additional amount of memory is not locally available (line 16) that the differences between the two algorithms become clear. First, two vectors are created, the first containing the states (nodes) that were already expanded (the closedList, line 17) and the second containing the states that were obtained as a result of this expansion (the openList, line 18). Next, the global resource view needs to be updated (lines 21-24), such that it reflects the additional amount of resources required by the virtual machine in cause. An initial state is generated (lines 26) which, thinking in terms used in Section 4.3, is invalid since the total amount of memory required by the virtual machines on one of the physical machines is bigger than the total amount of memory available of that machine. This initial state is then expanded and its children are added to the openList (line 27). Lines (35-39) represent the KBFS core of Algorithm B. The best K children (lines 29 and 35) are expanded and their children are added to the openList. This process is repeated a specific number of times, which is determined by the kbfsIterations variable. Based on the configuration of the managed infrastructure and the constrains set on the total migration time, the value of kbfsIterations variable can be determined so the all the states added to the open list have acceptable total migrations costs. The value of the kbfsK and kbfsIterations variables are read from the configuration file of the Cluster Logic modules, as shown in Listing 5.25, lines 15 and 16 respectively.
115
5.4. The Cluster Controller
Chapter 5. Implementation
1 private void startKbfsAlgorithm(String vmName, String physicalHost, Long addMemory) { 2 3 long localAvailableMemory = ClusterResourceView.getFreeMemory( 4 physicalHost).longValue(); 5 String remoteHost=""; 6 long remoteAvailableMemory; 7 long necessaryMemory = ClusterResourceView.getVMMemory(vmName).longValue() 8 + addMemory.longValue(); 9 10 if (localAvailableMemory >= addMemory) { 11 try { 12 localIncreaseVMMemory(vmName, physicalHost, necessaryMemory); 13 } catch (JMSException e) { 14 logger.error(this.moduleName, e.getLocalizedMessage()); 15 } 16 } else { 17 Vector closedList = new Vector(); 18 Vector openList = new Vector(); 19 20 //do this because the vm has now a new size 21 Hashtable>> tmpResourceView 22 = updateResourceView(ClusterResourceView.getResourceView()); 23 Hashtable tmpAllocationView 24 = updateAllocationView(ClusterResourceView.getAllocationView()); 25 26 ClusterState initialState = new ClusterState(tmpResourceView,tmpAllocationView); 27 addChildrentoTheOpenList(initialState, openList, closedList); 28 29 int limit = Integer.parseInt(kbfsK); 30 31 if (openList.size() < limit) { 32 limit = openList.size(); 33 } 34 35 for (int j = 0; j < Integer.parseInt(kbfsIterations); j++) { 36 for (int i = 0; i < limit; i++) { 37 addChildrentoTheOpenList(openList.get(i), openList, closedList); 38 } 39 } 40 41 closedList.addAll(openList); 42 ClusterState finalState = initialState; 43 44 for (ClusterState state : closedList) { 45 if (state.getStateUtility() > finalState.getStateUtility()) { 46 finalState = state; 47 } 48 } 49 50 Stack instructions = new Stack(); 51 52 boolean running = true; 53 while (running) { 54 if (finalState.getCost() != 0) {
116
Chapter 5. Implementation 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 } 82 }
5.4. The Cluster Controller
instructions.push(finalState.getMigratedVM()); instructions.push(finalState.getFromPyhsical()); instructions.push(finalState.getToPhysical()); finalState = finalState.getParent(); } else { running = false; } } while (!instructions.empty()) { remoteHost = instructions.pop(); physicalHost = instructions.pop(); vmName = instructions.pop(); try { migrateVmOnMachine(vmName, physicalHost, remoteHost); } catch (JMSException e) { logger.error(this.moduleName, e.getLocalizedMessage()); } } try { Thread.sleep(SLEEP_TIME); localIncreaseVMMemory(vmName, remoteHost, necessaryMemory); } catch (Exception e) { logger.error(this.moduleName, e.getLocalizedMessage()); }
Listing 5.28: The startKbfsAlgorithm() method After this process finishes, the state with the highest utility is selected (lines 44-48). Next, the instructions for reaching this state must be determined. This is done by covering the way back from this state to the initial state. On this way, the steps are put in a stack (lines 53-62), in order to reverse them in form of migration instructions (lines 64-73), once the initial state is reached. After all the migrations are performed, the necessary memory is allocated to the virtual machine in cause (line 77).
117
5.5. Code Statistics
5.5
Chapter 5. Implementation
Code Statistics
In Table 5.1, the complexity of this implementation measured in lines of code (LOC) is presented. For the sake of completeness, both the Self-Manager core and the JMXmodule are present in this table, despite being developed outside of this thesis. This data doesn’t include the code lines of the configuration files. Component Physical Manager
VM Manager
Cluster Manager
Module Monitor Actuator Xen Total VM Monitor SL Monitor JMX VM Logic Actuator Total Monitor Actuator Cluster Logic Total
Self-Manager Core Total
Lines of Code 350 324 782 1456 380 442 (2273) 1011 393 4499 364 274 1413 2051 (4961) 5733 (12967)
Table 5.1: Code statistics
118
Chapter 6 Evaluation The previous two chapters described the design and the implementation of the framework, the bottleneck determination mechanism and the two controller algorithms. In this chapter, the results of this concept and its implementation are evaluated. First, the testbed used for this evaluation is presented in Section 6.1. Next, in Section 6.2, the management framework, including the bottleneck determination mechanism, is evaluated. Finally, in Section 6.3, the two controllers are evaluated separately and compared to each other.
6.1
The Testbed
The framework was tested in a lab environment using a Java implementation of the TPCW benchmarking standard [Mar01, TPCb]. The testbed used for this evaluation is depicted in Figure 6.1. The TPC-W benchmark can be used to simulate the activities of a business oriented transactional Web server. The Java implementation of TPC-W is basically a collection of servlets that implement this benchmark for J2EE application server/servlet containers. For this evaluation, the Java TPC-W e-commerce application suite is hosted inside a Tomcat server version 5.5.25. The Tomcat server runs on top of a Fedora Core 7 that runs inside a virtual machine managed by the VM Manager. Two Fujitsu Siemens Esprimo physical machines wre used, equipped with an Intel Core 2 Duo 1,86 GHz processor and 2 GB of RAM. They are operated by Fedora Core 7 containing the Xen hypervisor version 3.1. One of the two physical machines hosts the virtual machine, but both of them have access to the disk image of the virtual machine via NFS. The two physical machines are managed by two different Physical Managers.
119
6.2. Evaluation of the Management Framework
Chapter 6. Evaluation
Physical Machine
TPC-W Java Load Generator
Cluster Manager Physical Manager
VM Manager
Physical Manager
TPC-W Java Web App
Tomcat 5.5.25
Fedora Core 7 (Dom 0)
Fedora Core 7 (VM)
Fedora Core 7 (Dom 0)
Xen 3.1
Xen 3.1
Physical Machine
Physical Machine
Figure 6.1: The testbed The entire cluster, here formed by the two physical machines, is managed by the Cluster Manager. The two Physical Managers, the VM Manager and the Cluster Manager are hosted on a third physical machine, which also hosts the TPC-W load generator. This third machine is not part of the managed infrastructure and thus not part of the cluster. The TPC-W load generator is a Java application and is also part of the Java implementation of the TPC-W benchmarking standard.
6.2 Evaluation of the Management Framework In this part of the evaluation, the framework has been tested using the testbed described in the previous section. This consists of two steps: (1) test if the VM Manager is capable of discovering a memory bottleneck and (2) test if the components of the framework work together in order to allocate additional memory when such a bottleneck was determined. For this purpose, tests were first conducted in order to determine a scenario in which memory is the bottleneck. The results of these empirical observations are shown in Figure 6.2. Here, a logarithmic scale is used to present the results of five tests that were
120
Chapter 6. Evaluation
6.2. Evaluation of the Management Framework
performed. This tests consisted of measuring the response times of the new_products servlet, which is part of the Java TPC-W implementation. The load for this servlet was generated using the TPC-W Java load generator, which simulated the shopping actions of 30 users. For each test 80, 90, 100 and 110 MBs of memory were allocated to the virtual machine. The response times of the servlet shown in Figure 6.2 clearly decrease with the increase in memory allocation. Thus, we were able to determine a scenario where the amount of memory allocated to a virtual machine can be the cause of a bottleneck. Sheet3
Response times servlet new_products 10000
1000
100
80 MB 90 MB 100 MB 110 MB
10
120 MB
01:59:15 01:54:25 01:50:15 01:45:42 01:42:58 01:41:16 01:28:42 01:20:56 01:16:59 01:15:15 01:05:01 01:02:02 00:58:34 00:54:22 00:51:17 00:49:33 00:41:56 00:38:34 00:35:44 00:32:03 00:28:12 00:25:00 00:17:46 00:12:02 00:08:49 00:04:36 00:00:04 00:00:00
1
Seite 33 Figure 6.2: Empirical observations
Upon conducting these experiments, we have set the memory allocation of our virtual machine at 70 MBs and started the TPC-W Java load generator. Again, the load generator was set up to simulate 30 active users which are shopping on the e-commerce site hosted inside the Tomcat server. The results of this evaluation are shown in Figure 6.3. To determine when an SLO violation takes place, the same rules presented in Section 5.3 were used. As it can be observed from Figure 6.3, four SLO violations took place in two minutes, caused by the increase in response times of the Tomcat server. While the first three SLO violations are tolerated, when the forth SLO violation occurred, the VM Manager reacted and requested more memory. This is in accordance to the rules presented in Section 5.3. Up to the point where the fourth SLO violation occurs, the virtual machine had virtu-
121
6.3. Evaluation of the Cluster Controllers
Chapter 6. Evaluation
Influence of Memory Allocation on Service Response Time 120000000
2,5
100000000
80000000 Bytes
1,5 60000000 1 40000000 0,5
20000000
VM OS Free Memory VM OS used Swap Measured SLO Violations per Minute
0
0: 00 :0 0 0: 00 :2 0 0: 00 :4 2 0: 01 :0 2 0: 01 :2 2 0: 01 :4 2 0: 02 :0 2 0: 02 :2 2 0: 02 :4 2 0: 03 :0 2 0: 03 :2 2
0
SLO Violations
2
Time (s)
Figure 6.3: Influence of available memory on SLO violations ally no free memory available. Furthermore, the Linux hosted in the virtual machine was heavily swapping. As it can be observed from Figure 6.3, after the forth SLO violation occurred, an additional 100 MBs free memory became available to the memory. This demonstrates that the VM Manager was capable of determining a bottleneck and requesting additional memory, and that the components of the framework were able to work together to allocate additional memory to the requesting virtual machine. After the memory allocation, the level of free memory sank to approximately 80 MBs, since 20 MBs were allocated to the Tomcat server. The swapping level remained the same, which is a common Linux behaviour. After the additional memory was allocated, no SLO violation takes place over the rest of the interval. This shows that the bottleneck determined by the VM Manager was truly a bottleneck and thus that the bottleneck determination mechanism works correctly, at least for scenarios where memory is the cause of a bottleneck (other scenarios were not implemented and thus could not be tested).
6.3
Evaluation of the Cluster Controllers
The cluster controllers, as described in Sections 4.3 and 5.4 were first tested using the previously described testbed. The results shown in Figure 6.3 were obtained using Algo-
122
Chapter 6. Evaluation
6.3. Evaluation of the Cluster Controllers
rithm A as a controller. In that scenario, Algorithm A determined that there was enough memory locally available, and thus decided to allocate this memory locally. The same results were obtained using Algorithm B for this scenario. Next, the scenario was modified so that the necessary memory was not locally available, but only on the second physical machine. In this case, only one solution is possible: to migrate the virtual machine on the other physical machine. Both algorithms found this solution, and performed a migration. The total migration time was under five seconds. However, evaluating these algorithms in a cluster which only contains two physical machines is of little relevance. In both scenarios that were designed (with and without migration), the limitations of Algorithm A over Algorithm B were not visible. Because of the architecture of the framework and the internal architecture of the Cluster Logic, it is possible to isolate the two algorithms and test them by means of simulation. We have designed two scenarios for the simulation, where migration must be performed in order to obtain a valid solution.
Initial Memory Allocation
Final Memory Allocation
12
12
10
10
8 Free
6
4
2
100 MB
100 MB
8
Free
VM 1
VM 1 6
4
VM 2
2
0
Free
VM 2
0 Host 1
Host 2
Host 1
Host 2
Figure 6.4: A scenario comprising two physical machines
In the first scenario, two physical machines named Host 1 and Host 2 each provision one VM, VM 1 respectively VM 2 (see Figure 6.4). Host 1 possesses a total of 500 MB RAM for VMs, while Host 2 provides 1000 MB. At the beginning, VM 1 uses 300 MB, while VM 2 uses 400 MB. As the TPC-W load generator stresses VM 1, the VM Manager responsible for VM 1 requests additional memory (in this case an extra 300 MB) from the Cluster Manager. In this scenario, both Cluster Logic algorithms deliver the same solution, namely to migrate VM 1 from Host 1 to Host 2 as Host 1 is not able to provide extra 300 MB of memory.
123
6.3. Evaluation of the Cluster Controllers Initial Memory Allocation
Intermediate Memory Allocation
12
12
10
6
Final Memory Allocation 12
Free
10 Free
8
100 MB
Chapter 6. Evaluation
VM 2
8 6
Free
10
VM 2
8 6
Free 4
Free
4
Free
VM 3 2
VM 1
VM 2
0
4 Free
2
VM 3 2
VM 1
0 Host 1
Host 2
Host 3
Free
VM 1
VM 3
Host 2
Host 3
0 Host 1
Host 2
Host 3
Host 1
Figure 6.5: A scenario comprising three physical machines The second scenario comprises three physical machines: Host 1, Host 2 and Host 3 (see Figure 6.5, left chart). At the beginning, each of them provisions one VM: VM 1, VM 2, and VM 3 respectively. In this scenario, Host 1 possesses a total of 500 MB RAM for VMs, Host 2 600 MB and Host 3 1100 MB, while VM 1 uses 300 MB, VM 2 400 MB and VM 3 600 MB. Again, the VM Manager responsible for VM 1 requests 300 MB of additional memory. Unlike in the first scenario, the two algorithms perform differently. In fact, Algorithm A is not even capable of finding a valid solution as neither Host 2 nor Host 3 have 600 MB of free memory available. This scenario clearly shows the limitations of this simple algorithm. The more sophisticated Algorithm B is able to find an optimal solution: VM 2 is migrated to Host 3 before VM 1 is migrated to Host 2. As a side effect, this solution results in a high resource utilization on both, Host 2 and Host 3, while Host 1 provisions no VM and thus can be set offline. The evolution of Algorithm B from the initial state to this final state can be observed in Figure 6.5.
124
Chapter 7 Summary The goal of this thesis was to design and evaluate self-management approaches for virtual machine-based environments. In doing this, an in-depth state of the art analysis was first performed. Upon this analysis, it was observed that although various controller approaches exist for the autonomic management of virtualization-based environments, they use different architectural approaches, manage different infrastructures and are thus impossible to compare with each other. It was argued that in order to design and evaluate various controller approaches, a generic framework is required in which different controllers can be plugged in, thus permitting an objective form of evaluation. For such a framework, a set of requirements was defined. The first contribution of this thesis represents the design and implementation of such a framework using the principles of the autonomic computing paradigm. The resulting framework exhibits a modular architecture and consists of there types of components, that manage the three types of entities found in the cluster: • The VM Manager manages a virtual machine • The Physical Manager manages a physical machine • The Cluster Manager manages the entire cluster These components are arranged in a hierarchical fashion, with the higher-ranked Cluster Manager placed over the lower-ranked VM Managers and Physical Managers. A VM Manager is responsible for signaling when the virtual machine to which it is associated requires more resources, while the Physical Managers allocate additional resources based on decisions taken by the Cluster Manager.
125
Chapter 7. Summary Three types of modules are present in the framework: monitor, actuator and “intelligent" modules. Furthermore, two different types of “intelligent” modules have been introduced: • The bottleneck determination mechanism, located in a VM Manager • The global cluster controller, located in the Cluster Manager The modular, loosely coupled approach used to design the framework had the effect of increasing the complexity of the implementation. The main reason for this was the asynchronous form of communication between modules that the architecture imposed. However, precisely this architectural approach is what provides the generic character of the framework. Technology-specific modules like the JMX module or the Xen module can be easily replaced with modules developed for other technologies. Furthermore, various cluster controller and bottleneck determination mechanisms can be plugged in the framework and evaluated. The bottleneck determination mechanism is responsible for determining the resource causing a bottleneck in a virtual machine. This is done by monitoring both the virtual machine’s resource utilization level, as well as the response times of the service running inside the virtual machine. Thus, a service level management approach is used instead of a simple, resource-oriented approach. Each of these “intelligent" modules is loosely coupled with the framework and can be plugged in. For the bottleneck determination mechanism, three possible approaches were analyzed: • Rule-based • Case-based reasoning • Time series The rule-based approach was selected, designed and implemented. The resulting bottleneck determination mechanism represents the second contribution of this thesis. The evaluation performed and described in Chapter 6 shows that this mechanism works well for a scenario where memory is the cause of a bottleneck. One of the main features of the framework is that it reduces the complex task of managing a virtualization-based environment to a resource allocation problem. This level of abstraction significantly simplified the design of cluster controllers. On the other hand,
126
Chapter 7. Summary using such a two-layered hierarchical architecture was a source of overhead for the implementation of the framework. To develop such a controller, an analysis was first performed. The analysis showed that the resource allocation problem that needs to be solved by the controllers belongs to the knapsack family of problems and is NP-hard. It is commonly believed that for such problems, there is no algorithm that finds an optimal solution in polynomial time. For our problem, this means that either heuristics or approximation algorithms can be used to implement the controller, and the results of the algorithm used by the controller are only approximate solutions. Different techniques were analyzed with respect to how they can be used for the design of cluster controllers. Finally, two different controllers were designed, called Algorithm A and Algorithm B. Both of them use heuristic-based strategies and were implemented and evaluated. The design and implementation of these controllers represents the third contribution of this thesis. The evaluation of these controllers proved their fitness but also indicated some limitations. While Algorithm A is not always capable of finding a solution when solutions exist, Algorithm B addresses this limitation and is always capable to find a solution if one exists. However, both Algorithm A and Algorithm B are heuristic-based algorithms and thus offer no guaranteed performance ratio. Finally, it is necessary to mention the virtualization technology-agnostic character of the framework. Although the implementation of the framework uses the Xen hypervisor, adapters for other virtualization technologies can be easily developed and plugged into the framework. This is the fourth contribution of the framework and represents an important step towards the management of heterogenous virtual machine-based data centers. The generic, modular approach of the framework provides a great support for extending the results of this thesis. In the following, possible ways of continuing this work are presented. The implementation of the framework dealt only with one type of resource, namely memory. Future work could extended the framework so that it deals with all four resource types defined in Section 3.2: memory, CPU, network bandwidth and I/O bandwidth. Furthermore, as part of the implementation adapters for Xen were developed. Part of future work should be developing adapters for other virtualization technologies, like VMware ESX, Virtuozzo and the ones that will follow. This could also include adapters based on the now emerging standards from the DMTF System Virtualization, Partitioning,
127
Chapter 7. Summary and Clustering Working Group (SVPC WG). Similarly, adapters for various monitoring technologies can be incorporated. In the design and implementation chapters, a rule-based bottleneck determination mechanism was presented. Besides extending this rule-based approach to deal with all four types of resources, one could also design and evaluate various other approaches, like the ones analyzed in Section 3.4. For example, it might be possible to use time series analysis to forecast a future bottleneck, and thus the management system can proactively react before a bottleneck occurs and not only after the fact. One limitation of the current framework implementation is that it only deals with underprovisioned virtual machines that require additional resources, but not with overprovisioned virtual machines. However, extending the framework with this capability simply means extending the bottleneck determination mechanism and the cluster controller with a mechanism for detecting resources that are overprovisioned respectively a mechanism to deallocate the amount of resources that are not required. Such future work should be based on the analysis of this problem performed in Section 3.5. Finally, in Section 4.3 we have seen that both Algorithm A and Algorithm B have a serious limitation: they are heuristic-based algorithms. However, we have shown in 3.3 that our problem belongs to the knapsack family of problems. As seen in Section 2.4, knapsack problems have approximation algorithms. This means that, if our problem can be reduced to a concrete knapsack problem that has an approximation algorithm, then we have found an algorithm for our problem that has a guaranteed performance ratio. Future work could thus design, implement and evaluate a controller based on an approximation algorithm. One way to do this is to first reduce the problem to a multidimensional bin packing problem. This can be achieved by using physical machines that have identical hardware configurations. After this reduction, one of the existent multidimensional bin packing algorithms [CKP01] can be adapted and implemented.
128
Chapter 8 Bibliography [AA06]
A DAMS, Keith ; AGESEN, Ole: A comparison of software and hardware techniques for x86 virtualization. In: ASPLOS-XII: Proceedings of the 12th international conference on Architectural support for programming languages and operating systems. ACM Press. – ISBN 1595934510, 2– 13
[ACK+ 00]
AUSIELLO, G. ; C RESCENZI, P. ; K ANN, V. ; M ARCHETTI -S P ; G AMBOSI, Giorgio ; S PACCAMELA, Alberto M.: Complexity and Approximation: Combinatorial Optimization Problems and Their Approximability Properties. Springer, 2000. – ISBN 3540654313
[AMD05]
AMD: AMD64 Virtualization Codenamed "Pacifica" Technology: Secure Virtual Machine Architecture Reference Manual, May 2005
[BCS06]
BANSAL, Nikhil ; C APRARA, Alberto ; S VIRIDENKO, Maxim: Improved approximation algorithms for multidimensional bin packing problems. In: FOCS ’06: Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06). IEEE Computer Society, 697–708
[BDF+ 03]
BARHAM, Paul ; D RAGOVIC, Boris ; F RASER, Keir ; H AND, Steven ; H AR RIS , Tim ; H O , Alex ; N EUGEBAUER , Rolf ; P RATT, Ian ; WARFIELD , Andrew: Xen and the art of virtualization. In: SOSP ’03: Proceedings of the nineteenth ACM symposium on Operating systems principles. ACM Press. – ISBN 1581137575, 164–177
[BJR94]
B OX, George ; J ENKINS, Gwilym M. ; R EINSEL, Gregory: Time Series Analysis: Forecasting & Control (3rd Edition). Prentice
129
Chapter 8. Bibliography Hall http://www.amazon.ca/exec/obidos/redirect?tag= citeulike09-20\&path=ASIN/0130607746. – ISBN 0130607746 [BKB07]
B OBROFF, Norman ; KOCHUT, Andrzej ; B EATY, Kirk: Dynamic Placement of Virtual Machines for Managing SLA Violations, 119–128
[CFH+ 05]
C LARK, Christopher ; F RASER, Keir ; H AND, Steven ; H ANSEN, Jacob G. ; J UL, Eric ; L IMPACH, Christian ; P RATT, Ian ; WARFIELD, Andrew: Live Migration of Virtual Machines. In: Proceedings of the 2nd ACM/USENIX Symposium on Networked Systems Design and Implementation (NSDI). Boston, MA, May 2005, S. 273–286
[CKP01]
C APRARA, Alberto ; K ELLERER, Hans ; P FERSCHY, Ulrich: Approximation Schemes for Ordered Vector Packing Problems. In: Lecture Notes in Computer Science 2129 (2001), 63–?? http://citeseer.ist.psu. edu/470961.html
[DC99]
D UDA, Kenneth J. ; C HERITON, David R.: Borrowed-virtual-time (BVT) scheduling: supporting latency-sensitive threads in a general-purpose scheduler. In: SOSP ’99: Proceedings of the seventeenth ACM symposium on Operating systems principles Bd. 33. ACM Press. – ISSN 0163–5980, 261– 276
[Deb04]
D EBUSMANN, Markus: Modellbasiertes Service Level Management verteilter Anwendungssysteme, Universität Kassel, Diss., December 2004
[EGCGJ97] E. G. C OFFMAN, Jr. ; G AREY, M. R. ; J OHNSON, D. S.: Approximation algorithms for bin packing: a survey. (1997), S. 46–93. ISBN 0–534– 94968–1 [FKK03]
F ELNER, Ariel ; K RAUS, Sarit ; KORF, Richard E.: KBFS: K-Best-First Search. In: Annals of Mathematics and Artificial Intelligence 39 (2003), Nr. 1, 19–39. http://dx.doi.org/10.1023/A:1024452529781. – DOI 10.1023/A:1024452529781
[FZ05]
F ROMM, Jochen ; Z APF, Michael: Selbst-Eigenschaften in verteilten Systemen. In: Praxis der Informationsverarbeitung und Kommunikation (PIK) 2005 (2005), oct, Nr. 4, 189-198. http://www.vs.uni-kassel.de/ publications/2005/FZ05
130
Chapter 8. Bibliography [GC03]
G ANEK, A. G. ; C ORBI, T. A.: The dawning of the autonomic computing era. In: IBM Syst. J. 42 (2003), January, Nr. 1, 5–18. http://portal. acm.org/citation.cfm?id=1014770. – ISSN 0018–8670
[GIYC]
G RIT, Laura ; I RWIN, David ; Y UMEREFENDI, Aydan ; C HASE, Jeff: Virtual Machine Hosting for Networked Clusters: Building the Foundations for Autonomic Orchestration
[HDPT04]
H ELLERSTEIN, Joseph L. ; D IAO, Yixin ; PAREKH, Sujay ; T ILBURY, Dawn M.: Feedback Control of Computing Systems. John Wiley & Sons, 2004. – ISBN 047126637X
[HJ04]
H ANSEN, Jacob G. ; J UL, Eric: Self-migration of operating systems. In: EW11: Proceedings of the 11th workshop on ACM SIGOPS European workshop: beyond the PC. ACM Press
[HMG04]
H ERRMANN, Klaus ; M ÜHL, Gero ; G EIHS, Kurt: Self-Management - Potentiale, Probleme, Perspektiven. In: Praxis der Informationsverarbeitung und Kommunikation (PIK) 27 (2004), April, Nr. 2, 74–79. http://www.kbs.cs.tu-berlin.de/publications/ fulltext/pik2004.pdf. – (in German). ISBN 3–598–01248–9
[Hor01]
H ORN, Paul: Autonomic computing: IBM’s Perspective on the State of Information Technology. 2001
[IBM]
IBM. – http://www.research.ibm.com/oceanoproject/ (Last visited 20.01.2008)
[Inf]
I NFOWORLD: Gartner Says Virtualization will Top IT Agenda Until 2010. http://weblog.infoworld.com/virtualization/ archives/general_news/
[Int05]
I NTEL: Intel Virtualization Technology Specification for the IA-32 Intel Architecture, April 2005
[Jac98]
JACKSON, Peter: Introduction to expert systems. Boston, MA, USA : Addison-Wesley Longman Publishing Co., Inc., 1998. – ISBN 0–201– 87686–8
[Jav]
Java RMI. – http://java.sun.com/javase/technologies/ core/basic/rmi/index.jsp (Last visited 24.02.2008)
131
Chapter 8. Bibliography [Kan92]
K ANN, Viggo: On the Approximability of NP-complete Optimization Problems, Royal Institute of Technology Stockholm, Diss., 1992
[Kar91]
K ARLOFF, Howard: Linear programming. Cambridge, MA, USA : Birkhauser Boston Inc., 1991. – ISBN 0–8176–3561–0
[KBC+ 00]
K UBIATOWICZ, John ; B INDEL, David ; C HEN, Yan ; C ZERWINSKI, Steven ; E ATON, Patrick ; G EELS, Dennis ; G UMMADI, Ramakrishna ; R HEA, Sean ; W EATHERSPOON, Hakim ; W ELLS, Chris ; Z HAO, Ben: OceanStore: an architecture for global-scale persistent storage. In: SIGARCH Comput. Archit. News 28 (2000), Nr. 5, S. 190–201. http://dx.doi. org/http://doi.acm.org/10.1145/378995.379239. – DOI http://doi.acm.org/10.1145/378995.379239. – ISSN 0163–5964
[KC03]
K EPHART, J. O. ; C HESS, D. M.: The Vision of Autonomic Computing. In: Computer 36 (2003), S. 41–50
[KPP04]
K ELLERER, H. ; P FERSCHY, U. ; P ISINGER, D.: Springer, Berlin, Germany, 2004
[LL02]
L OHMAN, Guy M. ; L IGHTSTONE, Sam S.: SMART: making DB2 (more) autonomic. In: VLDB ’02: Proceedings of the 28th international conference on Very Large Data Bases, VLDB Endowment, 2002, S. 877–879
[LLS02]
LYMBEROPOULOS, L. ; L UPU, E. ; S LOMAN, M.: An Adaptive Policy Based Management Framework for Differentiated Services Networks. In: POLICY ’02: Proceedings of the 3rd International Workshop on Policies for Distributed Systems and Networks (POLICY’02). IEEE Computer Society. – ISBN 0769516114
Knapsack Problems.
[LMKB01] L UTFIYYA, Hanan ; M OLENKAMP, Gary ; K ATCHABAW, Michael ; BAUER, Michael A.: Issues in Managing Soft QoS Requirements in Distributed Systems Using a Policy-Based Framework. In: POLICY ’01: Proceedings of the International Workshop on Policies for Distributed Systems and Networks. Springer-Verlag. – ISBN 3540416102, 185–201 [Mar01]
M ARDEN, Morris: An Architectural Evaluation of Java TPC-W. In: HPCA ’01: Proceedings of the 7th International Symposium on High-Performance Computer Architecture. Washington, DC, USA : IEEE Computer Society, 2001. – ISBN 0–7695–1019–1, S. 229
132
Chapter 8. Bibliography [MB06]
M ENASCE, Daniel A. ; B ENNANI, Mohamed N.: Autonomic Virtualized Environments. In: ICAS ’06: Proceedings of the International Conference on Autonomic and Autonomous Systems. Washington, DC, USA : IEEE Computer Society, 2006. – ISBN 0–7695–2653–5, S. 28
[Mel07]
M ELLOR, Ewan: The Xen-API. April 2007
[MSMW04] M ÜLLER -S CHLOER, Christian ; M ALSBURG, Christoph von d. ; W ÜRT, Rolf P.: Organic Computing. In: Informatik-Spektrum 27 (2004), Nr. 4, 332–336. http://dx.doi.org/10.1007/s00287-004-0409-6. – DOI 10.1007/s00287–004–0409–6 [PG74]
P OPEK, Gerald J. ; G OLDBERG, Robert P.: Formal requirements for virtualizable third generation architectures. In: Commun. ACM 17 (1974), July, Nr. 7, 412–421. http://dx.doi.org/10.1145/361011.361073. – DOI 10.1145/361011.361073. – ISSN 0001–0782
[Pis95]
P ISINGER, D.: Algorithms for Knapsack Problems, DIKU, University of Copenhagen, Denmark, Diss., 1995
[pol04]
An artificial intelligence perspective on autonomic computing policies . – 3–12 S
[pro00]
Process migration. In: ACM Comput. Surv. 32 (2000), September, Nr. 3, 241–299. http://dx.doi.org/10.1145/367701.367728. – DOI 10.1145/367701.367728. – ISSN 0360–0300
[PSZ+ 07]
PADALA, Pradeep ; S HIN, Kang G. ; Z HU, Xiaoyun ; U YSAL, Mustafa ; WANG, Zhikui ; S INGHAL, Sharad ; M ERCHANT, Arif ; S ALEM, Kenneth: Adaptive control of virtualized resources in utility computing environments. In: EuroSys ’07: Proceedings of the 2007 conference on EuroSys. ACM Press. – ISSN 0163–5980, 289–302
[RG05]
ROSENBLUM, M. ; G ARFINKEL, T.: Virtual machine monitors: current technology and future trends. In: Computer 38 (2005), Nr. 5, 39–47. http://ieeexplore.ieee.org/xpls/abs_all.jsp? arnumber=1430630
[RN04]
RUSSEL, Stuart ; N ORVIG, Peter: Künstliche Intelligenz. Pearson Studium http://www.amazon.de/exec/obidos/redirect?
133
Chapter 8. Bibliography tag=citeulike01-21\&path=ASIN/3827370892. – ISBN 3827370892 [RRX+ 06]
RUTH, P. ; R HEE, Junghwan ; X U, Dongyan ; K ENNELL, R. ; G OASGUEN, S.: Autonomic Live Adaptation of Virtual Computational Environments in a Multi-Domain Infrastructure. In: Autonomic Computing, 2006. ICAC ’06. IEEE International Conference on, 5–14
[RUB]
RUBiS. – 11.12.2007)
[San]
Sandia National Laboratories. – http://herzberg.ca.sandia.gov/ (Last visited 31.01.2008)
[Sch07]
S CHMID, Markus: Ein Ansatz fuer das Service Level Management in dynamischen Architekturen. In: B RAUN, T. (Hrsg.) ; C ARLE, G. (Hrsg.) ; S TILLER, B. (Hrsg.): KiVS 2007 - Kommunikation in Verteilten Systemen - Industriebeitraege, Kurzbeitraege und Workshops, VDE Verlag, March 2007. – ISBN 978–3–8007–2980–7, S. 255–266
[SG06]
S CHMID, Markus ; G EIHS, Kurt: Self-Organisation in the Context of QoS Management in Service Oriented Architectures. In: B OUDAOUD, Karima (Hrsg.) ; N OBELIS, Nicolas (Hrsg.) ; N EBE, Thomas (Hrsg.): Proceedings of the 13th Annual Workshop of HP OpenView University Association, Hosted by University of Nice at Cote d’Azur May 21 - 24, 2006. Stuttgart : Infonomics-Consulting, May 2006. – ISBN 3000187804, S. 153–164
[SN05]
S MITH, J. E. ; NAIR, Ravi: The architecture of virtual machines. In: Computer 38 (2005), Nr. 5, 32–38. http://ieeexplore.ieee.org/ xpls/abs_all.jsp?arnumber=1430629
[Ten00]
T ENNENHOUSE, David: Proactive computing. In: Commun. ACM 43 (2000), May, Nr. 5, 43–50. http://dx.doi.org/10.1145/ 332833.332837. – DOI 10.1145/332833.332837. – ISSN 0001–0782
[TPCa]
TPC-W. – http://www.tpc.org/tpcw/ (Last visited 11.12.2007)
[TPCb]
TPC-W Java Implementation. – http://www.ece.wisc.edu/ ~pharm/tpcw.shtml Last visited 11.12.2007
http://rubis.objectweb.org/ (Last visited
134
Chapter 8. Bibliography [UNR+ 05]
U HLIG, R. ; N EIGER, G. ; RODGERS, D. ; S ANTONI, A. L. ; M ARTINS, F. C. M. ; A NDERSON, A. V. ; B ENNETT, S. M. ; K AGI, A. ; L EUNG, F. H. ; S MITH, L.: Intel virtualization technology. In: Computer 38 (2005), Nr. 5, 48–56. http://ieeexplore.ieee.org/xpls/abs_all. jsp?arnumber=1430631
[Vaz01]
VAZIRANI, Vijay V.: Approximation algorithms. New York, NY, USA : Springer-Verlag New York, Inc., 2001. – ISBN 3–540–65367–8
[Wal02]
WALDSPURGER, Carl A.: Memory resource management in VMware ESX server. In: SIGOPS Oper. Syst. Rev. 36 (2002), Nr. SI, 181– 194. http://dx.doi.org/10.1145/844128.844146. – DOI 10.1145/844128.844146. – ISSN 0163–5980
[WHW+ 04] W HITE, S. R. ; H ANSON, J. E. ; W HALLEY, I. ; C HESS, D. M. ; K EPHART, J. O.: An architectural approach to autonomic computing. In: Autonomic Computing, 2004. Proceedings. International Conference on, 2–9 [WTKD04] WALSH, W. E. ; T ESAURO, G. ; K EPHART, J. O. ; DAS, R.: Utility functions in autonomic systems. In: Autonomic Computing, 2004. Proceedings. International Conference on, 70–77 [ZBG+ 05]
Z HANG, Yuting ; B ESTAVROS, Azer ; G UIRGUIS, Mina ; M ATTA, Ibrahim ; W EST, Richard: Friendly virtual machines: leveraging a feedback-control model for application adaptation. In: VEE ’05: Proceedings of the 1st ACM/USENIX international conference on Virtual execution environments. ACM Press. – ISBN 1595930477, 2–12
135