Transcript
Project Acronym
Fed4FIRE
Project Title Instrument Call identifier Project number Project website
Federation for FIRE Large scale integrating project (IP) FP7-‐ICT-‐2011-‐8 318389 www.fed4fire.eu
D6.1 Detailed specifications for first cycle ready Work package Task Due date Submission date Deliverable lead Version Authors
Reviewers
WP6 Task 6.1, Task 6.2 28/02/2013 07/03/2013 Max Ott (NICTA) 1.0 Olivier Mehani (NICTA), Guillaume Jourjon (NICTA), Yahya Al-‐Hazmi (TUB), Wim Vandenberghe (iMinds), Donatos Stavropoulos (UTH), Jorge Lanza (UC), Kostas Choumas (UTH), Luis Sanchez (UC), Pablo Sotres (UC) Kostas Kavoussanakis (EPCCY), Steve Taylor (IT Innovation)
1 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1
Abstract
This document provides an overview of the requirements of the testbeds involved in Fed4FIRE in terms of monitoring and measurements. It then finds commonalities, and presents implementations steps for the first development cycle of the project in terms of monitoring and measurement aspects of the Fed4FIRE federation. Report, Deliverable, Measurement, OML, TopHat, WP6
Keywords
Nature of the deliverable
Dissemination level
R P D O PU PP RE CO
Report Prototype Demonstrator Other Public Restricted to other programme participants (including the Commission) Restricted to a group specified by the consortium (including the Commission) Confidential, only for members of the consortium (including the Commission)
X X
2 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1
Disclaimer The information, documentation and figures available in this deliverable, is written by the Fed4FIRE (Federation for FIRE) – project consortium under EC co-‐financing contract FP7-‐ICT-‐ 318389 and does not necessarily reflect the views of the European Commission. The European Commission is not liable for any use that may be made of the information contained herein.
3 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1
Executive Summary This deliverable presents the first cycle implementation steps for Fed4FIRE Work package 6. After recalling the objectives set in previous Fed4FIRE deliverables, it collects input from the involved testbeds in terms of current deployment status, supported measurement infrastructures, and requirements. This input is consolidated in order to inform the creation of a detailed implementation design for the first cycle. Next to this input from the involved testbeds, this design also relied on a performed survey and analysis of the state of the art in terms of tools for data acquisition, collection and reporting. These different inputs led to the insight that the most widespread commonality is the use of OML as a collection and reporting framework. It allows instrumenting any sort of system through the abstraction of a measurement point, describing a group of metrics. This abstraction allows more latitude in the choice of measurement tools: as long as they conform to the same measurement point for the same information, their output can be used indistinctly in further analysis. Selecting OML for reporting purposes therefore allows flexibility in the choice of measurement tools, both for monitoring and measurement tasks, as well as for a unified way to access the collected data. This however only caters for collection and storage, but not directly access to or visualization of the data, let alone from a federated environment. Another tool for this purpose therefore needs to be identified. From the previous sections, Top Hat fits the bill for its ability to run queries over distributed systems and data stores, and pre-‐existing deployments. Facility and infrastructure monitoring tasks require specific metrics to always be made available about the testbed and its node. While some deployments already have solutions in place, the most indicated ones for others are, in order of preference, Zabbix, Nagios or Collectd. Overall, this caters for the measurement architecture shown in Figure 1. Essentially, all testbeds will be required to deploy OML and TopHat, for measurement collection and federated access, respectively. With the aim of limiting the impact on deployed solutions, monitoring and measurement tools already in use will not be superseded, but rather adapted to be included in the proposed architecture. For cases where a requirement is not met, default solutions are prescribed.
4 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1
Figure 1: Proposed cycle 1 measurement architecture for Fed4FIRE. Elements in bold are the default proposal for new deployments of on canonical testbed
In order to implement this design, seven implementation steps have been identified, as presented in Table 1. Most steps (service deployment and application instrumentation) need to be undertaken independently by all participants. Where commonalities exist (e.g. Zabbix and Nagios) instrumentation should be a common effort. To support with the instrumentation task, NICTA will provide and curate a clearinghouse of homogenized OML measurement points schemas. The goal is to ease integration of new applications while maintaining the highest level of data exchangeability and interoperability between measurement tools providing the similar information. A TDMI agent will also be written to allow queries to the OML storage from TopHat.
5 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1 Functional element
Implementation strategy
Facility and Infrastructure monitoring
• Deploy Nagios and/or Zabbix and/or collectd if not yet available (all participants) • Instrument these relevant measurement systems (all participants, with support from NICTA)
Experiment measurement
• Deploy OML if not yet available (all participants) • Instrument relevant measurement systems (all participants, with support from NICTA) • Maintain clearinghouse of measurement points (NICTA)
Data access
• Deploy Top Hat (all participants) • Make OML measurement databases accessible to Top-‐Hat (NICTA, UPMC) Table 1: Implementation strategy of functional elements
6 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1
Acronyms and Abbreviations API CPU CREW DWDM EC EPC FLS FTP GENI GPS HDD I/O IETF IoT IP IPMI FIRE KPI MAC MIB MP MS MTU NRPE OCF OML
Application Programming Interface Central Processing Unit Cognitive Radio Experimentation World (FP7 IP project) Dense Wavelength Division Mulitplexing Experiment Controller Evolved Packet Core First Level Support File Transfer Protocol Global Environment for Network Innovations Global Positioning System Hard Disk Drive Input/output Internet Engineering Task Force Internet of Things Internet Protocol Intelligent Platform Management Interface Future Internet Research and Experimentation Key Performance Indicator Media Access Control Management Information Base Measurement Point Measurement Stream Maximum Transmission Unit Nagios Remote Plugin Executer Ofelia Control Framework Measurement Library: an instrumentation system allowing for remote collection of any software-‐produced metrics, with in line filtering and multiple SQL back-‐ends. cOntrol and Management Framework: a testbed management framework
OMF OS
Operating System
PDU PoE PTPd QoS RCS RF S.M.A.R.T SLA SNMP SQL SSH
Power Distribution Unit Power over Ethernet Precision Time protocol daemon Quality of Service Rich Communication Services Radio Frequency Self-‐Monitoring Analysis and Reporting Technology Service Level Agreement Simple Network Management Protocol Structured Query Language Secure Shell 7 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1 TCP TDMI UDP USB VoIP VoLTE VM
Transmission Control Protocol TopHat Dedicated Measurement Infrastructure User Datagram Protocol Universal Serial Bus Voice over IP Voice over LTE Virtual Machine
8 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1
Table of Contents 1
INTRODUCTION ................................................................................................................................ 11
2
INPUTS TO THIS DELIVERABLE .......................................................................................................... 12 2.1 ARCHITECTURE (D2.1) .......................................................................................................................... 12 2.2 REQUIREMENTS ADDRESSED BY THE ARCHITECTURE ..................................................................................... 13 2.2.1 Generic requirements of a FIRE federation (D2.1) .................................................................. 13 2.2.2 Requirements from a sustainability point of view (D2.1) ....................................................... 14 2.2.3 High priority requirements of the infrastructure community (D3.1) ...................................... 14 2.2.4 High priority requirements of the services community (D4.1) ................................................ 15 2.2.5 High priority requirements of shared support services (D8.1) ................................................ 15 2.3 ADDITIONAL WP6 REQUIREMENTS .......................................................................................................... 16 2.3.1 Generic requirements ............................................................................................................. 16 2.3.2 Monitoring and measurement metrics overview ................................................................... 17 2.3.3 Consolidated summary of testbeds’ inputs ............................................................................. 19
3
IMPLEMENTATION OF THE ARCHITECTURAL FUNCTIONAL ELEMENTS .............................................. 23 3.1 INTRODUCTION .................................................................................................................................... 23 3.2 EVALUATION FOR POSSIBLE APPROACHES FOR IMPLEMENTATION ................................................................... 23 3.2.1 Data acquisition ...................................................................................................................... 23 3.2.2 Collection and Reporting ........................................................................................................ 29 3.2.3 Summary ................................................................................................................................ 31 3.3 DETAILS OF THE SELECTED MONITORING AND MEASURING TOOLS IN FED4FIRE ................................................ 31 3.3.1 Monitoring (Zabbix, Nagios) ................................................................................................... 33 3.3.2 Collection and Reporting (OML) ............................................................................................. 33 3.3.3 Federated Queries (TopHat) ................................................................................................... 34 3.4 IMPLEMENTATION STEPS ....................................................................................................................... 35 3.4.1 Installation of New Tools ........................................................................................................ 35 3.4.2 Adaptation of Existing Tools ................................................................................................... 35 3.4.3 Coordination ........................................................................................................................... 36
4
SUMMARY ....................................................................................................................................... 37 4.1 4.2
MAPPING OF ARCHITECTURE TO IMPLEMENTATION PLAN ............................................................................. 37 DEVIATION OF SUPPORTED REQUIREMENTS COMPARED TO D2.1 ................................................................... 38
REFERENCES ............................................................................................................................................ 39 APPENDIX A: PLANETLAB EUROPE REQUIREMENTS ................................................................................. 42 APPENDIX B: VIRTUAL WALL REQUIREMENTS ......................................................................................... 45 APPENDIX C: OFELIA (OPENFLOW IN EUROPE LINKING INFRASTRUCTURE AND APPLICATIONS) CONTROL FRAMEWORK REQUIREMENTS ..................................................................................................................... 47 APPENDIX D: OPTICAL TESTBED – EXPERIMENTA DWDM RING TESTBED REQUIREMENTS ....................... 49 APPENDIX E: OPTICAL TESTBED – OPENFLOW-‐ENABLED ADVA ROADMS TESTBED REQUIREMENTS ........ 51 APPENDIX F: BONFIRE REQUIREMENTS ................................................................................................... 52 APPENDIX G: GRID’5000 REQUIREMENTS ................................................................................................ 56 APPENDIX H: FUSECO PLAYGROUND REQUIREMENTS ............................................................................. 58
9 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1 APPENDIX I: NITOS REQUIREMENTS ........................................................................................................ 60 APPENDIX J: W-‐ILAB.T WIRELESS NETWORK TESTBED REQUIREMENTS ................................................... 62 APPENDIX K: NETMODE REQUIREMENTS ................................................................................................ 66 APPENDIX L: REQUIREMENTS FROM SMARTSANTANDER ........................................................................ 67 APPENDIX M: NICTA REQUIREMENTS ...................................................................................................... 69
10 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1
1 Introduction This deliverable details the specifications regarding cycle 1 development in WP6, based on the cycle 1 architecture described in D2.1 “First Federation Architecture”. These specifications cover the details that pave the way for the actual implementations in terms of tools deployment and/or adaptation, as well as coordination efforts in terms of data format harmonization. This document is structured as follows. Section 2 summarizes the specific constraints that WP6 has to operate within when defining the specifications for the first cycle of Fed4FIRE. This is achieved by reviving relevant information from previous Fed4FIRE deliverables, and by presenting the related current state of deployment and specific requirements collected from each of the involved Fed4FIRE testbeds. Note that section 2 presents a consolidated view, while the individual per-‐testbed information can be found in detail in the different appendices. Based on this information gathered in section 2, section 3 can evaluate possible approaches for implementation. The outcome of this evaluation is a more detailed design of the measurement and monitoring components of the first Fed4FIRE development cycle (these will be refined in cycle 2 and 3). The last part of section 3 defines the required implementation steps for the adoption of this design. Finally, section 4 concludes this deliverable with an appropriate summary.
11 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1
2 Inputs to this deliverable In this section, a specific selection of information is revived that originates from several previously submitted Fed4FIRE deliverables. The goal of this exercise is to summarize the specific constraints that WP6 has to operate within when defining the specifications for the first cycle of Fed4FIRE.
2.1 Architecture (D2.1) Fed4FIRE identified the following types of monitoring and measurement (Figure 2) in D2.1 “First Federation Architecture”: • “Facility monitoring: this monitoring is used in the first level support to see if the testbed facilities are still up and running. The most straight forward way for this, is that there is a common distributed tool which monitors each facility (Zabbix, Nagios or similar tools). The interface on top of this facility monitoring should be the same and will further be specified in WP6 (it seems in this case more straightforward to use all the same monitoring tool, then to define and implement new interfaces).” • “Infrastructure monitoring: this is monitoring of the infrastructure which is useful for experimenters. E.g. monitoring of switch traffic, wireless spectrum or physical host performance if the experimenter uses virtual machines. This should be provided by the testbed provider (an experimenter has e.g. no access to the physical host if he uses virtual machines) and as such a common interface would be very handy, but is not existing today.” • “Experiment measuring: measurements which are done by a framework that the experimenter uses and which can be deployed by the experimenter itself on his testbed resources in his experiment. In the figure one can see two experiment measuring frameworks each with its own interfaces (and thus experimenter tools). Of course, a testbed provider can ease this by providing e.g. OS images with certain frameworks pre-‐deployed.” This deliverable also stated that in the first cycle of Fed4FIRE, the facility monitoring will be rolled out on all testbeds. Infrastructure monitoring and experiment measuring are to be further discussed in WP6.
12 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
Get monitor data
Get monitor data
Future reservation broker
Brokers
Experimenter
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1
Central facility monitoring (first level support)
Masurement Measurement
Measurement Measurement
Testbed directory
Tool directory
Certificate directory
Identity provider
Testbed management
Identity provider
Discovery, reservation, provisioning Grant access?
Testbed
Rules-based authorization
Testbed A
Discovery, reservation, provisioning Grant access? Rules-based authorization
Infrastructure monitoring
Facility monitoring
Testbed B
Facility monitoring
Central location(s)
Figure 2: Monitoring and measurement architecture for cycle 1
2.2 Requirements addressed by the architecture This section recalls the requirements relevant to WP6 set forth in D2.1, D3.1, D4.1 and D8.1 2.2.1 •
•
Generic requirements of a FIRE federation (D2.1) Support: o How easily can components/testbeds/software be upgraded? • For this, the APIs should be versioned and tools and testbeds should support 2 or 3 versions at the same time, so that all components can be gradually upgraded o How can different versions of protocols be supported? (e.g. upgrade of RSpec) • With versions Experimenter ease of use: o DoW: The final goal is to make it easier for experimenters to use all kinds of testbeds and tools. If an experimenter wants to access resources on multiple testbeds, this should be possible from a single experimenter tool environment. • Is possible, but Fed4FIRE should also aim to keep such tools up-‐to-‐date during the lifetime of the project and set up a body which can further define the APIs, also after the project.
13 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1 2.2.2 •
•
Requirements from a sustainability point of view (D2.1) From a sustainability point of view, it is preferential that the number of required central components is minimal as these components put a high risk on a federation in terms of scalability and long term operability. o Yes, there is no central component which is needed in order to use the testbeds. Of course, the central portal, identity provider, testbed directory, tool directory and certificate directory eases the use of the federation as you have all the information in a single place to get new experimenters to use testbeds and tools. It is also required that the federation framework supports the joining and leaving of testbeds very easily, as this will be common practice. o The architecture supports multiple identity providers/portals. There is a common API for discovery, requirements, reservation and provisioning while it imposes no restrictions on the use of specific experiment control, monitoring and storage. The common API makes it straight forward to add new tools and testbeds while a testbed can be an extra identity provider also.
2.2.3 High priority requirements of the infrastructure community (D3.1) Federation aspect Req. ID Req. statement Remark Monitoring
I.2.101
Measurement support framework Monitoring resources for operational support Monitoring resources for suitable resource selection and measurement interpretation
Monitoring
I.2.104
Monitoring
I.2.105
Monitoring
I.2.106
Permanent storage
I.2.201
Minimal impact of monitoring and measuring tools Data storage
Permanent storage
I.2.202
Data security
Permanent storage
I.2.203
Interconnectivity
I.4.005
Stored experiment configuration IPv6 support
Architecture facilitates the use of measurement framework Architecture facilitates the use of such monitoring framework The architecture makes a distinction between facility monitoring, infrastructure monitoring and experiment monitoring. All three are supported by the architecture, but should be worked out in WP6. This requirement seems to be more a WP6 requirement. No architecture requirement
Doing this in a structured way will be tackled in cycle 2 or 3 Doing this in a structured way will be tackled in cycle 2 or 3 Doing this in a structured way will be tackled in cycle 2 or 3 In cycle 1, some testbed resources will be reachable through IPv6. The architecture can cope with this, e.g. through DNS names which resolve to IPv6
14 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1 2.2.4
High priority requirements of the services community (D4.1) Field
Monitoring
Req. ID
Req. statement
Remark
ST.2.001
Monitoring management control
From an architectural viewpoint, the facility, infrastructure and experiment monitoring can cope with this. The details should be filled in by WP6.
Permanent Storage ST.2.007
Monitoring data and experiment result storage. Audit, archiving, accountability Access between testbeds (interconnectivity)
Not yet foreseen in the architecture.
Interconnectivity
ST.4.001
Interconnectivity in a structured way is not tackled in cycle 1
2.2.5
High priority requirements of shared support services (D8.1)
Req. ID Description FLS.1 Facility monitoring should push RAG (Red, Amber, Green) status to a central dashboard for FLS reactive monitoring FLS.2 The Facility RAG status should be based upon the monitoring of key components of each facility that indicate the overall availability status of the facility FLS.3 The FLS should be able to drill down from the facility RAG status to see which components are degraded or down FLS.4 The key components monitored for facility monitoring, should be standardised across all facilities as much as possible FLS.5 A commitment is required from each testbed to maintain the quality of monitoring information (FLS is “monitoring the monitoring” and the information FLS has is only as good as the facility monitoring data) FLS.6 Any central federation-‐level systems/components that may be implemented will need to be monitored by FLS (e.g. a central directory) FLS.7 FLS requires visibility of planned outages on a push basis from testbeds and administrators of central systems FLS.8 Exception alerts from both testbeds and central systems should be filtered prior to reaching the FLS, to avoid reacting to alerts unnecessarily.
15 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1
2.3 Additional WP6 requirements This section defines some additional requirements in the context of WP6. It first introduces some generic requirements, followed by a consolidated view on the concrete requirements from each Fed4FIRE testbed. The corresponding details per testbed can be found in the different appendices, in which detailed measurement and monitoring requirement at the infrastructure level are presented. Information about the status and the availability of physical machines, capacity of each, and connectivity among them, are examples of such important requirements from the facility provider’s viewpoint to determine the health and the performance of their infrastructures, and well as from the user’s point of view to understand the operational conditions and the behaviour of the environment in which his experiment is conducted. In addition to this, experimenters’ requirements and their interest in specific monitoring metrics are also presented. These were collected from the individual facility providers involved in Fed4FIRE based on their experience and feedback received from experimenters who have already used or are currently using their facilities. Similar, measurement and monitoring requirements at the services and application levels such as cloud services and sensors are also addressed. In order to fulfil services and applications’ requirements in terms of monitoring, it is required to identify a set of metrics for the resources belonging to services and applications to be monitored including CPU performance, disk I/O and network measurements. An additional requirement from the services community is the ability to collect experimenter-‐specified, application specific metrics. It is also required to ensure that the collection and publishing of measurements is consistent across the various Fed4FIRE facilities. 2.3.1 Generic requirements Multiple stakeholders are interested in monitoring services: facility providers, experimenters, and those in the federation level. Their requirements differ from each other depending on their need for monitoring. From a facility provider’s viewpoint, rich measuring information, and understanding the performance, health and behavior of the whole infrastructure are needed for effective management and optimization purposes. Specific monitoring information is required at the federation level to enable interoperability and compatibility across the federated facilities. From an experimenter's viewpoint, monitoring experiment resources and collecting observations are an essential part of any scientific evaluation, or comparison of technologies or services being studied. Such observations are not different than monitoring of user-‐defined experimentation metrics (e.g., service metrics, application parameters) of the solution under test. Examples include: number of simultaneous download sessions, number of refused service requests, packet loss and spectrum analysis. Furthermore, measurements data should be provided to these stakeholders in different manners, including the following:
16 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1 Periodic monitoring: resources are monitored on a regular basis so that suitable measurements or monitoring tools are deployed and the data can be provided either through GUIs or APIs. • Single-‐value monitoring: if only one-‐measured-‐value is of interest, then the required metrics to be measured can be deployed, such as CPU usage of a VM, the number of currently running VMs on a specific physical machine that hosts a user’s VM, etc. Only one-‐measured-‐ value per metric is sent back to the requester. • Data transportation: suitable methods to transport data. • Data converting: converters could be needed to change data from a source format to another one, therefore where, how, who is responsible for this should be defined. • Data collection: methods to collect data and store them into storage resources such as collectors, aggregators or repositories. • Data viewing: methods to show measurements data such as APIs, GUI or visualization. • Data visualisation: measurement data should be visualised especially the real-‐time data. • Data harmonization: measurement data is collected from heterogeneous monitoring resources across federated infrastructures that could provide data in various manner and/or format. Therefore, there is a need for harmonizing measurement data to be provided to stakeholders in a unified data representation and in standard manners. The specific adoption of these different aspects in a single measurement campaign depends on how the measurements data is going to be used. To this end, different methodologies on how to provide data are to be addressed. The Fed4FIRE monitoring and measurement framework presented in this document should be able to cover all aspects of this monitoring lifecycle in an efficient way with a high level of user satisfaction. •
2.3.2 Monitoring and measurement metrics overview One very important aspect in monitoring and measuring is that of the metrics that should be collected. This requirement is varied from facility to another (wired, wireless, etc.) amongst services and platforms; and experimenters have different interests, based on their experiments and used resources, services, etc. Because of this heterogeneity, numerous metrics are to be measured at multiple levels: component level (physical and virtual), network level, traffic level, and service/software level. The remainder of this section gives some examples about metrics that could be of importance (by facility and service providers and experimenters) to be measured in different domains such as cloud infrastructures, cloud services, wireless networks (Wi-‐Fi or cellular networks), virtualised or non-‐virtualised infrastructures. In this deliverable measurement metrics are classified in four categories: • component-‐level metrics • network metrics • traffic metrics • software metrics
17 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1 2.3.2.1 Component-‐level metrics This represents metrics from both virtual and physical (computing, network or storage) devices. There are many metrics to be addressed here related to performance, storage, OS, processes, including the following •
Memory: total memory, used memory, free memory, total swap memory, used swap memory, free swap memory
•
Storage: total storage, used storage, free storage
•
CPU: CPU load, CPU utilisation, CPU count, CPU idle time, Cumulative CPU time, Cumulative CPU usage, Number of CPUs used by VM (in case of physical machines)
•
I/O reads/writes, the amount of I/O operations
•
OS: number of users, max number of processes
•
Processes: Number of processes, number of running processes
2.3.2.2 Network metrics Network metrics qualify the static aspects of a network deployment. They include connectivity, network speed, topology detection, accounting, reachability, signal quality, signal strength, noise level, interference, data transfer rates, Radio Frequency (RF) quality, throughput, available bandwidth, utilisation (bandwidth, protocol, ports), protocol analysis, per-‐node and per-‐channel statistics, IP connections statistics: IP addresses, ports, sessions, supported clients, authentication and de-‐authentication rates. 2.3.2.3 Traffic metrics Traffic metrics capture the more dynamic aspects of what is happening on a network. Many are of interest in traffic measurements and monitoring, such as link utilisation, packet arrival rate, packet loss, delay, jitter, number of flows, flow type, flow volume, and traffic density, cost, route performance, number of (bytes, packets, and flows) per time slot, VoIP analysis: call flow, signalling sessions, registrations, media streams or errors. 2.3.2.4 Software metrics Software metrics provide information about the applicative code running in the node. This represents metrics which provide information about the state of the software (service software), its performance and other software specific information. This well includes custom metrics identified by users. These metrics are heterogeneous and vary depending on the applications under study. Examples include memory consumption, errors and warnings or internal buffers 18 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1 2.3.3 Consolidated summary of testbeds’ inputs The tables presented here provide a consolidated view on the concrete requirements from each Fed4FIRE testbed deployment. The corresponding details per testbed can be found in the different appendices. Table 2 presents the current software deployments, Table 3 the requirements of each testbed, and whether they are currently met or not. Table 4 summarizes the metrics of interest, and identifies commonalities. Test bed PlanetLab Europe Virtual Wall OFELIA DWDM Ring Testbed ADVA ROADM BonFIRE Grid'5000
FUSECO FUSECO NITOS Netmode w-‐iLab.t SmartSantander NICTA
Facility Monitoring
Node/Flow Monitoring
Experimental measurements
Varied; mainly Nagios CoMoN, TDMI, Varied; OML supported and MySlice PlanetFlow, MyPLC Zabbix, EmuLab, scripts Zenoss, Polling of OCF AM PlanetFlow OpenFlow, Polling of OCF AM OML planned Nagios for services, Zabbix Zabbix Nagios, Munin, Cacti, ganglia, g5k-‐checks smokeping, ganglia, checks Zabbix, SNMP TUB Packet Tracking, scripts CM cards, scripts OML Nagios OML PDU’s, PoE, scripts Wi-‐Fi connectivity tool OML (OML) OML CM cards, scripts; Ad hoc tools OML Zabbix supported
Table 2: Consolidated data on software tools from partners' input. Empty cells are assumed to be done manually.
19 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1
Requirement
Supported in
Node (servers, switches, sensors, etc.) availability Node performance measurements
Path measurements between nodes
Information about the location of nodes Infrastructure monitoring for facility providers and operations (health, performance, etc.) Experimentation metrics Information about the changeable network topologies Experiment monitoring (e.g. experiment statistics) Permanent storage for monitoring data Experiment and flow statistics
PlanetLab Europe, BonFIRE, Virtual Wall, OFELIA, Grid’5000, SmartSantander, NICTA PlanetLab Europe, BonFIRe, FUSECO Playground, Grid’5000, Virtual Wall (manually initiated by experimenter), FUSECO Playground, NICTA PlanetLab Europe, FUSECO Playground (limited), NICTA
w-‐iLab.t,
NITOS, SmartSantander, w-‐ iLab.t
OpenFlow-‐enabled ADVA ROADMs, Grid’5000, FUSECO Playground, w-‐iLab.t, SmartSantander SmartSantander, NICTA
PlanetLab Europe, BonFIRE, NICTA, w-‐iLab.t PlanetLab Europe, BonFIRE, OFELIA, NITOS, w-‐iLab.t Grid’5000, Virtual Wall, FUSECO Playground, Netmode
BonFIRE, Virtual Wall, Netmode, PlanetLab Europe, w-‐iLab.t NICTA OFELIA, Experimenta, OpenFlow-‐enabled ADVA ROADMs, NITOS, w-‐iLab.t BonFIRE, NICTA OFELIA, Experimenta, OpenFlow-‐enabled ADVA ROADMs, NITOS, w-‐iLab.t BonFIRE, NICTA, Virtual Wal By all testbeds
Monitoring information to BonFIRE cloud services experimenters about the physical machines hosting their virtual machines Path and network monitoring NICTA Wireless connectivity NICTA, w-‐iLab.t Sensor measurements and environment depiction
Required by
Experimenta, OpenFlow-‐ enabled ADVA ROADMs,
FUSECO Playground NITOS, Netmode, FUSECO Playground, NICTA NITOS, w-‐iLab.t, NICTA
Table 3: Consolidated data on monitoring requirements from partners' input. Some of these are already supported by some facilities.
20 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1
Metrics
Supported in
Device status: ON/OFF
Physical interfaces: ON/OFF Virtual interfaces: ON/OFF Port status
Memory: total memory, used memory, free memory, total swap memory, used swap memory, free swap memory CPU: CPU load, CPU utilisation, and CPU idle time VMs: List of VMs, ON/OFF, CPU: Total number of CPUs, Core speed, L1/L2/L3 Cache memory, total CPU utilization, CPU utilisation per VM Storage: Total capacity, used capacity, free capacity Temperature
BonFIRE, Grid’5000, FUSECO Playground, NITOS, Netmode, NICTA, Virtual Wall (admins only), w-‐iLab.t BonFIRE BonFIRE
PlanetLab Europe, Virtual Wall, BonFIRE, Grid’5000, FUSECO Playground, NICTA
Required by OFELIA, Experimenta
OFELIA, Experimenta, w-‐iLab.t OFELIA, Experimenta, w-‐iLab.t OFELIA, Experimenta, OpenFlow-‐enabled ADVA ROADMs OFELIA, NITOS, w-‐iLab.t, Netmode
PlanetLab Europe, Virtual Wall, BonFIRE, Grid’5000, FUSECO Playground, NICTA BonFIRE
OFELIA, NITOS, w-‐iLab.t, Netmode
BonFIRE, NICTA
OFELIA,
OFELIA, w-‐iLab.t, SmartSantander OpenFlow-‐enabled ADVA ROADMs, FUSECO Playground, w-‐iLab.t OpenFlow-‐enabled ADVA ROADMs, w-‐iLab.t Experimenta, OpenFlow-‐ enabled ADVA ROADMs, w-‐ iLab.t, NICTA, NITOS, Virtual Wall OFELIA, Experimenta OFELIA, OpenFlow-‐enabled ADVA ROADMs, FUSECO Playground, w-‐iLab.t, Netmode OpenFlow-‐enabled ADVA ROADMs
Network speed, reachability, IP addresses, connectivity
Virtual Wall, NICTA
Network topology
Packet loss, delay, jitter
PlanetLab Europe, Grid’5000, FUSECO Playground, Netmode, SmartSantander, NICTA
IP connection statistics Available bandwidth, bandwidth usage on all links in the experiment topology Flow statistics
NICTA PlanetLab Europe, Virtual Wall, NICTA
OFELIA, Experimenta
21 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1
Metrics
Supported in
Required by
Running services: SSH, FTP etc. Inbound and outbound traffic on each interface Noise floor of the WiFi cards, channels used on the WiFi cards Indications about power on/off status from Chassis Management card Signal quality, signal strength, noise level, interference Throughput of each channel
BonFIRE, Grid’5000, Netmode Grid’5000, NICTA
w-‐iLab.t w-‐iLab.t, NICTA
NICTA
NITOS, w-‐iLab.t, Netmode
SmartSantander, NICTA, w-‐ iLab.t
NITOS
NICTA, w-‐iLab.t
Battery status, consumed power Position of node
SmartSantander
FUSECO Playground, NITOS, w-‐ iLab.t, Netmode, NICTA FUSECO Playground, NITOS, w-‐ iLab.t w-‐iLab.t, NICTA w-‐iLab.t, NICTA
SmartSantander, NICTA, w-‐ iLab.t
Table 4: Consolidated data on required metrics to be measured from partners' input. Some of these are already supported by some facilities.
22 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1
3 Implementation of the architectural functional elements 3.1 Introduction In this section we discuss every functional element of the D2.1 architecture that is related to WP6, and define how it will be implemented. It is possible that an available piece of software/tool will be used as a starting point, or a combination of such software. It is also possible that some elements will be implemented from scratch.
3.2 Evaluation for possible approaches for implementation The backbone of all three monitoring/measurement needs identified in D2.1 “First Federation Architecture” comprises two main classes of elements in charge of (i) obtaining the readings, and (ii) making them accessible to the relevant stakeholders. This section, partly based on contents from [16], reviews the state of the art in these areas. While some tools cater for both aspects of the needs, they are classified in the group which is most relevant to their prime usage. Based on the pros and cons of the studied solutions, some of them are selected as the initial Fed4FIRE approach; well understood alternatives are proposed in some cases. The goal is to direct testbeds still lacking specific capabilities into a preferred core solution. Testbeds already supporting these capabilities can retain their solution, but will need to ensure its integration into the rest of the federation. 3.2.1
Data acquisition
3.2.1.1 Facility and Infrastructure Monitoring Facility and infrastructure monitoring is essentially the same basic task, it however varies in terms of the precision and availability of the data, and the receiving stakeholder. Facility monitoring is exhaustive data aggregated for status summaries about the testbed, for use by the provider, while infrastructure monitoring would provide detailed measurements about a relevant subset of the testbed to the experimenter. Based on the review of the previous section, apart from ad hoc tools, a few off-‐the-‐shelf tools are used for various levels of monitoring. Some of the testbed control frameworks, such as OMF and OCF, natively include some facility-‐ monitoring functionality in the form of their Aggregate Manager. These can be complemented by physical chassis management (CM) cards. In the commercial and also experimental cloud infrastructures there are many monitoring systems or solutions being used. Examples for such are Zabbix [42] [7], Nagios [44], Ganglia [43], EVEREST [41], Groundwork [45], MonALISA [46], CloudStatus [47], and CA Nimsoft Monitor [48]. In addition to
23 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1 these, many other monitoring architectures are already deployed in cloud infrastructures [31][32][33][34][35][36][37]. Moreover, the BonFIRE monitoring solution based on Zabbix is used to monitor federated cloud infrastructures [49]. However, these architectures focus only on wired-‐ based cloud infrastructures. Furthermore, the heterogeneity that is considered in these architectures concerns the used virtualization solutions but various hardware infrastructures. With the addition of a plugin (PNP), Nagios can log historical performance data in round-‐robin databases (based on RRDTool), and provide graphs of their evolution. Similarly, both Munin and Cacti store historical performance data in RRDTool databases, and also provide instrumentation tools to poll distributed nodes and collect this information. While RRDTool is a well-‐known and widespread tool for the storage of performance data, its basic operation relies on reducing the resolution, over time, of old records, which might not be desirable in the context of infrastructure monitoring. Collectd is an extensible daemon that collects and stores performance data from network equipment and computer it is running on. Through the use of plugins, it can be extended to monitor various aspects of the nodes, such as various common application servers and specific system metrics. It can also query information from SNMP-‐enabled nodes, and through the libvirt plugin, monitor guest virtual machines. Its default storage backend relies on RRDTool, from which time-‐based graphs can be generated from external tools. However, a writer plugin is available which makes the data available through OML. Nmetrics is a multi-‐platform library allowing to query similar system run time parameters such as load, memory or network use in a system-‐agnostic way. It is however not as thorough as collectd, and does not cater for remote reporting, but might be a lighter infrastructure monitoring solution for low-‐powered nodes. An OML instrumentation for this library is available. DIMES [21] allows to deploy measurement probes throughout the Internet. Its goal is however more oriented towards measuring the live Internet rather than planned experiments. PlanetFlow [5] and CoMon [3] provide flow logging and slice or node (i.e., infrastructure) monitoring for PlanetLab, including sophisticated query mechanisms. Table 5 shows a summary of these tools.
24 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1
Tool
OMF/OCF, CM cards Nagios
Advantages
Disadvantages
Selected as initial approach
• Facility monitoring
• Limited information
• Alert management • Plugin support • Plugin for historical infrastructure monitoring (but RRDTool) • Already deployed in 33 testbeds
• Ad hoc storage (but SQL export scripts available)
X
Zabbix
• Supports both Facility and Infrastructure Monitoring • Alert management • SQL storage • Plugin support • VM monitoring • Agent-‐less monitoring • SNMP support • Support for remote collection to centralised server • Already deployed in 44 testbeds
• SQL database can become huge and unresponsive in certain cases • Not always very intuitive
X
Zenoss
• Support Nagios plugins
Munin, Cacti
• Good for infrastructure monitoring
• No facility monitoring • RRDTool backend (loss of resolution on old data)
Collectd
• Plugin support • Need local (or SNMP) agent • libvirt for VM monitoring • OML writer • SNMP support • Support for remote collection to centralised server
X
nmetrics
• OML-‐instrumented application available • Good for lightweight infrastructure monitoring
• Library, but not stand-‐ alone application • Limited monitored metrics • No remote reporting
TopHat / TDMI / MySlice
• Infrastructure (network, flows) monitoring • Federated
• Tophat: only running above TeamCumry, Maxmind, TDMI
X
25 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1
Tool
Advantages
Disadvantages
Selected as initial approach
• Support for external queries
• MySlice : only running above TopHat, SFA
DIMES
• Allows measurement of the live Internet
• Not deployed in any of the involved testbeds
PlanetFlow
• Infrastructure (flows) monitoring • Already deployed in 2 testbeds • Support for external queries
• Only deployed on PlanetLab
CoMon
• Infrastructure monitoring • Support for external queries
• Currently not maintained
Observium
• RRD • SNMP based • CollectD integration • IPMI integration • Minimal install effort • App. Monitoring (direct or via CollectD) • Ldap authentication
• Adding non-‐SNMP devices not supported • Monitoring data in RRD only
dstat
• Low level info available • Flexible • Well-‐known tool
• Granularity over time (limited to 1/s)
Table 5: Consolidated data on facility and infrastructure monitoring tools
3.2.1.2 Experimental Measurement For use as part of on-‐going experiments, the networking community has been developing and using several types of measurement tools. The most common ones focus on traffic capture and analysis. Among the best known Internet measurement software tools are tcpdump and Iperf [17]. The former has been shown to accurately report at capture rates up to gigabits per second [18] while the latter allows researchers to generate a traffic load to evaluate the capacity of a network or resilience of a system; the authors of [19] showed that it generated the highest load on networks paths compared to a number of other traffic generators. High performance or versatile hardware solutions have also been developed, such as DAG3 or NetFPGA [20]. The TopHat Dedicated Measurement Infrastructure (TDMI) is a measurement infrastructure based on Top Hat. It consists of modular probing agents that are deployed in a slice of various PlanetLab nodes and probe the underlying network in a distributed efficient manner. In addition, they probe outwards 26 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1 to a number of target IP addresses that are not within PlanetLab. The aim of TDMI is to obtain the basic information required by TopHat. It implements such algorithms as Paris Traceroute to remove the artifacts arising from the presence of load balancers in the Internet. TDMI aims at providing the necessary information to TopHat users about the evolution of the overlay, and focuses on catching the dynamic aspects of the topology. There are several research activities focusing on the federation of heterogeneous infrastructures towards large-‐scale Future Internet experimental facilities. In these, various measurement and monitoring solutions are used. For instance, the monitoring architecture of the EU FP7 NOVI project [40] uses four monitoring tools deployed across heterogeneous virtualized infrastructures, these are: (1) Multi-‐Hop Packet Tracking, an efficient passive one-‐way delay measurement solution, (2) HADES (Hades Active Delay Evaluation System) which is used for one-‐way delay, loss statistics, and hop count information, (3) SONoMA, which provides experimenters with monitoring information about the status of the measurement agents and about the state of measurement processes, and (4) packet capturing cards for line speeds up to 10 GBit/sec. The Multi-‐Hop Packet Tracking of (1) is characterized by the fact that it records detailed hop-‐by-‐hop metrics like delay and loss (traffic engineering). Packet tracking also enables measurements of environment conditions like cross-‐traffic and its influence on the user or experimenter traffic. It also allows tracking single packets through the network, which supports “trace-‐back systems” by deriving the source of malicious traffic and revealing the location of the adversary. The tool also adops a hash-‐ based packet selection technique that ensures a consistent selection throughout the network while maintaining statistically desired features of the sample. It can also efficiently export measurement results with IPFIX, and it provides the experimenter a choice of suitable packet ID generation functions. It can also reduce measurement traffic with hash-‐based packet selection, and it is able to visualize the measurement results. Finally, it can synchronize the sampling fractions in the network. Some disadvantages of the Multi-‐Hop Packet Tracking tool are the fact that it cannot measure passively if there is no traffic, that hash calculation and measurement export requires resources, and that it requires time synchronisation of nodes for delay measurements. CoMo (Continuous Monitoring) [24] is a network measurement system based on packet flows. It has core processes that are linked in stages, namely packet capture, export, storage, and query. These processes are linked by user-‐defined modules, used to customize the measurement system and implement filtering functions. The query process provides an interface for distributed queries on the captured packet traces. It is a highly tailored tool designed for efficient packet trace capture and analysis. Most tools, mentioned here, do not share any common output format. Data collection and post-‐ processing is therefore required before being able to cross-‐analyze any data. The next section reviews tools allowing to do so. Table 6 shows a summary of the tools described in this section. It only offers a description of their advantages/disadvantages, and does not prescribe any selection as part of the Fed4FIRE 27 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1 implementation. This choice is left to the experimenters that will have the freedom to deploy their preferred tools as part of their experiment. Indeed, the integration of testbeds in Fed4FIRE only requires that facility and infrastructure measurements are readily available to the federation, and experimental measurement be accessible by the experimenter from a remote testbed. The next section discusses remote data collection and reporting tools which will be used for that purpose. Tool Advantages Disadvantages Iperf
• • • •
D-‐ITG
• OML instrumentation • Different traffic profiles • TCP, UDP, DCCP support
OTG
• • • •
Tcpdump
• Well known tool
• No unified output, can differ strongly based on the chosen options. • No remote reporting
libtrace
• OML instrumentation available • Radiotap support
• Not standalone tool
DAG3
• Fast processing
• No unified output • No remote reporting
NetFPGA
• Fast processing
• No unified output • No remote reporting
Multihop Packet Tracking
• Detailed hop-‐by-‐hop metrics like delay and loss (traffic engineering) • Environment conditions like cross-‐traffic • Tracking single packets through the network • Hash-‐based packet selection technique • Export of measurement results with IPFIX. • Visualization of measurement results.
• Cannot measure passively if there is no traffic. • Hash calculation and measurement export requires resources • Time synchronisation of nodes for delay measurements required
Well known tool OML instrumentation TCP, UDP support DCCP, SCTP in some flavours
OML instrumentation Different traffic profiles Modular TCP, UDP support
• No unified output (by default) • No remote reporting (by default) • Segmented codebase
• Requires precise node synchronsation
• No DCCP nor SCTP support
28 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1 PlanetFlow
• TCP, UDP, ICMP support • Well known data format (silk format) • Netflow query system • Fast and extensive querying faciliyes • Web GUI access.
• Only deployed on PlanetLab
Table 6: Measurement tools
3.2.2 Collection and Reporting In the previous sections, attention was given to data acquisition. In this section, the focus is shifted to techniques to make this data available. Several solutions exist to instrument and collect information from various networking applications and devices. Generic reporting tools include SNMP [26], already leveraged by some of the monitoring tools reviewed in the previous section, DTrace [27], OML [16] and INSTOOLS [50]. They all allow the instrumentation of any software and/or devices. In addition, DTrace can dynamically instrument live applications, and is shipped by default with some operating systems. However, its measurement processing is limited to aggregating functions, and it does not support the streaming of measurements from different devices to a remote collection point. The INSTOOLS monitoring framework [50] is a system of instrumentation tools that enables GENI users to monitor and understand the behaviour of their experiments by automatically setting up and initializing experiment-‐specific network measurement and monitoring capabilities on behalf of users. SNMP has been widely adopted for the management and monitoring of devices, and allows the collection of information over the network. However, it has some performance and scaling limitations when measurements from large number of devices are required within a short time window[19]; SNMP is also constrained to only reporting information predefined in its management information base (MIB). IPFIX [29] is an IETF standard which leverages SNMP's MIB and defines a protocol for streaming information about IP traffic over the network. IPFIX exporters stream collected and potentially filtered measurements to collector points. While IPFIX was initially limited to measurements about IP flows, an extension [30] allows to specify custom types for the reported data,, and allow to cover a wider range of metrics than just flows. OML [16] is a generic framework that can instrument the whole software stack, and take input from any sensor with a software interface. It has no preconception on the type of software to be instrumented, nor does it force a specific data schema. Rather, it defines and implements a reporting protocol and a collection server. On the client side, any application can be instrumented using libraries abstracting the complexity of the network communication. Additionally, some of the libraries provide in-‐band filtering, allowing to adapt the measurement streams obtained from an instrumented application to the requirements of the current observation (e.g., time average rather
29 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1 than raw metrics). Applications for which the code is not available can also be integrated in the reporting chain by writing simple wrappers using one of the supported scripting language (Python or Ruby). After collection from the distributed application, the timestamped data is stored in an SQL database (SQLite3 or PostgreSQL), grouped by experimental domain; a server can collect data from several domains at the same time. Netcat [59] is a featured networking utility which reads and writes data across network connections, using the TCP/IP protocol. In its essence, Netcat is a solution that allows to easily communicate text over a tunnel. Hence it is rather simple to understand. It is also flexible and easy in the way that it can be used, since you can pipe output of other tools into it on the command line. A downside is that Netcat provides no other functionality than the text over tunnel. Features such as filtering or persistence are not natively supported by Netcat. MINER [25] is a solution sharing a lot of similarities with OML. It comprises a measurement architecture as well as elements of a management framework. The MINER tools are Java components that may provide measurement results directly, or may be wrappers around external libraries or applications that do the actual measurements. Unfortunately, MINER is not open source software, which limits its extensibility. TopHat is a measurement system offering a dedicated service that provides network topology information, as measured by the TDMI described above. It supports the entire lifecycle of an experiment, from assisting users during experiment setup in choosing the nodes on which the experiment will be deployed (facility monitoring) to providing live information to support adaptive applications and experiment control and supporting retrospective analysis through access to archived data (infrastructure monitoring). Additionally, TopHat-‐instrumented testbeds can be federated, and data made available to external users through MySlice. Tool Advantages Disadvantages Selected as final approach SNMP • Standard measurement • No remote reporting systems • Unified reporting DTrace • Dynamic instrumentation of • Limited to aggregating live applications functions • Default installed on some OS • No remote reporting IPFIX • Unified reporting • Limited representable information (but extensible) • Remote reporting Netcat
• Simple to understand (text over tunnel) • Simple to use (can pipe output of other tools into it on the command line)
• Limited functionality (e.g. no persistence to a database as part of Netcat)
30 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1 OML
• Unified reporting • Only reporting and collection: no measurement • Centralized reporting • Already deployed or planned in 5 testbeds • In-‐line filtering
MINER
• Similar in functionality as OML
• Not open source, hence less extensible
TopHat
• Allows federation of data sources • Can be plugged into the Fed4FIRE portal (which uses MySlice technology)
• Not intended for collecting of data on the level of the actual resources.
X
X
Table 7: Collection and reporting systems
3.2.3 Summary There is a large variability in the tools currently deployed on the various testbeds. A few commonalities can be found, mostly in the monitoring solutions, where Nagios and Zabbix are primarily used. It is however worth noting that Zabbix natively caters for both facility and infrastructure data management, while Nagios only provides the former. On the experimental measurement side the variability shows the most, with a lot of different and sometimes ad hoc tools. This disparity can be solved through the use of a middleware measurement system in charge of reporting samples from heterogeneous distributed tools in a unified and centralisable way. Here, a commonality can be identified around OML, with quite a few active or planned deployments. Its lightweight API is also a good match for the instrumentation of the various measurement tools in use. Federated access to measurement data is also important. TopHat, with its ability to query distributed data sources, and MySlice support, is probably a good candidate for this task. 3.3
Details of the selected monitoring and measuring tools in Fed4FIRE
From the surveys of the previous sections, the most widespread commonality is the use of OML as a collection and reporting framework. It allows instrumenting any sort of system through the abstraction of a measurement point, describing a group of metrics. This abstraction allows more latitude in the choice of measurement tools: as long as they conform to the same measurement point for the same information, their output can be used indistinctly in further analysis. Selecting OML for reporting purposes therefore allows flexibility in the choice of measurement tools, both for monitoring and measurement tasks, as well as for a unified way to access the collected data. 31 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1 This however only caters for collection and storage, but not directly access to or visualization of the data, let alone from a federated environment. Another tool for this purpose therefore needs to be identified. From the previous sections, Top Hat fits the bill for its ability to run queries over distributed systems and data stores, and pre-‐existing deployments. Facility and infrastructure monitoring tasks require specific metrics to always be made available about the testbed and its node. While some deployments already have solutions in place, the most indicated ones for others are, in order of preference, Zabbix, Nagios or Collectd.
Figure 3: Proposed cycle 1 measurement architecture for Fed4FIRE. Elements in bold are the default proposal for new deployments of on canonical testbed
Overall, this caters for the measurement architecture shown in Figure 3. The rest of this section describes these tools in more detail, presented according to their order of appearance along the measurement-‐to-‐analysis chain.
32 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1 3.3.1
Monitoring (Zabbix, Nagios)
Zabbix is an open-‐source solution for facility and infrastructure monitoring. It supports performance monitoring natively, in addition to facility monitoring and alerting. It also supports an extensive list of operating systems and platforms, including virtual machines. Three types of resource components are available for monitoring: native Zabbix agents, SNMP monitoring, and agentless script-‐based queries; all data is then aggregated within a central collection server which relies on SQL databases (MySQL or PostgreSQL) for storage. Nagios is another open-‐source base for infrastructure monitoring solutions. Unlike Zabbix, it does not support performance monitoring natively. It provides status reports for hosts, applications and networks. Nagios can also be extended through the use of user scripts run by the Nagios Remote Plugin Executor (NRPE). It has built-‐in support for raising alerts on problematic situations. Data is stored in an ad hoc backend, but some plugins allow export to SQL databases. Data can also be processed and exchanged between instances using the Nagios Remote Data Processor (NRDP). Bothtools can be extended through the use of plugins, and plugins written for Nagios are also reusable with the Zenoss monitoring tool. 3.3.2
Collection and Reporting (OML)
OML is an instrumentation tool that allows application writers to define customizable measurement points (MP) inside new or pre-‐existing applications. Experimenters running the applications can then direct the measurement streams (MS) from these MPs to remote collection points, for storage in measurement databases. It consists of two main components for injection and collection, and an optional proxy: • The OML client libraries: the OML client library provides a C API for applications to collect measurements that they produce. The library includes a dynamically configurable filtering mechanism that can perform some processing on each measurement stream before it is forwarded to the OML Server. The C library, as well as the native implementations for Python (OML4Py) and Ruby (OML4R) are also maintained. • The OML Server: the OML server component is responsible for collecting and storing measurements inside a database. Currently, SQLite3 and PostgreSQL are supported as database backends. • The optional OML Proxy Server: when doing experiment involving disconnections from the control/measurement network (such as with mobile devices), the proxy server can be used to temporarily buffer measurement data until a connection is available to transfer it to the collection server. OML can be used to collect data from any source, such as statistics about network traffic flows, CPU and memory usage, input from sensors such as temperature sensors, or GPS location measurement devices. It is a generic framework that can be adapted to many different purposes. It is mainly used 33 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1 as the measurement part of OMF-‐based testbeds as a way to collect and process data from distributed experiments, but can also be used as a standalone reporting system. Moreover, any activity that involves measurement on many different computers or devices that are connected by a network can benefit from using OML to allow a better collection of experimental data thanks to the reporting system, and its reuse or sharing thanks to its flexible schema-‐based definition of measurement samples. An instrumentation using the C library usually consists of a few additions to the application code, in the form of some initialization code, and some injection code. The same goes for Python and Ruby bindings, though they can also be used to instrument an application for which source code is not available by, e.g., parsing its output or log files. In the context of Fed4FIRE, there is however is an additional need to unify the structure of measurement coming from different applications with similar purposes. OML does not cater for this, and a separate coordination effort will therefore be needed on this aspect. 3.3.3
Federated Queries (TopHat)
TopHat collects data from various source of data (TDMI, Maxmind, TeamCymru, ...) and aggregates them in order to exposed enriched measurement data to the user. These data can typically be used for monitoring or more generally to have a better understanding of the network. TopHat provides measurement data and is built above the Manifold framework (as is MySlice, which provides testbed-‐oriented data). Manifold allows the user to query various source of data through a single API (in the case of TopHat, measurements through the TopHat API) while relieving the user from needing to know which platforms must be queried. Each source of data announces what kind of data it provides according to a common ontology. Manifold dispatches the user queries to each relevant platform, collects their replies, combines them, and sends the result to the user. For example TDMI provides traceroute measurements, while Maxmind can map an IP with localized city names. Thanks to TopHat, one can query TopHat to retrieve traceroute measurements from Paris to New York. Since each platform uses its own API (database, webservice, ...), each query issued by TopHat Manifold is translated into the platform's API through a dedicated gateway. In the same way, the reply of the platform is translated by the gateway to be expressed in the TopHat/Manifold format.
34 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1
3.4 Implementation Steps The measurement and monitoring effort in Fed4FIRE aims at unifying the information and metrics available from the federated testbeds. Three levels of reporting have been identified, with an increasing degree of precision. Facility monitoring allows an experimenter to choose testbed resources according to their needs, infrastructure monitoring allows them to measure and record the behavior of these resources during an experiment, and experimental measurement allows to collect any other specific metric germane to the study, usually measured with dedicated tools. Not all testbeds involved in the Fed4FIRE project provide the same tools nor resources. To support a working federation, commonalities have to be found and supported. This section presents the necessary steps to be taken towards this goal. They mostly consist of three categories: installation of new tools, adaptation of existing tools, and coordination. 3.4.1 Installation of New Tools As seen in the previous sections some testbeds do not expose sufficient information about the facility and infrastructure status. It will therefore be required for these testbeds to deploy one of the selected tools for that purpose, with a preference for Zabbix. According to Table 2, this includes the following testbeds: NITOS, w-‐iLab.t, SmartSantander and NICTA. All testbeds will have to support OML. This requires the installation of the OML client library on all testbed nodes, and the deployment of at least one OML collection server reachable by all. This requirement impacts the following testbeds, either in terms of deployment or activation: PlanetLab Europe, VirtualWall, OFELIA, DWDM Ring, ADVA ROADM, BonFIRE, Grid'5000 , FUSECO and SmartSantander. NICTA will provide technical guidance for these deployments. Currently, TopHat is only deployed within PlanetLab Europe. All other testbeds will need to install a local instance, with support from UPMC. 3.4.2 Adaptation of Existing Tools Many of the monitoring and measurement tools output data in varied formats. For integration within the federation, these tools need to be adapted to support reporting via OML. This primarily concerns Zabbix and Nagios, but also all experiment-‐specific measurement tools listed in Table 6. Information on how to instrument an application can be found in [15], as well as on the OML website. It is preferable to create a separate plugin when the tool supports it (e.g., Zabbix). An example for collectd is available. Each testbed will be responsible of instrumenting its own tools, with support from NICTA. Additionally, a TopHat agent able to query the OML storage backend is also needed in order to provide a unified gateway into distributed measurements. This will be a task of UPMC and NICTA. 35 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1 3.4.3 Coordination As noted in section 3.2.2, OML does not enforce any semantics on the schema of its measurement points. For the federation to be successful, it is important that similar tools provide monitoring and measurement data following the same structure. It is therefore needed to provide a unified abstraction for the way data from monitoring and measurement tools are grouped together in meaningful sets, and make these sets standard across tools measuring the similar aspects. NICTA will curate a list of measurement point schemas to use for specific types of metrics for that purpose, and provide technical support on their implementation into the tools in use by each testbed. This will ensure homogeneity across testbeds using different sets of tools to measure the same characteristics. Note that these measurement point schemas will not only ease the life of the experimenters, but it is also a vital instrument in the implementation of a First Level Support (FLS) service. This service will consolidate common facility monitoring information produced by all Fed4FIRE testbeds in a single operator view. The exact list of compulsory metrics will be defined based on further discussions and experiences, but they will all have to be provided to the FLS service in a common format. Therefore the coordination task presented in this section will also contribute to the successful implementation of the FLS service.
36 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1
4 Summary 4.1 Mapping of Architecture to Implementation Plan In order to implement the architecture from D2.1, seven implementation steps have been identified, as presented in Table 8. Most steps (service deployment and application instrumentation) need to be undertaken independently by all participants. Where commonalities exist (e.g. Zabbix and Nagios) instrumentation should be a common effort. To support with the instrumentation task, NICTA will provide and curate a clearinghouse of homogenized OML measurement point schemas. The goal is to ease integration of new applications while maintaining the highest level of data exchangeability and interoperability between measurement tools providing the similar information. A TDMI agent will be written to allow queries to the OML storage from TopHat. Functional element Implementation strategy Facility and Infrastructure monitoring
• Deploy Nagios and/or Zabbix and/or collectd if not yet available (all participants) • Instrument these relevant measurement systems (all participants, with support from NICTA)
Experiment measurement
• Deploy OML if not yet available (all participants) • Instrument relevant measurement systems (all participants, with support from NICTA) • Maintain clearinghouse of measurement points (NICTA)
Data access
• Deploy Top Hat (all participants) • Make OML measurement databases accessible to Top-‐Hat (NICTA, UPMC) Table 8: Implementation strategy of functional elements
37 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1
4.2 Deviation of supported requirements compared to D2.1
Figure 4: Monitoring and measurement specification for cycle 1
As shown in Figure 4, there are a few divergences from the architecture presented in Figure 2. First, specific data collection systems are introduced in each testbed. This element takes the form of one or more OML collection servers (their number and respective role is an implementation detail to be discussed for each testbed). Based on this semi-‐centralised data collection, a data access layer is introduced. This is in the form of a TopHat agent in each testbed, with ability to query the OML datastores on behalf of the experimenter or central federation management. Finally, rather than querying testbed entities directly, both experimenter and management can delegate all their queries to the testbed-‐local TopHat agents.
38 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1
References [1] PlanetLab Europe. Available online at http://www.planet-‐lab.eu, last visited on February 11, 2013. [2] S. Soltesz , M. Fiuczynski , and L. Peterson. MyOps: A Monitoring and Management Framework for PlanetLab Deployments. Working paper. 2009. [3] CoMon. Available online at http://www.comon.cs.princeton.edu/, last accessed on February 11, 2013. [4] TopHat. Available online at http://www.top-‐hat.info/, last accessed on February 11, 2013. [5] M. Huang, A. C. Bavier, and L. L. Peterson. PlanetFlow: Maintaining Accountability for Network Services. ACM SIGOPS Operating Systems Review. Volume 40, Issue 1, pp. 89 – 94, January 2006. [6] Nagios: systems and network monitoring; 2nd ed., by Wolfgang Barth, San Francisco: No Startch Press (2008) [7] Zabbix – open source monitoring system, www.zabbix.com, last accessed on February 11, 2013. [8] Zenoss. Available online at http://www.zenoss.com/, last accessed on February 11, 2013. [9] FUSECO Playground. Available online at www.fuseco-‐playground.org, and at www.ngn2fi.org/playgrounds/Fuseco_Playground/index.html, last accessed on February 11, 2013. [10]Multi-‐Hop Packet Tracking – http://www.av.tu-‐berlin.de/pt, last accessed on February 11, 2013. [11]FITeagle -‐ Future Internet Testbed Experimentation and Management Framework, http://www.fiteagle.org/, last accessed on February 11, 2013. [12]BonFIRE architecture. Available online at http://doc.bonfire-‐project.eu/R3/reference/bonfire-‐ architecture.html, last accessed on February 11, 2013. [13]BonFIRE monitoring documentation. Available online at http://doc.bonfire-‐ project.eu/R3/monitoring/howto.html, last accessed on February 11, 2013. [14]BonFIRE monitoring documentation. Available online at http://doc.bonfire-‐ project.eu/R3/monitoring/getting-‐data.html, last accessed on February 11, 2013. [15]SmartSantander. Available online at http://www.smartsantander.eu/index.php/testbeds/item/132-‐santander-‐summary, last accessed on February 11, 2013. [16]O. Mehani, G. Jourjon, T. Rakotoarivelo, and M. Ott, "An instrumentation framework for the critical task of measurement collection in the future Internet," Under review, 2012. [17]M. Gates, A. Tirumala, J. Dugan, K. Gibbs, Iperf version 2.0.0, NLANR applications support, University of Illinois at Urbana-‐Champaign, Urbana, IL, USA, 2004. [18]F. Schneider, J. Wallerich, A. Feldmann, Packet capture in 10-‐gigabit ethernet environments using contemporary commodity hardware, in: S. Uhlig, K. Papagiannaki, O. Bonaventure (Eds.), PAM 2007, 8th Internatinal Conference on Passive and Active Network Measurement, volume 4427 of Lecture Notes in Computer Science, Springer-‐Verlag Berlin, Heidelberg, Germany, 2007, pp. 207-‐ 217.
39 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1 [19]S. S. Kolahi, S. Narayan, D. Nguyen, Y. Sunarto, Performance monitoring of various network traffic generators, in: R. Cant (Ed.), UKSim 2011, 13th International Conference on Computer Modelling and Simulation, IEEE Computer Society, Los Alamitos, CA, USA, 2011, pp. 501506. [20]G. Gibb, J. W. Lockwood, J. Naous, P. Hartke, N. McKeown, NetFPGA-‐-‐-‐an open platform for teaching how to build gigabit-‐rate network switches and routers, IEEE Transactions on Education 51 (2008) 364369. [21]Y. Shavitt, E. Shir, DIMES: Let the internet measure itself, SIGCOMM Computer Communucation Review 35 (2005) 7174. [22]M. Huang, A. Bavier, L. Peterson, PlanetFlow: Maintaining accountability for network services, ACM SIGOPS Operating Systems Review 40 (2006) 8994. [23]K. Park, V. S. Pai, CoMon: A mostly-‐scalable monitoring system for PlanetLab, ACM SIGOPS Operating Systems Review 40 (2006) 6574. [24]G. Iannaccone, CoMo: An Open Infrastructure for Network Monitoring -‐-‐-‐ Research Agenda, Technical Report, Intel Research, Cambridge, UK, 2005. [25]C. Brandauer, T. Fichtel, MINER -‐-‐-‐ a measurement infrastructure for network research, in: T. Magedanz, S. Mao (Eds.), TridentCom 2009, 5th International Conference on Testbeds and Research Infrastructures for the Development of Networks & Communities, IEEE Computer Society, Los Alamitos, CA, USA, 2009, pp. 19. [26]D. Harrington, R. Presuhn, B. Wijnen, An Architecture for Describing Simple Network Management Protocol (SNMP) Management Frameworks, RFC 3411, RFC Editor, Fremont, CA, USA, 2002. [27]B. M. Cantrill, M. W. Shapiro, A. H. Leventhal, Dynamic instrumentation of production systems, in: A. Arpaci-‐Dusseau, R. Arpaci-‐Dusseau (Eds.), USENIX 2004, USENIX Association, Berkeley, CA, USA, 2004, pp. 1528. [28]Q. Zhao, Z. Ge, J. Wang, J. Xu, Robust traffic matrix estimation with imperfect information: Making use of multiple data sources, SIGMETRICS Performance Evaluation Review 34 (2006) 133-‐ 144. [29]B. Claise, S. Bryant, S. Leinen, T. Dietz, B. H. Trammell, Specification of the IP Flow Information Export (IPFIX) Protocol for the Exchange of IP Traffic Flow Information, RFC 5101, RFC Editor, Fremont, CA, USA, 2008. [30]E. Boschi, B. Trammell, L. Mark, T. Zseby, Exporting Type Information for IP Flow Information Export (IPFIX) Information Elements, RFC 5610, RFC Editor, Fremont, CA, USA, 2009. [31]Tordsson, J.; Djemame, K.; Henriksson, D.; Katsaros, G.; Ziegler, W.; Waldrich, O.; Konstanteli, K.; Sajjad, A.; Rajarajan, M.; Gallizo, G.; Nair, S.;, "Towards holistic Cloud management", in Book "European Research Activities in Cloud Computing", Cambridge Scholars Publishing, 2012. [32]G. Katsaros et al., "Building a service-‐oriented monitoring framework with REST and Nagios", IEEE International Conference on Services Computing (SCC), July 2011. [33] G. Katsaros et al., "A service oriented monitoring framework for soft real-‐time applications", IEEE International Conference on Service-‐ Oriented Computing and Applications (SOCA), Dec. 2010. [34] G. Katsaros et al., "Monitoring: A Fundamental Process to Provide QoS Guarantees in Cloud-‐ based Platforms", in Book "Cloud computing: methodology, systems, and applications", CRC, Taylor & Francis group, September 2011. [35]S.A. De Chaves et al., "Toward an architecture for monitoring private clouds", in the Communications Magazine, IEEE, Volume (49), Issues (12), pp. 130-‐137, December 2011.
40 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1 [36]G. Katsaros et al., "A multi-‐level architecture for collecting and managing monitoring information in Cloud environments", in the Proceedings of the 1st International Conference on Cloud Computing and Services Science (CLOSER), May 2011. [37]G. Katsaros et al., "An integrated monitoring infrastructure for Cloud environments", in Book "Lecture Notes in Business Information Processing" (LNBIP), Springer-‐Verlag, September 2011. [38]FIRE OpenLanb project -‐ http://www.ict-‐openlab.eu/home.html, visited on February 12, 2013. [39]FITeagle -‐ Future Internet Testbed Experimentation and Management Framework, www.fiteagle.org/, last visited on February 12, 2013. [40] NOVI FIRE Project http://www.fp7-‐novi.eu/, last visited on February 12, 2013. [41] EVEREST -‐ EVEnt REaSoning Toolkit. Available online at http://sourceforge.net/apps/trac/sla-‐at-‐ soi/wiki/EverestCore, last accessed on February 11, 2013. [42] Zabbix – open source monitoring system, www.zabbix.com, last accessed on February 11, 2013. [43] Ganglia, “Ganglia monitoring system,” Website, available online at www.ganglia.sourceforge.net, last visited on February 12, 2013. [44] Nagios monitoring tool. Website, available at www.nagios.org, visited on February 12, 2013. [45] GroundWork, “Groundwork,” Website, available online at www.gwos.com, last visited on February 12, 2013. [46] MonALISA, “Monalisa: Monitoring agents uasing a large integrated services architecture,” Website, available online at monalisa. caltech.edu/monalisa.htm, last visited on February 12, 2013. [47] CloudStatus, “Cloudstatus,” Website, available online at www.hyperic.com/products/cloud-‐ status-‐monitoring, last visited on October 14, 2012. [48] Nimsoft, “Ac nimsoft monitor,” Website, available online at www.nimsoft.com/solutions/nimsoft-‐monitor.html, last visited on February 12. [49]BonFIRE monitoring documentation. Available online at http://doc.bonfire-‐ project.eu/R3/monitoring/getting-‐data.html, last accessed on February 11, 2013. [50] J. Griffioen, Z. Fei, and H. Nasir. Architectural Design and Specification of the INSTOOLS Measurement System. December 2009. [51] Y. Al-‐Hazmi, and T. Magedanz. A Flexible Monitoring System for Federated Future Internet Testbeds. Proceeding of the 3rd IEEE International Conference on the Network of the Future (IEEE NoF 2012), Tunis, Tunisia, Nov 2012. [52]OCF. OFELIA Control Framework. Code available online at https://github.com/fp7-‐ofelia/ocf, last accessed on March 5, 2013. [53]EXPERIMENTA Platform at XiPi. http://www.xipi.eu/infrastructure/-‐/infrastructure/view/2009, last accessed on March 5, 2013. [54]OFELIA Project website. https://alpha.fp7-‐ofelia.eu/, last accessed on March 5, 2013. [55]FIBRE Project website. http://www.fibre-‐ict.eu/, last accessed on March 5, 2013. [56]OpenEPC – Open Evolved Packet core. Available online at http://www.openepc.net, last accessed on February 12th 2013 [57]Open IMS Core’s homepage. Available online at http://www.openimscore.org, last accessed on February 12th 2013 [58]The OpenMTC vision. Available online at http://www.open-‐mtc.org, last accessed on February 12th 2013 [59]The GNU Netcat project. Available online at http://netcat.sourceforge.net/, last accessed on February 13th 2013. 41 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1
Appendix A: PlanetLab Europe requirements We describe here the monitoring requirements for PlanetLab Europe [1], but most of what we describe here applies to other PlanetLab systems, and notably to PlanetLab Central. The PlanetLab architecture has purposely included the minimum required functionality for the system, leaving to third parties the provision of various functionalities that might be directly included in other systems. This is no exception for monitoring. A PlanetLab system consists of a set of nodes on the internet, but node monitoring and monitoring of the internet are not provided natively by PlanetLab. The most widely used node monitoring system for PlanetLab is called CoMon [3] and for monitoring of the internet the PlanetLab Europe operations team uses the TopHat Dedicated Measurement Infrastructure (TDMI) [4]. These two systems are the most experimenter-‐oriented of the used tools, but there are also other monitoring systems such as PlanetFlow [5] and MyOps [2] that tend to serve the testbed operator more. CoMon measures the activity of each slice on a node, for example CPU, memory, disk usage and bandwidth usage. TDMI has agents on each node that conduct traceroutes in a full mesh, providing information on the paths between the nodes. The third party service used by PlanetLab Europe for managing measurements is TopHat. TopHat draws from TDMI, CoMon and other measurement systems, such as Team Cymru (providing autonomous numbers for the nodes) and MaxMind (providing geolocalization). The TopHat developers can easily add other measurement sources by developing a dedicated gateway for each new service. This information serves at different phases in the experiment lifecycle. In setting up an experiment, users make use of the data in order to choose the nodes that they put in a slice. They might want lightly loaded nodes, for instance, or nodes in a given country, or nodes that have stable routes between each other. Then, during the time that an experiment is running, the experimenter might call on these measurement sources in order to monitor node health, or path stability, for instance. After an experiment has concluded, the experimenter may wish to use TopHat to call up historical data in an effort to comprehend how node or network conditions affected the experiment. Furthermore, PlanetLab is fully open to third party experiment control tools, which includes third party measurement tools. So, for instance, work is being done to easily enable OMF-‐enabled slices, which comes with OML. Monitoring solutions Splitting into 4 categories : (A) infrastructure (MyPLC and related) – (B) nodes from an operations point of view – (C) nodes from a user point of view – (D) experimentation metrics. As far as (A) “infrastructure” is concerned, we use a Nagios-‐based deployment [6] for monitoring all the boxes and services known to play a role in smooth operations; a lot of these have been added over time. As far as (B) “nodes from an operations point of view”, we use a separate tool but tailored for PlanetLab deployments named MyOps, which allows to implement escalation policies. For example, 42 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1 if one site has a node down we first send messages to the technical contact, then to the PI, and after a while the site loses some of its capabilities (fewer slices). All this workflow is entirely automated under MyOps, which also provides raw data on the status of nodes – see http://planet-‐ lab.eu/monitor/site. For (C) “nodes from a user point of view”, PlanetLab Europe uses TopHat, with TDMI and the other sources mentioned above. One difficulty encountered by the PlanetLab Europe operations team was that the potentially very useful data gathered by MyOps are not easily accessible through an API or other query-‐based techniques, and so MyOps does not lend itself to the creation of a gateway so as to be made available through TopHat. There clearly was a need for both: (1) aggregating data about nodes from various places (MyOps being one, CoMon being another one, and we found quite a few other sources of potentially high interest), and (2) providing a decent querying interface to these data. In a first step, we leveraged the internal ‘tags’ mechanism right in the MyPLC DB to extend it with such external data. In a federated world, and in particular with respect to MySlice, it might make sense to design a separate tool for hosting this aggregated data. We do not try to address (D) “experimentation metrics” at all, this being deemed part of the experimental plane. OML is considered to be a suitable candidate to deploy when support for (D) is pursued. Monitoring requirements Nothing is crucially missing in what we have at this point. There are no current monitoring requirements from PlanetLab Europe. Of course the introduction of Fed4FIRE monitoring systems may not lead to a loss of supported functionality compared to the situation as it is today. Required metrics The metrics that are measured and therefore should also be supported in Fed4FIRE are: • TopHat o traceroute measurements between each pair of PlanetLab nodes o For each IP hop, we could provide more information (ASN, country, hostname, etc.). • CoMon data1 o Slice name o Slice context id o CPU consumption (%) o Physical memory consumption (%) o Physical memory consumption (in KB) o Virtual memory consumption (in KB) o Number of processes o Average sending bandwidth for last 1 min (in Kbps) o Average sending bandwidth for last 5 min (in Kbps) 1
http://codeen.cs.princeton.edu/slicestat/
43 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1 o o o o o o
Average sending bandwidth for last 15 min (in Kbps) Average receiving bandwidth for last 1 min (in Kbps) Average receiving bandwidth for last 5 min (in Kbps) Average receiving bandwidth for last 15 min (in Kbps) Local IP address of this node Number of active processes -‐ that is, processes using the CPU cycle at the moment
44 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1
Appendix B: Virtual Wall requirements The virtual wall is a testbed consisting of physical servers, switches and management servers operating the infrastructure. Monitoring solutions At this moment the Virtual Wall runs Zabbix [7] to monitor the management servers and switches, accompanied with scripts that check temperature and fan speeds through IPMI on the physical servers. If a server runs too hot, it is turned off (e.g. if the fans are not running). This can be considered to be facility monitoring functionality. The Emulab software that powers the Virtual Wall has a built-‐in health check system, which puts nodes which do not come up okay after deploying images in a “hardware down” pool (and for the experiment it then uses another free node, so it is transparent for the experimenter). This can also be considered to be facility monitoring functionality. The Emulab software also has the possibility to run topology/link tests after an experimenter swaps in an experiment. This is infrastructure monitoring functionality. As experimenters have root access on the nodes, they can deploy other infrastructure monitoring and experiment measurement frameworks themselves. Monitoring requirements The system works very well as it is, however for being in a federation some extra requirements are needed. For the first level support, there is need for a central view on the health of all facilities, and probably also e.g. on the nodes available. This still has to be implemented / agreed in Fed4FIRE. Required metrics: A very important requirement/metric is that detailed monitoring/statistics are needed if the facility is open to traffic from and to the internet (probably through a programmable firewall). These are to trace back malicious traffic. A non-‐exhaustive list of examples of such metrics on a per-‐experiment basis are: are in-‐ and outgoing throughput, logging of the destination IP addresses of outgoing packets, logging of the quantity of the different type of outgoing messages based on information regarding the higher layers of the protocol stack (specific ICMP messages, usage of well known UDP/TCP ports that could indicate malicious behaviour, etc.), and so on. For other metrics, ICMP, ssh, http, https checks would appear to be adequate for first level support. The following metrics are of interest to the virtual wall experimenters: • Bandwidth usage on all links in the experiment topology • Memory: total memory, used memory, free memory, total swap memory, used swap memory, free swap memory
45 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1 •
• • • •
CPU: CPU load, CPU utilisation, CPU count, CPU idle time, Cumulative CPU time, Cumulative CPU usage, CPU usage over time for one or multiple specific processes, or even per specific thread Bandwidth usage per session (e.g. per TCP connection) Network metrics: connectivity, network speed IP connections statistics: IP addresses Link utilisation, number of (bytes, packets, and flows) per time slot
It would also be very interesting to have a test infrastructure in a central place which starts and stops experiments e.g. once in 24 hours, to see that each facility supports the whole lifecycle and that the APIs are still conforming.
46 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1
Appendix C: OFELIA (OpenFlow in Europe Linking infrastructure and Applications) Control Framework requirements This platform has multiple instances distributed inside and outside Europe, some of them deployed by partners involved in Fed4FIRE. Monitoring solutions At this moment, monitoring is limited to the general infrastructure status and VMs. Status of servers and switches is done with Zenoss [8]. It allows to the Island Manager and Experimenters to check which servers and switches are up and available. VM status is monitored by the agent in the server which sends notifications to the Aggregate Manager each time an event happens. VMs status can be “started” or “stopped”. There is no support for experiment monitoring at the moment. Monitoring requirements Planed improvement in OFELIA is to add call-‐backs to OpenFlow to monitor changes in network topology. Also experiment monitoring is planned to be implemented in Fed4FIRE to get statistics through OpenFlow counters. Experimenter can check the status of the requested resources. The following user needs can be identified: • Experimenters should be able to take measures from the resources during the experiment development using facility tools apart from the measurements he can take. • Experimenters should be able to store the measurements for later study and sharing. The following facility needs can be identified: • A monitoring tool to offer the experimenters the possibility of monitoring data from the resources of their experiment and store its statistics should be available. Required metrics The overall required metrics are as follows: • Network (L2 switches) level: • Device status: ON/OFF • CPU: CPU utilisation, CPU idle time • Memory: Total memory, used memory, free memory • Temperature • Physical interfaces: ON/OFF • Virtual interfaces: ON/OFF • VLANs 47 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1 Switch interface level: § Interface status: ON/OFF § Throughput § Packet rate § Error rate § IP addresses § MAC addresses § MTU Server level • Device status: ON/OFF • VMs: List of VMs, ON/OFF • Memory: Total memory, used memory, free memory • CPU: Total number of CPUs, Core speed, L1/L2/L3 Cache memory, total CPU utilisation, CPU utilisation per VM • Storage: Total capacity, used capacity, free capacity • Running services: SSH, FTP etc. • Virtual Interfaces: ON/OFF, IP and MAC addresses VM level: • VM Status: ON/OFF • Memory: Total memory assigned to VM •
•
•
48 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1
Appendix D: Optical Testbed – Experimenta DWDM Ring Testbed requirements The complete Experimenta testbed [53] comprises the DWDM Ring testbed, a pool of VMs and also OFELIA/FIBRE [54] [55] OpenFlow islands. The OFELIA/FIBRE Island is described in the previous section. In this section we will focus on the DWDM Ring testbed. The goal of the provider (i2CAT) is to offer Experimenta under the OFELIA Control Framework (OCF) [52]. This integration is still in its early stages. Therefore the details provided in this document are based on the current capabilities of OCF and the generic requirements for enabling OCF control and manage optical testbeds such as the Experimenta DWDM Ring testbed. Future revisions will update these details to properly reflect the on-‐going integration of OCF and this testbed. Monitoring solutions The integration of Experimenta with OFELIA/FIBRE Islands and OCF is still in the very early stages. Therefore monitoring is not available at the moment. After the integration is complete, experiments will be able to collect statistics using the OpenFlow controller. In particular, experimenters will be able to monitor and collect flow statistics. Monitoring requirements The experimenter might consider a load balancing mechanism applied to the proposed topology. Using inherent OpenFlow monitoring, the experimenter can monitor the number of bytes sent across the different load balancing paths. Assuming that OpenFlow based monitoring is already available to experimenters, an additional user need is to provide automated monitoring mechanism e.g. scripts to aid experimenters collect OpenFlow flow statistics. Required metrics The following represents monitoring requirements and their associated metrics. 1. Property: Virtualisation server status Virtualisation server is one of the main resources provided by OCF. VMs are created in the servers with variable characteristics (memory, HDD size, etc.). It is important to monitor the status of the facility servers in order to properly manage their capacities and detect malfunctions. All the component-‐level metrics can be useful but the ones that limit the servers’ capacity are the used memory and storage. CPU metrics are also interesting for facility monitoring as many VMs working at the same time can overload server’s capacity.
49 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1 • • • • • • •
Server’s status (up/down) Server’s memory usage Server’s storage space available Server’s CPU usage VM status (up/down) VM memory VM storage capacity
2. Property: Network status Another resource provided by OCF is OpenFlow resources. This includes mainly OpenFlow-‐enabled switches and a software-‐defined network topology over them. Component-‐level metrics related to the switches are required to know its availability to use them in an experiment. Also network metrics like connectivity, topology detection are useful to present the facility users the status of the network in order to set an experiment. • Switch status (up/down) • Switch ports • Network topology 3. Property: Experiment monitoring The previous properties are useful in a pre-‐experimentation phase. The servers’ CPU and memory utilization help the Island Managers to check the status of the facility to prevent and/or solve problems as soon as possible and the experimenters to choose the most appropriate resources for their experiments. Once the experiment has started, it would be very interesting to monitor its development and the usage of the reserved resources in order to get useful info. When an experimenter sets up an experiment, it consists of a network topology composed of VMs and OpenFlow-‐enabled switches. The experimenter then defines a controller (which can be deployed in a VM or any computer connected to the network). This controller reconfigures the routing tables of the OpenFlow switches dynamically to redirect the network traffic according to the experimenter’s purposes. Any monitoring metrics about the experiment development is useful. Either at the equipment level as in network or traffic. • All the traffic metrics: o Delay o Packet loss o Throughput o Etc. • IP connection statistics
50 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1
Appendix E: Optical Testbed – OpenFlow-‐enabled ADVA ROADMs Testbed requirements This comprises OpenFlow-‐enabled ADVA ROADMs, distributed Dark Fibre in UK, connectivity via JANET, internet2, GEANT. Monitoring solutions From the facility owner’s perspective, monitoring the infrastructure health/resource usage is possible via proprietary APIs and scripts. From the experimenter’s perspective, OpenFlow can be used to monitor flow statistics such as number of packets or number of bytes. Monitoring requirements For experimenters, statistics can be collected using the OpenFlow controller for the experiment. The experimenter determines what statistics to collect in this case. However, more sophisticated flow based monitoring might enhance the experiment monitoring in this testbed. There are currently considerations to integrate some OML functionalities into OCF and this could further improve and automate the monitoring capabilities of this platform. Experiments could want to monitor bandwidth consumption or packet loss. This could be used to evaluate switching algorithms and also for load balancing and VM migration experiments. The experimenter might want to observe the number of packets traversing the network to observe if the application aware switching algorithm has been able to select a high quality of service path taking into consideration the requirements of the beyond high-‐definition format sent from the traffic source. As OpenFlow based monitoring is already available to experimenters, the following additional user needs can be define: providing automated monitoring mechanism e.g. scripts to aid experimenters to collect OpenFlow flow statistics. Required metrics Standard optical metrics can be monitored including: • Bandwidth consumption • Packet loss • Flow statistics • Port status i.e. ports on/off • Connectivity status i.e. signal status • Total number of cross-‐connections. • Topology (using cross-‐connect configurations).
51 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1
Appendix F: BonFIRE requirements BonFIRE is a multi-‐site testbed, implemented using a broker architecture[12]. Observability is one of the four BonFIRE pillars, and as such extensive monitoring is available to the experimenters at both the broker and the site level. Site-‐level Monitoring Monitoring is provided on the BonFIRE sites using the Zabbix framework. Three types of monitoring are available on BonFIRE through Zabbix: VM, Application and Infrastructure monitoring. The BonFIRE documentation describes how to set up [13] and how to access [14] monitoring data. We briefly discuss the BonFIRE types of monitoring below: VM Monitoring: Over 100 metrics are being monitored at the VM level by default and the user can deactivate or re-‐activate them through the Zabbix interface. Examples of per-‐VM monitoring metrics are as follows: total memory, used memory, free memory, total swap memory, used swap memory, free swap memory, total storage, used storage, free storage, CPU load, CPU utilization (%), CPU count, network metrics (e.g. incoming and outgoing traffic in interfaces, OS-‐related metrics,),) processes-‐related metrics (e.g. number of running processes), services-‐related metrics (e.g. FTP-‐ /Email-‐/SSH-‐server is running), etc. A list of concrete measured metrics that are provided either as active or disabled metrics is given at the end of this appendix. Application Monitoring: Zabbix allows users to add their own metrics to be monitored and stored through the framework. These can provide information about the state of the running application, its performance and other application-‐specific information. As these are application-‐specific, the experimenters need to explicitly configure them at the agent and the server. Infrastructure Monitoring: BonFIRE provides its experimenters the ability to get monitoring data about the physical machines that run their VMs. We refer to this service as infrastructure monitoring. Most requested infrastructure metrics are: CPU load, total and free memory, free swap memory, the number of VMs on a physical node, disk IO, disk read/write, incoming and outgoing traffics on interfaces, etc. Broker-‐level Monitoring The BonFIRE sites do not generally understand the concept of the experiment; this is implemented at the broker level. BonFIRE timestamps and exposes to the experimenters the times at which events relevant to their experiment take place. This includes experiment and resource (e.g. compute, storage or network) events. Measured metrics in BonFIRE 52 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1
VM Monitoring: the following metrics are actively measured by default: Buffers memory Cached memory CPU system time (avg1) CPU nice time (avg1) CPU idle time (avg1) CPU iowait time (avg1) CPU user time (avg1) Free disk space on / Free memory Free swap space Host boot time Host status Host uptime (in sec) Incoming traffic on interface lo Incoming traffic on interface eth1 Incoming traffic on interface eth0 Number of processes Number of running processes Number of users connected Outgoing traffic on interface lo Outgoing traffic on interface eth0 Outgoing traffic on interface eth1 Ping to the server (TCP) Processor load Shared memory System cpu usage average Total disk space on / Total memory Total swap space Used disk space on / Used disk space on / in %
The following metrics are disabled (are not measured by default), but can be enabled by BonFIRE user any time: Checksum of /usr/sbin/sshd Checksum of /usr/bin/ssh Checksum of /vmlinuz Checksum of /etc/services Checksum of /etc/inetd.conf Checksum of /etc/passwd Email (SMTP) server is running Free disk space on /usr Free disk space on /var Free disk space on /tmp Free disk space on /home
53 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1 Free disk space on /opt Free disk space on /tmp in % Free disk space on /var in % Free disk space on /usr in % Free disk space on / in % Free disk space on /home in % Free disk space on /opt in % Free number of inodes on /usr Free number of inodes on /tmp Free number of inodes on /home Free number of inodes on / Free number of inodes on /opt Free number of inodes on /tmp in % Free number of inodes on / in % Free number of inodes on /usr in % Free number of inodes on /opt in % Free number of inodes on /home in % Free swap space in % FTP server is running Host information Host local time Host name IMAP server is running Maximum number of opened files Maximum number of processes News (NNTP) server is running Number of running processes zabbix_server Number of running processes zabbix_agentd Number of running processes apache Number of running processes inetd Number of running processes mysqld Number of running processes sshd Number of running processes syslogd POP3 server is running Processor load5 Processor load15 Size of /var/log/syslog SSH server is running Temperature of CPU 1of2 Temperature of CPU 2of2 Temperature of mainboard Total disk space on /home Total disk space on /usr Total disk space on /tmp Total disk space on /opt Total number of inodes on /usr
54 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1 Total number of inodes on / Total number of inodes on /opt Total number of inodes on /home Total number of inodes on /tmp Used disk space on /usr Used disk space on /var Used disk space on /home Used disk space on /tmp Used disk space on /opt Used disk space on /usr in % Used disk space on /var in % Used disk space on /tmp in % Used disk space on /opt in % Version of zabbix_agent(d) running WEB (HTTP) server is running
Infrastructure Monitoring: information about the following 16 measured metrics is provided Eth0 outgoing traffic Eth0 incoming traffic Running VMs Processor load Free swap space Total memory Free memory Disk sda Write Bytes/sec Disk sda Write: Ops/second Disk sda IO ms time spent performing IO Disk sda IO currently executing Disk sda Read: Milliseconds spent reading Disk sda Read: Ops/second Disk sda Write: Milliseconds spent writing Disk sda Read Bytes/sec Ping to the server (TCP)
55 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1
Appendix G: Grid’5000 requirements Grid’5000 is a scientific instrument designed to support experiment-‐driven research in all areas of computer science related to parallel, large-‐scale or distributed computing and networking. Supporting over 550 different experimenters a year, it is composed of 11 sites in France and Luxembourg, providing over 1500 nodes and 8500 cores that experimenters can reserve and reconfigure at bare hardware level, including the capacity to turn nodes on and off. At the networking level, most sites are interconnected with a dedicated 10G network link, with some sites providing Infiniband Myrinet 10G as a supplement over standard Ethernet 1G. Experiments can be isolated from each other in dedicated VLANs, including nation-‐wide VLANs. Monitoring solutions Grid'5000 distinguishes between monitoring (concerned with service health) and metrology (concerned with measurements for experiments). Monitoring is configured by puppet for all services deployed through puppet. It uses nagios for service status and munin for stat evolution on some services (disk utilization for example). Alerts are configured if services do not respond. Network between sites is monitored using smokeping, but no alerts are sent from smokeping. Local networks are monitored through cacti, but no alerts are configured. Node health is checked through g5k-‐checks at boot time, and regularly by the resource manager. g5K-‐check makes sure the properties described for a node (memory, uplink speed, disk size, number of cores, etc) are those seen at run-‐time. Grid’5000 uses monika to display node health. No alerts are issued through that system. Deployments statistics are collected through kstats3,and we are now in the planning phases to detect unusual patterns in failures to identify nodes that fail in above-‐-‐average rates. Monitoring requirements From the above: • Service status for all network reachable services • Server utilization for servers running services o CPU utilization o Disk utilisation, IO, latency, throughput, o Network traffic o Number of processes • Network properties (ping probes between sites, network throughput) • Networking equipment (CPU utilisation, main counters) • Node health (conformance and usage failure rates) Required metrics
56 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1 All information you can get is useful to assess the health of the infrastructure to detect problems before users or to be able to diagnose quickly when problems are reported. Grid’5000 has plans to go forward and increase monitoring, and would not accept a solution that lowers the surveillance of the health of the infrastructure, as this is the key to the reliability of the facility. For nodes, these are the metrics collected in Grid’5000: cpu_nice, mem_cached, pkts_out, disk_total, part_max_used, mem_buffers, cpu_idle, boottime, mem_free, load_five, proc_run, load_one, pkts_in pdu (energy consumed in watts), swap_free, cpu_num, mem_shared, cpu_user, swap_total, pdu_shared (energy consumed by the pdu this nodes is connected to), ambient_temp, cpu_speed, bytes_in, cpu_wio, cpu_system, bytes_out, proc_total, disk_free, cpu_aidle, load_fifteen, mem_total. And for the network hardware: Inbound and outbound traffic on each interface (in bits/s and packets) and CPU utilisation.
57 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1
Appendix H: FUSECO Playground requirements The FUSECO Playground experimentation facility [9] is a 3GPP Evolved Packet Core (EPC) centric, independent and open laboratory for mobile broadband communication research and development. It supports 3GPP (e.g. LTE, UMTS) and non-‐3GPP (e.g. Wi-‐Fi, WiMAX) technologies. In addition to the FOKUS OpenEPC toolkit [56], the FUSECO Playground also features the Open IMS Core [57] and enables to gain experience with IMS-‐based Rich Communication Services (RCS) and Voice over LTE (VoLTE) services, as well as the OpenMTC toolkit [58], supporting an open range of M2M applications. FITeagle [11] (mainly based on the Panlab FIRE-‐Teagle developments) is an extensible open source experimentation and management framework for federated Future Internet testbeds. It is used to manage and book instances of FUSECO services such as OpenIMSCore-‐aaS and OpenEPC-‐aaS (licensed). Monitoring solution Two monitoring tools are used to monitor the FUSECO playground infrastructure and its services. Zabbix is used to monitor the infrastructure to ensure its health and performance. In addition to monitoring performance metrics such as CPU, memory, network, disk space, processes, and OS-‐ related metrics, it supports custom metrics to monitor whatever users want. Thus, service and application related metrics (state, performance) can be measured. Second, the multi-‐hop Packet Tracking tool [10] is a distributed network surveillance solution for detecting paths of data packets through networks. It enables monitoring routes, delays, packet loss, influence cross traffic and more. Packet Tracking Netview Visualisation GUI called Netview is used to track and visualize packet paths and their hop-‐to-‐hop characteristics on a world map. Packet tracking enables the following functionalities: • QoS validation to support one of the main functions of EPC which is QoS guarantee • Total end-‐to-‐end as well as individual network-‐to-‐network monitoring • SLA Validation • Handover validation, performance evaluation and visualization • Per hop real-‐time link quality measurements and security constraints validations • Flexible measurement granularity adjustment from single packet up to flows Monitoring requirements 1. Property: Network status Monitoring information about the environment and the available access networks will enhance the performance of the Access network discovery and selection function (ANDSF) in EPC. Many metrics
58 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1 should be measured such as signal strength, number of users/subscribers, number of active users/subscribers or operator KPIs. 2. Property: IMS Performance Monitoring information about the IMS performance is required and therefore many KPIs should be measured. Signalling and media metrics are relevant, such as the number of drop calls/sessions, packet loss and delay (signalling and media), messaging rate (signalling), bandwidth (media), jitter (media).
59 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1
Appendix I: NITOS requirements NITOS is a Wireless Testbed which currently consists of 50 operational wireless nodes, which are based on commercial Wi-‐Fi cards and Linux open source drivers. The testbed is designed to achieve reproducibility of experimentation, while also supporting evaluation of protocols and applications in real world settings. NITOS testbed is deployed at the exterior of the University of Thessaly (UTH) campus building. Monitoring solution The monitoring of resources is being made through Chassis Manager cards. They report if the node is currently powered on or off. The experimenter can even force the node to reset or power on/off. NITOS is on the progress of developing a tool for monitoring the status of the nodes (checks: ssh and telnet service operation). Monitoring requirements More sophisticated methods for monitoring the infrastructure are needed. Right now, all the monitoring is left to the administrator of the testbed and no specific tools are being used. Regarding the experiments, they are in the progress of developing a monitoring framework which will offer to the experimenter a spectrum analysis during his experiments. Required metrics The following list includes the properties and the associated metrics that are essential for the NITOS normal operation. 1. Normal functionality and utilisation of the NITOS nodes and accurate detection of any misbehaviour. This property is essential from the testbed provider’s view. It is very important to ensure the regular behaviour of the NITOS nodes and the accurate response to management commands (power on/off, retrieving of features or statistics, etc.). Each NITOS node is equipped with a Chassis Management card that must be always online and is responsible for powering on/off the node. It is also interesting to monitor the usage of the reserved nodes, in order to create user utility profiles that affect the scheduling policies of the NITOS scheduler and maximize the testbed utilisation. The related metrics should be: • Indications about power on/off status from Chassis Management card. • Total and used memory of nodes. • CPU load and utilisation of nodes. • Traffic produced from the nodes (both in wireless and Ethernet interfaces). 60 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1 2. Stable wireless connectivity environment, given a specific outside interference. This property is essential from both the testbed provider’s view and the user’s view. It is very important to “build” a stable wireless connectivity environment that is affected only by the outside interference, while any other reason that influences this environment should be detected. Some of these reasons are a possible damage of the wireless antennas, or a persistent change of the topology due to unpredicted outside reasons (the testbed is outdoors and for example a strong wind may change the antennas position). On the other hand, it would be perfect to create connectivity map of the testbed that would be a perfect introduction for an experimenter who wants to use this facility. The related metrics should be: • Signal quality and strength for each pair of nodes and each channel. • Noise level for each wireless interface and each channel. 3. Sensor measurements and environment depiction This property is essential from the user’s view. The Chassis Management card, that each NITOS node has, is equipped with a plethora of sensors (humidity, light, temperature, power consumption, etc). The sensing measurements of all reserved nodes are available, through a web GUI, to the user who reserved the nodes. The related metrics should include temperature, light, humidity and other sensor measurements.
61 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1
Appendix J: w-‐iLab.t wireless network testbed requirements The requirements for both the sensor part and the Wi-‐Fi part of the w-‐iLab.t testbed are addressed in this section. Note that there are two w-‐iLab.t locations: Office and Zwijnaarde. While listed as a separate logical entity in the Fed4FIRE proposal, “w-‐iLab.t mobile” is not seen as a separate testbed from an administrative and implementation point of view, but as part of the w-‐iLab.t Zwijnaarde. At the moment different tools are used to manage/operate the two locations (Motelab-‐based vs. OMF-‐ based). The usage of these management tools does not influence the monitoring requirements for the two testbed instances. Therefore we describe them in the same section. Monitoring solution There are three characters of monitoring solutions: Facility Monitoring: for the w-‐iLab.t Office, currently the health of the key facility components can only be determined by the possibility to reach the central servers (either through the web interface, or by using ping/ssh). For the w-‐iLab.t Zwijnaarde, a Google calendar account is available which indicates whenever there is a power outage. Similarly to the w-‐iLab.t Office, the health of the central server can be checked by testing if the web interface is operational or by using ping/ssh. When taking facility monitoring down to the level of the individual nodes, it can be stated that for both w-‐iLab.t Office and Zwijnaarde, the health (ability to ping/ssh) of the nodes (embedded PC + sensor nodes) can be checked on the web interface of both testbed locations. Infrastructure monitoring: for the w-‐iLab.t Zwijnaarde, the power consumption of the nodes is shown on the web interface and is retrieved from the PDU’s (Power Distribution Units) through SNMP or from the PoE (Power over Ethernet) switches. Wireless connectivity between the nodes in the w-‐ iLab.t Zwijnaarde can also be visualised on the web interface. More advanced infrastructure monitoring can easily be setup by the user through using custom scripts. The output of this custom logging can also be saved with OML. Experiment Monitoring: In the w-‐iLab.t Zwijnaarde, OML is being used for the collection of user-‐ defined experimentation metrics. In the w-‐iLab.t Office, only measurement data from the sensor nodes can be logged to the central database by using the Motelab framework. Monitoring requirements Monitoring is very important. OML already provides a good basic set of functionalities, but the experimenter would like to know background information regarding the testbed, e.g. the amount of interfering activity on the shared wireless medium. On the other hand, the goal of his experiment is to produce actual measured results that will be used in publications to characterize e.g. his
62 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1 developed resilient multi-‐hop wireless sensor network protocol. Therefore the experimenter wants to store some metrics related to his experiment, such as packet error rate or disconnection duration. The following user needs can be identified: • The experimenter and the facility provider should be able to distinguish errors introduced by the solution under test from errors related to the infrastructure (e.g. error in my algorithm vs. error caused by malfunction of the embedded PC forwarding sensor messages to the testbed controller) • The experimenter should be able to measure time differences between actions on different nodes very accurately. E.g. a Wi-‐Fi experimenter wants to measure the uni-‐directional delay instead of the roundtrip time. Or wants to know the recovery time to restore connectivity after a link failure, handover, etc. Typically these measurements need sub-‐millisecond accuracy. This requires very solid and accurate clock synchronization between all nodes in the testbed. The iMinds w-‐iLab.t testbed relies on PTPd to provide this functionality. • Monitoring of the interference in a network is also important in sensor networks; interference may also be caused by nodes that are not part of the testbed. Without a view on the wireless activity in a testbed not related to the experimenter’s solution, it is hard to tell anything about performance of the solution under test. • The facility provider should be made aware of issues with the installation. Detecting a failing node is easy. Detecting a failing sensor is harder. Detecting a failing network interface (or loose antenna, etc.) may also be harder. Yet detecting these issues is important. Additional requirement can be identified when focusing on the aspect of facility federation. To make measurements comparable in a federation context, we should have a clear understanding of what different tools really measure, and how the measurement impacts the experiment. Required metrics Metrics which are important to the testbed provider of w-‐iLab.t (both Office and Zwijnaarde): • Metrics regarding the embedded PC (hosts all peripheral devices (like sensor nodes): o CPU (load/ idle time) § + temperature o Memory (total/used/free) o Hard disk (total/used/free) § Number of times the hard drive was re-‐imaged (only w-‐iLab.t Zwijnaarde, OMF load) + number of failed attempts to re-‐image § Monitoring through S.M.A.R.T.(Self-‐Monitoring Analysis and Reporting Technology, currently integrated into most hard drives) could indicate malfunctioning hard drive by the number of faulty writes before it is totally broken) o Powered On/Off § Power consumption when On o Control interface status § Able to Ping/SSH/Telnet to nodes?
63 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1
•
•
§ Saturation of control interface (ex. eth0 up:60% down:1%) o Active interfaces (with Amount of traffic (received/transmitted)) § Wired § Wireless (with logging of used channel and Noise floor) • Bandwidth usage could indicate a loose antenna or bad wireless card o (All) Present peripheral (USB) devices : § For example in w-‐iLab.t Zwijnaarde : • Environment Emulator • Tmote Sky or RM090 sensor nodes • Webcam • Bluetooth dongle • Imec sensing engine Metrics regarding the sensor nodes : o Number of times the node has been re-‐imaged § Failed attempts o Stats of used topologies § Which combination of nodes is used in the same experiment? Metrics that are important to the experimenters: o Important metrics that can be re-‐used from the testbed provider : § Node status (available/in-‐use) § Status of peripheral devices (like sensor node or webcam) § Commonly used sensor/Wi-‐Fi node topologies o Monitoring of the wireless inference/wireless activity § Using the Imec sensing engines o Monitoring of the wireless links between each pair of nodes § Generate full topology maps with for each pair of nodes an indication of : • Received Signal Strength • Throughput (Iperf) • Noise floor § All of the above with different settings: • Transmit power • Wireless channel (on all a/b/g/n bands) § These topology maps should be re-‐generated periodically to account for changes in the environment of any kind.
The required metrics by Wi-‐Fi experimenters are: o Free memory, CPU load and utilisation o Network metrics: network speed, topology detection, reachability, signal quality, signal strength, noise level, interference, data transfer rates, Radio Frequency (RF) quality, throughput, available bandwidth o Packet arrival rate, packet loss, delay, jitter
64 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1 In the iMinds w-‐iLab.t, the sensor boards are flashed by a separate node (embedded PC) which is physically connected to the sensor network board. The following metrics would be interesting when monitoring this node o Memory: free memory o CPU: CPU load o Board temperature o Saturation of the control interface (ex. eth0 up: 60% down: 1%) o It would be nice to have some insight on the physical disks via S.M.A.R.T. (faulty writes)! (has a big impact on the performance of and hard to debug based on other metrics) o Number of received/transmitted packets and Noise floor of the Wi-‐Fi cards o Channels used on the Wi-‐Fi cards Regarding the sensor node itself, the following monitoring metrics are of interest: • The number of times the nodes has been flashed and the number of failed attempts to flash it • Stats of the topologies that are used Regarding the wireless medium the following monitoring metrics are of interest: • Here the CREW project comes in: using sensing engines
65 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1
Appendix K: Netmode requirements The NETMODE testbed is comprised of wireless nodes equipped with 802.11 a/b/g/n interfaces. Monitoring solutions There are two types of monitoring solutions in NETMODE testbed: • Infrastructure monitoring. This type of monitoring involves information regarding the health of the nodes (UP/DOWN), running services on the nodes (ssh or telnet). The monitoring tool used to gather this information is Nagios. • Experimenter’s measurements. This type of monitoring is based on the OML library. The experimenter can use this library to collect statistics (e.g. bandwidth, delay, packet loss, etc.) for his experiments. Monitoring requirements The infrastructure should be able to provide experimenters with more detailed monitoring information. In the context of a wireless testbed like NETMODE, metrics such as signal strength, noise level, per node pair would be valuable to the experimenter in order to define, collect and evaluate his experimental results. Required metrics The following metrics gather monitoring information valuable to both the facility owner and the experimenter. • Node status (UP/DOWN) • Services status (ssh, telnet) • CPU utilization • Memory utilization • Hard disk usage • Bandwidth utilization (both in wireless and wired interfaces) • Signal strength for each node pair and wireless interface • Noise level for each node pair and wireless interface
66 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1
Appendix L: requirements from SmartSantander At the time being, the Santander facility [15] consists of more than 2,000 IEEE 802.15.4 devices, most of them, supporting both experimentation and service provision. In the future, the deployment will consist on around 12,000 IoT devices. It is important to highlight that the SmartSantander facility is not only meant for IoT experimentation but also for providing real-‐world Smart City services. Monitoring solutions Two facility monitoring processes are performed dynamically by the Management and Fault-‐ Monitoring Subsystem implemented within the SmartSantander platform, namely: resource discovery and resource monitoring. The resource discovery process involves detecting new IoT resources in the testbed, registering them for use and recording the resource descriptions using standard data models to facilitate the query and retrieval of these resources. The resource monitoring process concerns the dependability of the testbed platform i.e. its robustness with respect to software component or equipment failure. Under normal operation, the SmartSantander platform is in a constant state of flux. New sensor devices are detected and registered with the platform. The context status parameters (battery level, availability status, link quality) of existing devices change whilst they run experiments which generate experiment traces or sensor data streams and execute data transformation functions. Therefore, ensuring the correct execution of the IoT testbed’s services in the face of such dynamicity and ensuring the testbed’s resilience to failures requires continuous monitoring of the state of its IoT resources. Monitoring requirements The priority of monitoring requirements is low. As it has been previously mentioned, some metrics are currently being monitored to assure the correct behaviour of the network, such as device status (including battery level), measurement periodicity and link quality. Nevertheless, some other new metrics are not monitored at the time being and could be interesting, such as whether or not the measurements are within an expected range, the link quality within the mesh network or the number of users subscribed to the observations of a particular wireless sensor node. Exporting the already available metrics through OML would be valuable to the experimenter. Besides, the SmartSantander project relies for some of the experiments on mobile sensors and participatory sensing where we use users' mobile phones. In this respect, the variable number of nodes is available and the position of them is retrieved whenever possible. From an experimenter point of view the required metrics could be device status, measurement frequency and link quality. Also, an experimenter should be aware of node's capabilities and geo-‐localisation information (GPS position). 67 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1 Required metrics 1. Currently available metrics • Device status: shows whether or not a node is alive and sending sensor observations. (Online / Offline) • Battery status: shows relative level percentage against the maximum capacity of a battery. • Measurement frequency: it measures the time between sensor observations that a node is currently using. • Link quality: it measures the percentage of the expected service frames that haven't been lost across the multihop network. A value of 100% shows that no frames have been lost and a value of 0% means that the node is currently offline. • Position: it refers to the GPS position of the non-‐fixed nodes. 2. Desirable metrics • Measurement validity: this shows if the observations being sent by a node are in the valid range of the sensor. For example, a temperature sensor measuring -‐15ºC in Santander is malfunctioning and should be fixed or replaced. • Users subscribed: number of experimenters/users subscribed to each node's observation 3. Other information that needs to be tracked and shown to an experimenter • Capabilities: type of sensor observations that a node can provide. This value is usually fixed during the complete life of the node, although new sensors can be connected into the node. • Geolocalisation information: SmartSantander wireless sensor network is a real on the field deployment with both fixed and mobile nodes, so the position where a node really is at every moment is valuable information for an experimenter.
68 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1
Appendix M: NICTA requirements NICTA maintains a testbed (NOrbit) of about 40 wireless nodes. These machines are equipped with one or two 802.11 wireless cards, and have wired connectivity to two Ethernet network. One of them is reserved for control traffic (management and measurement), while the other can be used for specific experimentations. There are projects to install OpenFlow equipment on the experimental Ethernet link to support finer setup of experiments. The testbed is controlled through OMF (5.3, 5.4 and 6). All experimental measurement is done through OML, with a collection server being collocated with the OML AM (and default EC).
Monitoring Solutions The status of the nodes is monitored through an ad hoc solution which queries both the chassis management cards of the machines for power control, and attempts to contact various services (such as telnet, ssh) to get more information as to their operating system state (e.g., ready or PXE). Though there was some experiment with Zabbix, for more fine-‐grained infrastructure monitoring, this solution is not currently in use. During the course of an experiment, the experimenter can deploy OML-‐instrumented tools to provide more information about the nodes and network. The currently available tools cover: system status such as load, memory or users (SIGAR/nmetrics wrapper or collectd), active network testing (OTG, Iperf and D-‐ITG), passive network and spectrum monitoring (wrapper for libtrace for the latter with radiotap support, and wrapper for wpa_supplicant), node reachability (wrapper to ping) or mobile nodes' location from GPS fixes, when available (gpsd client). Monitoring Requirements The requirements on node monitoring are low at the moment, as experimenters can instantiate all required monitoring tools as part of the experiment. Since these measurements are of potential interest to many users, they could eventually be provided by the facility as part of an infrastructure monitoring service to the user. When running an experiment, the user would like to know the characteristics of the wireless environment during that run. For example, if the user is running a wireless experiment using channel 6 in the 2.4Ghz band, he/she would like to have information about all signals occupying a given frequency band around that channel and during that experiment. This information allows i) to identify potential factors that might have affected the experiment results (e.g. adjacent channel interference) and ii) sound comparison of this experiment run results with previous one which were run with similar 'environment' conditions. More specific, for a given time window around the
69 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013
FP7-‐ICT-‐318389/NICTA/REPORT/PUBLIC/D6.1 experiment the user would like to be aware of frequency (Hz) against occupancy (% time) and/or Power/Signal Level (dBm). 1. Wireless Connectivity Measurement Prior or during the run of an experiment, the user would like to know the connectivity characteristics for all the resources involved in the experiment. This information allows i) the selection of which resource to use based on average connectivity properties, ii) the identification of potential factors affecting the experiment results (e.g. highly variable packet delay between A and B correlated with highly variable SNR between A and B). More concrete, this encompasses the following information: for a given time window prior or during the experiment and for each pair of resources available in the testbed: RSSI (received signal strength indicator in arbitrary unit) for received frames, frame Loss, Retries and Error Rate (%), SNR (dBm). 2. Wireless Device Location Prior or during the run of an experiment, the user would like to know the absolute or relative (to a given reference) geographical position of the resources involved in their experiment. This information allows the correlation of observed results with the spatial context of each resource. More concrete, this encompasses the following information: for a given time window prior or during the experiment and for each resource: x,y,z coordinates (with absolute or relative reference and unit information). 3. Device Energy Consumption During the run of an experiment, the user would like to know the energy consumed by the resources involved in their experiment. This information allows the correlation of observed results and energy cost for the system being studied by the user (e.g. a new routing scheme). More concrete, this encompasses the following information: For a given time window during the experiment and for each resource: consumed energy (mW).
70 of 70
© Copyright NICTA and other members of the Fed4FIRE consortium 2013