Preview only show first 10 pages with watermark. For full document please download

D6.1 Detailed Specifications For First Cycle Ready

   EMBED


Share

Transcript

          Project  Acronym   Fed4FIRE   Project  Title   Instrument   Call  identifier   Project  number   Project  website   Federation  for  FIRE   Large  scale  integrating  project  (IP)   FP7-­‐ICT-­‐2011-­‐8   318389   www.fed4fire.eu         D6.1  Detailed  specifications  for  first   cycle  ready     Work  package   Task   Due  date   Submission  date   Deliverable  lead   Version   Authors   Reviewers   WP6   Task  6.1,  Task  6.2   28/02/2013   07/03/2013   Max  Ott  (NICTA)   1.0   Olivier  Mehani  (NICTA),  Guillaume  Jourjon  (NICTA),  Yahya  Al-­‐Hazmi   (TUB),  Wim  Vandenberghe  (iMinds),  Donatos  Stavropoulos  (UTH),   Jorge  Lanza  (UC),  Kostas  Choumas  (UTH),  Luis  Sanchez  (UC),  Pablo   Sotres  (UC)   Kostas  Kavoussanakis  (EPCCY),  Steve  Taylor  (IT  Innovation)     1  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1     Abstract   This  document  provides  an  overview  of  the  requirements  of   the  testbeds  involved  in  Fed4FIRE  in  terms  of  monitoring  and   measurements.  It  then  finds  commonalities,  and  presents   implementations  steps  for  the  first  development  cycle  of  the   project  in  terms  of  monitoring  and  measurement  aspects  of   the  Fed4FIRE  federation.   Report,  Deliverable,  Measurement,  OML,  TopHat,  WP6   Keywords         Nature  of  the  deliverable   Dissemination  level    R    P    D    O    PU    PP    RE    CO   Report   Prototype   Demonstrator   Other   Public   Restricted  to  other  programme  participants   (including  the  Commission)   Restricted  to  a  group  specified  by  the  consortium   (including  the  Commission)   Confidential,  only  for  members  of  the  consortium   (including  the  Commission)   X         X         2  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   Disclaimer     The   information,   documentation   and   figures   available   in   this   deliverable,   is   written   by   the   Fed4FIRE  (Federation  for  FIRE)  –  project  consortium  under  EC  co-­‐financing  contract  FP7-­‐ICT-­‐ 318389   and   does   not   necessarily   reflect   the   views   of   the   European   Commission.   The   European   Commission   is   not   liable   for   any   use   that   may   be   made   of   the   information   contained  herein.   3  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   Executive  Summary     This   deliverable   presents   the   first   cycle   implementation   steps   for   Fed4FIRE   Work   package   6.   After   recalling   the   objectives   set   in   previous   Fed4FIRE   deliverables,   it   collects   input   from   the   involved   testbeds   in   terms   of   current   deployment   status,   supported   measurement   infrastructures,   and   requirements.  This  input  is  consolidated  in  order  to  inform  the  creation  of  a  detailed  implementation   design  for  the  first  cycle.  Next  to  this  input  from  the  involved  testbeds,  this  design  also  relied  on  a   performed   survey   and   analysis   of   the   state   of   the   art   in   terms   of   tools   for   data   acquisition,   collection   and  reporting.     These   different   inputs   led   to   the   insight   that   the   most   widespread   commonality   is   the   use   of   OML   as   a   collection   and   reporting   framework.   It   allows   instrumenting   any   sort   of   system   through   the   abstraction   of   a   measurement   point,   describing   a   group   of   metrics.   This   abstraction   allows   more   latitude   in   the   choice   of   measurement   tools:   as   long   as   they   conform   to   the   same   measurement   point   for   the   same   information,   their   output   can   be   used   indistinctly   in   further   analysis.   Selecting   OML  for  reporting  purposes  therefore  allows  flexibility  in  the  choice  of  measurement  tools,  both  for   monitoring  and  measurement  tasks,  as  well  as  for  a  unified  way  to  access  the  collected  data.     This  however  only  caters  for  collection  and  storage,  but  not  directly  access  to  or  visualization  of  the   data,  let  alone  from  a  federated  environment.  Another  tool  for  this  purpose  therefore  needs  to  be   identified.   From   the   previous   sections,   Top   Hat   fits   the   bill   for   its   ability   to   run   queries   over   distributed  systems  and  data  stores,  and  pre-­‐existing  deployments.     Facility   and   infrastructure   monitoring   tasks   require   specific   metrics   to   always   be   made   available   about  the  testbed  and  its  node.  While  some  deployments  already  have  solutions  in  place,  the  most   indicated  ones  for  others  are,  in  order  of  preference,  Zabbix,  Nagios  or  Collectd.     Overall,  this  caters  for  the  measurement  architecture  shown  in  Figure  1.  Essentially,  all  testbeds  will   be   required   to   deploy   OML   and   TopHat,   for   measurement   collection   and   federated   access,   respectively.   With   the   aim   of   limiting   the   impact   on   deployed   solutions,   monitoring   and   measurement  tools  already  in  use  will  not  be  superseded,  but  rather  adapted  to  be  included  in  the   proposed  architecture.  For  cases  where  a  requirement  is  not  met,  default  solutions  are  prescribed.     4  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1     Figure  1:  Proposed  cycle  1  measurement  architecture  for  Fed4FIRE.  Elements  in  bold  are  the  default   proposal  for  new  deployments  of  on  canonical  testbed       In   order   to   implement   this   design,   seven   implementation   steps   have   been   identified,   as   presented   in   Table   1.   Most   steps   (service   deployment   and   application   instrumentation)   need   to   be   undertaken   independently   by   all   participants.   Where   commonalities   exist   (e.g.   Zabbix   and   Nagios)   instrumentation   should   be   a   common   effort.   To   support   with   the   instrumentation   task,   NICTA   will   provide  and  curate  a  clearinghouse  of  homogenized  OML  measurement  points  schemas.  The  goal  is   to   ease   integration   of   new   applications   while   maintaining   the   highest   level   of   data   exchangeability   and  interoperability  between  measurement  tools  providing  the  similar  information.     A  TDMI  agent  will  also  be  written  to  allow  queries  to  the  OML  storage  from  TopHat.       5  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   Functional  element   Implementation  strategy   Facility  and   Infrastructure  monitoring   • Deploy  Nagios  and/or  Zabbix  and/or  collectd  if  not  yet  available  (all   participants)     • Instrument  these  relevant  measurement  systems  (all  participants,   with  support  from  NICTA)   Experiment  measurement   • Deploy  OML  if  not  yet  available  (all  participants)   • Instrument  relevant  measurement  systems  (all  participants,  with   support  from  NICTA)   • Maintain  clearinghouse  of  measurement  points  (NICTA)   Data  access     • Deploy  Top  Hat  (all  participants)   • Make  OML  measurement  databases  accessible  to  Top-­‐Hat  (NICTA,   UPMC)   Table  1:  Implementation  strategy  of  functional  elements       6  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   Acronyms  and  Abbreviations     API   CPU   CREW   DWDM   EC   EPC   FLS   FTP   GENI   GPS   HDD   I/O   IETF   IoT   IP   IPMI   FIRE   KPI   MAC   MIB   MP   MS   MTU   NRPE   OCF   OML   Application  Programming  Interface   Central  Processing  Unit   Cognitive  Radio  Experimentation  World  (FP7  IP  project)   Dense  Wavelength  Division  Mulitplexing   Experiment  Controller   Evolved  Packet  Core   First  Level  Support   File  Transfer  Protocol   Global  Environment  for  Network  Innovations   Global  Positioning  System   Hard  Disk  Drive   Input/output   Internet  Engineering  Task  Force   Internet  of  Things   Internet  Protocol     Intelligent  Platform  Management  Interface   Future  Internet  Research  and  Experimentation   Key  Performance  Indicator   Media  Access  Control   Management  Information  Base   Measurement  Point   Measurement  Stream   Maximum  Transmission  Unit   Nagios  Remote  Plugin  Executer   Ofelia  Control  Framework   Measurement   Library:   an   instrumentation   system   allowing   for   remote   collection   of   any  software-­‐produced  metrics,  with  in  line  filtering  and  multiple  SQL  back-­‐ends.   cOntrol  and  Management  Framework:  a  testbed  management  framework   OMF   OS   Operating  System   PDU   PoE   PTPd   QoS   RCS   RF   S.M.A.R.T   SLA   SNMP   SQL   SSH   Power  Distribution  Unit   Power  over  Ethernet   Precision  Time  protocol  daemon   Quality  of  Service   Rich  Communication  Services   Radio  Frequency   Self-­‐Monitoring  Analysis  and  Reporting  Technology   Service  Level  Agreement   Simple  Network  Management  Protocol     Structured  Query  Language   Secure  Shell   7  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   TCP   TDMI   UDP   USB   VoIP   VoLTE   VM   Transmission  Control  Protocol   TopHat  Dedicated  Measurement  Infrastructure   User  Datagram  Protocol     Universal  Serial  Bus     Voice  over  IP   Voice  over  LTE   Virtual  Machine   8  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   Table  of  Contents   1   INTRODUCTION  ................................................................................................................................  11   2   INPUTS  TO  THIS  DELIVERABLE  ..........................................................................................................  12   2.1   ARCHITECTURE  (D2.1)  ..........................................................................................................................  12   2.2   REQUIREMENTS  ADDRESSED  BY  THE  ARCHITECTURE  .....................................................................................  13   2.2.1   Generic  requirements  of  a  FIRE  federation  (D2.1)  ..................................................................  13   2.2.2   Requirements  from  a  sustainability  point  of  view  (D2.1)  .......................................................  14   2.2.3   High  priority  requirements  of  the  infrastructure  community  (D3.1)  ......................................  14   2.2.4   High  priority  requirements  of  the  services  community  (D4.1)  ................................................  15   2.2.5   High  priority  requirements  of  shared  support  services  (D8.1)  ................................................  15   2.3   ADDITIONAL  WP6  REQUIREMENTS  ..........................................................................................................  16   2.3.1   Generic  requirements  .............................................................................................................  16   2.3.2   Monitoring  and  measurement  metrics  overview  ...................................................................  17   2.3.3   Consolidated  summary  of  testbeds’  inputs  .............................................................................  19   3   IMPLEMENTATION  OF  THE  ARCHITECTURAL  FUNCTIONAL  ELEMENTS  ..............................................  23   3.1   INTRODUCTION  ....................................................................................................................................  23   3.2   EVALUATION  FOR  POSSIBLE  APPROACHES  FOR  IMPLEMENTATION  ...................................................................  23   3.2.1   Data  acquisition  ......................................................................................................................  23   3.2.2   Collection  and  Reporting  ........................................................................................................  29   3.2.3   Summary  ................................................................................................................................  31   3.3   DETAILS  OF  THE  SELECTED  MONITORING  AND  MEASURING  TOOLS  IN  FED4FIRE  ................................................  31   3.3.1   Monitoring  (Zabbix,  Nagios)  ...................................................................................................  33   3.3.2   Collection  and  Reporting  (OML)  .............................................................................................  33   3.3.3   Federated  Queries  (TopHat)  ...................................................................................................  34   3.4   IMPLEMENTATION  STEPS  .......................................................................................................................  35   3.4.1   Installation  of  New  Tools  ........................................................................................................  35   3.4.2   Adaptation  of  Existing  Tools  ...................................................................................................  35   3.4.3   Coordination  ...........................................................................................................................  36   4   SUMMARY  .......................................................................................................................................  37   4.1   4.2   MAPPING  OF  ARCHITECTURE  TO  IMPLEMENTATION  PLAN  .............................................................................  37   DEVIATION  OF  SUPPORTED  REQUIREMENTS  COMPARED  TO  D2.1  ...................................................................  38   REFERENCES  ............................................................................................................................................  39   APPENDIX  A:  PLANETLAB  EUROPE  REQUIREMENTS  .................................................................................  42   APPENDIX  B:  VIRTUAL  WALL  REQUIREMENTS  .........................................................................................  45   APPENDIX  C:  OFELIA  (OPENFLOW  IN  EUROPE  LINKING  INFRASTRUCTURE  AND  APPLICATIONS)  CONTROL   FRAMEWORK  REQUIREMENTS  .....................................................................................................................  47   APPENDIX  D:  OPTICAL  TESTBED  –  EXPERIMENTA  DWDM  RING  TESTBED  REQUIREMENTS  .......................  49   APPENDIX  E:  OPTICAL  TESTBED  –  OPENFLOW-­‐ENABLED  ADVA  ROADMS  TESTBED  REQUIREMENTS  ........  51   APPENDIX  F:  BONFIRE  REQUIREMENTS  ...................................................................................................  52   APPENDIX  G:  GRID’5000  REQUIREMENTS  ................................................................................................  56   APPENDIX  H:  FUSECO  PLAYGROUND  REQUIREMENTS  .............................................................................  58   9  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   APPENDIX  I:  NITOS  REQUIREMENTS  ........................................................................................................  60   APPENDIX  J:  W-­‐ILAB.T  WIRELESS  NETWORK  TESTBED  REQUIREMENTS  ...................................................  62   APPENDIX  K:  NETMODE  REQUIREMENTS  ................................................................................................  66   APPENDIX  L:  REQUIREMENTS  FROM  SMARTSANTANDER  ........................................................................  67   APPENDIX  M:  NICTA  REQUIREMENTS  ......................................................................................................  69     10  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   1 Introduction     This  deliverable  details  the  specifications  regarding  cycle  1  development  in  WP6,  based  on  the  cycle  1   architecture  described  in  D2.1  “First  Federation  Architecture”.  These  specifications  cover  the  details   that  pave  the  way  for  the  actual  implementations  in  terms  of  tools  deployment  and/or  adaptation,  as   well   as   coordination   efforts   in   terms   of   data   format   harmonization.   This   document   is   structured   as   follows.   Section   2   summarizes   the   specific   constraints   that   WP6   has   to   operate   within   when   defining   the   specifications   for   the   first   cycle   of   Fed4FIRE.   This   is   achieved   by   reviving   relevant   information   from   previous   Fed4FIRE   deliverables,   and   by   presenting   the   related   current   state   of   deployment   and   specific   requirements   collected   from   each   of   the   involved   Fed4FIRE   testbeds.   Note   that   section   2   presents  a  consolidated  view,  while  the  individual  per-­‐testbed  information  can  be  found  in  detail  in   the   different   appendices.   Based   on   this   information   gathered   in   section   2,   section   3   can   evaluate   possible  approaches  for  implementation.  The  outcome  of  this  evaluation  is  a  more  detailed  design  of   the  measurement  and  monitoring  components  of  the  first  Fed4FIRE  development  cycle  (these  will  be   refined  in  cycle  2  and  3).  The  last  part  of  section  3  defines  the  required  implementation  steps  for  the   adoption  of  this  design.  Finally,  section  4  concludes  this  deliverable  with  an  appropriate  summary.       11  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   2 Inputs  to  this  deliverable     In   this   section,   a   specific   selection   of   information   is   revived   that   originates   from   several   previously   submitted   Fed4FIRE   deliverables.   The   goal   of   this   exercise   is   to   summarize   the   specific   constraints   that  WP6  has  to  operate  within  when  defining  the  specifications  for  the  first  cycle  of  Fed4FIRE.       2.1 Architecture  (D2.1)     Fed4FIRE   identified   the   following   types   of   monitoring   and   measurement   (Figure   2)   in   D2.1   “First   Federation  Architecture”:     • “Facility   monitoring:   this   monitoring   is   used   in   the   first   level   support   to   see   if   the   testbed   facilities   are   still   up   and   running.   The   most   straight   forward   way   for   this,   is   that   there   is   a   common   distributed   tool   which   monitors   each   facility   (Zabbix,   Nagios   or   similar   tools).   The   interface  on  top  of  this  facility  monitoring  should  be  the  same  and  will  further  be  specified  in   WP6   (it   seems   in   this   case   more   straightforward   to   use   all   the   same   monitoring   tool,   then   to   define  and  implement  new  interfaces).”   • “Infrastructure   monitoring:   this   is   monitoring   of   the   infrastructure   which   is   useful   for   experimenters.   E.g.   monitoring   of   switch   traffic,   wireless   spectrum   or   physical   host   performance   if   the   experimenter   uses   virtual   machines.   This   should   be   provided   by   the   testbed   provider   (an   experimenter   has   e.g.   no   access   to   the   physical   host   if   he   uses   virtual   machines)  and  as  such  a  common  interface  would  be  very  handy,  but  is  not  existing  today.”   • “Experiment   measuring:   measurements   which   are   done   by   a   framework   that   the   experimenter   uses   and   which   can   be   deployed   by   the   experimenter   itself   on   his   testbed   resources   in   his   experiment.   In   the   figure   one   can   see   two   experiment   measuring   frameworks   each   with   its   own   interfaces   (and   thus   experimenter   tools).   Of   course,   a   testbed   provider  can  ease  this  by  providing  e.g.  OS  images  with  certain  frameworks  pre-­‐deployed.”     This  deliverable  also  stated  that  in  the  first  cycle  of  Fed4FIRE,  the  facility  monitoring  will  be  rolled  out   on  all  testbeds.  Infrastructure  monitoring  and  experiment  measuring  are  to  be  further  discussed  in   WP6.     12  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     Get monitor data Get monitor data Future reservation broker Brokers Experimenter FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   Central facility monitoring (first level support) Masurement Measurement Measurement Measurement Testbed directory Tool directory Certificate directory Identity provider Testbed management Identity provider Discovery, reservation, provisioning Grant access? Testbed Rules-based authorization Testbed A Discovery, reservation, provisioning Grant access? Rules-based authorization Infrastructure monitoring Facility monitoring Testbed B Facility monitoring Central location(s)     Figure  2:  Monitoring  and  measurement  architecture  for  cycle  1       2.2 Requirements  addressed  by  the  architecture     This  section  recalls  the  requirements  relevant  to  WP6  set  forth  in  D2.1,  D3.1,  D4.1  and  D8.1     2.2.1 • • Generic  requirements  of  a  FIRE  federation  (D2.1)   Support:   o How  easily  can  components/testbeds/software  be  upgraded?   • For   this,   the   APIs   should   be   versioned   and   tools   and   testbeds   should   support   2   or   3   versions   at   the   same   time,   so   that   all   components   can   be   gradually  upgraded   o How  can  different  versions  of  protocols  be  supported?  (e.g.  upgrade  of  RSpec)   • With  versions   Experimenter  ease  of  use:   o DoW:  The  final  goal  is  to  make  it  easier  for  experimenters  to  use  all  kinds  of  testbeds   and   tools.   If   an   experimenter   wants   to   access   resources   on   multiple   testbeds,   this   should  be  possible  from  a  single  experimenter  tool  environment.   • Is   possible,   but   Fed4FIRE   should   also   aim   to   keep   such   tools   up-­‐to-­‐date   during   the   lifetime   of   the   project   and   set   up   a   body   which   can   further   define  the  APIs,  also  after  the  project.   13  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   2.2.2 • • Requirements  from  a  sustainability  point  of  view  (D2.1)   From   a   sustainability   point   of   view,   it   is   preferential   that   the   number   of   required   central   components   is   minimal   as   these   components   put   a   high   risk   on   a   federation   in   terms   of   scalability  and  long  term  operability.   o Yes,  there  is  no  central  component  which  is  needed  in  order  to  use  the  testbeds.  Of   course,   the   central   portal,   identity   provider,   testbed   directory,   tool   directory   and   certificate  directory  eases  the  use  of  the  federation  as  you  have  all  the  information   in  a  single  place  to  get  new  experimenters  to  use  testbeds  and  tools.   It  is  also  required  that  the  federation  framework  supports  the  joining  and  leaving  of  testbeds   very  easily,  as  this  will  be  common  practice.   o The   architecture   supports   multiple   identity   providers/portals.   There   is   a   common   API  for  discovery,  requirements,  reservation  and  provisioning  while  it  imposes  no   restrictions  on  the  use  of  specific  experiment  control,  monitoring  and  storage.  The   common   API   makes   it   straight   forward   to   add   new   tools   and   testbeds   while   a   testbed  can  be  an  extra  identity  provider  also.     2.2.3 High  priority  requirements  of  the  infrastructure  community  (D3.1)     Federation  aspect   Req.  ID   Req.  statement   Remark   Monitoring   I.2.101   Measurement  support   framework   Monitoring  resources   for  operational   support   Monitoring  resources   for  suitable  resource   selection  and   measurement   interpretation   Monitoring   I.2.104   Monitoring   I.2.105   Monitoring   I.2.106   Permanent  storage   I.2.201   Minimal  impact  of   monitoring  and   measuring    tools   Data  storage   Permanent  storage   I.2.202   Data  security   Permanent  storage   I.2.203   Interconnectivity   I.4.005   Stored  experiment   configuration   IPv6  support   Architecture  facilitates  the  use  of   measurement  framework   Architecture  facilitates  the  use  of  such   monitoring  framework   The  architecture  makes  a  distinction   between  facility  monitoring,   infrastructure  monitoring  and  experiment   monitoring.  All  three  are  supported  by  the   architecture,  but  should  be  worked  out  in   WP6.  This  requirement  seems  to  be  more   a  WP6  requirement.   No  architecture  requirement   Doing  this  in  a  structured  way  will  be   tackled  in  cycle  2  or  3   Doing  this  in  a  structured  way  will  be   tackled  in  cycle  2  or  3   Doing  this  in  a  structured  way  will  be   tackled  in  cycle  2  or  3   In  cycle  1,  some  testbed  resources  will  be   reachable  through  IPv6.  The  architecture   can  cope  with  this,  e.g.  through  DNS   names  which  resolve  to  IPv6     14  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   2.2.4   High  priority  requirements  of  the  services  community  (D4.1)   Field   Monitoring   Req.  ID   Req.  statement   Remark   ST.2.001   Monitoring   management  control   From  an  architectural  viewpoint,  the   facility,  infrastructure  and  experiment   monitoring  can  cope  with  this.  The  details   should  be  filled  in  by  WP6.   Permanent  Storage   ST.2.007   Monitoring  data  and   experiment  result   storage.  Audit,   archiving,   accountability   Access  between   testbeds   (interconnectivity)   Not  yet  foreseen  in  the  architecture.   Interconnectivity   ST.4.001   Interconnectivity  in  a  structured  way  is   not  tackled  in  cycle  1       2.2.5 High  priority  requirements  of  shared  support  services  (D8.1)     Req.  ID   Description   FLS.1   Facility  monitoring  should  push  RAG  (Red,  Amber,  Green)  status  to  a  central  dashboard  for   FLS  reactive  monitoring   FLS.2   The  Facility  RAG  status  should  be  based  upon  the  monitoring  of  key  components  of  each   facility  that  indicate  the  overall  availability  status  of  the  facility   FLS.3   The  FLS  should  be  able  to  drill  down  from  the  facility  RAG  status  to  see  which  components   are  degraded  or  down   FLS.4   The  key  components  monitored  for  facility  monitoring,  should  be  standardised  across  all   facilities  as  much  as  possible   FLS.5   A  commitment  is  required  from  each  testbed  to  maintain  the  quality  of  monitoring   information  (FLS  is  “monitoring  the  monitoring”  and  the  information  FLS  has  is  only  as   good  as  the  facility  monitoring  data)   FLS.6   Any  central  federation-­‐level  systems/components  that  may  be  implemented  will  need  to   be  monitored  by  FLS  (e.g.  a  central  directory)   FLS.7   FLS  requires  visibility  of  planned  outages  on  a  push  basis  from  testbeds  and  administrators   of  central  systems   FLS.8   Exception  alerts  from  both  testbeds  and  central  systems  should  be  filtered  prior  to   reaching  the  FLS,  to  avoid  reacting  to  alerts  unnecessarily.       15  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   2.3 Additional  WP6  requirements     This   section   defines   some   additional   requirements   in   the   context   of   WP6.   It   first   introduces   some   generic   requirements,   followed   by   a   consolidated   view   on   the   concrete   requirements   from   each   Fed4FIRE  testbed.  The  corresponding  details  per  testbed  can  be  found  in  the  different  appendices,  in   which  detailed  measurement  and  monitoring  requirement  at  the  infrastructure  level  are  presented.   Information   about   the   status   and   the   availability   of   physical   machines,   capacity   of   each,   and   connectivity  among  them,  are  examples  of  such  important  requirements  from  the  facility  provider’s   viewpoint  to  determine  the  health  and  the  performance  of  their  infrastructures,  and  well  as  from  the   user’s   point   of   view   to   understand   the   operational   conditions   and   the   behaviour   of   the   environment   in  which  his  experiment  is  conducted.       In  addition  to  this,  experimenters’  requirements  and  their  interest  in  specific  monitoring  metrics  are   also   presented.   These   were   collected   from   the   individual   facility   providers   involved   in   Fed4FIRE   based  on  their  experience  and  feedback  received  from  experimenters  who  have  already  used  or  are   currently   using   their   facilities.   Similar,   measurement   and   monitoring   requirements   at   the   services   and  application  levels  such  as  cloud  services  and  sensors  are  also  addressed.  In  order  to  fulfil  services   and  applications’  requirements  in  terms  of  monitoring,  it  is  required  to  identify  a  set  of  metrics  for   the   resources   belonging   to   services   and   applications   to   be   monitored   including   CPU   performance,   disk  I/O  and  network  measurements.  An  additional  requirement  from  the  services  community  is  the   ability   to   collect   experimenter-­‐specified,   application   specific   metrics.     It   is   also   required   to   ensure   that   the   collection   and   publishing   of   measurements   is   consistent   across   the   various   Fed4FIRE   facilities.       2.3.1 Generic  requirements     Multiple   stakeholders   are   interested   in   monitoring   services:   facility   providers,   experimenters,   and   those   in   the   federation   level.   Their   requirements   differ   from   each   other   depending   on   their   need   for   monitoring.  From  a  facility  provider’s  viewpoint,  rich  measuring  information,  and  understanding  the   performance,   health   and   behavior   of   the   whole   infrastructure   are   needed   for   effective   management   and   optimization   purposes.   Specific   monitoring   information   is   required   at   the   federation   level   to   enable  interoperability  and  compatibility  across  the  federated  facilities.     From   an   experimenter's   viewpoint,   monitoring   experiment   resources   and   collecting   observations   are   an  essential  part  of  any  scientific  evaluation,  or  comparison  of  technologies  or  services  being  studied.   Such   observations   are   not   different   than   monitoring   of   user-­‐defined   experimentation   metrics   (e.g.,   service   metrics,   application   parameters)   of   the   solution   under   test.   Examples   include:   number   of   simultaneous   download   sessions,   number   of   refused   service   requests,   packet   loss   and   spectrum   analysis.     Furthermore,   measurements   data   should   be   provided   to   these   stakeholders   in   different   manners,   including  the  following:   16  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   Periodic   monitoring:   resources   are   monitored   on   a   regular   basis   so   that   suitable   measurements   or   monitoring   tools   are   deployed   and   the   data   can   be   provided   either   through  GUIs  or  APIs.   • Single-­‐value   monitoring:   if   only   one-­‐measured-­‐value   is   of   interest,   then   the   required   metrics   to   be   measured   can   be   deployed,   such   as   CPU   usage   of   a   VM,   the   number   of   currently   running   VMs   on   a   specific   physical   machine   that   hosts   a   user’s   VM,   etc.   Only   one-­‐measured-­‐ value  per  metric  is  sent  back  to  the  requester.   • Data  transportation:  suitable  methods  to  transport  data.   • Data   converting:   converters   could   be   needed   to   change   data   from   a   source   format   to   another  one,  therefore  where,  how,  who  is  responsible  for  this  should  be  defined.   • Data   collection:   methods   to   collect   data   and   store   them   into   storage   resources   such   as   collectors,  aggregators  or  repositories.   • Data  viewing:  methods  to  show  measurements  data  such  as  APIs,  GUI  or  visualization.   • Data  visualisation:  measurement  data  should  be  visualised  especially  the  real-­‐time  data.   • Data   harmonization:   measurement   data   is   collected   from   heterogeneous   monitoring   resources  across  federated  infrastructures  that  could  provide  data  in  various  manner  and/or   format.   Therefore,   there   is   a   need   for   harmonizing   measurement   data   to   be   provided   to   stakeholders  in  a  unified  data  representation  and  in  standard  manners.     The  specific  adoption  of  these  different  aspects  in  a  single  measurement  campaign  depends  on  how   the  measurements  data  is  going  to  be  used.  To  this  end,  different  methodologies  on  how  to  provide   data  are  to  be  addressed.  The  Fed4FIRE  monitoring  and  measurement  framework  presented  in  this   document  should  be  able  to  cover  all  aspects  of  this  monitoring  lifecycle  in  an  efficient  way  with  a   high  level  of  user  satisfaction.       • 2.3.2 Monitoring  and  measurement  metrics  overview     One   very   important   aspect   in   monitoring   and   measuring   is   that   of   the   metrics   that   should   be   collected.  This  requirement  is  varied  from  facility  to  another  (wired,  wireless,  etc.)  amongst  services   and   platforms;   and   experimenters   have   different   interests,   based   on   their   experiments   and   used   resources,   services,   etc.   Because   of   this   heterogeneity,   numerous   metrics   are   to   be   measured   at   multiple   levels:   component   level   (physical   and   virtual),   network   level,   traffic   level,   and   service/software  level.  The  remainder  of  this  section  gives  some  examples  about  metrics  that  could   be  of  importance  (by  facility  and  service  providers  and  experimenters)  to  be  measured  in  different   domains   such   as   cloud   infrastructures,   cloud   services,   wireless   networks   (Wi-­‐Fi   or   cellular   networks),   virtualised   or   non-­‐virtualised   infrastructures.   In   this   deliverable   measurement   metrics   are   classified   in  four  categories:     • component-­‐level  metrics   • network  metrics   • traffic  metrics   • software  metrics       17  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1     2.3.2.1 Component-­‐level  metrics       This   represents   metrics   from   both   virtual   and   physical   (computing,   network   or   storage)   devices.   There   are   many   metrics   to   be   addressed   here   related   to   performance,   storage,   OS,   processes,   including  the  following     • Memory:   total   memory,   used   memory,   free   memory,   total   swap   memory,   used   swap   memory,  free  swap  memory   • Storage:  total  storage,  used  storage,  free  storage   • CPU:  CPU  load,  CPU  utilisation,  CPU  count,  CPU  idle  time,  Cumulative  CPU  time,  Cumulative   CPU  usage,  Number  of  CPUs  used  by  VM  (in  case  of  physical  machines)   • I/O  reads/writes,  the  amount  of  I/O  operations     • OS:  number  of  users,  max  number  of  processes   • Processes:  Number  of  processes,  number  of  running  processes     2.3.2.2 Network  metrics     Network   metrics   qualify   the   static   aspects   of   a   network   deployment.   They   include   connectivity,   network   speed,   topology   detection,   accounting,   reachability,   signal   quality,   signal   strength,   noise   level,   interference,   data   transfer   rates,   Radio   Frequency   (RF)   quality,   throughput,   available   bandwidth,   utilisation   (bandwidth,   protocol,   ports),   protocol   analysis,   per-­‐node   and   per-­‐channel   statistics,   IP   connections   statistics:   IP   addresses,   ports,   sessions,   supported   clients,   authentication   and  de-­‐authentication  rates.     2.3.2.3 Traffic  metrics       Traffic  metrics  capture  the  more  dynamic  aspects  of  what  is  happening  on  a  network.  Many  are  of   interest  in  traffic  measurements  and  monitoring,  such  as  link  utilisation,  packet  arrival  rate,  packet   loss,   delay,   jitter,   number   of   flows,   flow   type,   flow   volume,   and   traffic   density,   cost,   route   performance,  number  of  (bytes,  packets,  and  flows)  per  time  slot,  VoIP  analysis:  call  flow,  signalling   sessions,  registrations,  media  streams  or  errors.     2.3.2.4 Software  metrics     Software   metrics   provide   information   about   the   applicative   code   running   in   the   node.   This   represents  metrics  which  provide  information  about  the  state  of  the  software  (service  software),  its   performance  and  other  software  specific  information.  This  well  includes  custom  metrics  identified  by   users.   These   metrics   are   heterogeneous   and   vary   depending   on   the   applications   under   study.   Examples  include  memory  consumption,  errors  and  warnings  or  internal  buffers   18  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   2.3.3 Consolidated  summary  of  testbeds’  inputs     The   tables   presented   here   provide   a   consolidated   view   on   the   concrete   requirements   from   each   Fed4FIRE  testbed  deployment.  The  corresponding  details  per  testbed  can  be  found  in  the  different   appendices.   Table   2   presents   the   current   software   deployments,   Table   3   the   requirements   of   each   testbed,   and   whether   they   are   currently   met   or   not.   Table   4   summarizes   the   metrics   of   interest,   and   identifies  commonalities.     Test  bed   PlanetLab  Europe   Virtual  Wall   OFELIA   DWDM  Ring  Testbed   ADVA  ROADM   BonFIRE   Grid'5000   FUSECO   FUSECO   NITOS   Netmode   w-­‐iLab.t   SmartSantander   NICTA   Facility  Monitoring   Node/Flow  Monitoring   Experimental   measurements   Varied;  mainly  Nagios   CoMoN,  TDMI,   Varied;  OML  supported   and  MySlice   PlanetFlow,  MyPLC   Zabbix,  EmuLab,  scripts     Zenoss,  Polling  of  OCF  AM       PlanetFlow     OpenFlow,  Polling  of  OCF  AM   OML  planned   Nagios  for  services,  Zabbix   Zabbix   Nagios,  Munin,  Cacti,   ganglia,  g5k-­‐checks     smokeping,  ganglia,   checks   Zabbix,  SNMP       TUB  Packet  Tracking,  scripts   CM  cards,  scripts     OML   Nagios     OML   PDU’s,  PoE,  scripts   Wi-­‐Fi  connectivity  tool   OML   (OML)       OML   CM  cards,  scripts;   Ad  hoc  tools   OML   Zabbix  supported   Table  2:  Consolidated  data  on  software  tools  from  partners'  input.  Empty  cells  are  assumed  to  be  done   manually.   19  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1     Requirement   Supported  in   Node  (servers,  switches,   sensors,  etc.)  availability     Node  performance   measurements   Path  measurements  between   nodes   Information  about  the  location   of  nodes   Infrastructure  monitoring  for   facility  providers  and   operations  (health,   performance,  etc.)   Experimentation  metrics   Information  about  the   changeable  network   topologies   Experiment  monitoring  (e.g.   experiment  statistics)   Permanent  storage  for   monitoring  data   Experiment  and  flow  statistics   PlanetLab  Europe,  BonFIRE,   Virtual  Wall,  OFELIA,  Grid’5000,   SmartSantander,  NICTA   PlanetLab  Europe,  BonFIRe,   FUSECO  Playground,  Grid’5000,   Virtual  Wall  (manually  initiated   by  experimenter),  FUSECO   Playground,  NICTA   PlanetLab  Europe,  FUSECO   Playground  (limited),  NICTA   w-­‐iLab.t,     NITOS,  SmartSantander,  w-­‐ iLab.t   OpenFlow-­‐enabled  ADVA   ROADMs,  Grid’5000,  FUSECO   Playground,  w-­‐iLab.t,   SmartSantander   SmartSantander,  NICTA   PlanetLab  Europe,  BonFIRE,   NICTA,  w-­‐iLab.t   PlanetLab  Europe,  BonFIRE,   OFELIA,  NITOS,  w-­‐iLab.t   Grid’5000,  Virtual  Wall,  FUSECO   Playground,  Netmode   BonFIRE,  Virtual  Wall,  Netmode,   PlanetLab  Europe,  w-­‐iLab.t   NICTA     OFELIA,  Experimenta,   OpenFlow-­‐enabled  ADVA   ROADMs,  NITOS,  w-­‐iLab.t   BonFIRE,  NICTA   OFELIA,  Experimenta,   OpenFlow-­‐enabled  ADVA   ROADMs,  NITOS,  w-­‐iLab.t   BonFIRE,  NICTA,  Virtual  Wal   By  all  testbeds     Monitoring  information  to   BonFIRE   cloud  services  experimenters   about  the  physical  machines   hosting  their  virtual  machines     Path  and  network  monitoring     NICTA   Wireless  connectivity     NICTA,  w-­‐iLab.t   Sensor  measurements  and   environment  depiction   Required  by     Experimenta,  OpenFlow-­‐ enabled  ADVA  ROADMs,       FUSECO  Playground   NITOS,  Netmode,    FUSECO   Playground,  NICTA   NITOS,  w-­‐iLab.t,  NICTA   Table  3:  Consolidated  data  on  monitoring  requirements  from  partners'  input.  Some  of  these  are  already   supported  by  some  facilities.   20  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   Metrics   Supported  in   Device  status:  ON/OFF   Physical  interfaces:  ON/OFF   Virtual  interfaces:  ON/OFF   Port  status   Memory:  total  memory,  used   memory,  free  memory,  total   swap  memory,  used  swap   memory,  free  swap  memory   CPU:  CPU  load,  CPU  utilisation,   and  CPU  idle  time   VMs:     List  of  VMs,  ON/OFF,  CPU:   Total  number  of  CPUs,  Core   speed,  L1/L2/L3  Cache   memory,  total  CPU  utilization,   CPU  utilisation  per  VM   Storage:  Total  capacity,  used   capacity,  free  capacity   Temperature   BonFIRE,  Grid’5000,  FUSECO   Playground,  NITOS,  Netmode,   NICTA,  Virtual  Wall  (admins   only),  w-­‐iLab.t   BonFIRE   BonFIRE     PlanetLab  Europe,  Virtual  Wall,   BonFIRE,  Grid’5000,  FUSECO   Playground,  NICTA   Required  by   OFELIA,  Experimenta   OFELIA,  Experimenta,  w-­‐iLab.t   OFELIA,  Experimenta,  w-­‐iLab.t   OFELIA,  Experimenta,   OpenFlow-­‐enabled  ADVA   ROADMs   OFELIA,  NITOS,  w-­‐iLab.t,   Netmode   PlanetLab  Europe,  Virtual  Wall,   BonFIRE,  Grid’5000,  FUSECO   Playground,  NICTA   BonFIRE   OFELIA,  NITOS,  w-­‐iLab.t,   Netmode   BonFIRE,  NICTA   OFELIA,     OFELIA,  w-­‐iLab.t,   SmartSantander   OpenFlow-­‐enabled  ADVA   ROADMs,  FUSECO  Playground,   w-­‐iLab.t   OpenFlow-­‐enabled  ADVA   ROADMs,  w-­‐iLab.t   Experimenta,  OpenFlow-­‐ enabled  ADVA  ROADMs,  w-­‐ iLab.t,  NICTA,  NITOS,  Virtual   Wall   OFELIA,  Experimenta   OFELIA,  OpenFlow-­‐enabled   ADVA  ROADMs,  FUSECO   Playground,  w-­‐iLab.t,  Netmode   OpenFlow-­‐enabled  ADVA   ROADMs   Network  speed,  reachability,   IP  addresses,  connectivity   Virtual  Wall,  NICTA   Network  topology     Packet  loss,  delay,  jitter   PlanetLab  Europe,  Grid’5000,   FUSECO  Playground,  Netmode,   SmartSantander,  NICTA   IP  connection  statistics   Available  bandwidth,   bandwidth  usage  on  all  links  in   the  experiment  topology   Flow  statistics   NICTA   PlanetLab  Europe,  Virtual  Wall,   NICTA     OFELIA,  Experimenta   21  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   Metrics   Supported  in   Required  by   Running  services:  SSH,  FTP  etc.   Inbound  and  outbound  traffic   on  each  interface   Noise  floor  of  the  WiFi  cards,   channels  used  on  the  WiFi   cards   Indications  about  power   on/off  status  from  Chassis   Management  card   Signal  quality,  signal  strength,   noise  level,  interference   Throughput  of  each  channel   BonFIRE,  Grid’5000,  Netmode   Grid’5000,  NICTA   w-­‐iLab.t   w-­‐iLab.t,  NICTA   NICTA   NITOS,  w-­‐iLab.t,  Netmode   SmartSantander,  NICTA,  w-­‐ iLab.t   NITOS   NICTA,  w-­‐iLab.t   Battery  status,  consumed   power   Position  of  node   SmartSantander   FUSECO  Playground,  NITOS,  w-­‐ iLab.t,  Netmode,  NICTA   FUSECO  Playground,  NITOS,  w-­‐ iLab.t     w-­‐iLab.t,  NICTA   w-­‐iLab.t,  NICTA     SmartSantander,  NICTA,  w-­‐ iLab.t   Table  4:  Consolidated  data  on  required  metrics  to  be  measured  from  partners'  input.  Some  of  these  are   already  supported  by  some  facilities.     22  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   3 Implementation  of  the  architectural  functional  elements     3.1 Introduction     In  this  section  we  discuss  every  functional  element  of  the  D2.1  architecture  that  is  related  to  WP6,   and  define  how  it  will  be  implemented.  It  is  possible  that  an  available  piece  of  software/tool  will  be   used  as  a  starting  point,  or  a  combination  of  such  software.  It  is  also  possible  that  some  elements  will   be  implemented  from  scratch.     3.2 Evaluation  for  possible  approaches  for  implementation     The   backbone   of   all   three   monitoring/measurement   needs   identified   in   D2.1   “First   Federation   Architecture”   comprises   two   main   classes   of   elements   in   charge   of   (i)   obtaining   the   readings,   and   (ii)   making   them   accessible   to   the   relevant   stakeholders.   This   section,   partly   based   on   contents   from   [16],  reviews  the  state  of  the  art  in  these  areas.  While  some  tools  cater  for  both  aspects  of  the  needs,   they  are  classified  in  the  group  which  is  most  relevant  to  their  prime  usage.     Based   on   the   pros   and   cons   of   the   studied   solutions,   some   of   them   are   selected   as   the   initial   Fed4FIRE   approach;   well   understood   alternatives   are   proposed   in   some   cases.   The   goal   is   to   direct   testbeds  still  lacking  specific  capabilities  into  a  preferred  core  solution.  Testbeds  already  supporting   these   capabilities   can   retain   their   solution,   but   will   need   to   ensure   its   integration   into   the   rest   of   the   federation.     3.2.1 Data  acquisition   3.2.1.1 Facility  and  Infrastructure  Monitoring     Facility   and   infrastructure   monitoring   is   essentially   the   same   basic   task,   it   however   varies   in   terms   of   the   precision   and   availability   of   the   data,   and   the   receiving   stakeholder.   Facility   monitoring   is   exhaustive  data  aggregated  for  status  summaries  about  the  testbed,  for  use  by  the  provider,  while   infrastructure   monitoring   would   provide   detailed   measurements   about   a   relevant   subset   of   the   testbed   to   the   experimenter.   Based   on   the   review   of   the   previous   section,   apart   from   ad   hoc   tools,   a   few  off-­‐the-­‐shelf  tools  are  used  for  various  levels  of  monitoring.     Some   of   the   testbed   control   frameworks,   such   as   OMF   and   OCF,   natively   include   some   facility-­‐ monitoring   functionality   in   the   form   of   their   Aggregate   Manager.   These   can   be   complemented   by   physical  chassis  management  (CM)  cards.     In  the  commercial  and  also  experimental  cloud  infrastructures  there  are  many  monitoring  systems  or   solutions  being  used.  Examples  for  such  are  Zabbix  [42]  [7],  Nagios  [44],  Ganglia  [43],  EVEREST  [41],   Groundwork   [45],   MonALISA   [46],   CloudStatus   [47],   and   CA   Nimsoft   Monitor   [48].   In   addition   to   23  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   these,   many   other   monitoring   architectures   are   already   deployed   in   cloud   infrastructures   [31][32][33][34][35][36][37].  Moreover,  the  BonFIRE  monitoring  solution  based  on  Zabbix  is  used  to   monitor   federated   cloud   infrastructures   [49].   However,   these   architectures   focus   only   on   wired-­‐ based  cloud  infrastructures.  Furthermore,  the  heterogeneity  that  is  considered  in  these  architectures   concerns  the  used  virtualization  solutions  but  various  hardware  infrastructures.       With   the   addition   of   a   plugin   (PNP),   Nagios   can   log   historical   performance   data   in   round-­‐robin   databases  (based  on  RRDTool),  and  provide  graphs  of  their  evolution.  Similarly,  both  Munin  and  Cacti   store   historical   performance   data   in   RRDTool   databases,   and   also   provide   instrumentation   tools   to   poll  distributed  nodes  and  collect  this  information.  While  RRDTool  is  a  well-­‐known  and  widespread   tool  for  the  storage  of  performance  data,  its  basic  operation  relies  on  reducing  the  resolution,  over   time,  of  old  records,  which  might  not  be  desirable  in  the  context  of  infrastructure  monitoring.     Collectd  is  an  extensible  daemon  that  collects  and  stores  performance  data  from  network  equipment   and   computer   it   is   running   on.   Through   the   use   of   plugins,   it   can   be   extended   to   monitor   various   aspects   of   the   nodes,   such   as   various   common   application   servers   and   specific   system   metrics.   It   can   also   query   information   from   SNMP-­‐enabled   nodes,   and   through   the   libvirt   plugin,   monitor   guest   virtual  machines.  Its  default  storage  backend  relies  on  RRDTool,  from  which  time-­‐based  graphs  can   be   generated   from   external   tools.   However,   a   writer   plugin   is   available   which   makes   the   data   available  through  OML.  Nmetrics  is  a  multi-­‐platform  library  allowing  to  query  similar  system  run  time   parameters   such   as   load,   memory   or   network   use   in   a   system-­‐agnostic   way.   It   is   however   not   as   thorough  as  collectd,  and  does  not  cater  for  remote  reporting,  but  might  be  a  lighter  infrastructure   monitoring  solution  for  low-­‐powered  nodes.  An  OML  instrumentation  for  this  library  is  available.     DIMES   [21]   allows   to   deploy   measurement   probes   throughout   the   Internet.   Its   goal   is   however   more   oriented   towards   measuring   the   live   Internet   rather   than   planned   experiments.   PlanetFlow   [5]   and   CoMon   [3]   provide   flow   logging   and   slice   or   node   (i.e.,   infrastructure)   monitoring   for   PlanetLab,   including  sophisticated  query  mechanisms.     Table  5  shows  a  summary  of  these  tools.   24  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   Tool   OMF/OCF,  CM   cards   Nagios     Advantages   Disadvantages   Selected  as   initial   approach     • Facility  monitoring   • Limited  information   • Alert  management   • Plugin  support   • Plugin  for  historical   infrastructure  monitoring  (but   RRDTool)   • Already  deployed  in  33   testbeds   • Ad  hoc  storage  (but  SQL   export  scripts  available)   X     Zabbix   • Supports  both  Facility  and   Infrastructure  Monitoring   • Alert  management   • SQL  storage   • Plugin  support   • VM  monitoring   • Agent-­‐less  monitoring   • SNMP  support   • Support  for  remote  collection   to  centralised  server   • Already  deployed  in  44   testbeds   • SQL  database  can  become   huge  and  unresponsive  in   certain  cases   • Not  always  very  intuitive   X     Zenoss   • Support  Nagios  plugins       Munin,   Cacti   • Good  for  infrastructure   monitoring   • No  facility  monitoring   • RRDTool  backend  (loss  of   resolution  on  old  data)     Collectd   • Plugin  support   • Need  local  (or  SNMP)   agent   • libvirt  for  VM  monitoring     • OML  writer   • SNMP  support   • Support  for  remote  collection   to  centralised  server   X   nmetrics   • OML-­‐instrumented   application  available   • Good  for  lightweight   infrastructure  monitoring   • Library,  but  not  stand-­‐ alone  application   • Limited  monitored  metrics   • No  remote  reporting     TopHat  /  TDMI  /   MySlice   • Infrastructure    (network,   flows)  monitoring   • Federated   • Tophat:  only  running   above  TeamCumry,   Maxmind,  TDMI   X   25  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   Tool   Advantages   Disadvantages   Selected  as   initial   approach   • Support  for  external  queries   • MySlice  :  only  running   above  TopHat,  SFA   DIMES   • Allows  measurement  of  the   live    Internet   • Not  deployed  in  any  of  the   involved  testbeds     PlanetFlow   • Infrastructure  (flows)     monitoring   • Already  deployed  in  2   testbeds   • Support  for  external  queries   • Only  deployed  on   PlanetLab     CoMon   • Infrastructure  monitoring   • Support  for  external  queries   • Currently  not  maintained     Observium   • RRD   • SNMP  based   • CollectD  integration   • IPMI  integration   • Minimal  install  effort   • App.  Monitoring  (direct  or  via   CollectD)   • Ldap  authentication   • Adding  non-­‐SNMP  devices   not  supported   • Monitoring  data  in  RRD   only     dstat   • Low  level  info  available   • Flexible   • Well-­‐known  tool   • Granularity  over  time   (limited  to  1/s)     Table  5:  Consolidated  data  on  facility  and  infrastructure  monitoring  tools       3.2.1.2  Experimental  Measurement     For  use  as  part  of  on-­‐going  experiments,  the  networking  community  has  been  developing  and  using   several   types   of   measurement   tools.   The   most   common   ones   focus   on   traffic   capture   and   analysis.   Among  the  best  known  Internet  measurement  software  tools  are  tcpdump  and  Iperf  [17].  The  former   has  been  shown  to  accurately  report  at  capture  rates  up  to  gigabits  per  second  [18]  while  the  latter   allows  researchers  to  generate  a  traffic  load  to  evaluate  the  capacity  of  a  network  or  resilience  of  a   system;  the  authors  of  [19]  showed  that  it  generated  the  highest  load  on  networks  paths  compared   to  a  number  of  other  traffic  generators.  High  performance  or  versatile  hardware  solutions  have  also   been  developed,  such  as  DAG3  or  NetFPGA  [20].     The  TopHat  Dedicated  Measurement  Infrastructure  (TDMI)  is  a  measurement  infrastructure  based  on   Top  Hat.  It  consists  of  modular  probing  agents  that  are  deployed  in  a  slice  of  various  PlanetLab  nodes   and  probe  the  underlying  network  in  a  distributed  efficient  manner.  In  addition,  they  probe  outwards   26  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   to  a  number  of  target  IP  addresses  that  are  not  within  PlanetLab.  The  aim  of  TDMI  is  to  obtain  the   basic  information  required  by  TopHat.  It  implements  such  algorithms  as  Paris  Traceroute  to  remove   the  artifacts  arising  from  the  presence  of  load  balancers  in  the  Internet.  TDMI  aims  at  providing  the   necessary  information  to  TopHat  users  about  the  evolution  of  the  overlay,  and  focuses  on  catching   the  dynamic  aspects  of  the  topology.     There   are   several   research   activities   focusing   on   the   federation   of   heterogeneous   infrastructures   towards   large-­‐scale   Future   Internet   experimental   facilities.   In   these,   various   measurement   and   monitoring  solutions  are  used.  For  instance,  the  monitoring  architecture  of  the  EU  FP7  NOVI  project   [40]  uses  four  monitoring  tools  deployed  across  heterogeneous  virtualized  infrastructures,  these  are:   (1)  Multi-­‐Hop  Packet  Tracking,  an  efficient  passive  one-­‐way  delay  measurement  solution,  (2)  HADES   (Hades   Active   Delay   Evaluation   System)   which   is   used   for   one-­‐way   delay,   loss   statistics,   and   hop   count   information,   (3)   SONoMA,   which   provides   experimenters   with   monitoring   information   about   the   status   of   the   measurement   agents   and   about   the   state   of   measurement   processes,   and   (4)   packet  capturing  cards  for  line  speeds  up  to  10  GBit/sec.     The  Multi-­‐Hop  Packet  Tracking  of  (1)  is  characterized  by  the  fact  that  it  records  detailed  hop-­‐by-­‐hop   metrics   like   delay   and   loss   (traffic   engineering).   Packet   tracking   also   enables   measurements   of   environment  conditions  like  cross-­‐traffic  and  its  influence  on  the  user  or  experimenter  traffic.  It  also   allows   tracking   single   packets   through   the   network,   which   supports   “trace-­‐back   systems”   by   deriving   the  source  of  malicious  traffic  and  revealing  the  location  of  the  adversary.  The  tool  also  adops  a  hash-­‐ based  packet  selection  technique  that  ensures  a  consistent  selection  throughout  the  network  while   maintaining   statistically   desired   features   of   the   sample.   It   can   also   efficiently   export   measurement   results   with   IPFIX,   and   it   provides   the   experimenter   a   choice   of   suitable   packet   ID   generation   functions.  It  can  also  reduce  measurement  traffic  with  hash-­‐based  packet  selection,  and  it  is  able  to   visualize  the  measurement  results.  Finally,  it  can  synchronize  the  sampling  fractions  in  the  network.   Some   disadvantages   of   the   Multi-­‐Hop   Packet   Tracking   tool   are   the   fact   that   it   cannot   measure   passively  if  there  is  no  traffic,  that  hash  calculation  and  measurement  export  requires  resources,  and   that  it  requires  time  synchronisation  of  nodes  for  delay  measurements.     CoMo   (Continuous   Monitoring)   [24]   is   a   network   measurement   system   based   on   packet   flows.   It   has   core   processes   that   are   linked   in   stages,   namely   packet   capture,   export,   storage,   and   query.   These   processes   are   linked   by   user-­‐defined   modules,   used   to   customize   the   measurement   system   and   implement   filtering   functions.   The   query   process   provides   an   interface   for   distributed   queries   on   the   captured   packet   traces.   It   is   a   highly   tailored   tool   designed   for   efficient   packet   trace   capture   and   analysis.       Most   tools,   mentioned   here,   do   not   share   any   common   output   format.   Data   collection   and   post-­‐ processing   is   therefore   required   before   being   able   to   cross-­‐analyze   any   data.   The   next   section   reviews  tools  allowing  to  do  so.     Table  6   shows  a  summary  of  the  tools  described  in  this  section.  It  only  offers  a  description  of  their   advantages/disadvantages,   and   does   not   prescribe   any   selection   as   part   of   the   Fed4FIRE   27  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   implementation.  This  choice  is  left  to  the  experimenters  that  will  have  the  freedom  to  deploy  their   preferred   tools   as   part   of   their   experiment.   Indeed,   the   integration   of   testbeds   in   Fed4FIRE   only   requires   that   facility   and   infrastructure   measurements   are   readily   available   to   the   federation,   and   experimental   measurement   be   accessible   by   the   experimenter   from   a   remote   testbed.   The   next   section  discusses  remote  data  collection  and  reporting  tools  which  will  be  used  for  that  purpose.     Tool   Advantages   Disadvantages   Iperf   • • • • D-­‐ITG   • OML  instrumentation   • Different  traffic  profiles   • TCP,  UDP,  DCCP  support   OTG   • • • • Tcpdump   • Well  known  tool   • No  unified  output,  can  differ  strongly   based  on  the  chosen  options.   • No  remote  reporting   libtrace   • OML  instrumentation   available   • Radiotap  support   • Not  standalone  tool   DAG3   • Fast  processing   • No  unified  output   • No  remote  reporting   NetFPGA   • Fast  processing   • No  unified  output   • No  remote  reporting   Multihop  Packet   Tracking   • Detailed  hop-­‐by-­‐hop   metrics  like  delay  and   loss  (traffic  engineering)   • Environment  conditions   like  cross-­‐traffic   • Tracking  single  packets   through  the  network   • Hash-­‐based  packet   selection  technique     • Export  of  measurement   results  with  IPFIX.   • Visualization  of   measurement  results.   • Cannot  measure  passively  if  there  is  no   traffic.   • Hash  calculation  and  measurement   export  requires  resources   • Time  synchronisation  of  nodes  for  delay   measurements  required   Well  known  tool   OML  instrumentation   TCP,  UDP  support   DCCP,  SCTP  in  some   flavours   OML  instrumentation   Different  traffic  profiles   Modular   TCP,  UDP  support   • No  unified  output  (by  default)   • No  remote  reporting  (by  default)   • Segmented  codebase   • Requires  precise  node  synchronsation   • No  DCCP  nor  SCTP  support   28  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   PlanetFlow   • TCP,  UDP,  ICMP  support   • Well  known  data  format   (silk  format)   • Netflow  query  system   • Fast  and  extensive   querying  faciliyes   • Web  GUI  access.   • Only  deployed  on  PlanetLab   Table  6:  Measurement  tools       3.2.2 Collection  and  Reporting     In  the  previous  sections,  attention  was  given  to  data  acquisition.  In  this  section,  the  focus  is  shifted  to   techniques  to  make  this  data  available.  Several  solutions  exist  to  instrument  and  collect  information   from   various   networking   applications   and   devices.   Generic   reporting   tools   include   SNMP   [26],   already   leveraged   by   some   of   the   monitoring   tools   reviewed   in   the   previous   section,   DTrace   [27],   OML  [16]  and  INSTOOLS  [50].  They  all  allow  the  instrumentation  of  any  software  and/or  devices.       In  addition,  DTrace  can  dynamically  instrument  live  applications,  and  is  shipped  by  default  with  some   operating  systems.  However,  its  measurement  processing  is  limited  to  aggregating  functions,  and  it   does   not   support   the   streaming   of   measurements   from   different   devices   to   a   remote   collection   point.     The   INSTOOLS   monitoring   framework   [50]   is   a   system   of   instrumentation   tools   that   enables   GENI   users   to   monitor   and   understand   the   behaviour   of   their   experiments   by   automatically   setting   up   and   initializing  experiment-­‐specific  network  measurement  and  monitoring  capabilities  on  behalf  of  users.   SNMP   has   been   widely   adopted   for   the   management   and   monitoring   of   devices,   and   allows   the   collection   of   information   over   the   network.   However,   it   has   some   performance   and   scaling   limitations   when   measurements   from   large   number   of   devices   are   required   within   a   short   time   window[19];  SNMP  is  also  constrained  to  only  reporting  information  predefined  in  its  management   information   base   (MIB).   IPFIX   [29]   is   an   IETF   standard   which   leverages   SNMP's   MIB   and   defines   a   protocol   for   streaming   information   about   IP   traffic   over   the   network.   IPFIX   exporters   stream   collected  and  potentially  filtered  measurements  to  collector  points.  While  IPFIX  was  initially  limited   to  measurements  about  IP  flows,  an  extension  [30]  allows  to  specify  custom  types  for  the  reported   data,,  and    allow  to  cover  a  wider  range  of  metrics  than  just  flows.     OML  [16]  is  a  generic  framework  that  can  instrument  the  whole  software  stack,  and  take  input  from   any   sensor   with   a   software   interface.   It   has   no   preconception   on   the   type   of   software   to   be   instrumented,  nor  does  it  force  a  specific  data  schema.  Rather,  it  defines  and  implements  a  reporting   protocol   and   a   collection   server.   On   the   client   side,   any   application   can   be   instrumented   using   libraries   abstracting   the   complexity   of   the   network   communication.   Additionally,   some   of   the   libraries   provide   in-­‐band   filtering,   allowing   to   adapt   the   measurement   streams   obtained   from   an   instrumented  application  to  the  requirements  of  the  current  observation  (e.g.,  time  average  rather   29  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   than   raw   metrics).   Applications   for   which   the   code   is   not   available   can   also   be   integrated   in   the   reporting   chain   by   writing   simple   wrappers   using   one   of   the   supported   scripting   language   (Python   or   Ruby).   After   collection   from   the   distributed   application,   the   timestamped   data   is   stored   in   an   SQL   database  (SQLite3  or  PostgreSQL),  grouped  by  experimental  domain;  a  server  can  collect  data  from   several  domains  at  the  same  time.     Netcat   [59]   is   a   featured   networking   utility   which   reads   and   writes   data   across   network   connections,   using  the  TCP/IP  protocol.  In  its  essence,  Netcat  is  a  solution  that  allows  to  easily  communicate  text   over  a  tunnel.  Hence  it  is  rather  simple  to  understand.  It  is  also  flexible  and  easy  in  the  way  that  it   can  be  used,  since  you  can  pipe  output  of  other  tools  into  it  on  the  command  line.  A  downside  is  that   Netcat   provides   no   other   functionality   than   the   text   over   tunnel.   Features   such   as   filtering   or   persistence  are  not  natively  supported  by  Netcat.     MINER   [25]   is   a   solution   sharing   a   lot   of   similarities   with   OML.   It   comprises   a   measurement   architecture  as  well  as  elements  of  a  management  framework.  The  MINER  tools  are  Java  components   that   may   provide   measurement   results   directly,   or   may   be   wrappers   around   external   libraries   or   applications   that   do   the   actual   measurements.   Unfortunately,   MINER   is   not   open   source   software,   which  limits  its  extensibility.     TopHat   is   a   measurement   system   offering   a   dedicated   service   that   provides   network   topology   information,   as   measured   by   the   TDMI   described   above.   It   supports   the   entire   lifecycle   of   an   experiment,   from   assisting   users   during   experiment   setup   in   choosing   the   nodes   on   which   the   experiment   will   be   deployed   (facility   monitoring)   to   providing   live   information   to   support   adaptive   applications   and   experiment   control   and   supporting   retrospective   analysis   through   access   to   archived   data   (infrastructure   monitoring).   Additionally,   TopHat-­‐instrumented   testbeds   can   be   federated,  and  data  made  available  to  external  users  through  MySlice.       Tool   Advantages   Disadvantages   Selected  as   final  approach   SNMP     • Standard  measurement   • No  remote  reporting   systems   • Unified  reporting   DTrace     • Dynamic  instrumentation  of   • Limited  to  aggregating   live  applications   functions   • Default  installed  on  some  OS   • No  remote  reporting   IPFIX     • Unified  reporting   • Limited  representable   information  (but  extensible)   • Remote  reporting   Netcat   • Simple  to  understand  (text   over  tunnel)   • Simple  to  use  (can  pipe   output  of  other  tools  into  it   on  the  command  line)   • Limited  functionality  (e.g.  no   persistence  to  a  database  as   part  of  Netcat)     30  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   OML   • Unified  reporting   • Only  reporting  and  collection:   no  measurement   • Centralized  reporting   • Already  deployed  or  planned   in  5  testbeds   • In-­‐line  filtering   MINER   • Similar  in  functionality  as   OML   • Not  open  source,  hence  less   extensible   TopHat   • Allows  federation  of  data   sources   • Can  be  plugged  into  the   Fed4FIRE  portal  (which  uses   MySlice  technology)   • Not  intended  for  collecting  of   data  on  the  level  of  the  actual   resources.   X     X Table  7:  Collection  and  reporting  systems         3.2.3 Summary   There   is   a   large   variability   in   the   tools   currently   deployed   on   the   various   testbeds.   A   few   commonalities   can   be   found,   mostly   in   the   monitoring   solutions,   where   Nagios   and   Zabbix   are   primarily   used.   It   is   however   worth   noting   that   Zabbix   natively   caters   for   both   facility   and   infrastructure  data  management,  while  Nagios  only  provides  the  former.     On   the   experimental   measurement   side   the   variability   shows   the   most,   with   a   lot   of   different   and   sometimes   ad   hoc   tools.   This   disparity   can   be   solved   through   the   use   of   a   middleware   measurement   system   in   charge   of   reporting   samples   from   heterogeneous   distributed   tools   in   a   unified   and   centralisable   way.   Here,   a   commonality   can   be   identified   around   OML,   with   quite   a   few   active   or   planned   deployments.   Its   lightweight   API   is   also   a   good   match   for   the   instrumentation   of   the   various   measurement  tools  in  use.     Federated  access  to  measurement  data  is  also  important.  TopHat,  with  its  ability  to  query  distributed   data  sources,  and  MySlice  support,  is  probably  a  good  candidate  for  this  task.       3.3   Details  of  the  selected  monitoring  and  measuring  tools  in  Fed4FIRE   From   the   surveys   of   the   previous   sections,   the   most   widespread   commonality   is   the   use   of   OML   as   a   collection   and   reporting   framework.   It   allows   instrumenting   any   sort   of   system   through   the   abstraction   of   a   measurement   point,   describing   a   group   of   metrics.   This   abstraction   allows   more   latitude   in   the   choice   of   measurement   tools:   as   long   as   they   conform   to   the   same   measurement   point   for   the   same   information,   their   output   can   be   used   indistinctly   in   further   analysis.   Selecting   OML  for  reporting  purposes  therefore  allows  flexibility  in  the  choice  of  measurement  tools,  both  for   monitoring  and  measurement  tasks,  as  well  as  for  a  unified  way  to  access  the  collected  data.   31  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   This  however  only  caters  for  collection  and  storage,  but  not  directly  access  to  or  visualization  of  the   data,  let  alone  from  a  federated  environment.  Another  tool  for  this  purpose  therefore  needs  to  be   identified.   From   the   previous   sections,   Top   Hat   fits   the   bill   for   its   ability   to   run   queries   over   distributed  systems  and  data  stores,  and  pre-­‐existing  deployments.   Facility   and   infrastructure   monitoring   tasks   require   specific   metrics   to   always   be   made   available   about  the  testbed  and  its  node.  While  some  deployments  already  have  solutions  in  place,  the  most   indicated  ones  for  others  are,  in  order  of  preference,  Zabbix,  Nagios  or  Collectd.     Figure  3:  Proposed  cycle  1  measurement  architecture  for  Fed4FIRE.  Elements  in  bold  are  the  default   proposal  for  new  deployments  of  on  canonical  testbed     Overall,   this   caters   for   the   measurement   architecture   shown   in   Figure   3.   The   rest   of   this   section   describes   these   tools   in   more   detail,   presented   according   to   their   order   of   appearance   along   the   measurement-­‐to-­‐analysis  chain.         32  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   3.3.1   Monitoring  (Zabbix,  Nagios)   Zabbix  is  an  open-­‐source  solution  for  facility  and  infrastructure  monitoring.  It  supports  performance   monitoring  natively,  in  addition  to  facility  monitoring  and  alerting.  It  also  supports  an  extensive  list  of   operating   systems   and   platforms,   including   virtual   machines.   Three   types   of   resource   components   are   available   for   monitoring:   native   Zabbix     agents,   SNMP   monitoring,   and   agentless   script-­‐based   queries;  all  data  is  then  aggregated  within  a  central  collection  server  which  relies  on  SQL  databases   (MySQL  or  PostgreSQL)  for  storage.   Nagios  is  another  open-­‐source  base  for  infrastructure  monitoring  solutions.  Unlike  Zabbix,  it  does  not   support   performance   monitoring   natively.   It   provides   status   reports   for   hosts,   applications   and   networks.     Nagios   can   also   be   extended   through   the   use   of   user   scripts   run   by   the   Nagios   Remote   Plugin   Executor   (NRPE).   It   has   built-­‐in   support   for   raising   alerts   on   problematic   situations.   Data   is   stored   in   an   ad   hoc   backend,   but   some   plugins   allow   export   to   SQL   databases.   Data   can   also   be   processed  and  exchanged  between  instances  using  the  Nagios  Remote  Data  Processor  (NRDP).   Bothtools   can   be   extended   through   the   use   of   plugins,   and   plugins   written   for   Nagios   are   also   reusable  with  the  Zenoss  monitoring  tool.     3.3.2   Collection  and  Reporting  (OML)   OML  is  an  instrumentation  tool  that  allows  application  writers  to  define  customizable  measurement   points   (MP)   inside   new   or   pre-­‐existing   applications.   Experimenters   running   the   applications   can   then   direct   the   measurement   streams   (MS)   from   these   MPs   to   remote   collection   points,   for   storage   in   measurement  databases.     It  consists  of  two  main  components  for  injection  and  collection,  and  an  optional  proxy:   • The   OML   client   libraries:  the  OML  client  library  provides  a  C  API  for  applications  to  collect   measurements   that   they   produce.   The   library   includes   a   dynamically   configurable   filtering   mechanism   that   can   perform   some   processing   on   each   measurement   stream   before   it   is   forwarded   to   the   OML   Server.   The   C   library,   as   well   as   the   native   implementations   for   Python  (OML4Py)  and  Ruby  (OML4R)  are  also  maintained.     • The   OML   Server:   the   OML   server   component   is   responsible   for   collecting   and   storing   measurements   inside   a   database.   Currently,   SQLite3   and   PostgreSQL   are   supported   as   database  backends.   • The   optional   OML   Proxy   Server:   when   doing   experiment   involving   disconnections   from   the   control/measurement  network  (such  as  with  mobile  devices),  the  proxy  server  can  be  used  to   temporarily   buffer   measurement   data   until   a   connection   is   available   to   transfer   it   to   the   collection  server.     OML  can  be  used  to  collect  data  from  any  source,  such  as  statistics  about  network  traffic  flows,  CPU   and  memory  usage,  input  from  sensors  such  as  temperature  sensors,  or  GPS  location  measurement   devices.  It  is  a  generic  framework  that  can  be  adapted  to  many  different  purposes.  It  is  mainly  used   33  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   as   the   measurement   part   of   OMF-­‐based   testbeds   as   a   way   to   collect   and   process   data   from   distributed   experiments,   but   can   also   be   used   as   a   standalone   reporting   system.   Moreover,   any   activity  that  involves  measurement  on  many  different  computers  or  devices  that  are  connected  by  a   network  can  benefit  from  using  OML  to  allow  a  better  collection  of  experimental  data  thanks  to  the   reporting   system,   and   its   reuse   or   sharing   thanks   to   its   flexible   schema-­‐based   definition   of   measurement  samples.     An  instrumentation  using  the  C  library  usually  consists  of  a  few  additions  to  the  application  code,  in   the  form  of  some  initialization  code,  and  some  injection  code.    The  same  goes  for  Python  and  Ruby   bindings,   though   they   can   also   be   used   to   instrument   an   application   for   which   source   code   is   not   available  by,  e.g.,  parsing  its  output  or  log  files.     In   the   context   of   Fed4FIRE,   there   is   however   is   an   additional   need   to   unify   the   structure   of   measurement   coming   from   different   applications   with   similar   purposes.   OML   does   not   cater   for   this,   and  a  separate  coordination  effort  will  therefore  be  needed  on  this  aspect.     3.3.3   Federated  Queries  (TopHat)   TopHat  collects  data  from  various  source  of  data  (TDMI,  Maxmind,  TeamCymru,  ...)  and  aggregates   them  in  order  to  exposed  enriched  measurement  data  to  the  user.     These   data   can   typically   be   used   for   monitoring   or   more   generally   to   have   a   better   understanding   of   the  network.  TopHat  provides  measurement  data  and  is  built  above  the  Manifold  framework  (as  is   MySlice,  which  provides  testbed-­‐oriented  data).  Manifold  allows  the  user  to  query  various  source  of   data   through   a   single   API   (in   the   case   of   TopHat,   measurements   through   the   TopHat   API)   while   relieving  the  user  from  needing  to  know  which  platforms  must  be  queried.     Each   source   of   data   announces   what   kind   of   data   it   provides   according   to   a   common   ontology.   Manifold   dispatches   the   user   queries   to   each   relevant   platform,   collects   their   replies,   combines   them,   and   sends   the   result   to   the   user.   For   example   TDMI   provides   traceroute   measurements,   while   Maxmind   can   map   an   IP   with   localized   city   names.   Thanks   to   TopHat,   one   can   query   TopHat   to   retrieve  traceroute  measurements  from  Paris  to  New  York.     Since   each   platform   uses   its   own   API   (database,   webservice,   ...),   each   query   issued   by   TopHat   Manifold   is   translated   into   the   platform's   API   through   a   dedicated     gateway.   In   the   same   way,   the   reply  of  the  platform  is  translated  by  the  gateway  to  be  expressed  in  the  TopHat/Manifold  format.     34  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   3.4 Implementation  Steps     The   measurement   and   monitoring   effort   in   Fed4FIRE   aims   at   unifying   the   information   and   metrics   available   from   the   federated   testbeds.   Three   levels   of   reporting   have   been   identified,   with   an   increasing   degree   of   precision.   Facility   monitoring   allows   an   experimenter   to   choose   testbed   resources  according  to  their  needs,  infrastructure  monitoring  allows  them  to  measure  and  record  the   behavior  of  these  resources  during  an  experiment,  and  experimental  measurement  allows  to  collect   any  other  specific  metric  germane  to  the  study,  usually  measured  with  dedicated  tools.     Not   all   testbeds   involved   in   the   Fed4FIRE   project   provide   the   same   tools   nor   resources.   To   support   a   working   federation,   commonalities   have   to   be   found   and   supported.   This   section   presents   the   necessary   steps   to   be   taken   towards   this   goal.   They   mostly   consist   of   three   categories:   installation   of   new  tools,  adaptation  of  existing  tools,  and  coordination.     3.4.1 Installation  of  New  Tools     As   seen   in   the   previous   sections   some   testbeds   do   not   expose   sufficient   information   about   the   facility  and  infrastructure  status.  It  will  therefore  be  required  for  these  testbeds  to  deploy  one  of  the   selected  tools  for  that  purpose,  with  a  preference  for  Zabbix.  According  to  Table  2,  this  includes  the   following  testbeds:  NITOS,  w-­‐iLab.t,  SmartSantander  and  NICTA.     All  testbeds  will  have  to  support  OML.  This  requires  the  installation  of  the  OML  client  library  on  all   testbed   nodes,   and   the   deployment   of   at   least   one   OML   collection   server   reachable   by   all.   This   requirement  impacts  the  following  testbeds,  either  in  terms  of  deployment  or  activation:  PlanetLab   Europe,   VirtualWall,   OFELIA,   DWDM   Ring,   ADVA   ROADM,   BonFIRE,   Grid'5000   ,   FUSECO   and   SmartSantander.  NICTA  will  provide  technical  guidance  for  these  deployments.     Currently,  TopHat  is  only  deployed  within  PlanetLab  Europe.  All  other  testbeds  will  need  to  install  a   local  instance,  with  support  from  UPMC.     3.4.2 Adaptation  of  Existing  Tools     Many  of  the  monitoring  and  measurement  tools  output  data  in  varied  formats.  For  integration  within   the  federation,  these  tools  need  to  be  adapted  to  support  reporting  via  OML.  This  primarily  concerns   Zabbix  and  Nagios,  but  also  all  experiment-­‐specific  measurement  tools  listed  in  Table  6.  Information   on   how   to   instrument   an   application   can   be   found   in   [15],   as   well   as   on   the   OML   website.   It   is   preferable   to   create   a   separate   plugin   when   the   tool   supports   it   (e.g.,   Zabbix).   An   example   for   collectd   is   available.   Each   testbed   will   be   responsible   of   instrumenting   its   own   tools,   with   support   from  NICTA.       Additionally,   a   TopHat   agent   able   to   query   the   OML   storage   backend   is   also   needed   in   order   to   provide  a  unified  gateway  into  distributed  measurements.  This  will  be  a  task  of  UPMC  and  NICTA.   35  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   3.4.3 Coordination     As  noted  in  section  3.2.2,  OML  does  not  enforce  any  semantics  on  the  schema  of  its  measurement   points.  For  the  federation  to  be  successful,  it  is  important  that  similar  tools  provide  monitoring  and   measurement  data  following  the  same  structure.       It   is   therefore   needed   to   provide   a   unified   abstraction   for   the   way   data   from   monitoring   and   measurement   tools   are   grouped   together   in   meaningful   sets,   and   make   these   sets   standard   across   tools  measuring  the  similar  aspects.  NICTA  will  curate  a  list  of  measurement  point  schemas  to  use  for   specific   types   of   metrics   for   that   purpose,   and   provide   technical   support   on   their   implementation   into   the   tools   in   use   by   each   testbed.   This   will   ensure   homogeneity   across   testbeds   using   different   sets  of  tools  to  measure  the  same  characteristics.       Note  that  these  measurement  point  schemas  will  not  only  ease  the  life  of  the  experimenters,  but  it  is   also  a  vital  instrument  in  the  implementation  of  a  First  Level  Support  (FLS)  service.  This  service  will   consolidate   common   facility   monitoring   information   produced   by   all   Fed4FIRE   testbeds   in   a   single   operator  view.  The  exact  list  of  compulsory  metrics  will  be  defined  based  on  further  discussions  and   experiences,  but  they  will  all  have  to  be  provided  to  the  FLS  service  in  a  common  format.  Therefore   the  coordination  task  presented  in  this  section  will  also  contribute  to  the  successful  implementation   of  the  FLS  service.   36  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   4 Summary   4.1 Mapping  of  Architecture  to  Implementation  Plan     In   order   to   implement  the   architecture   from   D2.1,   seven   implementation   steps   have   been   identified,   as  presented  in  Table  8.  Most  steps  (service  deployment  and  application  instrumentation)  need  to  be   undertaken   independently   by   all   participants.   Where   commonalities   exist   (e.g.   Zabbix   and   Nagios)   instrumentation  should  be  a  common  effort.     To   support   with   the   instrumentation   task,   NICTA   will   provide   and   curate   a   clearinghouse   of   homogenized  OML  measurement  point  schemas.  The  goal  is  to  ease  integration  of  new  applications   while   maintaining   the   highest   level   of   data   exchangeability   and   interoperability   between   measurement  tools  providing  the  similar  information.     A  TDMI  agent  will  be  written  to  allow  queries  to  the  OML  storage  from  TopHat.       Functional  element   Implementation  strategy   Facility  and   Infrastructure  monitoring   • Deploy  Nagios  and/or  Zabbix  and/or  collectd  if  not  yet  available  (all   participants)     • Instrument  these  relevant  measurement  systems  (all  participants,   with  support  from  NICTA)   Experiment  measurement   • Deploy  OML  if  not  yet  available  (all  participants)   • Instrument  relevant  measurement  systems  (all  participants,  with   support  from  NICTA)   • Maintain  clearinghouse  of  measurement  points  (NICTA)   Data  access     • Deploy  Top  Hat  (all  participants)   • Make  OML  measurement  databases  accessible  to  Top-­‐Hat  (NICTA,   UPMC)   Table  8:  Implementation  strategy  of  functional  elements                         37  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   4.2 Deviation  of  supported  requirements  compared  to  D2.1         Figure  4:  Monitoring  and  measurement  specification  for  cycle  1       As  shown  in  Figure  4,  there  are  a  few  divergences  from  the  architecture  presented  in  Figure  2.  First,   specific  data  collection  systems  are  introduced  in  each  testbed.  This  element  takes  the  form  of  one   or  more  OML  collection  servers  (their  number  and  respective  role  is  an  implementation  detail  to  be   discussed   for   each   testbed).   Based   on   this   semi-­‐centralised   data   collection,   a   data   access   layer   is   introduced.   This   is   in   the   form   of   a   TopHat   agent   in   each   testbed,   with   ability   to   query   the   OML   datastores   on   behalf   of   the   experimenter   or   central   federation   management.   Finally,   rather   than   querying  testbed  entities  directly,  both  experimenter  and  management   can  delegate  all  their  queries   to  the  testbed-­‐local  TopHat  agents.   38  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   References       [1] PlanetLab   Europe.   Available   online   at   http://www.planet-­‐lab.eu,   last   visited   on   February   11,   2013.   [2] S.  Soltesz  ,  M.  Fiuczynski  ,  and  L.  Peterson.  MyOps:  A  Monitoring  and  Management  Framework   for  PlanetLab  Deployments.    Working  paper.  2009.   [3] CoMon.  Available  online  at  http://www.comon.cs.princeton.edu/,  last  accessed  on  February  11,   2013.   [4] TopHat.  Available  online  at  http://www.top-­‐hat.info/,  last  accessed  on  February  11,  2013.   [5] M.   Huang,   A.   C.   Bavier,   and   L.   L.   Peterson.   PlanetFlow:   Maintaining   Accountability   for   Network   Services.  ACM  SIGOPS  Operating  Systems  Review.  Volume  40,  Issue  1,  pp.  89  –  94,  January  2006.   [6] Nagios:  systems  and  network  monitoring;  2nd  ed.,  by  Wolfgang  Barth,  San  Francisco:  No  Startch   Press  (2008)   [7] Zabbix  –  open  source  monitoring  system,  www.zabbix.com,  last  accessed  on  February  11,  2013.   [8] Zenoss.  Available  online  at  http://www.zenoss.com/,  last  accessed  on  February  11,  2013.   [9] FUSECO   Playground.   Available   online   at   www.fuseco-­‐playground.org,   and   at   www.ngn2fi.org/playgrounds/Fuseco_Playground/index.html,   last   accessed   on   February   11,   2013.   [10]Multi-­‐Hop  Packet  Tracking  –  http://www.av.tu-­‐berlin.de/pt,  last  accessed  on  February  11,  2013.   [11]FITeagle   -­‐   Future   Internet   Testbed   Experimentation   and   Management   Framework,   http://www.fiteagle.org/,  last  accessed  on  February  11,  2013.   [12]BonFIRE   architecture.   Available   online   at   http://doc.bonfire-­‐project.eu/R3/reference/bonfire-­‐ architecture.html,  last  accessed  on  February  11,  2013.   [13]BonFIRE   monitoring   documentation.   Available   online   at   http://doc.bonfire-­‐ project.eu/R3/monitoring/howto.html,  last  accessed  on  February  11,  2013.   [14]BonFIRE   monitoring   documentation.   Available   online   at   http://doc.bonfire-­‐ project.eu/R3/monitoring/getting-­‐data.html,  last  accessed  on  February  11,  2013.   [15]SmartSantander.   Available   online   at   http://www.smartsantander.eu/index.php/testbeds/item/132-­‐santander-­‐summary,   last   accessed  on  February  11,  2013.   [16]O.  Mehani,   G.  Jourjon,   T.  Rakotoarivelo,   and   M.  Ott,   "An   instrumentation   framework   for   the   critical  task  of  measurement  collection  in  the  future  Internet,"  Under  review,  2012.   [17]M.   Gates,   A.   Tirumala,   J.   Dugan,   K.   Gibbs,   Iperf   version   2.0.0,   NLANR   applications   support,   University  of  Illinois  at  Urbana-­‐Champaign,  Urbana,  IL,  USA,  2004.   [18]F.  Schneider,  J.  Wallerich,  A.  Feldmann,  Packet  capture  in  10-­‐gigabit  ethernet  environments  using   contemporary   commodity   hardware,   in:   S.   Uhlig,   K.   Papagiannaki,   O.   Bonaventure   (Eds.),   PAM   2007,   8th   Internatinal   Conference   on   Passive   and   Active   Network   Measurement,   volume   4427   of   Lecture  Notes  in  Computer  Science,  Springer-­‐Verlag  Berlin,  Heidelberg,  Germany,  2007,  pp.  207-­‐ 217.   39  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   [19]S.  S.  Kolahi,  S.  Narayan,  D.  Nguyen,  Y.  Sunarto,  Performance  monitoring  of  various  network  traffic   generators,   in:   R.   Cant   (Ed.),   UKSim   2011,   13th   International   Conference   on   Computer   Modelling   and  Simulation,  IEEE  Computer  Society,  Los  Alamitos,  CA,  USA,  2011,  pp.  501506.   [20]G.   Gibb,   J.   W.   Lockwood,   J.   Naous,   P.   Hartke,   N.   McKeown,   NetFPGA-­‐-­‐-­‐an   open   platform   for   teaching   how   to   build   gigabit-­‐rate   network   switches   and   routers,   IEEE   Transactions   on   Education   51  (2008)  364369.   [21]Y.  Shavitt,  E.  Shir,  DIMES:  Let  the  internet  measure  itself,  SIGCOMM  Computer  Communucation   Review  35  (2005)  7174.   [22]M.   Huang,   A.   Bavier,   L.   Peterson,   PlanetFlow:   Maintaining   accountability   for   network   services,   ACM  SIGOPS  Operating  Systems  Review  40  (2006)  8994.   [23]K.   Park,   V.   S.   Pai,   CoMon:   A   mostly-­‐scalable   monitoring   system   for   PlanetLab,   ACM   SIGOPS   Operating  Systems  Review  40  (2006)  6574.   [24]G.   Iannaccone,   CoMo:   An   Open   Infrastructure   for   Network   Monitoring   -­‐-­‐-­‐   Research   Agenda,   Technical  Report,  Intel  Research,  Cambridge,  UK,  2005.   [25]C.   Brandauer,   T.   Fichtel,   MINER   -­‐-­‐-­‐   a   measurement   infrastructure   for   network   research,   in:   T.   Magedanz,   S.   Mao   (Eds.),   TridentCom   2009,   5th   International   Conference   on   Testbeds   and   Research   Infrastructures   for   the   Development   of   Networks   &   Communities,   IEEE   Computer   Society,  Los  Alamitos,  CA,  USA,  2009,  pp.  19.   [26]D.   Harrington,   R.   Presuhn,   B.   Wijnen,   An   Architecture   for   Describing   Simple   Network   Management   Protocol   (SNMP)   Management   Frameworks,   RFC   3411,   RFC   Editor,   Fremont,   CA,   USA,  2002.   [27]B.  M.  Cantrill,  M.  W.  Shapiro,  A.  H.  Leventhal,  Dynamic  instrumentation  of  production  systems,   in:  A.  Arpaci-­‐Dusseau,  R.  Arpaci-­‐Dusseau  (Eds.),  USENIX  2004,  USENIX  Association,  Berkeley,  CA,   USA,  2004,  pp.  1528.   [28]Q.   Zhao,   Z.   Ge,   J.   Wang,   J.   Xu,   Robust   traffic   matrix   estimation   with   imperfect   information:   Making  use  of  multiple  data  sources,  SIGMETRICS  Performance  Evaluation  Review  34  (2006)  133-­‐ 144.   [29]B.  Claise,  S.  Bryant,  S.  Leinen,  T.  Dietz,  B.  H.  Trammell,  Specification  of  the  IP  Flow  Information   Export   (IPFIX)   Protocol   for   the   Exchange   of   IP   Traffic   Flow   Information,   RFC   5101,   RFC   Editor,   Fremont,  CA,  USA,  2008.   [30]E.   Boschi,   B.   Trammell,   L.   Mark,   T.   Zseby,   Exporting   Type   Information   for   IP   Flow   Information   Export  (IPFIX)  Information  Elements,  RFC  5610,  RFC  Editor,  Fremont,  CA,  USA,  2009.   [31]Tordsson,   J.;   Djemame,   K.;   Henriksson,   D.;   Katsaros,   G.;   Ziegler,   W.;   Waldrich,   O.;   Konstanteli,   K.;   Sajjad,   A.;   Rajarajan,   M.;   Gallizo,   G.;   Nair,   S.;,   "Towards   holistic   Cloud   management",   in   Book   "European  Research  Activities  in  Cloud  Computing",  Cambridge  Scholars  Publishing,  2012.   [32]G.  Katsaros  et  al.,  "Building  a  service-­‐oriented  monitoring  framework  with  REST  and  Nagios",  IEEE   International  Conference  on  Services  Computing  (SCC),  July  2011.   [33]  G.  Katsaros  et  al.,  "A  service  oriented  monitoring  framework  for  soft  real-­‐time  applications",  IEEE   International  Conference  on  Service-­‐  Oriented  Computing  and  Applications  (SOCA),  Dec.  2010.   [34]  G.   Katsaros   et   al.,   "Monitoring:   A   Fundamental   Process   to   Provide   QoS   Guarantees   in   Cloud-­‐ based   Platforms",   in   Book   "Cloud   computing:   methodology,   systems,   and   applications",   CRC,   Taylor  &  Francis  group,  September  2011.   [35]S.A.   De   Chaves   et   al.,   "Toward   an   architecture   for   monitoring   private   clouds",   in   the   Communications  Magazine,  IEEE,  Volume  (49),  Issues  (12),  pp.  130-­‐137,  December  2011.   40  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   [36]G.  Katsaros  et  al.,  "A  multi-­‐level  architecture  for  collecting  and  managing  monitoring  information   in   Cloud   environments",   in   the   Proceedings   of   the   1st   International   Conference   on   Cloud   Computing  and  Services  Science  (CLOSER),  May  2011.   [37]G.   Katsaros   et   al.,   "An   integrated   monitoring   infrastructure   for   Cloud   environments",   in   Book   "Lecture  Notes  in  Business  Information  Processing"  (LNBIP),  Springer-­‐Verlag,  September  2011.   [38]FIRE  OpenLanb  project  -­‐  http://www.ict-­‐openlab.eu/home.html,  visited  on  February  12,  2013.   [39]FITeagle   -­‐   Future   Internet   Testbed   Experimentation   and   Management   Framework,   www.fiteagle.org/,  last  visited  on  February  12,  2013.   [40]  NOVI  FIRE  Project  http://www.fp7-­‐novi.eu/,  last  visited  on  February  12,  2013.   [41]  EVEREST  -­‐  EVEnt  REaSoning  Toolkit.  Available  online  at  http://sourceforge.net/apps/trac/sla-­‐at-­‐ soi/wiki/EverestCore,  last  accessed  on  February  11,  2013.   [42]  Zabbix  –  open  source  monitoring  system,  www.zabbix.com,  last  accessed  on  February  11,  2013.   [43]  Ganglia,   “Ganglia   monitoring   system,”   Website,   available   online   at   www.ganglia.sourceforge.net,  last  visited  on  February  12,  2013.   [44]  Nagios  monitoring  tool.  Website,  available  at  www.nagios.org,  visited  on  February  12,  2013.   [45]  GroundWork,   “Groundwork,”   Website,   available   online   at   www.gwos.com,   last   visited   on   February  12,  2013.   [46]  MonALISA,   “Monalisa:   Monitoring   agents   uasing   a   large   integrated   services   architecture,”   Website,   available   online   at   monalisa.   caltech.edu/monalisa.htm,   last   visited   on   February   12,   2013.   [47]  CloudStatus,   “Cloudstatus,”   Website,   available   online   at   www.hyperic.com/products/cloud-­‐ status-­‐monitoring,  last  visited  on  October  14,  2012.   [48]  Nimsoft,   “Ac   nimsoft   monitor,”   Website,   available   online   at   www.nimsoft.com/solutions/nimsoft-­‐monitor.html,  last  visited  on  February  12.   [49]BonFIRE   monitoring   documentation.   Available   online   at   http://doc.bonfire-­‐ project.eu/R3/monitoring/getting-­‐data.html,  last  accessed  on  February  11,  2013.   [50]  J.   Griffioen,   Z.   Fei,   and   H.   Nasir.   Architectural   Design   and   Specification   of   the   INSTOOLS   Measurement  System.  December  2009.   [51]  Y.   Al-­‐Hazmi,   and   T.   Magedanz.   A   Flexible   Monitoring   System   for   Federated   Future   Internet   Testbeds.   Proceeding   of   the   3rd   IEEE   International   Conference   on   the   Network   of   the   Future   (IEEE  NoF  2012),  Tunis,  Tunisia,  Nov  2012.   [52]OCF.  OFELIA  Control  Framework.  Code  available  online  at  https://github.com/fp7-­‐ofelia/ocf,  last   accessed  on  March  5,  2013.   [53]EXPERIMENTA   Platform   at   XiPi.   http://www.xipi.eu/infrastructure/-­‐/infrastructure/view/2009,   last  accessed  on  March  5,  2013.   [54]OFELIA  Project  website.  https://alpha.fp7-­‐ofelia.eu/,  last  accessed  on  March  5,  2013.   [55]FIBRE  Project  website.  http://www.fibre-­‐ict.eu/,  last  accessed  on  March  5,  2013.   [56]OpenEPC  –  Open  Evolved  Packet  core.  Available  online  at  http://www.openepc.net,  last  accessed   on  February  12th  2013   [57]Open   IMS   Core’s   homepage.   Available   online   at   http://www.openimscore.org,   last   accessed   on   February  12th  2013   [58]The   OpenMTC   vision.   Available   online   at   http://www.open-­‐mtc.org,   last   accessed   on   February   12th  2013   [59]The   GNU   Netcat   project.   Available   online   at   http://netcat.sourceforge.net/,   last   accessed   on   February  13th  2013.   41  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   Appendix  A:  PlanetLab  Europe  requirements     We   describe   here   the   monitoring   requirements   for   PlanetLab   Europe   [1],   but   most   of   what   we   describe  here  applies  to  other  PlanetLab  systems,  and  notably  to  PlanetLab  Central.   The   PlanetLab   architecture   has   purposely   included   the   minimum   required   functionality   for   the   system,  leaving  to  third  parties  the  provision  of  various  functionalities  that  might  be  directly  included   in  other  systems.  This  is  no  exception  for  monitoring.  A  PlanetLab  system  consists  of  a  set  of  nodes   on   the   internet,   but   node   monitoring   and   monitoring   of   the   internet   are   not   provided   natively   by   PlanetLab.  The  most  widely  used  node  monitoring  system  for  PlanetLab  is  called  CoMon  [3]  and  for   monitoring   of   the   internet   the   PlanetLab   Europe   operations   team   uses   the   TopHat   Dedicated   Measurement  Infrastructure  (TDMI)   [4].   These   two   systems   are   the   most   experimenter-­‐oriented   of   the   used   tools,   but   there   are   also   other   monitoring   systems   such   as   PlanetFlow  [5]   and   MyOps   [2]   that  tend  to  serve  the  testbed  operator  more.  CoMon  measures  the  activity  of  each  slice  on  a  node,   for   example   CPU,   memory,   disk   usage   and   bandwidth   usage.   TDMI   has   agents   on   each   node   that   conduct  traceroutes  in  a  full  mesh,  providing  information  on  the  paths  between  the  nodes.     The   third   party   service   used   by   PlanetLab   Europe   for   managing   measurements   is   TopHat.   TopHat   draws   from   TDMI,   CoMon   and   other   measurement   systems,   such   as   Team   Cymru   (providing   autonomous   numbers   for   the   nodes)   and   MaxMind   (providing   geolocalization).   The   TopHat   developers  can  easily  add  other  measurement  sources  by  developing  a  dedicated  gateway  for  each   new  service.  This  information  serves  at  different  phases  in  the  experiment  lifecycle.  In  setting  up  an   experiment,  users  make  use  of  the  data  in  order  to  choose  the  nodes  that  they  put  in  a  slice.  They   might   want   lightly   loaded   nodes,   for   instance,   or   nodes   in   a   given   country,   or   nodes   that   have   stable   routes  between  each  other.  Then,  during  the  time  that  an  experiment  is  running,  the  experimenter   might   call   on   these   measurement   sources   in   order   to   monitor   node   health,   or   path   stability,   for   instance.   After   an   experiment   has   concluded,   the   experimenter   may   wish   to   use   TopHat   to   call   up   historical  data  in  an  effort  to  comprehend  how  node  or  network  conditions  affected  the  experiment.   Furthermore,   PlanetLab   is   fully   open   to   third   party   experiment   control   tools,   which   includes   third   party  measurement  tools.  So,  for  instance,  work  is  being  done  to  easily  enable  OMF-­‐enabled  slices,   which  comes  with  OML.     Monitoring  solutions     Splitting   into   4   categories   :   (A)   infrastructure   (MyPLC   and   related)   –   (B)   nodes   from   an   operations   point  of  view  –  (C)  nodes  from  a  user  point  of  view  –  (D)  experimentation  metrics.     As  far  as  (A)  “infrastructure”  is  concerned,  we  use  a  Nagios-­‐based  deployment  [6]  for  monitoring  all   the  boxes  and  services  known  to  play  a  role  in  smooth  operations;  a  lot  of  these  have  been  added   over  time.     As   far   as   (B)   “nodes   from   an   operations   point   of   view”,   we   use   a   separate   tool   but   tailored   for   PlanetLab  deployments  named  MyOps,  which  allows  to  implement  escalation  policies.  For  example,   42  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   if  one  site  has  a  node  down  we  first  send  messages  to  the  technical  contact,  then  to  the  PI,  and  after   a   while   the   site   loses   some   of   its   capabilities   (fewer   slices).   All   this   workflow   is   entirely   automated   under   MyOps,   which   also   provides   raw   data   on   the   status   of   nodes   –   see   http://planet-­‐ lab.eu/monitor/site.     For  (C)  “nodes  from  a  user  point  of  view”,    PlanetLab  Europe  uses  TopHat,  with  TDMI  and  the  other   sources   mentioned   above.   One   difficulty   encountered   by   the   PlanetLab   Europe   operations   team   was   that  the  potentially  very  useful  data  gathered  by  MyOps  are  not  easily  accessible  through  an  API  or   other  query-­‐based  techniques,  and  so  MyOps  does  not  lend  itself  to  the  creation  of  a  gateway  so  as   to  be  made  available  through  TopHat.  There  clearly  was  a  need  for  both:  (1)  aggregating  data  about   nodes  from  various  places  (MyOps  being  one,  CoMon  being  another  one,  and  we  found  quite  a  few   other   sources   of   potentially   high   interest),   and   (2)   providing   a   decent   querying   interface   to   these   data.  In  a  first  step,  we  leveraged  the  internal  ‘tags’  mechanism  right  in  the  MyPLC  DB  to  extend  it   with   such   external   data.   In   a   federated   world,   and   in   particular   with   respect   to   MySlice,   it   might   make  sense  to  design  a  separate  tool  for  hosting  this  aggregated  data.     We   do   not   try   to   address   (D)   “experimentation   metrics”   at   all,   this   being   deemed   part   of   the   experimental  plane.  OML  is  considered  to  be  a  suitable  candidate  to  deploy  when  support  for  (D)  is   pursued.     Monitoring  requirements       Nothing   is   crucially   missing   in   what   we   have   at   this   point.   There   are   no   current   monitoring   requirements   from   PlanetLab   Europe.   Of   course   the   introduction   of   Fed4FIRE   monitoring   systems   may  not  lead  to  a  loss  of  supported  functionality  compared  to  the  situation  as  it  is  today.     Required  metrics       The  metrics  that  are  measured  and  therefore  should  also  be  supported  in  Fed4FIRE  are:   • TopHat   o traceroute  measurements  between  each  pair  of  PlanetLab  nodes   o For  each  IP  hop,  we  could  provide  more  information  (ASN,  country,  hostname,  etc.).   • CoMon  data1   o Slice  name   o Slice  context  id   o CPU  consumption  (%)   o Physical  memory  consumption  (%)   o Physical  memory  consumption  (in  KB)   o Virtual  memory  consumption  (in  KB)   o Number  of  processes   o Average  sending  bandwidth  for  last  1  min  (in  Kbps)   o Average  sending  bandwidth  for  last  5  min  (in  Kbps)                                                                                                                           1    http://codeen.cs.princeton.edu/slicestat/   43  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   o o o o o o Average  sending  bandwidth  for  last  15  min  (in  Kbps)   Average  receiving  bandwidth  for  last  1  min  (in  Kbps)   Average  receiving  bandwidth  for  last  5  min  (in  Kbps)   Average  receiving  bandwidth  for  last  15  min  (in  Kbps)   Local  IP  address  of  this  node   Number  of  active  processes  -­‐  that  is,  processes  using  the  CPU  cycle  at  the  moment   44  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   Appendix  B:  Virtual  Wall  requirements       The   virtual   wall   is   a   testbed   consisting   of   physical   servers,   switches   and   management   servers   operating  the  infrastructure.     Monitoring  solutions       At  this  moment  the  Virtual  Wall  runs  Zabbix  [7]  to  monitor  the  management  servers  and  switches,   accompanied   with   scripts   that   check   temperature   and   fan   speeds   through   IPMI   on   the   physical   servers.   If   a   server   runs   too   hot,   it   is   turned   off   (e.g.   if   the   fans   are   not   running).   This   can   be   considered  to  be  facility  monitoring  functionality.     The   Emulab   software   that   powers   the   Virtual   Wall   has   a   built-­‐in   health   check   system,   which   puts   nodes  which  do  not  come  up  okay  after  deploying  images  in  a  “hardware  down”  pool  (and  for  the   experiment  it  then  uses  another  free  node,  so  it  is  transparent  for  the  experimenter).  This  can  also   be  considered  to  be  facility  monitoring  functionality.     The  Emulab  software  also  has  the  possibility  to  run  topology/link  tests  after  an  experimenter  swaps   in  an  experiment.  This  is  infrastructure  monitoring  functionality.  As  experimenters  have  root  access   on   the   nodes,   they   can   deploy   other   infrastructure   monitoring   and   experiment   measurement   frameworks  themselves.     Monitoring  requirements       The  system  works  very  well  as  it  is,  however  for  being  in  a  federation  some  extra  requirements  are   needed.  For  the  first  level  support,  there  is  need  for  a  central  view  on  the  health  of  all  facilities,  and   probably  also  e.g.  on  the  nodes  available.  This  still  has  to  be  implemented  /  agreed  in  Fed4FIRE.     Required   metrics:   A   very   important   requirement/metric   is   that   detailed   monitoring/statistics   are   needed  if  the  facility  is  open  to  traffic  from  and  to  the  internet  (probably  through  a  programmable   firewall).  These  are  to  trace  back  malicious  traffic.  A  non-­‐exhaustive  list  of  examples  of  such  metrics   on   a   per-­‐experiment   basis   are:   are   in-­‐   and   outgoing   throughput,   logging   of   the   destination   IP   addresses   of   outgoing   packets,   logging   of   the   quantity   of   the   different   type   of   outgoing   messages   based   on   information   regarding   the   higher   layers   of   the   protocol   stack   (specific   ICMP   messages,   usage  of  well  known  UDP/TCP  ports  that  could  indicate  malicious  behaviour,  etc.),  and  so  on.       For  other  metrics,  ICMP,  ssh,  http,  https  checks  would  appear  to  be  adequate  for  first  level  support.   The  following  metrics  are  of  interest  to  the  virtual  wall  experimenters:   • Bandwidth  usage  on  all  links  in  the  experiment  topology   • Memory:   total   memory,   used   memory,   free   memory,   total   swap   memory,   used   swap   memory,  free  swap  memory   45  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   • • • • • CPU:  CPU  load,  CPU  utilisation,  CPU  count,  CPU  idle  time,  Cumulative  CPU  time,  Cumulative   CPU  usage,  CPU  usage  over  time  for  one  or  multiple  specific  processes,  or  even  per  specific   thread   Bandwidth  usage  per  session  (e.g.  per  TCP  connection)   Network  metrics:  connectivity,  network  speed   IP  connections  statistics:  IP  addresses   Link  utilisation,  number  of  (bytes,  packets,  and  flows)  per  time  slot     It  would  also  be  very  interesting  to  have  a  test  infrastructure  in  a  central  place  which  starts  and  stops   experiments  e.g.  once  in  24  hours,  to  see  that  each  facility  supports  the  whole  lifecycle  and  that  the   APIs  are  still  conforming.   46  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   Appendix   C:   OFELIA   (OpenFlow   in   Europe   Linking   infrastructure   and   Applications)  Control  Framework  requirements     This  platform  has  multiple  instances  distributed  inside  and  outside  Europe,  some  of  them  deployed   by  partners  involved  in  Fed4FIRE.         Monitoring  solutions       At   this   moment,   monitoring   is   limited   to   the   general   infrastructure   status   and   VMs.   Status   of   servers   and   switches   is   done   with   Zenoss   [8].   It   allows   to   the   Island   Manager   and   Experimenters   to   check   which  servers  and  switches  are  up  and  available.   VM   status   is   monitored   by   the   agent   in   the   server   which   sends   notifications   to   the   Aggregate   Manager  each  time  an  event  happens.  VMs  status  can  be  “started”  or  “stopped”.   There  is  no  support  for  experiment  monitoring  at  the  moment.     Monitoring  requirements       Planed   improvement   in   OFELIA   is   to   add   call-­‐backs   to   OpenFlow   to   monitor   changes   in   network   topology.   Also   experiment   monitoring   is   planned   to   be   implemented   in   Fed4FIRE   to   get   statistics   through  OpenFlow  counters.  Experimenter  can  check  the  status  of  the  requested  resources.       The  following  user  needs  can  be  identified:   • Experimenters   should   be   able   to   take   measures   from   the   resources   during   the   experiment   development  using  facility  tools  apart  from  the  measurements  he  can  take.   • Experimenters  should  be  able  to  store  the  measurements  for  later  study  and  sharing.     The  following  facility  needs  can  be  identified:   • A   monitoring   tool   to   offer   the   experimenters   the   possibility   of   monitoring   data   from   the   resources  of  their  experiment  and  store  its  statistics  should  be  available.     Required  metrics       The  overall  required  metrics  are  as  follows:   • Network  (L2  switches)  level:   • Device  status:  ON/OFF   • CPU:  CPU  utilisation,  CPU  idle  time   • Memory:  Total  memory,  used  memory,  free  memory   • Temperature   • Physical  interfaces:  ON/OFF   • Virtual  interfaces:  ON/OFF   • VLANs   47  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   Switch  interface  level:     § Interface  status:    ON/OFF   § Throughput   § Packet  rate   § Error  rate   § IP  addresses   § MAC  addresses   § MTU   Server  level   • Device  status:  ON/OFF   • VMs:  List  of  VMs,  ON/OFF   • Memory:  Total  memory,  used  memory,  free  memory   • CPU:  Total  number  of  CPUs,  Core  speed,  L1/L2/L3  Cache  memory,  total  CPU  utilisation,   CPU  utilisation  per  VM   • Storage:  Total  capacity,  used  capacity,  free  capacity   • Running  services:  SSH,  FTP  etc.   • Virtual  Interfaces:  ON/OFF,  IP  and  MAC  addresses   VM  level:   • VM  Status:  ON/OFF   • Memory:  Total  memory  assigned  to  VM   • • • 48  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   Appendix   D:   Optical   Testbed   –   Experimenta   DWDM   Ring   Testbed   requirements     The  complete  Experimenta  testbed  [53]  comprises  the  DWDM  Ring  testbed,  a  pool  of  VMs  and  also   OFELIA/FIBRE   [54]   [55]   OpenFlow   islands.   The   OFELIA/FIBRE   Island   is   described   in   the   previous   section.  In  this  section  we  will  focus  on  the  DWDM  Ring  testbed.     The  goal  of  the  provider  (i2CAT)  is  to  offer  Experimenta  under  the  OFELIA  Control  Framework  (OCF)   [52].   This   integration   is   still   in   its   early   stages.   Therefore   the   details   provided   in   this   document   are   based  on  the  current  capabilities  of  OCF  and  the  generic  requirements  for  enabling  OCF  control  and   manage  optical  testbeds  such  as  the  Experimenta  DWDM  Ring  testbed.  Future  revisions  will  update   these  details  to  properly  reflect  the  on-­‐going  integration  of  OCF  and  this  testbed.     Monitoring  solutions       The   integration   of   Experimenta   with   OFELIA/FIBRE   Islands   and   OCF   is   still   in   the   very   early   stages.   Therefore  monitoring  is  not  available  at  the  moment.  After  the  integration  is  complete,  experiments   will   be   able   to   collect   statistics   using   the   OpenFlow   controller.   In   particular,   experimenters   will   be   able  to  monitor  and  collect  flow  statistics.     Monitoring  requirements     The   experimenter   might   consider   a   load   balancing   mechanism   applied   to   the   proposed   topology.   Using  inherent  OpenFlow  monitoring,  the  experimenter  can  monitor  the  number  of  bytes  sent  across   the   different   load   balancing   paths.   Assuming   that   OpenFlow   based   monitoring   is   already   available   to   experimenters,  an  additional  user  need  is  to  provide  automated  monitoring  mechanism  e.g.  scripts   to  aid  experimenters  collect  OpenFlow  flow  statistics.     Required  metrics       The  following  represents  monitoring  requirements  and  their  associated  metrics.     1.  Property:  Virtualisation  server  status     Virtualisation  server  is  one  of  the  main  resources  provided  by  OCF.  VMs  are  created  in  the  servers   with   variable   characteristics   (memory,   HDD   size,   etc.).   It   is   important   to   monitor   the   status   of   the   facility  servers  in  order  to  properly  manage  their  capacities  and  detect  malfunctions.     All   the   component-­‐level   metrics   can   be   useful   but   the   ones   that   limit   the   servers’   capacity   are   the   used   memory   and   storage.   CPU   metrics   are   also   interesting   for   facility   monitoring   as   many   VMs   working  at  the  same  time  can  overload  server’s  capacity.   49  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   • • • • • • • Server’s  status  (up/down)   Server’s  memory  usage   Server’s  storage  space  available   Server’s  CPU  usage   VM  status  (up/down)   VM  memory   VM  storage  capacity     2.  Property:  Network  status     Another   resource   provided   by   OCF   is   OpenFlow   resources.   This   includes   mainly   OpenFlow-­‐enabled   switches   and   a   software-­‐defined   network   topology   over   them.   Component-­‐level   metrics   related   to   the  switches  are  required  to  know  its  availability  to  use  them  in  an  experiment.  Also  network  metrics   like  connectivity,  topology  detection  are  useful  to  present  the  facility  users  the  status  of  the  network   in  order  to  set  an  experiment.   • Switch  status  (up/down)   • Switch  ports   • Network  topology     3.  Property:  Experiment  monitoring     The   previous   properties   are   useful   in   a   pre-­‐experimentation   phase.   The   servers’   CPU   and   memory   utilization   help   the   Island   Managers   to   check   the   status   of   the   facility   to   prevent   and/or   solve   problems  as  soon  as  possible  and  the  experimenters  to  choose  the  most  appropriate  resources  for   their  experiments.     Once   the   experiment   has   started,   it   would   be   very   interesting   to   monitor   its   development   and   the   usage   of   the   reserved   resources   in   order   to   get   useful   info.   When   an   experimenter   sets   up   an   experiment,   it   consists   of   a   network   topology   composed   of   VMs   and   OpenFlow-­‐enabled   switches.   The   experimenter   then   defines   a   controller   (which   can   be   deployed   in   a   VM   or   any   computer   connected  to  the  network).  This  controller  reconfigures  the  routing  tables  of  the  OpenFlow  switches   dynamically  to  redirect  the  network  traffic  according  to  the  experimenter’s  purposes.         Any  monitoring  metrics  about  the  experiment  development  is  useful.  Either  at  the  equipment  level   as  in  network  or  traffic.   • All  the  traffic  metrics:   o Delay   o Packet  loss   o Throughput   o Etc.   • IP  connection  statistics     50  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   Appendix   E:   Optical   Testbed   –   OpenFlow-­‐enabled   ADVA   ROADMs   Testbed  requirements     This   comprises   OpenFlow-­‐enabled   ADVA   ROADMs,   distributed   Dark   Fibre   in   UK,   connectivity   via   JANET,  internet2,  GEANT.     Monitoring  solutions       From  the  facility  owner’s  perspective,  monitoring  the  infrastructure  health/resource  usage  is  possible   via   proprietary   APIs   and   scripts.   From   the   experimenter’s   perspective,   OpenFlow   can   be   used   to   monitor  flow  statistics  such  as  number  of  packets  or  number  of  bytes.     Monitoring  requirements     For   experimenters,   statistics   can   be   collected   using   the   OpenFlow   controller   for   the   experiment.   The   experimenter   determines   what   statistics   to   collect   in   this   case.   However,   more   sophisticated   flow   based   monitoring   might   enhance   the   experiment   monitoring   in   this   testbed.   There   are   currently   considerations   to   integrate   some   OML   functionalities   into   OCF   and   this   could   further   improve   and   automate  the  monitoring  capabilities  of  this  platform.     Experiments   could   want   to   monitor   bandwidth   consumption   or   packet   loss.   This   could   be   used   to   evaluate   switching   algorithms   and   also   for   load   balancing   and   VM   migration   experiments.   The   experimenter   might   want   to   observe   the   number   of   packets   traversing   the   network   to   observe   if   the   application   aware   switching   algorithm   has   been   able   to   select   a   high   quality   of   service   path   taking   into   consideration   the   requirements   of   the   beyond   high-­‐definition   format   sent   from   the   traffic   source.  As  OpenFlow  based  monitoring  is  already  available  to  experimenters,  the  following  additional   user   needs   can   be   define:   providing   automated   monitoring   mechanism   e.g.   scripts   to   aid   experimenters  to  collect  OpenFlow  flow  statistics.     Required  metrics         Standard  optical  metrics  can  be  monitored  including:   • Bandwidth  consumption     • Packet  loss     • Flow  statistics     • Port  status  i.e.  ports  on/off   • Connectivity  status  i.e.  signal  status   • Total  number  of  cross-­‐connections.   • Topology  (using  cross-­‐connect  configurations).     51  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   Appendix  F:  BonFIRE  requirements     BonFIRE  is  a  multi-­‐site  testbed,  implemented  using  a  broker  architecture[12].  Observability  is  one  of   the  four  BonFIRE  pillars,  and  as  such  extensive  monitoring  is  available  to  the  experimenters  at  both   the  broker  and  the  site  level.       Site-­‐level  Monitoring     Monitoring  is  provided  on  the  BonFIRE  sites  using  the  Zabbix  framework.  Three  types  of  monitoring   are   available   on   BonFIRE   through   Zabbix:   VM,   Application   and   Infrastructure   monitoring.   The   BonFIRE   documentation   describes   how   to   set   up   [13]   and   how   to   access   [14]   monitoring   data.   We   briefly  discuss  the  BonFIRE  types  of  monitoring  below:     VM  Monitoring:  Over  100  metrics  are  being  monitored  at  the  VM  level  by  default  and  the  user  can   deactivate   or   re-­‐activate   them   through   the   Zabbix   interface.   Examples   of   per-­‐VM   monitoring   metrics   are  as  follows:  total  memory,  used  memory,  free  memory,  total  swap  memory,  used  swap  memory,   free   swap   memory,   total   storage,   used   storage,   free   storage,   CPU   load,   CPU   utilization   (%),   CPU   count,   network   metrics   (e.g.   incoming   and   outgoing   traffic   in   interfaces,   OS-­‐related   metrics,),)   processes-­‐related   metrics   (e.g.   number   of   running   processes),   services-­‐related   metrics   (e.g.   FTP-­‐ /Email-­‐/SSH-­‐server   is   running),   etc.   A   list   of   concrete   measured   metrics   that   are   provided   either   as   active  or  disabled  metrics  is  given  at  the  end  of  this  appendix.     Application   Monitoring:   Zabbix   allows   users   to   add   their   own   metrics   to   be   monitored   and   stored   through  the  framework.  These  can  provide  information  about  the  state  of  the  running  application,  its   performance   and   other   application-­‐specific   information.   As   these   are   application-­‐specific,   the   experimenters  need  to  explicitly  configure  them  at  the  agent  and  the  server.     Infrastructure   Monitoring:   BonFIRE   provides   its   experimenters   the   ability   to   get   monitoring   data   about  the  physical  machines  that  run  their  VMs.  We  refer  to  this  service  as  infrastructure  monitoring.   Most  requested  infrastructure  metrics  are:  CPU  load,  total  and  free  memory,  free  swap  memory,  the   number   of   VMs   on   a   physical   node,   disk   IO,   disk   read/write,   incoming   and   outgoing   traffics   on   interfaces,  etc.       Broker-­‐level  Monitoring     The  BonFIRE  sites  do  not  generally  understand  the  concept  of  the  experiment;  this  is  implemented  at   the  broker  level.  BonFIRE  timestamps  and  exposes  to  the  experimenters  the  times  at  which  events   relevant   to   their   experiment   take   place.   This   includes   experiment   and   resource   (e.g.   compute,   storage  or  network)  events.     Measured  metrics  in  BonFIRE     52  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   VM  Monitoring:  the  following  metrics  are  actively  measured  by  default:   Buffers  memory   Cached  memory   CPU  system  time  (avg1)   CPU  nice  time  (avg1)   CPU  idle  time  (avg1)   CPU  iowait  time  (avg1)   CPU  user  time  (avg1)   Free  disk  space  on  /   Free  memory   Free  swap  space   Host  boot  time   Host  status   Host  uptime  (in  sec)   Incoming  traffic  on  interface  lo   Incoming  traffic  on  interface  eth1   Incoming  traffic  on  interface  eth0   Number  of  processes   Number  of  running  processes   Number  of  users  connected   Outgoing  traffic  on  interface  lo   Outgoing  traffic  on  interface  eth0   Outgoing  traffic  on  interface  eth1   Ping  to  the  server  (TCP)   Processor  load   Shared  memory   System  cpu  usage  average   Total  disk  space  on  /   Total  memory   Total  swap  space   Used  disk  space  on  /   Used  disk  space  on  /  in  %     The   following   metrics   are   disabled   (are   not   measured   by   default),   but   can   be   enabled   by   BonFIRE  user  any  time:   Checksum  of  /usr/sbin/sshd   Checksum  of  /usr/bin/ssh   Checksum  of  /vmlinuz   Checksum  of  /etc/services   Checksum  of  /etc/inetd.conf   Checksum  of  /etc/passwd   Email  (SMTP)  server  is  running   Free  disk  space  on  /usr   Free  disk  space  on  /var   Free  disk  space  on  /tmp   Free  disk  space  on  /home   53  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   Free  disk  space  on  /opt   Free  disk  space  on  /tmp  in  %   Free  disk  space  on  /var  in  %   Free  disk  space  on  /usr  in  %   Free  disk  space  on  /  in  %   Free  disk  space  on  /home  in  %   Free  disk  space  on  /opt  in  %   Free  number  of  inodes  on  /usr   Free  number  of  inodes  on  /tmp   Free  number  of  inodes  on  /home   Free  number  of  inodes  on  /   Free  number  of  inodes  on  /opt   Free  number  of  inodes  on  /tmp  in  %   Free  number  of  inodes  on  /  in  %   Free  number  of  inodes  on  /usr  in  %   Free  number  of  inodes  on  /opt  in  %   Free  number  of  inodes  on  /home  in  %   Free  swap  space  in  %   FTP  server  is  running   Host  information   Host  local  time   Host  name   IMAP  server  is  running   Maximum  number  of  opened  files   Maximum  number  of  processes   News  (NNTP)  server  is  running   Number  of  running  processes  zabbix_server   Number  of  running  processes  zabbix_agentd   Number  of  running  processes  apache   Number  of  running  processes  inetd   Number  of  running  processes  mysqld   Number  of  running  processes  sshd   Number  of  running  processes  syslogd   POP3  server  is  running   Processor  load5   Processor  load15   Size  of  /var/log/syslog   SSH  server  is  running   Temperature  of  CPU  1of2   Temperature  of  CPU  2of2   Temperature  of  mainboard   Total  disk  space  on  /home   Total  disk  space  on  /usr   Total  disk  space  on  /tmp   Total  disk  space  on  /opt   Total  number  of  inodes  on  /usr     54  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   Total  number  of  inodes  on  /   Total  number  of  inodes  on  /opt   Total  number  of  inodes  on  /home   Total  number  of  inodes  on  /tmp   Used  disk  space  on  /usr   Used  disk  space  on  /var   Used  disk  space  on  /home   Used  disk  space  on  /tmp   Used  disk  space  on  /opt   Used  disk  space  on  /usr  in  %   Used  disk  space  on  /var  in  %   Used  disk  space  on  /tmp  in  %   Used  disk  space  on  /opt  in  %   Version  of  zabbix_agent(d)  running   WEB  (HTTP)  server  is  running     Infrastructure  Monitoring:  information  about  the  following  16  measured  metrics  is  provided   Eth0  outgoing  traffic   Eth0  incoming  traffic   Running  VMs   Processor  load   Free  swap  space   Total  memory   Free  memory   Disk  sda  Write  Bytes/sec   Disk  sda  Write:  Ops/second   Disk  sda  IO  ms  time  spent  performing  IO   Disk  sda  IO  currently  executing   Disk  sda  Read:  Milliseconds  spent  reading   Disk  sda  Read:  Ops/second   Disk  sda  Write:  Milliseconds  spent  writing   Disk  sda  Read  Bytes/sec   Ping  to  the  server  (TCP)         55  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   Appendix  G:  Grid’5000  requirements     Grid’5000   is   a   scientific   instrument   designed   to   support   experiment-­‐driven   research   in   all   areas   of   computer   science   related   to   parallel,   large-­‐scale   or   distributed   computing   and   networking.   Supporting   over   550   different   experimenters   a   year,   it   is   composed   of   11   sites   in   France   and   Luxembourg,   providing   over   1500   nodes   and   8500   cores   that   experimenters   can   reserve   and   reconfigure   at   bare   hardware   level,   including   the   capacity   to   turn   nodes   on   and   off.   At   the   networking  level,  most  sites  are  interconnected  with  a  dedicated  10G  network  link,  with  some  sites   providing   Infiniband   Myrinet   10G   as   a   supplement   over   standard   Ethernet   1G.   Experiments   can   be   isolated  from  each  other  in  dedicated  VLANs,  including  nation-­‐wide  VLANs.     Monitoring  solutions      Grid'5000   distinguishes   between   monitoring   (concerned   with   service   health)   and   metrology   (concerned  with  measurements  for  experiments).     Monitoring   is   configured   by   puppet   for   all   services   deployed   through   puppet.   It   uses   nagios   for   service   status   and   munin   for   stat   evolution   on   some   services   (disk   utilization   for   example).   Alerts   are   configured  if  services  do  not  respond.  Network  between  sites  is  monitored  using  smokeping,  but  no   alerts   are   sent   from   smokeping.   Local   networks   are   monitored   through   cacti,   but   no   alerts   are   configured.  Node  health  is  checked  through  g5k-­‐checks  at  boot  time,  and  regularly  by  the  resource   manager.   g5K-­‐check   makes   sure   the   properties   described   for   a   node   (memory,   uplink   speed,   disk   size,  number  of  cores,  etc)  are  those  seen  at  run-­‐time.  Grid’5000  uses  monika  to  display  node  health.   No   alerts   are   issued   through   that   system.   Deployments   statistics   are   collected   through   kstats3,and   we  are  now  in  the  planning  phases  to  detect  unusual  patterns  in  failures  to  identify  nodes  that  fail  in   above-­‐-­‐average  rates.     Monitoring  requirements     From  the  above:   • Service  status  for  all  network  reachable  services   • Server  utilization  for  servers  running  services   o CPU  utilization   o Disk  utilisation,  IO,  latency,  throughput,   o Network  traffic   o Number  of  processes   • Network  properties  (ping  probes  between  sites,  network  throughput)   • Networking  equipment  (CPU  utilisation,  main  counters)   • Node  health  (conformance  and  usage  failure  rates)     Required  metrics   56  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1     All   information   you   can   get   is   useful   to   assess   the   health   of   the   infrastructure   to   detect   problems   before  users  or  to  be  able  to  diagnose  quickly  when  problems  are  reported.  Grid’5000  has  plans  to   go  forward  and  increase  monitoring,  and  would  not  accept  a  solution  that  lowers  the  surveillance  of   the  health  of  the  infrastructure,  as  this  is  the  key  to  the  reliability  of  the  facility.     For  nodes,  these  are  the  metrics  collected  in  Grid’5000:  cpu_nice,  mem_cached,  pkts_out,  disk_total,   part_max_used,   mem_buffers,   cpu_idle,   boottime,   mem_free,   load_five,   proc_run,   load_one,   pkts_in  pdu  (energy  consumed  in  watts),  swap_free,  cpu_num,  mem_shared,  cpu_user,  swap_total,   pdu_shared   (energy   consumed   by   the   pdu   this   nodes   is   connected   to),   ambient_temp,   cpu_speed,   bytes_in,   cpu_wio,   cpu_system,   bytes_out,   proc_total,   disk_free,   cpu_aidle,   load_fifteen,   mem_total.     And   for   the   network   hardware:   Inbound   and   outbound   traffic   on   each   interface   (in   bits/s   and   packets)  and  CPU  utilisation.     57  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   Appendix  H:  FUSECO  Playground  requirements     The   FUSECO   Playground   experimentation   facility   [9]   is   a   3GPP   Evolved   Packet   Core   (EPC)   centric,   independent  and  open  laboratory  for  mobile  broadband  communication  research  and  development.   It   supports   3GPP   (e.g.   LTE,   UMTS)   and   non-­‐3GPP   (e.g.   Wi-­‐Fi,   WiMAX)   technologies.   In   addition   to   the   FOKUS   OpenEPC   toolkit   [56],   the   FUSECO   Playground   also   features   the   Open   IMS   Core   [57]   and   enables   to   gain   experience   with   IMS-­‐based   Rich   Communication   Services   (RCS)   and   Voice   over   LTE   (VoLTE)   services,   as   well   as   the   OpenMTC   toolkit   [58],   supporting   an   open   range   of   M2M   applications.       FITeagle   [11]   (mainly   based   on   the   Panlab   FIRE-­‐Teagle   developments)   is   an   extensible   open   source   experimentation   and   management   framework   for   federated   Future   Internet   testbeds.   It   is   used   to   manage   and   book   instances   of   FUSECO   services   such   as   OpenIMSCore-­‐aaS   and   OpenEPC-­‐aaS   (licensed).     Monitoring  solution     Two   monitoring   tools   are   used   to   monitor   the   FUSECO   playground   infrastructure   and   its   services.   Zabbix   is   used   to   monitor   the   infrastructure   to   ensure   its   health   and   performance.   In   addition   to   monitoring   performance   metrics   such   as   CPU,   memory,   network,   disk   space,   processes,   and   OS-­‐ related   metrics,   it   supports   custom   metrics   to   monitor   whatever   users   want.   Thus,   service   and   application  related  metrics  (state,  performance)  can  be  measured.       Second,   the   multi-­‐hop   Packet   Tracking   tool   [10]   is   a   distributed   network   surveillance   solution   for   detecting  paths  of  data  packets  through  networks.  It  enables  monitoring  routes,  delays,  packet  loss,   influence   cross   traffic   and   more.   Packet   Tracking   Netview   Visualisation   GUI   called   Netview   is   used   to   track  and  visualize  packet  paths  and  their  hop-­‐to-­‐hop  characteristics  on  a  world  map.  Packet  tracking   enables  the  following  functionalities:   • QoS  validation  to  support  one  of  the  main  functions  of  EPC  which  is  QoS  guarantee   • Total  end-­‐to-­‐end  as  well  as  individual  network-­‐to-­‐network  monitoring   • SLA  Validation   • Handover  validation,  performance  evaluation  and  visualization   • Per  hop  real-­‐time  link  quality  measurements  and  security  constraints  validations   • Flexible  measurement  granularity  adjustment  from  single  packet  up  to  flows     Monitoring  requirements         1.  Property:  Network  status     Monitoring  information  about  the  environment  and  the  available  access  networks  will  enhance  the   performance  of  the  Access  network  discovery  and  selection  function  (ANDSF)  in  EPC.  Many  metrics   58  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   should   be   measured   such   as   signal   strength,   number   of   users/subscribers,   number   of   active   users/subscribers  or  operator  KPIs.     2.  Property:  IMS  Performance     Monitoring  information  about  the  IMS  performance  is  required  and  therefore  many  KPIs  should  be   measured.   Signalling   and   media   metrics   are   relevant,   such   as   the   number   of   drop   calls/sessions,   packet   loss   and   delay   (signalling   and   media),   messaging   rate   (signalling),   bandwidth   (media),   jitter   (media).     59  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   Appendix  I:  NITOS  requirements     NITOS   is   a   Wireless   Testbed   which   currently   consists   of   50   operational   wireless   nodes,   which   are   based  on  commercial  Wi-­‐Fi  cards  and  Linux  open  source  drivers.  The  testbed  is  designed  to  achieve   reproducibility  of  experimentation,  while  also  supporting  evaluation  of  protocols  and  applications  in   real   world   settings.   NITOS   testbed   is   deployed   at   the   exterior   of   the   University   of   Thessaly   (UTH)   campus  building.     Monitoring  solution     The  monitoring  of  resources  is  being  made  through  Chassis  Manager  cards.  They  report  if  the  node  is   currently   powered   on   or   off.   The   experimenter   can   even   force   the   node   to   reset   or   power   on/off.   NITOS  is  on  the  progress  of  developing  a  tool  for  monitoring  the  status  of  the  nodes  (checks:  ssh  and   telnet  service  operation).     Monitoring  requirements     More   sophisticated   methods   for   monitoring   the   infrastructure   are   needed.   Right   now,   all   the   monitoring   is   left   to   the   administrator   of   the   testbed   and   no   specific   tools   are   being   used.   Regarding   the  experiments,  they  are  in  the  progress  of  developing  a  monitoring  framework  which  will  offer  to   the  experimenter  a  spectrum  analysis  during  his  experiments.     Required  metrics     The  following  list  includes  the  properties  and  the  associated  metrics  that  are  essential  for  the  NITOS   normal  operation.       1.   Normal   functionality   and   utilisation   of   the   NITOS   nodes   and   accurate   detection   of   any   misbehaviour.     This   property   is   essential   from   the   testbed   provider’s   view.   It   is   very   important   to   ensure   the   regular   behaviour   of   the   NITOS   nodes   and   the   accurate   response   to   management   commands   (power   on/off,   retrieving   of   features   or   statistics,   etc.).   Each   NITOS   node   is   equipped   with   a   Chassis   Management   card  that  must  be  always  online  and  is  responsible  for  powering  on/off  the  node.  It  is  also  interesting   to   monitor   the   usage   of   the   reserved   nodes,   in   order   to   create   user   utility   profiles   that   affect   the   scheduling  policies  of  the  NITOS  scheduler  and  maximize  the  testbed  utilisation.  The  related  metrics   should  be:   • Indications  about  power  on/off  status  from  Chassis  Management  card.   • Total  and  used  memory  of  nodes.   • CPU  load  and  utilisation  of  nodes.   • Traffic  produced  from  the  nodes  (both  in  wireless  and  Ethernet  interfaces).     60  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   2.  Stable  wireless  connectivity  environment,  given  a  specific  outside  interference.     This   property   is   essential   from   both   the   testbed   provider’s   view   and   the   user’s   view.   It   is   very   important  to  “build”  a  stable  wireless  connectivity  environment  that  is  affected  only  by  the  outside   interference,  while  any  other  reason  that  influences  this  environment  should  be  detected.  Some  of   these  reasons  are  a  possible  damage  of  the  wireless  antennas,  or  a  persistent  change  of  the  topology   due   to   unpredicted   outside   reasons   (the   testbed   is   outdoors   and   for   example   a   strong   wind   may   change   the   antennas   position).   On   the   other   hand,   it   would   be   perfect   to   create   connectivity   map   of   the  testbed  that  would  be  a  perfect  introduction  for  an  experimenter  who  wants  to  use  this  facility.   The  related  metrics  should  be:   • Signal  quality  and  strength  for  each  pair  of  nodes  and  each  channel.   • Noise  level  for  each  wireless  interface  and  each  channel.       3.  Sensor  measurements  and  environment  depiction     This   property   is   essential   from   the   user’s   view.   The   Chassis   Management   card,   that   each   NITOS   node   has,  is  equipped  with  a  plethora  of  sensors  (humidity,  light,  temperature,  power  consumption,  etc).   The  sensing  measurements  of  all  reserved  nodes  are  available,  through  a  web  GUI,  to  the  user  who   reserved   the   nodes.   The   related   metrics   should   include   temperature,   light,   humidity   and   other   sensor  measurements.     61  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   Appendix  J:  w-­‐iLab.t  wireless  network  testbed  requirements     The  requirements  for  both  the  sensor  part  and  the  Wi-­‐Fi  part  of  the  w-­‐iLab.t  testbed  are  addressed  in   this   section.   Note   that   there   are   two   w-­‐iLab.t   locations:   Office   and   Zwijnaarde.   While   listed   as   a   separate  logical  entity  in  the  Fed4FIRE  proposal,  “w-­‐iLab.t  mobile”  is  not  seen  as  a  separate  testbed   from   an   administrative   and   implementation   point   of   view,   but   as   part   of   the   w-­‐iLab.t   Zwijnaarde.     At   the   moment   different   tools   are   used   to   manage/operate   the   two   locations   (Motelab-­‐based   vs.   OMF-­‐ based).    The  usage  of  these  management  tools  does  not  influence  the  monitoring  requirements  for   the  two  testbed  instances.  Therefore  we  describe  them  in  the  same  section.     Monitoring  solution     There  are  three  characters  of  monitoring  solutions:     Facility   Monitoring:   for   the   w-­‐iLab.t   Office,   currently   the   health   of   the   key   facility   components   can   only  be  determined  by  the  possibility  to  reach  the  central  servers  (either  through  the  web  interface,   or   by   using   ping/ssh).   For   the   w-­‐iLab.t   Zwijnaarde,   a   Google   calendar   account   is   available   which   indicates  whenever  there  is  a  power  outage.  Similarly  to  the  w-­‐iLab.t  Office,  the  health  of  the  central   server  can  be  checked  by  testing  if  the  web  interface  is  operational  or  by  using  ping/ssh.  When  taking   facility  monitoring  down  to  the  level  of  the  individual  nodes,  it  can  be  stated  that  for  both  w-­‐iLab.t   Office   and   Zwijnaarde,   the   health   (ability   to   ping/ssh)   of   the   nodes   (embedded   PC   +   sensor   nodes)   can  be  checked  on  the  web  interface  of  both  testbed  locations.     Infrastructure  monitoring:  for  the  w-­‐iLab.t  Zwijnaarde,  the  power  consumption  of  the  nodes  is  shown   on   the   web   interface   and   is   retrieved   from   the   PDU’s   (Power   Distribution   Units)   through   SNMP   or   from   the   PoE   (Power   over   Ethernet)   switches.   Wireless   connectivity   between   the   nodes   in   the   w-­‐ iLab.t   Zwijnaarde   can   also   be   visualised   on   the   web   interface.   More   advanced   infrastructure   monitoring  can  easily  be  setup  by  the  user  through  using  custom  scripts.  The  output  of  this  custom   logging  can  also  be  saved  with  OML.     Experiment   Monitoring:   In   the   w-­‐iLab.t   Zwijnaarde,   OML   is   being   used   for   the   collection   of   user-­‐ defined   experimentation   metrics.   In   the   w-­‐iLab.t   Office,   only   measurement   data   from   the   sensor   nodes  can  be  logged  to  the  central  database  by  using  the  Motelab  framework.       Monitoring  requirements     Monitoring   is   very   important.   OML   already   provides   a   good   basic   set   of   functionalities,   but   the   experimenter  would  like  to  know  background  information  regarding  the  testbed,  e.g.  the  amount  of   interfering  activity  on  the  shared  wireless  medium.  On  the  other  hand,  the  goal  of  his  experiment  is   to   produce   actual   measured   results   that   will   be   used   in   publications   to   characterize   e.g.   his   62  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   developed   resilient   multi-­‐hop   wireless   sensor   network   protocol.   Therefore   the   experimenter   wants   to  store  some  metrics  related  to  his  experiment,  such  as  packet  error  rate  or  disconnection  duration.   The  following  user  needs  can  be  identified:   • The   experimenter   and   the   facility   provider   should   be   able   to   distinguish   errors   introduced   by   the  solution  under  test  from  errors  related  to  the  infrastructure  (e.g.  error  in  my  algorithm   vs.   error   caused   by   malfunction   of   the   embedded   PC   forwarding   sensor   messages   to   the   testbed  controller)   • The  experimenter  should  be  able  to  measure  time  differences  between  actions  on  different   nodes  very  accurately.  E.g.  a  Wi-­‐Fi  experimenter  wants  to  measure  the  uni-­‐directional  delay   instead   of   the   roundtrip   time.   Or   wants   to   know   the   recovery   time   to   restore   connectivity   after   a   link   failure,   handover,   etc.   Typically   these   measurements   need   sub-­‐millisecond   accuracy.   This   requires   very   solid   and   accurate   clock   synchronization   between   all   nodes   in   the  testbed.  The  iMinds  w-­‐iLab.t  testbed  relies  on  PTPd  to  provide  this  functionality.   • Monitoring   of   the   interference   in   a   network   is   also   important   in   sensor   networks;   interference  may  also  be  caused  by  nodes  that  are  not  part  of  the  testbed.    Without  a  view   on  the  wireless  activity  in  a  testbed  not  related  to  the  experimenter’s  solution,  it  is  hard  to   tell  anything  about  performance  of  the  solution  under  test.   • The  facility  provider  should  be  made  aware  of  issues  with  the  installation.    Detecting  a  failing   node   is   easy.     Detecting   a   failing   sensor   is   harder.   Detecting   a   failing   network   interface   (or   loose  antenna,  etc.)  may  also  be  harder.    Yet  detecting  these  issues  is  important.     Additional   requirement   can   be   identified   when   focusing   on   the   aspect   of   facility   federation.   To   make   measurements   comparable   in   a   federation   context,   we   should   have   a   clear   understanding   of   what   different  tools  really  measure,  and  how  the  measurement  impacts  the  experiment.     Required  metrics     Metrics  which  are  important  to  the  testbed  provider  of  w-­‐iLab.t  (both  Office  and  Zwijnaarde):   • Metrics  regarding  the  embedded  PC  (hosts  all  peripheral  devices  (like  sensor  nodes):   o CPU  (load/  idle  time)   § +  temperature   o Memory  (total/used/free)   o Hard  disk  (total/used/free)   § Number   of   times   the   hard   drive   was   re-­‐imaged   (only   w-­‐iLab.t   Zwijnaarde,   OMF  load)  +  number  of  failed  attempts  to  re-­‐image   § Monitoring   through   S.M.A.R.T.(Self-­‐Monitoring   Analysis   and   Reporting   Technology,   currently   integrated   into   most   hard   drives)   could   indicate   malfunctioning  hard  drive  by  the  number  of  faulty  writes  before  it  is  totally   broken)   o Powered  On/Off   § Power  consumption  when  On   o Control  interface  status   § Able  to  Ping/SSH/Telnet  to  nodes?   63  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   • • § Saturation  of  control  interface  (ex.  eth0  up:60%  down:1%)   o Active  interfaces  (with  Amount  of  traffic  (received/transmitted))   § Wired     § Wireless  (with  logging  of  used  channel  and  Noise  floor)   • Bandwidth  usage  could  indicate  a  loose  antenna  or  bad  wireless  card   o (All)  Present  peripheral  (USB)  devices  :     § For  example  in  w-­‐iLab.t  Zwijnaarde  :     • Environment  Emulator   • Tmote  Sky  or  RM090  sensor  nodes   • Webcam   • Bluetooth  dongle   • Imec  sensing  engine   Metrics  regarding  the  sensor  nodes  :   o Number  of  times  the  node  has  been  re-­‐imaged   § Failed  attempts   o Stats  of  used  topologies   § Which  combination  of  nodes  is  used  in  the  same  experiment?   Metrics  that  are  important  to  the  experimenters:   o Important  metrics  that  can  be  re-­‐used  from  the  testbed  provider  :   § Node  status  (available/in-­‐use)   § Status  of  peripheral  devices  (like  sensor  node  or  webcam)   § Commonly  used  sensor/Wi-­‐Fi  node  topologies   o Monitoring  of  the  wireless  inference/wireless  activity   § Using  the  Imec  sensing  engines   o Monitoring  of  the  wireless  links  between  each  pair  of  nodes     § Generate  full  topology  maps  with  for  each  pair  of  nodes  an  indication  of  :   • Received  Signal  Strength   • Throughput  (Iperf)   • Noise  floor   § All  of  the  above  with  different  settings:   • Transmit  power   • Wireless  channel  (on  all  a/b/g/n  bands)   § These   topology   maps   should   be   re-­‐generated   periodically   to   account   for   changes  in  the  environment  of  any  kind.     The  required  metrics  by  Wi-­‐Fi  experimenters  are:   o Free  memory,  CPU  load  and  utilisation   o Network   metrics:   network   speed,   topology   detection,   reachability,   signal   quality,   signal   strength,   noise   level,   interference,   data   transfer   rates,   Radio   Frequency   (RF)   quality,  throughput,  available  bandwidth   o Packet  arrival  rate,  packet  loss,  delay,  jitter     64  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   In  the  iMinds  w-­‐iLab.t,  the  sensor  boards  are  flashed  by  a  separate  node  (embedded  PC)  which  is   physically   connected   to   the   sensor   network   board.   The   following   metrics   would   be   interesting   when  monitoring  this  node   o Memory:  free  memory   o CPU:  CPU  load   o Board  temperature   o Saturation  of  the  control  interface  (ex.  eth0  up:  60%  down:  1%)   o It   would   be   nice   to   have   some   insight   on   the   physical   disks   via   S.M.A.R.T.   (faulty   writes)!     (has  a  big  impact  on  the  performance  of  and  hard  to  debug  based  on  other   metrics)   o Number  of  received/transmitted  packets  and  Noise  floor  of  the  Wi-­‐Fi  cards   o Channels  used  on  the  Wi-­‐Fi  cards     Regarding  the  sensor  node  itself,  the  following  monitoring  metrics  are  of  interest:   • The   number   of   times   the   nodes   has   been   flashed   and   the   number   of   failed   attempts   to   flash  it   • Stats  of  the  topologies  that  are  used     Regarding  the  wireless  medium  the  following  monitoring  metrics  are  of  interest:   • Here  the  CREW  project  comes  in:  using  sensing  engines     65  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   Appendix  K:  Netmode  requirements     The  NETMODE  testbed  is  comprised  of  wireless  nodes  equipped  with  802.11  a/b/g/n  interfaces.     Monitoring  solutions     There  are  two  types  of  monitoring  solutions  in  NETMODE  testbed:   • Infrastructure  monitoring.  This  type  of  monitoring  involves  information  regarding  the  health   of  the  nodes  (UP/DOWN),  running  services  on  the  nodes  (ssh  or  telnet).  The  monitoring  tool   used  to  gather  this  information  is  Nagios.     • Experimenter’s   measurements.   This   type   of   monitoring   is   based   on   the   OML   library.   The   experimenter  can  use  this  library  to  collect  statistics  (e.g.  bandwidth,  delay,  packet  loss,  etc.)   for  his  experiments.       Monitoring  requirements     The   infrastructure   should   be   able   to   provide   experimenters   with   more   detailed   monitoring   information.   In   the   context   of   a   wireless   testbed   like   NETMODE,   metrics   such   as   signal   strength,   noise   level,   per   node   pair   would   be   valuable   to   the   experimenter   in   order   to   define,   collect   and   evaluate  his  experimental  results.     Required  metrics     The   following   metrics   gather   monitoring   information   valuable   to   both   the   facility   owner   and   the   experimenter.     • Node  status  (UP/DOWN)   • Services  status  (ssh,  telnet)   • CPU  utilization     • Memory  utilization     • Hard  disk  usage   • Bandwidth  utilization  (both  in  wireless  and  wired  interfaces)   • Signal  strength  for  each  node  pair  and  wireless  interface   • Noise  level  for  each  node  pair  and  wireless  interface     66  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   Appendix  L:  requirements  from  SmartSantander     At  the  time  being,  the  Santander  facility  [15]  consists  of  more  than  2,000  IEEE  802.15.4  devices,  most   of  them,  supporting  both  experimentation  and  service  provision.  In  the  future,  the  deployment  will   consist  on  around  12,000  IoT  devices.  It  is  important  to  highlight  that  the  SmartSantander  facility  is   not  only  meant  for  IoT  experimentation  but  also  for  providing  real-­‐world  Smart  City  services.       Monitoring  solutions     Two   facility   monitoring   processes   are   performed   dynamically   by   the   Management   and   Fault-­‐ Monitoring   Subsystem   implemented   within   the   SmartSantander   platform,   namely:   resource   discovery  and  resource  monitoring.       The  resource  discovery  process  involves  detecting  new  IoT  resources  in  the  testbed,  registering  them   for   use   and   recording   the   resource   descriptions   using   standard   data   models   to   facilitate   the   query   and  retrieval  of  these  resources.       The   resource   monitoring   process   concerns   the   dependability   of   the   testbed   platform   i.e.   its   robustness  with  respect  to  software  component  or  equipment  failure.  Under  normal  operation,  the   SmartSantander   platform   is   in   a   constant   state   of   flux.   New   sensor   devices   are   detected   and   registered   with   the   platform.   The   context   status   parameters   (battery   level,   availability   status,   link   quality)  of  existing  devices  change  whilst  they  run  experiments  which  generate  experiment  traces  or   sensor   data   streams   and   execute   data   transformation   functions.   Therefore,   ensuring   the   correct   execution   of   the   IoT   testbed’s   services   in   the   face   of   such   dynamicity   and   ensuring   the   testbed’s   resilience  to  failures  requires  continuous  monitoring  of  the  state  of  its  IoT  resources.     Monitoring  requirements     The  priority  of  monitoring  requirements  is  low.  As  it  has  been  previously  mentioned,  some  metrics   are  currently  being  monitored  to  assure  the  correct  behaviour  of  the  network,  such  as  device  status   (including   battery   level),   measurement   periodicity   and   link   quality.   Nevertheless,   some   other   new   metrics   are   not   monitored   at   the   time   being   and   could   be   interesting,   such   as   whether   or   not   the   measurements   are   within   an   expected   range,   the   link   quality   within   the   mesh   network   or   the   number  of  users  subscribed  to  the  observations  of  a  particular  wireless  sensor  node.  Exporting  the   already  available  metrics  through  OML  would  be  valuable  to  the  experimenter.       Besides,   the   SmartSantander   project   relies   for   some   of   the   experiments   on   mobile   sensors   and   participatory   sensing   where   we   use   users'   mobile   phones.   In   this   respect,   the   variable   number   of   nodes   is   available   and   the   position   of   them   is   retrieved   whenever   possible.   From   an   experimenter   point  of  view  the  required  metrics  could  be  device  status,  measurement  frequency  and  link  quality.   Also,  an  experimenter  should  be  aware  of  node's  capabilities  and  geo-­‐localisation  information  (GPS   position).   67  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1     Required  metrics     1.  Currently  available  metrics     • Device  status:  shows  whether  or  not  a  node  is  alive  and  sending  sensor  observations.  (Online   /  Offline)   • Battery  status:  shows  relative  level  percentage  against  the  maximum  capacity  of  a  battery.   • Measurement  frequency:  it  measures  the  time  between  sensor  observations  that  a  node  is   currently  using.   • Link   quality:   it   measures   the   percentage   of   the   expected   service   frames   that   haven't   been   lost  across  the  multihop  network.  A  value  of  100%  shows  that  no  frames  have  been  lost  and   a  value  of  0%  means  that  the  node  is  currently  offline.   • Position:  it  refers  to  the  GPS  position  of  the  non-­‐fixed  nodes.     2.  Desirable  metrics     • Measurement   validity:   this   shows   if   the   observations   being   sent   by   a   node   are   in   the   valid   range   of   the   sensor.   For   example,   a   temperature   sensor   measuring   -­‐15ºC   in   Santander   is   malfunctioning  and  should  be  fixed  or  replaced.   • Users  subscribed:  number  of  experimenters/users  subscribed  to  each  node's  observation     3.  Other  information  that  needs  to  be  tracked  and  shown  to  an  experimenter     • Capabilities:  type  of  sensor  observations  that  a  node  can  provide.  This  value  is  usually  fixed   during  the  complete  life  of  the  node,  although  new  sensors  can  be  connected  into  the  node.   • Geolocalisation   information:   SmartSantander   wireless   sensor   network   is   a   real   on   the   field   deployment  with  both  fixed  and  mobile  nodes,  so  the  position  where  a  node  really  is  at  every   moment  is  valuable  information  for  an  experimenter.     68  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   Appendix  M:  NICTA  requirements     NICTA  maintains  a  testbed  (NOrbit)  of  about  40  wireless  nodes.  These  machines  are  equipped  with   one  or  two  802.11  wireless  cards,  and  have  wired  connectivity  to  two  Ethernet  network.  One  of  them   is   reserved   for   control   traffic   (management   and   measurement),   while   the   other   can   be   used   for   specific   experimentations.   There   are   projects   to   install   OpenFlow   equipment   on   the   experimental   Ethernet  link  to  support  finer  setup  of  experiments.  The  testbed  is  controlled  through  OMF  (5.3,  5.4   and   6).   All   experimental   measurement   is   done   through   OML,   with   a   collection   server   being   collocated  with  the  OML  AM  (and  default  EC).     Monitoring  Solutions     The   status   of   the   nodes   is   monitored   through   an   ad   hoc   solution   which   queries   both   the   chassis   management   cards   of   the   machines   for   power   control,   and   attempts   to   contact   various   services   (such  as  telnet,  ssh)  to  get  more  information  as  to  their  operating  system  state  (e.g.,  ready  or  PXE).   Though   there   was   some   experiment   with   Zabbix,   for   more   fine-­‐grained   infrastructure   monitoring,   this  solution  is  not  currently  in  use.     During   the   course   of   an   experiment,   the   experimenter   can   deploy   OML-­‐instrumented   tools   to   provide  more  information  about  the  nodes  and  network.  The  currently  available  tools  cover:  system   status  such  as  load,  memory  or  users    (SIGAR/nmetrics  wrapper  or  collectd),  active  network  testing   (OTG,   Iperf   and   D-­‐ITG),   passive   network   and   spectrum   monitoring   (wrapper   for   libtrace   for   the   latter   with   radiotap   support,   and   wrapper   for   wpa_supplicant),   node   reachability   (wrapper   to   ping)   or   mobile  nodes'  location  from  GPS  fixes,  when  available  (gpsd  client).     Monitoring  Requirements     The  requirements  on  node  monitoring  are  low  at  the  moment,  as  experimenters  can  instantiate  all   required   monitoring   tools   as   part   of   the   experiment.   Since   these   measurements   are   of   potential   interest  to  many  users,  they  could  eventually  be  provided  by  the  facility  as  part  of  an  infrastructure   monitoring  service  to  the  user.     When   running   an   experiment,   the   user   would   like   to   know   the   characteristics   of   the   wireless   environment  during  that  run.  For  example,  if  the  user  is  running  a  wireless  experiment  using  channel   6   in   the   2.4Ghz   band,   he/she   would   like   to   have   information   about   all   signals   occupying   a   given   frequency   band   around   that   channel   and   during   that   experiment.   This   information   allows   i)   to   identify   potential   factors   that   might   have   affected   the   experiment   results   (e.g.   adjacent   channel   interference)  and  ii)  sound  comparison  of  this  experiment  run  results  with  previous  one  which  were   run   with   similar   'environment'   conditions.   More   specific,   for   a   given   time   window   around   the   69  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013     FP7-­‐ICT-­‐318389/NICTA/REPORT/PUBLIC/D6.1   experiment   the   user   would   like   to   be   aware   of   frequency   (Hz)   against   occupancy   (%   time)   and/or   Power/Signal  Level  (dBm).     1.  Wireless  Connectivity  Measurement     Prior  or  during  the  run  of  an  experiment,  the  user  would  like  to  know  the  connectivity  characteristics   for   all   the   resources   involved   in   the   experiment.   This   information   allows   i)   the   selection   of   which   resource   to   use   based   on   average   connectivity   properties,   ii)   the   identification   of   potential   factors   affecting  the  experiment  results  (e.g.  highly  variable  packet  delay  between  A  and  B  correlated  with   highly  variable  SNR  between  A  and  B).       More   concrete,   this   encompasses   the   following   information:   for   a   given   time   window   prior   or   during   the   experiment   and   for   each   pair   of   resources   available   in   the   testbed:   RSSI   (received   signal   strength   indicator  in  arbitrary  unit)  for  received  frames,  frame  Loss,  Retries  and  Error  Rate  (%),  SNR  (dBm).     2.  Wireless  Device  Location     Prior  or  during  the  run  of  an  experiment,  the  user  would  like  to  know  the  absolute  or  relative  (to  a   given   reference)   geographical   position   of   the   resources   involved   in   their   experiment.   This   information   allows   the   correlation   of   observed   results   with   the   spatial   context   of   each   resource.   More   concrete,   this   encompasses   the   following   information:   for   a   given   time   window   prior   or   during   the  experiment  and  for  each  resource:  x,y,z  coordinates  (with  absolute  or  relative  reference  and  unit   information).     3.  Device  Energy  Consumption     During  the  run  of  an  experiment,  the  user  would  like  to  know  the  energy  consumed  by  the  resources   involved  in  their  experiment.  This  information  allows  the  correlation  of  observed  results  and  energy   cost   for   the   system   being   studied   by   the   user   (e.g.   a   new   routing   scheme).   More   concrete,   this   encompasses  the  following  information:  For  a  given  time  window  during  the  experiment  and  for  each   resource:  consumed  energy  (mW).       70  of  70       ©  Copyright  NICTA  and  other  members  of  the  Fed4FIRE  consortium    2013