Preview only show first 10 pages with watermark. For full document please download

Ibm Bigfix Version 9.x: Capacity Planning, Performance, And

   EMBED


Share

Transcript

IBM® Security IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide Document version 9.x.11 IBM BigFix Performance Team © Copyright International Business Machines Corporation 2015, 2016, 2017. US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide Contents Contents .............................................................................................................................. iii List of Figures ..................................................................................................................... vii Author List ........................................................................................................................... ix Revision History .................................................................................................................. x 1 Introduction............................................................................................................ 11 2 IBM BigFix 9.x Overview ....................................................................................... 12 3 2.1 Functional Overview.................................................................................. 12 2.2 Architectural Overview .............................................................................. 13 2.3 Return on Investment................................................................................ 14 BigFix Performance Management ........................................................................ 16 3.1 Reference Benchmarks ............................................................................ 16 3.1.1 3.2 3.3 The Unified Benchmark .......................................................................... 16 Key Performance Indicators ..................................................................... 18 3.2.1 Little’s Law ............................................................................................. 19 3.2.2 Evaluating the Number of Concurrent Users .......................................... 19 Monitoring Tools ........................................................................................ 20 3.3.1 Iometer ................................................................................................... 21 3.3.2 BigFixPerf ............................................................................................... 24 3.3.3 BigFixDisk .............................................................................................. 27 iii 4 BigFix Capacity Planning ...................................................................................... 28 4.1 How to Use This Guide: Capacity Planning Guidelines ........................... 28 4.2 CPUs, vCPUs, and Cores ......................................................................... 29 4.3 Capacity Planning Spreadsheet ............................................................... 29 4.4 BigFix Management Server Capacity Planning Recommendations ........ 30 4.5 4.4.1 Message Level Encryption Enablement ................................................. 31 4.4.2 WebUI Enablement ................................................................................ 32 4.4.3 Considerations for a Local or Remote WebUI ........................................ 32 BigFix Console Capacity Planning Recommendations ............................ 33 4.6 BigFix Relay and Associated Infrastructure Capacity Management Considerations ...................................................................................................... 35 4.6.1 4.7 4.8 5 Relay Virtualization................................................................................. 36 Upgrade Capacity Planning Considerations ............................................. 37 4.7.1 IBM BigFix 9.2.5 Upgrade ...................................................................... 37 4.7.2 IBM BigFix 9.5 Upgrade ......................................................................... 37 4.7.3 IBM BigFix 9.5.3 Upgrade ...................................................................... 38 Capacity Planning Example ...................................................................... 38 Performance Management ................................................................................... 40 5.1 5.2 Infrastructure Management Approaches .................................................. 42 5.1.1 Virtualization ........................................................................................... 42 5.1.2 Operating System Management ............................................................. 44 5.1.3 Database System Management ............................................................. 45 BigFix Management Approaches.............................................................. 48 5.2.1 FillDB Options ........................................................................................ 48 5.2.2 Agent Heartbeat Frequency ................................................................... 50 iv IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide 5.3 6 5.2.4 Data Archiving ........................................................................................ 50 5.2.5 WebUI Management .............................................................................. 51 Benchmark Management Approaches ..................................................... 52 Database Backup Management ............................................................... 53 6.1.1 Online Backup Support .......................................................................... 53 6.1.2 Database Backup Cleanup..................................................................... 56 6.2 Database Statistics Management ............................................................. 56 6.3 Database Reorganization ......................................................................... 56 6.4 Database Maintenance Automation ......................................................... 57 Security Considerations ........................................................................................ 58 7.1 7.2 8 Console Refresh Frequency ................................................................... 50 BigFix Maintenance Recommendations ............................................................... 53 6.1 7 5.2.3 Security Management ............................................................................... 58 7.1.1 Web Application Security Scanning........................................................ 58 7.1.2 Application Source Code Scanning ........................................................ 59 7.1.3 Threat Modeling ..................................................................................... 59 7.1.4 Security Regulatory Compliance Reports............................................... 60 Security Hardening.................................................................................... 60 7.2.1 Port Management and Firewall Configuration ........................................ 60 7.2.2 Common Vulnerabilities and Exposures Management ........................... 63 Summary Cookbook.............................................................................................. 64 8.1 Base Installation Recommendations ........................................................ 64 8.2 Post Installation Recommendations ......................................................... 65 8.3 High Scale Recommendations ................................................................. 65 v Appendix A: DB2 Online Backup Enablement ................................................................. 66 References ........................................................................................................................ 69 vi IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide List of Figures Figure 1: Revision History .................................................................................................................x Figure 2: BigFix Architecture.......................................................................................................... 13 Figure 3: BigFix Server Elements .................................................................................................. 14 Figure 4: Business Value Analyst for IBM BigFix and MobileFirst ................................................. 15 Figure 5: BigFix Performance Benchmark Environment Sample ................................................... 18 Figure 6: Little's Law ...................................................................................................................... 19 Figure 7: Monitoring Tools ............................................................................................................. 21 Figure 8: Iometer User Interface Sample ....................................................................................... 21 Figure 9: Iometer Workload Sample .............................................................................................. 22 Figure 10: Iometer Results Sample................................................................................................ 23 Figure 11: Disk Queue Length by IO Subsystem Type .................................................................. 23 Figure 12: BigFixPerf Syntax ......................................................................................................... 24 Figure 13: BigFixPerf Windows Example ....................................................................................... 24 Figure 14: BigFixPerf Windows Example Output ........................................................................... 25 Figure 15: BigFixPerf UNIX Example ............................................................................................. 25 Figure 16: BigFixPerf UNIX Example Output ................................................................................. 26 Figure 17: BigFixDisk Syntax ......................................................................................................... 27 Figure 18: BigFixDisk Windows Example ...................................................................................... 27 Figure 19: BigFix Management Server Capacity Planning Requirements ..................................... 30 Figure 20: Storage Requirements Breakdown ............................................................................... 31 Figure 21: BigFix Message Level Encryption Capacity Planning Requirements ............................ 31 Figure 22: BigFix WebUI v1 Capacity Planning Requirements ...................................................... 32 Figure 24: Console Workstation Capacity Planning Requirements ................................................ 33 Figure 25: Terminal Server Capacity Planning Requirements ....................................................... 34 Figure 26: BigFix Relay Infrastructure............................................................................................ 35 Figure 27: Top Level Relay Capacity Planning Requirements ....................................................... 36 Figure 28: BigFix Virtualization Performance ................................................................................. 37 Figure 29: Sample Capacity Planning Profile ................................................................................. 39 Figure 30: BigFix Capacity Planning & Performance Management ............................................... 41 Figure 31: Modifying the Linux IO Scheduler ................................................................................. 43 Figure 32: Linux IO Scheduler Throughput .................................................................................... 44 Figure 33: Linux IO Scheduler Latency .......................................................................................... 44 vii Figure 34: BigFix Schema Characteristics ..................................................................................... 47 Figure 35: DB2 Configuration Recommendations .......................................................................... 47 Figure 36: FillDB Database Boost Levels ...................................................................................... 48 Figure 37: FillDB Parallelism Example ........................................................................................... 50 Figure 38: Database Maintenance Approaches ............................................................................. 53 Figure 39: Database Backup with Compression Command ........................................................... 53 Figure 40: Database Offline Backup Restore ................................................................................. 53 Figure 41: Database Online Backup Schedule .............................................................................. 54 Figure 42: Database Incremental Backup Enablement .................................................................. 54 Figure 43: Database Online Backup Manual Restore .................................................................... 54 Figure 44: Database Online Backup Automatic Restore ................................................................ 55 Figure 45: Database Log Archiving to Disk .................................................................................... 55 Figure 46: Database Log Archiving to TSM ................................................................................... 55 Figure 47: Database Roll Forward Recovery: Sample A................................................................ 55 Figure 48: Database Roll Forward Recovery: Sample B................................................................ 55 Figure 49: Database Backup Cleanup Command .......................................................................... 56 Figure 50: Database Backup Automatic Cleanup Configuration .................................................... 56 Figure 51: Database Statistics Collection Command ..................................................................... 56 Figure 52: Database Statistics Collection Table Iterator ................................................................ 56 Figure 53: Database Reorganization Commands .......................................................................... 57 Figure 54: Database Reorganization Table Iterator ....................................................................... 57 Figure 55: Sample Database Maintenance Schedule .................................................................... 57 Figure 56: BigFix Security Management Summary ........................................................................ 58 Figure 57: Security Compliance Report Options ............................................................................ 60 Figure 58: BigFix Port Management .............................................................................................. 62 Figure 59: Port Utility Hosts Configuration ..................................................................................... 63 Figure 60: Port Utility Active Port Configuration ............................................................................. 63 Figure 61: Port Utility Ports and Programs to Ignore ..................................................................... 63 Figure 62: Base Installation Recommendations ............................................................................. 64 Figure 63: Post Installation Recommendations .............................................................................. 65 Figure 64: High Scale Recommendations...................................................................................... 65 Figure 65: BigFix Database LOB Logging Check .......................................................................... 66 Figure 66: Sample Database Backup with Compression Command ............................................. 67 Figure 67: Sample Database Connect ........................................................................................... 67 Figure 68: Sample Migration .......................................................................................................... 67 Figure 69: Sample Database Offline Backup Restore.................................................................... 67 viii IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide Author List This paper is the team effort of a number of security and performance specialists comprising the IBM BigFix performance team. Additional recognition goes out to the entire IBM BigFix development team. Mark Leitch (primary contact for this paper) IBM Toronto Laboratory Bernardo Pastorelli Federico Pezzotti Mariella Corbacio Nicola Milanese Pietro Marella IBM Rome Laboratory ix Revision History Date Version Revised By Draft MDL Initial version for review. November 1st, 2015 9.2.x.0 MDL Initial version for publication. December 10th, 2015 9.2.x.1 MDL Incorporated review comments (minor edits). 11th, 9.2.x.2 MDL Incorporated review comments (minor edits). 9.x.3 MDL Changes made for BigFix 9.5. April 13th, 2016 9.x.4 MDL Qualified DB2 online backup support. May 11th, 2016 9.x.5 MDL Incorporated updates for the 9.5.2 release. 16th, 9.x.6 MDL Added storage and anti-collocation recommendations. 9.x.7 MDL Refined storage and virtualization content. July 19th, 2016 9.x.8 MDL Incorporated review comments. July 26th, 2016 9.x.9 MDL Added DB2 online backup enablement content. October 7th, 2016 9.x.10 MDL Added BigFix 9.5.3 agent resources content. 9.x.11 MDL BigFix 9.5.5 updates. July 15th, 2015 January March June July July 27th, 6th, 2016 2016 2016 2016 30th, 2017 Comments Figure 1: Revision History x IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide 1 Introduction Capacity planning involves the specification of the various components of an installation to meet customer requirements, often with growth or timeline considerations. IBM BigFix offers endpoint lifecycle and security management for large scale, distributed deployments of servers, desktops, laptops, and mobile devices across physical and virtual environments. This document will provide an overview of capacity planning for the IBM BigFix Version 9.x solution. In addition, it will offer management best practices to achieve a well performing installation that demonstrates service stability. This will include the deployment of the BigFix management servers into cloud, or virtual, environments. Capacity planning for virtual environments typically involves the specification of sufficient physical resources to provide the illusion of unconstrained resources in an environment that may be characterized by highly variable demand In this document we will provide an IBM BigFix overview, including functionality, architecture, and performance. We will then offer the capacity planning recommendations, including considerations for hardware configuration, software configuration, and cloud best practices. A summary “cookbook” is provided to manage installation and configuration for specific instances of BigFix. Note: This document is considered a work in progress. Capacity planning recommendations will be refined and updated as new BigFix releases are available. While the paper in general is considered suitable for all BigFix Version 9.x releases, it is best oriented towards IBM BigFix Version 9.2.6 onwards, including IBM BigFix 9.5. In addition, a number of references are provided in the References section. These papers are highly recommended for readers who want detailed knowledge of BigFix server configuration, architecture, and capacity planning. Note: Some artifacts are distributed with this paper (see “View  Navigation Panels  Attachments” in the document viewer). The distributions are in zip format. However Adobe protects against files with a “zip” suffix. As a result, the file suffix is set to “zap” per distribution. To use these artifacts, simply rename the distribution to “zip” and process as usual. 11 2 IBM BigFix 9.x Overview An overview of IBM BigFix Version 9.x will be provided from the following perspectives: 1. Functional 2. Architectural 3. Return on Investment 2.1 Functional Overview The IBM BigFix portfolio provides a comprehensive security solution encompassing a number of operational areas. These areas include the following.  Lifecycle management (asset discovery and inventory, software distribution, patch management, operating system deployment, remote control).  Security and compliance (security configuration management, vulnerability management, patch management, anti-virus and anti-malware client management, network self-quarantine).  Patch management.  Power management.  Software use analysis.  Core protection.  Server automation. Additional information on the functional management may be obtained from a variety of IBM resources (for example, this announcement letter for IBM BigFix 9.2: URL). Note that IBM BigFix was previously known as IBM Endpoint Manager, but was rebranded in 2015 with a very positive response from the field. In general, IBM BigFix spans the broadest OS and device set in the industry, including the management of physical and virtual servers, PCs, Macs, tablets, smartphones, embedded and hardened devices, and point of sale devices. This is managed via a scalable distributed infrastructure that includes a lightweight dedicated agent. We will describe this infrastructure in the architectural overview section. 12 IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide 2.2 Architectural Overview The following diagram provided a basic view of the BigFix architecture. Figure 2: BigFix Architecture The notable components of this diagram follow.  Root Server. The base BigFix Enterprise Server. It is comprised of a number of core services as identified.  Console. A management console (user interface) for BigFix operators. The console is a Windows only application. A console server is used to support one or more instances of the BigFix Console.  The WebUI. A Node.js instance with associated database intended to support the Web based user interface.  The Web Reports Server. The Web Reports Server can provide a variety of stock and custom reports for one or more BigFix server installations. 13  Relays. A distributed, hierarchical infrastructure to manage the deployment of BigFix agents across diverse network topologies.  Agents (as part of the client population). A lightweight, native agent that manages the endpoint.  Fixlet Servers (represented via the Internet content). These servers are used as the object repository for all client content (Fixlets, tasks, baselines, and analyses). In addition, dashboards, wizards, and WebUI applications are delivered via the Fixlet servers. Fixlets are utilized by the agent to determine relevance, compliance, and remediation activities.  Disaster Server Architecture (aka DSA, not shown). The DSA is a server replication architecture intended to provide fault tolerance.  The Database Management Server, or DBMS (either Microsoft SQL Server or IBM DB2 for Linux, UNIX, and Windows, also referred to as DB2 LUW). The diagram below shows the anti-collocation options for these elements (meaning, the ability to deploy on nodes distinct from the BigFix root server). The pros and cons of anti-collocation are described later in this document. Figure 3: BigFix Server Elements Recommendations will be provided in the BigFix Capacity Planning section for optimal performance management of these components. 2.3 Return on Investment Return on Investment (ROI) is a key concern for any deployed solution. In the security space, the notion of “return” can involve many dimensions, given the potentially catastrophic impact security exploits may have on an enterprise. To facilitate the understanding of ROI for IBM BigFix, a business value analyst is available (URL). 14 IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide Figure 4: Business Value Analyst for IBM BigFix and MobileFirst The analyzer is based on the establishment of a comprehensive profile comprising the following elements.  The company profile, including legacy systems and current incident and problem resolution rates.  Hardware and software investment profiles, including endpoint audits and device decomposition.  Patch, application, and power management profiles.  The security management profile. Based on these responses, the benefits, investments, and overall Return on Investment is provided through a number of multi-year views. 15 3 BigFix Performance Management We will provide an overview of BigFix performance management including reference benchmarks, key performance indicators, and monitoring tools. 3.1 Reference Benchmarks There are a number of reference benchmarks managed for the BigFix solution. These benchmarks ensure the offering is “field ready” and able to manage future scalability requirements. The set of benchmarks includes, but is not limited to, the following.  The concurrent user performance benchmark. A set of workloads to simulate user activity and responsiveness. The responsiveness is managed for both the console and Web User Interface offerings.  The client metering & evaluation loop benchmark. A set of workloads to manage client performance, including device impact and request latency.  FillDB benchmark. A set of workloads to manage and optimize the BigFix FillDB operation.  Mailbox benchmark. A set of workloads to manage and evaluate the BigFix endpoint mailbox functionality.  Relay plugins benchmark. A set of workloads to evaluate the replay plugin, including resource utilization and request latency.  REST API benchmark. A set of workloads to evaluate the RESTful interface of BigFix.  Web Reports benchmark. A set of workloads to evaluate the Web Reports interface of BigFix.  The Unified Benchmark. The comprehensive benchmark for BigFix. We will discuss the Unified Benchmark in more detail. The most basic performance methodology for any benchmark is to establish a baseline, and then iterate on the baseline as you drive improvements through code, infrastructure, and tuning. Once the solution offering is delivered, the baseline is then moved to the new “improved” state. 3.1.1 The Unified Benchmark The Unified Benchmark combines the component level benchmarks, into a single, unified benchmark that provides comprehensive simulation and prediction of the behavior of BigFix in large customer environments. The benchmark is not only defined by the diversity 16 IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide and scale of the workload, but also by time. Sample characteristics of the Unified Benchmark include the following.  Duration. The benchmark is persistent, meaning it is continuously running (as a customer workload would be continuously running) in order to manage the long term stability characteristics of BigFix (e.g. performance stability, system resource stability, etc.).  Data population. Database population is used to simulate large scale customer installations. Sample population parameters include, but are not limited to, the following. o Number of managed devices = 250,000. o Number of managed Fixlets = 100,000 (20% custom). o Sites = 20 (25% custom sites). o Software packages = 750. o Patches = 28,000. o Users = 100 (10% master operators). o Roles = 50. o Computer groups= 50.  Client simulation. Real and simulated clients may be used to represent large installations comprising hundreds of thousands of endpoints.  Workload saturation. Workload levels should not be constant. Workload oscillation, meaning workload peaks and valleys are in evidence, are expected in customer environments. It can be useful to drive beyond solution saturation levels for brief periods to demonstrate product stability and preservation of service under high load.  User transaction rate control. The frequency that simulated users drive actions within the workload is managed via loop control functions. Closed loop simulation approaches are used where a new user will enter the system only when a previous user completes. Through the closed loop system, steady state operations under load may be driven. Sustained user transaction rates are characterized by a warm up phase, where users are introduced to the system in a controlled manner until the desired concurrency is reached. For example, in our tests, we expect to introduce a new user every five seconds as part of the warm up phase.  Think times. Think times are the “pause” between user operations, meant to simulate the behavior of a human user.  Bandwidth throttling. In order to simulate low speed or high latency lines, bandwidth throttling is employed for some customer workloads. A sample throttle for a moderate speed 17 ADSL connection (cable/DSL simulation) is a setting of 1.5 Mbps download, and 384 Kbps upload. The following diagram shows a sample unified benchmark environment. Figure 5: BigFix Performance Benchmark Environment Sample 3.2 Key Performance Indicators Key Performance Indicators (KPI) are the quantitative values managed in the reference benchmarks that are used to determine solution success. Sample KPIs may include the following.  Request characteristics including response times and associated latency.  System utilization including the standard CPU, IO, network and memory views.  Request concurrency characteristics. The final category, request concurrency, has many interesting dimensions. Two areas we will focus on are Little’s Law, and how to evaluate the number of concurrent users for a solution. 18 IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide 3.2.1 Little’s Law The field of queuing theory is mathematically rich and often complex. However, Little’s Law offers a simple and intuitive view of queuing theory. Little’s Law may be summarized by the following figure. L = λW Where: L = the number of concurrent requests in the system. λ = the request arrival rate. W = the average time a request spends in the system. Figure 6: Little's Law This elegant equation makes it clear that if you want to improve concurrency you may:  Increase the request arrival rate.  Reduce the request service time. Increasing the arrival rate will eventually hit a solution limit (whether software or infrastructure). At that point, the focus is typically on optimizing the software and/or infrastructure to reduce the average request time. 3.2.2 Evaluating the Number of Concurrent Users The number of concurrent users supported by a solution is typically a function of request response times under load. For example, if a solution can manage 50 concurrent users with an average response time of “X” seconds, it comes down to whether “X” is acceptable for the user (where the user may be a person, a program, etc.). For the concurrent users of a user interface (e.g. the BigFix console or WebUI), it is important to understand what is meant by a concurrent user. Consider:  P = total population for an instance of BigFix (including administrators, end users, etc.).  C = the concurrent user population for an instance of BigFix. Concurrent users are considered to be the set of users within the overall population P that are actively managing the environment at a point in time (e.g. administrator operations in the user interface, endpoint operations, etc.). In general, P is a much larger value than C (i.e. P >> C). For example, it is not unrealistic that a total population of 200 users may have a concurrent user population of 20 users (i.e. 10%). 19 3.3 Monitoring Tools Monitoring tools may include system monitoring approaches as well as associated infrastructure benchmarks. The follow table describes a number of approaches that are used for BigFix. Tool Description BigFixPerf A BigFix custom data collection tool that wraps the nmon and perfmon utilities. It is a Perl based utility and requires an instance of Perl. Documentation: See detail section below. Recommended invocation: BigFixPerf –monitor –interval –iterations BigFixDisk A BigFix custom data collection tool that logs disk utilization. It is a Perl based utility and requires an instance of Perl. Documentation: See detail section below. Recommended invocation: BigFixDisk –monitor –interval –iterations nmon nmon is a comprehensive system monitoring tool for the UNIX platform. It is highly useful for understanding system behavior. Documentation: URL Sample invocation: nmon -T -s -c -F perfmon perfmon is a comprehensive monitoring too for the Windows platform. Documentation: URL db2support Database support collection tool. Documentation: URL Recommended invocation: db2support -d -c -f -s -l DBMS Snapshots DBMS snapshot monitoring can offer insight into SQL workload, and in particular expensive SQL statements. Documentation: URL WAIT Java WAIT monitoring can provide a non-invasive view of JVM performance through accumulated Java cores and analytic tools. Documentation and recommended invocation: URL esxtop VMware performance collection tool. Documentation: URL Recommended invocation: esxtop -b -a -d 60 -n > iometer I/O subsystem measurement and characterization tool for single and clustered systems. Documentation: URL Additional information is also provided below. Recommended invocation: dynamo /m 20 IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide iperf TCP and UDP measurement and characterization tool that reports bandwidth, delay, jitter, and datagram loss. Documentation: URL Recommended server invocation: iperf –s Recommended client invocation #1: iperf -c Recommended client invocation #2: iperf -c -R UnixBench UNIX measurement and characterization tool, with reference benchmarks and evaluation scores. Documentation: URL Recommended invocation: ./Run Perfkit Benchmarker A Google suite offering comprehensive benchmark capability. Documentation: URL Figure 7: Monitoring Tools 3.3.1 Iometer Out of all system resources, IO is typically the most difficult to manage. High performance IO subsystems are relatively expensive, and prone to failure if redundancy is not managed. In addition, many solutions that perform well in a physical environment may plummet in a virtual environment (to, say, 100 IOPS). We will describe a sample Iometer workload relevant for BigFix. We will then show sample results and targets. Figure 8: Iometer User Interface Sample 21 Sample Iometer Workload The following table demonstrates a recommended Iometer workload. The workload is based on IO subsystem monitoring of a large scale BigFix deployment under load. Workload Block Size Read % Random % Workload 1: Stock 4KB 25% 0% Workload 2: Open Action Profile 8KB 10% 20% Workload 3: REST API Profile 8KB 90% 20% Figure 9: Iometer Workload Sample Sample Results & Recommendations The following table shows sample results, across a number of high and low performing IO subsystems. The IO subsystem types include:  SSD.  A Storage Area Network (SAN) Raid 5 configuration configured for file system access (LUN A).  A Storage Area Network (SAN) Raid 5 configuration configured for database system access (LUN B). The first graph below shows the latency of each IO subsystem, for each reference workload. The supported IOPS are shown as part of the “X axis” per column. A second graph is also provided below, showing the average disk IO queue length for the same workloads. Essentially, the system and disk queue impact of the various storage subsystems translate easily to the Iometer results. 22 IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide Figure 10: Iometer Results Sample Figure 11: Disk Queue Length by IO Subsystem Type In summary, the general recommendation for a healthy IO subsystem is the ability to manage in the range of 5,000 IOPS to 10,000 IOPS with a latency of 1ms or less. In Iometer, as you increase the number of workers, you may see the IOPs increase while latency will also typically increase. For example, for a 16 worker workload, a latency of 2ms may be observed, and still be relatively healthy (though, clearly, lower is better). In general, the number of workers used in Iometer for concurrent workloads should not exceed the number of cores available on the system. 23 3.3.2 BigFixPerf The BigFixPerf utility is a custom monitoring tool intended to standardize data collection on Windows and UNIX. The utility is distributed as an attachment to this paper (see the introduction for details on obtaining the attachments). Syntax: BigFixPerf --monitor [--interval ] [--iterations ] [--norun] [--short] [--sql] [--verbose] Options: --monitor The monitor results object. On Windows, this is the name of the defined monitoring element, and may be viewed under the Windows perfmon utility. On UNIX, this is the name of the monitoring output file. --interval The monitor sample interval. The default value is 5 seconds. --iterations The number of samples to collect. The default value is 720 sample iterations. With a 5 second sample interval, this will provide a one hour monitoring capture. --norun Do not run the commands, merely echo them to stdout. --short Only collect the minimal set of counters. --sql Include Microsoft SQL Server counters. --verbose Enables verbose mode for the BigFixPerf utility itself. This may be useful for debugging purposes. Figure 12: BigFixPerf Syntax Windows Example The following command will create a performance counter named “BigFixPerf” and will collect samples every 60 seconds, for a 24 hour interval. BigFixPerf --monitor BigFixPerf --interval 60 --iterations 1440 Figure 13: BigFixPerf Windows Example 24 IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide The monitor results are typically written to the $systemdir directory, and may be viewed in the Windows perfmon utility. The following image shows the aggregate counter view. The view may be tailored so only selected counters are visible. Figure 14: BigFixPerf Windows Example Output UNIX Example The following command will create a performance monitor output file named “BigFixPerf.nmon” and will collect samples every 60 seconds, for a 24 hour interval. BigFixPerf --monitor BigFixPerf.nmon --interval 60 --iterations 1440 Figure 15: BigFixPerf UNIX Example The monitor results are written to the designated output file. The results must then be formatted via the nmon provided utility. The following image shows the sample formatted output. Note it is a multi-tabbed spreadsheet that offers an excellent view of resource utilization. 25 Figure 16: BigFixPerf UNIX Example Output 26 IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide 3.3.3 BigFixDisk The BigFixDisk utility is a custom disk space monitoring tool intended to standardize data collection on Windows and UNIX. The utility is distributed as an attachment to this paper (see the introduction for details on obtaining the attachments). Syntax: BigFixDisk --monitor [--interval ] [--iterations ] [--disks ] [--norun] [--verbose] Options: --monitor The monitor results file generated relative to the current working directory. --interval The monitor sample interval. The default value is 5 seconds. --iterations The number of samples to collect. The default value is 720 sample iterations. With a 5 second sample interval, this will provide a one hour monitoring capture. --disks The utility has an associated configuration file. This option references the label in the configuration file for the disk volumes to be monitored. Samples for Windows and UNIX are provided in the configuration file. --verbose Enables verbose mode for the BigFixDisk utility itself. This may be useful for debugging purposes. Figure 17: BigFixDisk Syntax Windows Example The following command will create a performance counter named “BigFixDisk.csv” and will collect samples every 60 seconds, for a 24 hour interval. BigFixDisk --monitor BigFixDisk.csv --interval 60 --iterations 1440 –disks Windows Figure 18: BigFixDisk Windows Example 27 4 BigFix Capacity Planning We will provide capacity planning recommendations through the following approaches.  Static planning via a spreadsheet approach.  Capacity planning for the BigFix management server(s) (aka the “managed from” infrastructure). This includes the BigFix console.  Capacity planning for the managed endpoints, including the BigFix relay infrastructure (aka the “managed to” infrastructure).  Capacity planning considerations for BigFix server upgrades. The capacity planning recommendations will be given in terms of number of CPUs. Given there is a broad range of capability here, we will provide a general definition of a “CPU”. First though, some practical advice on how to use this guide will be given. 4.1 How to Use This Guide: Capacity Planning Guidelines For pretty much any performance or capacity planning question, the standard answer is “it depends”. It depends on specific hardware, workload, time of day versus workload, collocation aspects (at many levels), configuration options, etc. In practical terms, there is no way to account for all of the permutations. As a result, the following guidelines apply to the provided capacity planning recommendations.  Capacity planning recommendations are general purpose for a “typical” workload. We model the “typical” workload in our performance labs and provide a description of the benchmarking approaches used in this guide.  Capacity planning recommendations are provided in terms of the number of managed endpoints. This is a simplification to make the recommendations consumable, but there are many more dimensions than this that may apply for a specific installation.  In the event you are within a range of capacity planning recommendations (e.g. somewhere between 50,000 and 100,000 managed endpoints for the BigFix root server), you may start at the low end of the range and grow, assuming your workload and system behavior is well understood. This applies to the CPU, memory, and storage allocations. The IO subsystem and network requirements are universal.  Monitoring over time is always recommended. This guide includes specific recommendations for how to monitor at the operating system, application, and database levels.  Capacity planning is seldom static. Systems grow over time. Entropy increases. Maintenance operations are typically required, especially at the database level, to manage this and ensure performance stability. Approaches for this are described in this guide. 28 IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide 4.2 CPUs, vCPUs, and Cores Today’s multiprocessor architectures are often defined in terms of cores, and may be physical cores or virtual CPUs (aka vCPUs). The following rules of thumb apply for our capacity planning outlook.  In terms of pure clock speed, per the IBM SoftLayer definition, we will generally consider a CPU core to be a relatively current generation 2.0+ GHz/core implementation. See the References section for more information on SoftLayer server specifications.  A discrete core and a vCPU can be considered equivalent. However, more management is typically required for a vCPU due to the hypervisor scheduler and contention for vCPUs due to the virtual configuration and over commitment model. As a result, hypervisor monitoring is required to ensure a vCPU is in the range of 90+% of the capability of a physical core.  A hyper threaded core does not have the throughput capability of a “pure” core, and can be considered to have on the order of 30% of the capability of a pure core. Our sizing approach is based on pure, physical cores at 2.0+ GHz/core. For virtual or hyper threaded cores, the above efficiency rations should be part of the sizing methodology. An example will be provided at the end of this section. 4.3 Capacity Planning Spreadsheet In order to provide a desired hardware and software configuration for an IBM BigFix implementation, a wide range of parameters must be understood. The following questions are usually relevant. 1. What operations are expected to be performed with BigFix? 2. What are the average and peak concurrent user workloads? 3. What is the expected scale of the “managed to” infrastructure? 4. What is the expected workload for the “managed to” infrastructure (e.g. patch or software deployment scenarios)? 5. What is the enterprise network topology? 6. Is the deployment pure physical, or are some aspects virtualized? 7. What are the solution availability requirements? A capacity planning spreadsheet is attached to this paper (“BigFix Capacity Planning Profile v9.x.xlsx”). The spreadsheet may be used to provide a profile for further sizing activities (e.g. a capacity planning activity in association with the document authors). 29 4.4 BigFix Management Server Capacity Planning Recommendations The following table describes the BigFix management server’s capacity planning recommendations as a function of the scale of the “managed to” estate.  The recommendations are for a collocated server configuration. Finer grained, non-collocated recommendations will be provided in a future update of this paper.  Requirements for the BigFix management server include the base operating system requirements. All other capacity planning numbers are in addition to the base operating system requirements.  Suitable storage is recommended to accommodate the growing database size and associated management overhead (e.g. a working set of database backups). The storage requirements reflect the requirements for BigFix 9.5, where the Unicode base has driven greater storage requirements.  The BigFix application directory also contains a download cache which defaults to 1GB. We recommend increasing this cache size to somewhere between 100GB and 1TB (or higher) depending on needs. The storage requirements below should be increased to account for the desired cache size.  Network requirements are for a 1 Gbps network or better for the management server infrastructure.  Additional requirements based on BigFix function enablement are also provided. Deployment Size CPU Memory (GB) Storage (GB) < 1,000 41 16 100 10,000 4 16 250 50,000 6 24 300 100,000 12 48 500 150,000 16 72 750 200,000 20 128 1000 250,000 24 128 1250 Figure 19: BigFix Management Server Capacity Planning Requirements For the storage requirements, in general it is advantageous to have distinct storage subsystems/controllers for specific components. For example, a recommended best practice is for discrete storage subsystems for each of the base, the database logs, and the database containers. A description of each of these categories follows. 1 While more granular sizing is possible here, to ensure base operating system health and given the commodity level of a system with four processor cores, this is considered a reasonable, minimal base. 30 IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide  Base. The base including the operating system, BigFix binaries, BigFix logs, and BigFix content. Note for the operating system, the BigFix Knowledge Center provides minimum recommendations (URL).  Database Logs. The database logs, used for data recovery. By default, BigFix uses simple logging for MS SQL, and circular logging for DB2. In the event that more extensive logging approaches are used, there will be additional impact. In addition, if options such as BigFix Compliance or BigFix Inventory are enabled, there will be additional impact.  Database Containers. The MS SQL or DB2 database containers. These include the actual tables and indexes, and potentially the archive of a set of database backups. Each of these categories have distinct storage requirements and very specific access patterns. For example, the database logs are typically characterized by the need for very fast sequential IO, while the containers can have much more diverse access patterns, but are highly insulated from storage impact by virtue of the database buffer pools. In terms of the storage breakdown for these categories, the following table provides a good rule of thumb. Base (%) Database Logs (GB) Database Containers (%) 30% 2GB 70% - 2GB Figure 20: Storage Requirements Breakdown We will next describe additional capacity planning requirements based on enablement of specific BigFix functions: Message Level Encryption and the WebUI. . 4.4.1 Message Level Encryption Enablement Message Level Encryption (MLE) provides data encryption support for clients, and is particularly valuable for insecure networks or when secure communication in general is required. It is worth noting MLE does not affect actions taken from the console or Fixlets that are already protected by digital signatures. More information on MLE is provided in the References section. The following table describes the capacity planning requirements for Message Level Encryption. Function Enabled Additional CPU Additional Memory (GB) Additional Storage (GB) Message Level Encryption +2 +4 n/a Figure 21: BigFix Message Level Encryption Capacity Planning Requirements 31 4.4.2 WebUI Enablement The BigFix WebUI offers a new scalable and highly responsive management interface for BigFix. Once enabled, the BigFix WebUI will drive additional system utilization as a function of the number of concurrent administrators (as described earlier, this is the population of administrators active at any one time, and the number of concurrent administrators is typically << the total number of administrators). We will describe the capacity planning requirements for the current WebUI offering (aka WebUI v1).  WebUI v1. This instance is originated on a BigFix 9.2 base. It features an Extract-TransformLoad (ETL) process that builds a standalone SQLite database for the WebUI. The SQLite database will be established under the BigFix WebUI directory (versus the database instance directory) and has associated storage impact. The size of this database is typically on the order of 30% of the size of the reference BigFix database. Additional details for the WebUI v1 implementation are provided below. WebUI Component Additional CPU Additional Memory (GB) Additional Storage (GB) BigFix WebUI ETL +1 +4 30% of BigFix database BigFix WebUI +3 per 10 concurrent users +2 per 10 concurrent users n/a Figure 22: BigFix WebUI v1 Capacity Planning Requirements In terms of the scalability characteristics of the WebUI, the offering is targeted at the activities of non-master operators (versus the broader administrative role of master operators). A realistic upper bound for the WebUI is management by 30 concurrent users on Windows and Linux, over an estate of 60k devices. An alternative view would be management by 10 concurrent users on Windows and Linux, over an estate of 120k devices. The rationale for these different concurrency limits are the number of endpoints determine the scale and complexity of the WebUI data processing requirements, which ultimately impacts the concurrent user resource requirements and performance. It is worth noting the concurrent users would typically be non-master operators, managing a subset of the estate. For example, some non-master operators may only be managing a handful of devices, while others may be managing on the order of 20k devices. It is possible to manage at a larger scale based on user operations, infrastructure capability, etc. However, the stated bounds should be considered a good “rule of thumb” for the scale of the solution. 4.4.3 Considerations for a Local or Remote WebUI The WebUI may be run either local (i.e. collocated with the root server) or remote (i.e. anticollocated with the root server). In terms of the capacity planning recommendations, the current WebUI offering is relatively self-contained. For example, for a local WebUI, simply add the WebUI specific capacity planning requirements to the root server. For a remote 32 IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide WebUI implementation, the WebUI specific capacity planning requirements should be added to the base operating system requirements. Of more interest is the question “am I better off running the WebUI locally or remotely”? As usual, the answer is “it depends”. For example, in virtual environments a remote WebUI offers the potential for your VMs to individually have fewer vCPUs and may schedule better. For a physical environment, database communication paths tend to be optimal when the server is collocated and advantages like shared memory may be used for communication. In terms of reliability, collocation is generally preferred as it implies less dependence on the network stack. 4.5 BigFix Console Capacity Planning Recommendations The following table shows capacity planning requirements for a workstation based console. Expectation is data center level network speeds are available. The site cache is expected to be on the order of 20MB per external site. Deployment Size CPU Memory (GB) Storage (GB) < 10,000 1 2 0.25 + site cache < 100,000 2 4 2GB + site cache > 100,000 2 6 4GB + site cache > 200,000 2 8 8GB + site cache Figure 23: Console Workstation Capacity Planning Requirements The following table shows capacity planning requirements for a terminal or Citrix server based implementation. The expectation is data center level network speeds are available for the server, and each server may be managing on the order of 10-20 concurrent users (remote users, meaning they may not reside in the data center). In the event a greater number of concurrent users are in effect, the general rule of thumb is to add on the order of 1 CPU and 2-6 GB of RAM for every additional concurrent user (RAM is dependent on the deployment size). As always, requirements are workload dependent so monitoring of the system under load is always recommended. As a result, ranges are given where appropriate with the expectation that monitoring may be used to fine tune in the customer environment. 33 Deployment Size CPU Memory (GB) Storage (GB) < 10,000 8 - 16 8 - 16 20 < 100,000 10 - 20 16 - 48 80GB > 100,000 10 - 20 32 - 80 80GB - 240GB Figure 24: Terminal Server Capacity Planning Requirements The question often arises how many console operators may be supported by BigFix. A primary selling feature of BigFix is the ability of a small number of operators to manage a large estate. However, in the event that fine grained management is required, a base of 300 operators may be managed with careful attention to the console infrastructure. Proceeding beyond this value would require understanding of the infrastructure and associated workload impact. 34 IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide 4.6 BigFix Relay and Associated Infrastructure Capacity Management Considerations The following diagram provides a deployment scenario for the BigFix relay and agent infrastructure2. Figure 25: BigFix Relay Infrastructure In terms of strict capacity planning the relay requirements are fairly straightforward. The relays are deployed in a hierarchy with top level relays serving other relays, and the leaf node relays (often referred to as site level relays) serving the endpoints. The following ratios apply for relay deployment. 2  Relays managing relays: 1:250 (meaning each relay node can manage on the order of 250 other relays).  Relays managing BigFix agent children: 1:1000 (meaning each relay node can manage on the order of 1000 child agents or endpoints. Diagram courtesy of the IBM BigFix Knowledge Center: URL. 35 For the relays serving other relays (also known as top level relays), the following capacity planning recommendations apply. Top level relays are generally recommended once you approach 40,000 endpoints or over 200 relays (whichever comes first). A top level relay should manage no more than 40,000 endpoints or 250 relays (whichever comes first). The relays managing the endpoints offer low utilization, and may possibly be collocated with server nodes already distributed in the enterprise. Deployment Size (Endpoints Served) CPU Memory (GB) Storage (GB) 10,000 1 2 20,000 2 2 40,000 2 4 OS + 3GB (BigFix binaries + logs) + 2GB (Default Cache) + Extended Cache (if additional cache space is configured) + 300 MB/site (site cache for relay) Figure 26: Top Level Relay Capacity Planning Requirements For the relay storage requirements, the operating system storage may vary widely. For example, Windows 10 64 bit requires 20GB. Something like the Linux Tiny Core, where relays have been successfully demonstrated, requires on the order of megabytes. The net is we have high performance Windows 10 relays running with 25GB, and they have been proven to virtualize extremely well. A more difficult decision to reach is actual placement of the relays, and in particular the hierarchy of nodes. Decision points include network bandwidth, network latency, firewalls, server infrastructure, etc. Network topology maps typically exist for most enterprises. However, these maps rarely contain accurate metrics for network performance between nodes. Even when metrics are provided, they are often out of date or represent ideal conditions. In addition, network shaping often applies, meaning network characteristics may be dynamic based upon load. In order to facilitate understanding of the network and placement considerations, basic network ping tests may be performed. Sample ping commands follow.  Basic ping: ping –c 10  Flood ping: ping –f –c 100000 –s 1500 –I 4 Using these commands a map may be built showing latency and packet loss. Additional tests to demonstrate the number of hops via trace route (e.g. tracert) or equivalent is recommend. Secure copy (e.g. scp) tests for sample payloads are also helpful. In essence, once the network characteristics are defined, placement decisions typically become very straightforward. 4.6.1 Relay Virtualization The BigFix relays virtualize extremely well. Their small system requirements and lightweight workloads are ideal for virtual machine deployments. In addition, it is possible to collocate a number of virtual relay instances in order to achieve virtual clusters for system throughput and redundancy. The following diagram shows the performance of an increasing number of relay requests, across a number of physical and virtual Linux instances. Performance across the 36 IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide instances is extremely consistent, providing excellent support for virtualizing relays. It should be pointed out that at the 50,000 request level, results are more inconsistent as we are at the infrastructure threshold for both virtual and physical deployments. For tuning virtual instances, the Virtualization section later in this paper should be followed. Figure 27: BigFix Virtualization Performance 4.7 Upgrade Capacity Planning Considerations As stated previously, the BigFix capacity planning profile stated herein generally reflects the requirements for IBM BigFix 9.5. In terms of upgrade capacity planning, there have been some significant events for the 9.x stream. We will discuss some of the notable upgrades in the 9.x service stream. 4.7.1 IBM BigFix 9.2.5 Upgrade The 9.2.5 patch level introduced the BigFix WebUI. This upgrade includes schema level improvements that will result in growth for the database container utilization on the order of the size of the BigFix Fixlet content. The reason for this is the WebUI creates an ETL stream to manage the WebUI content in a SQLite database, and additional storage is required to manage this. For example, based on the Fixlet content, growth in the range of [50%,170%] has been evidenced. As a result, it is recommended that ample container space is available for any upgrade starting below patch level 9.2.5, and moving to level 9.2.5 and beyond. 4.7.2 IBM BigFix 9.5 Upgrade The 9.5.0 patch level introduced Unicode. Before upgrading to BigFix V9.5 you need to ensure that you have enough free space. The free space projection involves a number of factors: the row counts of specific tables, the density of the data pages storing the tables and indexes, and specific database options for log management and table compression. The basic rule of thumb for the 9.5 upgrade is to ensure you may accommodate a 100% growth in the size of the actual database (twice the size of your current database). The 37 additional space is required to manage the growth of tables for UTF-8 transcoding, including the allocation of temporary space for the movement of data within the BigFix databases. For Windows deployments, the database size may be determined through the Microsoft SQL Server Management Studio. For Linux deployments, the get_dbsize_info procedure may be used. 4.7.3 IBM BigFix 9.5.3 Upgrade The 9.5.3 stream has two changes of note: 1. Support for remote WebUI. 2. Introduction of a “global” SQLite implementation for the BigFix agent. For the remote WebUI, the primary impact is for the WebUI Extract-Transform-Load (ETL) process. This is because this process is managing data synchronization between the BigFix root server database and the WebUI database that reside on different nodes. As per most remote database implementations, the ETL process now incurs extra overhead for the data extraction over the network. This overhead is typically within the interval [20%,40%] for additional ETL duration. However, as the ETL process is a back end process and hidden from user view, the WebUI user experience is essentially unchanged. For the BigFix agent SQLite implementation, the primary impact is improved agent concurrency at the expense of increased memory utilization. The agent is designed to have very low system utilization in terms of CPU, memory, and IO. To be more specific, typical agent memory utilization can be on the order of 40 MB, though this will vary with workload and the operating system. The operating system is highly relevant as modern operating systems tend to be very good at utilizing memory (if it is available, why not use it to improve system performance). As a result, there are many ways to interpret memory utilization (e.g. overall, working set size, etc.). The net is for the agent, the central SQLite implementation typically uses an additional 20% of working set memory. It is important to realize this memory comes along with reduced IO. This is expected, as modern databases prefer to use fast, cheap memory to offset relatively slow, expensive IO. 4.8 Capacity Planning Example We will do a sample capacity planning exercise based on the following customer requirements.  BigFix management server preferred OS: Linux  Number of total users: 20  Number of concurrent users: 10  Number of managed endpoints: 20,000  Network topology: Hub topology with 1 central data center and four hubs, with 1Gbps data center speed, and 100Mbps WAN speed.  Number of endpoints per data center: 4,000  Function enabled: BigFix Console, BigFix WebUI 38 IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide BigFix Component Number of Servers CPU (per Server) RAM (GB) (per Server) Base server (includes DBMS) 1 6 24 WebUI v1 ETL Added to base server +1 +4 WebUI v1 Server Added to base server +3 +2 Terminal Server 1 8 16 Failover Relay 1 2 2 Leaf (Site Level) Relays 20 (collocated on existing servers) n/a n/a Total 3 (not counting collocated relays) 20 48 Figure 28: Sample Capacity Planning Profile 39 5 Performance Management Capacity planning and performance management go hand in hand. There is no standard workload for BigFix. Every enterprise has different requirements, infrastructure, and customization. This section will build upon the base capacity planning recommendations, and offer a set of defined decision points for building an optimal BigFix deployment based on the enterprise needs. The following diagram provides a form of decision tree for a BigFix deployment.  Understanding the capability of the cloud infrastructure (and potentially poorly configured or underperforming components of the infrastructure).  Understanding the base capability of the BigFix implementation and associated customization.  Understanding the long term performance stability of the system. We will describe basic system monitoring approaches, infrastructure benchmarks, and cloud benchmarks. In general the “green path” in the diagram is recommended. The deviations for the “yellow path” may be followed, and can certainly result in a very successful implementation, but tradeoffs typically exist. For example:  Virtualization adds additional levels that must be managed for performance. For example hypervisor management and IO tuning are critical in virtual deployments. In physical deployments, this management cost is typically significantly reduced (IO) or eliminated (the hypervisor).  Windows adds additional Total Cost of Ownership (TCO) concerns for licensing. For example, the DB2 entitlement is provided as part of the BigFix product itself, so no additional licensing is needed. This can simplify deployments, especially Proof of Concept situations that often desire minimal licensing and infrastructure requirements.  A remote database typically adds overhead (e.g. request latency) versus a local database. With these tradeoffs in mind, we will present a number of performance management considerations for BigFix. We will break the discussion down into infrastructure management approaches, and BigFix management approaches. We will also discuss benchmark management. 40 IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide Figure 29: BigFix Capacity Planning & Performance Management 41 5.1 Infrastructure Management Approaches We will discuss infrastructure management approaches in terms of virtualization, operating system management (including IO subsystems and network management), and database system management. 5.1.1 Virtualization In today’s modern enterprise, virtualization is seen as a powerful way to address the management of cost and scale. In general terms, performance management of physical servers tends to be simpler. Resources are isolated, there is no hypervisor involved, and the operating system view of performance is a direct indicator of system and application performance. In order to simplify performance management and keep latency characteristics to a minimum, the first recommendation is always to deploy on physical hardware. However, it is still possible that a virtual deployment is still desired (whether enterprise standards, high skill levels in the team for virtual system performance, etc.). In order to manage BigFix in a virtual environment, precautions must be taken to ensure performance. We will describe some of the key management aspects. We will then reinforce the fact that monitoring and understanding is critical in a virtual world. “Right Sizing” the CPU Allocation When deploying a physical server, additional CPU resources are generally seen as passive, or perhaps even beneficial. In a virtual environment, an oversized VM may actually degrade performance. The reason is in a shared deployment model, larger VMs require greater scheduling and orchestration effort. This may lead to scheduling delays or wait time. As a result, “right sizing” is critical. The capacity recommendations in this paper are the starting point, with monitoring being essential. Some classic elements for monitoring follow.  CPU ready. This is the percentage of time the VM is ready to be run, but is waiting due to scheduler constraints.  CPU wait. The amount of time the CPU spends in wait state. A set of VMware performance troubleshooting guides are provided in the References section. VMware Snapshot Management Snapshots are a powerful tool in virtual environments. In addition, many teams new to virtualization start to leverage snapshots as a backup approach. In the context of VMware, it cannot be emphasized enough that snapshots should not be used for a backup policy! In order to understand why, it is strongly recommended to read the VMware literature in the References section of this paper. Essentially, snapshots result in the chaining of images with degradation incurred as a result of managing the chains. It is not unusual to see degradation on the order of hundreds of percent. To further compound the issue, such performance issues are often difficult to diagnose as they are obscured by the hypervisor. 42 IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide VMware Latency Management With VMware vSphere 5.5, it is possible to set the latency sensitivity of a virtual machine. This serves to reduce the impact of virtualization with improved application performance, at the expense of “dedicated” resources. Further information is available through the VMware content in the References section. Virtual IO Management Out of all system resources, IO is typically the most difficult to manage. High performance IO subsystems are relatively expensive, and prone to failure if redundancy is not managed. In addition, many solutions that perform well in a physical environment (say in the range of 5,000 to 10,000 IOPS) may plummet in a virtual environment (to, say, 100 IOPS). As a result, in any virtual environment it is critical to benchmark and monitor the IO subsystem. In order to achieve this, more information is provided in the benchmarking section below. In addition, we will next describe specific guidelines for IO management for Linux virtual deployments. The Linux IO Scheduler Each Linux instance has an IO scheduler. The intent of the IO scheduler is to optimize IO performance, potentially by clustering or sequencing requests to reduce the physical impact of IO. In a virtual world, however, the operating system is typically disassociated from the physical world through the hypervisor. As a result, it is recommended to alter the IO scheduler algorithm so that it is more efficient in a virtual deployment, with scheduling delegated to the hypervisor. The default scheduling algorithm is typically “cfq” (completely fair queuing) 3. Alternative and recommended algorithms are “noop” and “deadline”. The “noop” algorithm, as expected, does as little as possible with a first in, first out queue. The “deadline” algorithm is more advanced, with priority queues and age as a scheduling consideration. System specific benchmarks should be used to determine which algorithm is superior for a given workload. The general recommendation is to use the “deadline” scheduler. The following console output shows how to display and modify the IO scheduler algorithm for a set of block devices. In the example, the “noop” scheduler algorithm is set. Note to ensure the scheduler configuration persists, it should be enforced via the operating system configuration (e.g. /etc/rc.local). Figure 30: Modifying the Linux IO Scheduler 3 With Red Hat Enterprise Linux 7, the default scheduler has been set to “deadline”. 43 The following graphs show the throughput and latency results, based on an Iometer benchmark across a variety of storage subsystems. In terms of throughput (where higher is better) and latency (where lower is better), the deadline scheduler is preferable. Note while the differences may appear small, under load and as concurrency and contention go up, the gaps increase significantly. Figure 31: Linux IO Scheduler Throughput Figure 32: Linux IO Scheduler Latency 5.1.2 Operating System Management Both Linux and Windows are evolved and refined operating systems that can be used as the base for high performance systems. As described earlier, in the context of Total Cost of Ownership, when it comes to performance management we have a preference for Linux based deployments. In terms of operating system tuning, very little tuning is typically required. However, specific guidelines for CPU, memory, IO, and network management apply. CPU and memory are very straight ahead, with capacity planning and monitoring approaches 44 IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide provided in this paper, along with recommendations for virtual deployments. IO and network management are more complex. Fortunately, there are some excellent BigFix Knowledge Base articles addressing this. Rather than duplicate this content here, the following URLs are recommended.  BigFix server disk performance: URL  BigFix network management and bandwidth throttling: URL We will describe two special areas of operating system management: Linux “swappiness” and the Linux “ulimit”. Linux Swappiness The Linux swappiness kernel parameter (vm.swappiness) is a value in the interval [0,100] that defines the relative weight of swapping out runtime memory, versus dropping pages from the system page cache. The default value is typically 60. The recommendations for setting this value are as follows.  For a dedicated database management server, the swappiness should be set to zero (0).  For a database management server collocated with the BigFix application server, the swappiness should be set to ten (10). Further details on managing DB2 performance are provided in the References section. Linux ulimit Management For Linux operating system defines a system ulimit for the maximum number of open files allowed for a process (i.e. the nofiles option when you run the command “ulimit –a”). For the DB2 instance, the value for this kernel limit should be either “unlimited” or “65536”. 5.1.3 Database System Management The Database Management System (DBMS) is critical to the success of all BigFix deployments. We will recommend specific tuning and configuration options. Supported Database Versions IBM BigFix has a well-documented matrix of system requirements, including operating system versions, hypervisor versions, browser versions, and database versions. In terms of database versions, the following offerings are supported.  Microsoft SQL Server 2008, 2012, 2014.  DB2 Workgroup Edition 10.1, 10.5. The versions supported for a specific BigFix release are documented in the system requirements matrix (URL). The general recommendation is to use the most current database release supported for your BigFix version, as database performance, resilience, and function tend to only improve with each new release. 45 It should be noted that Microsoft SQL Server Express is also included in the list of reference databases. Both DB2 and Microsoft SQL Server offer “express versions”. These are license free, limited utility offerings typically intended for low demand or proof of concept situations. In the case of DB2 Express, there is no compelling reason to use it. It is simply a constrained version of DB2, and the license for the full version is provided for BigFix Linux deployments. In the case of Microsoft SQL Server Express, the support matrix clearly indicates it may be used for evaluation purposes, and the customer will provide the full Microsoft SQL Server license. What does this mean in the context of a BigFix deployment? Essentially, Microsoft SQL Server Express may be used for a BigFix deployment with the following constraints.  The user must be aware of the Microsoft SQL Server Express constraints. The constraints for a specific version are documented in the Microsoft Knowledge Base (e.g. URL). In general, the DBMS is constrained to a single CPU socket and up to four cores, utilizing up to 1GB of RAM and 10GB of database storage. Once the Microsoft SQL Server Express limits are reached, the configuration is no longer supported by IBM BigFix.  In terms of the scale limits for IBM BigFix with Microsoft SQL Server Express, scale on the order of 100 devices with one or two operators is expected. Even at this level of “proof of concept” scale, system monitoring is critical to ensure system health. For example, it may be possible to exceed 100 devices with careful monitoring, but it is considered a good rule of thumb. In addition, some BigFix function such as the IBM BigFix WebUI should be enabled with care. Further detail on managing monitoring and the WebUI is provided in the following points.  The user must perform adequate system monitoring to ensure the database system limits are not impacting the health of BigFix. For example, once the 10GB storage limit is reached, the database will no longer be viable. When the CPU and memory limits are reached, system response times and throughput will degrade. As a result system monitoring is critical. To monitor the storage limits, the Windows file explorer may be used. To manage the CPU and memory limits, the Windows performance monitor or task manager may be used for the SQL Server process. For advanced users, the Microsoft SQL Server Management Studio may be used for monitoring.  In the event it is desired to enable the BigFix WebUI functionality, it should be done with caution. The reason for this is the WebUI will initiate additional database workload that will increase system resource requirements. Before enabling the BigFix WebUI, the base workload should be running with some headroom with respect to the Microsoft SQL Server Express limitations (CPU, memory, and storage). If resource utilization issues are indicated, either pre or post WebUI enablement, it is recommended to upgrade to the licensed version of Microsoft SQL Server.  In the event the defined limits are reached, Microsoft does support an in place upgrade approach. As a result, a maintenance outage may be taken, the DBMS licensed, and service may be resumed. 46 IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide Local versus Remote Database A local database is generally recommended for performance. A remote database adds additional request latency. While a dedicated DBMS server is usually ideal in terms of resource segregation and allocation (e.g. CPU, memory), communication with the database typically has higher overhead. For example, the Open Database Connectivity (ODBC) driver used by BigFix may exploit shared memory when the BigFix server and the DBMS server are collocated. For a remote database, the network transport chain must be invoked. As a result, even well configured data centers that have low latency 10Gbps networks may demonstrate a slowdown with a remote database versus a local database. Database and Schema Overview The following table describes the BigFix schema characteristics. Database Schema Number of Tables BESREPOR DBO 26 BigFix reporting database. BFENT DBO 114 BigFix enterprise database. Comments Figure 33: BigFix Schema Characteristics DB2 Configuration Recommendations For DB2 bases deployments, there are a number of database configuration parameters that are required. These parameters are typically set at installation time, but their settings should be verified as depending on the history of the database server, their configuration may be required. These are all database level configuration parameters and are typically set by a statement of the form: “db2 update db cfg for BFENT using ”. Database Configuration Parameter(s) Comments STMT_CONC = LITERALS Enables the statement concentrator to improve package cache utilization for dynamic statements. This will improve query performance. LOCKTIMEOUT = 120 Enables a lock timeout to prevent long running queries from dominating the system and degrading performance. AUTO_MAINT = ON AUTO_TBL_MAINT = ON AUTO_RUNSTATS = ON AUTO_STMT_STATS = ON AUTO_REORG = ON CUR_COMMIT = ON Enables automatic maintenance operations for the management of table statistics and reorganization. Statistics are critical so the database optimizer may make effective query plan choices. Enables improved concurrency for Cursor Stability (CS) scans. Figure 34: DB2 Configuration Recommendations 47 5.2 BigFix Management Approaches BigFix is typically configured to self-manage and to adapt to an extremely broad range of enterprise deployments. However, there is a broad array of tuning knobs available for specific, custom situations. We will focus on the most impactful knobs that will be of interest in any deployment. Please see the References section for a more comprehensive listing. 5.2.1 FillDB Options FillDB is perhaps the most critical BigFix daemon, and there has been a continuous emphasis on driving performance improvements for FillDB. We will describe these improvements across a number of releases. BigFix 9.2 and FillDB For BigFix 9.2, a number of database boost level options were introduced to optimize performance of FillDB on Windows systems. The boost level may be in the integer interval [0,3]. In order to set the boost level on Windows and Linux, the following approaches may be used.  Windows: Add the DatabaseBoostLevel DWord value to the registry key HKLM\Software\Wow6432Node\BigFix\Enterprise Server\FillDB.  Linux: Add the following lines to the /var/opt/BESServer/besserver.config file: [Software\BigFix\Enterprise Server\FillDB] DatabaseBoostLevel = For BigFix 9.2, the default database boost level is zero (0). While this is sufficient for Linux, the recommended database boost level for Windows based deployments is three (3). The following graph shows a sample scenario for database boost level impact on Windows (higher numbers are better). Figure 35: FillDB Database Boost Levels 48 IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide BigFix 9.5 and FillDB Note that with the IBM BigFix 9.5 release, the database boost level has been simplified. For Linux, the boost level is always enabled and no customization is required. For Windows, the option may be set to “1” (on) or “0” (off), with a default value of “0”. It is recommended the option be set to “1” and the impact observed. In most deployments, improvements will be evident. However, results should be monitored as workload specifics will be relevant in terms of which setting drives the highest database throughput and optimal database locking characteristics. BigFix 9.5.5 and FillDB In the BigFix 9.5.5 release, a parallel FillDB implementation was delivered. The following parameters are managed, with the first set being for the base FillDB capability, and the second set being for the BigFix Query result processing capability that is also managed by FillDB.  ParallelismEnabled NumberOfParsingThreads NumberOfDBUpdatingThreads  ParallelismEnabledForQuery NumberOfParsingThreadsForQuery NumberOfDBUpdatingThreadsForQuery The parallelism is enabled by default by the BigFix installer based on the based hardware capability (number of cores). For example, there is a base report concurrency of [3 readers, 3 writers] if the machine has 6+ cores. If the machine has 10+ cores, the BigFix Query report concurrency is also set to [3, 3]. While concurrency sounds like a good thing, what does all of this really mean? Essentially, when the base parallelism is enabled, FillDB throughput doubles with a nominal increase of 1 to 1.5 cores. Similarly, for BigFix Query, the parallelism can essentially cut the processing time for large result sets in half. Both of these are significant improvements, with no modification required by the user. It is possible in an environment with significant available system capability, that more parallelism may be enabled by the user with additional throughput improvements. However, caution should be used. The default parallelism provides an excellent combination of high throughput with low resource impact. While it is possible to increase parallelism, this can also put more pressure on the IO subsystem, with the possibility to degrade performance. The following figure provides an example of this. Here modest improvements in parallelism are in evidence as we proceed beyond the default of “3/3”. However, once you go beyond “10/10”, degradation is in evidence. 49 Figure 36: FillDB Parallelism Example 5.2.2 Agent Heartbeat Frequency The agent sends a heartbeat to the server (essentially, reporting in), every 15 minutes. As a BigFix deployment scales, the heartbeat activity can become significant. For example, if 250,000 agents are reporting every 15 minutes, that is over 278 heartbeats per second! In order to mitigate this, the general recommendation is to set the heartbeat interval to 15 minutes for every 10,000 agents. For example, for a 250,000 agent deployment, this would mean a heartbeat on the order of 6 hours. At this setting, there would be on the order of 12 heartbeats a second. This is much more manageable. The heart beat may be set in the BigFix console preference section. 5.2.3 Console Refresh Frequency The BigFix console refresh frequency determines the interval for populating (refreshing) the console cache. The default value is 15 seconds, meaning every 15 seconds there is a database workload (impacting CPU, disk, network, etc.) in order to maintain the cache. Setting the refresh frequency is a trade-off between data currency and system workload. The setting is ultimately derived from the number of console operators, the size of the estate, and the desired user experience. For a large estate with many concurrent users, a refresh frequency of 300 seconds is not out of the question. The console refresh frequency may be set in the BigFix console preference section. 5.2.4 Data Archiving A variety of data archiving approaches are available and should be performed at regular intervals. 50 IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide  Deletion of closed and expired actions. This is a console activity that will remove the actions from the console view. While the action will still persist in the database, this cleanup approach will reduce the console workload.  Computer removal. A computer removal utility (URL) is available to remove obsolete computers and thereby reduce the overhead for database operations.  Audit cleanup. An audit cleanup utility (URL) is available to prune entries based on a variety of criteria, including time. The audit cleanup should be done in accordance with the enterprise audit policies. For example, a database archive may be generated to store the audit content for that point in time, with the subsequent cleanup serving to reduce the overhead for database operations. Note the IBM BigFix 9.5 release offers significant improvements for data archiving and management.  Database cleanup tools. You may use the BESAdmin interface on Windows systems or the BESAdmin command line on Windows and Linux systems to remove data about computers, custom Fixlets, properties, analyses, and actions and to update the PropertyIDMap table with changes.  FillDB log rotation. The log rotation feature is active by default with LogFileSizeLimit set to 100 MB. 5.2.5 WebUI Management We will break down our recommendations by the WebUI versions. WebUI v1 The WebUI v1 functionality is based on a relatively transparent SQLite instance. This instance is managed via a small set of file system objects (a database container, a log container, etc.). To populate the SQLite instance, an ETL process is run periodically to ensure data freshness. The following parameter defines the ETL interval.  _WebUI_ETL_DelaySeconds This parameter defines the ETL run interval, and starts counting at the completion of the last ETL interval. The default value is 600 seconds, which equates to 10 minutes. Adjusting this value is a tradeoff between WebUI data freshness and database resource impact. As is typical of most databases, statistics are collected by the database manager and used in the establishment of query plans. In the case of the WebUI SQLite instance, there are two parameters that will impact statistics collection.  _WebUIAppEnv_ETL_STATISTICS_THRESHOLD This parameter defines the row threshold for statistics collection. Once this threshold is passed, statistics will be refreshed based on the associated time interval (see the next setting). 51  _WebUIAppEnv_ETL_STATISTICS_THRESHOLD_TIME This parameter defines the time interval for refreshing statistics. By default it is set to 3 a.m. (i.e. “03:00”). An array of values may be provided (e.g. “03:00,11:00,16:00"). In the event that query slowdowns are experienced during the daily usage of the WebUI (as evidenced by user interface operations taking more time to load), more frequent statistics collection may be specified via the above parameters. In addition to statistics management, concurrency in the WebUI (Node.js) runtime is also important. The following configuration setting should be set to the expected number of peak concurrent users.  5.3 _WebUIAppEnv_UV_THREADPOOL_SIZE This parameter determines the thread pool size utilized by the Node.js runtime for the WebUI. Benchmark Management Approaches A common theme in this paper is that for any sufficiently complex deployment, benchmark management is required. Recommendations for a benchmark management approach, based on the building blocks described in this paper, follow.  Establishment of a micro benchmark baseline. Before an application is deployed, especially in a virtual environment, a micro benchmark suite should be deployed to demonstrate the capabilities of the environment. The minimum recommended benchmark is Iometer.  Establishment of a reference workload baseline. The core premise of all performance work is establishment of a baseline. This is the known starting state of the installation, typically derived at low scale, where the system is considered relatively healthy. A reference workload should be defined, key performance indicators established, and then managed as the system grows and ages. In this manner, any degradation may be managed and addressed. Note that reference workloads do not have to be elaborate. Simple deployment times or REST API response times can be sufficient. In the event workload automation is desired, a recommended tool for managing reference workload benchmarks is JMeter.  Monitoring should be light weight, detailed, and persistent. Every enterprise tends to have comprehensive management tools for hypervisor, operating system instances, database server instances, and applications. These monitoring tools should be visible to administrators so average and peak workloads may be understood and managed. In the event that it is necessary to do fine grained monitoring of a BigFix instance, the esxtop (for VMware instances) and BigFixPerf tools are a recommended starting point. In the event it is desired to set up a benchmark and monitoring reference for a BigFix installation, the authors of this paper may be contacted for consultation purposes. 52 IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide 6 BigFix Maintenance Recommendations A number of maintenance approaches have been prescribed in the performance management section. In this section we will describe recommended maintenance approaches for the DB2 Database Management System. DB2 is used as an example as the licensed offering is available with BigFix. Similar approaches are available for Microsoft SQL Server and may be available in a future version of this paper. The following figure offers the maintenance approaches to be described, and the comparable DB2 and Microsoft SQL Server utilities involved. Maintenance Approach Backup Management Statistics Management DB2 Utility Microsoft SQL Server Utility backup, restore backup, restore runstats create or update statistics reorg alter index with the reorganize or rebuild option Reorganization Maintenance Automation Based upon operating system and database schedulers. Figure 37: Database Maintenance Approaches Steps make reference to recommended scheduling frequencies. The general purpose “cron” scheduling utility may be used to achieve this. However, other scheduling utilities may also be used. The key aspect of a cron’ed activity is it is scheduled at regular intervals (e.g. nightly, weekly) and typically does not require operator intervention. Designated maintenance windows may be used for these activities. 6.1 Database Backup Management It is recommended that nightly database backups be taken. The following figures offer a sample database offline backup (utilizing compression), along with a sample restore. backup db user using to compress Figure 38: Database Backup with Compression Command restore db from taken at without prompting Figure 39: Database Offline Backup Restore 6.1.1 Online Backup Support Online backups are typically desired for full application availability. Prior to the release of IBM BigFix 9.5.3, not all database data types were logged (to be specific, Large Object types, or LOBs, were not logged). The logging of all necessary data types is required to ensure the integrity of online backups, given they depend on the log content. 53 However, with the IBM BigFix 9.5.3 release, all necessary data types are logged for new installations. As a result, these installations may safely enable online backups. On the other hand, legacy installations, even if they upgrade to IBM BigFix 9.5.3, will typically not log all data types, and as a result online backups are not recommended for them. For these legacy installations, Appendix A contains a set of prescribed steps to support the enablement of online backups. These steps include how to determine if specific data types are not logged, as well as a database migration procedure to enable logging for specific columns. Once the migration is complete, online backups may also be enabled for legacy installations. Performing Online Backups In the event online backups are enabled, the following figure provides commands that comprise a sample weekly schedule. With the given schedule, the best case scenario is a restore requiring one image to restore (Monday failure using the Sunday night backup). The worst case scenario would require four images (Sunday + Wednesday + Thursday + Friday). An alternate approach would be to utilize a full incremental backup each night to make the worst case scenario two images. The tradeoffs for the backup approaches are the time to take the backup, the amount of disk space consumed, and the restore dependencies. A best practice can be to start with nightly full online backups, and introduce incremental backups if time becomes an issue. (Sun) (Mon) (Tue) (Wed) (Thu) (Fri) (Sat) backup backup backup backup backup backup backup db db db db db db db online online online online online online online include logs use tsm incremental delta use incremental delta use incremental use tsm incremental delta use incremental delta use incremental use tsm tsm tsm tsm tsm Figure 40: Database Online Backup Schedule Note to enable incremental backups, the database configuration must be updated to track page modifications, and a full backup taken in order to establish a baseline. update db cfg for BFENT using TRACKMOD YES Figure 41: Database Incremental Backup Enablement To restore the online backups, either a manual or automatic approach may be used. For the manual approach, you must start with the target image, and then revert to the oldest relevant backup and move forward to finish with the target image. A far simpler approach is to use the automatic option and let DB2 manage the images. A sample of each approach is provided below, showing the restore based on the Thursday backup. restore db incremental use tsm taken at restore db incremental use tsm taken at restore db incremental use tsm taken at Figure 42: Database Online Backup Manual Restore 54 IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide restore db incremental auto use tsm taken at Figure 43: Database Online Backup Automatic Restore In order to support online backups, archive logging must be enabled. The next subsection provides information on archive logging, including the capability to restore to a specific point in time using a combination of database backups and archive logs. Database Log Archiving A basic approach we will advocate is archive logging with the capability to support online backups. The online backups themselves may be full, incremental (based on the last full backup), and incremental delta (based on the last incremental backup). In order to enable log archiving to a location on disk, the following command may be used. update db cfg for using logarchmeth1 DISK:/path/logarchive Figure 44: Database Log Archiving to Disk Alternatively, in order to enable log archiving to TSM, the following command may be used4. update db cfg for using logarchmeth1 TSM Figure 45: Database Log Archiving to TSM Note that a “logarchmeth2” configuration parameter also exists. If both of the log archive method parameters are set, each log file is archived twice (once per log archive method configuration setting). This will result in two copies of archived log files in two distinct locations (a useful feature based on the resiliency and availability of each archive location). Once the online backups and log archive(s) are in effect, the recovery of the database may be performed via a database restore followed by a roll forward through the logs. Several restore options have been previously described. Once the restore has been completed, roll forward recovery must be performed. The following are sample roll forward operations. rollforward to end of logs Figure 46: Database Roll Forward Recovery: Sample A rollforward to 2012-02-23-14.21.56 and stop Figure 47: Database Roll Forward Recovery: Sample B It is worth noting the second example recovers to a specific point in time. For a comprehensive description of the DB2 log archiving options, the DB2 information center should be consulted (URL). A service window (i.e. stop the application) is typically required to enable log archiving. 4 The log archive methods (logarchmeth1, logarchmeth2) have the ability to associate configuration options with them (logarchopt1, logarchopt2) for further customization. 55 6.1.2 Database Backup Cleanup Unless specifically pruned, database backups may accumulate and cause issues with disk utilization or, potentially, a stream of failed backups. If unmonitored backups begin to fail, it may make disaster recovery near impossible in the event of a hardware or disk failure. A simple manual method to prune backups follows. find /backup/DB2 -mtime +7 | xargs rm Figure 48: Database Backup Cleanup Command A superior approach is to let DB2 automatically prune the backup history and delete your old backup images and log files. A sample configuration is provided below. update db cfg for BFENT using AUTO_DEL_REC_OBJ ON update db cfg for BFENT using NUM_DB_BACKUPS 21 update db cfg for BFENT using REC_HIS_RETENTN 180 Figure 49: Database Backup Automatic Cleanup Configuration It is also generally recommended to have the backup storage independent from the database itself. This provides a level of isolation in the event volume issues arise (e.g. it ensures that a backup operation will not fill the volume hosting the tablespace containers, which could possibly lead to application failures). 6.2 Database Statistics Management As discussed in the previous performance management section, database statistics ensure that the DBMS optimizer makes wise choices for database access plans. The DBMS is typically configured for automatic statistics management. However, it may often be wise to force statistics as part of a nightly or weekly database maintenance operation. A simple command to update statistics for all tables in a database is the “reorgchk” command. reorgchk update statistics on table all Figure 50: Database Statistics Collection Command One issue with the reorgchk command is it does not enable full control over statistics capturing options. For this reason, it may be beneficial to perform statistics updates on a table by table level. However, this can be a daunting task for a database with hundreds of tables. As a result, the following SQL statement may be used to generate administration commands on a table by table basis. select 'runstats on table ' || STRIP(tabschema) || '.' || tabname || ' with distribution and detailed indexes all;' from SYSCAT.TABLES where tabschema in ('DBO'); Figure 51: Database Statistics Collection Table Iterator 6.3 Database Reorganization Over time, the space associated with database tables and indexes may become fragmented. Reorganizing the table and indexes may reclaim space and lead to more efficient space utilization and query performance. In order to achieve this, the table 56 IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide reorganization command may be used. Note, as discussed in the previous performance management section, automatic database reorganization may be enabled to reduce the requirement for manual maintenance. The following commands are examples of running a “reorg” on a specific table and its associated indexes. Note the “reorgchk” command previously demonstrated will actually have a per table indicator of what tables require a reorg. Using the result of “reorgchk” per table reorganization may be achieved for optimal database space management and usage. reorg table allow no access reorg indexes all for table
allow no access Figure 52: Database Reorganization Commands It is important to note there are many options and philosophies for doing database reorganization. Every enterprise must establish its own policies based on usage, space considerations, performance, etc. The above example is an offline reorg. However it is possible to also do an online reorg via the “allow read access” or “allow write access” options. The “notruncate” option may also be specified (indicating the table will not be truncated in order to free space). The “notruncate” option permits more relaxed locking and greater concurrency (which may be desirable if the space usage is small or will soon be reclaimed). If full online access during a reorg is required, the “allow write access” and “notruncate” options are both recommended. Note it is also possible to use our table iteration approach to do massive reorgs across hundreds of tables as shown in the following figure. The DB2 provided snapshot routines and views (e.g. SNAPDB, SNAP_GET_TAB_REORG) may be used to monitor the status of reorg operations. select 'reorg table ' || STRIP(tabschema) || '.' || tabname || ' allow no access;' from SYSCAT.TABLES where tabschema in ('DBO'); select 'reorg indexes all for table ' || STRIP(tabschema) || '.' || tabname || ' allow no access;' from SYSCAT.TABLES where tabschema in ('DBO'); Figure 53: Database Reorganization Table Iterator 6.4 Database Maintenance Automation For standard database maintenance, it is advisable to automate the scheduling and execution of the maintenance activities via the crontab. The following table shows a sample schedule for the maintenance operations for the relevant BigFix databases. Database BESREPOR BFENT Statistics Reorgs Saturday Saturday Sunday Sunday Figure 54: Sample Database Maintenance Schedule 57 Archiving Saturday 7 Security Considerations This paper is primarily concerned with capacity planning and performance management for BigFix. However, BigFix is primarily a security offering, so it seems fitting to provide a description of security management and hardening approaches for BigFix deployments. 7.1 Security Management The following table provides a summary of BigFix security management. Specific security areas are expanded upon as appropriate. Security Area Disposition Web Application Security Scanning Scans mandated by IBM Corporate Security Standards. They offer an automated and repeatable security assessment. Application Source Code Scanning Scans mandated by IBM Corporate Security Standards. They offer an automated and repeatable security assessment. Threat Modeling Threat model assessment mandated by IBM Corporate Security Standards. Security Regulatory Compliance Reports Several compliance reports (e.g. PCI DSS) are available as part of the web application security scanning work. Figure 55: BigFix Security Management Summary 7.1.1 Web Application Security Scanning Web Application security scanning is performed by the IBM Rational Appscan Standard Edition reference tool. Some of the capabilities of this tool include the following.  Provides visibility into the security and regulatory compliance risks web applications present to your organization.  Uses a combination of testing techniques to provide thorough, automated assessments.  Scans websites for both embedded malware and links to malicious or undesirable websites.  Helps ensure your website is not infecting visitors or directing them to unwanted or dangerous websites.  Correlates results discovered using dynamic and static analysis techniques.  Tests web services. 58 IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide  Heightened scan severity ratings through the enablement of Collateral Damage and Target Distribution settings specifically for cloud offerings.  Delivers more than 40 security compliance reports, including PCI Data Security Standard (PCI DSS), Payment Application Data Security Standard (PA-DSS), ISO 27001 and ISO 27002, HIPAA, GLBA and Basel II. Further information on the Rational Appscan offering is available in the References section. 7.1.2 Application Source Code Scanning Application source code scanning is performed by the Rational Appscan Source reference tool. Some notable features of the Rational Appscan instance that apply to cloud deployments follow.  Identifies security vulnerabilities and defects in the source code during the early stages of the application lifecycle when they are the least expensive to remediate.  Builds automated security into development by integrating security source code analysis with automated scanning during the build process.  Scans, triages and manages security policies; prioritizes assignment of results to security teams for vulnerability remediation.  Delivers fast scans of more than one million lines of code per hour, allowing you to scan even the most complex enterprise applications.  Uses string analysis to simplify the adoption of security testing by development teams.  Support for testing mobile applications including Java, C# and Objective-C. Further information on the Rational Appscan Source offering is available in the References section. 7.1.3 Threat Modeling Threat modeling assessments may encompass automated and manual approaches, including ethical hacking approaches. Basic methods employed include the following.  Enforcement of non-root runtime for audit and trust purposes.  Enforcement of necessary permissions.  Secure credentials management (e.g. passwords).  Secure port analysis.  Ethical hacking approaches. 59 7.1.4 Security Regulatory Compliance Reports The Web Application Security Scanning tool offers a number of regulatory compliance reports. See the following figure for some sample report types. Figure 56: Security Compliance Report Options It is worth noting these are simply report options. For example, for the PCI DSS report neither Rational Appscan nor IBM are approved scanning vendors. While the reports are considered to have value in terms of classifications and exposures, they are not considered to be at the certification level. 7.2 Security Hardening Security hardening has multiple dimensions, particularly given the scope of BigFix. We will provide the following, very basic, hardening approaches.  Port management and firewall configuration.  Common vulnerabilities and exposures management. 7.2.1 Port Management and Firewall Configuration The following tables provide a summary of BigFix port management. Active ports are listed by major component. Inter node communication (which can be used for firewall configuration) is listed where applicable. Intra node communication (i.e. communication on 60 IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide ports that are local to the host) between components is expected. For intra node communication, the local host is not listed as an incoming host. The following attributes are managed.  Port. The specific port that is open.  Protocol. The specific network protocol in effect, where applicable.  Program instance. The program holding the port. This may be a specific executable or a general class designation (e.g. “Operating System”).  Operating system user id instance. The operating system user id the program is running under. Some critical items of interest follow for the reference tables.  The reference tables describe the BigFix runtime requirements. The install and upgrade requirements are not included.  The reference tables describe a central database server topology.  The ports described are for the BigFix content. Additional operating system services may be active.  DNS and directory services are specific to an enterprise deployment and may require additional customization.  Information is not provided for the BigFix Disaster Server Architecture (DSA) configuration at the time of this writing.  The information presented is derived on Linux via a custom BigFix port management tool. A similar configuration is in effect for Windows. The BigFix port management tool is described after the reference table. 61 Port Protocol Program User Comments 50000 TCP db2sysc db2inst1 8080 TCP BESWebReports root BigFix Web Reports service. 52311 TCP BESRootServer root BigFix Root Server service. 52315 TCP BESRootServer root BigFix Root Server API service. 80 TCP node root WebUI Node.js service. 443 TCP node root DB2 database instance. Local traffic only. WebUI Node.js service (port 80 redirects to 443). Local traffic only. 5000 5009 TCP node root WebUI Node.js application processes. Local traffic only. Figure 57: BigFix Port Management The BigFix Port Management Tool A port management tool (BigFixPorts) is provided with this paper. We will describe the four management nodes it provides. We will then describe the tool configuration. 1. List mode: List the set of interesting ports currently being listened on. 2. Inbound connection mode: List the inbound connections to interesting ports. 3. Outbound connection mode: List the outbound connections to interesting ports. 4. Monitor mode: Continuously monitor the ports being listened on, and determine if unrecognized ports are active. This is the most useful capability. The monitor mode will loop indefinitely and for all of the active ports, will list any ports it may not identify as being on the active or ignore lists. Why is this so useful? Well, by running the monitor mode it can be established if new, unexpected ports are being initiated. These ports may either be shut down, or managed per enterprise firewall standards. The ‘BigFixPorts’ tool is a Perl based utility. A standard Perl technique is to put configuration settings in a separate file, using Perl variables that may be sourced directly. The benefits of this are advanced data structures may be supported with all parsing provided by the Perl interpreter. For the utility, this approach is used. We will describe each of the variables in the provided configuration sample. 62 IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide The first variable is the set of servers to be managed. This is a hash of the node alias (a symbolic value), and the fully qualified host name. This structure should be changed per BigFix installation, for the nodes the utility is to be run against. A sample follows. %hosts = ('BF1' => 'blade13.romelab.it.ibm.com'); Figure 58: Port Utility Hosts Configuration The next structure shows the set of active ports required for BigFix. These are the defined listening ports, broken down by host and organized by component. Samples are shown for the BigFix Enterprise Server. %ports_active = ( 'BF1' => [# DB2 BF 50000, 8080, 52311, 52315, Node 80, 443, 5000, 5001, 5002, 5003, 5004, 5005, 5006, 5007, 5008, 5009 ], ); Figure 59: Port Utility Active Port Configuration The next structures serve a common purpose: they indicate the ports or the programs associated with ports that may be ignored. The intent is to remove any noise from the port monitoring view. This is particularly valuable in monitor mode. %ports_ignore = ('BF1' => []); @programs_ignore = ('cupsd', 'dnsmasq', 'master', 'repo_srv.', 'rpcbind', 'rpc.statd', 'sshd'); Figure 60: Port Utility Ports and Programs to Ignore 7.2.2 Common Vulnerabilities and Exposures Management BigFix security management typically implies multi data center security management, and is a herculean task. The “Common Vulnerabilities and Exposures” (CVE) offers a free dictionary of publicly known vulnerabilities (see the References section) that can assist in this task. Given BigFix typically involves a “bring your own operating system” approach, it is extremely useful to be aware of these vulnerabilities, and associated alerts. Some prominent recent alerts, that should be addressed by any cloud deployment, follow. 1. Heartbleed: An OpenSSL vulnerability (URL). 2. POODLE: An OpenSSL vulnerability (URL). 3. Shellshock: A GNU Bash shell vulnerability (URL). It should also be noted the IBM Rational scan tools cited earlier are CVE compatible. 63 8 Summary Cookbook The following tables provide a cookbook for the solution implementation. The cookbook approach implies a set of steps the reader may “check off” as completed to provide a stepwise implementation of the BigFix solution. The recommendations will be provided in three basic steps: 1. Base installation recommendations. 2. Post installation recommendations. 3. High scale recommendations. All recommendations are provided in tabular format. The preferred order of implementing the recommendations is in order from the first row of the table through to the last. 8.1 Base Installation Recommendations The base installation recommendations are considered essential to a properly functioning BigFix instance. All steps should be implemented. Identifier Description B1 Perform the base BigFix installation, ensuring the recommended capacity planning recommendations are followed. The general server recommendation is a physical node with a collocated DBMS. B2 In the event a virtual environment is used, follow the virtualization recommendations. B3 In the event of a Linux deployment, follow the operating system and DBMS configuration recommendations. Figure 61: Base Installation Recommendations 64 Status IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide 8.2 Post Installation Recommendations The post installation recommendations will provide additional throughput and superior functionality. All steps should be implemented. Identifier Description P1 Perform a set of infrastructure and BigFix benchmarks to determine the viability of the installation. P2 Implement the database statistics maintenance activity. P3 Implement the database reorg maintenance activity. P4 Implement the database archiving maintenance activity. P5 Implement a suitable backup and disaster recovery plan comprising regular backups of all critical server components (including the database and relevant file system objects). Status Figure 62: Post Installation Recommendations 8.3 High Scale Recommendations The high scale recommendations should be incorporated once the production installation wants to support the high water mark for scalability. All steps may be optionally implemented over time based upon workload. Identifier Description S1 Apply the latest BigFix fixpack. S2 Monitor the performance of the installation and adjust the management server to the recommended installation values as appropriate. S3 Optimize DBMS performance. A basic way to achieve this is to have dedicated, high performance storage allocated to the database containers and logs. Figure 63: High Scale Recommendations 65 Status APPENDIX A: DB2 ONLINE BACKUP ENABLEMENT The following sections will provide an overview of DB2 backup enablement. Enablement consists of two steps. 1. Determining if database migration is required. 2. Performing the database migration for DB2 online backup enablement. Determining if Database Migration is Required The simplest way to determine if database migration is required is to look at some sample table definitions and inspect the logging for LOB columns. For example, the following command displays the table definition for the LONGQUESTIONRESULTS table and shows the LOB content is not logged. In this case, database migration is required based on the “NOT LOGGED” qualifier for the table’s LOB content. Figure 64: BigFix Database LOB Logging Check 66 IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide Performing the Database Migration To perform the database migration, the following steps are recommended. Note the steps are required for the BFENT and BESREPOR databases. The sample provided below uses BFENT. 1. Stop the BigFix services. 2. In order to verify the BigFix services are indeed stopped, and not persisting connections, it can be useful to restart the database (i.e. db2stop, followed by a db2start). In the event the stop is not successful, verify the BigFix services are down and possibly force the stop. 3. Take a full offline backup of the BigFix BFENT database. db2 backup db BFENT to /home/db2inst1/LOBMigration compress Figure 65: Sample Database Backup with Compression Command 4. Connect to the BFENT database. db2 connect to BFENT Figure 66: Sample Database Connect 5. Perform the database migration step. Note the BigFixLOBLogging.sql script is distributed as an attachment to this paper. db2 –tvf BigFixLOBLogging.sql Figure 67: Sample Migration 6. Upon successful completion of the migration, it is recommended to take another offline backup of the database. In the event errors are encountered, whether as part of the migration process or once the BigFix services are started, the backup captured in the first step may be restored to reset the server state. A sample restore command follows. restore db BFENT from /home/db2inst1/LOBMigration taken at 201607281234 without prompting Figure 68: Sample Database Offline Backup Restore 7. Restart the BigFix services. Outline of Database Migration Steps The database migration script (BigFixLOBLogging.sql) performs the following steps. 1. Creates a set of three tables (essentially temporary tables) to keep track of the LOB migration content: DBO.LOB_TABLES DBO.LOB_COLUMNS DBO.LOB_COLUMN_DUMP 67 2. Creates a set of three stored procedures to perform the migration steps: DBO.getLobColumns DBO.generateRowIDValues DBO.changeLOBLogging 3. The changeLOBLogging procedure is the main processing loop for the migration. It derives the set of tables that require migration from the system catalogs. For each relevant table/column the procedure will.  Dump the column data into one of the temporary tables.  Drop the column.  Recreate the column, with suitable logging.  Restore the column data.  Perform any necessary table reorganization as determined by the migration. 68 IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide REFERENCES IBM BigFix Performance Blogs IBM BigFix Query: Unleashing the Chief Security Officer @ Scale https://www.ibm.com/developerworks/community/blogs/81c130c7-4408-4e01-adf5658ae0ef5f0c/entry/IBM_BigFix_Query_Unleashing_the_Chief_Security_Officer_Scale?lan g=en Case Study on BigFix Performance https://www.ibm.com/developerworks/community/blogs/81c130c7-4408-4e01-adf5658ae0ef5f0c/entry/Case_study_on_BigFix_performance_by_IBM_Cloud_Development_ WW_IT_Services?lang=en IBM BigFix References IBM BigFix Knowledge Center IBM BigFix Version 9.5 Knowledge Center IBM BigFix Resource Center IBM BigFix Resource Center IBM BigFix developerWorks Resource Center IBM BigFix developerWorks Resource Center IBM BigFix 9.5.0 System Requirements https://www-01.ibm.com/support/docview.wss?uid=swg21976664 IBM BigFix Message Level Encryption IBM BigFix Message Level Encryption IBM BigFix Performance Configurations https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Tivoli%20End point%20Manager/page/Performance%20Configurations IBM BigFix Server Disk Performance https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Tivoli%20End point%20Manager/page/Server%20Disk%20Performance BigFix Network Management and Bandwidth Throttling https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Tivoli%20End point%20Manager/page/Bandwidth%20Throttling BigFix Client Usage Profiler http://www-01.ibm.com/support/docview.wss?uid=swg21506248 BigFix Utilities https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Tivoli%20End point%20Manager/page/Utilities 69 IBM BigFix Return on Investment (ROI) References Business Value Analyst for IBM BigFix https://roianalyst.alinean.com/ibm/security/ IBM DB2 References DB2 10.5 Knowledge Center DB2 Knowledge Center DB2: Best practices tuning and monitoring database system performance DB2 White Paper Virtualization References IBM SoftLayer Cloud Server Specifications IBM SoftLayer Cloud Server Specifications Performance Best Practices for VMware vSphere™ 5.0 http://www.vmware.com/pdf/Perf_Best_Practices_vSphere5.0.pdf Performance Best Practices for VMware vSphere™ 5.1 http://www.vmware.com/pdf/Perf_Best_Practices_vSphere5.1.pdf Performance Best Practices for VMware vSphere™ 5.5 https://www.vmware.com/pdf/Perf_Best_Practices_vSphere5.5.pdf Best practices for virtual machine snapshots in the VMware environment http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC &externalId=1025279 VMware: Troubleshooting ESX/ESXi virtual machine performance issues VMware Knowledge Base VMware: Troubleshooting virtual machine performance issues VMware Knowledge Base VMware: Performance Blog http://blogs.vmware.com/vsphere/performance Linux on System x: Tuning KVM for Performance KVM Performance Tuning Kernel Virtual Machine (KVM): Tuning KVM for performance http://pic.dhe.ibm.com/infocenter/lnxinfo/v3r0m0/topic/liaat/liaattuning_pdf.pdf PowerVM Virtualization Performance Advisor Developer Works PowerVM Performance IBM PowerVM Best Practices http://www.redbooks.ibm.com/redbooks/pdfs/sg248062.pdf 70 IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide Benchmark References Report on Cloud Computing to the OSG Steering Committee, SPEC Open Systems Group, https://www.spec.org/osgcloud/docs/osgcloudwgreport20120410.pdf Security Scan References IBM Rational Security Appscan Enterprise Edition http://www-03.ibm.com/software/products/us/en/appscan-enterprise IBM Rational Security Appscan Source http://www-03.ibm.com/software/products/us/en/appscan-source Common Vulnerabilities and Exposures https://cve.mitre.org/ 71 ® © Copyright IBM Corporation 2015, 2016, 2017 IBM United States of America Produced in the United States of America US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not grant you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing IBM Corporation North Castle Drive Armonk, NY 10504-1785 U.S.A. The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PAPER “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes may be made periodically to the information herein; these changes may be incorporated in subsequent versions of the paper. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this paper at any time without notice. Any references in this document to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing IBM Corporation 4205 South Miami Boulevard Research Triangle Park, NC 27709 U.S.A. All statements regarding IBM's future direction or intent are subject to change or withdrawal without notice, and represent goals and objectives only. This information is for planning purposes only. The information herein is subject to change before the products described become available. If you are viewing this information softcopy, the photographs and color illustrations may not appear. 72 IBM BigFix Version 9.x: Capacity Planning, Performance, and Management Guide Trademarks IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the web at "Copyright and trademark information" at http://www.ibm.com/legal/copytrade.shtml. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. Other company, product, or service names may be trademarks or service marks of others. 73