Transcript
IBM Software Information Management
Oracle Exadata and IBM PureData System for Analytics compared by Phil Francisco, Vice President, Product Management and Product Marketing, IBM and Mike Kearney, Senior Director, Product Marketing, IBM
IBM Software Information Management
Contents
Introduction
Introduction 1 Online transaction processing (OLTP) and data warehousing
3
Performance 4 Simplicity of operation
10
Value 14 Conclusion 17
To innovate requires us to think and do things differently, solving a problem using new approaches. Many organizations recognize they have a problem: which technology will best extract value from the growing volumes of data created daily in their operations? IBM® PureData™ System for Analytics, powered by Netezza® technology, is designed to query and analyze large volumes of data. Wishing to exploit data at lower costs of operation and ownership, many organizations have adopted this system and moved their data warehouses from Oracle. In response, Oracle hitched their database management system to a massively parallel processing storage system to launch Exadata. Oracle’s promise that Exadata brings extreme performance to processing both online transactions and analytic queries characterizes it as a general purpose platform for managing mixed workloads. But data warehousing and analytics make different demands of their software and hardware than online transaction processing (OLTP). Quite simply, workloads for data warehousing perform better and at lower cost on a system that is purpose-built for analytics. Exadata’s data warehousing credentials demand scrutiny, particularly with respect to performance, simplicity and value—benefits customers cite when moving from Oracle to PureData System for Analytics.
IBM PureData System for Analytics delivers excellent performance for customers’ warehouse queries. PureData System for Analytics offers customers simplicity; anyone with basic knowledge of SQL and Linux has the skills needed to perform the few administrative tasks required to maintain consistent service levels through dynamically changing workloads. PureData System for Analytics’ performance with simplicity reduces their costs of owning and running their data warehouses. More important, IBM customers create new business value by deploying analytic applications, which their previous database technology placed beyond their reach. This eBook opens by reviewing differences between processing online transactions and processing queries and analyses in a data warehouse. It then discusses Exadata and PureData System for Analytics from perspectives of their performance, simplicity of operation and value. As you read this document, put aside notions of how a database management system should work, be open to new ways of thinking and be prepared to do less, not more, to achieve a better result.
2
IBM Software Information Management
The IBM PureData System for Analytics team has no direct access to an Exadata machine. This team is fortunate in the detailed feedback we receive from many organizations that have evaluated both technologies and selected PureData Systems for Analytics. We have analyzed postings by Oracle and others in the industry. Information shared in this paper is made available in the spirit of openness. Any inaccuracies result from the mistakes of IBM researchers, not an intent to mislead.
their focus is writing (UPDATE, INSERT and DELETE) to a current data set. These systems are typically specific to a business process or function;for example, managing the current balance of a checking account. Their data is commonly structured in third normal form (3NF). Transaction types of OLTP systems are stable and their data requirements are well understood, so secondary data structures such as indices can usefully locate records on disk, prior to their transfer to memory for processing.
Online transaction processing (OLTP) and data warehousing
In comparison, data warehouse systems are characterized by predominantly heavy database read (SELECT) operations against a current and historical data set. Whereas an OLTP operation accesses a small number of records, a data warehouse query might scan a table of billions of rows and join its records with those from multiple other tables. Furthermore, the unpredictability of which queries will be made against a data warehouse not only limits the value of caching, partitioning and indexing strategies, but can penalize data traversal by alternate paths.1 Choices for structuring data in the warehouse range from 3NF to dimensional models such as star and snowflake schemas. Data within each system feeding a typical warehouse are structured to reflect the needs of a specific business process. Before loading to the warehouse, data may be cleansed, de-duplicated and integrated.
OLTP systems execute many short transactions. Each transaction’s scope is small, limited to one or a small number of records. Their highly predictable data requirements enable many transactions to be completed using data copied from disk into fast access in-memory cache, improving the performance of the system. Although OLTP systems process large volumes of database queries,
This eBook divides data warehouses into either first or second generation. While this classification may not stand the deepest scrutiny, it reflects how many of IBM’s customers talk about their evolutionary path to generating ever greater value from their data. First generation data warehouses are typically loaded overnight. They provide information to their business via a stable body of slowly evolving SQL-based reports and dashboards. As these simple warehouses somewhat resemble OLTP systems—their workload and data requirements are understood and stable—organizations often adopt the same database management products they use for OLTP. With the product comes the practice: database administrators analyze each report’s data requirements and build indices to accelerate data retrieval. Creep of OLTP’s technology and techniques appears a success, until data volumes in the warehouse outstrip those commonly managed in transactional systems.
3
IBM Software Information Management
Corporations and public sector agencies accept growth rates for data of 30 - 50 percent per year as normal. Technologies and practices successful in the world of OLTP prove less and less applicable to data warehousing; the index as aid to data retrieval is a case in point. As the database system processes jobs to load data, it is also busy updating its multiple indices. With large data volumes this becomes a very slow process, causing load jobs to overrun their allotted processing window. Despite working long hours, the technical team misses service levels negotiated with the business. Productivity suffers as business units wait for reports and data to become available.
Noticing a sudden spike in sales of a high-margin product at just five stores drives a retailer to understand what happened and why. This knowledge informs strategies to promote similar sales activity at all 150 store locations. The computing system underpinning the warehouse must be capable of managing these sudden surges in demand without disrupting performance of regular reports and dashboards. The business users demand the freedom to exploit their data at the time and in the manner of their choosing, and their appetite for immediacy leaves no place for technologies whose performance depends on the tuning work of administrators.
Organizations are redefining how they need and want to exploit their data; this eBook refers to this development as the second-generation data warehouse. These new warehouses, managing massive data sets with ease, serve as the corporate memory. When interrogated, they recall events recorded years previously; these distant memories increase the accuracy of predictive analytic applications. Constant trickle feeds are replacing overnight batch loads, reducing latency between the recording of an event and its analysis. Beyond operational analytics and the simple SQL used to populate reports and dashboards, the warehouse processes linear regressions, Naive Bayes and other mathematical algorithms of advanced analytics.
Performance Continued growth in volumes of data creates challenges for organizations wishing to exploit these resources. Many first-generation data warehouses failed to deliver on early promises and frustrated the business’s thirst for information because their architects defaulted to the database management technology their organizations had chosen for online transaction processing. The technology was incapable of responding quickly enough to requests from business intelligence applications. A system unable to respond at business speed to data warehousing queries undermines an organization’s success in taking advantage of data.
Performance with Oracle Exadata In building Exadata, Oracle focused on removing congestion to data movement between disk storage and database management system—technologists refer to this as an I/O (input/output) bottleneck. Exadata achieves this by pairing a storage area network (SAN)) subsystem to a database grid and connecting the two by a fast network. Unlike enterprise SANs which support multiple servers and their applications, Exadata’s storage grid solely accepts requests from and serves data to its database grid that runs Oracle Database 11g R2 with real application clusters (RAC), or for requests from Oracle’s Exalytics machine. At the time of publication, Oracle offered two models. A single rack X3-2 comprises a grid of 8 database servers, each with two 8-core Intel® Xeon® E5-2690 processors (128 cores in total dedicated to database processing) connected via InfiniBand QDR to a storage grid of 14 servers running 168 cores and 56 PCI flash cards with 22.4 TB of flash cache. A single rack X3-8 comprises a grid of two database servers, each with eight 10-core Intel® Xeon® E7-8870 processors (160 cores in total dedicated to database processing) connected via InfiniBand QDR to a storage grid of 14 servers running 168 cores and 56 PCI flash cards with 22.4 terabytes of flash cache. As Exadata’s performance is reviewed, keep in mind that while Oracle has updated the specifications of hardware across three generations of machines, their underlying architecture is largely unchanged.
4
IBM Software Information Management
A video series posted on Youtube at www.youtube.com/ watch?v=hMXsrxyeRro by Oracle’s Real World Performance
Group provides insight to how Exadata processes what the presenter characterizes as a difficult query—“what were the popular items in the baskets of shoppers who visited stores in California in the first week of May and didn’t buy bananas.” Expressed in SQL, the query comprises the following processing steps: Step 1 - A projection and restriction (SELECT FROM WHERE syntax) on four tables Step 2 - A JOIN across four tables Step 3 - A JOIN of two tables Step 4 - A n outer JOIN of the products of steps 2 and 3
Step 5 - Applying the RANK function, GROUP BY and ORDER BY clauses to the results of step 4.
Exadata Smart Scan Processing
Smart Scan limitations Smart Scan is not comprehensive; Exadata’s MPP storage tier is unable to process: data blocks containing active transactions (INSERT, UPDATE, DELETE); SQL joins across multiple tables or complex joins across two tables; 192 of Oracle’s 511 database functions; user-defined functions; or distinct aggregations, common in simple reports
MPP Underused Exadata’s engineering does not fully exploit MPP architecture. Database management is not completely integrated into the storage tier, meaning too little is asked of the hardware in its MPP grid
1
SELECT customer_id FROM calls where amount > 200
6
Rows returned
2
Smart Scan constructed & sent to cells
7
Consolidated result built from all cells
3
Smart Scan identifies rows and columns within terabyte table that match request
8
2MB of data is returned to the server
Understanding how Exadata confirms that yes, we have no bananas requires careful observation of the Linux Collectl utility shown in the video as the query runs within the single rack machine. The top section of the dynamic monitoring tool’s screen shows activity across fourteen servers that comprise the storage grid, while the lower screen shows activity across eight servers running Oracle RAC.
5
IBM Software Information Management
Within its storage grid, Exadata deploys a large flash cache capable of serving data at fast rates. In the video of the no bananas query the dynamic monitoring tool shows data copied from disk to flash memory being read at 7236 MB/second. Flash cache gets query processing in Exadata off to a great start solving the I/O bottleneck plaguing first generation Oracle warehouses. Next, data flows to 14 servers (168 cores) comprising the storage grid. The video shows these processors reaching a maximum average utilization of about 30 percent as they complete the SQL projections and restrictions of step 1. Having placed the results of this relatively light-weight processing on the Infiniband network, all 14 servers (168 cores) in the storage grid fall idle, running at an average of six percent utilization. Data transport across the internal network connecting the two grids peaks at 7236 MB/second as data flows to the
grid running Oracle RAC. The database grid is now asked to process steps 2, 3, 4 and 5 of the no bananas query and, unsurprisingly, they become fully utilized at 99 percent busy. At the time the business most needs the warehouse to give its all, Exadata’s software leaves idle more than 50 percent of its 328 processing cores. Exadata compounds its inefficiency by running CPU-intensive activities—table joins, sorts and aggregations—exclusively in its RAC grid with fewer processors. In solving its I/O bottleneck problems with a dedicated SAN, Exadata shunts the point of contention downstream to its database grid. While keeping the business informed by running queries efficiently is fundamental to a warehouse’s success, its computer system must concurrently process multiple other activities. When data is loaded from source systems, records are commonly in ASCII format. Oracle must convert these to its internal data types, such as CHAR, DATE, VARCHAR2, etc.
Because warehouses are tasked with being an organization’s memory reaching back years, data are compressed to reduce their cost of storage. To select the lowest cost access path for queries, Oracle’s optimizer gathers statistics from the database. All these activities are CPU-intensive—the more processing power they are assigned, the faster they complete, freeing the business to exploit the warehouse. Exadata comprises two grids of servers, yet it processes all these activities in its smaller database grid, which quickly becomes CPU-bound. This stalls business intelligence and analytic applications, while the majority of the machine’s computing power, the 168 cores comprising its storage grid, is under-utilized.
6
IBM Software Information Management
In its most recent version of Exadata, Oracle increased capacity of flash cache to 22.4 terabytes and positions this upgrade as a “Database In-Memory Machine.”2 While unclear exactly what this means, it does not mean that Exadata supports in-memory database processing. Exadata’s flash cache is located in its storage grid, not in its database grid. Flash cache is not memory available to Oracle RAC—it is a fast peripheral device that the database management system can read via an I/O request, and an I/O request via any network is far slower than direct access to data in-memory. An administrator pinning a table in to a flash cache will accelerate the rate at which Exadata reads data; without this human intervention SmartScans bypass flash and access data from the slower medium of disk. In a system capable of processing data as fast as it can read it, flash cache would reduce the total time taken to run a data warehouse query. When deployed for data warehousing and analytics, Exadata’s 22.4 terabytes of flash cache is an expensive mechanism to move more data, faster, to its bottleneck at the database grid.
Performance with IBM PureData System for Analytics PureData System for Analytics is designed from the ground up as a platform for data warehousing and analytics. It employs an asymmetric massively parallel processing (AMPP) architecture. A symmetrical multiprocessing host3 fronts a grid of massively parallel processing (MPP) nodes which do the heavy lifting of warehousing and analyzing data. Each host machine comprises dual quad core processors and seven 300 gigabyte drives. A node in the MPP grid is called an S-Blade (Snippet Blade), each an HX5 IBM Blade Center server with 16 Intel CPU cores and 128 gigabytes of random access memory. An accelerator card attached to each S-Blade boosts performance with 16 multi-engine field programmable gate array (FPGA) cores and 16 gigabytes of RAM. A single rack PureData System for Analytics (model N2001) has seven S-Blades—in total comprising 112 CPU cores, 112 FPGA cores and a large pool of memory—teamed with twelve SAS disk enclosures with a total of 288 600GB drives. A single rack IBM PureData System for Analytics manages up to 192 terabytes compressed data, scans at 450 terabytes per hour, and maintains high query performance while loading data at up to 5TB per hour.
7
IBM Software Information Management
To understand how PureData System for Analytics works, let’s observe as it processes the no bananas query. The query enters the PureData System for Analytics at the symmetrical multi-processing host, a node operating on Linux and running a SQL-standard database management system. On the host machine the optimizer, the compiler and the scheduler decompose the query into many different pieces or snippets—these light-weight processes are well-suited to the host’s SMP architecture. The familiarity of the host environment removes a barrier to the adoption of analytics, and in masking MPP’s complexity, opens its capabilities to any organization with basic Linux and SQL technical skills. Its work done, the host distributes its instructions and kick-starts activity across PureData System for Analytics’ MPP grid. Within the grid all4 S-Blades work in parallel, processing their workload simultaneously against their locally-managed slice of data.
IBM PureData System for Analytics AMPP Architecture
FPGA
Advanced Analytics
CPU Memory
BI FPGA
CPU Host
Memory
ETL FPGA
CPU Memory
Disk Enclosures
S-Blades
Loader Network Fabric
IBM PureData System for Analytics
Applications
8
IBM Software Information Management
At each S-Blade, the arriving snippet initiates reading of compressed data from disk into memory. Here, PureData System for Analytics’ innovative architecture becomes apparent. A field programmable gate array is a semiconductor chip equipped with a large number of internal gates programmable to implement almost any logical function, and particularly effective at managing streaming processing tasks. Outside of the PureData System for Analytics, FPGAs are used in such applications as digital signal processing, medical imaging and speech recognition. The IBM engineering team has built software machines within the FPGAs which replace work traditionally undertaken by CPUs. The FPGA reads the data from memory buffers, and utilizing its Compress Engine decompresses it, instantly transforming each block from disk into the equivalent of 4 - 8 data blocks within the FPGA. Next, within the FPGA, data streams into the Project Engine to remove columns based on parameters specified in the SELECT clause. Filtered records are then passed further downstream to the restrict engine, where unneeded rows are blocked from passing through gates, based on restrictions specified in the WHERE clause. The final stage of FPGA processing is the visibility engine, which filters rows belonging to uncommitted data loads and so maintains ACID (atomicity, consistency, isolation and durability) compliance at streaming speeds. At this point PureData System for Analytics has completed processing step 1 of the no bananas query.
Immediately we see differences between the IBM and Oracle systems: where Exadata takes a traditional route and implements compression, projections and restrictions in software, PureData System for Analytics innovates and achieves these functions in hardware. For customers this translates directly to high-query performance, low cost of acquisition and low cost of ownership. IBM’s use of the FPGA accelerates processing before data streams to the CPU and balances the whole system by clearing the I/O path of bottlenecks. The FPGA’s work reduces the volume of data streamed to the CPU, where PureData System for Analytics processes the more computationally demanding steps 2 to 5 of the no bananas query. Its work done, the S-Blade returns its data to the host node, which undertakes the light-weight processing of compiling responses from all seven S-Blades, before sending the final result to the application. Premier Healthcare Alliance loads two-and-a-half million clinical transactions to their PureData System for Analytics every day. Health practitioners at 2,700 hospitals and 90,000 other clinical sites analyze these and other data to drive best practices within their communities of care. Premier maintains the USA’s largest clinical, financial and outcomes database with information on one-in-four of the nation’s patient discharges, allowing every hospital and clinic in the network to learn and benefit from peer
organization sharing results of their process innovations. Todd Wilkes, Vice President of Enterprise Solution Development, at Premier Healthcare Alliance attributes PureData System for Analytics’ simpler administration, faster response times, faster load times and in-database analytics as technical enablers of the continual improvement within a health network spanning forty percent of the United States.5 While Exadata utilizes high performing components, the success of a system is realized when its software makes the most of these resources. Flash cache gives fast access to data, but the system is challenged to maintain the pace of processing across its downstream activities. Exadata’s architecture interrupts the natural flow of SQL processing, splitting this across two grids. The powerful array of processing cores that comprise the storage grid are assigned the relatively lightweight tasks of SQL’s projections and restrictions and then left to idle while the smaller database grid is overwhelmed by more CPU-intensive tasks required to complete query processing. Flash cache and Infiniband dramatically reduce time spent in input/output work, but the bottleneck created by Oracle RAC becoming CPU-bound nullifies this.
9
IBM Software Information Management
The under-utilized storage grid compromises performance. Exadata’s architecture is unbalanced. Engineers create balance across a system by matching throughput of each component to capabilities of its neighbors. A system constructed from Exadata’s hardware, but whose software fully utilizes all its resources, will always outperform Oracle’s database machine. PureData System for Analytics is built on an integrated and balanced architecture. Its pragmatic blending of SMP and MPP delivers performance to business intelligence and analytics, and simplifies operations in the data warehouse.
Simplicity of operation Beyond a warehouse’s capability to inform the business in time for it to act, communities relying on business intelligence and analytic applications expect consistent performance: a report that completes in 5 seconds on Monday creates frustration if it is still running after 30 seconds on Friday. A system that makes it easy for a warehouse’s administrators to manage dynamic, changing workloads and ensure consistent performance creates a solid foundation for success.
Simplicity with Oracle Exadata To help the business make decisions on current data, a modern data warehouse is constantly updated from source systems—during its busiest times when multiple reports and queries are running the warehouse must also direct its resources to loading, compressing, transforming and integrating data. Exadata loads data through its database grid. Data loaded from ASCII format sources must be converted to Oracle’s internal data types and Exadata runs this CPU-demanding activity in its database grid. A warehouse commonly manages data in different structures to the transactional systems that originate data, and the MPP capabilities of the modern warehouse lend themselves to transforming data from one structure to another; Exadata processes data transformations in its database grid. Exadata’s administrators may choose to take advantage of Hybrid Columnar Compression when loading data; Exadata runs the CPU-intensive work of compressing data with this mechanism in its database grid. Because Exadata pushes most of the work needed to process queries to its less powerful database grid, Oracle’s system creates challenges for its administrators who strive to balance competing claims for processing power on less than half the machine’s compute resources while unable to draw on the greater resources idling in the storage grid.
Attempts to boost query performance by delving deep into Oracle RAC Wait Events to identify contention and resolve bottlenecks are not new to Exadata. A customer of IBM from the financial services industry used the Lean6 approach to analyze resource expenditure required to manage their Oracle data warehouse. They learned that more than 90 percent of their IT team’s work was either required waste or non-value added processing. The cost of this waste translates to unnecessary hardware and software license costs, terabytes of wasted storage, elongated development and data load cycles, long periods of data unavailability, stale data, poorly performing loads and queries and excessive administrative costs. Exadata presented its engineers with an opportunity to reduce the burden of tuning and administration that Oracle RAC imposes on their customers, but as this professional noted of his experience with Exadata: “It only takes a few contending processes to bring a DB on a full RAC to its knees.”7
10
IBM Software Information Management
To improve the rate at which it moves data from disk, Exadata supports a mechanism called the storage index. This meta-data structure is dynamically created within the memory of the storage grid servers. A storage index is a map of physical storage used by queries so they can skip over areas of disk storage known not to contain data they require. A storage index can only store this information for up to eight columns—in this age of big data, modern data warehouses must manage very wide tables, some with hundreds of columns. Because storage indexes are dynamically created, the first query to use the feature will be relatively slow; subsequent queries accessing the same index will be faster. Memory is a limited resource. Different queries requiring the creation of different storage indexes will push least-recently used storage indexes out of memory, . This means when the original query is re-run it will again be relatively slow as its storage indexes are re-created. Business users place great value on predictability of query performance from a data warehouse—in data warehouses with a broad range of queries, including ad-hoc analyses, storage indexes create unpredictable query performance.
Administrators can accelerate input/output processing by pinning tables in to Exadata’s cache of flash memory. When a data warehouse’s active data set is larger than the cache available (22.4 terabytes on a full rack X3-2 model) administrators must decide which data to advantage. This decision creates administrative burden, for at least two reasons. Business users create great value from ad-hoc queries; problem solving is typified by an iterative pattern of discovery where one question generates further questions. In these scenarios it is impossible for an administrator to guess which data should be pinned in flash. The alternative is for an administrator to accompany every journey data scientists take through their data. Secondly, while OLTP systems typically operate on single records, analytic queries typically retrieve multiple record sets. While flash cache has advantages in the world of online transaction processing where active data sets are relatively small and highly predictable, it creates only costly overhead when Exadata is deployed as a data warehouse. Compared to the relatively predictable loads created by OLTP systems, computer systems running data warehouses must contend with highly varying demands. Achieving and maintaining consistent performance for large communities of users, with different application and data requirements, through rising and falling loads, is a difficult challenge. Oracle’s philosophy of workload management is to offer
administrators multiple tuning parameters. Oracle’s parameters have a high degree of dependency on one another, and in Exadata some must be set to the same value for every processor in their grid. This complexity forces administrators to experimentally change parameter settings, constantly tuning as new demands are made of the warehouse. Demanding a high degree of Oracle expertise, Exadata does little to shield warehouse administrators from the complexities of workload management. The already difficult task of system administration and maintaining consistent performance is compounded by Exadata’s partitioning of processing between two grids and a feature labelled passthrough, which sees the storage grid hand responsibility for processing to the database grid. Note that passthrough is strictly one way; the lesser powered database grid has no opportunity to call upon the services of the more powerful storage grid. Exadata unpacks Hybrid Columnar Compressed data within its storage grid only when it has CPU cycles available. With its SAN busy and processing cycles available in its database grid, Exadata ships compressed data over its network to smaller node for unpacking. The dynamic and conditional nature of this passthrough feature creates uncertainty, complicating life for Exadata administrators.
11
IBM Software Information Management
Oracle releases patches for Exadata once every three months. Each quarterly release typically includes multiple patches to be applied in the storage grid, on the Infiniband switches, in the database management system and for systems’ management utilities. Patching Exadata is so complicated as to warrant presentations on patch planning,8 professional service offering from Oracle and partners,9 blogs from technicians sharing their experiences,10 and webinars.11 The number of training days recommended by a product’s manufacturer is one indicator of a product’s ease of use, Oracle12 recommends 24 days of education for Exadata database administrators. The machine’s complexity burdens its administrators, including technical experts with deep knowledge of Oracle’s database management system. Despite training programs, blogs and articles, the reality is that, even for experienced professionals, Exadata’s complexity makes it difficult to master, or as this professional noted: “I think the bottom line is that there are very few people on the planet who really know enough.”13
Simplicity with IBM PureData System for Analytics IBM’s philosophy is to bring simplicity to all phases of data warehousing. The first task facing a customer is loading their data. For example, MediaMath provides tools to marketing professionals so they use data to understand consumer behavior and identify opportunities—their data warehouse is a world of large and growing data volumes and constant loads. Tom Craig, VP Data and Information Strategy, credits PureData System for Analytics with “the ability to consume massive amounts of data from disparate sources and come up with that decision that you need to make at that point in time.”14 During the load process PureData System for Analytics automates data distribution. Experience from proof-ofconcept projects is that customers load their data to PureData System for Analytics using automatic distribution, run their queries and compare results to their highly tuned Oracle environments. For all but the simplest queries, automatic distribution is good enough for the IBM machine to outperform Oracle. Customers may later analyze all their queries to identify those that can be accelerated by redistributing data on different keys.
PureData System for Analytics eliminates the wasted work of database tuning. Equipped to make its own intelligent decisions, the PureData System for Analytics requires no tuning and little system administration. The few administrative tasks necessary to maintain consistent performance through dynamic, changing workloads are within easy reach of anyone with Linux and SQL experience. All that is required of the administrator is to allocate the system’s resources to groups within the user community and hand control to the workload management system. Another client, TEOCO, uses 24 racks of PureData System for Analytics to provide analytics solutions to communications service providers worldwide—those machines are managed by just two people.15 Jonjie Sena, Senior Director of Product Management at TEOCO, calls this “The iPod approach to big data.”
12
IBM Software Information Management
One key to fast performance in a data warehouse is to reduce the physical disk area accessed by each and every query. This approach minimizes time wasted by individual queries moving data from storage into memory where is it immediately discarded, and promotes efficient resource management across the entire data warehousing compute environment. This use case is the opposite of online transaction processing where an index as a pointer to a single record speeds processing. Pre-dating Exadata’s storage index by several years, PureData System for Analytics solves the challenge of steering queries past areas storing un-needed data with the ZoneMap.16 PureData System for Analytics automatically builds and maintains ZoneMaps with no administrative intervention and provides consistent query performance regardless of growth in the underlying database tables or - unlike Exadata’s storage indexes - when a query was previously processed. Because PureData System for Analytics applies full parallelism to all tasks, its workload management system plays a critical role in controlling how much of the system’s computing resources are made available to each and every job. In the IBM architecture, one software component controls all system resources: processors, disks, memory, and network. This simplicity is the foundation of the system’s workload management
system, making it easy for administrators to allocate computational resources to users and groups based on priorities agreed with the business and to maintain consistent response times for multiple communities. Maintaining current releases of all software comprising PureData System for Analytics is simple. IBM routinely issues patches that include updates to multiple components, and customers report that the automated update process usually completes in less than one hour. IBM recommends a single three-day course for administrators of the PureData System for Analytics. To inform their customers of what we buy and watch, Nielsen runs very complex analytic workloads against huge amounts of data integrated from point of sale systems, social media sites, and set-top boxes. Because Nielsen runs analytics on-demand from their customers, they need a system capable of running at the speed of thought. As John Naduvathusseril, Chief Data Architect at Nielsen explains of their PureData System for Analytics: “We just load the data and we fire away the queries and we get the performance that we need without investing too much time in customization, which is what we’ve traditionally seen with a lot of other competitive vendors. This simplicity, and the paradigm of an appliance where you load data and just start running queries, was not visible at other competitive vendors that we tested.”17
Bringing large volumes of data under management and making them exploitable by large communities of business users are complex challenges. The measure of a wellengineered product is its success at shielding its users and operators from a task’s complexity. Unable to contain complexity, Oracle’s engineers burden their customers with high costs of operation and ownership. In the role of data warehouse, Exadata constantly challenges its administrators to maintain consistent performance for existing reports and analyses as new and changing query, load and transformation tasks contend for less than 50 percent of the machine’s total processing power, while the greater part of the machine’s compute resources idle in its storage grid. PureData System for Analytics is simple-toown, simple-to-use and simple-to-manage. Customers who move their warehouses from Oracle to IBM immediately recognize that attending weeks of training and spending years developing deep product knowledge are unfortunate attributes of their previous choice in technology, and not prerequisites of successful data warehousing. IBM’s developers succeed in engineering software and hardware to really work together, ensuring customers don’t waste their time and money. These differences between the products have consequences for their potential to create value for their customers.
13
IBM Software Information Management
Value At a fundamental level, data management is cost while business intelligence and analytic applications create value. While investing in the former is required in order to gain the benefit of the latter, the more the costs of managing data are reduced, the more resources are made available for exploiting information. Simply speeding up queries run for the past decade misses the point. The opportunity is to create new ways of doing business where decisions, often made as events unfold, are informed by deep analysis of data.
Value with Oracle Exadata Exadata dedicates more than half its processing power to filtering data; its SAN-based servers idle while the database grid is completely consumed by the heavierweight activities of SQL processing. Exadata’s inefficient architecture translates to wasted space, power and cooling in data centers. These costs grow when we consider the opportunity cost of hardware dedicated to Exadata, which otherwise could be usefully employed by other tasks. Customers pay for under-utilized space which would return greater value if used to house a more efficient computer system or a SAN making data available to systems beyond those supplied by Oracle.
Exadata reduces the costs of managing an Oracle based data warehouse—the test of a well-engineered system is by how much. One Exadata customer18 reports, “More than half the time we spent tuning vanishes, and all that time can be immediately dedicated to development.” Given case studies from first generation warehouses where customers report a team of six people spending most of their time tuning and managing a relatively small Oracle warehouse, “more than half” remains too high a cost for data management. Once data filtering completes (step 1 of the no bananas query) and the benefits of a storage grid and Infiniband are realized, Exadata is Oracle’s database management system running on multiple cores. Oracle’s no bananas video shows a single rack Exadata with all processors within its database busied by just one query on one terabyte of data. Depending on which disks are purchased, when fully loaded this rack can store either 45 or 224 terabytes of user data. A demonstration of this same query against either of these larger data sets would show the servers in the storage grid busy as they unleash a wave of data, and progress stalling at the bottleneck of Exadata’s database grid.
Unlike the video, a busy organization is unlikely to dedicate its entire warehouse to a single query. A demonstration of no bananas running concurrently with queries and data loads that typically run throughout the day in a modern warehouse would show the application waiting for a response while the 160 cores of database grid run at full utilization. Despite Oracle’s Real World Performance Group’s representing no bananas as a difficult query, it is just an everyday query. And that’s Exadata’s problem—the Oracle database management system struggles to process even moderately demanding SQL on large data sets. Talking of her experience of data warehousing with Oracle, Christine Twyford, Manager of Technology at T-Mobile reminisces that: “We’d never had access to nation-wide data because it was too massive to query before”. Since moving to PureData System for Analytics, T-Mobile loads 20 billion rows of new data using 150 thousand extractload-transform jobs to a warehouse of two petabytes of data. Where their previous system constrained reporting to individual regions, the new system informs the business of events across the entire network of 35,000 sites across the USA. “We’ve had a lot of success building on our network because we’ve had access to this data. We’re using it in pretty unique ways. We’re using predictive analytics to figure out when new network nodes need to be added and plan for that in a much smarter way. Having that kind of accuracy and timeliness is critical.”19
14
IBM Software Information Management
Evaluating the Systems
IBM PureData System
Oracle
Item
IBM PureData System for Analytics N1001 / N2001
Exadata X3
Performance and architecture
MPP
• •
Hardware architecture
• • •
True MPP Optimized for Data warehousing and analytics
•
ull processing S-Blades (1 CPU core + 1 FPGA core / 1 disk drive) F SMP host node used primarily for user/applications interface Independent blade-to-blade redistribution
•
•
• • •
Data streaming
• •
In-database analytics
• • • • • •
Scale
• •
Simplicity
Appliance system management and integration
• •
ybrid—parallel storage nodes and SMP clustered head node H A generalized architecture Intelligent storage (1 CPU core / 1.5 disk drives) SMP cluster nodes running Oracle 11g RAC InfiniBand (Exadata nodes to SMP cluster) Head node engagement in all data redistributions
PGA performance assist on S-Blade—decompression, predicate F filtering, row-level security enforcement >95 percent of work done on S-Blades
•
ully engaged MPP platform for analytics F User-defined functions, aggregates and tables L anguage support: C/C++, Java, Python, R, Fortran Paradigm support: SQL, Matrix, Grid, Hadoop Built-in set of >150 key analytics (fully parallelized) Integrated development environment: Eclipse and R GUI w/ wizards
•
inear performance and data size scalability L Full-featured, enterprise-class workload management and other features
•
on-linear performance and data size scaling—performance and i/o N bottleneck at/to head node cluster
o tuning, no indexing, no partitions N Balanced system developed to deliver best price-performance
•
eavily tuned performance dependency H Performance depends on physical database design skills, including indices and partitions
•
• • • •
•
xadata nodes primarily used for decompression and predicate filtering E Most DW and analytics work done in SMP head node nalytics processing limited to head node cluster only A User-defined functions and aggregates L anguage support: C/C++, Java Paradigm support: SQL, Matrix (minor) Basic analytics functions
15
IBM Software Information Management
Value with IBM PureData System for Analytics Where Exadata customers must run 168 processing cores to run the filtration processing of their SQL queries, data warehouses running on PureData System for Analytics do more work in just sixteen field programmable gate arrays. Each FPGA—one square inch of silicon, decompresses and filters data with enormous efficiency, drawing little power and generating little heat. The administrative overhead of Oracle-based warehouses means customers must grow their administrative teams as data volumes grow. Moving to PureData System for Analytics remedies that situation. At T-Mobile the same team of three people who previously managed 40 terabytes in an under-performing Oracle warehouse now exploit two petabytes with PureData System for Analytics.
Customers moving to PureData System for Analytics benefit immediately from reduced costs of data management. As data management ceases to be a problem, they see their organizations freed to exploit data – doing things better or doing different things. One of the US’s largest communications service providers, XO Communications supplies more than fifty percent of the Fortune 500. XO pledges to provide the greatest level of service available in the telecom industry. As explained by Chris Payne, Senior Manager of Business Intelligence at XO, managing data is crucial to both company profits and good customer experiences. “At any given time we probably have close to five or six hundred different variables that we are looking at, just in terms of a single customer.” Those variables form a predictive model that in its first year of operation saved the company $11.2 million. Because PureData System for Analytics is so simple to operate, XO is continually finding ways to optimize their predictive model. “Our most recent model has produced around $15 million worth of annual savings just by having better and more optimized data available to us.”20
PureData System for Analytics supports XO Communications and other like-minded organizations looking to create value because the system’s engineers recognized data warehousing has moved beyond processing of straightforward SQL as represented by queries such as no bananas. PureData System for Analytics’ architecture excels at processing predictive models, investigative graphs and other analytic applications. In this world, SQL can be viewed as a convenient I/O mechanism –queries such as no bananas represent the beginning and not the endpoint of the machine’s work. As the system receives an analytic query, each S-Blade streams data to the FPGA from where filtered data moves to memory. The same Snippet code that delivers SQL instructions also carries algorithms of advanced analytics—and the CPU goes to work on data in memory. To compute answers to highly complex algorithms require communication among S-Blades; PureData System for Analytics exploits a message-passing interface that communicates interim results between the nodes and accelerates production of the final result.
16
IBM Software Information Management
Armed with a team of data scientists and a machine designed for analytic processing, some organizations choose to do different things—exploiting data to differentiate themselves from their competitors. The UK’s highly competitive market for motor vehicle insurance is commoditized as its players issue quotes based on standardized actuarial tables used to calculate risk based on factors including age of the driver and the vehicle’s night time parking location. Insurethebox21 takes a different approach; installing a telemetry device in each insured vehicle to collect real data in near real-time. Knowing how, where and when a vehicle is driven allows the company to issue personalized insurance premiums based on hard evidence.
PureData System for Analytics analyzes each driver’s behavior and communicates results and recommendations back via a portal. As Mike Brockman, CEO of Insurethebox explains, “It’s completely changed the game, basically. It’s all about educating customers to drive more responsibly.” Society gains from safer drivers, individuals learn to drive more safely and reduce their premiums, and insurethebox creates new value in an otherwise commoditized market. These are the breakthroughs that can result when data warehouses bring massive data volumes under management at low cost and deliver performance to the algorithms of advanced analytics. PureData System for Analytics is proven. Oracle’s database management system is designed for transaction processing—Exadata’s dedicated SAN and Infiniband network cannot make it excel at warehousing and analytics.
Conclusion Oracle Exadata is built from fast componentry, but the whole system is less than the sum of its parts. Because its architecture under-utilizes processing power in its storage grid and creates contention at its database grid, Exadata cannot match the query performance of PureData System for Analytics. Bringing large volumes of data under management and making them exploitable by large communities of business users are complex challenges. Where Exadata burdens customers with high costs of operation and ownership, PureData System for Analytics screens its users and operators from complexity. Once its storage grid and Infiniband network complete their work, Exadata is just an Oracle database management system —the same poorly performing software that prompted hundreds of customers to seek an alternative and move to a solution specialized for processing analytic queries on large data sets. Oracle’s marketing recommends organizations dissatisfied with their current Oracle-based warehouses to move to Exadata. Hundreds of organizations have found a better way forward: data warehousing and analytics with PureData System for Analytics.
17
IBM Software Information Management
© Copyright IBM Corporation 2013 IBM Corporation Software Group Route 100 Somers, NY 10589 U.S.A. Produced in the United States of America April 2013 IBM, the IBM logo, ibm.com, PureData and PureSystems are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the web at “Copyright and trademark information” at www.ibm.com/legal/ copytrade.shtml
Netezza is a registered trademark of IBM International Group B.V., an IBM Company. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. This document is current as of the initial date of publication and may be changed by IBM at any time. Not all offerings are available in every country in which IBM operates.
The performance data and client examples cited are presented for illustrative purposes only. Actual performance results may vary depending on specific configurations and operating conditions. It is the user’s responsibility to evaluate and verify the operation of any other products or programs with IBM products and programs. Statements regarding IBM’s future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. Actual available storage capacity may be reported for both uncompressed and compressed data and will vary and may be less than stated. 1 For greater analysis of this subject see Understanding Analytic Workloads - Meeting the complex processing demands of advanced analytics available at www.ibmbigdatahub.com/sites/ default/files/document/IBM-Netezza-Understanding-AnalyticWorkloads-eBook.pdf
2 http://www.oracle.com/us/corporate/press/1931761 3 PureData System for Analytics has two SMP hosts for redundancy but only one is active at any one time. 4 Within each rack one S-blade is a spare on warm stand-by 5 http://www.ibmbigdatahub.com/video/ibm-puredata-analyticscustomer-reference-featuring-premier-healthcare
6 With roots in manufacturing, “Lean” is a practice using tools and techniques of Six Sigma to analyze wasteful expenditure of resources, and target activities adding no value to the product or service for elimination. 7 www.vanoug.org/exdata10x100xslower.pdf 8 www.oracleimg.com/openworld/lad-en/session-presentations/12961-
9 www.enkitec.com/database-solutions/oracle-database/exadatapatching
10 www.divakarmehta.wordpress.com/2011/12/15/exadata-11-2-2-4-postpatch-issues-with-infiniband-diagnsotics-failing-on-compute-nodes
11 www.centroid.com/knowledgebase/webinars/exadata-patching 12 www.education.oracle.com/pls/web_prod-plq-dad/ou_product_ category.getPage?p_cat_id=212
13 www.martincarstenbach.wordpress.com/2012/04/05/exadataexperience-what-does-that-actually-mean
14 www.ibmbigdatahub.com/video/netezza-data-analysis-powers-realtime-bidding-mediamath-demand-side-platform
15 www.ibmbigdatahub.com/video/teoco-corporation-simplifyingbusiness-analytics-ibm-netezza
16 ibm.com/developerworks/mydeveloperworks/blogs/Netezza/entry/ zone_maps_and_data_power20?lang=en
17 http://www.youtube.com/watch?v=vwOSZixbhiU&list=PLqZTdGs5y wRW4WsD7HonOu8KVbcQXYdI5&index=24
18 www.informationweek.com/software/information-management/ oracle-exadata-v2-customer-tells-all/226300160
19 http://www.ibmbigdatahub.com/video/ibm-puredata-systemanalytics-customer-testimonial-t-mobile
20 http://www.ibmbigdatahub.com/video/ibm-puredata-systemtestimonial-featuring-xo-communications
21 www.ibmbigdatahub.com/video/insurethebox-transforming-motorinsurance-industry-ibm-netezza-data-warehouse-appliance
enok-1432301.pdf
Please Recycle
IMM14080-USEN-04