Preview only show first 10 pages with watermark. For full document please download

When The Rules Change.key - New York Oracle User Group

   EMBED


Share

Transcript

WHEN THE RULES CHANGE Next Generation Oracle Database Architectures using Super-fast Storage James Morle, EMC DSSD © James Morle, June 2016 1 INTRO thar be dragons 1993 2010 2001 2015 founded in © James Morle, June 2016 2 Disclaimer: I work for EMC these days, and use some of the corporate content, but all opinions here are my own - this is not an official company presentation. © James Morle, June 2016 3 “I/O certainly has been lagging in the last decade” - Seymour Cray 1976 © James Morle, June 2016 4 THE ACCESS TIME GAP 1,000,000 DRAM 100,000 10,000 Cost ($/GB) Disk 1,000 Access time gap 100 10 3D XPoint 1 0 1 10 100 NAND 1,000 10,000 Access time (ns) © James Morle, June 2016 5 100,000 1,000,000 10,000,000 100,000,000 "Bandwidth is the work of man, latency is the realm of " Jeff Bonwick, CTO and Founder, DSSD © James Morle, June 2016 6 © James Morle, June 2016 7 BIG PIPES ARE EASY PERFORMANCE LATENCY © James Morle, June 2016 8 WHAT MATTERS WITH ORACLE WORKLOADS? • DW/BI Workloads: • Multiblock read bandwidth • Sequential write bandwidth and latency © James Morle, June 2016 • 9 OLTP Workloads: • Single block read latency • Sequential write latency SO WHAT'S THE PROBLEM? • Delivery of low latency I/O requires low latency transport in addition to low latency media • We have the media, currently NAND flash, but… • Fibre Channel often adds up to 200 microseconds of latency • This needs something new, and fit for purpose… let’s start with the software © James Morle, June 2016 10 DSSD Block Device Access to DSSD A bit more latency due to kernel overhead SOFTWARE Application Libraries System Call KERNEL POSIX File System Volume Mgr. Device Driver HARDWARE MENU 300µS to 5,000µS Application Libraries libflood PCIe HBA User DMA Port DSSD Block Driver SAS/SATA PCIe Client Card Device Controller DSSD I/O Module Disk/NAND DSSD Flash Module © Copyright 2015 EMC Corporation. All rights reserved. <120µS 11 DSSD FM VS OTHER FLASH STORAGE Simpler and Faster Flash Modules Standard Flash Devices ASIC ASIC ASIC DSSD CM Wear Leveling Flash Physics Garbage Collection Cubic RAID Defect Mgmt FTL ASIC ASIC ASIC Vaulting DSSD FM ECC Vaulting DSSD FM ECC Vaulting ECC SSD Drive Vaulting FTL Defect Mgmt Garbage Collection SSD Drive Flash Physics Wear Leveling ECC Vaulting FTL Defect Mgmt Garbage Collection Flash Physics Wear Leveling ECC Vaulting FTL Defect Mgmt Garbage Collection Flash Physics Wear Leveling ECC SSD Drive D5 DSSD FM • Complex firmware, limited power • DSSD has simple, fast Flash Modules • Independently managed media • Control Module with rich resources implements advanced global algorithms MENU © Copyright 2015 EMC Corporation. All rights reserved. 12 HARDWARE + SOFTWARE RESILIENCE Always On Cubic Raid • Cubic RAID has 2x greater reliability of other RAID but has the same overhead (17%) • Cubic RAID Grid is an interlocked, multidimensional array of multi-page “cells” of NAND die • High performance – always on MENU © Copyright 2015 EMC Corporation. All rights reserved. System Wide Data Protection 13 50 Dense and Shared Flash DSSD D5 - 5U RACK SCALE FLASH PLATFORM FLASH AND CMs 36 Flash Modules (FMs) 18 Flash Modules when Half Populated 2TB/4TB Flash Modules today Larger FMs on the roadmap Dual Ported PCIe Gen 3 x4 per FM Dual-Redundant Control Modules (CMs) PCIe Gen 3 Connected MENU © Copyright 2015 EMC Corporation. All rights reserved. 14 Dense and Shared Flash DSSD D5 - 5U RACK SCALE FLASH PLATFORM IOMs, Fans, Power Supplies Redundant Power Supplies x4 Dual-Redundant IO Modules (IOMs) PCIe Gen 3 Connected 48 PCIe Gen 3 x4 Client Ports 
 Per IOM Total of 96 PCIe Gen 3 x4 Client Port 
 Connections per D5 Redundant Fan Modules x5 MENU © Copyright 2015 EMC Corporation. All rights reserved. 15 NOISY NEIGHBORS • In all other (non-D5) storage solutions, data is served by CPUs • CPU Memory CPUs execute the code to service HBA requests, check caches, request data from media, and so on 2, 3 1 Request Response Network HBA FC or IP CPU 8 8 4 7 Media HBA 5 Persistent Media 6 9 • CPU is a relatively scarce resource, and prone to abuse by certain sessions/systems/ users – the noisy neighbors • When CPU resource is unavailable, response times degrade rapidly and exponentially © James Morle, June 2016 1. Request arrives, 2. CPU accepts interrupt, checks CPU memory for cached copy 3. If found, skip to 8. If not, continue 4. CPU forwards request to Media HBA 5. HBA makes request from persistent media 6. Media locates data and responds 7. HBA forwards data to CPU 8. CPU forwards data to Network HBA 9. Return data to host 16 NOISY NEIGHBOURS • In DSSD D5, data is self-service • • Hosts have full access to 18,432 flash chips, a much less scarce resource Data is spread thinly across those chips, minimizing contention e N y s i o N to 1 Request e n o • All data transfers, read and writer are direct DMA p between host and flashess l h c u • The D5 has Mso much performance capacity, Response CPU 2 Flash Media 3 1. Request arrives (as DMA write of requested LBA) 2. CPU writes DMA directly to appropriate Flash Module 3. Flash Module returns data via DMA write to host compared to other platforms, that the likelihood of a single errant system affecting others is greatly reduced © James Morle, June 2016 m o r d n y S r o b h g i ! e 17 Performance Oriented Architecture I/O Module PCIe ports Flash Modules Control Module CPUs I/O Module PCIe ports MENU © Copyright 2015 EMC Corporation. All rights reserved. 18 WHAT DOES ALL THIS GIVE US? • Marketing ‘hero’ numbers (real, but using artificial tools): • • • • 100TB Usable 100GB/s bandwidth 100µs latency Proven Oracle numbers • 100TB Usable • 60GB/s bandwidth into Oracle • 140µs latency 5.3 million IOPs (8KB, SLOB) 5U rack space • 10 million IOPs (4KB) • • 5U rack space • © James Morle, June 2016 19 AND THERE’S MORE… • • Up to two D5s are currently supported on a single system © James Morle, June 2016 20 Proven Oracle numbers • 200TB Usable • 120GB/s bandwidth into Oracle • 140µs latency • 10.6 million IOPs (8KB, SLOB) • 10U rack space NEW RULES • D5 has NO cache - Everything is fast • You just have a full 100TB usable ‘working set’ © James Morle, June 2016 21 TRADITIONAL STORAGE SLOW © James Morle, June 2016 Cache Persistent Storage 22 Data Motion FAST D5 STORAGE © James Morle, June 2016 Persistent Storage 23 Entire Dataset FAST WHAT DOES IT LOOK LIKE TO A DBA? • Familiar block-driver interface: • i.e.: /dev/dssdXXXX devices • Fully shared disk • Multipathing is fully automatic and invisible • No child devices exposed, no tunables • Udev rules recommended to create friendly names • Reference documentation is the “Oracle Databases on DSSD D5 – Best Known Methods” paper © James Morle, June 2016 24 WHAT DOES IT LOOK LIKE TO A DBA? # ls -l /dev/asmdisks total 0 lrwxrwxrwx 1 root root lrwxrwxrwx 1 root root lrwxrwxrwx 1 root root lrwxrwxrwx 1 root root lrwxrwxrwx 1 root root lrwxrwxrwx 1 root root lrwxrwxrwx 1 root root lrwxrwxrwx 1 root root lrwxrwxrwx 1 root root lrwxrwxrwx 1 root root lrwxrwxrwx 1 root root © James Morle, June 2016 11 11 11 11 11 11 11 11 11 11 11 Feb Feb Feb Feb Feb Feb Feb Feb Feb Feb Feb 11 11 11 11 11 11 11 11 11 11 11 20:19 20:19 20:19 20:19 20:19 20:19 20:19 20:19 20:19 20:19 20:19 25 OraOCR000441_00 -> ../dssd0030 OraOCR000441_01 -> ../dssd0031 OraOCR000444_00 -> ../dssd0028 OraOCR000444_01 -> ../dssd0029 OraRedo000441_00 -> ../dssd0000 OraRedo000441_01 -> ../dssd0001 OraRedo000444_00 -> ../dssd0026 OraRedo000444_01 -> ../dssd0027 OraVol000441_00 -> ../dssd0032 OraVol000441_01 -> ../dssd0033 OraVol000441_02 -> ../dssd0034 WHAT DOES IT LOOK LIKE TO A DBA? SQL> l 1* select group_number,path,name,failgroup,mount_status from v$asm_disk order by 1,4,3 SQL> / GROUP_NUMBER PATH NAME FAILGROUP MOUNT_S ------------ ---------------------------------------- -------------------- -------------------- ------0 /dev/asmdisks/OraFRA000441_03 CLOSED 0 /dev/asmdisks/OraVol000441_11 CLOSED 0 /dev/asmdisks/OraVol000444_06 CLOSED 0 /dev/asmdisks/OraVol000444_03 CLOSED 0 /dev/asmdisks/OraRedo000444_00 CLOSED 0 /dev/asmdisks/OraVol000444_09 CLOSED 0 /dev/asmdisks/OraVol000444_01 CLOSED … etc © James Morle, June 2016 26 ELIMINATION OF COMPLEXITY • • • No dm-multipath or Powerpath • Purpose built, high performance multipathing integral in client drivers • Only a single device name is exposed, all detail is handled by the driver No manipulation of I/O elevators • NOOP is forced • Everything is 4KB anyway (blkdev) DMA access and separate submission and completion queues • No queue tuning - DMA enqueues so fast that it is largely unnecessary - but we make an exception for redo © James Morle, June 2016 27 WHICH DIMENSION MATTERS? • Bandwidth? • Latency? • IOPs? • Nobody actually needs 5.3M IOPs, but they are a side effect of the bandwidth and low latency - which people DO need! © James Morle, June 2016 28 ANALYSIS OF DB TIME • Low latency storage dramatically alters the split of time for a process • Using SLOB: • Traditional storage: ~200µs CPU, 6000µs single block read. 30:1 ratio • D5: ~200µs CPU, ~200µs (at high load) single block read. 1:1 ratio © James Morle, June 2016 29 LATENCY: SYNCHRONOUS I/O • • Oracle workloads are most frequently dependent on synchronous I/O • Index traversal and Nested Loop joins (serial I/O pathology) • Log writer (redo bandwidth is proportional to write latency) Latency is now so low that the returns are diminishing after this: • Reducing disk latency from 6ms->3ms was almost 2x speedup • But now the compute time is similar to the I/O time - halving I/O latency is 25% speedup • OMG - if we eliminate I/O altogether, we can only go 2x faster. Where did orders of magnitude go?! © James Morle, June 2016 30 BANDWIDTH: BIG QUERIES • It is rare that ‘adhoc query’ exists in reality: • Sure, submit the query • But it might not come back until next Tuesday • Oh, and everyone else will suffer while it runs © James Morle, June 2016 31 THE REALITY: THE DBA’S PLIGHT! • Physical schema mitigations are adopted to minimize the data scan volume: • Materialized Views • Secondary Indexes • Fine grain subpartitioning • Even Smart Scan - a non-deterministic workaround © James Morle, June 2016 32 AN EXPERIMENT • Exam Question: How much do Materialized Views actually help with runtimes when you have next-generation I/O horsepower? © James Morle, June 2016 33 DISSECTING THE QUERY TOP LEVEL cur_year_sales_cnt, prev_year_sales_cnt, sales_count_diff, sales_amount_diff Join WHERE year=2000 © Copyright 2016 EMC Corporation. All rights reserved. ALL_SALES (prev year) Filter Filter ALL_SALES (cur year) WHERE year=1999 34 DISSECTING THE QUERY MAIN BLOCK year, brand_id, class_id , category_id, manufact_id, cs_quantity-COALESCE(cr_return_quantity,0) AS sales_cnt, cs_ext_sales_price - COALESCE(cr_return_amount,0.0) AS sales_amt Join date_dim item Sales (ie WEB_SALES, CATALOG_SALES or STORE_SALES) 2-6 billion rows Filter Returns WHERE category=‘Shoes’ 7-14 billion rows © Copyright 2016 EMC Corporation. All rights reserved. ‘NET SALES’ Query Block 35 DISSECTING THE QUERY UNION Net Sales (Store) Union Net Sales (Catalog) Union Net Sales (Web) ALL_SALES Inline View © Copyright 2016 EMC Corporation. All rights reserved. 36 THE TEST • Materialize the main query block of the three sales channels WEB © Copyright 2016 EMC Corporation. All rights reserved. Materialize Materialize CATALOG Materialize STORE WEB STORE CATALOG 37 RESULTS But only 24% runtime reduction 5 3.8 2.5 1.3 0 4.5 72% less data 4.6 3.4 1.3 Data Volume Scanned (TB) Full Query © Copyright 2016 EMC Corporation. All rights reserved. Query Runtime (min) MV Optimized Query 38 WHY ONLY A SMALL SPEEDUP? • DSSD D5 makes the I/O portion of the query much less significant in the total runtime • Remaining work, such as CPU compute, serialization, and inter-node communication remain constant © Copyright 2016 EMC Corporation. All rights reserved. 39 D5 Versus a Typical All-flash Array Complex Query Runtime (Shorter is Better) Runtime (Minutes) 30 22.5 64.2% 15 7.5 24.4% 0 MENU © Copyright 2015 EMC Corporation. All rights reserved. DSSD D5 Full Query All Flash Array MV Optimized Query 40 D5 Full Query vs. AFA Materialized View Complex Query Runtime (Shorter is Better) 11 Runtime (Minutes) 8 55.5% 6 3 0 MENU All Flash Array MV Optimized © Copyright 2015 EMC Corporation. All rights reserved. DSSD D5 Full Query 41 BANDWIDTH MATTERS • Full query running on DSSD D5 (with gas left in the tank): © Copyright 2016 EMC Corporation. All rights reserved. 42 SO WHAT? • “Extreme Performance” is not just for “Extreme Workloads” • As a DBA, you have only been able to deliver that which the hardware allows • “Extreme Performance” is an enabler to business transformation © James Morle, June 2016 SOFTWARE: ALGORITHMS • Until now, a cache miss meant certain death… • • At least 50x slower, including code path Net result: algorithms carefully maximize cache hit, and optimizer aggressively favors cached access paths © James Morle, June 2016 44 SOFTWARE: ALGORITHMS • Next-Gen Storage: • Cost of cache miss is much, much less • But algorithms remain largely the same • Algorithms could be significantly more speculative in approach © James Morle, June 2016 45 SQL OPTIMIZER • Should push more I/O out as large physical I/O requests • Large index joins will become less relevant - synchronous/serial pathology and inefficient join algorithm at scale • Large PIO is async and parallel, and hash joins are highly effective (if you can spill to disk at a decent rate) © James Morle, June 2016 46 WHAT’S MISSING? • • Things Things that will probably never come: • • Data Compression • Dedupe © James Morle, June 2016 Services (probably) • At-rest Data Services • that are coming: Encryption • Snapshots • Replication • 47 Full Non-disruptive Operations support (definitely, and soon) ARCHITECTURES • Tiering with D5 • Preferred Read Failure Group © James Morle, June 2016 48 © James Morle, June 2016 49 © James Morle, June 2016 50 HADOOP/HDFS SUPPORT • There is also an HDFS Datanode Plugin © James Morle, June 2016 51 FLASH OPTIMIZED HDFS SERVER 1 SERVER 2 SERVER 3 TRADITIONAL HADOOP (3 COPIES OF DATA) SERVER 1 HADOOP WITH DSSD D5 © Copyright 2016 EMC Corporation. All rights reserved. SERVER 2 SERVER 3 • HDFS uses a replication factor of at least 3 for availability • Results in 3x+ data on persistent media • Not economical for flash • Stores just 1 copy of data regardless of replication factor • Use entire flash capacity for data • Increase data locality without using more capacity (1 COPY OF DATA) 52 SIMPLIFIED ARCHITECTURE INDEPENDENT SCALING HDFS on DSSD Storage Add D5 to Increase Storage • Scale compute independent 
 of storage Increased Cluster Nodes Required to Increase Compute • Achieve optimal asymmetric high performance balance • Add additional performance as hardware evolves Compute © Copyright 2016 EMC Corporation. All rights reserved. 53 HADOOP/HDFS SUPPORT • Elimination of Replication • Storage savings make the D5 price competitive with local SSDs • Local data access is possible for every attached host without storage multiplication • • Eliminates any Key Hashing hotspots Run all of this, Oracle, Hadoop, Filesystems, on the same storage platform © James Morle, June 2016 54 NEXT STEPS • Moore's Law++ <—12mo doubling in storage density • Controller CPU and memory is also subject to Moore's Law balanced growth • Optane/3DXpoint - another order of magnitude © James Morle, June 2016 55 THANK YOU! • Any Questions? • [email protected] • @jamesmorle © James Morle, June 2016 56