Transcript
WHEN THE RULES CHANGE Next Generation Oracle Database Architectures using Super-fast Storage James Morle, EMC DSSD
© James Morle, June 2016
1
INTRO
thar be dragons
1993
2010
2001
2015 founded in
© James Morle, June 2016
2
Disclaimer: I work for EMC these days, and use some of the corporate content, but all opinions here are my own - this is not an official company presentation. © James Morle, June 2016
3
“I/O certainly has been lagging in the last decade” - Seymour Cray 1976
© James Morle, June 2016
4
THE ACCESS TIME GAP 1,000,000
DRAM 100,000
10,000
Cost ($/GB)
Disk 1,000
Access time gap 100
10
3D XPoint
1
0
1
10
100
NAND
1,000
10,000 Access time (ns)
© James Morle, June 2016
5
100,000
1,000,000
10,000,000
100,000,000
"Bandwidth is the work of man, latency is the realm of
" Jeff Bonwick, CTO and Founder, DSSD
© James Morle, June 2016
6
© James Morle, June 2016
7
BIG PIPES ARE EASY
PERFORMANCE LATENCY
© James Morle, June 2016
8
WHAT MATTERS WITH ORACLE WORKLOADS? •
DW/BI Workloads: •
Multiblock read bandwidth
•
Sequential write bandwidth and latency
© James Morle, June 2016
•
9
OLTP Workloads: •
Single block read latency
•
Sequential write latency
SO WHAT'S THE PROBLEM? •
Delivery of low latency I/O requires low latency transport in addition to low latency media
•
We have the media, currently NAND flash, but…
•
Fibre Channel often adds up to 200 microseconds of latency
•
This needs something new, and fit for purpose… let’s start with the software
© James Morle, June 2016
10
DSSD Block Device Access to DSSD A bit more latency due to kernel overhead
SOFTWARE
Application Libraries System Call
KERNEL
POSIX File System Volume Mgr. Device Driver
HARDWARE
MENU
300µS to 5,000µS
Application Libraries
libflood
PCIe HBA
User DMA Port DSSD Block Driver
SAS/SATA
PCIe Client Card
Device Controller
DSSD I/O Module
Disk/NAND
DSSD Flash Module
© Copyright 2015 EMC Corporation. All rights reserved.
<120µS
11
DSSD FM VS OTHER FLASH STORAGE Simpler and Faster Flash Modules Standard Flash Devices ASIC
ASIC
ASIC
DSSD CM Wear Leveling
Flash Physics
Garbage Collection
Cubic RAID
Defect Mgmt
FTL
ASIC
ASIC
ASIC Vaulting
DSSD FM
ECC
Vaulting
DSSD FM
ECC
Vaulting
ECC
SSD Drive
Vaulting
FTL
Defect Mgmt
Garbage Collection
SSD Drive
Flash Physics
Wear Leveling
ECC
Vaulting
FTL
Defect Mgmt
Garbage Collection
Flash Physics
Wear Leveling
ECC
Vaulting
FTL
Defect Mgmt
Garbage Collection
Flash Physics
Wear Leveling
ECC
SSD Drive
D5
DSSD FM
• Complex firmware, limited power
• DSSD has simple, fast Flash Modules
• Independently managed media
• Control Module with rich resources implements advanced global algorithms
MENU
© Copyright 2015 EMC Corporation. All rights reserved.
12
HARDWARE + SOFTWARE RESILIENCE
Always On Cubic Raid • Cubic RAID has 2x greater reliability of other RAID but has the same overhead (17%) • Cubic RAID Grid is an interlocked, multidimensional array of multi-page “cells” of NAND die • High performance – always on
MENU
© Copyright 2015 EMC Corporation. All rights reserved.
System Wide Data Protection
13 50
Dense and Shared Flash DSSD D5 - 5U RACK SCALE FLASH PLATFORM FLASH AND CMs 36 Flash Modules (FMs) 18 Flash Modules when Half Populated
2TB/4TB Flash Modules today Larger FMs on the roadmap
Dual Ported PCIe Gen 3 x4 per FM Dual-Redundant Control Modules (CMs) PCIe Gen 3 Connected
MENU
© Copyright 2015 EMC Corporation. All rights reserved.
14
Dense and Shared Flash DSSD D5 - 5U RACK SCALE FLASH PLATFORM IOMs, Fans, Power Supplies Redundant Power Supplies x4 Dual-Redundant IO Modules (IOMs) PCIe Gen 3 Connected
48 PCIe Gen 3 x4 Client Ports
Per IOM Total of 96 PCIe Gen 3 x4 Client Port
Connections per D5
Redundant Fan Modules x5
MENU
© Copyright 2015 EMC Corporation. All rights reserved.
15
NOISY NEIGHBORS •
In all other (non-D5) storage solutions, data is served by CPUs •
CPU Memory
CPUs execute the code to service HBA requests, check caches, request data from media, and so on
2, 3 1 Request
Response
Network HBA FC or IP
CPU 8
8 4 7
Media HBA
5 Persistent Media
6
9
•
CPU is a relatively scarce resource, and prone to abuse by certain sessions/systems/ users – the noisy neighbors
•
When CPU resource is unavailable, response times degrade rapidly and exponentially
© James Morle, June 2016
1. Request arrives, 2. CPU accepts interrupt, checks CPU memory for cached copy 3. If found, skip to 8. If not, continue 4. CPU forwards request to Media HBA 5. HBA makes request from persistent media 6. Media locates data and responds 7. HBA forwards data to CPU 8. CPU forwards data to Network HBA 9. Return data to host
16
NOISY NEIGHBOURS •
In DSSD D5, data is self-service •
•
Hosts have full access to 18,432 flash chips, a much less scarce resource Data is spread thinly across those chips, minimizing contention
e N y s i o N to
1 Request
e n o • All data transfers, read and writer are direct DMA p between host and flashess l h c u • The D5 has Mso much performance capacity,
Response
CPU
2 Flash Media
3
1. Request arrives (as DMA write of requested LBA) 2. CPU writes DMA directly to appropriate Flash Module 3. Flash Module returns data via DMA write to host
compared to other platforms, that the likelihood of a single errant system affecting others is greatly reduced
© James Morle, June 2016
m o r d n y S r o b h g i
! e
17
Performance Oriented Architecture I/O Module PCIe ports
Flash Modules
Control Module CPUs
I/O Module PCIe ports MENU
© Copyright 2015 EMC Corporation. All rights reserved.
18
WHAT DOES ALL THIS GIVE US? •
Marketing ‘hero’ numbers (real, but using artificial tools): • • •
•
100TB Usable 100GB/s bandwidth 100µs latency
Proven Oracle numbers •
100TB Usable
•
60GB/s bandwidth into Oracle
•
140µs latency 5.3 million IOPs (8KB, SLOB) 5U rack space
•
10 million IOPs (4KB)
•
•
5U rack space
•
© James Morle, June 2016
19
AND THERE’S MORE… •
•
Up to two D5s are currently supported on a single system
© James Morle, June 2016
20
Proven Oracle numbers •
200TB Usable
•
120GB/s bandwidth into Oracle
•
140µs latency
•
10.6 million IOPs (8KB, SLOB)
•
10U rack space
NEW RULES
•
D5 has NO cache - Everything is fast •
You just have a full 100TB usable ‘working set’
© James Morle, June 2016
21
TRADITIONAL STORAGE
SLOW
© James Morle, June 2016
Cache
Persistent Storage
22
Data Motion
FAST
D5 STORAGE
© James Morle, June 2016
Persistent Storage
23
Entire Dataset
FAST
WHAT DOES IT LOOK LIKE TO A DBA? •
Familiar block-driver interface: •
i.e.: /dev/dssdXXXX devices
•
Fully shared disk
•
Multipathing is fully automatic and invisible
•
No child devices exposed, no tunables
•
Udev rules recommended to create friendly names
•
Reference documentation is the “Oracle Databases on DSSD D5 – Best Known Methods” paper © James Morle, June 2016
24
WHAT DOES IT LOOK LIKE TO A DBA? # ls -l /dev/asmdisks total 0 lrwxrwxrwx 1 root root lrwxrwxrwx 1 root root lrwxrwxrwx 1 root root lrwxrwxrwx 1 root root lrwxrwxrwx 1 root root lrwxrwxrwx 1 root root lrwxrwxrwx 1 root root lrwxrwxrwx 1 root root lrwxrwxrwx 1 root root lrwxrwxrwx 1 root root lrwxrwxrwx 1 root root © James Morle, June 2016
11 11 11 11 11 11 11 11 11 11 11
Feb Feb Feb Feb Feb Feb Feb Feb Feb Feb Feb
11 11 11 11 11 11 11 11 11 11 11
20:19 20:19 20:19 20:19 20:19 20:19 20:19 20:19 20:19 20:19 20:19 25
OraOCR000441_00 -> ../dssd0030 OraOCR000441_01 -> ../dssd0031 OraOCR000444_00 -> ../dssd0028 OraOCR000444_01 -> ../dssd0029 OraRedo000441_00 -> ../dssd0000 OraRedo000441_01 -> ../dssd0001 OraRedo000444_00 -> ../dssd0026 OraRedo000444_01 -> ../dssd0027 OraVol000441_00 -> ../dssd0032 OraVol000441_01 -> ../dssd0033 OraVol000441_02 -> ../dssd0034
WHAT DOES IT LOOK LIKE TO A DBA? SQL> l 1* select group_number,path,name,failgroup,mount_status from v$asm_disk order by 1,4,3 SQL> / GROUP_NUMBER PATH
NAME
FAILGROUP
MOUNT_S
------------ ---------------------------------------- -------------------- -------------------- ------0 /dev/asmdisks/OraFRA000441_03
CLOSED
0 /dev/asmdisks/OraVol000441_11
CLOSED
0 /dev/asmdisks/OraVol000444_06
CLOSED
0 /dev/asmdisks/OraVol000444_03
CLOSED
0 /dev/asmdisks/OraRedo000444_00
CLOSED
0 /dev/asmdisks/OraVol000444_09
CLOSED
0 /dev/asmdisks/OraVol000444_01
CLOSED
… etc
© James Morle, June 2016
26
ELIMINATION OF COMPLEXITY •
•
•
No dm-multipath or Powerpath •
Purpose built, high performance multipathing integral in client drivers
•
Only a single device name is exposed, all detail is handled by the driver
No manipulation of I/O elevators •
NOOP is forced
•
Everything is 4KB anyway (blkdev)
DMA access and separate submission and completion queues •
No queue tuning - DMA enqueues so fast that it is largely unnecessary - but we make an exception for redo
© James Morle, June 2016
27
WHICH DIMENSION MATTERS? •
Bandwidth?
•
Latency?
•
IOPs?
•
Nobody actually needs 5.3M IOPs, but they are a side effect of the bandwidth and low latency - which people DO need!
© James Morle, June 2016
28
ANALYSIS OF DB TIME •
Low latency storage dramatically alters the split of time for a process
•
Using SLOB: •
Traditional storage: ~200µs CPU, 6000µs single block read. 30:1 ratio
•
D5: ~200µs CPU, ~200µs (at high load) single block read. 1:1 ratio
© James Morle, June 2016
29
LATENCY: SYNCHRONOUS I/O •
•
Oracle workloads are most frequently dependent on synchronous I/O •
Index traversal and Nested Loop joins (serial I/O pathology)
•
Log writer (redo bandwidth is proportional to write latency)
Latency is now so low that the returns are diminishing after this: •
Reducing disk latency from 6ms->3ms was almost 2x speedup
•
But now the compute time is similar to the I/O time - halving I/O latency is 25% speedup •
OMG - if we eliminate I/O altogether, we can only go 2x faster. Where did orders of magnitude go?!
© James Morle, June 2016
30
BANDWIDTH: BIG QUERIES •
It is rare that ‘adhoc query’ exists in reality: •
Sure, submit the query
•
But it might not come back until next Tuesday
•
Oh, and everyone else will suffer while it runs
© James Morle, June 2016
31
THE REALITY: THE DBA’S PLIGHT! •
Physical schema mitigations are adopted to minimize the data scan volume: •
Materialized Views
•
Secondary Indexes
•
Fine grain subpartitioning
•
Even Smart Scan - a non-deterministic workaround
© James Morle, June 2016
32
AN EXPERIMENT
•
Exam Question: How much do Materialized Views actually help with runtimes when you have next-generation I/O horsepower?
© James Morle, June 2016
33
DISSECTING THE QUERY TOP LEVEL cur_year_sales_cnt, prev_year_sales_cnt, sales_count_diff, sales_amount_diff
Join
WHERE year=2000 © Copyright 2016 EMC Corporation. All rights reserved.
ALL_SALES (prev year) Filter
Filter
ALL_SALES (cur year)
WHERE year=1999 34
DISSECTING THE QUERY MAIN BLOCK year, brand_id, class_id , category_id, manufact_id, cs_quantity-COALESCE(cr_return_quantity,0) AS sales_cnt, cs_ext_sales_price - COALESCE(cr_return_amount,0.0) AS sales_amt
Join
date_dim
item
Sales
(ie WEB_SALES, CATALOG_SALES or STORE_SALES)
2-6 billion rows
Filter
Returns
WHERE category=‘Shoes’
7-14 billion rows
© Copyright 2016 EMC Corporation. All rights reserved.
‘NET SALES’ Query Block 35
DISSECTING THE QUERY UNION
Net Sales (Store)
Union
Net Sales (Catalog)
Union
Net Sales (Web)
ALL_SALES Inline View
© Copyright 2016 EMC Corporation. All rights reserved.
36
THE TEST • Materialize the main query block of the three sales channels WEB
© Copyright 2016 EMC Corporation. All rights reserved.
Materialize
Materialize
CATALOG
Materialize
STORE
WEB
STORE
CATALOG
37
RESULTS But only 24% runtime reduction
5
3.8
2.5
1.3
0
4.5
72% less data
4.6
3.4
1.3
Data Volume Scanned (TB)
Full Query
© Copyright 2016 EMC Corporation. All rights reserved.
Query Runtime (min)
MV Optimized Query
38
WHY ONLY A SMALL SPEEDUP? • DSSD D5 makes the I/O portion of the query much less significant in the total runtime • Remaining work, such as CPU compute, serialization, and inter-node communication remain constant
© Copyright 2016 EMC Corporation. All rights reserved.
39
D5 Versus a Typical All-flash Array Complex Query Runtime (Shorter is Better)
Runtime (Minutes)
30
22.5
64.2% 15
7.5
24.4% 0
MENU
© Copyright 2015 EMC Corporation. All rights reserved.
DSSD D5 Full Query
All Flash Array MV Optimized Query
40
D5 Full Query vs. AFA Materialized View Complex Query Runtime (Shorter is Better) 11
Runtime (Minutes)
8
55.5% 6
3
0
MENU
All Flash Array MV Optimized
© Copyright 2015 EMC Corporation. All rights reserved.
DSSD D5 Full Query
41
BANDWIDTH MATTERS • Full query running on DSSD D5 (with gas left in the tank):
© Copyright 2016 EMC Corporation. All rights reserved.
42
SO WHAT? •
“Extreme Performance” is not just for “Extreme Workloads” •
As a DBA, you have only been able to deliver that which the hardware allows
•
“Extreme Performance” is an enabler to business transformation
© James Morle, June 2016
SOFTWARE: ALGORITHMS •
Until now, a cache miss meant certain death… •
•
At least 50x slower, including code path
Net result: algorithms carefully maximize cache hit, and optimizer aggressively favors cached access paths
© James Morle, June 2016
44
SOFTWARE: ALGORITHMS •
Next-Gen Storage: •
Cost of cache miss is much, much less
•
But algorithms remain largely the same
•
Algorithms could be significantly more speculative in approach
© James Morle, June 2016
45
SQL OPTIMIZER •
Should push more I/O out as large physical I/O requests •
Large index joins will become less relevant - synchronous/serial pathology and inefficient join algorithm at scale
•
Large PIO is async and parallel, and hash joins are highly effective (if you can spill to disk at a decent rate)
© James Morle, June 2016
46
WHAT’S MISSING? •
• Things
Things that will probably never come: •
• Data
Compression
•
Dedupe
© James Morle, June 2016
Services (probably)
• At-rest
Data Services •
that are coming:
Encryption
• Snapshots • Replication •
47
Full Non-disruptive Operations support (definitely, and soon)
ARCHITECTURES
•
Tiering with D5
•
Preferred Read Failure Group
© James Morle, June 2016
48
© James Morle, June 2016
49
© James Morle, June 2016
50
HADOOP/HDFS SUPPORT
•
There is also an HDFS Datanode Plugin
© James Morle, June 2016
51
FLASH OPTIMIZED HDFS
SERVER 1
SERVER 2
SERVER 3
TRADITIONAL HADOOP (3 COPIES OF DATA)
SERVER 1
HADOOP WITH DSSD D5 © Copyright 2016 EMC Corporation. All rights reserved.
SERVER 2
SERVER 3
•
HDFS uses a replication factor of at least 3 for availability
•
Results in 3x+ data on persistent media
•
Not economical for flash
•
Stores just 1 copy of data regardless of replication factor
•
Use entire flash capacity for data
•
Increase data locality without using more capacity
(1 COPY OF DATA) 52
SIMPLIFIED ARCHITECTURE INDEPENDENT SCALING
HDFS on DSSD
Storage
Add D5 to Increase Storage
• Scale compute independent
of storage
Increased Cluster Nodes Required to Increase Compute
• Achieve optimal asymmetric high performance balance • Add additional performance as hardware evolves
Compute © Copyright 2016 EMC Corporation. All rights reserved.
53
HADOOP/HDFS SUPPORT •
Elimination of Replication •
Storage savings make the D5 price competitive with local SSDs
•
Local data access is possible for every attached host without storage multiplication •
•
Eliminates any Key Hashing hotspots
Run all of this, Oracle, Hadoop, Filesystems, on the same storage platform
© James Morle, June 2016
54
NEXT STEPS •
Moore's Law++ <—12mo doubling in storage density
•
Controller CPU and memory is also subject to Moore's Law balanced growth
•
Optane/3DXpoint - another order of magnitude
© James Morle, June 2016
55
THANK YOU! •
Any Questions?
•
[email protected]
•
@jamesmorle
© James Morle, June 2016
56