Transcript
Low latency edge computing with QEMU/KVM: Challenges and future
Mihai Caraman, PhD | Virtualization Architect August, 2015
TM
Agenda •
5G Requirements − Base
•
Station Virtualization
RT KVM − Embedded
•
PA and ARMv8
I/O Virtualization − Direct
Assignment
− Virtio
•
vSwitch Offload • Open Stack Integration • Tuning & Performance • Future
TM
Mihai Caraman, August 2015
2
5G Requirements
Air interface enhancements
Flexible deployment (small cell, macro, CRAN) through SDN and NFV Source: METIS PDCP crypto
UE
1 ms
5 ms
IPSec
eNB
12 ms
TM
Mihai Caraman, August 2015
3
EPC
Metro
Core network
20 ms
20 ms
Server
< 80 ms
Virtualized
Base Station Virtualization
Cloud OpenStack
Transport control (SDN)
ARMv8 compute farm vAppliance L3/control plane
Proprietary
vAppliance BS appliance
PDCP
vAppliance
SEC
MAC/RLC
L3/control plane PDCP
On-chip / on-board interconnect (AXI, PCIe)
Transport
iNIC (on-chip or off-chip) MAC/RLC
L2/QoS
L2/QoS
Transport 10+ GbE (on-board or physically remote)
On-chip / on-board interconnect (AXI, sRIO, PCIe)
L1 (DSP + accelerator)
L1 (DSP + accelerator)
MAC/RLC
L1
L1
TM
Mihai Caraman, August 2015
4
Virtual Base Station - PoC •
Scenarios − L2/L3 stack in VM, end-to-end video download using commercial LTE dongle − PDCP scaling
Platform Phase I • QorIQ T4240 − PA Book III-E − Security Engine • QorIQ T4240 PCI SR-IOV intelligent NIC Platform Phase II • QorIQ LS2085A − ARMv8 − DPAA2 Devices with Management Complex (MC) configuration bus − Advanced I/O Processor (AIOP) integrated NPU TM
Mihai Caraman, August 2015
5
Virtualization requirements Timing, latency requirements •
Transmission Time Interval (TTI) −
Synchronized between L2 and L1 at 1 ms
−
Provisioned through GPS, IEEE1588/PTP (interrupt for demo purposes)
−
L1 with a 1Gbps interface adds 150us latency, 10Gbps: 15us
−
TTI IRQ delivered to guest user space application ~50us max latency
KVM −
RT Linux guest
−
One RT vcpu per cpu
RT KVM
−
CPU oversubscription (nice to have)
RT LInux
−
TTI IRQ
US vBTS
I/O Virtualization −
Direct assignment
−
Virtio
IRQ
Open Stack Integration
TTI
Eth
SEC
Eth
iNIC/NPU Packet Processing
TM
Mihai Caraman, August 2015
6
RT KVM - PA Book III-E MPIC emulation • Replaced spinlocks with raw spinlocks. PREEMPT_RT spinlock implementation uses sleeping mutex • RT-friendly refactoring (to do): −
Increase lock granularity
−
Minimize interrupts disabled code paths
−
Avoid races with (lazy) pending exceptions
Timer • Moved the processing of the KVM decrementer from softirq (ksoftirqd kthread) to hardirq context • Replaced wait queues with simple work queues, wait queue callbacks prevent the use of raw spinlocks • Removed unnecessary tasklet used by the hrtimer
TM
Mihai Caraman, August 2015
7
RT KVM - Timer Interrupt Latency Initial Low-priority
Stress
Timer ISR
IPI/Schedule
Stress
vcpu
High-priority HWI
Int
Cyclictest
Int Timer ISR
RT Low-priority
Stress
Stress
High-priority
Cyclictest
vcpu
HWI
Int Timer ISR IPI/Schedule
TM
Mihai Caraman, August 2015
8
Int Timer ISR
RT KVM - TTI IRQ & Latency Tracer TTI IRQ • GPIO assigned − Fast path delivery • GPIO interrupt affinity • Extensive host and guest debug statistics Latency Tracer • The tracer is a mix between 2 of the available ftrace modes of operation:
• •
−
function-trace / function-graph
−
max latency - retain the maximum latency of this execution chain
Expose the maximum execution latency for injecting the TTI interrupt into the guest and the associated code path. Used to analyze the causes for the TTI interrupt delivery latency
TM
Mihai Caraman, August 2015
9
RT KVM - Configuration # defconfig CONFIG_RCU_NOCB_CPU=y CONFIG_RCU_CPU_STALL_TIMEOUT=300
# CONFIG_MMC is not set # bootargs isolcpus=17-23 rcu_nocbs=17-23 # disable RT throttling
echo -1 >/proc/sys/kernel/sched_rt_runtime_us echo -1 >/proc/sys/kernel/sched_rt_period_us # disable RCU stall warnings echo 1 > /sys/module/rcupdate/parameters/rcu_cpu_stall_suppress echo 0 > /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout # taskset & chrt taskset -c $CPU_QEMU qemu-system-x …; chrt -p 95 $QEMUPID taskset -pc $CPU_VCPU $VCPUPID; chrt -p 95 $VCPUPID TM
Mihai Caraman, August 2015
10
RT KVM - PV-SCHED Para-virtualized scheduling porting on PA Book III-E • Follows the original implementation of Jan Kiszka on x86, adapted for PA and rebased on newer kernels • The significant differences are interrupt handling and delivery to the guest
TM
Mihai Caraman, August 2015
11
I/O Virtualization - Direct Assignment PCI SR-IOV VFs direct assignment (VFIO PCI) •
Place VFs in different IOMMU groups −
•
•
Access Control Services quirks
PCI EP partitioning PoC −
Device ID allocation and programming
−
Enable IOMMU entries after each device attach
−
Dump IOMMU information for debugging
Arch fix-ups are evil −
QEMU PCI EP inadvertently hidden
•
Memory translation
•
MSI interrupt affinity
TM
Mihai Caraman, August 2015
12
PCI DMA memory translations: T4-iNIC:iNIC app (receives physical/iova addr from driver) | T4-iNIC:DMA engine | T4-iNIC:outbound PCI ATMU | PCI bus (bus:dev:func) | T4-RDB:inbound PCI ATMU | T4-RDB:PCI controller maps bus:dev:func to LIODN | T4-RDB:PAMU (iova to real addr) | T4-RDB:DDR
I/O Virtualization - Direct Assignment Security engine direct assignment (VFIO for Platform Devices) • QEMU glue code • Physical and virtual functions Management Complex Bus device assignment (VFIO MC) • VFIO for Management Complex − QEMU integration − Legacy interrupts with irqfd support − Performance improvements KVM ARM support for direct I/O Guest caching attributes (I/O portal) I/O portal HW interrupt coalescing
TM
Mihai Caraman, August 2015
13
I/O Virtualization - virtio-crypto Provide binary compatibility across HW platforms (from different vendors) and machine migration Virtio-crypto device • supplies cryptographic offloading for the guest Cryptographic transformations: • ablkcipher - block ciphers (encryption) • ahash - hashing (authentication) • aead - authentication and encryption in one job Virtqueues • session, crypto, control PoC • QEMU integration • Linux frontend driver • vhost-crypto over crypto-API TM
Mihai Caraman, August 2015
14
Packet Processing Performance VM
IP tables+ SYN cookie
IP tables+ SYN cookie
OF vSwitch
OF vSwitch
VxLAN decap/demux
VxLAN decap/demux
IP reasm + IPSec
IP reasm + IPSec
• •
TM
Increasing complexity of infrastructure stack Performance bottleneck from software implementation of networking stack Native networking stack performance
Throughput
VM
8000 7000 6000 5000 4000 3000 2000 1000 0
64 390 390 (1K conn) 1024 1472 Mihai Caraman, August 2015 15
TCP/IP
OVS
IPtables + OVS
370 2205 2205 5722 8042
279 1652 1514 4346 6097
195 1194 1051 3085 4334
OVS + VxLAN
IPtables + OVS + VxLAN
181 1072 914 2737 2906
136 824 639 2080 2365
IPtables + OVS + VxLAN + IPsec 49 146 98 190 197
Packet Processing Offload •
•
Limited GPP involvement (management/CP only) Offload as much packet processing to NPU/iNIC NPU implements fast path − Direct connectivity to VM
Faster Connection rate
Management
Management
Control path
Control path
Datapath
Datapath Kernel space
Nova compute
Neutron agent
TCP/IP
ovs-vsctl
ovs-ofctl
IPtables
OVSDB
OF agent
IPSec dataplane
−
IP Table Policy Caching − Entire OF pipeline processing for switching − All OF based data paths
VM
User space
−
•
VM
OF control
NPU/iNIC
Firewall Switching VxLAN IPSec fastpath
TM
Mihai Caraman, August 2015
16
Open Stack - Integration Openstack Controller
DB Openstack Dashboard
Openstack Compute Node
2 7
L2 VM
3 1 1
2
NSCS vBTS Plugin
6
NSCS Relay Agent
6
4
Network Service Configuration Stack: • Single Dashboard for Configuration relay • Service Function Chaining Support • OF-Controller, OF-Switch – Onboarding • Dynamic Scalability • AMQP fan-out exchange • Configuration Versioning and Delta Support
TM
Mihai Caraman, August 2015
17
Config Daemon
B9131
5
NSCS Agent
8
EPC NSCS Agent
OpenStack - L1 & eNodeB integration
TM
Mihai Caraman, August 2015
18
Tuning & performance Latency Benchmarks • Cyclictest − Stress: coremark, lmbench • L2 application Networking Benchmarks • Iperf Perfomance tools − Perf kvm − CPU statistics KVM ARM CPU accounting
TM
Mihai Caraman, August 2015
19
RT KVM PA Latency Results - Cyclictest Native vs Virtualized - 1 CPU, core isolation, 12 h
CPU stress Latency (cycles)
Native
Virtual
Min
1660
2770
Avg
2220
6660
Max
3330
11660
Memory stress
TM
Mihai Caraman, August 2015
20
Latency (cycles)
Native
Virtual
Min
1660
2220
Avg
2770
7220
Max
13880
25550
RT KVM PA Latency Results - Cyclictest Virtualized pv-sched – host & guest coremark stress, 15 min
CPU stress (PV-SCH) Latency (cycles)
Native
Min
5550
Avg
14440
Max
27770
Prio host coremark: 10 Prio guest coremark: 0
TM
Mihai Caraman, August 2015
21
Virtual Base Station - Benchmarking Setup 10 Gig x 4
10 Gig x 4
Traffic generator 80 Gbps Bidirectional Traffic
80 Gbps Bidirectional Traffic
Metro Switch OPEN DAYLIGHT CONTROLLER
BSC9131RDB-2 For Latency
OPEN 10 Gig 1 Gig 1 Gig
1 Gig
VM1
T4 VM2iNIC
BSC9131RDB-1
UE VM2
TM
Mihai Caraman, August 2015
T4240 -iNIC
22
EPC
RT KVM PA Latency Results - TTI External Interrupt Virtualized – LTE L2 Stress
LTE L2 stress
TM
Mihai Caraman, August 2015
23
Latency (us)
Virtual
Min
8330
Avg
13880
Max
29444
KVM ARMv8 - Networking performance degradation Iperf TCP client 750 flows - DPAA 2 Direct Assignment • 1 Network interface • 1 VM, 1 vCPU 40%
Performance degradation
30%
20%
Best absolute performance
10%
0%
0
10
20
30
40
IRQ coleascing time (µs)
TM
Mihai Caraman, August 2015
24
50
60
70
80
Future − Scalability
and performance
Stack
disaggregation Optimized I/O and accelerator access CPU oversubscription − Fast
guest interrupt delivery − IOMMU emulation
TM
Mihai Caraman, August 2015
25
Summary •
“5G” networks are enabled by SDN and NFV
•
Base station virtualization with RT KVM
•
I/O virtualization using direct assignment and virtio
•
NFV packet processing offload
•
Virtual base station integration with OpenStack
•
Low interrupt latency and low network performance degradation with KVM
TM
Mihai Caraman, August 2015
26
Q&A
TM
Mihai Caraman, August 2015
27
TM
www.Freescale.com
Freescale and the Freescale logo and are trademarks of Freescale Semiconductor, Inc., Reg. U.S. Pat. & Tm. Off. All other product or service names are the property of their respective owners. ARM and Cortex are registered trademarks of ARM Limited (or its subsidiaries) in the EU and/or elsewhere. All rights reserved. © 2015 Freescale Semiconductor, Inc
LS2085A General Purpose Processing ARM A57 ARM A57 ARM A57 ARM A57
48KB 32KB 48KB 48KB 48KB L1-D 32KBL1-I 32KBL1-I L1-I 48KB L1-I 48KB L1-D 32KB L1-D 32KB L1-I L1-I L1-D L1-D
48KB 32KB 48KB 48KB 48KB L1-D 32KBL1-I 32KBL1-I L1-I 48KB L1-I 48KB L1-D 32KB L1-D 32KB L1-I L1-I L1-D L1-D
2MB Banked L2 1MB Banked L2
2MB Banked L2 1MB Banked L2
Secure Boot
64-bit DDR2/3 DDR4 Memory Controller Pre-fetch
ARM A57 ARM A57 ARM A57 ARM A57
IO MMU
IO MMU
Power Management DCE
SEC
Queue Mgr.
4x I2C Buffer Mgr.
PME
2x USB3.0 + PHY
32-bit DDR4 Memory Controller
Buffer L2 Switch
Accelerated Packet Processor (AIOP)
WRIOP 8x1/10 + 8x1
8-Lane 10GHz SERDES
Other Parametrics • 37.5x37.5 Flipchip • 1mm Pitch • 1292pins
8-Lane 10GHz SERDES
Datapath Acceleration • SEC- crypto acceleration • DCE - Data Compression Engine • PME – Pattern Matching Engine
TM
Mihai Caraman, August 2015
29
SATA 3.0
SDXC/eMMC
SATA 3.0
IO MMU
PCIe PCIe PCIe PCIe
Trust Zone
SPI, GPIO, JTAG
Accelerated I/O Processor
Coherency Fabric
Flash Controller
2x DUART
64-bit DDR2/3 DDR4 Memory Controller
1MB Platform Cache
• 8x ARM A57 CPUs, 64b, 2.0GHz • 4MB Banked L2 cache • HW L1 & L2 Prefetch Engines • Neon SIMD in all CPUs • 1MB L3 platform cache w/ECC • 2x64b DDR4 up to 2.4GT/s
• • • • •
40Gbps Packet Processing 20Gbps SEC- crypto acceleration 15Gbps Pattern Match/RegEx 20Gbps Data Compression Engine 4MB Packet Express Buffer
Express Packet IO • Supports1x8, 4x4, 4x2, 4x1 PCIe Gen3 controllers • 2 x SATA 3.0, 2 x USB 3.0 with PHY Network IO • Wire Rate IO Processor: • 8x1/10GbE + 8x1G • XAUI/XFI/KR and SGMII • MACSec on up to 4x 1/10GbE