Preview only show first 10 pages with watermark. For full document please download

Current Priorities

   EMBED


Share

Transcript

Low latency edge computing with QEMU/KVM: Challenges and future Mihai Caraman, PhD | Virtualization Architect August, 2015 TM Agenda • 5G Requirements − Base • Station Virtualization RT KVM − Embedded • PA and ARMv8 I/O Virtualization − Direct Assignment − Virtio • vSwitch Offload • Open Stack Integration • Tuning & Performance • Future TM Mihai Caraman, August 2015 2 5G Requirements Air interface enhancements Flexible deployment (small cell, macro, CRAN) through SDN and NFV Source: METIS PDCP crypto UE 1 ms 5 ms IPSec eNB 12 ms TM Mihai Caraman, August 2015 3 EPC Metro Core network 20 ms 20 ms Server < 80 ms Virtualized Base Station Virtualization Cloud OpenStack Transport control (SDN) ARMv8 compute farm vAppliance L3/control plane Proprietary vAppliance BS appliance PDCP vAppliance SEC MAC/RLC L3/control plane PDCP On-chip / on-board interconnect (AXI, PCIe) Transport iNIC (on-chip or off-chip) MAC/RLC L2/QoS L2/QoS Transport 10+ GbE (on-board or physically remote) On-chip / on-board interconnect (AXI, sRIO, PCIe) L1 (DSP + accelerator) L1 (DSP + accelerator) MAC/RLC L1 L1 TM Mihai Caraman, August 2015 4 Virtual Base Station - PoC • Scenarios − L2/L3 stack in VM, end-to-end video download using commercial LTE dongle − PDCP scaling Platform Phase I • QorIQ T4240 − PA Book III-E − Security Engine • QorIQ T4240 PCI SR-IOV intelligent NIC Platform Phase II • QorIQ LS2085A − ARMv8 − DPAA2 Devices with Management Complex (MC) configuration bus − Advanced I/O Processor (AIOP) integrated NPU TM Mihai Caraman, August 2015 5 Virtualization requirements Timing, latency requirements • Transmission Time Interval (TTI) − Synchronized between L2 and L1 at 1 ms − Provisioned through GPS, IEEE1588/PTP (interrupt for demo purposes) − L1 with a 1Gbps interface adds 150us latency, 10Gbps: 15us − TTI IRQ delivered to guest user space application ~50us max latency KVM − RT Linux guest − One RT vcpu per cpu RT KVM − CPU oversubscription (nice to have) RT LInux − TTI IRQ US vBTS I/O Virtualization − Direct assignment − Virtio IRQ Open Stack Integration TTI Eth SEC Eth iNIC/NPU Packet Processing TM Mihai Caraman, August 2015 6 RT KVM - PA Book III-E MPIC emulation • Replaced spinlocks with raw spinlocks. PREEMPT_RT spinlock implementation uses sleeping mutex • RT-friendly refactoring (to do): − Increase lock granularity − Minimize interrupts disabled code paths − Avoid races with (lazy) pending exceptions Timer • Moved the processing of the KVM decrementer from softirq (ksoftirqd kthread) to hardirq context • Replaced wait queues with simple work queues, wait queue callbacks prevent the use of raw spinlocks • Removed unnecessary tasklet used by the hrtimer TM Mihai Caraman, August 2015 7 RT KVM - Timer Interrupt Latency Initial Low-priority Stress Timer ISR IPI/Schedule Stress vcpu High-priority HWI Int Cyclictest Int Timer ISR RT Low-priority Stress Stress High-priority Cyclictest vcpu HWI Int Timer ISR IPI/Schedule TM Mihai Caraman, August 2015 8 Int Timer ISR RT KVM - TTI IRQ & Latency Tracer TTI IRQ • GPIO assigned − Fast path delivery • GPIO interrupt affinity • Extensive host and guest debug statistics Latency Tracer • The tracer is a mix between 2 of the available ftrace modes of operation: • • − function-trace / function-graph − max latency - retain the maximum latency of this execution chain Expose the maximum execution latency for injecting the TTI interrupt into the guest and the associated code path. Used to analyze the causes for the TTI interrupt delivery latency TM Mihai Caraman, August 2015 9 RT KVM - Configuration # defconfig CONFIG_RCU_NOCB_CPU=y CONFIG_RCU_CPU_STALL_TIMEOUT=300 # CONFIG_MMC is not set # bootargs isolcpus=17-23 rcu_nocbs=17-23 # disable RT throttling echo -1 >/proc/sys/kernel/sched_rt_runtime_us echo -1 >/proc/sys/kernel/sched_rt_period_us # disable RCU stall warnings echo 1 > /sys/module/rcupdate/parameters/rcu_cpu_stall_suppress echo 0 > /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout # taskset & chrt taskset -c $CPU_QEMU qemu-system-x …; chrt -p 95 $QEMUPID taskset -pc $CPU_VCPU $VCPUPID; chrt -p 95 $VCPUPID TM Mihai Caraman, August 2015 10 RT KVM - PV-SCHED Para-virtualized scheduling porting on PA Book III-E • Follows the original implementation of Jan Kiszka on x86, adapted for PA and rebased on newer kernels • The significant differences are interrupt handling and delivery to the guest TM Mihai Caraman, August 2015 11 I/O Virtualization - Direct Assignment PCI SR-IOV VFs direct assignment (VFIO PCI) • Place VFs in different IOMMU groups − • • Access Control Services quirks PCI EP partitioning PoC − Device ID allocation and programming − Enable IOMMU entries after each device attach − Dump IOMMU information for debugging Arch fix-ups are evil − QEMU PCI EP inadvertently hidden • Memory translation • MSI interrupt affinity TM Mihai Caraman, August 2015 12 PCI DMA memory translations: T4-iNIC:iNIC app (receives physical/iova addr from driver) | T4-iNIC:DMA engine | T4-iNIC:outbound PCI ATMU | PCI bus (bus:dev:func) | T4-RDB:inbound PCI ATMU | T4-RDB:PCI controller maps bus:dev:func to LIODN | T4-RDB:PAMU (iova to real addr) | T4-RDB:DDR I/O Virtualization - Direct Assignment Security engine direct assignment (VFIO for Platform Devices) • QEMU glue code • Physical and virtual functions Management Complex Bus device assignment (VFIO MC) • VFIO for Management Complex − QEMU integration − Legacy interrupts with irqfd support − Performance improvements  KVM ARM support for direct I/O Guest caching attributes (I/O portal)  I/O portal HW interrupt coalescing TM Mihai Caraman, August 2015 13 I/O Virtualization - virtio-crypto Provide binary compatibility across HW platforms (from different vendors) and machine migration Virtio-crypto device • supplies cryptographic offloading for the guest Cryptographic transformations: • ablkcipher - block ciphers (encryption) • ahash - hashing (authentication) • aead - authentication and encryption in one job Virtqueues • session, crypto, control PoC • QEMU integration • Linux frontend driver • vhost-crypto over crypto-API TM Mihai Caraman, August 2015 14 Packet Processing Performance VM IP tables+ SYN cookie IP tables+ SYN cookie OF vSwitch OF vSwitch VxLAN decap/demux VxLAN decap/demux IP reasm + IPSec IP reasm + IPSec • • TM Increasing complexity of infrastructure stack Performance bottleneck from software implementation of networking stack Native networking stack performance Throughput VM 8000 7000 6000 5000 4000 3000 2000 1000 0 64 390 390 (1K conn) 1024 1472 Mihai Caraman, August 2015 15 TCP/IP OVS IPtables + OVS 370 2205 2205 5722 8042 279 1652 1514 4346 6097 195 1194 1051 3085 4334 OVS + VxLAN IPtables + OVS + VxLAN 181 1072 914 2737 2906 136 824 639 2080 2365 IPtables + OVS + VxLAN + IPsec 49 146 98 190 197 Packet Processing Offload • • Limited GPP involvement (management/CP only) Offload as much packet processing to NPU/iNIC NPU implements fast path − Direct connectivity to VM Faster Connection rate Management Management Control path Control path Datapath Datapath Kernel space Nova compute Neutron agent TCP/IP ovs-vsctl ovs-ofctl IPtables OVSDB OF agent IPSec dataplane − IP Table Policy Caching − Entire OF pipeline processing for switching − All OF based data paths VM User space − • VM OF control NPU/iNIC Firewall Switching VxLAN IPSec fastpath TM Mihai Caraman, August 2015 16 Open Stack - Integration Openstack Controller DB Openstack Dashboard Openstack Compute Node 2 7 L2 VM 3 1 1 2 NSCS vBTS Plugin 6 NSCS Relay Agent 6 4 Network Service Configuration Stack: • Single Dashboard for Configuration relay • Service Function Chaining Support • OF-Controller, OF-Switch – Onboarding • Dynamic Scalability • AMQP fan-out exchange • Configuration Versioning and Delta Support TM Mihai Caraman, August 2015 17 Config Daemon B9131 5 NSCS Agent 8 EPC NSCS Agent OpenStack - L1 & eNodeB integration TM Mihai Caraman, August 2015 18 Tuning & performance Latency Benchmarks • Cyclictest − Stress: coremark, lmbench • L2 application Networking Benchmarks • Iperf Perfomance tools − Perf kvm − CPU statistics  KVM ARM CPU accounting TM Mihai Caraman, August 2015 19 RT KVM PA Latency Results - Cyclictest Native vs Virtualized - 1 CPU, core isolation, 12 h CPU stress Latency (cycles) Native Virtual Min 1660 2770 Avg 2220 6660 Max 3330 11660 Memory stress TM Mihai Caraman, August 2015 20 Latency (cycles) Native Virtual Min 1660 2220 Avg 2770 7220 Max 13880 25550 RT KVM PA Latency Results - Cyclictest Virtualized pv-sched – host & guest coremark stress, 15 min CPU stress (PV-SCH) Latency (cycles) Native Min 5550 Avg 14440 Max 27770 Prio host coremark: 10 Prio guest coremark: 0 TM Mihai Caraman, August 2015 21 Virtual Base Station - Benchmarking Setup 10 Gig x 4 10 Gig x 4 Traffic generator 80 Gbps Bidirectional Traffic 80 Gbps Bidirectional Traffic Metro Switch OPEN DAYLIGHT CONTROLLER BSC9131RDB-2 For Latency OPEN 10 Gig 1 Gig 1 Gig 1 Gig VM1 T4 VM2iNIC BSC9131RDB-1 UE VM2 TM Mihai Caraman, August 2015 T4240 -iNIC 22 EPC RT KVM PA Latency Results - TTI External Interrupt Virtualized – LTE L2 Stress LTE L2 stress TM Mihai Caraman, August 2015 23 Latency (us) Virtual Min 8330 Avg 13880 Max 29444 KVM ARMv8 - Networking performance degradation Iperf TCP client 750 flows - DPAA 2 Direct Assignment • 1 Network interface • 1 VM, 1 vCPU 40% Performance degradation 30% 20% Best absolute performance 10% 0% 0 10 20 30 40 IRQ coleascing time (µs) TM Mihai Caraman, August 2015 24 50 60 70 80 Future − Scalability and performance  Stack disaggregation  Optimized I/O and accelerator access  CPU oversubscription − Fast guest interrupt delivery − IOMMU emulation TM Mihai Caraman, August 2015 25 Summary • “5G” networks are enabled by SDN and NFV • Base station virtualization with RT KVM • I/O virtualization using direct assignment and virtio • NFV packet processing offload • Virtual base station integration with OpenStack • Low interrupt latency and low network performance degradation with KVM TM Mihai Caraman, August 2015 26 Q&A TM Mihai Caraman, August 2015 27 TM www.Freescale.com Freescale and the Freescale logo and are trademarks of Freescale Semiconductor, Inc., Reg. U.S. Pat. & Tm. Off. All other product or service names are the property of their respective owners. ARM and Cortex are registered trademarks of ARM Limited (or its subsidiaries) in the EU and/or elsewhere. All rights reserved. © 2015 Freescale Semiconductor, Inc LS2085A General Purpose Processing ARM A57 ARM A57 ARM A57 ARM A57 48KB 32KB 48KB 48KB 48KB L1-D 32KBL1-I 32KBL1-I L1-I 48KB L1-I 48KB L1-D 32KB L1-D 32KB L1-I L1-I L1-D L1-D 48KB 32KB 48KB 48KB 48KB L1-D 32KBL1-I 32KBL1-I L1-I 48KB L1-I 48KB L1-D 32KB L1-D 32KB L1-I L1-I L1-D L1-D 2MB Banked L2 1MB Banked L2 2MB Banked L2 1MB Banked L2 Secure Boot 64-bit DDR2/3 DDR4 Memory Controller Pre-fetch ARM A57 ARM A57 ARM A57 ARM A57 IO MMU IO MMU Power Management DCE SEC Queue Mgr. 4x I2C Buffer Mgr. PME 2x USB3.0 + PHY 32-bit DDR4 Memory Controller Buffer L2 Switch Accelerated Packet Processor (AIOP) WRIOP 8x1/10 + 8x1 8-Lane 10GHz SERDES Other Parametrics • 37.5x37.5 Flipchip • 1mm Pitch • 1292pins 8-Lane 10GHz SERDES Datapath Acceleration • SEC- crypto acceleration • DCE - Data Compression Engine • PME – Pattern Matching Engine TM Mihai Caraman, August 2015 29 SATA 3.0 SDXC/eMMC SATA 3.0 IO MMU PCIe PCIe PCIe PCIe Trust Zone SPI, GPIO, JTAG Accelerated I/O Processor Coherency Fabric Flash Controller 2x DUART 64-bit DDR2/3 DDR4 Memory Controller 1MB Platform Cache • 8x ARM A57 CPUs, 64b, 2.0GHz • 4MB Banked L2 cache • HW L1 & L2 Prefetch Engines • Neon SIMD in all CPUs • 1MB L3 platform cache w/ECC • 2x64b DDR4 up to 2.4GT/s • • • • • 40Gbps Packet Processing 20Gbps SEC- crypto acceleration 15Gbps Pattern Match/RegEx 20Gbps Data Compression Engine 4MB Packet Express Buffer Express Packet IO • Supports1x8, 4x4, 4x2, 4x1 PCIe Gen3 controllers • 2 x SATA 3.0, 2 x USB 3.0 with PHY Network IO • Wire Rate IO Processor: • 8x1/10GbE + 8x1G • XAUI/XFI/KR and SGMII • MACSec on up to 4x 1/10GbE