Transcript
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device
Shuang Liang, Ranjit Noronha, D.K. Panda Department of Computer Science and Engineering The Ohio State University Email: {liangs,noronha,panda}@cse.ohio-state.edu
Presentation Outline • • • • •
Introduction Problem Statement Design Issues Performance Evaluation Conclusions and Future Work
Application Trends • Applications are becoming increasingly data intensive with high memory demand – Larger working set for a single application, such as data warehouse application, scientific simulation, etc. – Memory resources in a single node of a cluster system may not be able to accommodate the working set in memory, while some other node may host plenty of memory unused
Utilizing Remote Memory • Can we utilize those remote memory to improve the applications performance? Node 1
Node 2
CPU
CPU
NIC
NIC
page
page
MEM
Network
MEM
Motivation • Emergence of commodity high performance network such as InfiniBand – – – – –
Offloaded transport protocol stack User-level communication bypass OS Low latency High bandwidth comparable to local memcpy performance Novel hardware features such as Remote Direct Memory Access (RDMA) with minimal host overhead
• Can we utilize these features to boost sequential data intensive applications? And how?
Approaches • Global memory management [Feeley95]
– Close integration with virtual memory management – Implementation Complexity and poor portability
• User level run-time libraries [Koussih95] – Application aware interface – Additional management layer in user space
• Remote paging[Markatos96]
– Flexible – Moderate implementation effort
Presentation Outline • • • • •
Background Problem Statement Design Issues Performance Evaluation Conclusions and Future Work
Problem Statement • Enable InfiniBand cluster to take advantage of remote memory by remote paging – Enhance the local memory hierarchy performance – Deliver high performance – Enable application to benefit transparently
• Evaluate the network performance impact
– Comparisons of remote paging with GigE, IPoIB and InfiniBand native communication
Presentation Outline • • • • •
Background Problem Statement Design Issues Performance Evaluation Conclusion and Future Work
Design Choices • Kernel Level Design – Pros:
• Transparency to applications • Beneficial to processes in the system • Take advantage of virtual memory system management for page management
– Cons:
• Dependency on OS • Not easy to debug
• User Level Design – Pros:
• Portable across different OSes • Easier to debug
– Cons
• Not completely transparent to application • Beneficial only to application using the user-level library • High overhead with user-level signal handling
Network Block Device • A software mechanism to utilize remote block based resources over network
– Examples: NBD, ENBD, DRBD, GNBD, etc. – Often used to export remote disk resources to provide storage, such as RAID device, mirror device, etc.
• Use Ramdisk based Network Block Device as swapping device
– Seamless integration with VM for remote paging – NBD — a TCP implementation of Network Block Device within default kernel source tree can be used for comparison study – An InfiniBand based Network Block Device needs to designed
Architecture of the remote paging system Application
Memory Server Memory MemoryServer Server
Application User space
Virtual Memory Management
Kernel space
Swap device Local Disk
HPBD
InfiniBand Network
HCA
HCA
Design Issues Memcpy Registration-Kernel Registration-User
350 300 250 200 150 100 50 0
5 +0 1E 6 53 65 8 76 32 4 38 16 92 81 96 40 48 20 24 10 2 51 6 25 8 12
64 32 16
8 4 2
1
– Registration is a costly operation compared with memcpy for small buffers – Pre-registration out of the critical path needs registration for all memory pages – Paging messages are upper bounded by 128KB in Linux
400
Micro second
• Memory registration and buffer management
Size in Bytes
Memory copy is more than 12 times faster than memory registration for one page
Design Issues (cont’d) • Dealing with message completions
– Polling based synchronous completion wastes CPU cycles – None preemptive in kernel mode – InfiniBand supports event based completion by registering asynchronous event handler
• Thread safety
– There could be multiple instances of the driver running, mutual exclusion is needed for shared data structures
• Reliability issues
Our Design • RDMA based server design Page in request RDMA Write
Client
Request Ack Page out request
RDMA Read Request Ack
Memory Server
Our Design (cont’d) • Registered buffer pool management
– Use pre-register a buffer pool for page copy before communication
• Hybrid completion handling
– Register an event handler with InfiniBand transport – Both client and server block, when there is no traffic – Use polling scheme for bursty incoming requests
Our Design (cont’d) • Reliable communication – Using RC services
Server 1
Server 2
Server 3
• Flow control
– Use credit based flow control
• Multiple server support
– Distribute block across multiple servers in linear mode
Swap area 1
n
2n
……
Presentation Outline • • • • •
Background Problem Statement Design Issues Performance Evaluation Conclusions and Future Work
Experiment Setup • Xeon 2.66GHZ Cluster with 2G DDR Memory; 40GB ST340014A Hard disk; InfiniBand Mellanox MT23108 HCA • Memory size configuration: – 2G for local memory test scenario – 512M for swapping scenario
• Swapping area setup – Use Ram disk on memory server as swap area
Latency Comparison 1400
Micro second
1200 Memcpy
1000 800
IB
600
IPoIB
400
GigE
200 0 65536
16384
4096
1024
256
64
16
4
1
Size in Bytes
InfiniBand native communication latency for one page is 4 times faster than IPoIB and 8 times faster than GigE and 2.4 times slower than memcpy
Micro-benchmark: Execution Time 20
second
15
Memory HPBD
10
NBD-IPoIB NBD-GigE
5
Local-Disk
0
Network Overhead is approximately 30% for IPoIB. Using server based RDMA further improves the performance for HPBD
Quicksort – Execution time 700 600
second
500 400 300 200 100
Memory HPBD NBD-IPoIB NBD-GigE Local-Disk
0
HPBD is 1.4 times slower than enough local memory and 4.7 times faster than swapping to disk
Barnes – Execution time 700 600 500
second
400 300 200 100 0 Memory
HPBD
NBD-
NBD-GigE
Local-
IPoIB Disk For slightly oversized working set, HPBD is still 1.4 times faster than swapping to disk
Two processes of Quicksort 512MB memory with disk swapping
16000 14000
second
12000 10000 8000 6000
512MB memory with 3*512MB HPBD Servers 1GB memory with 2*512MB HPBD Servers
4000 2000 0
2GB memory
process1
process2
Concurrent instances of quicksort run up to 21 times faster than swapping to disk
Quicksort with multiple servers 400
second
350
1 server-GigE
300
1 Server-IPoIB
250
1 Server
200
2 Server
150
4 Servers
100
8 Servers
50 0
16 Servers
Maintaining multiple connections does not degrade performance up to 12 servers
Presentation Outline • • • • •
Background Problem Statement Design Issues Performance Evaluation Conclusions and Future Work
Conclusions • Remote paging is an efficient way to enable sequential applications to take advantage of remote memory • Using InfiniBand for remote paging can improve the performance, compared with GigE and IPoIB. And it is comparable to system with enough local memory • As network speed increase, host overhead becomes more critical for further performance improvement
Future Work • Achieve zero copy along the communication path to reduce host overhead along the critical path • Dynamic management of idle cluster memory for swap area allocation
Acknowledgements Our research is supported by the following organizations • Current Funding support by
• Current Equipment donations by
Thank You! {liangs, noronha, panda}@cse.ohio-state.edu
Network-Based Computing Laboratory http://nowlab.cis.ohio-state.edu/