Preview only show first 10 pages with watermark. For full document please download

Bisection Width - Department Of Computer Science

   EMBED


Share

Transcript

Course Outline • Introduction in algorithms and applications • Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters, many-cores • Programming methods, languages, and environments Message passing (SR, MPI, Java) Higher-level language: HPF • Applications N-body problems, search algorithms • Many-core (GPU) programming (Rob van Nieuwpoort) Parallel Machines Parallel Computing – Techniques and Applications Using Networked Workstations and Parallel Computers (2/e) Section 1.3 + (part of) 1.4 Barry Wilkinson and Michael Allen Pearson, 2005 Overview • Processor organizations • Types of parallel machines – Processor arrays – Shared-memory multiprocessors – Distributed-memory multicomputers • Cluster computers • Blue Gene Processor Organization • Network topology is a graph – A node is a processor – An edge is a communication path • Evaluation criteria – Diameter (maximum distance) – Bisection width (minimum number of edges that should be removed to split the graph into 2 -almost- equal halves) – Number of edges per node Key issues in network design • Bandwidth: – Number of bits transferred per second • Latency: – Network latency: time to make a message transfer through the network – Communication latency: total time to send the message, including software overhead and interface delays – Message latency (startup time): time to send a zero-length message • Diameter influences latency • Bisection width influences bisection bandwidth (collective bandwidth over the ``removed’’ edges) Mesh q-dimensional lattice q=2 -> 2-D grid Number of nodes: k² (k = SQRT(p)) Diameter Bisection width Edges per node 2(k - 1) k 4 Binary Tree • • • • Number of nodes: 2k – 1 (k = 2LOG(P)) Diameter: 2 (k -1) Bisection width: 1 Edges per node: 3 Hypertree • Tree with multiple roots, gives better bisection width • 4-ary tree: – – – – Number of nodes Diameter Bisection width Edges per node 2k ( 2 k+1 - 1) 2k 2 k+1 6 Engineering solution: fat tree • Tree with more bandwidth at links near the root • CM-5 • CM-5 Hypercube • k-dimensional cube, each node has binary value, nodes that differ in 1 bit are connected • • • • Number of nodes Diameter Bisection width Edges per node 2k k 2k-1 k Hypercube • Label nodes with binary value, connect nodes that differ in 1 coordinate • Number of nodes 2k • Diameter k • Bisection width 2k-1 • Edges per node k Comparison Mesh Tree Hypercube Diameter o + + Bisection width o - + #edges 4 3 unlimited Types of parallel machines • Processor arrays • Shared-memory multiprocessors • Distributed-memory multicomputers Processor Arrays • Instructions operate on scalars or vectors • Processor array = front-end + synchronized processing elements Processor Arrays •Front-end •Sequential machine that executes program •Vector operations are broadcast to PEs •Processing element •Performs operation on its part of the vector •Communicates with other PEs through a network Examples of Processor Arrays • CM-200, Maspar MP-1, MP-2, ICL DAP (~1970s) • Earth Simulator (Japan, 2002, former #1 of top-500) • Ideas are now applied in GPUs and CPU-extensions like MMX Shared-Memory Multiprocessors • Bus easily gets saturated => add caches to CPUs • Central problem: cache coherency – Snooping cache: monitor bus, invalidate copy on write – Write-through or copy-back • Bus-based multiprocessors do not scale Other Multiprocessor Designs (1/2) • Switch-based multiprocessors (e.g., crossbar) • Expensive (requires many very fast components) Other Multiprocessor Designs (2/2) • Non-Uniform Memory Access (NUMA) multiprocessors • Memory is distributed • Some memory is faster to access than other memory • Example: – Teras at Sara, Dutch National Supercomputer (1024-node SGI) • Ideas now applied in multi-cores Distributed-Memory Multicomputers • Each processor only has a local memory • Processors communicate by sending messages over a network • Routing of messages: – Packet-switched message routing: split message into packets, buffered at intermediate nodes • Store-and-forward – Circuit-switched message routing: establish path between source and destination Packet-switched Message Routing • Messages are forwarded one node at a time • Forwarding is done in software • Every processor on path from source to destination is involved • Latency linear to distance x message length – Old examples: Parsytec GCel (T800 transputers), Intel Ipsc Circuit-switched Message Routing • Each node has a routing module • Circuit set up between source and destination • Latency linear to distance + message length • Example: Intel iPSC/2 Modern routing techniques • Circuit switching: needs to reserve all links in the path (cf. old telephone system) • Packet switching: high latency, buffering space (cf. postal mail) • Cut-through routing: packet switching, but immediately forward (without buffering) packets if outgoing link is available • Wormhole routing: transmit head (few bits) of message, rest follows like a worm Performance Network latency Packet switching Wormhole routing Circuit switching Distance (number of hops) Distributed Shared Memory • Shared memory is easier to program, but doesn’t scale • Distributed memory is hard to program, but does scale • Distributed Shared Memory (DSM): provide sharedmemory programming model on top of distributed memory hardware – Shared Virtual Memory (SVM): use memory management hardware (paging), copy pages over the network – Object-based: provide replicated shared objects (Orca language) • Was hot research topic in 1990s, but performance remained the bottleneck • Some revival with Partitioned Global Address Space (PGAS) languages, like X-10 and Chapel Flynn's Taxonomy Single Data Multiple Data Single Instruction SISD SIMD Multiple Instruction MISD MIMD • Instruction stream: sequence of instructions • Data stream: sequence of data manipulated by instructions Flynn's Taxonomy Single Data Multiple Data Single Instruction SISD SIMD Multiple Instruction MISD MIMD SISD: Single Instruction Single Data Traditional uniprocessors SIMD: Single Instruction Multiple Data Processor arrays MISD: Multiple Instruction Single Data Nonexistent? MIMD: Multiple Instruction Multiple Data Multiprocessors and multicomputers